Senior Site Reliability Engineer (m/f/d)

Job Teaser

As a Senior Site Reliability Engineer in our Platform Squad, you'll own critical reliability domains end-to-end and drive the technical direction within the squad - leading architectural decisions on our platform, mentoring teammates, and continuously raising the reliability bar inside the team.

This role is for an engineer with a proven track record of building and operating high-throughput, highly available systems, who wants senior-level technical ownership and real impact through deep engineering work inside a tight, well-scoped team.

What awaits you with us

Co-own the architecture: Help drive the architecture and evolution of our cloud infrastructure on Azure and our Kubernetes clusters - designed for high throughput and highest availability - to support Flip's rapid growth across the globe.
Drive the resilience strategy: Define how we approach global scaling, zero-downtime deployments, rollback mechanisms and disaster recovery, and make sure the platform stays available around the clock.
Evolve our observability stack: Improve our LGTM stack (Loki, Grafana, Tempo, Mimir) into a foundation our engineers can trust.
Improve our IaC Platform: Eliminate toil at the source, and make our infrastructure truly self-service for engineering teams.
Lead in incidents: Take a leading role in platform-related major incidents, drive blameless post-mortems for the squad, and translate findings into systemic improvements.
Mentor within the squad: Coach teammates, run RFCs and design reviews inside the team, and help engineers grow into stronger SREs.
Shape our roadmap: Partner with your squad to define the platform's direction.

What you bring to the table

We're looking for a hands-on, SaaS-minded senior Site Reliability Engineer who treats scalability and reliability as a first-class product concern.

Must-Have Qualifications

5+ years of hands-on experience as a Site Reliability Engineer (SRE), Platform Engineer, DevOps Engineer, Infrastructure Engineer, Cloud Engineer, or Backend Engineer with a strong infrastructure focus.
Proven track record building and operating high-throughput, highly available systems in production.
Deep, production-level experience with Kubernetes on any Hyperscaler.
Strong experience with modern observability stacks (e.g. Prometheus, Mimir, VictoriaMetrics, Dash0, Loki, ELK) and a clear point of view on SLIs, SLOs and error budgets.
Solid software development skills in Go (strongly preferred, since our IaC runs on Pulumi in Go) or Python.
Hands-on experience with Infrastructure as Code (Pulumi, OpenTofu, Terraform) and GitOps (e.g. ArgoCD) + CI/CD pipeline design.
Demonstrated ability to lead complex infrastructure initiatives from design to production - including writing RFCs and driving architecture decisions within your team.
Experience mentoring engineers and raising the technical bar within a team.
Comfortable owning major incidents end-to-end and turning learnings into systemic change.
Strong communication skills and business-fluent English.
Willingness to participate in on-call rotations to ensure the reliability of our platform.

Nice-to-Have Qualifications

Rolled out production-ready API-Gateways with Gateway API (e.g. Envoy Gateway).
Operated multi-cluster service meshes (e.g. Cilium, Linkerd, Istio)
Deployed and maintained Kubernetes Operators (e.g. Strimzi, CNPG).
Operated highly available PostgreSQL in production.

Job Teaser

What awaits you with us

What you bring to the table

Must-Have Qualifications

Nice-to-Have Qualifications

Apply now