Wikimedia Enterprise logo
Wikimedia Enterprise

Senior Site Reliability Engineer

🕐 6 dias atrás📍 Luxembourg🌍 Remoto
Candidaturas encerradasVer outras vagas

Accountabilities

In this role, you will be responsible for ensuring the reliability, scalability, and performance of large-scale distributed systems that power data and API services. You will:

  • Define, track, and continuously improve SLOs, SLIs, and error budgets for critical services
  • Design and enhance observability systems including metrics, logging, and distributed tracing
  • Participate in incident response, on-call rotations, and post-incident reviews to drive continuous improvement
  • Build and maintain CI/CD and GitOps pipelines enabling secure, automated, and reliable deployments
  • Implement infrastructure-as-code and automation-first practices to reduce operational toil
  • Design and operate scalable cloud infrastructure across production environments
  • Drive capacity planning, performance optimization, and resilience testing (including chaos engineering practices)
  • Improve developer experience by enabling self-service infrastructure and streamlined workflows
  • Collaborate with security, software, and release engineering teams to embed reliability and security best practices
  • Optimize infrastructure cost and efficiency using FinOps principles without compromising availability
  • Develop and maintain operational metrics such as MTTR, MTTD, and incident frequency
  • Contribute to platform engineering initiatives that standardize infrastructure across teams
  • Mentor peers and promote best practices in SRE, automation, and systems reliability

Requirements

This position requires strong expertise in site reliability engineering, distributed systems, and cloud infrastructure, along with a proactive and collaborative mindset. You should have:

  • 5+ years of experience in SRE, DevOps, or infrastructure engineering roles
  • Strong experience with infrastructure-as-code tools such as Terraform and/or Ansible
  • Proficiency in at least one programming language (Python, Go, or similar)
  • Hands-on experience with cloud platforms such as AWS, GCP, or Azure
  • Experience building and maintaining CI/CD pipelines and GitOps workflows (e.g., GitLab, ArgoCD or similar tools)
  • Strong understanding of SRE principles including SLOs, SLIs, and error budgets
  • Experience with observability tooling such as Prometheus, OpenTelemetry, or equivalent
  • Proven experience in incident response, on-call operations, and postmortem analysis
  • Ability to operate and optimize large-scale distributed systems with high availability requirements
  • Strong communication and collaboration skills in distributed, remote-first environments
  • Ability to document systems clearly and contribute to shared engineering knowledge
  • Strong ownership mindset, with a focus on automation, reliability, and continuous improvement
  • Adaptability to fast-evolving, technology-driven environments

Benefits

  • Remote-first work model with global collaboration
  • Opportunity to work on high-impact systems supporting global knowledge platforms
  • Exposure to large-scale distributed systems and modern cloud-native architectures
  • Culture of engineering excellence, automation, and continuous improvement
  • Strong emphasis on learning, experimentation, and open collaboration
  • Competitive compensation adjusted to location and experience
  • Inclusive and diverse work environment with global team exposure
  • Opportunity to contribute to open knowledge infrastructure used worldwide

🇧🇷 Essa vaga exige inglês. Você está pronto?

A DevSpeak Academy prepara desenvolvedores brasileiros para conquistar vagas internacionais. Domine o inglês técnico com professores que entendem o mundo dev.

Conheça a DevSpeak Academy
Candidaturas encerradasVer outras vagas