Senior Site Reliability Engineer @ Wikimedia Enterprise — Vaga Remota

Accountabilities

In this role, you will be responsible for ensuring the reliability, scalability, and performance of large-scale distributed systems that power data and API services. You will:

Define, track, and continuously improve SLOs, SLIs, and error budgets for critical services
Design and enhance observability systems including metrics, logging, and distributed tracing
Participate in incident response, on-call rotations, and post-incident reviews to drive continuous improvement
Build and maintain CI/CD and GitOps pipelines enabling secure, automated, and reliable deployments
Implement infrastructure-as-code and automation-first practices to reduce operational toil
Design and operate scalable cloud infrastructure across production environments
Drive capacity planning, performance optimization, and resilience testing (including chaos engineering practices)
Improve developer experience by enabling self-service infrastructure and streamlined workflows
Collaborate with security, software, and release engineering teams to embed reliability and security best practices
Optimize infrastructure cost and efficiency using FinOps principles without compromising availability
Develop and maintain operational metrics such as MTTR, MTTD, and incident frequency
Contribute to platform engineering initiatives that standardize infrastructure across teams
Mentor peers and promote best practices in SRE, automation, and systems reliability

Requirements

This position requires strong expertise in site reliability engineering, distributed systems, and cloud infrastructure, along with a proactive and collaborative mindset. You should have:

5+ years of experience in SRE, DevOps, or infrastructure engineering roles
Strong experience with infrastructure-as-code tools such as Terraform and/or Ansible
Proficiency in at least one programming language (Python, Go, or similar)
Hands-on experience with cloud platforms such as AWS, GCP, or Azure
Experience building and maintaining CI/CD pipelines and GitOps workflows (e.g., GitLab, ArgoCD or similar tools)
Strong understanding of SRE principles including SLOs, SLIs, and error budgets
Experience with observability tooling such as Prometheus, OpenTelemetry, or equivalent
Proven experience in incident response, on-call operations, and postmortem analysis
Ability to operate and optimize large-scale distributed systems with high availability requirements
Strong communication and collaboration skills in distributed, remote-first environments
Ability to document systems clearly and contribute to shared engineering knowledge
Strong ownership mindset, with a focus on automation, reliability, and continuous improvement
Adaptability to fast-evolving, technology-driven environments

Benefits

Remote-first work model with global collaboration
Opportunity to work on high-impact systems supporting global knowledge platforms
Exposure to large-scale distributed systems and modern cloud-native architectures
Culture of engineering excellence, automation, and continuous improvement
Strong emphasis on learning, experimentation, and open collaboration
Competitive compensation adjusted to location and experience
Inclusive and diverse work environment with global team exposure
Opportunity to contribute to open knowledge infrastructure used worldwide

Senior Site Reliability Engineer

Accountabilities

Requirements

Benefits

🇧🇷 Essa vaga exige inglês. Você está pronto?