Source Job

US

  • Lead the Site Reliability Operations team, overseeing observability, monitoring, incident response, and operational excellence for key enterprise services.
  • Partner with product, engineering, and infrastructure teams to embed CI/CD and release best practices, automating build/test/deploy and release monitoring.
  • Own problem management, driving root cause analysis and corrective actions to improve system resilience and reduce incident impact.

DevOps Cloud Platforms Incident Management Observability

20 jobs similar to Manager Site Reliability Operations

Jobs ranked by similarity.

$200,000–$225,000/yr

  • Lead the evaluation, adoption, and execution of technology initiatives.
  • Recruit, mentor, and motivate a high-performance operations staff.
  • Drive operational excellence through structured incident, problem, and change management practices.

Business Wire is a press release distribution company. The company's total rewards include remote work, health benefits, fitness allotment, and a 401(k) plan.

Mexico

  • Design systems with resilience, graceful degradation, and capacity in mind.
  • Define and measure SLOs and SLIs that actually reflect what our customers feel.
  • Use Datadog (logging, metrics, APM) together with CloudWatch to build signal-heavy, noise-light observability.

EarnIn is building products that deliver real-time financial flexibility for those with the unique needs of living paycheck to paycheck. They are growing fast and are excited to continue bringing world-class talent onboard to help shape the next chapter of their growth journey.

US 5w PTO

  • Design and develop CI/CD systems for websites, services, and release workflows, and operate an EKS-based Kubernetes platform.
  • Diagnose debug production incidents, drive root-cause analysis, and implement improvements to enhance system reliability.
  • Write and maintain infrastructure as code using Pulumi or Terraform/OpenTofu across multiple AWS accounts with security-conscious practices.

Thunderbird is one of the world’s most trusted open-source email applications, empowering more than 20 million people globally. Our small but growing distributed team includes 65+ people across seven countries, and we build privacy-respecting communication tools with a collaborative, inclusive, and user-first spirit.

$188,550–$212,150/yr
Global Unlimited PTO

  • Own the technical direction of Remote's SRE/Platform domain.
  • Define and drive the reliability strategy across the platform.
  • Identify and lead AI enablement initiatives across the engineering organisation.

Remote is solving modern organizations’ biggest challenge – navigating global employment compliantly with ease. With our core values at heart and a future-focused work culture, our team works tirelessly on ambitious problems, asynchronously, around the world.

$29,000–$36,000/yr
India

  • Design, build, and maintain scalable, reliable systems on GCP.
  • Develop automation for infrastructure provisioning using Terraform, Ansible, or Deployment Manager.
  • Manage incident response, conduct postmortems, and implement improvements to reduce recurrence.

SupplyHouse.com is an industry-leading e-commerce company specializing in HVAC, plumbing, heating, and electrical supplies since 2004. They value every individual team member and cultivate a community where people come first with Generosity, Respect, Innovation, Teamwork, and GRIT.

US

  • Ensure reliability, scalability, and performance of hosted healthcare platforms.
  • Lead incident response, root cause analysis, and implement proactive monitoring.
  • Automate operational tasks using scripting and Infrastructure-as-Code.

Altera Digital Health empowers healthcare providers to deliver superior care through innovative technology. The company is part of Constellation Software Inc., Canada's largest software company, offering a supportive and award-winning culture with opportunities for growth.

Germany

  • Build and maintain end-to-end observability with ELK, Prometheus, and Grafana.
  • Own and improve CI/CD pipelines (CircleCI, GitLab CI, GitHub Actions, ArgoCD).
  • Lead incident response and postmortems in a blameless culture.

Redcare Pharmacy is Europe’s No.1 e-pharmacy, powered by passionate teams and cutting-edge innovation. They strive to create a healthy, collaborative work environment where every employee feels valued and inspired to contribute to their vision “Until every human has their health”.

$215,000–$280,000/yr
US 4w PTO 12w maternity 12w paternity

  • Own production health, reliability, and operational support processes across critical systems and services
  • Lead incident response efforts, stakeholder communication, root cause analysis, and post-incident reviews
  • Design and implement AI-driven agents and workflows that automate support and operational tasks

Quanata is on a mission to help ensure a better world through context-based insurance solutions. They are an exceptional, customer centered team with a passion for creating innovative technologies, digital products, and brands. Quanata, LLC is wholly owned and funded by State Farm.

Canada US 4w PTO

  • Lead and grow high-performing platform engineering teams that deliver reliable, scalable infrastructure and operational excellence for Vanta’s products and customers.
  • Set technical direction and drive multi-quarter platform initiatives spanning infrastructure reliability, security, scalability, and developer experience across shared systems and services.
  • Partner closely with product engineering, security, and engineering leadership to identify organizational needs and deliver scalable platform solutions.

Vanta helps businesses earn and prove trust by empowering companies to practice better security and prove it with ease. They have a kind and talented team, and while some have prior security experience, many have been successful without it.

US Canada

  • Own and evolve AWS infrastructure using Terraform, managing EKS clusters, databases, and core services.
  • Maintain CI/CD reliability and developer tooling across the full engineering org.
  • Lead incident response, drive post-incident reviews, and improve monitoring and alerting standards.

Babylist is the leading platform for expecting and new families, helping parents feel confident, connected, and cared for at every step. As a modern, AI-forward tech company with over 10 million yearly shoppers, Babylist has expanded into a full ecosystem and generated $750M in revenue in 2025, reshaping the $235B kids and baby market.

Americas 7w PTO

  • Act as a first responder for system incidents and outages, ensuring high availability and performance.
  • Own and evolve monitoring, alerting, and log management systems while optimizing database infrastructure.
  • Collaborate with engineering teams to build scalable, resilient systems and contribute to SRE tooling and automation.

Circle is building the world's leading all-in-one platform for online communities. We're a fully remote company of around 200 team members from 30+ countries, with a culture that values autonomy, async collaboration, and high expectations.

Europe

  • Lead Reliability Engineering for User Experience.
  • Architect for Scale, partnering with product and infrastructure teams to design highly available systems.
  • Drive Automation to eliminate repetitive operational work through tooling and systems.

Reddit is a community-based platform where users submit, vote, and comment on various topics. It hosts over 100,000 active communities and attracts millions of daily active users, making it one of the largest and most influential internet platforms.

$160,000–$190,000/yr
US

  • Own and evolve Launch Potato's cloud infrastructure, CI/CD platform, and compliance posture.
  • Build the SRE function from the ground up so product teams can ship faster without compromising reliability, security, or cost control.
  • Stand up the SRE practice from scratch: on-call rotation, PagerDuty configuration, SLA/SLO definitions for core infrastructure services, runbook library, and observability dashboards that tie site performance to business metrics.

Launch Potato is a digital media company that connects consumers with leading brands through data-driven content and technology. They are headquartered in South Florida with a remote-first team spanning over 15 countries, with a high-growth, high-performance culture.

Canada

  • Own and operate production cloud environments, ensuring high availability, reliability, and performance across distributed systems.
  • Design, build, and maintain scalable infrastructure using automation-first principles and Infrastructure as Code practices.
  • Drive automation initiatives and continuous improvement across infrastructure, deployment, and operational workflows.

Jobgether is an AI-powered job matching platform that connects candidates with hiring companies. They have an inclusive, employee-driven culture with a strong focus on collaboration and innovation.

US Global

  • Performing day-to-day operational/DevOps tasks on Wikimedia’s public facing infrastructure.
  • Implementing and utilizing configuration management and deployment tools.
  • Leading continuous improvement, by automating the installation, configuration and maintenance of services on our platform.

The Wikimedia Foundation operates Wikipedia and other Wikimedia free knowledge projects with the vision of a world where every single human can freely share in the sum of all knowledge. As a charitable, not-for-profit organization, it relies on donations and has staff members based in 40+ countries.

US 12w maternity 12w paternity

  • Design and build tools and frameworks to automate operational tasks and deployments for Portal and Endpoint Agents.
  • Evolve AI tooling and workflows to enhance developer productivity and integrate AI into daily development.
  • Build and maintain CI/CD pipelines, support product teams, and optimize software architecture for scalability and reliability.

Huntress is a cybersecurity company founded in 2015 by former NSA cyber operators, focused on protecting small to midsize businesses from cyber attacks through its award-winning security platform and expert human threat hunters. The company is fully remote and fosters a culture of inclusivity, innovation, and collaboration.

Germany 6w PTO

  • Build and scale a strong culture of operational excellence by defining standards and coaching teams to own reliability and availability.
  • Drive mature DevOps/SRE practices, including incident response and PIRs, on-call readiness, runbooks, alerting, observability, and release/change management.
  • Guide teams in the design, development, evolution, and operation of large-scale, distributed cloud systems.

Grafana Labs is a remote-first, open-source powerhouse with more than 20M users of Grafana around the globe. They help more than 3,000 companies manage their observability strategies with the Grafana LGTM Stack, and their team thrives in an innovation-driven environment.

Brazil Unlimited PTO

  • Collaborate with a tight-knit development team.
  • Design, deploy, and operate critical systems balancing reliability, cost, and agility.
  • Perform troubleshooting and root-cause analysis of system operation issues.

Loadsmart is a logistics technology company valued at over $1 billion. We are a collection of industry veterans and user-centered engineers using innovative technology to fearlessly reinvent the future of freight.

Global

  • Lead the Security Operations Team to protect global IT infrastructure, ensuring system confidentiality, integrity, and availability.
  • Oversee incident response, vulnerability management, and continuous security posture improvements across the organization.
  • Collaborate with IT, Engineering, and Compliance teams to embed security into every layer of the business.

Unit4 is a cloud ERP company redefining enterprise resource planning for mid-market people-centric organizations. With over 40 years of heritage, it fosters a people-first culture with a high-performance team and a focus on employee empowerment.

US

  • Continuously monitor infrastructure, cloud platforms, identity systems, networking, and security tooling using centralized monitoring and alerting solutions.

Mercer Advisors helps families amplify and simplify their financial lives by integrating financial planning, investment management, business management, tax, estate, insurance, and more, managed by a single team. They serve over 31,300 families across 90+ cities in the U.S. and are ranked the #1 RIA Firm in the nation by Barron’s for two consecutive years.