Source Job

US

  • Build the SRE practice from scratch: define SLO frameworks, on-call rotation, and incident command for live bank customers.
  • Define severity tiers, SLA commitments, and escalation paths for production support, acting as the technical owner during incidents.
  • Set engineering operations across sprint discipline, release rituals, code review standards, and compliance artifacts for bank examiners.

Site Reliability Engineering Process Design

17 jobs similar to Head of Site Reliability Engineering

Jobs ranked by similarity.

Global

  • Collaborate with service teams to define SLIs and SLOs based on customer experience and build error budget policies that influence engineering decisions.
  • Own the Operational Readiness Review process, conducting reviews for new services and major changes across observability, alerting, runbooks, capacity, and graceful degradation.
  • Act as a reliability expert for architecture reviews, failure mode analysis, dependency mapping, and resilience design.

Supabase provides the Postgres development platform with a complete backend solution including Database, Auth, Storage, Edge Functions, Realtime, and Vector Search. With 280+ team members across 55+ countries, they are an open-source-first company that values async work and has raised $500M.

Latin America

  • Design, implement, and improve Site Reliability Engineering practices across production environments with a focus on SLOs, SLIs, and error budgets.
  • Lead incident response processes and build observability strategies including monitoring, logging, alerting, and distributed tracing.
  • Partner with engineering teams to enhance system reliability, availability, scalability, and operational efficiency.

Oowlish is a rapidly expanding software development company in Latin America that collaborates with premier clients from the United States and Europe to create pioneering digital solutions. Certified as a Great Place to Work, it offers a nurturing environment with opportunities for professional growth and international impact.

North America Canada

  • Lead the strategic evolution of reliability engineering for ServiceNow's cloud platform, driving technical excellence and operational performance.
  • Define and execute change strategies for major SRE initiatives, including stakeholder engagement, impact assessments, and adoption risk management.
  • Manage incidents and escalations for SRE teams, ensuring performance and availability while developing management processes for Incident, Problem, Configuration, and Change management.

ServiceNow is the AI control tower for business reinvention, bringing together any AI, any data, and any workflow to help 85% of the Fortune 500 work smarter. The company has a growing global workforce and fosters an AI-native culture where technology and talent are unstoppable together.

US

  • Take ownership of incident management and operational excellence across cloud infrastructure.
  • Automate high-risk manual processes and drive reliability gains through engineering.
  • Own a platform domain such as Temporal, observability, or Kubernetes operations.

Synthesia is the world’s leading AI video platform for business, used by over 90% of the Fortune 100. Founded in 2017, the company is headquartered in London with offices across Europe and the US, and has over $530 million in funding from premier investors like Accel and Nvidia's VC arm.

Americas 7w PTO

  • Act as a first responder for system incidents and outages, ensuring high availability and performance.
  • Own and evolve monitoring, alerting, and log management systems while optimizing database infrastructure.
  • Collaborate with engineering teams to build scalable, resilient systems and contribute to SRE tooling and automation.

Circle is building the world's leading all-in-one platform for online communities. We're a fully remote company of around 200 team members from 30+ countries, with a culture that values autonomy, async collaboration, and high expectations.

APAC

  • Define and own the APAC infrastructure architecture end-to-end on Azure, including compute, networking, and containerisation.
  • Lead incident response for the region with calm, methodical root cause analysis and durable fixes.
  • Drive infrastructure migrations and PCI-DSS hardening programs across clouds safely.

Tilt is a mobile-first fintech company that uses machine learning to provide credit beyond traditional credit scores. With millions of customers worldwide, they value ownership, excellence, and mutual respect.

US

  • Lead the Site Reliability Operations team, overseeing observability, monitoring, incident response, and operational excellence for key enterprise services.
  • Partner with product, engineering, and infrastructure teams to embed CI/CD and release best practices, automating build/test/deploy and release monitoring.
  • Own problem management, driving root cause analysis and corrective actions to improve system resilience and reduce incident impact.

Mercury Insurance helps people reduce risk and overcome unexpected events, serving customers for over 60 years. They are a midsize employer recognized as one of America's Best Midsize Employers for 2026, with a collaborative culture focused on growth and inclusion.

United States

  • Own and evolve observability strategy including monitoring, alerting, dashboards, logging, and distributed tracing.
  • Define and manage SLIs, SLOs, and reliability metrics, improving MTTD and MTTR through automation.
  • Build and maintain reliable cloud infrastructure on AWS and Kubernetes while mentoring engineers on SRE best practices.

Filevine is a Legal AI company delivering Legal Operating Intelligence for legal work. Fueled by a team of exceptional collaborators and innovators, Filevine’s rapid growth has earned AI awards and recognition from Deloitte and Inc. as one of the most innovative and fastest-growing technology companies in the country.

US

  • Design, provision, and manage AWS infrastructure using Terraform and Kubernetes.
  • Build, operate, and improve observability, monitoring, and incident response processes.
  • Collaborate with engineering teams on capacity planning, performance optimization, and resilient system design.

Vynca provides comprehensive care for individuals with complex needs, focusing on quality days at home. The company is a close-knit community guided by core values of Excellence, Compassion, Curiosity, and Integrity.

US

  • Implement highly available, scalable infrastructure across AWS, GCP, and bare-metal environments.
  • Drive an "automation-first" culture by writing code in Python/Go to build self-healing systems.
  • Act as lead Incident Commander, develop response playbooks, and conduct post-incident analyses.

Zscaler accelerates digital transformation to secure customers with a cloud-native Zero Trust Exchange platform. The company processes over 200 billion transactions daily and fosters a culture of execution, collaboration, and accountability.

US

  • Lead design and operation of internal developer platforms and self-service infrastructure.
  • Build and optimize CI/CD pipelines, deployment workflows, and automation across GitHub Actions, Jenkins, ArgoCD.
  • Apply SRE principles to improve developer-facing systems and software delivery performance.

Versant is a media company owning iconic brands in news, sports, and entertainment, including USA Network, Fandango, and Rotten Tomatoes. It is an independent, publicly traded company with a collaborative, inclusive culture and a remote-first work environment.

US Canada

  • Own and evolve AWS infrastructure using Terraform, managing EKS clusters, databases, and core services.
  • Maintain CI/CD reliability and developer tooling across the full engineering org.
  • Lead incident response, drive post-incident reviews, and improve monitoring and alerting standards.

Babylist is the leading platform for expecting and new families, helping parents feel confident, connected, and cared for at every step. As a modern, AI-forward tech company with over 10 million yearly shoppers, Babylist has expanded into a full ecosystem and generated $750M in revenue in 2025, reshaping the $235B kids and baby market.

UK Netherlands Ireland Unlimited PTO

  • Partner with Ads Engineering teams to improve reliability, scalability, and operational excellence of ad-serving and related systems.
  • Design, build, and maintain infrastructure, tooling, and automation to improve service reliability and engineering productivity.
  • Participate in on-call rotations, lead incident response, and drive root cause analysis and corrective actions.

Reddit is a community of communities built on shared interests, passion, and trust. With 100,000+ active communities and approximately 126 million daily active unique visitors, it is one of the internet's largest sources of information.

Global

  • Lead a team of experienced SRE engineers to raise reliability standards in blockchain infrastructure.
  • Set engineering direction, build conditions for good work, and apply SRE disciplines like SLOs and error budgets.
  • Drive automation and foster people development in a small, broad-scope team.

Parity is a leading core blockchain infrastructure company, founded by Dr. Gavin Wood, co-founder and former CTO of Ethereum. They are a remote-first team with offices in Berlin, Lisbon, and London, focused on building advanced technologies in the blockchain sector and committed to diversity and inclusion.

US Unlimited PTO

  • Lead a global SRE team of ~10 engineers, owning day-to-day operations and long-term technical direction.
  • Drive strategic partnerships with product engineering to shift from reactive support to proactive reliability ownership.
  • Scale multi-tenant infrastructure, manage cloud costs, and champion developer self-service.

Counterpart Health is transforming healthcare by providing an AI-enabled primary care tool that supports physicians in early diagnosis and management of chronic conditions. As a subsidiary of Clover Health, it has a remote-first culture that emphasizes collaboration and innovation.

Unlimited PTO 16w maternity 16w paternity

  • Own and operate customer-facing managed infrastructure across multiple AWS accounts and regions.
  • Serve as the senior technical escalation point for production incidents and complex configurations.
  • Contribute to OpenTelemetry distributions and maintain open source projects like Refinery.

Honeycomb provides observability for developer tools, helping companies like HelloFresh and Slack understand their software. They have over 200 employees and were named to Forbes' Best Startups in 2022 and 2023, with a culture that values inclusion and autonomy.

Global Unlimited PTO 16w maternity 16w paternity

  • Own the operational excellence and infrastructure strategy for Remote Build's platform, ensuring reliability, performance, and security.
  • Lead incident response, build observability systems, and drive continuous improvement in system reliability.
  • Embed security into infrastructure, optimize costs, and automate operational toil to scale efficiently.

Remote solves modern organizations' biggest challenge of navigating global employment compliantly. With a fully distributed team across 6 continents, the company fosters a future-focused culture with core values of innovation and async work.