Source Job

Global

  • Collaborate with service teams to define SLIs and SLOs based on customer experience and build error budget policies that influence engineering decisions.
  • Own the Operational Readiness Review process, conducting reviews for new services and major changes across observability, alerting, runbooks, capacity, and graceful degradation.
  • Act as a reliability expert for architecture reviews, failure mode analysis, dependency mapping, and resilience design.

SRE Postgres AWS Pulumi Terraform

20 jobs similar to Site Reliability Engineer (SRE)

Jobs ranked by similarity.

Americas 7w PTO

  • Act as a first responder for system incidents and outages, ensuring high availability and performance.
  • Own and evolve monitoring, alerting, and log management systems while optimizing database infrastructure.
  • Collaborate with engineering teams to build scalable, resilient systems and contribute to SRE tooling and automation.

Circle is building the world's leading all-in-one platform for online communities. We're a fully remote company of around 200 team members from 30+ countries, with a culture that values autonomy, async collaboration, and high expectations.

US

  • Take ownership of incident management and operational excellence across cloud infrastructure.
  • Automate high-risk manual processes and drive reliability gains through engineering.
  • Own a platform domain such as Temporal, observability, or Kubernetes operations.

Synthesia is the world’s leading AI video platform for business, used by over 90% of the Fortune 100. Founded in 2017, the company is headquartered in London with offices across Europe and the US, and has over $530 million in funding from premier investors like Accel and Nvidia's VC arm.

US

  • Design, provision, and manage AWS infrastructure using Terraform and Kubernetes.
  • Build, operate, and improve observability, monitoring, and incident response processes.
  • Collaborate with engineering teams on capacity planning, performance optimization, and resilient system design.

Vynca provides comprehensive care for individuals with complex needs, focusing on quality days at home. The company is a close-knit community guided by core values of Excellence, Compassion, Curiosity, and Integrity.

Europe

  • Lead Reliability Engineering for User Experience.
  • Architect for Scale, partnering with product and infrastructure teams to design highly available systems.
  • Drive Automation to eliminate repetitive operational work through tooling and systems.

Reddit is a community-based platform where users submit, vote, and comment on various topics. It hosts over 100,000 active communities and attracts millions of daily active users, making it one of the largest and most influential internet platforms.

UK Netherlands Ireland Unlimited PTO

  • Partner with Ads Engineering teams to improve reliability, scalability, and operational excellence of ad-serving and related systems.
  • Design, build, and maintain infrastructure, tooling, and automation to improve service reliability and engineering productivity.
  • Participate in on-call rotations, lead incident response, and drive root cause analysis and corrective actions.

Reddit is a community of communities built on shared interests, passion, and trust. With 100,000+ active communities and approximately 126 million daily active unique visitors, it is one of the internet's largest sources of information.

US

  • Lead the Site Reliability Operations team, overseeing observability, monitoring, incident response, and operational excellence for key enterprise services.
  • Partner with product, engineering, and infrastructure teams to embed CI/CD and release best practices, automating build/test/deploy and release monitoring.
  • Own problem management, driving root cause analysis and corrective actions to improve system resilience and reduce incident impact.

Mercury Insurance helps people reduce risk and overcome unexpected events, serving customers for over 60 years. They are a midsize employer recognized as one of America's Best Midsize Employers for 2026, with a collaborative culture focused on growth and inclusion.

Europe

  • Lead reliability initiatives across multiple Ads domains including ad serving, auctions, targeting, reporting, measurement, and billing.
  • Partner with engineering leadership to improve reliability, scalability, operational excellence, and engineering efficiency across the Ads organization.
  • Design and build platforms, tooling, and automation that improve reliability and developer productivity at scale.

Reddit is a community of communities, built on shared interests, passion, and trust, home to the most open and authentic conversations on the internet. With 100,000+ active communities and approximately 126 million daily active unique visitors, it is one of the internet's largest sources of information.

US

  • Lead design and operation of internal developer platforms and self-service infrastructure.
  • Build and optimize CI/CD pipelines, deployment workflows, and automation across GitHub Actions, Jenkins, ArgoCD.
  • Apply SRE principles to improve developer-facing systems and software delivery performance.

Versant is a media company owning iconic brands in news, sports, and entertainment, including USA Network, Fandango, and Rotten Tomatoes. It is an independent, publicly traded company with a collaborative, inclusive culture and a remote-first work environment.

Global Unlimited PTO 16w maternity 16w paternity

  • Own the operational excellence and infrastructure strategy for Remote Build's platform, ensuring reliability, performance, and security.
  • Lead incident response, build observability systems, and drive continuous improvement in system reliability.
  • Embed security into infrastructure, optimize costs, and automate operational toil to scale efficiently.

Remote solves modern organizations' biggest challenge of navigating global employment compliantly. With a fully distributed team across 6 continents, the company fosters a future-focused culture with core values of innovation and async work.

US

  • Ensure reliability, availability, and observability for a large-scale cloud-based SaaS platform serving millions in education.
  • Design and maintain infrastructure-as-code and CI/CD pipelines while leading incident response and resolution.
  • Mentor peers and integrate AI-driven tools to improve SRE workflows and system performance.

Jobgether is an AI-powered job matching platform that connects candidates with hiring companies. The company manages the application process and uses AI to shortlist top-fitting candidates based on core requirements.

US Canada

  • Own and evolve AWS infrastructure using Terraform, managing EKS clusters, databases, and core services.
  • Maintain CI/CD reliability and developer tooling across the full engineering org.
  • Lead incident response, drive post-incident reviews, and improve monitoring and alerting standards.

Babylist is the leading platform for expecting and new families, helping parents feel confident, connected, and cared for at every step. As a modern, AI-forward tech company with over 10 million yearly shoppers, Babylist has expanded into a full ecosystem and generated $750M in revenue in 2025, reshaping the $235B kids and baby market.

Global Unlimited PTO

  • Improve the reliability, performance, and scalability of our production platform.
  • Operate reliable infrastructure, improve observability, and drive incident response.
  • Use data-driven reliability practices such as SLIs, SLOs, SLAs, and DORA metrics.

VRChat is a game-changing platform that provides an endless collection of social VR experiences. They empower their community to bring their imaginations to life and help shape the metaverse. Their team includes people from Netflix, Twitter, Meta, and Microsoft.

$115,200–$172,800/yr
US 8w paternity

  • Build internal tooling to help other engineers and the rest of the company understand and operate our system.
  • Design and implement security best practices for our team and infrastructure.
  • Reduce toil through automation, including building and maintaining CI/CD infrastructure.

Openly is rebuilding insurance from the ground up by re-envisioning and enhancing every aspect of the customer experience. They are a rapidly growing team of exceptional, curious, empathetic people with a wide range of skill sets, spanning many departments.

US

  • Ensure reliability, scalability, and performance of hosted healthcare platforms.
  • Lead incident response, root cause analysis, and implement proactive monitoring.
  • Automate operational tasks using scripting and Infrastructure-as-Code.

Altera Digital Health empowers healthcare providers to deliver superior care through innovative technology. The company is part of Constellation Software Inc., Canada's largest software company, offering a supportive and award-winning culture with opportunities for growth.

Canada

  • Build and maintain infrastructure platforms for over 200 backend services running on Kubernetes clusters with 40,000+ cores.
  • Lead and mentor other engineers, own complex infrastructure failures, and participate in a shared on-call rotation.
  • Drive cloud cost efficiency, estimate schedules, and use AI tools as a first-class collaborator in daily workflows.

Life360's mission is to keep people close to the ones they love through location sharing, safe driver reports, and crash detection. The company serves approximately 97.8 million monthly active users across more than 180 countries and has more than 500 remote-first employees.

US

  • Manage a scrum team of 4-6 engineers building and operating high-volume bidder systems.
  • Oversee AWS-based cloud infrastructure processing over 1 billion HTTP requests per hour.
  • Drive improvements in reliability, performance, and cost efficiency across production systems.

Jamloop builds high-scale advertising technology for real-time bidding systems. We are a remote-first company focused on reliability and operational excellence.

US

  • Implement highly available, scalable infrastructure across AWS, GCP, and bare-metal environments.
  • Drive an "automation-first" culture by writing code in Python/Go to build self-healing systems.
  • Act as lead Incident Commander, develop response playbooks, and conduct post-incident analyses.

Zscaler accelerates digital transformation to secure customers with a cloud-native Zero Trust Exchange platform. The company processes over 200 billion transactions daily and fosters a culture of execution, collaboration, and accountability.

Brazil Unlimited PTO

  • Collaborate with a tight-knit development team.
  • Design, deploy, and operate critical systems balancing reliability, cost, and agility.
  • Perform troubleshooting and root-cause analysis of system operation issues.

Loadsmart is a logistics technology company valued at over $1 billion. We are a collection of industry veterans and user-centered engineers using innovative technology to fearlessly reinvent the future of freight.

Unlimited PTO

  • Own database reliability across Aurora, OpenSearch, Redis, and CDC pipeline, including schema design reviews, migration safety, and incident response.
  • Make the Django ORM a strength at scale by catching N+1 patterns, extending QuerySet conventions, and building CI checks that encode standards.
  • Build self-service tooling and dashboards giving teams visibility into their query footprint, and contribute to onboarding and knowledge-sharing as the engineering org grows.

Scribe provides a Workflow AI platform that automatically captures and optimizes how work gets done, used by 94% of the Fortune 500. The company has grown to over 5 million daily active users across 600,000 businesses, achieved $100M ARR in May 2026, is Series C valued at $1.3 billion, and fosters a builder culture with a high bar and fast pace.

Canada

  • Own and operate production cloud environments, ensuring high availability, reliability, and performance across distributed systems.
  • Design, build, and maintain scalable infrastructure using automation-first principles and Infrastructure as Code practices.
  • Drive automation initiatives and continuous improvement across infrastructure, deployment, and operational workflows.

Jobgether is an AI-powered job matching platform that connects candidates with hiring companies. They have an inclusive, employee-driven culture with a strong focus on collaboration and innovation.