Source Job

Mexico Colombia

  • Build and maintain observability across the platform in Datadog, including dashboards, monitors, APM, and log pipelines.
  • Participate in on-call rotation and incident response, driving blameless post-incident reviews and automating toil.
  • Leverage AI tools to accelerate debugging, generate runbooks, and build automation for operational efficiency.

Datadog AWS Python Terraform Incident Response

20 jobs similar to SRE Engineer

Jobs ranked by similarity.

United States

  • Own and evolve observability strategy including monitoring, alerting, dashboards, logging, and distributed tracing.
  • Define and manage SLIs, SLOs, and reliability metrics, improving MTTD and MTTR through automation.
  • Build and maintain reliable cloud infrastructure on AWS and Kubernetes while mentoring engineers on SRE best practices.

Filevine is a Legal AI company delivering Legal Operating Intelligence for legal work. Fueled by a team of exceptional collaborators and innovators, Filevine’s rapid growth has earned AI awards and recognition from Deloitte and Inc. as one of the most innovative and fastest-growing technology companies in the country.

US Canada

  • Own and evolve AWS infrastructure using Terraform, managing EKS clusters, databases, and core services.
  • Maintain CI/CD reliability and developer tooling across the full engineering org.
  • Lead incident response, drive post-incident reviews, and improve monitoring and alerting standards.

Babylist is the leading platform for expecting and new families, helping parents feel confident, connected, and cared for at every step. As a modern, AI-forward tech company with over 10 million yearly shoppers, Babylist has expanded into a full ecosystem and generated $750M in revenue in 2025, reshaping the $235B kids and baby market.

US

  • Design, provision, and manage AWS infrastructure using Terraform and Kubernetes.
  • Build, operate, and improve observability, monitoring, and incident response processes.
  • Collaborate with engineering teams on capacity planning, performance optimization, and resilient system design.

Vynca provides comprehensive care for individuals with complex needs, focusing on quality days at home. The company is a close-knit community guided by core values of Excellence, Compassion, Curiosity, and Integrity.

US

  • Take ownership of incident management and operational excellence across cloud infrastructure.
  • Automate high-risk manual processes and drive reliability gains through engineering.
  • Own a platform domain such as Temporal, observability, or Kubernetes operations.

Synthesia is the world’s leading AI video platform for business, used by over 90% of the Fortune 100. Founded in 2017, the company is headquartered in London with offices across Europe and the US, and has over $530 million in funding from premier investors like Accel and Nvidia's VC arm.

US Unlimited PTO

  • Design, scale, and operate resilient, cloud-native infrastructure in AWS with a strong emphasis on EKS, IAM, RBAC, and modern security-first practices.
  • Build and optimize CI/CD pipelines with GitHub Actions and GitHub Advanced Security, enabling velocity without compromising safety.
  • Own observability across the stack using Datadog (metrics, logging, alerting, and tracing).

DexCare optimizes time in healthcare, streamlining patient access, reducing waits, and enhancing overall experiences. Currently serving 57 million patients, including Kaiser Permanente and Providence, DexCare is committed to an inclusive workplace where diversity drives innovation.

US 5w PTO

  • Design and develop CI/CD systems for websites, services, and release workflows, and operate an EKS-based Kubernetes platform.
  • Diagnose debug production incidents, drive root-cause analysis, and implement improvements to enhance system reliability.
  • Write and maintain infrastructure as code using Pulumi or Terraform/OpenTofu across multiple AWS accounts with security-conscious practices.

Thunderbird is one of the world’s most trusted open-source email applications, empowering more than 20 million people globally. Our small but growing distributed team includes 65+ people across seven countries, and we build privacy-respecting communication tools with a collaborative, inclusive, and user-first spirit.

US Unlimited PTO

  • Design and build cloud-native infrastructure for reliability, observability, and automation across GCP, GKE, and Cloud Run.
  • Own incident response, root cause analysis, escalation workflows, and runbooks to prevent hard problems from recurring.
  • Develop Infrastructure as Code, CI/CD pipelines, and operational tooling to improve developer velocity and platform efficiency.

CertifyOS is building the data infrastructure that powers modern healthcare, automating provider licensing, enrollment, credentialing, and network monitoring through an API-first platform. The company is backed by leading investors with a team of deep experience in provider data systems, valuing authenticity, accountability, collaboration, results, and openness to feedback.

US

  • Lead design and operation of internal developer platforms and self-service infrastructure.
  • Build and optimize CI/CD pipelines, deployment workflows, and automation across GitHub Actions, Jenkins, ArgoCD.
  • Apply SRE principles to improve developer-facing systems and software delivery performance.

Versant is a media company owning iconic brands in news, sports, and entertainment, including USA Network, Fandango, and Rotten Tomatoes. It is an independent, publicly traded company with a collaborative, inclusive culture and a remote-first work environment.

US

  • Drive the definition and adoption of SLIs and SLOs across services, reducing toil through automation and incident response.
  • Design and architect Infrastructure as Code solutions for large-scale environments using Docker, Kubernetes, and cloud-native services.
  • Serve as primary SRE liaison for development teams, influencing architecture and conducting training for clients.

Noctua Technology, LLC is a company that drives digital transformation by treating operations as a software engineering challenge, focusing on cloud native systems. They are a dynamic team seeking a Senior SRE to define strategy and bridge development and operations for clients.

Latin America

  • Design, implement, and improve Site Reliability Engineering practices across production environments with a focus on SLOs, SLIs, and error budgets.
  • Lead incident response processes and build observability strategies including monitoring, logging, alerting, and distributed tracing.
  • Partner with engineering teams to enhance system reliability, availability, scalability, and operational efficiency.

Oowlish is a rapidly expanding software development company in Latin America that collaborates with premier clients from the United States and Europe to create pioneering digital solutions. Certified as a Great Place to Work, it offers a nurturing environment with opportunities for professional growth and international impact.

Global

  • Embed with product and platform teams from early stages to ensure reliability is designed in from the start.
  • Define production-readiness standards and measurable SLIs/SLOs to guide operational excellence.
  • Build tooling and infrastructure across AWS, GCP, and Azure using Terraform, and share on-call rotation.

We build WebContainers and Bolt.new, an AI-powered app builder that lets you create, edit, and deploy full-stack apps instantly in your browser. We are a fully remote, globally distributed team of passionate engineers serving over 1 million developers monthly.

Latin America

  • Define and implement SLOs, SLIs, and Error Budgets to ensure production system reliability.
  • Lead incident command during major outages and drive blameless postmortems.
  • Develop observability strategies, including monitoring, logging, tracing, and alerting.

Oowlish is a rapidly expanding software development company in Latin America. It is certified as a Great Place to Work and offers a nurturing environment with professional development opportunities.

Canada

  • Build and maintain infrastructure platforms for over 200 backend services running on Kubernetes clusters with 40,000+ cores.
  • Lead and mentor other engineers, own complex infrastructure failures, and participate in a shared on-call rotation.
  • Drive cloud cost efficiency, estimate schedules, and use AI tools as a first-class collaborator in daily workflows.

Life360's mission is to keep people close to the ones they love through location sharing, safe driver reports, and crash detection. The company serves approximately 97.8 million monthly active users across more than 180 countries and has more than 500 remote-first employees.

Americas 7w PTO

  • Act as a first responder for system incidents and outages, ensuring high availability and performance.
  • Own and evolve monitoring, alerting, and log management systems while optimizing database infrastructure.
  • Collaborate with engineering teams to build scalable, resilient systems and contribute to SRE tooling and automation.

Circle is building the world's leading all-in-one platform for online communities. We're a fully remote company of around 200 team members from 30+ countries, with a culture that values autonomy, async collaboration, and high expectations.

  • Own reliability, latency, and performance for AI platform services and data infrastructure on AWS.
  • Design and maintain CI/CD pipelines, infrastructure-as-code, and observability frameworks across the stack.
  • Partner with AI and data engineers to ensure secure, cost-optimized, and scalable deployment of platform components.

HHAeXchange is the leading technology platform for home and community-based care, providing an end-to-end homecare solution for people who are aging or have disabilities. Founded in 2008, the company is passionate about transforming healthcare by connecting patients, providers, managed care organizations, and states.

Global Unlimited PTO 16w maternity 16w paternity

  • Own the operational excellence and infrastructure strategy for Remote Build's platform, ensuring reliability, performance, and security.
  • Lead incident response, build observability systems, and drive continuous improvement in system reliability.
  • Embed security into infrastructure, optimize costs, and automate operational toil to scale efficiently.

Remote solves modern organizations' biggest challenge of navigating global employment compliantly. With a fully distributed team across 6 continents, the company fosters a future-focused culture with core values of innovation and async work.

Global

  • Collaborate with service teams to define SLIs and SLOs based on customer experience and build error budget policies that influence engineering decisions.
  • Own the Operational Readiness Review process, conducting reviews for new services and major changes across observability, alerting, runbooks, capacity, and graceful degradation.
  • Act as a reliability expert for architecture reviews, failure mode analysis, dependency mapping, and resilience design.

Supabase provides the Postgres development platform with a complete backend solution including Database, Auth, Storage, Edge Functions, Realtime, and Vector Search. With 280+ team members across 55+ countries, they are an open-source-first company that values async work and has raised $500M.

Unlimited PTO 16w maternity 16w paternity

  • Own and operate customer-facing managed infrastructure across multiple AWS accounts and regions.
  • Serve as the senior technical escalation point for production incidents and complex configurations.
  • Contribute to OpenTelemetry distributions and maintain open source projects like Refinery.

Honeycomb provides observability for developer tools, helping companies like HelloFresh and Slack understand their software. They have over 200 employees and were named to Forbes' Best Startups in 2022 and 2023, with a culture that values inclusion and autonomy.

US

  • Proactively identify and respond to emerging security threats and incidents.
  • Develop detection techniques and manage core security tooling such as SIEM and orchestration platforms.
  • Collaborate across teams to support security projects and participate in on-call rotations.

Circle is a leading internet financial platform company building infrastructure for digital assets, stablecoins, and blockchain. They have a flexible work environment with values of high integrity and multistakeholder collaboration.

US

  • Implement highly available, scalable infrastructure across AWS, GCP, and bare-metal environments.
  • Drive an "automation-first" culture by writing code in Python/Go to build self-healing systems.
  • Act as lead Incident Commander, develop response playbooks, and conduct post-incident analyses.

Zscaler accelerates digital transformation to secure customers with a cloud-native Zero Trust Exchange platform. The company processes over 200 billion transactions daily and fosters a culture of execution, collaboration, and accountability.