Source Job

US

  • Ensure reliability, availability, and observability for a large-scale cloud-based SaaS platform serving millions in education.
  • Design and maintain infrastructure-as-code and CI/CD pipelines while leading incident response and resolution.
  • Mentor peers and integrate AI-driven tools to improve SRE workflows and system performance.

Site Reliability Engineering AWS Kubernetes Terraform Python

20 jobs similar to Sr Site Reliability Engineer

Jobs ranked by similarity.

US

  • Designing and managing cloud-based infrastructure on AWS.
  • Creating and maintaining deployment architectures and continuous delivery pipelines.
  • Automating infrastructure provisioning and management using Infrastructure as Code (IaC) tools such as Terraform or CloudFormation.

Nearform is an independent team of data & AI experts, engineers, and designers who build intelligent digital solutions and capability at pace. Our team of 500 experts in 20+ countries is trusted by leading enterprises.

US

  • Design, provision, and manage AWS infrastructure using Terraform and Kubernetes.
  • Build, operate, and improve observability, monitoring, and incident response processes.
  • Collaborate with engineering teams on capacity planning, performance optimization, and resilient system design.

Vynca provides comprehensive care for individuals with complex needs, focusing on quality days at home. The company is a close-knit community guided by core values of Excellence, Compassion, Curiosity, and Integrity.

Argentina 18w maternity 12w paternity

  • Own and evolve the cloud platform including compute layer, EKS fleet, serverless infrastructure, networking, and cloud operations across AWS and GCP.
  • Design and maintain infrastructure-as-code foundation and networking layer for reliability, security, and scalability.
  • Build AI-powered automation for cloud infrastructure management, including policy-as-code, drift detection, and LLM-assisted runbook generation.

Webflow builds the world's leading AI-native Digital Experience Platform, empowering teams to design, launch, and optimize for the web without barriers. As a remote-first company with over 2 million users across 190 countries, it fosters a culture of trust, transparency, and creativity.

US 5w PTO

  • Design and develop CI/CD systems for websites, services, and release workflows, and operate an EKS-based Kubernetes platform.
  • Diagnose debug production incidents, drive root-cause analysis, and implement improvements to enhance system reliability.
  • Write and maintain infrastructure as code using Pulumi or Terraform/OpenTofu across multiple AWS accounts with security-conscious practices.

Thunderbird is one of the world’s most trusted open-source email applications, empowering more than 20 million people globally. Our small but growing distributed team includes 65+ people across seven countries, and we build privacy-respecting communication tools with a collaborative, inclusive, and user-first spirit.

$115,000–$130,000/yr
US Unlimited PTO

  • Develop and maintain scalable automation and integrations across cloud platforms and services.
  • Design, implement, and operate CI/CD pipelines using Jenkins, Dagger, Terraform, and Docker.
  • Build, operate, and troubleshoot workloads on Kubernetes, using Kustomize and Helm.

People Inc. is America’s largest digital and print publisher. Our brands harness the best intent-driven content, the fastest sites, and the fewest ads to help nearly 200 million people every month make decisions.

US Unlimited PTO

  • Leads DevOps delivery for cloud-native applications, translating architecture into infrastructure and CI/CD across environments.
  • Designs and maintains AWS infrastructure as code using Terraform across multiple services.
  • Builds and enhances CI/CD pipelines in Azure DevOps and GitHub for high-velocity delivery.

Origami Risk delivers single-platform SaaS solutions that help organizations navigate the complexities of risk, insurance, compliance, and safety management. Founded by industry veterans, the company focuses on client success with award-winning software solutions.

US Unlimited PTO

  • Design, scale, and operate resilient, cloud-native infrastructure in AWS with a strong emphasis on EKS, IAM, RBAC, and modern security-first practices.
  • Build and optimize CI/CD pipelines with GitHub Actions and GitHub Advanced Security, enabling velocity without compromising safety.
  • Own observability across the stack using Datadog (metrics, logging, alerting, and tracing).

DexCare optimizes time in healthcare, streamlining patient access, reducing waits, and enhancing overall experiences. Currently serving 57 million patients, including Kaiser Permanente and Providence, DexCare is committed to an inclusive workplace where diversity drives innovation.

United States 4w PTO

  • Own and improve infrastructure, deployment systems, and operational foundation for reliability and security.
  • Build safer deployment paths, strengthen observability, and lead infrastructure migrations.
  • Partner with engineers on scaling, error handling, and backend changes to support AI-enabled workflows.

Clever is a venture-backed real estate technology company that builds a leading online education platform and has earned a 4.9 TrustPilot rating. The company has helped consumers save over $210 million in real estate fees and fosters a culture of innovation and transparency.

United States

  • Design and build core platform infrastructure for large-scale cloud-native data and analytics systems.
  • Own and improve CI/CD pipelines, testing frameworks, and deployment in a high-scale PaaS environment.
  • Contribute to reliability engineering, observability, and operational excellence across distributed systems.

Jobgether uses an AI-powered matching process to connect candidates with roles. The company is a growing platform focused on efficient job matching and data privacy compliance.

US

  • Lead the Site Reliability Operations team, overseeing observability, monitoring, incident response, and operational excellence for key enterprise services.
  • Partner with product, engineering, and infrastructure teams to embed CI/CD and release best practices, automating build/test/deploy and release monitoring.
  • Own problem management, driving root cause analysis and corrective actions to improve system resilience and reduce incident impact.

Mercury Insurance helps people reduce risk and overcome unexpected events, serving customers for over 60 years. They are a midsize employer recognized as one of America's Best Midsize Employers for 2026, with a collaborative culture focused on growth and inclusion.

US

  • Lead design and operation of internal developer platforms and self-service infrastructure.
  • Build and optimize CI/CD pipelines, deployment workflows, and automation across GitHub Actions, Jenkins, ArgoCD.
  • Apply SRE principles to improve developer-facing systems and software delivery performance.

Versant is a media company owning iconic brands in news, sports, and entertainment, including USA Network, Fandango, and Rotten Tomatoes. It is an independent, publicly traded company with a collaborative, inclusive culture and a remote-first work environment.

US Canada

  • Own and evolve AWS infrastructure using Terraform, managing EKS clusters, databases, and core services.
  • Maintain CI/CD reliability and developer tooling across the full engineering org.
  • Lead incident response, drive post-incident reviews, and improve monitoring and alerting standards.

Babylist is the leading platform for expecting and new families, helping parents feel confident, connected, and cared for at every step. As a modern, AI-forward tech company with over 10 million yearly shoppers, Babylist has expanded into a full ecosystem and generated $750M in revenue in 2025, reshaping the $235B kids and baby market.

Global

  • Collaborate with service teams to define SLIs and SLOs based on customer experience and build error budget policies that influence engineering decisions.
  • Own the Operational Readiness Review process, conducting reviews for new services and major changes across observability, alerting, runbooks, capacity, and graceful degradation.
  • Act as a reliability expert for architecture reviews, failure mode analysis, dependency mapping, and resilience design.

Supabase provides the Postgres development platform with a complete backend solution including Database, Auth, Storage, Edge Functions, Realtime, and Vector Search. With 280+ team members across 55+ countries, they are an open-source-first company that values async work and has raised $500M.

Global Unlimited PTO 16w maternity 16w paternity

  • Own the operational excellence and infrastructure strategy for Remote Build's platform, ensuring reliability, performance, and security.
  • Lead incident response, build observability systems, and drive continuous improvement in system reliability.
  • Embed security into infrastructure, optimize costs, and automate operational toil to scale efficiently.

Remote solves modern organizations' biggest challenge of navigating global employment compliantly. With a fully distributed team across 6 continents, the company fosters a future-focused culture with core values of innovation and async work.

Brazil Unlimited PTO

  • Collaborate with a tight-knit development team.
  • Design, deploy, and operate critical systems balancing reliability, cost, and agility.
  • Perform troubleshooting and root-cause analysis of system operation issues.

Loadsmart is a logistics technology company valued at over $1 billion. We are a collection of industry veterans and user-centered engineers using innovative technology to fearlessly reinvent the future of freight.

Canada

  • Own and operate production cloud environments, ensuring high availability, reliability, and performance across distributed systems.
  • Design, build, and maintain scalable infrastructure using automation-first principles and Infrastructure as Code practices.
  • Drive automation initiatives and continuous improvement across infrastructure, deployment, and operational workflows.

Jobgether is an AI-powered job matching platform that connects candidates with hiring companies. They have an inclusive, employee-driven culture with a strong focus on collaboration and innovation.

US

  • Lead integration of security across the SDLC, embedding automated testing into CI/CD pipelines.
  • Secure cloud-native AWS architectures and enforce least privilege access and runtime protections.
  • Perform threat modeling, automate compliance, and innovate with AI security standards.

TrueML is a mission-driven financial software company that uses machine learning to improve customer experiences for distressed borrowers. The team includes data scientists, financial services experts, and customer experience fanatics building inclusive financial technology.

Germany 6w PTO

  • Architect and scale the cloud platform behind a mission-critical SaaS product used globally.
  • Lead Infrastructure as Code maturity and drive automation, reliability, and cost optimisation.
  • Own uptime, SLAs, and incident management practices while mentoring engineers.

Innocraft (trading as Matomo) provides an open-source analytics platform trusted by enterprises and governments for full data ownership. The company values diversity and inclusion, and operates with a stable, mature product and strong engineering team.

Global

  • Manage a team of Engineers, conducting 1:1s, performance reviews, hiring, and career development in a distributed remote friendly environment.
  • Own the technical roadmap for shared cloud infrastructure across Azure and AWS, balancing reliability work against longer-term platform improvements.
  • Set and enforce standards for infrastructure-as-code (Terraform, Helm, Kubernetes), documentation, and operational readiness.

Delinea is a pioneer in securing human and machine identities through intelligent, centralized authorization, empowering organizations to seamlessly govern their interactions across the modern enterprise. They value diversity, innovation, and a culture of respect and fairness, with a global team supported by strategic investment from TPG.

US Unlimited PTO

  • Lead the design, implementation, manage, support and operation of cloud-native infrastructure and container orchestration platforms.
  • Drive platform reliability, scalability, automation, and operational excellence across critical SaaS and cloud-based workloads.
  • Contribute to architectural decisions, mentoring engineers, and ensuring alignment with security, compliance, and operational standards.

Availity delivers revenue cycle and related business solutions for health care professionals who want to build healthy, thriving organizations. They are a global team with headquarters in Jacksonville, FL, and an office in Bangalore, India, united by a mission to bring the focus back to patient care.