Source Job

Americas 7w PTO

  • Act as a first responder for system incidents and outages, ensuring high availability and performance.
  • Own and evolve monitoring, alerting, and log management systems while optimizing database infrastructure.
  • Collaborate with engineering teams to build scalable, resilient systems and contribute to SRE tooling and automation.

AWS Kubernetes MySQL Postgres Redis

20 jobs similar to Senior Site Reliability Engineer

Jobs ranked by similarity.

$188,550–$212,150/yr
Global Unlimited PTO

  • Own the technical direction of Remote's SRE/Platform domain.
  • Define and drive the reliability strategy across the platform.
  • Identify and lead AI enablement initiatives across the engineering organisation.

Remote is solving modern organizations’ biggest challenge – navigating global employment compliantly with ease. With our core values at heart and a future-focused work culture, our team works tirelessly on ambitious problems, asynchronously, around the world.

US Canada

  • Own and evolve AWS infrastructure using Terraform, managing EKS clusters, databases, and core services.
  • Maintain CI/CD reliability and developer tooling across the full engineering org.
  • Lead incident response, drive post-incident reviews, and improve monitoring and alerting standards.

Babylist is the leading platform for expecting and new families, helping parents feel confident, connected, and cared for at every step. As a modern, AI-forward tech company with over 10 million yearly shoppers, Babylist has expanded into a full ecosystem and generated $750M in revenue in 2025, reshaping the $235B kids and baby market.

US 5w PTO

  • Design and develop CI/CD systems for websites, services, and release workflows, and operate an EKS-based Kubernetes platform.
  • Diagnose debug production incidents, drive root-cause analysis, and implement improvements to enhance system reliability.
  • Write and maintain infrastructure as code using Pulumi or Terraform/OpenTofu across multiple AWS accounts with security-conscious practices.

Thunderbird is one of the world’s most trusted open-source email applications, empowering more than 20 million people globally. Our small but growing distributed team includes 65+ people across seven countries, and we build privacy-respecting communication tools with a collaborative, inclusive, and user-first spirit.

North America

  • Triage difficult technical problems and implement solutions
  • Improve our observability stack (monitoring, logging, profiling)
  • Incident Management: Respond to and resolve incidents in a timely manner, conducting post-incident reviews to identify and implement improvements.

Alpaca is a self-clearing broker-dealer and brokerage infrastructure for stocks, ETFs, options, crypto, fixed income, 24/5 trading, and more. They are a dynamic team of 380+ globally distributed members.

Global Unlimited PTO

  • Improve the reliability, performance, and scalability of our production platform.
  • Operate reliable infrastructure, improve observability, and drive incident response.
  • Use data-driven reliability practices such as SLIs, SLOs, SLAs, and DORA metrics.

VRChat is a game-changing platform that provides an endless collection of social VR experiences. They empower their community to bring their imaginations to life and help shape the metaverse. Their team includes people from Netflix, Twitter, Meta, and Microsoft.

Brazil Unlimited PTO

  • Collaborate with a tight-knit development team.
  • Design, deploy, and operate critical systems balancing reliability, cost, and agility.
  • Perform troubleshooting and root-cause analysis of system operation issues.

Loadsmart is a logistics technology company valued at over $1 billion. We are a collection of industry veterans and user-centered engineers using innovative technology to fearlessly reinvent the future of freight.

US

  • Design, build, and operate core cloud infrastructure across compute, storage, databases, and networking layers.
  • Own and improve the reliability, scalability, and security of Valon’s production systems as we scale to support major enterprise deployments.
  • Evaluate, adopt, and operationalize new infrastructure technologies (e.g., Vitess, Clickhouse, Redis) to meet evolving product and scale requirements.

Valon is building the AI-native operating system for regulated finance, starting with mortgage servicing. They're a Series C company backed by a16z, transforming industries that others have written off as too complex to innovate.

Global

  • Provide production support on a shift according to the team on-call roster.
  • Work on the customer and internal engineering/implementation team raised tickets while not on-call for production support.
  • Continuously monitor the health and performance of our services, systems, and infrastructure.

Granicus builds and maintains technology that is transforming the Govtech industry by bringing governments and its constituents together. They serve 5,500 federal, state, and local government agencies and more than 300 million citizen subscribers, and are known for being one of the best companies to work for.

$145,000–$250,000/yr
US Unlimited PTO

  • Construct infrastructure as code, developing and enforcing best practice across configurations while preventing drift between Terraform configurations and infrastructure deployments.

SentiLink provides innovative identity and risk solutions, empowering institutions and individuals to transaction with confidence. They are building the future of identity verification in the United States replacing a clunky, ineffective, and expensive status quo with solutions that are 10x faster, smarter, and more accurate.

$160,000–$190,000/yr
US

  • Own and evolve Launch Potato's cloud infrastructure, CI/CD platform, and compliance posture.
  • Build the SRE function from the ground up so product teams can ship faster without compromising reliability, security, or cost control.
  • Stand up the SRE practice from scratch: on-call rotation, PagerDuty configuration, SLA/SLO definitions for core infrastructure services, runbook library, and observability dashboards that tie site performance to business metrics.

Launch Potato is a digital media company that connects consumers with leading brands through data-driven content and technology. They are headquartered in South Florida with a remote-first team spanning over 15 countries, with a high-growth, high-performance culture.

$113,850–$126,500/yr
Europe 5w PTO

  • Design, build, and maintain infrastructure using Infrastructure as Code tools such as Terraform.
  • Improve system reliability, scalability, resilience, and performance across the Mast platform.
  • Build systems and tooling that automate infrastructure management and operational workflows wherever possible.

Mast is on a mission to make complex lending simple by building modern, cloud-native lending technology purpose-built for specialist lenders. It is a high-performance team of engineers and lending experts that values radical honesty, transparency, and speed.

SRE

Fal
$180,000–$250,000/yr
US

  • Own and operate our Kubernetes infrastructure.
  • Build and maintain CI/CD pipelines and deployment infrastructure.
  • Leverage AI to automate analysis and resolution of production issues.

Fal is the generative media ecosystem powering the next generation of AI products. They build the infrastructure, tools, and model access that teams need to move from idea to production.

Mexico

  • Design systems with resilience, graceful degradation, and capacity in mind.
  • Define and measure SLOs and SLIs that actually reflect what our customers feel.
  • Use Datadog (logging, metrics, APM) together with CloudWatch to build signal-heavy, noise-light observability.

EarnIn is building products that deliver real-time financial flexibility for those with the unique needs of living paycheck to paycheck. They are growing fast and are excited to continue bringing world-class talent onboard to help shape the next chapter of their growth journey.

Europe

  • Lead Reliability Engineering for User Experience.
  • Architect for Scale, partnering with product and infrastructure teams to design highly available systems.
  • Drive Automation to eliminate repetitive operational work through tooling and systems.

Reddit is a community-based platform where users submit, vote, and comment on various topics. It hosts over 100,000 active communities and attracts millions of daily active users, making it one of the largest and most influential internet platforms.

Canada 4w PTO

  • Design and build scalable infrastructure to support rapid growth in data volume, service usage, and engineering velocity.
  • Implement and maintain core security infrastructure and controls including, service-to-service authentication, secrets management, application security primitives.
  • Partner closely with Security Engineering to implement infrastructure that supports best-in-class security and compliance practices.

Vanta helps businesses earn and prove trust by providing a platform that continuously monitors and verifies security. They empower companies to practice better security and prove it with ease. Vanta has a kind and talented team with offices in SF, NYC, London, Dublin, Tel Aviv, and Sydney.

US 6w PTO

  • Design, build, and operate reconciliation systems to track desired stack state, detect and repair drift across stack templates, grafana.com state, Hosted Grafana, and actual customer stack configuration.
  • Collaborate across SSS, grafana.com, and deployment configurations to ensure stack lifecycle workflows remain reliable, observable, and resilient.
  • Improve operational efficiency by reducing deployment complexity and contributing to the Stack Config Reconciliation project.

Grafana Labs is a remote-first, open-source powerhouse with more than 20M users of Grafana around the globe. They help more than 3,000 companies manage their observability strategies with the Grafana LGTM Stack. Their team thrives in an innovation-driven environment where transparency, autonomy, and trust fuel everything they do.

Europe 6w PTO

  • Design, build, and operate reconciliation systems to track desired stack state, detect and repair drift across stack templates, grafana.com state, Hosted Grafana, and actual customer stack configuration.
  • Collaborate across SSS, grafana.com, and deployment configurations to ensure stack lifecycle workflows remain reliable, observable, and resilient.
  • Improve operational efficiency by reducing deployment complexity and contributing to the Stack Config Reconciliation project.

Grafana Labs is a remote-first, open-source powerhouse with over 20M users of Grafana. They help more than 3,000 companies manage their observability strategies with the Grafana LGTM Stack, featuring scalable metrics (Grafana Mimir), logs (Grafana Loki), and traces (Grafana Tempo).

Europe

  • Support and improve hybrid production infrastructure for 15+ development teams handling 100+ products, 10K+ domains, and billions of hits per day.
  • Architect and plan improvements of a multi-datacenter development environment, advocating for migration to automated, elastic infrastructures using cloud, Kubernetes, and serverless technologies.
  • Document processes, monitor performance metrics, promote CICD practices, and mentor junior DevOps engineers.

Aylo is a tech pioneer that offers world-class adult entertainment and games on safe, popular platforms. With an international team of dynamic innovators, the company focuses on trust-and-safety protocols and has offices in Montreal, Austin, and Nicosia.

US

  • Ensure reliability, scalability, and performance of hosted healthcare platforms.
  • Lead incident response, root cause analysis, and implement proactive monitoring.
  • Automate operational tasks using scripting and Infrastructure-as-Code.

Altera Digital Health empowers healthcare providers to deliver superior care through innovative technology. The company is part of Constellation Software Inc., Canada's largest software company, offering a supportive and award-winning culture with opportunities for growth.

$4,313–$5,391/mo
Europe

  • Own 5 AWS accounts across the organisation.
  • Architect and maintain infrastructure as code with Terraform.
  • Set up monitoring, alerting, and incident response.

We're a UK fintech building high-throughput digital infrastructure for the mortgage and property space. Recently acquired Trussle and we are taking our platform to the next level. The company values innovation and building high-quality products.