Source Job

US

  • Ensure near-zero downtime with monitoring and alerting, self-healing automation, and continuous improvement
  • Create highly automated, available and scalable systems by applying software and infrastructure principles
  • Employ and advise clients on DevOps and SRE principles and practices, covering deployment pipelines, HA, service reliability, technical debt, and operational toil for live services running at scale

Linux Windows K8s Python Terraform

20 jobs similar to Site Reliability Engineer

Jobs ranked by similarity.

South America

  • Architect and maintain self-healing systems with 99.9%+ availability targets.
  • Use AI/ML to automate infrastructure governance and detect configuration or IaC anti-patterns.
  • Implement adaptive SLIs/SLOs that evolve automatically from real-time data.

Groupon is a marketplace where customers discover new experiences and services everyday and local businesses thrive. Even with thousands of employees spread across multiple continents, they still maintain a culture that inspires innovation, rewards risk-taking and celebrates success.

US

  • Ensure the smooth operation and high availability of Clarifai's core services
  • Monitor system performance, identify bottlenecks, and implement optimizations to enhance reliability and efficiency
  • Design and implement scalable, secure, and cost-effective infrastructure solutions

Clarifai is a leading AI platform specializing in computer vision and generative AI, empowering organizations to transform unstructured data into actionable insights. Founded in 2013, they have a diverse, globally distributed team with $100M in funding and are committed to building a diverse and inclusive team.

$140,000–$190,000/yr
US Canada Unlimited PTO

  • Architect and maintain scalable, reliable infrastructure: Design and optimize infrastructure for high availability, fault tolerance, and performance across distributed systems.
  • Lead incident management and root cause analysis: Own incident response processes, ensure swift resolution of issues, and drive post-incident improvements to prevent recurrences.
  • Service monitoring and automation: Build and maintain automated monitoring, alerting, and healing systems that improve system health, reduce manual intervention, and minimize downtime.

VGS is the world's leader in payment tokenization, empowering clients and partners by tokenizing sensitive payment data and limiting compliance scope. They embed a universal token vault into their technology stack to manage the complexities of payment data tokenization across processors and networks and more. While the job posting doesn't specify size, they appear to have a culture that values transparency, collaboration, grit, and humility.

US Canada Europe

  • Lead a global team of Site Reliability Engineers.
  • Recruit, hire, onboard and develop engineers.
  • Guide project planning by defining milestones and identifying dependencies.

AuthZed creates and maintains SpiceDB and the authorization infrastructure. They are a Series A company with a fully remote team across the US, Canada, and Europe and a hardworking, close-knit group with a software-driven culture that values integrity, collaboration, and open-mindedness.

$219,000–$245,000/yr
US Unlimited PTO

  • Architect, operate, improve and secure the platform the Garner Health app runs on
  • Boost development velocity and productivity
  • Build systems to a high engineering standard and hold others to the same high standard

Garner has developed a revolutionary approach to evaluating doctor performance and a unique incentive model that's reshaping the healthcare economy to ensure everyone can afford high quality care. They have more than doubled their revenue annually over the last 5 years. Garner's award winning culture is designed to cultivate teamwork, trust, autonomy, exceptional results, and individual growth.

US

  • Own developer operations and platform reliability across Introzy’s product stack.
  • Lead how we run infrastructure on Render, design and evolve our observability and alerting, shape our CI/CD and release practices.
  • Continuously improve internal developer experience so the engineering team can ship quickly and safely.

Introzy is a multi-app platform designed to unify networking, workflow, and productivity. As a subsidiary of Sanguine Technology Solutions, they are an early-stage company moving fast to deliver value, with a lean engineering team and a culture that embraces AI.

US

  • Play a crucial part in designing and scaling secure cloud infrastructure.
  • Lead the charge in intelligent automation systems and ensure robust deployment processes.
  • Collaborate with product, engineering, and leadership to drive company success.

Jobgether is a company that connects job seekers with employers. They utilize an AI-powered matching process to ensure applications are reviewed quickly and objectively.

India

  • Oversee the reliability, scalability, performance, and security of key production services.
  • Collaborate with cross-functional teams to develop and maintain resilient infrastructure.
  • Provide expert mentorship and guidance on best practices to engineers throughout the organization.

Cision is a global leader in PR, marketing and social media management technology and intelligence, helping brands and organizations connect with customers and stakeholders to drive business results. The company has offices in 24 countries throughout the Americas, EMEA and APAC.

$109,800–$252,500/yr
US Unlimited PTO 16w maternity 8w paternity

  • Design, implement, and maintain scalable and reliable infrastructure solutions.
  • Automate deployments and maintain a resilient, secure SaaS application platform.
  • Develop comprehensive monitoring and alerting solutions, and respond to incidents.

Veeam is the #1 global market leader in data resilience, believing businesses should control all their data whenever and wherever they need it, providing data resilience through data backup, data recovery, data portability, data security, and data intelligence. Based in Seattle, Veeam protects over 550,000 customers worldwide who trust Veeam to keep their businesses running.

Americas EMEA Unlimited PTO

  • Design and implement highly scalable infrastructure for GitLab.com to support current and future growth.
  • Collaborate with cross-functional teams across the Infrastructure organization to plan and deliver projects that shape GitLab’s platform direction.
  • Operate and improve edge services and Kubernetes workloads, acting as a subject matter expert within the infrastructure department.

GitLab is an open-core software company that develops the most comprehensive AI-powered DevSecOps Platform, used by more than 100,000 organizations. They aim to enable everyone to contribute to and co-create the software that powers our world.

$125,000–$169,000/yr
Unlimited PTO

  • Design, scale, and operate resilient, cloud-native infrastructure in AWS with an emphasis on EKS, IAM, RBAC, and modern security-first practices.
  • Build and optimize CI/CD pipelines with GitHub Actions and GitHub Advanced Security enabling velocity without compromising safety.
  • Own observability across the stack using Datadog (metrics, logging, alerting, and tracing).

DexCare optimizes time in healthcare, streamlining patient access, reducing waits, and enhancing overall experiences. They are committed to creating an inclusive workplace where diversity drives innovation and belonging strengthens collaboration, enabling everyone to thrive.

US

  • Design, build, and maintain secure, scalable cloud infrastructure.
  • Own CI/CD pipelines and deployment workflows across services and environments.
  • Improve reliability, availability, and performance through monitoring, alerting, and incident response practices.

Jobgether is a company that uses an AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. They identify the top-fitting candidates and share this short list directly with the hiring company.

Global

  • Automate infrastructure provisioning, configuration management, monitoring, and operational workflows using IaC and scripting languages.
  • Own the deployment, maintenance, and lifecycle management of systems supporting engineering, leveraging deep expertise in Kubernetes.
  • Troubleshoot complex infrastructure and application issues, driving root-cause analysis and developing long-term remediation solutions

SingleStore delivers the cloud-native database with the speed and scale to power the world’s data-intensive applications. They are venture-backed and headquartered in San Francisco with offices in Sunnyvale, Raleigh, Seattle, Boston, London, Lisbon, Bangalore, Dublin and Kyiv.

US

  • Design, create, and maintain software and systems to improve the availability, scalability, and efficiency of Thumbtack's services.
  • Set the architectural direction of infrastructure and platform services while supporting the engineering organization.
  • Troubleshoot and debug critical systems throughout the SDLC.

Thumbtack helps millions of people confidently care for their homes by offering personalized guidance, AI tools, and a hiring experience. They have a growing community of 300,000 local service businesses and value a cross functional collaborative culture.

Canada

  • Design, create, and maintain software and systems to improve the availability, scalability, and efficiency of Thumbtack's services
  • Set the architectural direction of infrastructure and platform services while supporting the engineering organization
  • Design and implement tools and processes used for deployment, change, service, and infrastructure management

Thumbtack helps millions of people confidently care for their homes through personalized guidance, AI tools, and a hiring experience. They have a growing community of 300,000 local service businesses.

$155,000–$165,000/yr
US Unlimited PTO

  • Lead maintenance and operations for production and development environments.
  • Architect and implement complex solutions spanning OS, virtualization, network, and cloud layers.
  • Lead automation initiatives for infrastructure provisioning and operational tasks.

NMI enables partners with choice in payments, challenging the one-size-fits-all approach. They power innovative tech for SMBs, entrepreneurs, and fintech startups, fostering a diverse and welcoming workplace with a dedicated Diversity, Equity & Inclusion action group.

Global 6w PTO 26w maternity

  • Build self-service systems that automate managing, deploying and operating services.
  • Automate environment observability and resilience. Enable all developers to troubleshoot and resolve problems.
  • Ensure we hit defined SLOs, including participation in an on-call rotation.

Cohere is focused on scaling intelligence to serve humanity by training and deploying frontier models for developers and enterprises. They are a team of researchers, engineers, and designers. They value diversity and strive to create an inclusive work environment.

$167,249–$216,090/yr
US

  • Contribute to the design of a scalable cloud infrastructure platform on Google Cloud.
  • Develop and maintain infrastructure automation using Terraform and Kubernetes controllers.
  • Ensure cloud infrastructure adheres to best practices for security and compliance.

Virta Health is dedicated to reversing metabolic disease in one billion people. They innovate through technology, personalized nutrition, and virtual care, partnering with health plans, employers, and government organizations, with over $350 million raised from investors.

Latin America Unlimited PTO

  • Audit and optimize cloud usage, capacity, and spend.
  • Improve reliability through better automation, monitoring, and alerting.
  • Partner with engineers to upgrade infrastructure components and roll out changes safely.

Our client builds a high-scale data and analytics platform used by sophisticated teams to make critical business decisions. They are trusted by 800+ companies and value collaboration, high ownership, and long-term system reliability.

$89,155–$287,488/yr
Global

  • Configure and maintain cloud infrastructure automation using Terraform, focusing on CDN optimization and content delivery performance
  • Develop capacity planning strategies and performance optimization initiatives for high-volume spatial content delivery.
  • Instrument services to understand system health.

Miris is a cutting-edge technology company building the future of 3D content delivery at global scale. Our mission is to empower creators and developers to deliver high-fidelity, photorealistic 3D experiences to billions of users instantly, seamlessly, and across all major platforms and devices.