Source Job

UK

  • Act as a primary or escalation responder in a 24x7 on‑call rotation
  • Automate repetitive operational tasks to reduce manual toil
  • Support and troubleshoot: Linux‑based systems Cloud platforms (AWS, Azure, GCP)

Linux AWS Kubernetes Python Bash

20 jobs similar to Site Reliability Engineer

Jobs ranked by similarity.

Canada

  • Implementing the improvements to the reliability, fault tolerance, scalability, and performance of our infrastructure
  • Managing incidents using your technical know-how to involve the appropriate teams and automate away manual practices
  • Improving observability across our systems (metrics, logs, tracing) to reduce time to detection and resolution

Newton is changing how Canadians trade crypto with the goal to make financial freedom achievable for everyone by giving their customers the tools and knowledge needed to navigate the crypto world. They are a remote team spread across Canada that values pushing boundaries and getting things done.

LATAM

  • Support the availability and durability of critical services across production environments.
  • Develop automation for common operational tasks, reducing manual intervention and toil.
  • Partner with engineering, product, and operations teams to support resilient system design and operations.

Backblaze is the object storage leader in the open cloud movement, fueling customer success with cloud storage built purposefully to unlock budgets and unleash innovators. Founded in 2007, they scaled the business with less than $3 million in outside funding until 2021, and generate over $100m in revenue managing over three billion gigabytes of data storage for 500K+ customers in 175+ countries.

US

  • Own end-to-end availability and performance of critical services, including building automation to prevent recurring issues
  • Administer Linux and Windows systems across web, application, and database servers
  • Develop and automate solutions using various programming languages

Coupa provides a total spend management platform for businesses. They utilize AI and a global network of buyers and suppliers. The company values collaboration, teamwork, transparency, and a commitment to excellence.

Global

  • Provide production support on a shift according to the team on-call roster.
  • Work on the customer and internal engineering/implementation team raised tickets while not on-call for production support.
  • Continuously monitor the health and performance of our services, systems, and infrastructure.

Granicus is driven by the excitement of building, implementing, and maintaining technology that is transforming the Govtech industry by bringing governments and its constituents together. They have served 5,500 federal, state, and local government agencies and more than 300 million citizen subscribers.

Canada

  • Working with engineers across Yelp in supporting new features and services.
  • Integrating tools to monitor platform stability and performance.
  • Help scale our Kubernetes clusters and AWS-based infrastructure while maintaining our platform's SLOs.

Yelp's engineering culture values individual authenticity and encourages creative solutions. They focus on helping users, growing as engineers, and having fun in a collaborative environment.

$100,000–$120,000/yr
US

  • Lead efforts to improve system reliability, scalability, and performance across critical services
  • Define and implement SLIs/SLOs and error budgets, and use them to guide engineering priorities
  • Design and develop observability systems (metrics, logging, tracing, alerting) that produce actionable alerts and data with minimal noise

UJET is an AI-powered contact center innovation company, delivering a cloud platform that redefines the customer experience. They are built on a cloud-native architecture and partner with businesses to deliver exceptional interactions and accelerated growth in the AI-driven world.

$65,500–$79,600/yr
Europe

  • Ensure the reliability of our critical products and services by meeting or exceeding SRE objectives.
  • Instantiate and maintain production infrastructure using Infrastructure as Code and Configuration Management tools.
  • Automate deployments, administration, and monitoring of our services by following CI/CD practices.

Sectigo delivers certificate lifecycle management (CLM) solutions that secure human and machine identities. They are one of the largest CAs with over 700,000 customers and strive to delight their customers and become the market leader in their industry.

Unlimited PTO

  • Develop and maintain observability solutions using platforms like Datadog, Prometheus and Grafana
  • Take a leading role in incident management, including coordinating response efforts, troubleshooting issues, and identifying follow-up actions
  • Partner with product engineering teams to architect reliable systems, recover from incidents, and learn from mistakes

Ditto is redefining how data moves at the edge, aiming to make resilient, real-time applications seamless for developers, regardless of network conditions. It's a globally distributed and fast-growing startup with over $145 million in funding that is committed to building a diverse and inclusive team.

Europe

  • Design, build, and maintain scalable, highly available and fault-tolerant infrastructures.
  • Implement and improve monitoring, alerting, and incident response systems to ensure optimal system performance and minimize downtime.
  • Drive continuous improvement in infrastructure automation, deployment, and orchestration.

Mistral AI is dedicated to democratizing AI through high-performance, optimized, open-source models, products, and solutions designed to integrate seamlessly into daily working life. They are a dynamic, collaborative team passionate about AI and its potential to transform society dedicated to innovation.

Americas

  • Manage and support infrastructure for Growth teams, including Nomad, Hashistack, databases, and any other underlying systems
  • Maintain and troubleshoot GitLab CI pipelines, ensuring reliable and fast build, test, and deployment cycles
  • Provide operational support across Onboarding, Acquire, and Engage teams, helping debug issues in staging and production environments

Kraken is a mission-focused company rooted in crypto values, aiming to accelerate the global adoption of crypto, so that everyone can achieve financial freedom and inclusion. As a fully remote company, they have Krakenites in 70+ countries who speak over 50 languages.

$198,025–$287,952/yr
US

  • Building tools and applications to extends Calendly’s infrastructure platform
  • Evaluating and deploying cloud native open source tools
  • Exercising expertise in cloud infrastructure concepts and patterns

Calendly makes it possible for their customers through impactful innovation. They have millions of users and are in the midst of exciting product growth.

$198,025–$287,952/yr

  • Building tools and applications to extends Calendly’s infrastructure platform
  • Evaluating and deploying cloud native open source tools
  • Exercising expertise in cloud infrastructure concepts and patterns

Calendly's product powers connections for millions through impactful innovation. They are in the midst of exciting growth and desire people that want to learn, grow, and do their best work.

$161,637–$175,000/yr
US

  • Handle technical escalations and engage in complex troubleshooting within a Follow-the-Sun (FTS) support model.
  • Develop automation frameworks and regression test suites (Python, Bash) to streamline deployment and testing processes.
  • Troubleshoot and manage incidents for production systems (cloud infrastructure, TCP/IP networking) and work with NoSQL databases (Redis).

Redis created the product that runs the fast apps our world runs on, building a faster world with simpler experiences. As a global company, it values a culture of curiosity, diversity of thought, and innovation from its employees, customers, and partners.

$103,174–$117,720/yr
Canada

  • Lead efforts to scale and improve our infrastructure.
  • Develop and support internal team tooling.
  • Troubleshoot, debug and resolve issues as part of a shared on-call rotation.

Lillio, formerly HiMama, empowers early childhood educators through innovative tools. They are a Series B, private-equity backed company recognized as an industry leader and selected in 2025 by Time Magazine as one of the world's top EdTech companies.

US India

  • Operate and improve platform tools so product teams can ship reliably, triaging tickets, fixing build issues, and handling routine service requests.
  • Maintain and extend self-service workflows by updating docs, examples, and guardrails under guidance from senior engineers.
  • Perform day-to-day Kubernetes operations: deploy/update Helm charts, manage namespaces, diagnose rollout issues, and follow runbooks for incident response.

ISHIR is a digital innovation and enterprise AI services provider. They work with startups and enterprises to shape the future through accelerated innovation, deep technical expertise, access to global digital talent and a passion for complex problem-solving. ISHIR attracts proactive individuals who thrive on challenges and promote self-reliance, open communication, and collaboration.

$170,000–$190,000/yr

  • Build, maintain, and support all environments which host the Medrio Platform.
  • Monitor environments for issues, configuring and building alerting/self-healing of issues.
  • Work with developers and testers to troubleshoot application/platform issues.

Medrio seeks smart, capable, and conscientious people to help expand its product capabilities, grow its business, and better serve its customers. The Medrio team values collaboration, ingenuity and creating a culture of excellence!

Global

  • Design and implement infrastructure and tools that empower our product teams to rapidly and securely iterate, emphasizing reliability and automation.
  • Influence the strategic direction of our infrastructure and operational practices, ensuring that we are well-positioned to scale and support our growing organization.
  • Take a proactive role in the resolution of production issues, ensuring that we are well-prepared to handle incidents and that we learn from them in a blameless manner.

SSV Labs is the core team behind the SSV Network - pioneering decentralized infrastructure for Ethereum staking. They are building tools, protocols, and standards to make staking more secure, scalable, and trustless.

Europe

  • Manage and support hybrid-cloud infrastructure for the Payward Services business unit, including Nomad, Kubernetes, and databases.
  • Build automation tooling, maintain CI/CD pipelines, and consult on monitoring and alerting best practices to ensure service reliability.
  • Provide operational support, participate in incident response, and debug complex distributed system issues across production and staging environments.

Kraken is a mission-focused company building premium crypto products for traders and institutions, dedicated to accelerating global crypto adoption for financial freedom. It is a fully remote company with a global team of industry pioneers spread across 70+ countries, operating with a strong crypto ethos and commitment to security and education.

Global 4w PTO

  • Act as an escalation point for Tier 1 engineers: mentorship, technical guidance, troubleshooting.
  • Maintain and monitor hybrid infrastructure (servers, Linux/Windows, Kubernetes, AWS, storage, backups, VMware).
  • Automate processes with Ansible, Terraform; manage system configurations.

Apriorit is a software engineering company established in 2002, specializing in system programming, cybersecurity, and more. With over 400 specialists, they maintain high standards in software development and teamwork, serving high-profile clients worldwide.

Spain 6w PTO

  • Operating and evolving 100+ multi-cloud streaming clusters and related database infrastructure.
  • Diagnosing and eliminating cross-layer failure modes.
  • Designing safe upgrade and rollout strategies at scale.

Grafana Labs is a remote-first, open-source powerhouse with over 20M users of Grafana, its open source visualization tool. Grafana Labs helps more than 3,000 companies manage their observability strategies with the Grafana LGTM Stack, and its team thrives in an innovation-driven environment.