Source Job

Canada

  • Working with engineers across Yelp in supporting new features and services.
  • Integrating tools to monitor platform stability and performance.
  • Help scale our Kubernetes clusters and AWS-based infrastructure while maintaining our platform's SLOs.

Linux Python Kubernetes AWS Terraform

20 jobs similar to Site Reliability Engineer, Production Reliability

Jobs ranked by similarity.

Canada

  • Implementing the improvements to the reliability, fault tolerance, scalability, and performance of our infrastructure
  • Managing incidents using your technical know-how to involve the appropriate teams and automate away manual practices
  • Improving observability across our systems (metrics, logs, tracing) to reduce time to detection and resolution

Newton is changing how Canadians trade crypto with the goal to make financial freedom achievable for everyone by giving their customers the tools and knowledge needed to navigate the crypto world. They are a remote team spread across Canada that values pushing boundaries and getting things done.

$103,174–$117,720/yr
Canada

  • Lead efforts to scale and improve our infrastructure.
  • Develop and support internal team tooling.
  • Troubleshoot, debug and resolve issues as part of a shared on-call rotation.

Lillio, formerly HiMama, empowers early childhood educators through innovative tools. They are a Series B, private-equity backed company recognized as an industry leader and selected in 2025 by Time Magazine as one of the world's top EdTech companies.

$230,000–$250,000/yr
US Unlimited PTO 12w paternity

  • Define and evolve reliability standards for the SmarterDx platform.
  • Enhance observability systems (metrics, logs, traces, alerting) to provide actionable insights and reduce mean time to detect (MTTD) and resolve (MTTR).
  • Reduce operational toil through automation, self-healing systems, and improved deployment and rollback mechanisms.

SmarterDx, a Smarter Technologies company, builds clinical AI that is transforming how hospitals translate care into payment. Founded by physicians in 2020, their platform connects clinical context with revenue intelligence, helping health systems recover millions in missed revenue, improve quality scores, and appeal every denial.

US Canada

  • Own SLI/SLO/SLA definitions for the Akuity SaaS platform and drive continuous improvement.
  • Participate in an on-call rotation and act as incident commander for high-severity production events.
  • Partner with engineering teams to build reliability into new features before they ship to production

Akuity helps enterprises ship software faster and more reliably with modern GitOps best practices. The Akuity Platform enables teams to manage the development and deployment across hundreds – if not thousands – of Kubernetes clusters from a single control plane.

Global 4w PTO

  • Act as an escalation point for Tier 1 engineers: mentorship, technical guidance, troubleshooting.
  • Maintain and monitor hybrid infrastructure (servers, Linux/Windows, Kubernetes, AWS, storage, backups, VMware).
  • Automate processes with Ansible, Terraform; manage system configurations.

Apriorit is a software engineering company established in 2002, specializing in system programming, cybersecurity, and more. With over 400 specialists, they maintain high standards in software development and teamwork, serving high-profile clients worldwide.

Global

  • Provide production support on a shift according to the team on-call roster.
  • Work on the customer and internal engineering/implementation team raised tickets while not on-call for production support.
  • Continuously monitor the health and performance of our services, systems, and infrastructure.

Granicus is driven by the excitement of building, implementing, and maintaining technology that is transforming the Govtech industry by bringing governments and its constituents together. They have served 5,500 federal, state, and local government agencies and more than 300 million citizen subscribers.

$172,614–$172,614/yr
US

  • Design infrastructure, networking, and software platform architecture.
  • Build and maintain automation of Continuous Integration and Continuous Deployment pipelines.
  • Troubleshoot infrastructure, internal applications, networking, and security issues.

Loadsmart is a technology company focused on the logistics and supply chain industry. They leverage data and technology to automate and optimize freight transportation, connecting shippers and carriers to streamline the shipping process. They are a mid-sized company passionate about transforming the future of freight.

$165,000–$195,000/yr
US

  • Support and operate Legion’s AWS-based cloud platform and Kubernetes (EKS) environments.
  • Build and maintain infrastructure-as-code using Terraform.
  • Improve CI/CD pipelines to increase deployment safety and velocity.

Legion Technologies delivers the industry’s most innovative workforce management platform. The AI-driven Legion WFM platform maximizes labor efficiency and employee engagement. They are a remote, mission-driven team that embraces a collaborative, fast-paced, and entrepreneurial culture.

US Canada 16w maternity

  • Build and deploy computing services and infrastructure in customer environments.
  • Clarify and surface requirements from ambiguous use cases defined by cross-functional stakeholders.
  • Improve reliability and scalability by resolving edge cases, studying failure modes, and writing tests.

Planet designs, builds, and operates the largest constellation of imaging satellites in history. They deliver an unprecedented dataset of empirical information via a revolutionary cloud-based platform to authoritative figures in commercial, environmental, and humanitarian sectors. Planet has a people-centric approach toward culture and community and it strives to iterate in a way that puts their team members first and prepares their company for growth.

$141,000–$230,000/yr
US

  • Collaborate with engineering teams to design and implement scalable, secure systems.
  • Establish and manage service level objectives (SLOs) and service level agreements (SLAs).
  • Enhance incident response processes and post-mortem analysis for outages.

ClickHouse, recognized on the 2025 Forbes Cloud 100 list, is one of the most innovative and fast-growing private cloud companies. With more than 3,000 customers and ARR that has grown over 250 percent year over year, ClickHouse leads the market in real-time analytics, data warehousing, observability, and AI workloads.

Europe

  • Write code, automate everything, design for reliability, and deeply understand the systems.
  • Build or extend Terraform modules and contribute to Platform Engineering around Observability.
  • Collaborate with developers to shape feature design so that reliability is built in, not added later.

InPost Group is an innovative European out of home deliveries company, revolutionizing the way parcels are delivered to customers. With over 10,000 employees worldwide, InPost Group is one of the largest out of home delivery providers in Europe, committed to providing sustainable and efficient delivery solutions.

$170,000–$240,000/yr
US 4w PTO

  • Own our fundamental cloud services and tooling.
  • Own our application platform.
  • Own our developer experience.

Propel builds technology that strengthens the social safety net. They are a passionate team of ~100 Propellers who envision a future where every American has the tools and resources they need to thrive, offering a remote-first working environment with headquarters in Brooklyn.

Global

  • Cooperate closely with other Platform and Engineering teams on strategic initiatives
  • Improve, automate and grow SmartRecruiters cloud platform
  • Respond to client threats and remediate issues

SmartRecruiters is the Recruiting AI Company that transforms hiring for the world’s leading enterprises. An SAP company, they deliver an AI-powered hiring platform that automates and optimizes the entire talent acquisition process. They are a values-driven tech company with strong financial backing and a bold vision.

US

  • Collaborate with application engineering teams on platform infrastructure.
  • Enhance observability and spearhead the adoption of SRE best practices.
  • Build and maintain reliable CI/CD pipelines, tooling, and infrastructure.

Rula strives to provide quality, evidence-based, compassionate mental healthcare and aims to create a world where mental health is no longer stigmatized. They are a remote-first company operating in most U.S. states, and are dedicated to having a culture of inclusion that supports their employees.

US 6w PTO

  • Build and roll out new features with your team, iterating based on results.
  • Drive projects from initial ideation to operational deployment.
  • Analyze, design, and build modular solutions for complex challenges.

Jobgether is a platform helping candidates find the right job. They use AI-powered matching to ensure every application is reviewed quickly, objectively, and fairly against the core requirements.

Global

  • Ensure the availability, reliability, performance, and security of our SaaS platform
  • Lead infrastructure automation efforts using Infrastructure as Code and Configuration Management tools
  • Define and monitor SLAs/SLOs/SLIs, and drive service quality improvements

Remote People builds the infrastructure to power borderless teams. Their technology enables businesses to hire anyone anywhere compliantly at the push of a button. They are committed to building a global, diverse team representing different and varied backgrounds, perspectives, and experiences.

Europe

  • Work closely with developers and operations teams to scale and optimize their infrastructure for sustained growth.
  • Design, deploy, and operate their core backend infrastructure using automated, Infrastructure-as-Code approach.
  • Prioritize and own delivery in a small, highly efficient team — you set the bar, not just maintain it.

Relai is Europe's fastest growing Bitcoin-only app. They are looking for an experienced, results-oriented and impact-driven Senior DevOps Engineer who can help them scale their infrastructure and pursue their mission of bringing the best store of value to more people.

$100,000–$130,000/yr
Canada

  • Design, build, and maintain Kubernetes-based infrastructure and cloud environments.
  • Build and optimize CI/CD pipelines that enable fast, safe, and repeatable deployments.
  • Leverage AI coding tools and agentic workflows as a core part of your work.

Intrahealth, a subsidiary of HEALWELL AI Inc., is an enterprise class EMR provider supporting approximately 20,000 providers and the care delivery of tens of millions of patients and clients across Canada, Australia and New Zealand. Intrahealth provides a suite of flexible software solutions to a wide variety of customers including health authorities, public health, community health, home care, and primary care professionals.

$133,110–$148,042/yr
US

  • Collaborate with stakeholders to drive best practices for monitoring, CI/CD pipelines
  • Troubleshoot deployment issues in our CI pipeline
  • Identify areas for automation and embrace the codification of all things

Weedmaps is a global leader in the cannabis industry. They are dedicated to transparency, education, and community, serving cannabis to consumers and businesses in the U.S. and worldwide.

$110,000–$175,000/yr
US

  • Become a subject matter expert in applications supporting Ooma customers.
  • Collaborate with Development, QA and other SREs to evaluate, deploy, and debug applications.
  • Improve observability by implementing, refining, and adjusting application monitoring and thresholds.

Ooma empowers people to connect in smarter ways by creating powerful communication experiences through their cloud-based platform. They help small business owners stay connected, provide customized unified communications solutions, and offer smart home security solutions.