Source Job

US Global

  • Perform day-to-day operational/DevOps tasks on Wikimedia’s public-facing infrastructure.
  • Implement and utilize configuration management and deployment tools.
  • Collaborate with a global, cross-functional team in an asynchronous communication environment.

Python Puppet Linux TCP/IP HTTP

20 jobs similar to Senior Site Reliability Engineer

Jobs ranked by similarity.

$110,000–$175,000/yr
US

  • Become a subject matter expert in applications supporting Ooma customers.
  • Collaborate with Development, QA and other SREs to evaluate, deploy, and debug applications.
  • Improve observability by implementing, refining, and adjusting application monitoring and thresholds.

Ooma empowers people to connect in smarter ways by creating powerful communication experiences through their cloud-based platform. They help small business owners stay connected, provide customized unified communications solutions, and offer smart home security solutions.

$172,614–$172,614/yr
US

  • Design infrastructure, networking, and software platform architecture.
  • Build and maintain automation of Continuous Integration and Continuous Deployment pipelines.
  • Troubleshoot infrastructure, internal applications, networking, and security issues.

Loadsmart is a technology company focused on the logistics and supply chain industry. They leverage data and technology to automate and optimize freight transportation, connecting shippers and carriers to streamline the shipping process. They are a mid-sized company passionate about transforming the future of freight.

Canada

  • Designing and implementing SLI/SLO frameworks with error budgets to guide reliability and performance decisions.
  • Building and maintaining AWS-based production infrastructure using Infrastructure as Code (Terraform, CloudFormation), including ECS, EKS/Kubernetes, and microservices orchestration.
  • Developing internal tools, automation frameworks, and reliability services in TypeScript, Python, or similar languages to enhance operational efficiency.

Jobgether uses an AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. They identify the top-fitting candidates, and this shortlist is then shared directly with the hiring company.

Europe

  • Implement SLI/SLO frameworks with error budgets to drive reliability decisions
  • Design release strategies including blue/green deployments and version tracking
  • Lead incident response and develop automated runbooks to reduce MTTR

Jobgether is a company that helps connect individuals with jobs through an AI-powered matching process. They ensure applications are reviewed quickly, objectively, and fairly against roles' core requirements.

4w PTO

  • Work closely with developers for prototyping, and designing new features as part of the infrastructure.
  • Deploy, install, configure and maintain sophisticated Trading/Finance and related software.
  • Build & maintain CI/CD pipelines.

Devexperts works with respected financial institutions, delivering products and tailor-made solutions for retail and brokerage houses, exchanges, and buy-side firms. The company focuses on trading platforms and brokerage automation, complex software development projects, market data products, and IT consulting services.

$141,000–$230,000/yr
US

  • Collaborate with engineering teams to design and implement scalable, secure systems.
  • Establish and manage service level objectives (SLOs) and service level agreements (SLAs).
  • Enhance incident response processes and post-mortem analysis for outages.

ClickHouse, recognized on the 2025 Forbes Cloud 100 list, is one of the most innovative and fast-growing private cloud companies. With more than 3,000 customers and ARR that has grown over 250 percent year over year, ClickHouse leads the market in real-time analytics, data warehousing, observability, and AI workloads.

Mexico

  • Collaborate with engineers in supporting new features and services.
  • Build tools to monitor site stability and performance.
  • Troubleshoot site issues using industry-leading tools like Splunk, Prometheus and OpenTelemetry.

Yelp's engineering culture is cooperative and values individual authenticity. They encourage creative solutions to problems and help users, grow as engineers, and have fun in a collaborative environment.

US

  • Responsible for availability, latency, performance, efficiency, monitoring/observability, emergency response, capacity planning.
  • Analyze, troubleshoot and resolve operational challenges contributing to defined SLO's.
  • Manage site stability, performance, reliability, and maintain uptime for production environments.

CentralReach provides autism and IDD care software for Applied Behavior Analysis (ABA), multidisciplinary therapy, and special education. They are trusted by more than 200,000 users and is backed by Roper Technologies, Inc. (Nasdaq: ROP). Their culture is centered around impact, inclusion, and flexibility.

Global

  • Design, deploy, and manage scalable infrastructure on Google Cloud Platform (GCP).
  • Collaborate closely with the Database Administration team to ensure high availability and performance of data systems.
  • Automate operational workflows and maintenance tasks using Python (primary team standard) or Shell/Bash.

Miratech is a global IT services and consulting company that brings together enterprise and start-up innovation. They support digital transformation for some of the world's largest enterprises and retain nearly 1000 full-time professionals with an annual growth rate exceeding 25%.

Europe

  • Design and maintain scalable, fault-tolerant infrastructure that supports our SaaS platform and keeps pace with business growth.
  • Define, document, and maintain SLIs, SLOs, and SLAs in partnership with product engineering, translating business commitments into technical guardrails.
  • Lead incident response with steady judgment, facilitate blameless postmortems, and drive remediation efforts that prevent recurrence.

Fixify is on a mission to reimagine IT teams support companies. They need a Senior Site Reliability Engineer who finds joy in building systems that fade into the background, empowering product engineers to ship with confidence and their customers to work without interruption.

$90,000–$125,000/yr
US 3w PTO

  • Support Engineering and Platform automation efforts with development and scripting skills.
  • Automate operational processes using scripting languages.
  • Develop, implement, and continually improve system and network monitoring and alerting capabilities and procedures.

Cotiviti is focused on providing payment accuracy and analytics-driven solutions that drive measurable results. They offer team members a competitive benefits package and has a culture of valuing individual qualifications without regard to race, gender, or other protected characteristics.

$170,000–$240,000/yr
US 4w PTO

  • Own our fundamental cloud services and tooling.
  • Own our application platform.
  • Own our developer experience.

Propel builds technology that strengthens the social safety net. They are a passionate team of ~100 Propellers who envision a future where every American has the tools and resources they need to thrive, offering a remote-first working environment with headquarters in Brooklyn.

US

  • Architect and deploy on-premise and cloud-based Linux infrastructure.
  • Develop and maintain Infrastructure-as-Code (IaC) frameworks using Terraform and Ansible.
  • Implement system-level security best practices including patching and hardening.

Jobgether uses an AI-powered matching process to ensure applications are reviewed quickly, objectively, and fairly against the role's core requirements. They identify the top-fitting candidates, and this shortlist is then shared directly with the hiring company.

  • Maximize the velocity of our product engineering team.
  • Ensure platform scalability, reliability, and security.
  • Champion best practices and shape the engineering culture.

They are building a robust, scalable trading platform to serve high-traffic, latency-sensitive applications. They leverage state-of-the-art technologies to support real-time trading while providing unparalleled reliability and performance.

North America Canada

  • Operate and maintain ServiceNow’s global cloud network infrastructure.
  • Troubleshoot and resolve network issues, including urgent operational events.
  • Participate in 24/7 on-call rotation, including weekends, as part of the Network Operations Engineering team.

ServiceNow is a global market leader bringing innovative AI-enhanced technology to over 8,100 customers, including 85% of the Fortune 500®. Their intelligent cloud-based platform seamlessly connects people, systems, and processes to empower organizations to find smarter, faster, and better ways to work.

$100,000–$170,000/yr
US

  • Oversee the operation and maintenance of the trading systems, guaranteeing continuity and stability in the production trading environment.
  • Develop automation tools to streamline operational processes, reducing overhead and enhancing efficiency.
  • Triage, prioritize and troubleshoot complex network and systems issues, ranging from low-level hardware to in-house software applications.

They participate in a wide variety of marketplaces including global futures, equities, commodities, options, fixed income, and cryptocurrencies. Their culture emphasizes teamwork and focuses on continuous integration and test-driven development.

Global

  • Build and own the foundational infrastructure that our products run upon.
  • Work directly on our products' golang code base to implement SRE related objectives.
  • Take a data driven approach to quantifying system performance and reliability.

LiveKit provides the network infrastructure for multimodal AI interfaces, enabling seamless audio and visual interactions. Founded in 2021, LiveKit supports over 3 Billion calls annually, with 100,000+ developers and industry giants like OpenAI, Spotify, and Meta.

US

  • Collaborate with application engineering teams on platform infrastructure.
  • Enhance observability and spearhead the adoption of SRE best practices.
  • Build and maintain reliable CI/CD pipelines, tooling, and infrastructure.

Rula strives to provide quality, evidence-based, compassionate mental healthcare and aims to create a world where mental health is no longer stigmatized. They are a remote-first company operating in most U.S. states, and are dedicated to having a culture of inclusion that supports their employees.

US 6w PTO

  • Build and scale a strong culture of operational excellence by defining standards and coaching teams to own reliability and availability.
  • Drive mature DevOps/SRE practices, including incident response and PIRs, on-call readiness, runbooks, alerting, observability, and release/change management.
  • Guide teams in the design, development, evolution, and operation of large-scale, distributed cloud systems.

Grafana Labs is a remote-first, open-source powerhouse with more than 20M users of Grafana, the open source visualization tool, around the globe. They help more than 3,000 companies manage their observability strategies with the Grafana LGTM Stack.

$160,000–$200,000/yr
US

  • Help drive reliability, automation and performance within our cloud-based infrastructure.
  • Become embedded within an Engineering team helping them navigate production excellence and advocate for best practices.
  • Debug production issues across services and levels of the stack as well as practice incident response and blameless postmortems.

Flywire is a global payments enablement and software company that was founded over a decade ago. They have over 1,200 global FlyMates, representing more than 40 nationalities, in 12 offices worldwide, and are looking for people to join the next stage of their journey as they continue to grow.