Source Job

US

  • Lead incident response as Incident Commander, coordinating teams, communications, and service restoration
  • Produce executive-level incident reports, run RCAs, and drive continuous improvement
  • Enforce change management and risk assessment for production changes

Linux Windows CloudWatch CI/CD

20 jobs similar to Site Reliability Operations

Jobs ranked by similarity.

$98,583–$138,016/yr
US Unlimited PTO

  • Respond to production incidents and contribute to post-incident analysis.
  • Identify and automate manual processes to improve efficiency and reduce risk.
  • Enhance monitoring tools and platforms to improve observability.

Restaurant365 is a SaaS company that provides a unique, centralized solution for accounting and back-office operations for restaurants. They focus on empowering team members to produce top-notch results while elevating their skills.

US Unlimited PTO

  • Be a key contributor on an Agile development team, collaboratively realizing business value through iterative software development lifecycle.
  • Build and execute the monitoring strategy for ScienceLogic SaaS infrastructure.
  • Define, deploy, and maintain system and service monitors.

ScienceLogic is a leader in IT Operations Management, giving modern IT operations actionable insights for faster problem resolution and prediction. They see everything across cloud and distributed architectures, contextualizing data through relationship mapping, and acting on this insight through integration and automation.

$103,200–$178,400/yr
US

  • Serve as Incident Commander, leading real-time response efforts, managing communication across teams, triaging issues, and driving resolution of high-priority incidents.
  • Execute documented runbooks for troubleshooting and resolving production incidents involving AWS services and Kubernetes Clusters.
  • Collaborate post-incident with engineering teams, performing root cause analysis, documenting lessons learned, and driving the implementation of durable solutions.

EBay is a global ecommerce leader that is changing the way the world shops and sells. Our platform empowers millions of buyers and sellers in more than 190 markets around the world and the team fosters an inclusive and collaborative culture, encouraging open communication, continuous learning, and professional growth.

$83,000–$96,000/yr
US

  • Lead the identification, triage, escalation, and resolution of incidents to minimize customer and business impact.
  • Provide timely, clear, and professional communication to internal stakeholders throughout the incident lifecycle.
  • Develop, maintain, and improve incident management processes, procedures, runbooks, and playbooks.

NetDocuments is the world’s #1 trusted cloud-based content management and productivity platform that helps legal professionals do their best work. They strive to win together through passionate hard work, exploring new things and recognizing every interaction matters.

Global

  • Lead role in major incidents and ensure effective communication to stakeholders.
  • Monitor, control, and support service delivery, ensuring systems and procedures are followed.
  • Define and track service measures and KPIs to manage the performance of IT services.

RWS unlocks global understanding by growing the value of ideas, data, and content. The company values every language and culture and has a global reach, providing support services to over 7500 end users worldwide, with a dedicated team of over 500 staff across all regions.

US

  • Ensure near-zero downtime with monitoring and alerting, self-healing automation, and continuous improvement
  • Create highly automated, available and scalable systems by applying software and infrastructure principles
  • Employ and advise clients on DevOps and SRE principles and practices, covering deployment pipelines, HA, service reliability, technical debt, and operational toil for live services running at scale

66degrees is an AI transformation partner. They guide enterprises from business challenges to quantifiable outcomes, helping businesses reach their inflection point where chaotic data becomes a strategic asset, complexity becomes clarity, and AI becomes an engine for growth. They believe in thriving through challenges and winning together.

US

  • Design, build, and maintain secure, scalable cloud infrastructure.
  • Own CI/CD pipelines and deployment workflows across services and environments.
  • Improve reliability, availability, and performance through monitoring, alerting, and incident response practices.

Jobgether is a company that uses an AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. They identify the top-fitting candidates and share this short list directly with the hiring company.

$126,000–$184,000/yr
US

  • Own the operational stability and performance of Juul’s hybrid cloud infrastructure.
  • Lead automation efforts and architect for reliability.
  • Act as the final escalation point for critical incidents.

Juul Labs aims to transition the world’s billion adult smokers away from combustible cigarettes and eliminate their use, while also combating underage usage of their products. They are backed by leading technology investors and are committed to hiring great talent and building a diverse team.

US Canada Europe

  • Design, build, and maintain highly available, scalable infrastructure.
  • Manage and optimize infrastructure across GCP, AWS, Azure, and other cloud providers.
  • Develop comprehensive monitoring, logging, and alerting systems.

Bobsled is seeking a Site Reliability Engineer to enhance its data-sharing platform's reliability and scalability. We're a company that values growth, offering flexible work hours in a fully remote environment and fully sponsored individual coaching for all employees.

Global

  • Design and implement reliable and scalable AWS architecture.
  • Support the CICD process with ArgoCD and GitOps, automating deployments with Terraform.
  • Optimize system performance and troubleshoot issues, collaborating with development teams.

Cloudbeds is transforming hospitality with its intelligently designed platform that powers properties across 150 countries. They are a completely remote team of 650+ employees across 40+ countries, focused on building AI-powered solutions for hotels.

US Unlimited PTO 11w maternity

  • Own and maintain the incident response process, including defining procedures, tools, and best practices
  • Guide teams in establishing and monitoring Service Level Objectives (SLOs), including setting up alerts and reporting systems
  • Lead capacity planning initiatives, focusing on both short and long-term scalability while optimizing costs

Underdog makes sports more fun by building the best products for sports fans. They are a fast-growing sports company valued at $1.3B with a focus on a seamless, simple, easy to use, intuitive and fun app.

  • Act as first responder and incident commander during production incidents
  • Improve reliability and uptime across all Wormhole services
  • Harden infrastructure for security and operational resiliency

Wormhole Foundation empowers passionate people in the research and development of blockchain interoperability technologies. They support teams building secure, open-source, and decentralized products within the Wormhole ecosystem.

Global

  • Lead and manage the DevOps team, prioritizing performance and accountability across cloud functions.
  • Define and enforce DevSecOps standards integrating automation, security, and compliance.
  • Optimize cloud infrastructure across AWS, GovCloud, and Azure for uptime and cost-effectiveness.

Jobgether is a company using an AI-powered matching process to ensure applications are reviewed quickly, objectively, and fairly. This allows them to identify the top-fitting candidates for companies, and this shortlist is then shared directly with the hiring company.

$109,800–$252,500/yr
US Unlimited PTO 16w maternity 8w paternity

  • Design, implement, and maintain scalable and reliable infrastructure solutions.
  • Automate deployments and maintain a resilient, secure SaaS application platform.
  • Develop comprehensive monitoring and alerting solutions, and respond to incidents.

Veeam is the #1 global market leader in data resilience, believing businesses should control all their data whenever and wherever they need it, providing data resilience through data backup, data recovery, data portability, data security, and data intelligence. Based in Seattle, Veeam protects over 550,000 customers worldwide who trust Veeam to keep their businesses running.

Europe

  • Ensure the infrastructure is configured and distributed correctly to meet stability and performance objectives.
  • Manage the day-to-day operations of the IT infrastructure environment by monitoring performance, configuration, maintenance, and repair.
  • Deploy and manage Windows/Linux/Unix servers.

Jobgether is connecting tech talent to opportunity. They focus on AI-powered matching processes.

$125,000–$175,000/yr
US

  • Oversee daily operations of all technical departments.
  • Ensure SLA adherence and quality control.
  • Partner with Client Success on service reviews.

SugarShot is an information technology company with practice areas in Cybersecurity, IT Support and Professional Services. They are growing quickly, been honored on the Inc. 5000 3 years in a row, and have excellent opportunities for great people who are looking to make a real difference in the market place.

US Unlimited PTO

  • Contribute to high impact AWS cloud infrastructure initiatives.
  • Participate in operability and production readiness reviews.
  • Advocate and implement Site Reliability Engineering practices.

Patreon is a media and community platform where creators give fans access to exclusive work. They have generated over $10 billion for creators and have 25 million+ paid memberships, with a hybrid work model and offices in New York and San Francisco.

India

  • Configure/operate monitoring, logging, and tracing tools for application performance.
  • Build dashboards and automation workflows for system reliability and uptime.
  • Collaborate with software engineering teams to design and implement robust systems.

Jobgether is a platform that uses AI-powered matching to connect job seekers with employers. They ensure applications are reviewed quickly and fairly, then share a shortlist with the hiring company for final decisions.

Europe

  • Operate, maintain, and troubleshoot UNIX/Linux systems running in cloud environments
  • Support and maintain existing configuration management and Infrastructure as Code setups
  • Assist with the operation of cloud-based infrastructure, including virtual machines, networking components, and managed services

Dataiku is The Universal AI Platform™, giving organizations control over their AI talent, processes, and technologies to unleash the creation of analytics, models, and agents. Providing no-, low-, and full-code capabilities, Dataiku meets teams where they are today, allowing them to begin building with AI using their existing skills and knowledge.

North America

  • Own the strategy and execution for Runtime Platform.
  • Set the technical direction, build and develop the team, and are accountable for outcomes.
  • Translate product needs into platform capabilities and building trust through consistent delivery.

Wealthsimple aims to help everyone achieve financial freedom by reimagining how people manage their money. As the largest fintech company in Canada, it has over 3+ million users and manages more than $100 billion in assets, fostering inclusive and high-performing teams.