Source Job

Nigeria

  • Detect and triage service and reliability issues.
  • Develop automation to eliminate manual and repetitive operational tasks.
  • Investigate and resolve customer complaints escalated beyond L1 and L2 support.

Java Go Python SQL Kubernetes

20 jobs similar to Site Reliability Engineer

Jobs ranked by similarity.

$113,082–$175,725/yr
Canada

  • Operate and maintain large-scale data systems, ensuring stability and performance.
  • Design, implement, and optimize deployment processes using virtualization.
  • Monitor system health, analyze failures, and identify instability sources.

Jobgether is a platform that uses AI-powered matching to connect candidates with companies. They ensure applications are reviewed quickly, objectively, and fairly, then share a shortlist of top candidates directly with the hiring company.

LATAM

  • Monitor production systems, dashboards, logs, and alerts to ensure high availability and performance across distributed environments.
  • Assist in incident detection, triage, escalation, and resolution, following structured on-call rotations with mentorship support.
  • Maintain, follow, and continuously improve runbooks, operational procedures, and incident response workflows.

Jobgether is a platform that helps job seekers find the right opportunities. They use an AI-powered matching process to ensure applications are reviewed quickly and fairly.

US Canada Europe Asia

  • Automate the provisioning of all of Juniper Square’s infrastructure in code.
  • Partner with our Platform Engineering team on building developer tooling / improving developer experiences via joint initiatives and enhancements.
  • Partner with our Data Engineering team on improving our data posture and driving operational excellence.

Juniper Square's mission is to unlock the full potential of private markets by digitizing them to bring efficiency, transparency, and access. They are a values-driven organization with a hybrid workplace strategy, allowing employees to collaborate effectively across multiple countries and offering physical offices in several major cities.

$120,000–$180,000/yr
US

  • Develop automation code to provision and operate infrastructure at scale.
  • Build resilient, scalable, secure, and observable services with cost optimization.
  • Proactively identify and address security concerns across systems and infrastructure.

Globality uses AI to transform enterprise spending into a more efficient and inclusive process. They aim to revolutionize enterprise procurement with AI and have a culture built on trust, collaboration, and innovation, fostering an environment where every individual feels valued and included.

Global

  • Partner with engineers to build dev tools that empower developer workflows and deployment infrastructure.
  • Ensure reliability of multi-cloud Kubernetes clusters and pipelines.
  • Metrics, logging, analytics, and alerting for performance and security across all endpoints and applications.

Cresta is on a mission to turn every customer conversation into a competitive advantage by unlocking the true potential of the contact center. Their platform combines the best of AI and human intelligence to help contact centers discover customer insights and behavioral best practices.

Europe Middle East Africa

  • Design, deploy and maintain a cloud infrastructure to support a Dataiku SaaS offering mainly on AWS and Azure and GCP
  • Continuously improve the infrastructure, deployment and configuration to deliver more reliable, resilient, scalable and secure services
  • Automate as much as possible all technical operations

Dataiku is The Universal AI Platform™, giving organizations control over their AI talent, processes, and technologies to unleash the creation of analytics, models, and agents. They connect many data science technologies and integrate the best of data and AI tech.

Australia

  • Support and evolve the reliability of platforms used by the AI Research team.
  • Ensure production services meet expectations for availability, latency, and operational readiness.
  • Build and maintain Kubernetes-based services on GCP using infrastructure-as-code and GitOps.

Algolia is a pioneer and market leader in AI Search, empowering 17,000+ businesses to deliver blazing-fast, predictive search and browse experiences. They have raised $150 million in Series D funding, quadrupling their valuation to $2.25 billion, investing in their market-leading platform.

US 5w maternity

  • Support teammates with goal-setting, professional development, and mentoring.
  • Ensure delivery of maintainable, high-quality platform systems.
  • Build and sustain a healthy team culture where ownership and collaboration are the norm.

onX is a pioneer in digital outdoor navigation solutions through its suite of apps. With over 400 employees, they foster a fast-paced, tech-forward environment valuing ownership, accountability, and teamwork.

India

  • Configure/operate monitoring, logging, and tracing tools for application performance.
  • Build dashboards and automation workflows for system reliability and uptime.
  • Collaborate with software engineering teams to design and implement robust systems.

Jobgether is a platform that uses AI-powered matching to connect job seekers with employers. They ensure applications are reviewed quickly and fairly, then share a shortlist with the hiring company for final decisions.

$130,000–$140,000/yr
Global 7w PTO

  • Act as a primary responder for system incidents and outages, ensuring high availability and fast recovery.
  • Own and continuously improve monitoring, alerting, and log management systems.
  • Manage, optimize, and scale database infrastructure including MySQL, PostgreSQL, ClickHouse, and Redis.

Jobgether uses an AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. They identify the top-fitting candidates, and this shortlist is then shared directly with the hiring company.

US

  • Work directly with customers to ensure successful Teleport deployments.
  • Meet regularly with customers, understand pain points blocking deployments and remove roadblocks.
  • Work with customers to articulate the problem they are trying to solve, gather requirements, and make the business case to the product and engineering teams to invest in resolving the issue.

Teleport is the Infrastructure Identity Company, modernizing identity, access, and policy for infrastructure, improving engineering velocity and resiliency of critical infrastructure against human factors and/or compromise. They are a fast-growing, well-funded Y-Combinator company that values craft, strongly supports work/life balance, and embraces a culture of humility, honesty, and transparency.

$150,000–$167,000/yr
US

  • Lead reliability-focused design and readiness reviews.
  • Build, operate, and continuously improve our observability stack.
  • Own and evolve incident management practices.

Transcend is building the privacy platform that easily embeds privacy into your entire tech stack. They are growing quickly, backed by top-tier investors and are proud to serve some of the world's most iconic brands.

US Unlimited PTO

  • Be a key contributor on an Agile development team, collaboratively realizing business value through iterative software development lifecycle.
  • Build and execute the monitoring strategy for ScienceLogic SaaS infrastructure.
  • Define, deploy, and maintain system and service monitors.

ScienceLogic is a leader in IT Operations Management, giving modern IT operations actionable insights for faster problem resolution and prediction. They see everything across cloud and distributed architectures, contextualizing data through relationship mapping, and acting on this insight through integration and automation.

US Canada 6w PTO

  • Work with your team to build and roll out new features, then use the results to iterate and improve.
  • Drive projects from initial ideation all the way to operations once it is in the hands of customers.
  • Maintain critical systems, and own their reliability, performance, and availability.

Grafana Labs is a remote-first, open-source powerhouse with over 20M users. They provide observability strategies for over 3,000 companies, featuring scalable metrics, logs, and traces, and thrive in an innovation-driven environment with transparency, autonomy, and trust.

Europe

  • Collaborate with the team to design, build, and maintain a robust and scalable infrastructure.
  • Manage and optimize Linux-based systems to ensure high availability and performance.
  • Utilize Kubernetes to orchestrate containers and maintain containerized applications effectively.

As Europe’s No.1 e-pharmacy, Redcare Pharmacy is powered by passionate teams and cutting-edge innovation. They strive to create a healthy, collaborative work environment where every employee feels valued and inspired to contribute to their vision “Until every human has their health”.

$98,583–$138,016/yr
US Unlimited PTO

  • Respond to production incidents and contribute to post-incident analysis.
  • Identify and automate manual processes to improve efficiency and reduce risk.
  • Enhance monitoring tools and platforms to improve observability.

Restaurant365 is a SaaS company that provides a unique, centralized solution for accounting and back-office operations for restaurants. They focus on empowering team members to produce top-notch results while elevating their skills.

$126,000–$184,000/yr
US

  • Own the operational stability and performance of Juul’s hybrid cloud infrastructure.
  • Lead automation efforts and architect for reliability.
  • Act as the final escalation point for critical incidents.

Juul Labs aims to transition the world’s billion adult smokers away from combustible cigarettes and eliminate their use, while also combating underage usage of their products. They are backed by leading technology investors and are committed to hiring great talent and building a diverse team.

$175,000–$195,000/yr
Americas Unlimited PTO 16w maternity

  • Lead effective squad rituals and ensure production readiness.
  • Partner with engineers to ensure solutions are scalable, architecturally sound, flexible, and secure.
  • Provide timely, specific coaching and development opportunities for your direct reports.

Customer.io's platform allows over 8,000 companies to send messages using real-time behavioral data. Their team uses Go, React, Ember, and AI to ship fast and scale with confidence and they value ownership, leadership, and healthy skepticism.

US

  • Lead incident response as Incident Commander, coordinating teams, communications, and service restoration
  • Produce executive-level incident reports, run RCAs, and drive continuous improvement
  • Enforce change management and risk assessment for production changes

Truelogic is a leading provider of nearshore staff augmentation services headquartered in New York, delivering top-tier technology solutions to companies of all sizes. Their team of 600+ highly skilled tech professionals, based in Latin America, drives digital disruption by partnering with U.S. companies on their most impactful projects.

$116,943–$140,233/yr
UK 6w PTO

  • Design and implement high-quality, scalable services to be consumed by multiple Grafana Cloud products.
  • Support the technical direction and vision of the team, contributing to strategic discussions and future development of observability solutions
  • Be a part of your team’s follow-the-sun on-call rotations and take ownership of the services you’re running

Grafana Labs is a remote-first, open-source powerhouse that provides the leading open source visualization tool. They help more than 3,000 companies manage their observability strategies with the Grafana LGTM Stack, which can be run fully managed with Grafana Cloud or self-managed with the Grafana Enterprise Stack. The team thrives in an innovation-driven environment where transparency, autonomy, and trust fuel everything we do.