Source Job

US

  • Designs, implements, and continuously improves observability strategies across services.
  • Focuses on understanding system behavior in production, identifying failure modes, performance bottlenecks, and reliability risks.
  • Evolves and maintains shared AWS CDK and CDK8s constructs, with emphasis on observability, autoscaling, and operational safeguards.

AWS Kubernetes Python Grafana

20 jobs similar to Site Reliability Engineer

Jobs ranked by similarity.

Design, implement, monitor and maintain Sysdig's Infrastructure at scale on different clouds and on-prem. Collaborate with development teams to improve system reliability, performance, and scalability. Participate in on-call rotation, respond to incidents, conduct root cause analyses, and implement preventive measures.

Sysdig helps organizations secure innovation in the cloud with runtime insights, open innovation, and agentic AI, trusted by over 60% of the Fortune 500.

Brazil 26w maternity 4w paternity

Support the evolution of our platform by improving scalability, reliability, observability, and security. Proactively identify bottlenecks and unlock the autonomy of the entire engineering team. Maintain infrastructure & deployment pipelines and collaborate with engineering teams on architectural decisions and production-readiness practices.

Feegow joined the Docplanner Group, a health-tech company, in 2022 and is dedicated to developing innovative solutions for physicians and managers.

Europe

As an SRE you will be responsible for ensuring the availability, performance and cost effectiveness of these services. You will be working with multiple feature development teams and the BAU/Support team to define and evolve our cloud & on-prem infrastructure & delivery pipelines, improving system observability. Proactively identifying and mitigating reliability risks.

In 2019, our founders were working as engineers solving complex cross domain problems within government organisations TwinStream was formed.

$140,000–$190,000/yr
US Canada Unlimited PTO

  • Architect and maintain scalable, reliable infrastructure: Design and optimize infrastructure for high availability, fault tolerance, and performance across distributed systems.
  • Lead incident management and root cause analysis: Own incident response processes, ensure swift resolution of issues, and drive post-incident improvements to prevent recurrences.
  • Service monitoring and automation: Build and maintain automated monitoring, alerting, and healing systems that improve system health, reduce manual intervention, and minimize downtime.

VGS is the world's leader in payment tokenization, empowering clients and partners by tokenizing sensitive payment data and limiting compliance scope. They embed a universal token vault into their technology stack to manage the complexities of payment data tokenization across processors and networks and more. While the job posting doesn't specify size, they appear to have a culture that values transparency, collaboration, grit, and humility.

UK

Run the production environment by monitoring availability and taking a holistic view of system health. Build software and systems to manage platform infrastructure and applications. Improve reliability, quality, and time-to-market of our suite of software solutions.

NICE software products are used by 25,000+ global businesses to deliver extraordinary customer experiences, fight financial crime and ensure public safety.

$174,600–$220,000/yr
US

  • Lead capacity planning, autoscaling, and performance optimization across our application.
  • Define and enforce best practices for scalability, reliability, observability, and infrastructure resilience.
  • Conduct architectural reviews and propose improvements to enhance performance and cost efficiency.

Hypori Inc., a leading provider of SaaS cybersecurity solutions, is a disruptive technology company transforming secure mobility for government and commercial customers.

$125,000–$169,000/yr
Unlimited PTO

  • Design, scale, and operate resilient, cloud-native infrastructure in AWS with an emphasis on EKS, IAM, RBAC, and modern security-first practices.
  • Build and optimize CI/CD pipelines with GitHub Actions and GitHub Advanced Security enabling velocity without compromising safety.
  • Own observability across the stack using Datadog (metrics, logging, alerting, and tracing).

DexCare optimizes time in healthcare, streamlining patient access, reducing waits, and enhancing overall experiences. They are committed to creating an inclusive workplace where diversity drives innovation and belonging strengthens collaboration, enabling everyone to thrive.

Germany

Shape the way Scalable runs microservices in a performant, secure, and cost-efficient way. Collaborate with cross-functional teams to understand scalability requirements. Develop and maintain internal tooling around Monitoring, Developer Portal, and Load Testing.

Scalable Capital is a leading digital investment and banking platform with a full banking licence, empowering people across Europe to shape their own finances.

India Unlimited PTO

Seeking an experienced Site Reliability Engineer to help build highly resilient and scalable systems by automating, measuring, and monitoring everything. Implement highly-available and scalable architectures for core and third-party components of Acquia Source. Implement metrics, monitoring, and incident response processes.

Acquia is an open source digital experience company providing technology to brands that allows them to embrace innovation and create customer moments that matter.

US Unlimited PTO

  • Implement and maintain observability tools and dashboards using [e.g., AWS CloudWatch, Datadog, Sentry, OpenTelemetry].
  • Assist with cloud cost visibility and optimization, analyze infrastructure usage patterns to identify waste and implement aggressive tagging strategies.
  • Manage the tooling and processes for deploying applications to AWS EKS / Kubernetes / ECS / Serverless and facilitate modern deployment strategies.

True is a global platform of companies that optimizes value creation by placing executive talent, developing business leaders, creating diverse and inclusive networks, and using innovative technology to advance executive talent priorities. True was founded on the belief that doing good is the pathway to doing well and their growth and success are a by-product of their values treating people right, listening to new ideas and keeping culture at the heart of their business.

$145,000–$185,000/yr
US Unlimited PTO

  • Be a keen learner, working with cloud-native, highly scalable infrastructure and gaining expertise in container orchestration, networking, and observability.
  • Be a passionate problem solver, tackling scalability, reliability, and troubleshooting challenges in distributed systems.
  • Be a great communicator, engaging directly with developers, engineering teams, and product teams to understand infrastructure challenges and provide solutions.

Temporal provides an open-source programming model that simplifies code, improves application reliability, and helps developers focus on delivering features faster. They aim to be the reliable foundation of every developer’s toolbox and value curiosity, drive, collaboration, genuineness, and humility.

  • Ensure reliability, stability, and operational excellence for mission-critical contact center environments.
  • Provide incident response, troubleshoot production issues, and perform root cause analysis.
  • Manage Amazon Connect configurations, contact flows, bots (Lex), and integrations.

Miratech is a global IT services and consulting company that brings together enterprise and start-up innovation, supporting digital transformation for large enterprises.

  • Design and implement foundational patterns and libraries for Python applications.
  • Develop and maintain robust CI/CD pipelines using tools such as Jenkins, ArgoCD.
  • Instrument observability through tools such as CloudWatch and DataDog to monitor and optimize application performance across multiple environments.

As a leader in aging care innovation, Honor provides the technology, tools, and services that empower older adults to live life on their own terms.

Europe 6w PTO

  • Lead and support the platform team through coaching and clear expectations.
  • Own the platform strategy and roadmap, prioritizing initiatives and managing team capacity.
  • Provide technical direction for the AWS- and Kubernetes-based platform.

bunch is building the backbone of private markets, combining exceptional expertise, operational excellence, and frictionless technology.

Europe

Lead the Reliability & Operations function within the Developer & Production Enablement (DPE) division of RWS’s Product & Technology organization. Take ownership of global production operations and lead the transition from manual, ticket-based workflows to platform-integrated automation. Ensure stability today, while designing for scalability and autonomy in the future.

RWS's purpose is to unlock global understanding, valuing every language and culture, and celebrating diversity and inclusion to make the company strong.

Canada 5w PTO

Design, implement, and evolve large-scale, cloud-native infrastructure supporting MariaDB's global SaaS platform. Lead reliability and scalability initiatives, driving automation and resilience through infrastructure-as-code and GitOps practices. Proactively identify and remediate systemic reliability issues, ensuring high service availability and performance across multi-cloud environments.

MariaDB is making a big impact on the world and is the backbone of applications used everyday, including 75% of the Fortune 500 companies.

Europe 4w PTO

Design, build, and own AWS-based MLOps infrastructure, defining standards for security, automation, cost-efficiency, and governance. Architect and operate production Kubernetes clusters, including containerizing and deploying ML models using Docker and Helm. Build and maintain CI/CD pipelines for training, validation, and deployment of ML workloads, implementing canary, blue-green, and rollback strategies.

Jobgether is a Talent Matching Platform that partners with companies worldwide to efficiently connect top talent with the right opportunities through AI-driven job matching.

$110,000–$250,000/yr
US 4w PTO

  • Design and implement cloud-native infrastructure that powers core product capabilities at scale.
  • Build proprietary solutions (sync engines, observability pipelines, DNS management systems) that differentiate Files.com.
  • Engineer infrastructure for speed, resilience, and maintainability across high-volume, distributed workloads.

Files.com powers secure file transfer and automation for over 4,000 brands. They are a profitable, founder-led SaaS company with a flat, high-trust engineering organization, where engineers are empowered to take ownership of projects.

ANZ

  • Building world-class AI infrastructure to support a 100+ person research team.
  • Designing and scaling multi-cloud systems that support high-performance model training and inference.
  • Improving monitoring, alerting and system observability for AI workloads.

Canva is redefining how the world experiences design. They have campuses in Sydney and Melbourne, co-working spaces in Brisbane, Perth, Adelaide and Auckland, and trust their employees to choose the balance that empowers them and their team to achieve their goals.

US

  • Ramp on AWS architecture, Terraform patterns, Kubernetes setup, CI/CD pipelines, and observability stack.
  • Take ownership of an infrastructure area: CI/CD pipelines, observability stack, Kubernetes platform, or AWS security/networking.
  • Shape infrastructure direction with design docs, RFC proposals, and mentoring engineering teams.

Bastion enables financial institutions and enterprises to issue regulated stablecoins, generate revenue on reserves, and expand their ecosystems.