As a Senior Site Reliability Engineer, you will partner with the Engineering Department to drive the reliability, scalability, and performance of our production systems. You will define and implement best practices across infrastructure security, observability, release engineering, and developer tooling to meet department-level operational requirements, own our Incident Management process and automate operational tasks.
Remote Devops Jobs · DataDog
6 results
FiltersJob listings
As a founding member of the Site Reliability Engineering (SRE) team, helps define the culture and build the systems that keep regulated, cloud-based production environments reliable. Designs, implements, and operates observability, reliability, and incident management systems. Partners with engineering teams to define SLIs, SLOs, and error budgets, build runbooks and operational playbooks, and develop the monitoring and automation needed to ensure systems are reliable and compliant.
Combine technical excellence and exceptional collaboration skills to deliver impact. You will drive vital initiatives including deployment velocity, observability, system reliability, and developer experience. You'll build the infrastructure and tooling that lets engineers ship faster and sleep better—think GitOps workflows, Kubernetes configuration and optimization, comprehensive Datadog observability coverage, and the kind of automation that makes deployments uneventful.
As a Staff Site Reliability Engineer at Topstep, you'll play a foundational role in shaping how we approach reliability, observability, and infrastructure at scale. You'll be instrumental in building out our SRE practice, defining our incident response culture, closing observability gaps, and optimizing our AWS infrastructure for both performance and cost.
Take ownership of reliability within the Platform Team. In this role, you will design and build the internal tools that keep our systems resilient, measurable, and dependable. By creating frameworks for automated checks, observability, and reliability testing, you will give our teams the confidence to move quickly without sacrificing stability.
Huntress is growing our Platform Engineering team and is looking for an experienced engineer who is passionate about stability, resilience, and scalability. You’ll be joining a high-performing team responsible for proactively building, monitoring, and implementing the infrastructure that is a part of the Huntress Security Platform, providing a first-class development platform to our developers, and tracking and supporting the complete lifecycle of our millions of installed endpoint agents.