Own SLI/SLO/SLA definitions for the Akuity SaaS platform and drive continuous improvement against them
Design, instrument, and maintain observability systems (metrics, logs, traces) across multi-region AWS infrastructure
Identify reliability gaps, lead blameless post-mortems, and close the loop with permanent fixes

On-Call & Incident Response:

Participate in an on-call rotation and act as incident commander for high-severity production events
Build and maintain runbooks, escalation paths, and incident playbooks that keep mean time to resolution low
Drive improvements to alerting fidelity; reduce noise, increase signal, eliminate toil

What We're Looking For:

5+ years of SRE, platform engineering, or production operations experience in a SaaS environment
Deep hands-on Kubernetes expertise; you understand the scheduler, networking, storage, and autoscaling at a level where you can debug anything
Strong AWS fundamentals across compute (EC2, EKS), networking (VPC, NLB, Route53), storage (S3, RDS), and IAM

Akuity

Akuity helps enterprises ship software faster and more reliably with modern GitOps best practices. The Akuity Platform enables teams to manage the development and deployment across hundreds – if not thousands – of Kubernetes clusters from a single control plane.

Apply for This Position

Similar Jobs

Senior DevOps & Platform Engineer

Staff Software Engineer

Senior Site Reliability Engineer

Sr Site Reliability Engineer

Staff Site Reliability Engineer

Akuity