Similar Jobs

See all

Platform Reliability & SLAs:

  • Own SLI/SLO/SLA definitions for the Akuity SaaS platform and drive continuous improvement against them
  • Design, instrument, and maintain observability systems (metrics, logs, traces) across multi-region AWS infrastructure
  • Identify reliability gaps, lead blameless post-mortems, and close the loop with permanent fixes

On-Call & Incident Response:

  • Participate in an on-call rotation and act as incident commander for high-severity production events
  • Build and maintain runbooks, escalation paths, and incident playbooks that keep mean time to resolution low
  • Drive improvements to alerting fidelity; reduce noise, increase signal, eliminate toil

What We're Looking For:

  • 5+ years of SRE, platform engineering, or production operations experience in a SaaS environment
  • Deep hands-on Kubernetes expertise; you understand the scheduler, networking, storage, and autoscaling at a level where you can debug anything
  • Strong AWS fundamentals across compute (EC2, EKS), networking (VPC, NLB, Route53), storage (S3, RDS), and IAM

Akuity

Akuity helps enterprises ship software faster and more reliably with modern GitOps best practices. The Akuity Platform enables teams to manage the development and deployment across hundreds – if not thousands – of Kubernetes clusters from a single control plane.

Apply for This Position