The Opportunity:
- Managed Services team owns shared, production-critical infrastructure that powers Grafana Cloud’s next-generation database products.
- They operate 100+ WarpStream clusters across multiple cloud providers and regions.
- The team works closely with high-volume analytical and storage systems that power query-heavy and aggregation-heavy workloads.
What You’ll Be Doing:
- Operate at both the system and team level, helping shape how we run and evolve shared database infrastructure.
- This involves operating and evolving multi-cloud streaming clusters and related database infrastructure.
- Serving as a primary escalation point and on-call for relevant incidents.
What Makes You a Great Fit:
- Regular 1:1s with your manager and close collaboration with teammates across regions, helping shape how the team operates and matures.
- Defining and evolving SLO strategy for shared database infrastructure, identifying systemic reliability gaps and driving long-term error budget improvements.
- Leading complex initiatives across high-throughput, multi-cloud infrastructure.
Requirements:
- 8+ years of engineering experience, including meaningful time in SRE, platform engineering, production engineering, infrastructure engineering, or distributed systems roles.
- Experience with high-throughput streaming systems, analytical or storage backends, or large-scale database infrastructure.
- Strong Kubernetes experience in AWS, GCP, or Azure, and familiarity with infrastructure-as-code tooling.