Managed Services team owns shared, production-critical infrastructure that powers Grafana Cloud’s next-generation database products.
They operate 100+ WarpStream clusters across multiple cloud providers and regions.
The team works closely with high-volume analytical and storage systems that power query-heavy and aggregation-heavy workloads.

What You’ll Be Doing:

Operate at both the system and team level, helping shape how we run and evolve shared database infrastructure.
This involves operating and evolving multi-cloud streaming clusters and related database infrastructure.
Serving as a primary escalation point and on-call for relevant incidents.

What Makes You a Great Fit:

Regular 1:1s with your manager and close collaboration with teammates across regions, helping shape how the team operates and matures.
Defining and evolving SLO strategy for shared database infrastructure, identifying systemic reliability gaps and driving long-term error budget improvements.
Leading complex initiatives across high-throughput, multi-cloud infrastructure.

Requirements:

8+ years of engineering experience, including meaningful time in SRE, platform engineering, production engineering, infrastructure engineering, or distributed systems roles.
Experience with high-throughput streaming systems, analytical or storage backends, or large-scale database infrastructure.
Strong Kubernetes experience in AWS, GCP, or Azure, and familiarity with infrastructure-as-code tooling.

Grafana Labs

Grafana Labs is a remote-first, open-source powerhouse with more than 20M users of Grafana. They help more than 3,000 companies manage their observability strategies with the Grafana LGTM Stack, which can be run fully managed with Grafana Cloud or self-managed with the Grafana Enterprise Stack.

Apply for This Position