Job Description
As a Site Reliability Engineer, you will contribute to the evolution of the strategic management of our GCP infrastructure, and of DevOps practices like incident management, SLOs and error budgets. You will champion observability as a way to improve mean time to recover and use DORA metrics to help the Product & Engineering team to get better at creating amazing products, and help other teams to optimize the use of GCP and manage cost.
You will design and evolve Overstoryβs cloud infrastructure to support the companyβs scaling needs, laying the foundation for performance, security, and maintenance. Build tooling and automation that promote team autonomy while ensuring operational excellence.
Advance our observability platform to support long-term insights, meaningful alerting and improved ease of use for the engineering teams. Build visibility into infra costs to raise awareness across engineering and empower teams to make cost-aware decisions. Champion reliability best practices by shaping incident processes, defining SLOs, and fostering a culture of ownership and continuous improvement.
About Overstory
Overstory harnesses cutting-edge technology to enable a resilient electrical grid that keeps communities thriving as our world changes.