The Work:
- Are you an experienced Site Reliability Engineer who thrives at the intersection of software engineering and production operations?
- Do you take pride in keeping mission-critical customer systems reliable under real-world operational pressure?
- Are you looking for an opportunity to own production reliability for a modern hybrid infrastructure platform spanning cloud, colocation, and edge environments?
Primary Responsibilities:
- Own production reliability for Climavision’s customer-facing platform and radar-derived weather data services across Azure, colocation, and edge Kubernetes environments.
- Drive multi-replica and multi-cluster high availability across Climavision’s .NET services by refactoring C# code for safe horizontal scaling.
- Support and coordinate production incident response, including troubleshooting, mitigation, communication, and postmortem analysis.
On-Call Expectation:
- Participate in a primary on-call rotation, taking one full week of duty at a time with 24/7 availability including nights, weekends, and holidays.
- Acknowledge pages within response-time SLO, drive incidents to resolution, and maintain reliable connectivity.
- Plan personal time around published rotation and arrange documented coverage swaps when unavoidable.