Job Description
As a Senior Site Reliability Engineer at Runwise, you will maintain the stability and performance of our services, ensuring they are reliable, scalable, and fault-tolerant. You’ll work closely with hardware and software engineers to build and maintain tools that improve the reliability and efficiency of our systems.
Responsibilities will include, but are not limited to:
* Design and maintain scalable infrastructure in AWS cloud and distributed on-prem systems
* Automate infrastructure provisioning, deployment pipelines, and operational workflows using tools like Terraform, Ansible, or Helm
* Build and improve monitoring, alerting, and observability systems (e.g., Cloud Health, Grafana)
* Collaborate with development teams to improve service reliability, performance, and scalability
* Participate in on-call rotation and manage incident response, including root cause analysis and postmortems
* Define and track service-level objectives (SLOs) and service-level indicators (SLIs)
* Conduct capacity planning, chaos testing, and disaster recovery exercises
* Advocate for engineering best practices across CI/CD, security, and fault tolerance
About Runwise
Runwise is a customer-focused climate-tech startup that controls and runs the key energy systems in buildings throughout the US, reducing energy usage and carbon output.