We are looking for a highly motivated Site Reliability Engineer (SRE) to join our DevOps team, focusing on enhancing the reliability, performance, and observability of our applications. In this role, you will work closely with software developers and DevOps engineers to ensure our cloud infrastructure and vendor services meet established reliability goals. Your primary responsibilities will also include automating operational tasks and improving incident management processes.
You will monitor and enhance system reliability by defining and tracking Service Level Objectives (SLOs). Respond to incidents, support and troubleshoot issues, and document lessons learned in a collaborative, blameless environment. Automate repetitive tasks and enhance existing CI/CD workflows to achieve higher operational efficiency. Optimize application performance and resilience by utilizing metrics from tools such as Sentry and CloudWatch. Collaborate closely with developers and vendors to proactively identify and mitigate reliability risks. Participate in post-mortem analyses for incidents and outages.