Site Reliability Engineer

Gauss Labs ๐Ÿงช๐Ÿ’ก๐Ÿ”ฌ

Benefits

Job Description

You will play a critical role in ensuring our industrial AI platform's reliability, performance, and scalability. You will be responsible for building and maintaining a robust solution that supports our growing business at customer sites. This role requires a high level of technical expertise, a collaborative mindset, and a strong desire to continuously improve systems and processes. Responsibilities: Monitoring and Alerting: Creating and maintaining robust monitoring systems to proactively identify and resolve issues before they impact customers. Implementing effective alerting mechanisms to ensure timely response to critical events. Incident Response: Participating in on-call rotations and leading incident response efforts to minimize downtime and restore service quickly. Automation: Developing and implementing automation tools and scripts to streamline operations, reduce manual effort, and improve efficiency. Capacity Planning: Forecasting resource needs, optimizing resource utilization, and ensuring customers' infrastructure can handle increasing workloads. Performance Optimization: Identifying and resolving performance bottlenecks, optimizing system performance, and improving response times. Collaboration: Partnering with software engineers, data scientists, and other teams to ensure alignment and efficient operations. Customer Focus: Working closely with the AI Program Manager and Technical Account Manager to understand customer issues, provide technical support, and improve customer satisfaction. Continuous Improvement: Driving a culture of continuous improvement by identifying opportunities to enhance system reliability, performance, and efficiency.

About Gauss Labs

Gauss Labs is seeking a highly skilled Site Reliability Engineer to join their team in Vancouver.

Apply for This Position