Job Description
Collaborate with various engineering teams in ClickHouse to design and implement scalable, secure, and highly available systems for ClickHouse Cloud. Establish and manage service level objectives (SLOs) and service level agreements (SLAs) for ClickHouse Cloud. Ensure all the infrastructure components in ClickHouse Cloud have monitoring and alerting in place to ensure timely detection and resolution of incidents. Enhance and refine incident response processes and post-mortem analysis for any outages in ClickHouse Cloud. Continuously improve the reliability and performance of our ClickHouse services. Plan, enable, and drive Chaos initiatives across Engineering teams. Manage on-call processes to respond to performance and reliability issues, and establish best practices for coordinating escalation to resolve issues and minimize downtime.
About ClickHouse
Established in 2009, ClickHouse leads the industry with its open-source column-oriented database system, driven by the vision of becoming the fastest OLAP database globally.