Job Description
Continuously improve the reliability and performance of ClickHouse core. Improve and create metrics and alerts for ClickHouse to be able to identify and prevent problems in production before they affect customers. Dig deeper into the most common problems encountered by customers in Clickhouse Core to identify the root cause of problems and submit bug fixes, issue reports and suggest improvements. Enhance and refine incident response processes and post-mortem analysis for ClickHouse core related outages including working with support and Cloud teams to communicate to the impacted customers. Plan, enable, and drive Chaos initiatives across Engineering teams, based upon internal priorities. Manage on-call processes to respond to performance and reliability issues, and establish best practices for coordinating escalation to resolve issues and minimize customer impact.
About ClickHouse
ClickHouse is the fastest open-source column-oriented database system, empowering users to generate real-time analytical reports through SQL queries.