Job Description
Ensure the reliability, scalability, and performance of Groq’s observability tools and services for provisioning and managing the full lifecycle of Groq hardware, software, and networking systems at massive scale. The observability team builds the monitoring and observability infrastructure and tooling that supports Groq’s inferencing hardware at massive scale, both in the cloud and our own datacenters.
Responsibilities include building and maintaining comprehensive observability systems at massive scale, ensuring high-quality production systems with excellent uptime, iterating on and automating systems, and instrumenting Kubernetes clusters, applications, and datacenter infrastructure components. Strong analytical and problem-solving skills are essential, along with excellent communication and teamwork abilities.
Ideal candidates will have 4+ years of experience in observability, a deep understanding of cloud-native technologies and IaaS, and expertise in standing up monitoring, observability, and alerting systems. Experience with instrumenting large Kubernetes clusters and building operators is also important.
About Groq
Groq delivers fast, efficient AI inference with its LPU-based system, powering GroqCloud™, and aims to make high performance AI compute more accessible and affordable.