Job Description
Design and implement solutions to problems of scale for multi-site deployment and management of CoreWeave’s global server hardware fleet. Build and maintain backend services and APIs (gRPC/REST) in Go or Python to interact with Kubernetes and other infrastructure systems. Develop provisioning services, automation workflows, and fleet management tools that span from bare metal to container orchestration. Write and maintain Kubernetes custom controllers and operators to automate infrastructure behavior. Design and implement observability solutions for large-scale server monitoring to improve system stability and insight. Adapt and extend open source tooling to enhance visibility into system metrics, performance, and health. Create test plans, deployment automation, dashboards, alerts, and insights into our fleet operations. Resolve integration challenges across the entire infrastructure stack, from data center hardware to orchestration platforms. Participate in an on-call rotation.
About CoreWeave
CoreWeave is the AI Hyperscaler™, delivering a cloud platform of cutting edge services powering the next wave of AI.