Design and implement solutions to problems of scale for multi-site deployment and management of CoreWeaveโs global server hardware fleet. Build and maintain backend services and APIs (gRPC/REST) in Go or Python to interact with Kubernetes and other infrastructure systems. Develop provisioning services, automation workflows, and fleet management tools that span from bare metal to container orchestration.
Job listings
An experienced Senior DevOps-Networking Engineer comfortable working in multiple cloud environments and experienced in cloud networking components. They should be comfortable in the full Software Development Lifecycle (SDLC) with networking experience and a DevOps mindset. The DevOps-Networking Engineer will work in a fast paced, results driven environment and be responsible for highly scalable, secure enterprise applications.
We are looking for a skilled and motivated Lead Infrastructure Engineer to lead our Platform Engineering team. As the team leader, you will direct the planning, design, development, and implementation of our platform architecture, ensuring it meets the needs of our growing product portfolio. You will guide a talented team of engineers, driving best practices and fostering a culture of excellence and innovation.
Play a key role in shaping the future of our global infrastructure, overseeing a global infrastructure of ~10,000 on-prem servers, youโll tackle unique technical challenges, engineer scalable systems, and have a direct impact on the reliability and performance of our products. Build Reliable Infrastructure, Automate Everything, Ensure Observability, Solve Complex Issues, and Collaborate & Innovate.
Join a growing team as a Senior AWS Developer. The ideal candidate will have extensive experience designing, developing, and managing cloud-based solutions on Amazon Web Services (AWS). This role demands hands-on experience with AWS AI/ML services, strong programming skills, and a solid understanding of cloud infrastructure and security best practices.
Play a key role in building our Developer Experience team and owning critical infrastructure and services that support engineering across the organization. Define best practices, shape internal guidelines, and lead efforts to improve developer workflows, tooling, and system reliability. This role involves mentoring engineers, conducting code reviews, and delivering high-impact projects that power our core systems and servicesโultimately enabling faster, safer, and more scalable product development.
Implementation and maintenance of scalable solutions following DevOps best practices in a leading Gaming industry company. Process automation using Terraform or AWS CloudFormation, enabling creation and management of AWS infrastructure and applications in an internal on-prem cluster. Building and maintaining CI/CD pipelines, automation of tests, deployments, and rollback strategies.
As a Platform engineer, MLOps, you will be critical to deploying and managing cutting-edge infrastructure crucial for AI/ML operations, and you will collaborate with AI/ML engineers and researchers to develop a robust CI/CD pipeline that supports safe and reproducible experiments. Your expertise will also extend to setting up and maintaining monitoring, logging, and alerting systems to oversee extensive training runs and client-facing APIs.
Be part of a dynamic team that is shaping the future of energy and technology. Build and maintain backend systems and data pipelines for AI-based software platforms, integrating SQL/NoSQL databases and collaborating with engineering teams to enhance performance. Design, deploy, and optimize cloud infrastructure on Google Cloud Platform, including Kubernetes clusters, virtual machines, and cost-effective scalable architecture.
As a Platform engineer, MLOps, you will be critical to deploying and managing cutting-edge infrastructure crucial for AI/ML operations, collaborating with AI/ML engineers and researchers to develop a robust CI/CD pipeline that supports safe and reproducible experiments. Your expertise will extend to setting up and maintaining monitoring, logging, and alerting systems to oversee extensive training runs and client-facing APIs. You will ensure that training environments are optimally available and efficiently managed across multiple clusters, enhancing our containerization and orchestration systems with advanced tools like Docker and Kubernetes.