Responsibilities:
- Architect and deploy GPU clusters on-premises and in cloud environments using Ansible, Terraform, Kubernetes, and Slurm.
- Engage with customers through technical deep-dives and workshops, presenting solutions to stakeholders.
- Build and maintain Infrastructure as Code modules to automate provisioning, scaling, and monitoring of GPU resources.
Requirements:
- Proven experience deploying GPU clusters at scale, including multi-node, multi-GPU setups.
- Hands-on expertise with Ansible, Terraform, Kubernetes, and programming in Python or Go.
- Solid understanding of ML ecosystems, models, tooling, and production deployment patterns.
Additional Skills:
- Experience with high-availability inference infrastructure for production AI workloads.
- Familiarity with MLflow, PyTorch, TensorFlow, and CI/CD pipelines for ML.
- Knowledge of GitOps workflows, Docker, Helm charts, and Hugging Face transformers.
Gcore
Gcore is a global provider of infrastructure and software solutions for AI, cloud, network, and security, powering digital experiences worldwide. The company collaborates with leading technology partners and employs over 550 professionals building foundational technologies.