Similar Jobs
See allSenior Site Reliability Engineer
SSV Labs
Global
Kubernetes
Terraform
Go
Senior AI Infrastructure Engineer (Europe based - Remote)
Sword Health
Europe
Kubernetes
Terraform
GitOps
Senior Infrastructure Engineer (OpenStack) - Australia
NexGen Cloud
Australia
OpenStack
Kubernetes
Linux
Site Reliability Engineer
Mistral AI
Europe
Docker
Kubernetes
Terraform
Senior Distributed Systems Engineer / Architect
RapidFort
US
Python
Bash
Linux
GPU Cluster Architecture:
- Design and evolve multi-provider, multi-region GPU compute clusters optimized for large-scale training.
- Make topology-aware scheduling, networking, and storage decisions that directly impact training throughput and cost efficiency.
- Serve as the primary technical point of contact for customers running large-scale training workloads.
Reliability & Performance Engineering:
- Define SLOs and error budgets that account for the unique failure modes of GPU infrastructure.
- Own capacity planning across heterogeneous GPU fleets optimized for training throughput.
- Ensure the health and performance of high-speed interconnects that underpin distributed training.
Automation & Tooling:
- Build production-grade automation for cluster provisioning, GPU health checks, job scheduling, self-healing, and firmware/driver lifecycle management.
- Lead incident response for complex, multi-layer failures spanning hardware, networking, orchestration, and ML frameworks.
- Drive blameless postmortems and systemic fixes.
Andromeda Cluster
Andromeda Cluster gives early-stage startups access to scaled AI infrastructure. They work with leading AI labs, data centers, and cloud providers to deliver compute when and where it’s needed most and are expanding to find the brightest in AI infrastructure, research and engineering.