Design and evolve multi-provider, multi-region GPU compute clusters optimized for large-scale training.
Make topology-aware scheduling, networking, and storage decisions that directly impact training throughput and cost efficiency.
Serve as the primary technical point of contact for customers running large-scale training workloads.

Reliability & Performance Engineering:

Define SLOs and error budgets that account for the unique failure modes of GPU infrastructure.
Own capacity planning across heterogeneous GPU fleets optimized for training throughput.
Ensure the health and performance of high-speed interconnects that underpin distributed training.

Automation & Tooling:

Build production-grade automation for cluster provisioning, GPU health checks, job scheduling, self-healing, and firmware/driver lifecycle management.
Lead incident response for complex, multi-layer failures spanning hardware, networking, orchestration, and ML frameworks.
Drive blameless postmortems and systemic fixes.

Andromeda Cluster

Andromeda Cluster gives early-stage startups access to scaled AI infrastructure. They work with leading AI labs, data centers, and cloud providers to deliver compute when and where it’s needed most and are expanding to find the brightest in AI infrastructure, research and engineering.

Apply for This Position