- Design resource management systems provisioning and orchestrating compute across AWS, GCP, and Azure using infrastructure-as-code.
- Architect fault-tolerant infrastructure for distributed ML, GPU clusters, NVIDIA runtime, S3 checkpointing, Large dataset management and streaming, health monitoring.
- Build systems that simulate and handle real-world network conditions — bandwidth shaping, latency injection, packet loss.