What You'll Do:
- Work Directly With ML Engineers
- Help customers diagnose and resolve complex distributed systems and ML infrastructure issues
- Translate infrastructure level issues into actionable guidance for ML engineers
Debug ML Infrastructure & Distributed Workloads:
- Troubleshoot PyTorch, CUDA, NCCL, and inference serving related issues
- Analyze logs, metrics, traces, and system behavior to isolate root causes
- Diagnose performance bottlenecks involving compute, memory, networking, or storage
Improve Reliability & Platform Operations:
- Contribute to post incident reviews and operational improvements
- Build internal tooling, automation, documentation, and runbooks
- Help improve observability, operational visibility, and troubleshooting workflows
Lightning AI
Lightning AI is the company behind PyTorch Lightning, building an end-to-end platform for developing, training, and deploying AI systems. They serve solo researchers, startups, and large enterprises, operating globally with offices in New York City, San Francisco, Seattle, and London.