What You'll Do:

Work Directly With ML Engineers
Help customers diagnose and resolve complex distributed systems and ML infrastructure issues
Translate infrastructure level issues into actionable guidance for ML engineers

Debug ML Infrastructure & Distributed Workloads:

Troubleshoot PyTorch, CUDA, NCCL, and inference serving related issues
Analyze logs, metrics, traces, and system behavior to isolate root causes
Diagnose performance bottlenecks involving compute, memory, networking, or storage

Improve Reliability & Platform Operations:

Contribute to post incident reviews and operational improvements
Build internal tooling, automation, documentation, and runbooks
Help improve observability, operational visibility, and troubleshooting workflows

Lightning AI

Lightning AI is the company behind PyTorch Lightning, building an end-to-end platform for developing, training, and deploying AI systems. They serve solo researchers, startups, and large enterprises, operating globally with offices in New York City, San Francisco, Seattle, and London.

Apply for This Position