What You'll Do:

  • Work Directly With ML Engineers
  • Help customers diagnose and resolve complex distributed systems and ML infrastructure issues
  • Translate infrastructure level issues into actionable guidance for ML engineers

Debug ML Infrastructure & Distributed Workloads:

  • Troubleshoot PyTorch, CUDA, NCCL, and inference serving related issues
  • Analyze logs, metrics, traces, and system behavior to isolate root causes
  • Diagnose performance bottlenecks involving compute, memory, networking, or storage

Improve Reliability & Platform Operations:

  • Contribute to post incident reviews and operational improvements
  • Build internal tooling, automation, documentation, and runbooks
  • Help improve observability, operational visibility, and troubleshooting workflows

Lightning AI

Lightning AI is the company behind PyTorch Lightning, building an end-to-end platform for developing, training, and deploying AI systems. They serve solo researchers, startups, and large enterprises, operating globally with offices in New York City, San Francisco, Seattle, and London.

Apply for This Position