Storage Systems & Infrastructure:
- Operate and scale distributed storage systems like VAST and S3-compatible object storage.
- Improve performance and reliability for large-scale AI/ML training and inference workloads.
- Troubleshoot complex storage and data path issues across hardware and software layers.
Automation & Tooling:
- Build and maintain Python-based automation for provisioning and monitoring storage.
- Develop tools to reduce manual operational overhead and improve lifecycle management.
- Enhance workflows for deployment, maintenance, and scaling of storage clusters.
Systems & Operations:
- Manage Linux-based systems in production bare-metal environments.
- Partner with data center teams on hardware bring-up, upgrades, and issue resolution.
- Support capacity planning and utilize monitoring for performance tuning.
Cross-Functional Collaboration:
- Work with Infrastructure and Platform teams to integrate storage into the broader platform.
- Contribute to design discussions for new infrastructure deployments and scaling strategies.
- Help define best practices for storage in high-performance computing environments.
Lightning AI
Lightning AI builds an end-to-end platform for developing, training, and deploying AI systems, designed to take ideas from research to production. It operates globally with a focus on speed, focus, balance, craftsmanship, and minimalism, backed by top-tier investors.