What You’ll Do:
- Own and evolve a scalable observability platform spanning metrics, logs, traces, and events
- Design multi-tenant observability systems with scoped access, RBAC, and customer-facing visibility
- Continuously improve observability systems to keep pace with rapid infrastructure buildouts
Telemetry & Data Pipelines:
- Design and operate telemetry pipelines ingesting data from GPUs, CPUs, networking, containers, APIs, and BMC/Redfish
- Build systems to correlate signals across infrastructure layers to enable faster debugging and root cause analysis
- Implement streaming and real-time data pipelines using tools such as Kafka, OTEL, Promtail, or similar
Alerting, Reliability & Insights:
- Design and implement noise-resistant alerting systems to improve signal quality and reduce operational load
- Create dashboards and alerting for InfraOps, Engineering, and Customer Success teams
- Build automated insights and enable proactive detection, forecasting, and system health visibility at scale
Lightning AI
Lightning AI builds an end-to-end platform for developing, training, and deploying AI systems, designed to take ideas from research to production with less friction. They combine developer-first software with cost-efficient, large-scale compute, serving solo researchers, startups, and large enterprises.