Key Responsibilities:
- Build and maintain Python fleet tracking system that manages the full lifecycle of servers.
- Build server management tooling that automates provisioning, health checks, GPU diagnostics, recovery and alerting.
- Create and maintain metrics, dashboards, and alerting for hardware health across the fleet.
Requirements:
- 3+ years experience managing bare-metal and cloud based server fleets at scale (100+ nodes).
- Strong software engineering skills in Python; you write production tooling, not scripts.
- Deep Linux systems knowledge: boot process, kernel tuning, networking, storage, systemd, cgroups, namespaces, performance profiling.
What We Offer:
- Interesting and challenging work
- A lot of learning and growth opportunities
FAL
FAL is committed to keeping a large fleet of GPU servers healthy and productive. They offer a collaborative and supportive culture with learning and growth opportunities.