Engineer

FAL

Remote regions

US

Salary range

$180,000–$250,000/yr

Benefits

Key Responsibilities:

  • Build and maintain Python fleet tracking system that manages the full lifecycle of servers.
  • Build server management tooling that automates provisioning, health checks, GPU diagnostics, recovery and alerting.
  • Create and maintain metrics, dashboards, and alerting for hardware health across the fleet.

Requirements:

  • 3+ years experience managing bare-metal and cloud based server fleets at scale (100+ nodes).
  • Strong software engineering skills in Python; you write production tooling, not scripts.
  • Deep Linux systems knowledge: boot process, kernel tuning, networking, storage, systemd, cgroups, namespaces, performance profiling.

What We Offer:

  • Interesting and challenging work
  • A lot of learning and growth opportunities

FAL

FAL is committed to keeping a large fleet of GPU servers healthy and productive. They offer a collaborative and supportive culture with learning and growth opportunities.

Apply for This Position