Own the full lifecycle of our bare-metal GPU server fleet: provisioning, imaging, configuration management, patching, and decommissioning across multiple data centers and providers.
Build and maintain our server automation stack using Ansible, Terraform, and custom tooling to manage OS configuration, kernel parameters, driver versions, and firmware updates at scale.
Tune Linux systems for AI workloads: kernel parameters, NUMA topology, CPU pinning, hugepages, I/O schedulers, and GPU driver stack optimization (NVIDIA drivers, CUDA, container runtimes).

Requirements:

8+ years of experience administering Linux systems at scale, ideally in GPU cloud, HPC, or large bare-metal environments.
Deep expertise in Linux internals: systemd, kernel tuning (sysctl, cgroups, namespaces), boot process, package management, and performance profiling (perf, bpftrace, sar).
Strong experience with configuration management and infrastructure-as-code: Ansible, Terraform, cloud-init, PXE/iPXE, and custom imaging pipelines.

Fal

Fal is a company focused on providing a GPU cloud platform. They offer visa sponsorship and relocation assistance to San Francisco, and have regular team events and offsites.

Apply for This Position