Similar Jobs
See allSr Linux Networking Engineer
Fal
US
Linux
Networking
Ansible
Systems / DevOps engineer
Mirantis
Global
Linux
Docker
Kubernetes
Senior Site Reliability Engineer (m/f/d)
Redcare Pharmacy
Europe
Linux
Kubernetes
OpenStack
Core Infrastructure - Platforms Engineer
Kraken
Global
Linux
Networking
Terraform
Senior Site Reliability Engineer (Forward Deployed)
Teleport
US
Terraform
Ansible
Helm
Key Responsibilities:
- Own the full lifecycle of our bare-metal GPU server fleet: provisioning, imaging, configuration management, patching, and decommissioning across multiple data centers and providers.
- Build and maintain our server automation stack using Ansible, Terraform, and custom tooling to manage OS configuration, kernel parameters, driver versions, and firmware updates at scale.
- Tune Linux systems for AI workloads: kernel parameters, NUMA topology, CPU pinning, hugepages, I/O schedulers, and GPU driver stack optimization (NVIDIA drivers, CUDA, container runtimes).
Requirements:
- 8+ years of experience administering Linux systems at scale, ideally in GPU cloud, HPC, or large bare-metal environments.
- Deep expertise in Linux internals: systemd, kernel tuning (sysctl, cgroups, namespaces), boot process, package management, and performance profiling (perf, bpftrace, sar).
- Strong experience with configuration management and infrastructure-as-code: Ansible, Terraform, cloud-init, PXE/iPXE, and custom imaging pipelines.
Fal
Fal is a company focused on providing a GPU cloud platform. They offer visa sponsorship and relocation assistance to San Francisco, and have regular team events and offsites.