Similar Jobs

See all

Senior ML Operations (MLOps) Engineer

Jobgether

US

Python PyTorch TensorFlow

AI Infrastructure Engineer

Pragmatike

EMEA

Python Kubernetes Terraform

Staff / Principal MLOps Engineer

Futurefit Ai

US

SQL Python Airflow

Reliability & Performance:

Own availability, latency, and throughput SLOs across a large fleet of generative media model APIs.
Build monitoring, alerting, and observability to catch ML-specific failures and regressions.
Improve capacity planning, autoscaling, and GPU fleet efficiency for inference workloads.

Security & Safety:

Drive the security posture of the model fleet, including secure model serving and abuse detection.
Operationalize content moderation pipelines, safety classifiers, and guardrails for inference.
Lead incident response for model API outages and run blameless postmortems.

Collaboration & Culture:

Partner with model and infrastructure teams to embed reliability requirements into onboarding.
Work alongside a team dedicated to rapidly iterating on AI breakthroughs.
Contribute to a culture of automation, blameless postmortems, and continuous improvement.

Fal

Fal is the generative media ecosystem powering the next generation of AI products, providing infrastructure, tools, and model access for developers and enterprises. As a unified platform for high-performance inference, orchestration, and observability, fal is becoming the ecosystem ambitious teams build on in a market projected to grow by hundreds of billions over the next decade.

Apply for This Position

ML / Site Reliability Engineer - Model Fleet

Similar Jobs

Senior ML Operations (MLOps) Engineer

Trust & Safety Engineer

AI Infrastructure Engineer

MLOps Engineer

Staff / Principal MLOps Engineer

Fal