Similar Jobs
See allSenior ML Operations (MLOps) Engineer
Jobgether
US
Python
PyTorch
TensorFlow
Trust & Safety Engineer
Runway
US
Python
TypeScript
AWS
AI Infrastructure Engineer
Pragmatike
EMEA
Python
Kubernetes
Terraform
MLOps Engineer
Dv01
US
Python
Kubernetes
Terraform
Staff / Principal MLOps Engineer
Futurefit Ai
US
SQL
Python
Airflow
Reliability & Performance:
- Own availability, latency, and throughput SLOs across a large fleet of generative media model APIs.
- Build monitoring, alerting, and observability to catch ML-specific failures and regressions.
- Improve capacity planning, autoscaling, and GPU fleet efficiency for inference workloads.
Security & Safety:
- Drive the security posture of the model fleet, including secure model serving and abuse detection.
- Operationalize content moderation pipelines, safety classifiers, and guardrails for inference.
- Lead incident response for model API outages and run blameless postmortems.
Collaboration & Culture:
- Partner with model and infrastructure teams to embed reliability requirements into onboarding.
- Work alongside a team dedicated to rapidly iterating on AI breakthroughs.
- Contribute to a culture of automation, blameless postmortems, and continuous improvement.
Fal
Fal is the generative media ecosystem powering the next generation of AI products, providing infrastructure, tools, and model access for developers and enterprises. As a unified platform for high-performance inference, orchestration, and observability, fal is becoming the ecosystem ambitious teams build on in a market projected to grow by hundreds of billions over the next decade.