Source Job

India Australia New Zealand

  • Own availability, latency, and throughput SLOs across a large fleet of generative media model APIs serving production traffic at scale.
  • Build monitoring, alerting, and observability to catch ML-specific failures, output quality degradation, and model regressions before customers do.
  • Harden model deployment workflows with canary releases, shadow testing, automated rollbacks, and validation gates to ship new model versions safely.

Python Kubernetes PyTorch Observability

20 jobs similar to ML / Site Reliability Engineer - Model Fleet

Jobs ranked by similarity.

US Unlimited PTO

  • Design and maintain scalable ML infrastructure including data pipelines, training workflows, and model deployment systems.
  • Own end-to-end ML lifecycle operations, ensuring reliable delivery of models into production at scale.
  • Implement monitoring, telemetry, and feedback loops for ML models running across large-scale device fleets.

Our partner company develops ML systems for connected hardware products used by customers worldwide. They operate in a fast-paced, product-driven environment with a collaborative and technically ambitious culture focused on real-world ML impact.

US Europe

  • Serve as a core safety partner embedded across product and research teams, providing Trust & Safety engineering support for all launches from early design through post-launch monitoring.
  • Build and maintain safety infrastructure ensuring Runway's models have a positive impact as they reach millions of users.
  • Design, execute, and continuously improve red teaming systems to proactively surface harmful outputs before production.

Runway builds AI to simulate the world through merging art and science, focusing on world models for general-purpose simulation. The team consists of creative, open-minded, caring, and ambitious people determined to change the world.

EMEA

  • Build and operate production-grade model serving infrastructure using vLLM, TGI, or Triton frameworks.
  • Design and implement auto-scaling, multi-model architectures, and intelligent request routing for ML inference.
  • Optimize GPU utilization, memory efficiency, and observability to ensure low-latency, cost-effective systems.

They are a distributed cloud infrastructure startup building AI-native cloud services with GPU-powered compute. The company is well-funded, fast-scaling, and operates in a remote-first environment with a focus on sustainability and decentralization.

US Unlimited PTO 16w maternity 4w paternity

  • Build and operate the ML lifecycle platform, including tooling for experiment tracking, model registry, and versioned pipelines.
  • Own CI/CD and deployment for ML workloads, building automated pipelines from notebook to production.
  • Make models observable and reliable in production with monitoring for latency, drift, data quality, and cost signals.

dv01 provides a data analytics platform for the structured finance market, offering transparency into investment performance and risk for lenders and Wall Street investors. With over 400 clients and coverage of over 100 million loans, dv01 is a data-first company with a diverse and innovative culture.

US Canada

  • Assess current pipelines and data architecture to produce a prioritized plan for change.
  • Design durable data and ML systems grounded in customer needs with documented tradeoffs.
  • Harden pipelines, upgrade data architecture, and raise standards for observability and reliability.

FutureFit AI's core mission is to help more people get to better jobs faster and cheaper, with a focus on those facing barriers to opportunity. Their team of 30-50 across the US and Canada fosters a high trust, high intensity culture with a will to win.

United States Canada

  • Build and operate the real-time inference service for the risk decision engine with low latency and high availability.
  • Own model deployment infrastructure including CI/CD, shadow mode, and staged rollouts.
  • Build model observability and partner with Risk Data Science for production operation.

Mercury is a fintech company that provides banking services for startups via partner banks. The company is committed to creating a safe environment and values diversity, with a growing team focused on innovation.

Global 16w maternity 16w paternity

  • Design, train, evaluate, and ship ML systems for governance and security, starting with prompt injection detection and behavioral anomaly detection.
  • Build supporting infrastructure including data pipelines, feature stores, model serving, and evaluation harnesses.
  • Set technical direction for ML work, own architecture, evaluation methodology, and model lifecycle.

Docker provides developer tools for building, sharing, and running applications across Docker Desktop, Docker Hub, and Docker Scout. With over 20 million monthly users and a globally distributed remote-first team, Docker is trusted by solo founders to the world's largest companies.

Canada

  • Build and maintain infrastructure platforms for over 200 backend services running on Kubernetes clusters with 40,000+ cores.
  • Lead and mentor other engineers, own complex infrastructure failures, and participate in a shared on-call rotation.
  • Drive cloud cost efficiency, estimate schedules, and use AI tools as a first-class collaborator in daily workflows.

Life360's mission is to keep people close to the ones they love through location sharing, safe driver reports, and crash detection. The company serves approximately 97.8 million monthly active users across more than 180 countries and has more than 500 remote-first employees.

United States

  • Own and evolve observability strategy including monitoring, alerting, dashboards, logging, and distributed tracing.
  • Define and manage SLIs, SLOs, and reliability metrics, improving MTTD and MTTR through automation.
  • Build and maintain reliable cloud infrastructure on AWS and Kubernetes while mentoring engineers on SRE best practices.

Filevine is a Legal AI company delivering Legal Operating Intelligence for legal work. Fueled by a team of exceptional collaborators and innovators, Filevine’s rapid growth has earned AI awards and recognition from Deloitte and Inc. as one of the most innovative and fastest-growing technology companies in the country.

US

  • Lead operational excellence, reliability, and support of enterprise AI and data platforms, ensuring stability, scalability, and observability.
  • Design and implement automation, monitoring, and operational tooling for AI/ML platforms including Palantir Foundry, AWS Bedrock, and SageMaker.
  • Serve as a senior escalation point for complex production issues, driving root cause analysis and improving platform stability.

CSAA Insurance Group, a AAA insurer, offers personal lines of property and casualty insurance to AAA members across 23 states and DC. Founded in 1914, they are one of the top personal lines insurers in the US with over 3,800 employees, known for a values-based culture and recognition in leadership development and community involvement.

US

  • Take ownership of incident management and operational excellence across cloud infrastructure.
  • Automate high-risk manual processes and drive reliability gains through engineering.
  • Own a platform domain such as Temporal, observability, or Kubernetes operations.

Synthesia is the world’s leading AI video platform for business, used by over 90% of the Fortune 100. Founded in 2017, the company is headquartered in London with offices across Europe and the US, and has over $530 million in funding from premier investors like Accel and Nvidia's VC arm.

Europe

  • Lead the investigation and resolution of complex infrastructure, networking, and platform-related incidents.
  • Provide technical leadership for Kubernetes platform operations and supporting infrastructure services.
  • Mentor and support AI Infrastructure & Platform Operations Engineers, sharing technical knowledge through documentation and training.

Mirantis helps organizations ship code faster on public and private clouds, providing a public cloud experience on any infrastructure from the data center to the edge. The company serves many of the world's leading enterprises, including Adobe, DocuSign, Liberty Mutual, and PayPal, and is a leader in container management.

  • Optimize production LLM serving with vLLM and SGLang to maximize throughput and minimize latency through batching and quantization.
  • Profile training runs to find bottlenecks and resolve them with attention implementations like FlashAttention on H200 and GB200 hardware.
  • Deploy and operate multiple models on shared GPU clusters with autoscaling, bin-packing, and efficient handling of mixed workloads.

Egen is a fast-growing technology company with a data-first mindset, partnering with clients on Google Cloud and Salesforce to drive action through data and insights. We are a team of dedicated engineers who thrive on solving tough problems and continually innovate to achieve fast, effective results.

Canada

  • Design and operate core AI platform components for training, deploying, and serving ML models at scale.
  • Own model serving and inference workflows end-to-end, optimizing for reliability, latency, throughput, and cost.
  • Collaborate with product, infrastructure, and security teams to build scalable platform capabilities for AI-powered features.

Mozilla Corporation is the non-profit-backed technology company behind Firefox and Pocket, with over 225 million monthly users. A wholly-owned subsidiary of the Mozilla Foundation, the company is mission-driven, employee-owned, and focused on privacy and open standards.

US Unlimited PTO

  • Own and scale AI compute and deployment platforms including Kubernetes and GitOps pipelines.
  • Build inference infrastructure and observability stacks for LLM-powered workflows.
  • Drive security, compliance, and governance at the systems level in a regulated healthcare environment.

Hims & Hers is a leading health and wellness platform focused on making healthcare accessible and personal. As a publicly traded company on the NYSE (HIMS), it offers flexible/remote work and a culture centered on innovation and employee well-being.

US Unlimited PTO

  • Act as pre-sales technical lead for federal pursuits, leading discovery workshops and architecting AI security solutions in SaaS and airgapped environments.
  • Build mission-focused demonstrations and proof-of-concept AI applications, integrating SDKs and APIs to protect computer vision, LLM, and agentic workloads.
  • Advise customers on securing AI infrastructure aligned to MITRE ATLAS, OWASP Top 10 for LLMs, and NIST AI Risk Management Framework.

HiddenLayer protects the world’s most valuable technologies from adversarial AI attacks. Founded by AI professionals and security specialists, the company has been recognized with awards such as RSA Innovation Sandbox Winner and CB Insights AI 100, and has a venture-backed team focused on accelerating secure AI adoption.

Canada

  • Build and ship AI agents, APIs, and applications on Affirm's internal platform, owning the full lifecycle from architecture to production.
  • Turn messy business requirements from People Operations stakeholders into production systems, integrating with tools like Workday and Notion.
  • Design reliability infrastructure for multi-model LLM services, including structured output validation and quality controls.

Affirm is reinventing credit to make it more honest and friendly, giving consumers the flexibility to buy now and pay later without any hidden fees or compounding interest. The People Tech & Analytics team builds and owns the data, AI, and technology infrastructure for Affirm's People function, running like a product engineering group embedded in HR.

  • Own reliability, latency, and performance for AI platform services and data infrastructure on AWS.
  • Design and maintain CI/CD pipelines, infrastructure-as-code, and observability frameworks across the stack.
  • Partner with AI and data engineers to ensure secure, cost-optimized, and scalable deployment of platform components.

HHAeXchange is the leading technology platform for home and community-based care, providing an end-to-end homecare solution for people who are aging or have disabilities. Founded in 2008, the company is passionate about transforming healthcare by connecting patients, providers, managed care organizations, and states.

India

  • Build and ship specialized agents including parsers, extractors, and synthesizers for the Aedeon agent-native modernization platform.
  • Own the full delivery of assigned agents from prototype through deployment and post-release validation, practicing test-driven development.
  • Write clear Python, document agent contracts and decision logic, and promote a culture of release discipline and quality across the team.

Mactores is a trusted leader in providing modern data platform solutions, enabling businesses to accelerate value through automation with end-to-end data solutions that are automated, agile, and secure. Since 2008, they have collaborated with customers to strategize and navigate digital transformation via assessments, migration, or modernization, fostering a culture driven by 10 core leadership principles.

India

  • Collaborate with data scientists and engineers to build scalable ML pipelines, troubleshoot infrastructure issues from Linux to Kubernetes, and optimize model performance.
  • Drive high engineering standards, design on-premises MLOps solutions, and maintain tools for deployment and monitoring.
  • Refine CI/CD workflows, incorporate ML model training and evaluation into testing, and ensure seamless handover between research and production.

Learneo is a platform of builder-driven businesses, including Course Hero, CliffsNotes, LitCharts, Quillbot, Symbolab, and Scribbr, focused on supercharging productivity and learning. The company supports high-growth businesses with centralized corporate operations and has a virtual-first culture with employees across multiple countries.