Source Job

5w PTO

  • Own the design, implementation, and evolution of core MLOps systems across Hyperstack.
  • Build and improve systems that orchestrate model training, fine-tuning, evaluation, and deployment.
  • Define and embed strong MLOps practices across teams.

Python Docker Kubernetes CI/CD MLOps

20 jobs similar to Lead MLOps Engineer

Jobs ranked by similarity.

US

  • Design and maintain robust ML deployment pipelines to ensure seamless model delivery.
  • Automate model training, deployment, and monitoring workflows to increase operational efficiency.
  • Collaborate closely with Data Scientists and Engineering teams to integrate models into production environments.

Truelogic is a leading provider of nearshore staff augmentation services, headquartered in New York. With over 600+ highly skilled tech professionals based in Latin America, they drive digital disruption by partnering with U.S. companies on their most impactful projects.

Europe

  • Build scalable Edge infrastructure, designing and maintaining delivery systems for model deployment.
  • Work with cross-functional teams to integrate complex features, translating research into hardware realities.
  • Drive automation and reliability by implementing infrastructure to test models and monitor performance.

Hudl builds great teams and hires the best to ensure employees are working with people they can constantly learn from. They provide a culture where everyone feels supported, becoming one of Newsweek's Top 100 Global Most Loved Workplaces.

US

  • Design, build, and maintain scalable training infrastructure for computer vision workloads
  • Implement and manage distributed training pipelines to support large-scale model training and hyperparameter tuning
  • Build and maintain robust data pipelines for ML development

Buzz is revolutionizing the analytics and maintenance of power grid infrastructure through their advanced AI solutions. Their computer vision systems analyze critical infrastructure to enhance safety, reliability, and operational efficiency across the power grid network.

Europe

  • Build and operate production-grade model serving infrastructure using frameworks such as vLLM, TGI, Triton, or equivalent
  • Design and implement robust deployment pipelines with blue/green and canary rollout strategies for ML models
  • Develop and maintain auto-scaling systems, multi-model serving architectures, and intelligent request routing layers

Pragmatike is recruiting on behalf of a fast-scaling, well-funded distributed cloud infrastructure startup building next-generation AI-native cloud services. The company is redefining how compute is delivered by providing GPU-powered infrastructure for AI/ML workloads, secure storage, and high-speed data transfer through a decentralized architecture that significantly reduces environmental impact compared to traditional cloud providers.

$133,109–$239,596/yr
US 4w PTO

  • Design, build, and maintain scalable MLOps pipelines for model training, validation, deployment, and monitoring using AWS services.
  • Implement infrastructure as code and CI/CD workflows to support rapid experimentation and reliable production releases.
  • Collaborate with data scientists to productionize ML models and ensure reproducibility, versioning, and traceability.

Experian is a global data and technology company, powering opportunities for people and businesses around the world. They are a FTSE 100 Index company with a team of 23,300 people across 32 countries, investing in people and new advanced technologies to unlock the power of data and to innovate.

$134,000–$149,000/yr
US

  • Design, implement, and operate cloud-native infrastructure for production workloads.

PointClickCare's mission is to help providers deliver exceptional care. They are a leading health tech company that’s founder-led and privately held that empowers their employees to push boundaries, innovate, and shape the future of healthcare. They have the largest long-term and post-acute care dataset and a Marketplace of 400+ integrated partners, their platform serves over 30,000 provider organizations.

$170,000–$240,000/yr
US

  • Own the full ML lifecycle, including data ingestion, training, validation, deployment, monitoring, retraining, and retirement.
  • Transition AI/ML prototypes into scalable, production-ready systems with CI/CD pipelines, automation, and observability.
  • Develop and maintain AI-driven applications and inference services, optimizing for performance, scalability, reliability, and cost.

IMO Health combines software development, artificial intelligence, and clinical expertise to create AI-driven solutions. They enhance access to reliable health information, support clinical decision-making, and improve patient outcomes.

LATAM

  • Design and maintain CI/CD pipelines for ML model training, packaging, and deployment across our microservices.
  • Manage containerized services on AWS ECS, optimizing for cost, latency, and availability.
  • Automate infrastructure provisioning and service configuration with Terraform.

Newsela takes authentic, real-world content from trusted sources and makes it instruction-ready for K-12 classrooms. Each text is published at five reading levels, so content is accessible to every learner; over 3.3 million teachers and 40 million students have registered.

Europe

  • Build and productionize reusable MLOps components supporting scalable and reliable ML workflows.
  • Establish strong ML lifecycle practices including experiment tracking, evaluation, and reproducibility.
  • Enable robust and monitored ML systems aligned with healthcare-grade reliability and compliance requirements.

Neko Health aims to shift healthcare from treating illness to preventing it, using advanced, non-invasive technology and clinical expertise. They have nearly 100 full-time engineers working across multiple European locations and prioritize work-life balance.

US Unlimited PTO

  • Design, build, and maintain ML infrastructure across training, evaluation, serving, and monitoring
  • Own data pipelines including generation, cleaning, validation, and versioning
  • Build and improve experiment tracking, orchestration, and reproducibility tooling

Quilter is helping electrical engineers save time and accomplish more by automating the tedious and time-consuming task of designing printed circuit boards (PCBs). Their small team is composed of experts in electrical engineering, electromagnetic simulation, ML/AI, and high-performance computing (HPC).

$164,000–$194,000/yr
US Mexico Unlimited PTO

  • Architect the ML Ecosystem: You will own the end-to-end lifecycle of our ML infrastructure, designing a scalable, modern environment that enables models to thrive in production.
  • Productionize Innovation: Partner closely with our Data Science team to take complex algorithms from the "lab" to the "real-world", building the high-performance pipelines required to scale them.
  • Engineer Feature Intelligence: Design and maintain both offline and online feature stores, ensuring our models have the high-quality data they need for instant decision-making.

True Accord, a wholly owned subsidiary of TrueML, combines machine learning with a human-based approach to transform debt resolution and to get people on the path towards financial health. We are a dynamic group of people who are subject matter experts with a passion for change.

Europe

  • Maintain and scale Kubernetes clusters, managing workloads and monitoring at production scale.
  • Manage and evolve our AWS and GCP cloud environments, balancing reliability, cost, and velocity.
  • Own and improve our CI/CD systems using GitHub Actions on our self-hosted AWS runners.

Synthesia is the world’s leading AI video platform for business, used by over 90% of the Fortune 100. Founded in 2017, the company develops products to enhance visual communication and enterprise skill development, helping people work better. Our valuation stands at $4 billion and our culture values building and hiring smart, kind, unrelenting people.

UK 5w PTO

  • Lead and guide development teams while working directly with clients.
  • Translate business and technical requirements into impactful applications.
  • Ensure best practices in software development, DevOps, and agile methodologies.

Nearform is an independent team of data & AI experts, engineers, and designers who build intelligent digital solutions and capability at pace. Our team of 500 experts in 20+ countries is trusted by leading enterprises.

US

  • Build and maintain infrastructure-as-code for our AWS EKS and GCP GKE clusters, plus on-premises deployments.
  • Own CI/CD pipelines and drive GitOps adoption.
  • Deploy, scale, and optimize ML/NLP inference workloads.

Vectara is the Enterprise Agent Platform that enables businesses to build and deploy governed, grounded, auditable AI agents across SaaS, VPC, and on-prem. We’re a passionate team that’s hyper-focused on solving enterprise-level technology and business problems with AI.

$123,696–$254,667/yr
US

  • Scale the decision making process for tools for the tvScientific AI team, from our workflows to our training infrastructure to our Kubernetes deployments.
  • Improve the developer experience for the data science team and upgrade our observability tooling.
  • Make every deployment smooth as our infrastructure evolves, working with software engineering, data infra, and SRE partners.

tvScientific is the first and only CTV advertising platform purpose-built for performance marketers, leveraging massive data and cutting-edge science to automate and optimize TV advertising to drive business outcomes. It is built by industry leaders with expertise in programmatic advertising, digital media, and ad verification to create a trusted platform for advertisers to grow their business.

Europe

  • Design, build, and maintain scalable, highly available and fault-tolerant infrastructures.
  • Implement and improve monitoring, alerting, and incident response systems to ensure optimal system performance and minimize downtime.
  • Drive continuous improvement in infrastructure automation, deployment, and orchestration.

Mistral AI is dedicated to democratizing AI through high-performance, optimized, open-source models, products, and solutions designed to integrate seamlessly into daily working life. They are a dynamic, collaborative team passionate about AI and its potential to transform society dedicated to innovation.

$200,000–$250,000/yr
US Canada Unlimited PTO

  • Design the BYOC deployment model for Archie across customer environments.
  • Build and own Kubernetes-based infrastructure that runs reliably across multiple clouds and customer setups.
  • Create deployment tooling using Helm, GitOps, or similar approaches to make installation and operations repeatable.

P-1 AI is building an engineering AGI with their first product, Archie, an AI engineer. They closed a $23 million seed round and aim to put an Archie on every engineering team at every industrial company on earth.

Turkey

  • Lead the strategy and architecture for a scalable AI platform that integrates model orchestration, tool integration, and real-time decision systems.
  • Design, develop, and maintain the platform with full ownership from ideation to deployment, ensuring reliability, observability, and security.
  • Mentor engineers and collaborate across teams to evangelize AI best practices and drive the integration of AI throughout the product development lifecycle.

JumpCloud is an AI-powered unified IT management platform designed to secure the modern workforce by consolidating identity, device, and access management. The company is remote-first with teams in over 15 countries, fostering a culture that values building connections, out-of-the-box thinking, and passionate collaboration on challenging technical problems.

$175,000–$215,000/yr
US Unlimited PTO

  • Transform our DevOps platform and bring it into alignment with modern tooling and practices.
  • Transform the way we monitor and operate software in production to fully incorporate modern automation and AI tools.
  • Enable engineering teams to rearchitect monolithic .Net Framework and legacy JavaScript Framework systems into a modular, platform-focused architecture.

Campminder builds software for summer camps, supporting the industry's digital transformation. With over 20 years of experience, they're a stable and profitable company with a loyal customer base and 85+ employees valuing a values-led culture and employee experience.

Europe Unlimited PTO

  • Design, build, and maintain the inference infrastructure that powers Sword Health's AI products.
  • Own the end-to-end deployment pipeline for AI models.
  • Architect and scale Kubernetes clusters for GPU-accelerated workloads.

Sword Health is shifting healthcare from human-first to AI-first through its AI Care platform, making healthcare available anytime, anywhere, and reducing costs. They have over 1,000 enterprise clients and have raised more than $500 million from leading investors.