Source Job

US

  • Design, build, and maintain scalable training infrastructure for computer vision workloads
  • Implement and manage distributed training pipelines to support large-scale model training and hyperparameter tuning
  • Build and maintain robust data pipelines for ML development

Python MLOps AWS GCP Kubernetes

20 jobs similar to Machine Learning Platform Engineer

Jobs ranked by similarity.

US

  • Design and maintain robust ML deployment pipelines to ensure seamless model delivery.
  • Automate model training, deployment, and monitoring workflows to increase operational efficiency.
  • Collaborate closely with Data Scientists and Engineering teams to integrate models into production environments.

Truelogic is a leading provider of nearshore staff augmentation services, headquartered in New York. With over 600+ highly skilled tech professionals based in Latin America, they drive digital disruption by partnering with U.S. companies on their most impactful projects.

$133,109–$239,596/yr
US 4w PTO

  • Design, build, and maintain scalable MLOps pipelines for model training, validation, deployment, and monitoring using AWS services.
  • Implement infrastructure as code and CI/CD workflows to support rapid experimentation and reliable production releases.
  • Collaborate with data scientists to productionize ML models and ensure reproducibility, versioning, and traceability.

Experian is a global data and technology company, powering opportunities for people and businesses around the world. They are a FTSE 100 Index company with a team of 23,300 people across 32 countries, investing in people and new advanced technologies to unlock the power of data and to innovate.

LATAM

  • Design and maintain CI/CD pipelines for ML model training, packaging, and deployment across our microservices.
  • Manage containerized services on AWS ECS, optimizing for cost, latency, and availability.
  • Automate infrastructure provisioning and service configuration with Terraform.

Newsela takes authentic, real-world content from trusted sources and makes it instruction-ready for K-12 classrooms. Each text is published at five reading levels, so content is accessible to every learner; over 3.3 million teachers and 40 million students have registered.

US

  • Build and maintain infrastructure-as-code for our AWS EKS and GCP GKE clusters, plus on-premises deployments.
  • Own CI/CD pipelines and drive GitOps adoption.
  • Deploy, scale, and optimize ML/NLP inference workloads.

Vectara is the Enterprise Agent Platform that enables businesses to build and deploy governed, grounded, auditable AI agents across SaaS, VPC, and on-prem. We’re a passionate team that’s hyper-focused on solving enterprise-level technology and business problems with AI.

Europe

  • Build and operate production-grade model serving infrastructure using frameworks such as vLLM, TGI, Triton, or equivalent
  • Design and implement robust deployment pipelines with blue/green and canary rollout strategies for ML models
  • Develop and maintain auto-scaling systems, multi-model serving architectures, and intelligent request routing layers

Pragmatike is recruiting on behalf of a fast-scaling, well-funded distributed cloud infrastructure startup building next-generation AI-native cloud services. The company is redefining how compute is delivered by providing GPU-powered infrastructure for AI/ML workloads, secure storage, and high-speed data transfer through a decentralized architecture that significantly reduces environmental impact compared to traditional cloud providers.

5w PTO

  • Own the design, implementation, and evolution of core MLOps systems across Hyperstack.
  • Build and improve systems that orchestrate model training, fine-tuning, evaluation, and deployment.
  • Define and embed strong MLOps practices across teams.

NexGen Cloud is the company behind Hyperstack, a full-stack AI cloud serving tens of thousands of customers from AI researchers to enterprises running the world's most compute-intensive workloads. They deliver on-demand and private GPU infrastructure to teams who treat performance as a requirement, not a feature.

$164,000–$194,000/yr
US Mexico Unlimited PTO

  • Architect the ML Ecosystem: You will own the end-to-end lifecycle of our ML infrastructure, designing a scalable, modern environment that enables models to thrive in production.
  • Productionize Innovation: Partner closely with our Data Science team to take complex algorithms from the "lab" to the "real-world", building the high-performance pipelines required to scale them.
  • Engineer Feature Intelligence: Design and maintain both offline and online feature stores, ensuring our models have the high-quality data they need for instant decision-making.

True Accord, a wholly owned subsidiary of TrueML, combines machine learning with a human-based approach to transform debt resolution and to get people on the path towards financial health. We are a dynamic group of people who are subject matter experts with a passion for change.

$170,000–$200,000/yr
US Unlimited PTO

  • Working with customers, engineers, and other stakeholders to define clear requirements that solve customers’ problems and leverage the capabilities of our AIOps platform
  • Developing and validating machine learning models and custom analytic algorithms that are applied to image, video, text, geospatial, time series, and structured data
  • Orchestrating and automating complex data and analytic pipelines

Striveworks helps organizations harness the power of artificial intelligence to solve real-world national security and business challenges by serving as the command center between data, models, and business outcomes. Founded by data scientists and engineers, Striveworks set out to make the journey from deployment to ongoing optimization simple and effective.

US Unlimited PTO

  • Design, build, and maintain ML infrastructure across training, evaluation, serving, and monitoring
  • Own data pipelines including generation, cleaning, validation, and versioning
  • Build and improve experiment tracking, orchestration, and reproducibility tooling

Quilter is helping electrical engineers save time and accomplish more by automating the tedious and time-consuming task of designing printed circuit boards (PCBs). Their small team is composed of experts in electrical engineering, electromagnetic simulation, ML/AI, and high-performance computing (HPC).

Europe

  • Build and productionize reusable MLOps components supporting scalable and reliable ML workflows.
  • Establish strong ML lifecycle practices including experiment tracking, evaluation, and reproducibility.
  • Enable robust and monitored ML systems aligned with healthcare-grade reliability and compliance requirements.

Neko Health aims to shift healthcare from treating illness to preventing it, using advanced, non-invasive technology and clinical expertise. They have nearly 100 full-time engineers working across multiple European locations and prioritize work-life balance.

Europe

  • Build scalable Edge infrastructure, designing and maintaining delivery systems for model deployment.
  • Work with cross-functional teams to integrate complex features, translating research into hardware realities.
  • Drive automation and reliability by implementing infrastructure to test models and monitor performance.

Hudl builds great teams and hires the best to ensure employees are working with people they can constantly learn from. They provide a culture where everyone feels supported, becoming one of Newsweek's Top 100 Global Most Loved Workplaces.

US

  • Build and deploy end-to-end AI/ML solutions, from data pipelines and feature engineering to model training and inference
  • Develop and maintain data pipelines for ingesting, transforming, and preparing data for analytics and machine learning
  • Write clean, modular, and maintainable code to support scalable AI applications

Eimagine fosters a remote-enabled environment where their people can thrive. They are a team of professionals who take pride in their craft, continuously learn, and support one another, helping clients navigate technology and business change while delivering meaningful outcomes.

$200,000–$250,000/yr
US Unlimited PTO

  • Work with customers, engineers, and other stakeholders to define clear requirements that solve the customers’ problems and leverage the capabilities of our AI operations platform.
  • Translate requirements into a technical approach, design, scoping estimate, and execution plan.
  • Lead execution teams to achieve on-time completion of project deliverables mapped to customer business value while making key individual contributions throughout the process.

Striveworks helps organizations harness the power of artificial intelligence to solve real-world national security and business challenges. Founded by data scientists and engineers, they set out to make the journey from deployment to ongoing optimization simple and effective.

Europe

  • Maintain and scale Kubernetes clusters, managing workloads and monitoring at production scale.
  • Manage and evolve our AWS and GCP cloud environments, balancing reliability, cost, and velocity.
  • Own and improve our CI/CD systems using GitHub Actions on our self-hosted AWS runners.

Synthesia is the world’s leading AI video platform for business, used by over 90% of the Fortune 100. Founded in 2017, the company develops products to enhance visual communication and enterprise skill development, helping people work better. Our valuation stands at $4 billion and our culture values building and hiring smart, kind, unrelenting people.

US

  • Develop and enhance backend features, ensuring system reliability and scalability.
  • Collaborate with stakeholders to define requirements and improve system performance.
  • Manage infrastructure using Terraform and other infrastructure-as-code tools.

Aura is on a mission to create a safer internet, offering a suite of intelligent digital safety products that help millions of customers protect themselves against digital threats. With over 400 employees worldwide, Aura is guided by experienced leadership and fostering an inclusive community.

$180,000–$190,000/yr
US

  • Build and own product and platform capabilities across Spiral and AXO, from early prototypes to production systems.
  • Design and implement AI-powered workflows, agent capabilities, and backend services that are scalable, secure, and reliable.
  • Develop high-performance APIs, async workers, and application logic in Python and TypeScript.

UJET leads in AI-powered contact center innovation, offering a cloud platform that redefines customer experience. They ensure security, scalability, and prioritized data insights, partnering with businesses for smarter decision-making and accelerated growth in the AI-driven world.

$123,696–$254,667/yr
US

  • Scale the decision making process for tools for the tvScientific AI team, from our workflows to our training infrastructure to our Kubernetes deployments.
  • Improve the developer experience for the data science team and upgrade our observability tooling.
  • Make every deployment smooth as our infrastructure evolves, working with software engineering, data infra, and SRE partners.

tvScientific is the first and only CTV advertising platform purpose-built for performance marketers, leveraging massive data and cutting-edge science to automate and optimize TV advertising to drive business outcomes. It is built by industry leaders with expertise in programmatic advertising, digital media, and ad verification to create a trusted platform for advertisers to grow their business.

Canada

  • Design and maintain training systems that can process and learn from petabyte-scale multimodal datasets.
  • Identify and resolve bottlenecks in the training pipeline to maximize GPU utilization and reduce training time.
  • Work with the ML team to develop and refine neural network architectures suitable for autonomy tasks.

Serve Robotics is reimagining how things move in cities. Their personable sidewalk robot is their vision for the future; it's designed to take deliveries away from congested streets, make deliveries available to more people, and benefit local businesses. Their team is agile, diverse, and driven aiming to grow robotic deliveries from surprising novelty to efficient ubiquity.

US

  • Analyze ML models to identify and resolve performance bottlenecks.
  • Incorporate OSS tools to enable ML engineers self-sufficiently profile and optimize models.
  • Deliver solutions to streamline model deployment across various hardware platforms.

Stack is developing revolutionary AI and advanced autonomous systems designed to enhance safety, reliability, and efficiency of modern operations. With decades of experience creating and deploying real world systems for demanding environments, the Stack team is dedicated to developing an autonomous solution ecosystem tailored to the trucking industry's unique demands.

Europe Unlimited PTO

  • Build and ship customer‑facing AI, combining Generative AI with machine‑learning techniques.
  • Develop new models end-to-end, from understanding product requirements to implementation and deployment.
  • Create an ML Ops framework to ensure models scale effectively with proper monitoring and alerts.

Qonto is creating the leading finance workspace with banking at its core for SMEs in Europe, augmented by financial tools. Founded in 2017 by Alexandre and Steve, Qonto has grown to over 1,600 employees and serves over 600,000 customers across 8 European countries, with a culture that prioritizes customer satisfaction.