Source Job

Global

  • Design, implement, and maintain high-performance ML training and inference platforms.
  • Ship tools that allow any ML engineer to deploy a model in minutes, not days.
  • Improve scalability, reliability, and cost efficiency of model training and serving systems.

Python Docker Kubernetes Terraform MLflow

20 jobs similar to AI Infrastructure Engineer

Jobs ranked by similarity.

North America Unlimited PTO

  • Build and operate scalable backend services and internal APIs for the AI platform.
  • Integrate LLMs and AI tool execution into reliable, production-ready workflows.
  • Own production reliability for AI platform infrastructure through observability, alerting, and incident response.

MaintainX is the world's leading Asset and Work Intelligence platform for industrial and frontline environments. They are a modern IoT-enabled cloud-based tool for reliability, safety, and operations on physical equipment and facilities, powering operational excellence for 13,000+ businesses. MaintainX recently completed a $150 million Series D round, at a valuation of $2.5 billion.

$141,487–$184,800/yr
Europe

  • Design scalable, future-proof data platforms optimized for AI research workloads.
  • Build efficient self-serve data processing pipelines leveraging GCP's advanced services.
  • Implement guardrails for cost, quality, and performance.

AssemblyAI is at the forefront of Speech AI, creating powerful models for speech-to-text and speech understanding via an API. They're a remote team of startup veterans and AI researchers looking to build one of the next great AI companies.

US Canada Argentina India

  • Work with research teams to design and build our training infrastructure
  • Prototype new training frameworks and production-ize solutions at scale
  • Design, optimize and test model integration infrastructure

Clarifai is a leading AI platform specializing in computer vision, NLP, LLMs, and audio recognition, helping organizations transform unstructured data into structured data. Founded in 2013, they remotely operate across multiple countries with backing from industry leaders, fostering a diverse and equal opportunity workplace.

US

  • Design, develop, and deploy AI/ML models and pipelines that meet mission and performance objectives.
  • Build, train, and fine-tune models using frameworks such as PyTorch, TensorFlow, scikit-learn, Hugging Face, and LangChain.
  • Write clean, efficient Python code for data ingestion, feature engineering, embeddings, and inference services.

Frontier Technology Inc. (FTI) delivers mission-focused solutions to the Department of Defense (DoD/DoW) and Intelligence Community (IC) through advanced engineering, digital transformation, and program execution expertise. They help their customers solve complex challenges and achieve mission success by integrating people, process, and technology.

India

  • Design and manage AWS infrastructure for AI services.
  • Implement Infrastructure as Code using Terraform.
  • Collaborate with cross-functional teams to enhance performance.

Jobgether uses an AI-powered matching process to ensure applications are reviewed quickly, objectively, and fairly against the role's core requirements. Their system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company.

Global

  • Build and scale ML-optimized HPC infrastructure by deploying and managing Kubernetes-based GPU/TPU superclusters across multiple clouds.
  • Optimize for AI/ML training by collaborating with cloud providers to fine-tune infrastructure for cost efficiency, reliability, and performance.
  • Troubleshoot and resolve complex issues and proactively identify infrastructure bottlenecks, performance degradation, and system failures.

Cohere's mission is to scale intelligence to serve humanity by training and deploying frontier models for developers and enterprises who are building AI systems. The company is composed of researchers, engineers, and designers passionate about their craft and believes that a diverse range of perspectives is a requirement for building great products.

Canada

  • Contribute to our core ML infrastructure.
  • Prototype new training frameworks and production-ize solutions at scale.
  • Design, optimize and test model integration infrastructure.

Clarifai is a leading, full-lifecycle deep learning AI platform for computer vision, natural language processing, LLM's and audio recognition. Clarifai was founded in 2013 and has employees remotely based throughout the United States, Canada, Argentina, India and Estonia.

US

  • Design and implement advanced GPU virtualization solutions.
  • Manage and optimize large-scale GPU and HPC clusters.
  • Collaborate with data science and engineering teams to optimize AI models.

Jobgether is a company that connects job seekers with potential employers. They use AI-powered matching to ensure applications are reviewed quickly and fairly, and their system identifies top-fitting candidates for hiring companies.

$140,000–$180,000/yr
US

  • Design and deliver scalable AI systems that connect models, data, and products.
  • Turn research prototypes into secure, reliable, production-ready services.
  • Build pipelines and serving layers that power adaptive, real-time features.

KnowBe4 is a cybersecurity company that puts security first, offering an AI-driven Human Risk Management platform. They empower over 70,000 organizations worldwide to strengthen their security culture and transform their workforce into their strongest security asset.

$133,109–$239,596/yr
US 4w PTO

  • Develop scalable MLOps pipelines for model training, validation, deployment, and monitoring using AWS services
  • Implement infrastructure as code and CI/CD workflows to support rapid experimentation and reliable production releases
  • Collaborate with data scientists to productionize ML models and ensure reproducibility, versioning, and traceability

Experian is a global data and technology company, powering opportunities for people and businesses around the world. A FTSE 100 Index company listed on the London Stock Exchange (EXPN), they have a team of 23,300 people across 32 countries and corporate headquarters are in Dublin, Ireland.

$125,600–$157,000/yr
US

  • Design, build, and scale enterprise-grade AI/ML systems that power internal workflows and external-facing AI/ML platforms.
  • Develop a production-ready Generative AI and MLOps platform with reusable components used to deploy multiple AI solutions across Natera’s business units.
  • Implement cloud-native infrastructure for large-scale model training and serving using Kubernetes, MLflow, Terraform, and AWS-native services

Natera is a global leader in cell-free DNA (cfDNA) testing. They are dedicated to oncology, women’s health, and organ health, aiming to make personalized genetic testing and diagnostics part of the standard of care. The Natera team consists of highly dedicated statisticians, geneticists, doctors, laboratory scientists, business professionals, software engineers and many other professionals from world-class institutions.

Latin America

  • Strong computer science or engineering background with 3+ years of coding experience with Python.
  • Advanced knowledge of AWS services including but not limited to their ML services (AWS SageMaker and AWS Step Functions).
  • Experience with ML monitoring and automation tools (MLflow, SagaMaker Pipelines).

Bluelight is a leading software consultancy dedicated to designing and developing innovative technology that enhances users' lives. With a presence across the United States and Central/South America, Bluelight is in an exciting phase of expansion, continually seeking exceptional talent to join its dynamic and diverse community.

US

  • Architect enterprise-grade AI systems using LLMs and multimodal models.
  • Design end-to-end pipelines for data ingestion, model training, and deployment.
  • Define engineering standards for MLOps and data quality.

Jobgether connects job seekers with companies using AI-powered matching. They focus on quick, objective, and fair reviews of applications, but do not mention employee count or company culture.

$110,000–$135,000/yr
US Unlimited PTO 9w maternity

  • Design, develop, and maintain backend systems in Python, integrating AI-driven functionalities and APIs
  • Implement prompt engineering, Retrieval-Augmented Generation (RAGs), and basic generative AI solutions into platform workflows
  • Collaborate on AI model deployment, fine-tuning, and production integration

Impiricus is the first and only AI-powered HCP Engagement Engine created to cut through the noise and put physician care delivery at the forefront. The company is committed to providing life science companies with AI technology needed to deliver evidence-based resources into the hands of HCPs.

$145,000–$165,000/yr
US

  • Take ownership of an ML deployment system spanning multiple production environments and continue to research efficient and effective strategies.
  • Improve, expand, and streamline our existing deployment pipelines to support faster deployments and automated model retraining.
  • Collaborate with Data Scientists to understand model requirements and provide guidance to ensure seamless integration with production environments.

Best Egg is a market-leading, tech-enabled financial platform helping people build financial confidence through lending solutions and financial health tools. They foster an inclusive, flexible, and fun workplace with top-tier benefits and growth opportunities.

Global 5w PTO

  • Design, develop, and deploy robust ML systems and multi-model AI agents that solve real-world retail challenges.
  • Lead the entire lifecycle, including prototyping, deployment, monitoring, and maintenance using modern CI/CD and containerisation practices.
  • Build high-performance data pipelines (ETL/ELT) for both training and real-time inference, ensuring our systems are scalable and reliable.

EDITED is the world’s leading AI-driven retail intelligence platform. They empower the world’s most successful brands and retailers with real-time decision making power. Their environment is dynamic and supportive, encouraging team members to take initiative, innovate, and continuously grow.

US 4w PTO

  • Architect, design, and oversee delivery of end-to-end AI/ML solutions.
  • Lead cross-functional teams to implement robust ML platforms, pipelines, and applications.
  • Communicate the business value and ROI of AI/ML solutions to stakeholders.

Jobgether is using an AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. The system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company.

US

  • Architect and deploy secure, scalable infrastructure using Terraform, CloudFormation, or similar tools.
  • Ensure the platform meets strict SLA requirements for enterprise clients, minimizing downtime.
  • Implement comprehensive monitoring, logging, and alerting to provide deep visibility into system health.

Filevine provides cloud-based workflow tools for legal professionals, helping them manage organizations and serve clients. They are recognized as a fast-growing and innovative technology company with a team of passionate professionals.

Europe

  • Design, implement, and maintain robust, containerized, and reproducible pipelines for model training, evaluation, and deployment—across both batch and real-time settings.
  • Build and manage ML services, APIs, and model serving infrastructure using tools like MLflow, Amazon SageMaker, and Feature Store.
  • Set up and maintain monitoring, observability, and alerting systems to ensure high availability and performance (including model/data drift, feature logging, and inference latency).

AUTO1 Group Technology drives innovation in the used car market across Europe. They operate at the intersection of software engineering, data science, and DevOps, helping bring state-of-the-art ML models—such as large-scale recommendation systems and transformer-based neural networks—safely into production.

UK 7w PTO

  • Design, develop, and deploy machine learning models and pipelines using Python
  • Build and maintain end-to-end ML systems from data ingestion to model serving
  • Write clean, efficient, and maintainable Python code following best practices

Activate Group was named by the Sunday Times as one of the UK’s 100 fastest-growing private companies. They employ more than 700 team members nationwide and work with some of the UK's largest fleets and insurance companies, supporting drivers involved in road incidents.