Source Job

UK

  • Build and maintain backend services, Python libraries, and model lifecycle tooling for internal ML teams.
  • Design and operate distributed systems for model serving, evaluation, and feature engineering.
  • Focus on developer experience and reliability to help teams train, deploy, and serve ML models safely.

Go Python Kubernetes AWS GCP

20 jobs similar to ML/AI Platform Engineer

Jobs ranked by similarity.

Canada

  • Design and operate core AI platform components for training, deploying, and serving ML models at scale.
  • Own model serving and inference workflows end-to-end, optimizing for reliability, latency, throughput, and cost.
  • Collaborate with product, infrastructure, and security teams to build scalable platform capabilities for AI-powered features.

Mozilla Corporation is the non-profit-backed technology company behind Firefox and Pocket, with over 225 million monthly users. A wholly-owned subsidiary of the Mozilla Foundation, the company is mission-driven, employee-owned, and focused on privacy and open standards.

US Unlimited PTO

  • Own and scale AI compute and deployment platforms including Kubernetes and GitOps pipelines.
  • Build inference infrastructure and observability stacks for LLM-powered workflows.
  • Drive security, compliance, and governance at the systems level in a regulated healthcare environment.

Hims & Hers is a leading health and wellness platform focused on making healthcare accessible and personal. As a publicly traded company on the NYSE (HIMS), it offers flexible/remote work and a culture centered on innovation and employee well-being.

Europe 5w PTO

  • Design, build, and maintain scalable backend services and APIs that power Chattermill’s core analytics platform.
  • Architect reliable, maintainable distributed systems and contribute to the evolution of backend service design and infrastructure.
  • Own end-to-end delivery of backend engineering workstreams, from technical scoping and architecture through to implementation, testing, observability, and production support.

Chattermill helps large successful brands like Uber, Amazon, and Wise put their customers at the centre of everything they do. Using best-in-class tech in a fast-evolving AI space, their Customer Experience Intelligence platform continuously analyses feedback to help clients identify what to do next.

US

  • Design and build a next-generation reliability platform for Affirm's production systems, blending distributed systems engineering with AI-assisted development.
  • Create AI agents and a centralized command center to assist with incident triage, root-cause analysis, and unified system health visualization.
  • Own projects end-to-end, from requirements to rollout, collaborating with partner teams to build powerful, simple solutions for developers.

Affirm is reinventing credit to make it more honest and friendly, offering consumers the flexibility to buy now and pay later without hidden fees. The company is a remote-first organization with a strong focus on people-first values and inclusive benefits.

Brazil

  • Evolve and maintain our Kubeflow, Feast, and Spark-on-Kubernetes ML infrastructure.
  • Design tools and APIs empowering teams to transition from centralized bottlenecks to self-service excellence.
  • Collaborate with Data Science teams to apply software engineering best practices to ML workflows.

Wellhub revolutionizes workplace wellness by connecting employees to partners for fitness, mindfulness, therapy, nutrition, and sleep in one subscription. Headquartered in NYC with team members across the globe, we value wellbeing, collaboration, and different perspectives.

United States Canada

  • Build and operate the real-time inference service for the risk decision engine with low latency and high availability.
  • Own model deployment infrastructure including CI/CD, shadow mode, and staged rollouts.
  • Build model observability and partner with Risk Data Science for production operation.

Mercury is a fintech company that provides banking services for startups via partner banks. The company is committed to creating a safe environment and values diversity, with a growing team focused on innovation.

Canada

  • Build and maintain infrastructure platforms for over 200 backend services running on Kubernetes clusters with 40,000+ cores.
  • Lead and mentor other engineers, own complex infrastructure failures, and participate in a shared on-call rotation.
  • Drive cloud cost efficiency, estimate schedules, and use AI tools as a first-class collaborator in daily workflows.

Life360's mission is to keep people close to the ones they love through location sharing, safe driver reports, and crash detection. The company serves approximately 97.8 million monthly active users across more than 180 countries and has more than 500 remote-first employees.

United States Unlimited PTO

  • Design and build scalable backend systems powering AI agents in real-time enterprise environments.
  • Develop agent orchestration frameworks and low-latency inference pipelines integrating LLMs and SLMs.
  • Build robust APIs and work with cross-functional teams to productionize agentic AI at scale.

Level AI is an AI-native platform that helps enterprises transform contact centers into engines of customer intelligence and operational efficiency. The company is a Series C startup backed by Battery Ventures and ENIAC, based in Mountain View, California, with a globally distributed team.

US Unlimited PTO

  • Design and operate the developer platform powering services, focusing on AI-native tooling like agentic service catalogs and MCP-backed APIs.
  • Build and scale agent golden paths, treating AI agents as first-class platform users, and drive Istio service mesh adoption across the fleet.
  • Establish platform guardrails with scorecards and CI policies, and own core components like Kubernetes, IaC, and CI/CD pipelines.

Hims & Hers is the leading health and wellness platform on a mission to help the world feel great through better health. The company is public, traded on the NYSE as "HIMS," and offers a talent-first flexible/remote work approach with outstanding benefits and culture.

US Unlimited PTO 16w maternity 4w paternity

  • Build and operate the ML lifecycle platform, including tooling for experiment tracking, model registry, and versioned pipelines.
  • Own CI/CD and deployment for ML workloads, building automated pipelines from notebook to production.
  • Make models observable and reliable in production with monitoring for latency, drift, data quality, and cost signals.

dv01 provides a data analytics platform for the structured finance market, offering transparency into investment performance and risk for lenders and Wall Street investors. With over 400 clients and coverage of over 100 million loans, dv01 is a data-first company with a diverse and innovative culture.

$180,000–$250,000/yr
US

  • Build our core Python/Rust platform: request routing, AI workload orchestration, scheduling, GPU autoscaling, large scale file storage, queueing, etc
  • Produce forward designs for platform evolution as we scale to 100x current traffic and need to provide low latency across the world
  • Leverage AI to an extreme level to automate the mundane parts of building complex but reliable systems

Fal is building the infrastructure, tools, and model access to move from AI idea to production. They aim to be the unified platform where high-performance inference, orchestration, and observability come together to unlock new categories of AI-native products.

  • Own reliability, latency, and performance for AI platform services and data infrastructure on AWS.
  • Design and maintain CI/CD pipelines, infrastructure-as-code, and observability frameworks across the stack.
  • Partner with AI and data engineers to ensure secure, cost-optimized, and scalable deployment of platform components.

HHAeXchange is the leading technology platform for home and community-based care, providing an end-to-end homecare solution for people who are aging or have disabilities. Founded in 2008, the company is passionate about transforming healthcare by connecting patients, providers, managed care organizations, and states.

$81,112–$92,025/yr
Europe

  • Empower ML Engineers with the tools, infrastructure, and frameworks they need to iterate fast autonomously.
  • Accelerate time-to-market for production-ready ML products with seamless integration and access to data and resources.
  • Own ML CI/CD in close collaboration with the DevExp team, adapting existing frameworks to ML-specific needs.

Dailymotion is a video platform designed to broaden users' horizons with a unique algorithm. They foster inclusivity and aim to build a better and safer Internet with cutting-edge solutions for video hosting and advertising. With 400 employees in France, New York, and Singapore, Dailymotion is shaking up the global video platform ecosystem.

US Unlimited PTO

  • Provide frontline technical expertise to help developers deploy and scale Temporal in cloud-native environments.
  • Troubleshoot complex infrastructure issues, optimize performance, and develop automation solutions.
  • Collaborate with engineering and product teams to influence platform improvements and enhance developer experience.

Temporal provides an open source programming model that simplifies code and makes applications more reliable. The company is a growing team driven by values of curiosity, collaboration, and humility, focused on improving developer experience.

US

  • Design and optimize scalable, secure, and maintainable AI-powered software solutions, integrating machine learning models and generative AI services.
  • Champion engineering excellence by writing high-quality, well-tested code and guiding peers in best practices for AI integration.
  • Collaborate cross-functionally to evaluate new AI capabilities and contribute to the roadmap for AI-enabled features and platforms.

BECU is a financial cooperative with 1.5 million members and over $30 billion in managed assets, focused on people over profits. With 90 years of history and a purpose-driven culture, they are one of the nation's leading credit unions, emphasizing employee support and community well-being.

United States

  • Design and build core platform infrastructure for large-scale cloud-native data and analytics systems.
  • Own and improve CI/CD pipelines, testing frameworks, and deployment in a high-scale PaaS environment.
  • Contribute to reliability engineering, observability, and operational excellence across distributed systems.

Jobgether uses an AI-powered matching process to connect candidates with roles. The company is a growing platform focused on efficient job matching and data privacy compliance.

UK Netherlands

  • Design and build systems that improve the efficiency of ML training and inference workloads.
  • Develop tooling that helps ML engineers debug, profile, optimize, and monitor model performance.
  • Partner with ML researchers and product teams to identify bottlenecks and drive performance improvements.

Reddit is a community of communities built on shared interests, passion, and trust, hosting the most open and authentic conversations on the internet. With over 100,000 active communities and approximately 126 million daily active users, Reddit is one of the internet's largest sources of information.

Europe

  • Design, build, and maintain scalable services that support the AI lifecycle.
  • Develop tools for pre/post-processing data for AI and other usage.
  • Design scalable pipelines for data collection, processing, and transformation.

Planner 5D is a global hub for home design, uniting over 100+ million users. They simplify the home renovation process with their cutting-edge software, fostering a vibrant community of enthusiastic and product-oriented professionals.

Global Unlimited PTO

  • Build and scale high-throughput ingestion and trace-query systems for LangSmith, a purpose-built observability platform.
  • Set API, SDK, and CLI standards across Python, TypeScript, Go, and Java for consistent developer experiences.
  • Own integrations with AI frameworks and tools, ensuring LangSmith remains framework-agnostic and easy to adopt.

LangChain builds the foundation for agent engineering, helping developers go from prototypes to production-ready AI agents with platforms like LangSmith and open-source frameworks. With $125M raised from top VCs and 100M+ monthly open source downloads, the team is small but impactful, shaping how AI agents operate in the real world.

US Unlimited PTO

  • Maintain, improve, and extend an AI platform already running in production.
  • Handle a mix of backend development, data pipelines, DevOps, and infrastructure work.
  • Translate business and product requirements into technical decisions independently.

Provectus is an AI consultancy and solutions provider. We help businesses adopt AI technologies, offering development and integration services. While the job posting doesn't mention company size information, they seem to foster a flexible, autonomous, and tech-forward culture.