Source Job

20 jobs similar to Staff Software Engineer, Machine Learning Infrastructure

Jobs ranked by similarity.

US Canada Argentina India

  • Work with research teams to design and build our training infrastructure
  • Prototype new training frameworks and production-ize solutions at scale
  • Design, optimize and test model integration infrastructure

Clarifai is a leading AI platform specializing in computer vision, NLP, LLMs, and audio recognition, helping organizations transform unstructured data into structured data. Founded in 2013, they remotely operate across multiple countries with backing from industry leaders, fostering a diverse and equal opportunity workplace.

  • Own the end-to-end lifecycle of ML model deployment—from training artifacts to production inference services.
  • Design, build, and maintain scalable inference pipelines using modern orchestration frameworks (e.g., Kubeflow, Airflow, Ray, MLflow).
  • Implement and optimize model serving infrastructure for latency, throughput, and cost efficiency across GPU and CPU clusters.

MARA is building a modular platform that unifies IaaS, PaaS, and SaaS which will enable governments, enterprises, and AI innovators to deploy, scale, and govern workloads across data centers, edge environments, and sovereign clouds. They are redefining the future of sovereign, energy-aware AI infrastructure.

Australia New Zealand

  • Act as a solution expert across ML domains including evaluations, training, inference, data pipelines, quality, and optimisation.
  • Work directly alongside product teams as a trusted partner, helping them navigate technical challenges and arrive at effective solutions.
  • Develop blueprints, patterns, and paved roads that allow other teams to follow proven approaches and accelerate their own implementations.

Canva is a design platform that enables users to create professional designs. They have a flagship campus in Sydney, a second campus in Melbourne, and co-working spaces in other locations, with a flexible work environment.

$145,831–$218,747/yr
Canada

  • Build, maintain and improve Torc ML frameworks.
  • Use Terraform, AWS Managed Services, EKS, Ray.
  • Focus on data ops, ML development pipeline, logging & aggregation.

Torc has been a leader in autonomous driving since 2007. Now a part of the Daimler family, they are focused solely on developing software for automated trucks to transform how the world moves freight. Their culture is collaborative, energetic, and team focused.

US Europe Unlimited PTO

  • Design and deliver advanced solutions that generate predictions from a wide range of Computer Vision models.
  • Build and evolve key components of the Roboflow Platform to ensure seamless, reliable model deployment at scale.
  • Contribute to and maintain Roboflow’s open-source projects, helping grow and support the broader developer community.

Roboflow simplifies building and using computer vision models, and over 1M+ developers, including those from half the Fortune 100, use Roboflow’s tools.

  • Help define the direction for the team.
  • Define and prioritize ML Platform initiatives.
  • Enable teams to build features at scale by providing a foundation of reusable software components and infrastructure.

Motive empowers the people who run physical operations with tools to make their work safer, more productive, and more profitable. Motive serves nearly 100,000 customers – from Fortune 500 enterprises to small businesses – across a wide range of industries.

$82,300–$140,580/yr
US

  • Deploy and optimize ML/LLM models on platforms like NVIDIA Triton and vLLM within Kubernetes clusters.
  • Integrate models with Rackspace’s Unified Inference API and API Gateway for multi-tenant routing.
  • Configure telemetry for GPU utilization, request tracing, and error monitoring.

We combine our expertise with the world’s leading technologies — across applications, data and security — to deliver end-to-end solutions.

US

  • Draft detailed natural-language plans and code implementations for machine learning tasks.
  • Convert novel machine learning problems into agent-executable tasks for reinforcement learning environments.
  • Identify failure modes and apply golden patches to LLM-generated trajectories for machine learning tasks.

At Mercor, we’re building the talent engine that helps leading labs and research orgs move AI forward.

Canada

  • Train, evaluate, and optimize machine learning models for high performance.
  • Contribute to R&D in object detection and multi-object tracking for remote sensing.
  • Design and deliver production-grade, maintainable code while managing multi-phase development.

Clarifai is a leading AI platform specializing in computer vision and generative AI. They empower organizations to transform unstructured data into actionable insights. Their globally distributed team operates across the United States, Canada, Estonia, Argentina, and India and is committed to building a diverse and inclusive team.

$315,000–$340,000/yr
US

  • Design and build infrastructure that enables researchers to rapidly iterate on reward signals.
  • Develop systems for automated quality assessment of rewards, including detection of reward hacks and other pathologies.
  • Collaborate with researchers to translate science requirements into platform capabilities.

Anthropic’s mission is to create reliable, interpretable, and steerable AI systems to be safe and beneficial for users and society.

Global

Design and maintain infrastructure supporting scalable real time data pipelines to handle huge datasets. Develop and support tooling enabling implementation of custom ML algorithms in a low latency environment. Work on infrastructure for running training, inference, monitoring, and deployment on thousands of ML tasks concurrently.

StackAdapt empowers marketers to reach, engage, and convert audiences with precision with its AI-powered marketing platform.

Global

  • Build end-to-end training pipelines: data → training → eval → inference
  • Design new model architectures or adapt open-source frontier models
  • Fine-tune models using state-of-the-art methods (LoRA/QLoRA, SFT, DPO, distillation)

A1 is a self-funded AI group operating in full stealth, building a new global consumer AI application focused on an important but underexplored use case. They are assembling a small, elite team of ML and engineering builders who want to work on meaningful, high-impact problems.

US

  • Design and implement advanced GPU virtualization solutions.
  • Manage and optimize large-scale GPU and HPC clusters.
  • Collaborate with data science and engineering teams to optimize AI models.

Jobgether is a company that connects job seekers with potential employers. They use AI-powered matching to ensure applications are reviewed quickly and fairly, and their system identifies top-fitting candidates for hiring companies.

$125,600–$157,000/yr
US

  • Design, build, and scale enterprise-grade AI/ML systems that power internal workflows and external-facing AI/ML platforms.
  • Develop a production-ready Generative AI and MLOps platform with reusable components used to deploy multiple AI solutions across Natera’s business units.
  • Implement cloud-native infrastructure for large-scale model training and serving using Kubernetes, MLflow, Terraform, and AWS-native services

Natera is a global leader in cell-free DNA (cfDNA) testing. They are dedicated to oncology, women’s health, and organ health, aiming to make personalized genetic testing and diagnostics part of the standard of care. The Natera team consists of highly dedicated statisticians, geneticists, doctors, laboratory scientists, business professionals, software engineers and many other professionals from world-class institutions.

Australia New Zealand

As a Senior MLE, debug complex AI implementations and optimize inference performance. Work directly with product teams building solutions and develop blueprints for proven patterns. Operate in a high-velocity environment where priorities shift rapidly based on team needs.

Join the team redefining how the world experiences design.

Europe

  • Design, implement, and maintain SFT and RL post-training pipelines for multi-step coding agents.
  • Train and adapt LLMs for agent workflows, including planning, tool use, and multi-step interactions inside JetBrains IDEs.
  • Build and develop evaluation and simulation environments where coding agents can act, be measured, and compared on realistic developer tasks.

At JetBrains, code is their passion and they strive to make the strongest, most effective developer tools on earth. Today, AI-powered assistance and agents are becoming a core part of how developers work in their IDEs.

Brazil

Combine Software Engineering and Data Science disciplines to create production-ready Machine Learning models. Develop frameworks and platform to build, deploy, serve and monitor ML-based services. Contribute to vision and architecture to scale ML solutions at QuintoAndar's business.

We are Grupo QuintoAndar, the largest real estate ecosystem in Latin America, guided by a shared purpose of helping people love the place they live.

$133,109–$239,596/yr
US 4w PTO

  • Develop scalable MLOps pipelines for model training, validation, deployment, and monitoring using AWS services
  • Implement infrastructure as code and CI/CD workflows to support rapid experimentation and reliable production releases
  • Collaborate with data scientists to productionize ML models and ensure reproducibility, versioning, and traceability

Experian is a global data and technology company, powering opportunities for people and businesses around the world. A FTSE 100 Index company listed on the London Stock Exchange (EXPN), they have a team of 23,300 people across 32 countries and corporate headquarters are in Dublin, Ireland.

Europe

Continuously improve the performance and scalability of ML models. Build and deploy models from inception to live in production pipelines. Advocate for code and process improvements across your team, and help to define best practices based on personal industry experience and research.

BenchSci's mission is to exponentially increase the speed and quality of life-saving research and development.

$140,000–$180,000/yr
US Unlimited PTO

  • Develop and evaluate AI-based biomarkers using multimodal data.
  • Design, implement, and improve machine-learning models to predict patient outcomes and treatment response.
  • Conduct research and experimentation to improve model performance, robustness, generalizability, and interpretability.

Artera is an AI startup that develops medical artificial intelligence tests to personalize therapy for cancer patients. They are on a mission to personalize medical decisions for patients and physicians on a global scale and value bringing together individuals from diverse backgrounds.