Source Job

US Unlimited PTO

  • Design, build, and maintain ML infrastructure across training, evaluation, serving, and monitoring
  • Own data pipelines including generation, cleaning, validation, and versioning
  • Build and improve experiment tracking, orchestration, and reproducibility tooling

CI/CD GPU Software Engineering

20 jobs similar to Senior or Staff ML Systems Engineer

Jobs ranked by similarity.

Europe US

  • Design and build training pipelines, fine-tuning workflows, and RL infrastructure.
  • Implement data ingestion and curation systems, inference services, and scalability and backend architecture.
  • Own the platform that turns models into production systems.

Fastino is building the next generation of LLMs, with a team of alumni from Google Research, Apple, Stanford, and Cambridge. Fastino's GLiNER family of open source models has been downloaded more than 5 million times and is used by companies such as NVIDIA, Meta, and Airbnb.

$170,000–$245,000/yr
US

  • Design, prototype and productionize scalable machine learning and optimization models.
  • Develop frameworks, pipelines, libraries, utilities and tools that process massive data for ML tasks.
  • Build end-to-end reusable pipelines from data acquisition to model output delivery.

BetterHelp's mission is to remove the traditional barriers to therapy and make mental health care more accessible. Founded in 2013, they are now the world’s largest online therapy service, with over 30,000 licensed therapists.

$70,000–$87,000/yr
Argentina Mexico Unlimited PTO

  • Architect the ML Ecosystem.
  • Productionize Innovation.
  • Engineer Feature Intelligence.

TrueML is a mission-driven financial software company that aims to create better customer experiences for distressed borrowers. The TrueML team includes inspired data scientists, financial services industry experts and customer experience fanatics building technology to serve people.

US

  • Design and Develop machine learning infrastructure, tooling, and models to help teams deliver world class experiences.
  • Help product and development teams understand the data lifecycle and the inherent experimental nature of machine learning.
  • Build internal products and platforms to enable teams to incorporate AI into their features and customer facing products.

Weave provides an all-in-one platform for small businesses to streamline communications, and patient experiences. The company has a phenomenal culture, and Weave's teams are cross-functional agile teams composed of a product owner, backend and frontend devs and devops.

Canada

  • Design and maintain training systems that can process and learn from petabyte-scale multimodal datasets.
  • Identify and resolve bottlenecks in the training pipeline to maximize GPU utilization and reduce training time.
  • Work with the ML team to develop and refine neural network architectures suitable for autonomy tasks.

Serve Robotics is reimagining how things move in cities. Their personable sidewalk robot is their vision for the future; it's designed to take deliveries away from congested streets, make deliveries available to more people, and benefit local businesses. Their team is agile, diverse, and driven aiming to grow robotic deliveries from surprising novelty to efficient ubiquity.

US

  • Set up and manage GPU cluster infrastructure on major cloud providers.
  • Build and operate job orchestration and scheduling systems.
  • Integrate and maintain ML training frameworks and post-training pipelines.

Snorkel AI helps enterprises transform expert knowledge into specialized AI at scale. They started as a research project in the Stanford AI Lab and work with some of the world’s largest organizations to empower scientists, engineers, financial experts, product creators, journalists, and more to build custom AI with their data faster than ever before.

$150,000–$180,000/yr
US

  • Design and implementation of reliable, maintainable, and scalable GenAI systems.
  • Serve as a subject matter expert for machine learning systems owned by the team.
  • Mentor junior and mid level engineers through code reviews and design collaboration.

Trajector specializes in medical evidence services, guiding clients through disability benefits complexities. They are a global team of over 1,800 dedicated individuals, streamlining the path to benefits and ensuring access to rightful compensation for those with disabilities.

LATAM

  • Design and maintain CI/CD pipelines for ML model training, packaging, and deployment across our microservices.
  • Manage containerized services on AWS ECS, optimizing for cost, latency, and availability.
  • Automate infrastructure provisioning and service configuration with Terraform.

Newsela takes authentic, real-world content from trusted sources and makes it instruction-ready for K-12 classrooms. Each text is published at five reading levels, so content is accessible to every learner; over 3.3 million teachers and 40 million students have registered.

$230,000–$322,000/yr
US

  • Architect, build, and deploy large-scale ML systems powering recommendations, search, messaging, and content understanding.
  • Lead projects from ideation → modeling → experimentation → production → iteration.
  • Design and improve recommender systems and ranking models across surfaces (feed, search, notifications).

Reddit is a community of communities that's built on shared interests, passion, and trust. It is home to open and authentic conversations, with over 100,000 active communities and approximately 121 million daily active unique visitors.

US

  • Design and build the AI platform layer, including data pipelines and serving infrastructure.
  • Productionize AI/ML capabilities, ensuring reliability, performance, and scalability.
  • Architect data pipelines to ingest, transform, and serve data to power AI features.

Lone Wolf Technologies is building AI capabilities into the core of its platform, transforming how real estate professionals manage transactions, serve clients, and grow their businesses. The company's culture fosters innovation and collaboration, empowering employees to contribute to impactful projects.

$170,000–$240,000/yr
US

  • Own the full ML lifecycle, including data ingestion, training, validation, deployment, monitoring, retraining, and retirement.
  • Transition AI/ML prototypes into scalable, production-ready systems with CI/CD pipelines, automation, and observability.
  • Develop and maintain AI-driven applications and inference services, optimizing for performance, scalability, reliability, and cost.

IMO Health combines software development, artificial intelligence, and clinical expertise to create AI-driven solutions. They enhance access to reliable health information, support clinical decision-making, and improve patient outcomes.

$185,800–$303,400/yr
US

  • Design, build, and deploy production-grade machine learning models and systems at scale
  • Own the full ML lifecycle: from problem definition and feature engineering to training, evaluation, deployment, and monitoring
  • Work with large-scale datasets to improve ranking, recommendations, search relevance, prediction, content/user understanding, and optimization systems

Reddit is a community of communities built on shared interests and trust, and is home to open conversations on the internet. With 100,000+ active communities and approximately 121 million daily active unique visitors, Reddit is one of the internet’s largest sources of information.

Europe

  • Own the architecture and delivery of production-grade LLM systems and classical ML solutions.
  • Design, evaluate, and optimize RAG pipelines (retrieval strategy, chunking, indexing, monitoring).
  • Build scalable, production-grade LLM services and agentic workflows, alongside traditional ML systems where appropriate.

Hiflylabs is a team of 250+ data and tech enthusiasts based in Budapest. They focus on data engineering, data science, artificial intelligence and application development, working on a wide range of projects around the world. Hiflylabs values its people and is committed to nurturing their personal and professional development through a mentoring system.

US Europe

  • Help customers love Rerun, being the primary technical interface.
  • Run demos, build engineer-to-engineer trust, and support onboarding.
  • Map customer needs to product roadmap and build features based on understanding.

Rerun is building the data stack for Physical AI. They have an uncommonly talented tech team and expect everyone to take broad responsibility for what they build.

$100,000–$150,000/yr
US

  • Contribute to the development of software infrastructure to support the creation of agentic systems.
  • Deploy and operate cloud-hosted services for use by the research community.
  • Define key directions to keep UChicago at the forefront of AI/ML and national data infrastructure.

The University of Chicago delivers solutions to the research community worldwide through Globus, a sustainable, non-profit unit. They develop cloud-based software for governmental, academic, and commercial organizations, emphasizing data management challenges, and house employees in downtown Chicago and remotely.

$160,000–$190,000/yr
US Unlimited PTO

  • Design, build, and deploy production AI agents and multi-agent orchestration systems.
  • Architect RAG pipelines with vector search and knowledge base management for AI-driven support.
  • Build production microservices and APIs serving as orchestration layers for AI agent systems.

Greenlight is a family fintech company helping parents raise financially smart kids. They serve over 6 million parents and kids with their banking app, aiming to ensure every child has the opportunity to become financially healthy and happy.

Australia

  • You’ll design, build, and maintain scalable systems for serving machine learning models in production.
  • You’ll optimise inference performance, including latency, throughput, and cost efficiency.
  • You’ll collaborate with ML researchers and engineers to productionise models

Canva is a design platform that enables users to create a variety of visual content. They have campuses in Sydney and Melbourne, with co-working spaces in other Australian cities, and promote a flexible work environment.

US Canada

  • Bring deep expertise in machine learning and applied AI to turn emerging techniques into practical solutions.
  • Provide broad technical leadership across teams while remaining hands-on in applied research and innovation.
  • Guide major technical decisions, identify opportunities for differentiation, and translate new ideas into future product capabilities.

Kinaxis is a global leader in modern supply chain orchestration that powers complex global supply chains and supports the people who manage them. They have grown to become a global organization with over 2000 employees around the world, with 6 global offices and a best-in-class HQ in Ottawa, Canada.

Australia

  • Drive the design and evolution of AI-ready tools and APIs for LLM platforms.
  • Own and evolve evaluation frameworks that measure tool-use accuracy across platforms.
  • Shape Canva's agent architecture, making strategic technical decisions about intelligence location.

Canva is a design platform that enables users to create various visual content. They have offices in multiple locations in Australia and New Zealand, and they offer a flexible work environment.

$151,000–$205,000/yr
US Unlimited PTO

  • Extend, optimize, and maintain core data models for reports, machine learning, and generative AI.
  • Implement automation and operationalize ML models to streamline operational processes and improve efficiency.
  • Partner with engineering, product, and analytics teams to deliver seamless integrations and customer-facing data products.

Boulevard provides a client experience platform for appointment-based, self-care businesses, helping customers enhance client experiences. They value diversity and inclusivity, offering equal opportunities and aiming to create a supportive work environment.