Source Job

EMEA

  • Build and operate production-grade model serving infrastructure using vLLM, TGI, or Triton frameworks.
  • Design and implement auto-scaling, multi-model architectures, and intelligent request routing for ML inference.
  • Optimize GPU utilization, memory efficiency, and observability to ensure low-latency, cost-effective systems.

Python Kubernetes Terraform

20 jobs similar to AI Infrastructure Engineer

Jobs ranked by similarity.

  • Optimize production LLM serving with vLLM and SGLang to maximize throughput and minimize latency through batching and quantization.
  • Profile training runs to find bottlenecks and resolve them with attention implementations like FlashAttention on H200 and GB200 hardware.
  • Deploy and operate multiple models on shared GPU clusters with autoscaling, bin-packing, and efficient handling of mixed workloads.

Egen is a fast-growing technology company with a data-first mindset, partnering with clients on Google Cloud and Salesforce to drive action through data and insights. We are a team of dedicated engineers who thrive on solving tough problems and continually innovate to achieve fast, effective results.

US Unlimited PTO

  • Design and maintain scalable ML infrastructure including data pipelines, training workflows, and model deployment systems.
  • Own end-to-end ML lifecycle operations, ensuring reliable delivery of models into production at scale.
  • Implement monitoring, telemetry, and feedback loops for ML models running across large-scale device fleets.

Our partner company develops ML systems for connected hardware products used by customers worldwide. They operate in a fast-paced, product-driven environment with a collaborative and technically ambitious culture focused on real-world ML impact.

Canada

  • Design and operate core AI platform components for training, deploying, and serving ML models at scale.
  • Own model serving and inference workflows end-to-end, optimizing for reliability, latency, throughput, and cost.
  • Collaborate with product, infrastructure, and security teams to build scalable platform capabilities for AI-powered features.

Mozilla Corporation is the non-profit-backed technology company behind Firefox and Pocket, with over 225 million monthly users. A wholly-owned subsidiary of the Mozilla Foundation, the company is mission-driven, employee-owned, and focused on privacy and open standards.

Europe

  • Monitor, operate, and support production AI infrastructure platforms.
  • Investigate and resolve infrastructure, networking, hardware, and platform-related incidents.
  • Collaborate with engineering teams, hardware vendors, and datacenter personnel to resolve technical issues.

Mirantis is the Kubernetes-native AI infrastructure company, enabling organizations to build and operate scalable, secure infrastructure for AI and data-intensive applications. The company is growing and invests heavily in AI infrastructure and platform services.

US Unlimited PTO

  • Own and scale AI compute and deployment platforms including Kubernetes and GitOps pipelines.
  • Build inference infrastructure and observability stacks for LLM-powered workflows.
  • Drive security, compliance, and governance at the systems level in a regulated healthcare environment.

Hims & Hers is a leading health and wellness platform focused on making healthcare accessible and personal. As a publicly traded company on the NYSE (HIMS), it offers flexible/remote work and a culture centered on innovation and employee well-being.

US Unlimited PTO 16w maternity 4w paternity

  • Build and operate the ML lifecycle platform, including tooling for experiment tracking, model registry, and versioned pipelines.
  • Own CI/CD and deployment for ML workloads, building automated pipelines from notebook to production.
  • Make models observable and reliable in production with monitoring for latency, drift, data quality, and cost signals.

dv01 provides a data analytics platform for the structured finance market, offering transparency into investment performance and risk for lenders and Wall Street investors. With over 400 clients and coverage of over 100 million loans, dv01 is a data-first company with a diverse and innovative culture.

India

  • Collaborate with data scientists and engineers to build scalable ML pipelines, troubleshoot infrastructure issues from Linux to Kubernetes, and optimize model performance.
  • Drive high engineering standards, design on-premises MLOps solutions, and maintain tools for deployment and monitoring.
  • Refine CI/CD workflows, incorporate ML model training and evaluation into testing, and ensure seamless handover between research and production.

Learneo is a platform of builder-driven businesses, including Course Hero, CliffsNotes, LitCharts, Quillbot, Symbolab, and Scribbr, focused on supercharging productivity and learning. The company supports high-growth businesses with centralized corporate operations and has a virtual-first culture with employees across multiple countries.

US

  • Own the technical design and delivery of subsystems in a high-throughput, low-latency inference platform.
  • Develop robust API layers and SDKs that abstract complex distributed inference orchestration.
  • Build and harden a multi-tenant control plane for metering, rate limiting, and tenant isolation.

Stack develops revolutionary AI and autonomous systems to enhance safety and efficiency in trucking. The team has decades of experience deploying real-world systems and is committed to inclusion, entrepreneurship, and innovation.

Europe

  • Lead the investigation and resolution of complex infrastructure, networking, and platform-related incidents.
  • Provide technical leadership for Kubernetes platform operations and supporting infrastructure services.
  • Mentor and support AI Infrastructure & Platform Operations Engineers, sharing technical knowledge through documentation and training.

Mirantis helps organizations ship code faster on public and private clouds, providing a public cloud experience on any infrastructure from the data center to the edge. The company serves many of the world's leading enterprises, including Adobe, DocuSign, Liberty Mutual, and PayPal, and is a leader in container management.

UK Netherlands

  • Design and build systems that improve the efficiency of ML training and inference workloads.
  • Develop tooling that helps ML engineers debug, profile, optimize, and monitor model performance.
  • Partner with ML researchers and product teams to identify bottlenecks and drive performance improvements.

Reddit is a community of communities built on shared interests, passion, and trust, hosting the most open and authentic conversations on the internet. With over 100,000 active communities and approximately 126 million daily active users, Reddit is one of the internet's largest sources of information.

  • Own reliability, latency, and performance for AI platform services and data infrastructure on AWS.
  • Design and maintain CI/CD pipelines, infrastructure-as-code, and observability frameworks across the stack.
  • Partner with AI and data engineers to ensure secure, cost-optimized, and scalable deployment of platform components.

HHAeXchange is the leading technology platform for home and community-based care, providing an end-to-end homecare solution for people who are aging or have disabilities. Founded in 2008, the company is passionate about transforming healthcare by connecting patients, providers, managed care organizations, and states.

India

  • Collaborate with data scientists and software engineers to build scalable data pipelines and ML deployment systems.
  • Troubleshoot issues across the ML infrastructure stack, from Linux and Docker to Kubernetes and model serving.
  • Drive high engineering standards through code reviews, testing, and CI/CD enhancements.

Quillbot helps students and professionals strengthen their writing with AI-powered tools. We serve over 56 million users globally and foster a collaborative, virtual-first culture.

Canada

  • Build and maintain infrastructure platforms for over 200 backend services running on Kubernetes clusters with 40,000+ cores.
  • Lead and mentor other engineers, own complex infrastructure failures, and participate in a shared on-call rotation.
  • Drive cloud cost efficiency, estimate schedules, and use AI tools as a first-class collaborator in daily workflows.

Life360's mission is to keep people close to the ones they love through location sharing, safe driver reports, and crash detection. The company serves approximately 97.8 million monthly active users across more than 180 countries and has more than 500 remote-first employees.

India

  • Research and implement state-of-the-art techniques to accelerate AI inference: quantization, sparsity, distillation, speculative decoding, and caching.
  • Partner closely with hardware and compiler teams to ensure algorithmic improvements translate to real gains on custom silicon.
  • Build profiling tools and comprehensive benchmarking frameworks to measure model quality and efficiency.

EnCharge AI is building the next generation AI platform using novel in-memory-computing architecture. The team consists of experienced AI researchers, silicon & systems engineers, and architects backed by leading investors.

United States Canada

  • Build and operate the real-time inference service for the risk decision engine with low latency and high availability.
  • Own model deployment infrastructure including CI/CD, shadow mode, and staged rollouts.
  • Build model observability and partner with Risk Data Science for production operation.

Mercury is a fintech company that provides banking services for startups via partner banks. The company is committed to creating a safe environment and values diversity, with a growing team focused on innovation.

Global Unlimited PTO

  • Lead and scale the Forward Deployed Engineering and Technical Support teams, defining engagement models and operating standards.
  • Own the FDE engagement lifecycle from technical discovery to deployment guidance, ensuring customer value.
  • Drive operational discipline across support tools and partner with Sales, Product, and Engineering on roadmap alignment.

Runpod is the AI Developer Cloud. More than one million developers use the platform to experiment, train, deploy, and scale AI, and we are a small, remote-first team that has processed over 20 billion inference requests and closed a $100M Series A.

Global 4w PTO

  • Own the ML serving API and deploy models to production with CI/CD and infrastructure as code.
  • Build monitoring, alerting, and reliability for NBA models and LLM agents.
  • Drive architectural decisions and mentor engineers on MLOps patterns.

Clutch is a vertical SaaS company backed by Andreessen Horowitz, revolutionizing how credit unions engage with members via fintech lending software. The company is small and ambitious, with a lean data team of five that values pragmatism and fast shipping.

Europe

  • Lead investigation and resolution of complex infrastructure, networking, and platform incidents.
  • Provide technical leadership for Kubernetes platform operations and drive automation initiatives.
  • Mentor engineers and develop operational standards, runbooks, and best practices.

Mirantis is the Kubernetes-native AI infrastructure company, enabling organizations to build and operate scalable, secure, and sovereign infrastructure for modern AI, machine learning, and data-intensive applications. Serving enterprises like Adobe, PayPal, and Volkswagen, Mirantis is committed to open standards and freedom from lock-in.

Brazil

  • Evolve and maintain our Kubeflow, Feast, and Spark-on-Kubernetes ML infrastructure.
  • Design tools and APIs empowering teams to transition from centralized bottlenecks to self-service excellence.
  • Collaborate with Data Science teams to apply software engineering best practices to ML workflows.

Wellhub revolutionizes workplace wellness by connecting employees to partners for fitness, mindfulness, therapy, nutrition, and sleep in one subscription. Headquartered in NYC with team members across the globe, we value wellbeing, collaboration, and different perspectives.

United States

  • Own the reliability of event-driven messaging with backpressure, idempotency, and dead-letter handling.
  • Build and operate infrastructure for LLM orchestration workloads at scale.
  • Maintain production support for CI infrastructure including on-call responsibilities and incident response.

Scorpion is a leading provider of technology and services for local businesses, helping them understand market dynamics and improve marketing. The company fosters a culture of constant improvement and unbeatable teamwork, valuing winning mindsets and genuine care.