Source Job

  • Optimize production LLM serving with vLLM and SGLang to maximize throughput and minimize latency through batching and quantization.
  • Profile training runs to find bottlenecks and resolve them with attention implementations like FlashAttention on H200 and GB200 hardware.
  • Deploy and operate multiple models on shared GPU clusters with autoscaling, bin-packing, and efficient handling of mixed workloads.

Python Kubernetes

17 jobs similar to Lead Machine Learning Engineer, Inference & Performance

Jobs ranked by similarity.

EMEA

  • Build and operate production-grade model serving infrastructure using vLLM, TGI, or Triton frameworks.
  • Design and implement auto-scaling, multi-model architectures, and intelligent request routing for ML inference.
  • Optimize GPU utilization, memory efficiency, and observability to ensure low-latency, cost-effective systems.

They are a distributed cloud infrastructure startup building AI-native cloud services with GPU-powered compute. The company is well-funded, fast-scaling, and operates in a remote-first environment with a focus on sustainability and decentralization.

US

  • Own the technical design and delivery of subsystems in a high-throughput, low-latency inference platform.
  • Develop robust API layers and SDKs that abstract complex distributed inference orchestration.
  • Build and harden a multi-tenant control plane for metering, rate limiting, and tenant isolation.

Stack develops revolutionary AI and autonomous systems to enhance safety and efficiency in trucking. The team has decades of experience deploying real-world systems and is committed to inclusion, entrepreneurship, and innovation.

India

  • Research and implement state-of-the-art techniques to accelerate AI inference: quantization, sparsity, distillation, speculative decoding, and caching.
  • Partner closely with hardware and compiler teams to ensure algorithmic improvements translate to real gains on custom silicon.
  • Build profiling tools and comprehensive benchmarking frameworks to measure model quality and efficiency.

EnCharge AI is building the next generation AI platform using novel in-memory-computing architecture. The team consists of experienced AI researchers, silicon & systems engineers, and architects backed by leading investors.

Canada

  • Design and operate core AI platform components for training, deploying, and serving ML models at scale.
  • Own model serving and inference workflows end-to-end, optimizing for reliability, latency, throughput, and cost.
  • Collaborate with product, infrastructure, and security teams to build scalable platform capabilities for AI-powered features.

Mozilla Corporation is the non-profit-backed technology company behind Firefox and Pocket, with over 225 million monthly users. A wholly-owned subsidiary of the Mozilla Foundation, the company is mission-driven, employee-owned, and focused on privacy and open standards.

UK Netherlands

  • Design and build systems that improve the efficiency of ML training and inference workloads.
  • Develop tooling that helps ML engineers debug, profile, optimize, and monitor model performance.
  • Partner with ML researchers and product teams to identify bottlenecks and drive performance improvements.

Reddit is a community of communities built on shared interests, passion, and trust, hosting the most open and authentic conversations on the internet. With over 100,000 active communities and approximately 126 million daily active users, Reddit is one of the internet's largest sources of information.

US Unlimited PTO

  • Design and maintain scalable ML infrastructure including data pipelines, training workflows, and model deployment systems.
  • Own end-to-end ML lifecycle operations, ensuring reliable delivery of models into production at scale.
  • Implement monitoring, telemetry, and feedback loops for ML models running across large-scale device fleets.

Our partner company develops ML systems for connected hardware products used by customers worldwide. They operate in a fast-paced, product-driven environment with a collaborative and technically ambitious culture focused on real-world ML impact.

United States

  • Own the reliability of event-driven messaging with backpressure, idempotency, and dead-letter handling.
  • Build and operate infrastructure for LLM orchestration workloads at scale.
  • Maintain production support for CI infrastructure including on-call responsibilities and incident response.

Scorpion is a leading provider of technology and services for local businesses, helping them understand market dynamics and improve marketing. The company fosters a culture of constant improvement and unbeatable teamwork, valuing winning mindsets and genuine care.

Global 6w PTO

  • Build, optimize, and embed machine learning models for on-device inference within the QSIDS detection engine.
  • Collaborate closely with systems engineers to integrate models efficiently into a Go-based engine.
  • Take models all the way to production and own them once they're running, monitoring performance, detecting drift, and iterating to keep them reliable.

Qohash builds the zero copy data security control layer for enterprises to adopt AI safely. The company has a strong culture centered on five core values: pursuit of excellence, resilience, mission focus, accountability, and embracing conflict.

Europe

  • Lead the investigation and resolution of complex infrastructure, networking, and platform-related incidents.
  • Provide technical leadership for Kubernetes platform operations and supporting infrastructure services.
  • Mentor and support AI Infrastructure & Platform Operations Engineers, sharing technical knowledge through documentation and training.

Mirantis helps organizations ship code faster on public and private clouds, providing a public cloud experience on any infrastructure from the data center to the edge. The company serves many of the world's leading enterprises, including Adobe, DocuSign, Liberty Mutual, and PayPal, and is a leader in container management.

Global 4w PTO

  • Take ownership of the ML API serving NBA recommendations and harden it for low-latency production traffic.
  • Ship your first agent tool contract end-to-end: schema design, handler implementation, and unit tests.
  • Set up the eval foundation for agents with golden transcripts, rubric-based judges, and regression suites.

Clutch is a vertical SaaS company backed by Andreessen Horowitz that helps credit unions become fintech lenders, providing affordable lending solutions to over 130 million Americans. The team is small, ambitious, and shipping fast with a culture that values pragmatism and real autonomy.

United States

  • Design and deliver production AI and agentic systems across document intelligence, workflow automation, and copilots.
  • Own architecture decisions for LLM-based systems, including retrieval, tool use, orchestration, memory, and evaluation.
  • Manage evals and observability for production AI, ensuring system accuracy and detecting regressions.

Maxwell is a mortgage technology and fulfillment company on a mission to make lending simpler, faster, and more accessible. It is a remote-first team that takes craft seriously and moves with intention, building a cutting-edge AI company in mortgage technology.

US

  • Develop and improve NLP systems and language model-powered experiences.
  • Fine-tune and optimize language models for domain-specific use cases and build evaluation frameworks.
  • Deploy and maintain production-grade ML systems on GPU infrastructure with a focus on scalability and safety.

BetterHelp removes barriers to therapy and makes mental health care accessible globally. Founded in 2013, it is now the world's largest online therapy service with over 30,000 licensed therapists, and it invests deeply in employee well-being and professional development.

India Australia New Zealand

  • Own availability, latency, and throughput SLOs across a large fleet of generative media model APIs serving production traffic at scale.
  • Build monitoring, alerting, and observability to catch ML-specific failures, output quality degradation, and model regressions before customers do.
  • Harden model deployment workflows with canary releases, shadow testing, automated rollbacks, and validation gates to ship new model versions safely.

Fal is the generative media ecosystem powering the next generation of AI products, providing infrastructure, tools, and model access for developers and enterprises. As a unified platform for high-performance inference, orchestration, and observability, fal is becoming the ecosystem ambitious teams build on in a market projected to grow by hundreds of billions over the next decade.

US

  • Develop and operate production-ready AI and machine learning systems for enterprise-scale products.
  • Build and optimize LLM-powered applications, RAG pipelines, and intelligent agents.
  • Implement software engineering best practices for AI development including CI/CD and testing.

Our partner is building enterprise-grade AI solutions that deliver measurable business impact. They offer a remote-friendly work environment with a collaborative engineering culture focused on innovation, quality, and continuous learning.

US

  • Act as the primary NVIDIA AI Enterprise and vector database expert for HyperPOD customer environments, owning end-to-end triage across GPU, NVAIE services, and storage.
  • Author and maintain support triage runbooks, diagnostics bundles, and collaborate on observability dashboards for platform health and RAG metrics.
  • Build hands-on labs, PoCs, and reusable technical assets to accelerate support readiness and partner success.

DataDirect Networks (DDN) is a global market leader in AI and high-performance data storage, powering many of the world's most demanding AI data centers across industries like life sciences, healthcare, financial services, and research. They are a global company with strong innovation, customer-centricity, and a team of passionate professionals committed to shaping the future of AI and data management.

United States Canada

  • Build and operate the real-time inference service for the risk decision engine with low latency and high availability.
  • Own model deployment infrastructure including CI/CD, shadow mode, and staged rollouts.
  • Build model observability and partner with Risk Data Science for production operation.

Mercury is a fintech company that provides banking services for startups via partner banks. The company is committed to creating a safe environment and values diversity, with a growing team focused on innovation.

Global Unlimited PTO

  • Lead and scale the Forward Deployed Engineering and Technical Support teams, defining engagement models and operating standards.
  • Own the FDE engagement lifecycle from technical discovery to deployment guidance, ensuring customer value.
  • Drive operational discipline across support tools and partner with Sales, Product, and Engineering on roadmap alignment.

Runpod is the AI Developer Cloud. More than one million developers use the platform to experiment, train, deploy, and scale AI, and we are a small, remote-first team that has processed over 20 billion inference requests and closed a $100M Series A.