Source Job

Global

  • Own and operate GPU and accelerator clusters for AI training, inference, and experimentation, ensuring reliability and cost-efficiency.
  • Build and optimize scheduling, orchestration, and serving systems using frameworks like vLLM and Triton to improve latency, throughput, and memory efficiency.
  • Partner with ML engineers to remove workflow bottlenecks and build observability for GPU utilization, capacity, and incident response.

Kubernetes Python Distributed Systems

11 jobs similar to AI Compute and Infrastructure Engineer

Jobs ranked by similarity.

Poland

  • Design and deploy GPU cluster architectures using tools like Ansible, Terraform, Kubernetes, and Slurm.
  • Lead technical deep-dives, workshops, and present solutions to stakeholders, translating complex concepts.
  • Automate provisioning and monitoring with Infrastructure as Code, and produce documentation and training materials.

Gcore is a global provider of infrastructure and software solutions for AI, cloud, network, and security, powering digital experiences worldwide. The company collaborates with leading technology partners and employs over 550 professionals building foundational technologies.

SRE

Fal
$180,000–$250,000/yr
US

  • Own and operate our Kubernetes infrastructure.
  • Build and maintain CI/CD pipelines and deployment infrastructure.
  • Leverage AI to automate analysis and resolution of production issues.

Fal is the generative media ecosystem powering the next generation of AI products. They build the infrastructure, tools, and model access that teams need to move from idea to production.

APAC

  • Partner directly with customer engineering teams running training and inference workloads in production.
  • Investigate failures involving distributed training, Kubernetes orchestration, GPU allocation, networking, and storage systems.
  • Identify recurring patterns across customer issues and drive long term reliability improvements.

Lightning AI is the company behind PyTorch Lightning, building an end-to-end platform for developing, training, and deploying AI systems. They serve solo researchers, startups, and large enterprises, operating globally with offices in New York City, San Francisco, Seattle, and London.

US

  • Design, build, and maintain the core infrastructure layer supporting GenAI products.
  • Implement secure access controls and authentication mechanisms integrated by default into the AI platform components.
  • Develop and manage observability, monitoring, and logging solutions for GenAI workloads and infrastructure.

PointClickCare is a healthcare technology company. This team will serve as the product owner for GenAI capabilities, closely integrated with key horizontal partners to ensure delivery of safe, scalable and high-impact AI Products.

Europe

  • Define and evolve the architecture and roadmap for enterprise‑scale Data and AI platforms.
  • Design and build multi‑tenant, multi‑region, highly available AI platforms with governance.
  • Lead capacity planning and cost optimization strategies for GPU and CPU workloads.

NEORIS accelerates growth in Ibero‑America, combining global engineering with regional expertise. With over 60,000 professionals across 55+ countries, they offer technical specialization career paths and value responsibility, collaboration, creativity, and commitment.

Global

  • Contribute to the development of the Everywhere Inference platform, a Kubernetes-based solution.
  • Design and implement APIs and developer tools to simplify deployment, management, and monitoring of AI applications.
  • Optimize serverless container workflows for AI workloads, ensuring performance, scalability, and seamless autoscaling.

Gcore provides infrastructure and software solutions for AI, cloud, network, and security. They have 550+ professionals globally and power everything from real-time communication and streaming to enterprise AI and secure web applications.

Canada

  • Define, drive, design, and build/ship end-to-end solutions that solve real customer problems.
  • Contribute to the end-to-end AI/ML software development lifecycle, ensuring reproducible research.
  • Drive architecture, design, and delivery of advanced ML systems in the Product R&D team.

Kinaxis is a global leader in modern supply chain orchestration. Known for its AI-infused platform and transparency across end-to-end supply chains, Kinaxis helps customers make faster, better decisions. The company has over 2000 employees worldwide and is recognized with Top Employer awards.

Canada

  • Design, develop, and maintain core infrastructure supporting large-scale optimization engines and planning workflows to improve scalability and performance.
  • Analyze and optimize performance bottlenecks in optimization pipelines, focusing on compute, memory usage, and data flow for complex planning problems.
  • Contribute to evolving platform architecture, designing systems for large datasets and parallel execution while ensuring enterprise-grade reliability and maintainability.

Kinaxis is a global leader in modern supply chain orchestration, providing an AI-powered platform for end-to-end supply chain transparency and faster decision-making. The company has over 2000 employees globally, is a multi-time Top Employer award winner, and fosters a culture of innovation with a serious focus on technology, customers, and a collaborative, not-too-serious internal environment.

$180,000–$300,000/yr
US 20w maternity 12w paternity

  • Act as a trusted advisor to clients, providing technical expertise and guidance throughout engagements
  • Conduct PoCs, workshops, presentations, and training sessions on GPU cloud technologies and best practices
  • Collaborate with clients to understand their business requirements and develop solution architectures

Lavendo partners with startups and high‑growth companies to help them hire top‑tier sales, go‑to-market, and technical talent. They are an equal opportunity workplace and consider all qualified applicants without regard to race, color, religion, national origin, age, sex, marital status, ancestry, disability, genetic information, veteran or military status, gender identity or expression, sexual orientation, or any other characteristic protected by law.

$245,000–$295,000/yr
US

  • Build, lead, and grow the platform team, setting the pace and creating an environment where strong engineers want to stay.
  • Remain hands-on by writing code, reviewing architecture decisions, and debugging production issues while owning the platform's technical direction.
  • Steer projects through ambiguity, solving technical problems, resourcing gaps, and prioritization calls to ensure the infrastructure scales effectively.

OpenRouter is the leading AI routing and infrastructure layer that enterprises use to access, manage, and optimize the best large language models across providers. It's a fast-scaling technology company powering advanced AI teams by providing flexibility, scalability, and future-proof infrastructure.

India

  • Design end-to-end AI integration architectures connecting LLM APIs, vector databases, and inference systems to existing backend infrastructure.
  • Build reusable ML infrastructure components like feature pipelines, model serving layers, and evaluation frameworks that multiple portfolio companies standardize on.
  • Establish AI system integration best practices and governance patterns that become repeatable playbooks across the holding company.

Emergence is a thematic holding company backed by the Pritzker Organization focused exclusively on acquiring and scaling category-defining software businesses. They invest in focused portfolios, specialized operating groups with deep domain expertise and proven playbooks.