Source Job

Australia New Zealand

  • Building world-class AI infrastructure to support a 100+ person research team.
  • Designing and scaling multi-cloud systems that support high-performance model training and inference.
  • Improving monitoring, alerting and system observability for AI workloads

AWS GCP Terraform Kubernetes

20 jobs similar to Engineering Manager (Infra) - AI Reliability

Jobs ranked by similarity.

Australia

  • Support and evolve the reliability of platforms used by the AI Research team.
  • Ensure production services meet expectations for availability, latency, and operational readiness.
  • Build and maintain Kubernetes-based services on GCP using infrastructure-as-code and GitOps.

Algolia is a pioneer and market leader in AI Search, empowering 17,000+ businesses to deliver blazing-fast, predictive search and browse experiences. They have raised $150 million in Series D funding, quadrupling their valuation to $2.25 billion, investing in their market-leading platform.

US Unlimited PTO

  • Influence the technical direction for infrastructure and platform capabilities that support our rapidly growing AI product suite.
  • Architect and evolve our cloud infrastructure (primarily on AWS) to support current and future products.
  • Mentor and level up engineers across Platform and product teams; review design docs, guide architecture decisions, and model high standards.

Rad AI is on a mission to transform healthcare with artificial intelligence. Our AI-driven solutions are revolutionizing radiology—saving time, reducing burnout, and improving patient care. Rad AI has secured over $140M in funding and our valuation is at $528M.

Australia New Zealand

  • Leading a new team pioneering AI developer tooling workflows that boost engineering productivity at scale
  • Driving the strategy and execution of Canva’s “AI-Orchestrated Developer Experience” goal
  • Building agentic, context-aware AI systems that understand and integrate into our software development lifecycle

Canva is a design platform redefining how the world experiences design. While the posting doesn't explicitly mention employee count, it does imply a large scale due to mentions of multiple campuses and scaling operations, fostering a culture that embraces change and empowers employees.

Global

  • Design, implement, and maintain high-performance ML training and inference platforms.
  • Ship tools that allow any ML engineer to deploy a model in minutes, not days.
  • Improve scalability, reliability, and cost efficiency of model training and serving systems.

Speechify's mission is to make sure that reading is never a barrier to learning. With nearly 200 people around the globe working in a 100% distributed setting, Speechify's team includes frontend and backend engineers, AI research scientists, and others.

Global

  • Build and scale ML-optimized HPC infrastructure by deploying and managing Kubernetes-based GPU/TPU superclusters across multiple clouds.
  • Optimize for AI/ML training by collaborating with cloud providers to fine-tune infrastructure for cost efficiency, reliability, and performance.
  • Troubleshoot and resolve complex issues and proactively identify infrastructure bottlenecks, performance degradation, and system failures.

Cohere's mission is to scale intelligence to serve humanity by training and deploying frontier models for developers and enterprises who are building AI systems. The company is composed of researchers, engineers, and designers passionate about their craft and believes that a diverse range of perspectives is a requirement for building great products.

North America Unlimited PTO

  • Build and operate scalable backend services and internal APIs for the AI platform.
  • Integrate LLMs and AI tool execution into reliable, production-ready workflows.
  • Own production reliability for AI platform infrastructure through observability, alerting, and incident response.

MaintainX is the world's leading Asset and Work Intelligence platform for industrial and frontline environments. They are a modern IoT-enabled cloud-based tool for reliability, safety, and operations on physical equipment and facilities, powering operational excellence for 13,000+ businesses. MaintainX recently completed a $150 million Series D round, at a valuation of $2.5 billion.

US

  • Ensure the smooth operation and high availability of Clarifai's core services
  • Monitor system performance, identify bottlenecks, and implement optimizations to enhance reliability and efficiency
  • Design and implement scalable, secure, and cost-effective infrastructure solutions

Clarifai is a leading AI platform specializing in computer vision and generative AI, empowering organizations to transform unstructured data into actionable insights. Founded in 2013, they have a diverse, globally distributed team with $100M in funding and are committed to building a diverse and inclusive team.

US

  • Make deployments boring (in the best way possible)
  • Own CI/CD pipelines: optimize build times, improve caching, reduce flakiness
  • Evolve our Kubernetes (EKS) deployment strategy for reliability and speed

Obvious is building an AI-native workspace, an operating system for work that puts co-intelligence at the center. They are a small and talent-dense team with world-class builders, former founders, and leaders from companies like Netflix, Google, and Meta.

Europe 5w PTO

  • Design, implement, and manage AI Platform architecture.
  • Control AI-related costs, including models, GPUs, and other resources.
  • Collaborate with ML teams to operationalize AI models and integrate them into systems.

Docplanner empowers patients by giving them access to leave and read reviews about their visit and provides doctors with the technology to manage bookings easily and save time. They are leaders in 13 countries with 2,500+ employees globally and maintain a startup-mindset.

$100,000–$185,000/yr
US Unlimited PTO

  • Work hands-on with the infrastructure that supports our distributed & highly scalable services.
  • Gather requirements from customers and adapt manifests and software to support new environments.
  • Automate and optimize the release pipeline to make it as frictionless as possible.

Arize AI is transforming the world by providing a leading AI observability and evaluation platform. They empower AI engineers to ship high-performing, reliable agents and applications, unifying build, test, and run in a single workspace, with over 150 leading enterprises as customers.

Europe

  • Own the reliability, scalability, and performance of Peec AI’s core systems and infrastructure
  • Design, build, and maintain the tooling, automation, and monitoring that keep our services fast, secure, and highly available
  • Partner closely with product and engineering teams to ensure new features are reliable, observable, and easy to operate from day one

Peec AI is one of Europe’s fastest-growing Series A startups (no employee count/culture details given). They provide exciting and challenging work in the AI space.

Australia New Zealand

  • You’ll own challenging infrastructure problems end-to-end.
  • You’ll design scalable, maintainable services and contribute to technical proposals.
  • You’ll contribute to the roadmap for our Provisioning team.

Canva is a design platform that enables users to create a variety of visual content. They have campuses in Sydney and Melbourne, co-working spaces in other major cities, and offer a flexible work environment.

Australia New Zealand

  • Owning and leading a high-performing engineering team, with a strong focus on career development, wellbeing, and impact
  • Partnering with Engineering Managers and Staff Engineers to define and execute Canva’s longer-term observability strategy
  • Translating high-level goals into clear roadmaps, priorities, and delivery plans

Canva is a design platform redefining how the world experiences design. They have campuses in Sydney and Melbourne and co-working spaces across ANZ, trusting their Canvanauts to find the balance that helps them and their teams do their best work.

US Unlimited PTO

  • Build Enterprise-Scale Infrastructure leveraging infrastructure-as-code to manage complex cloud environments.
  • Sustain Platform Health and Performance owning critical systems in production, including reliability and security.
  • Enable Teams and Customers to Move Faster creating abstractions and tooling that deploy, run, and scale AI/ML workloads.

Cake is on a mission to make cutting-edge AI accessible to enterprise teams. Backed by top investors, Cake is seeing strong adoption and is positioned for rapid growth in the next 12 months, emphasizing ownership, clear communication, and collaboration.

US

  • Architect and deploy secure, scalable infrastructure using Terraform, CloudFormation, or similar tools.
  • Ensure the platform meets strict SLA requirements for enterprise clients, minimizing downtime.
  • Implement comprehensive monitoring, logging, and alerting to provide deep visibility into system health.

Filevine provides cloud-based workflow tools for legal professionals, helping them manage organizations and serve clients. They are recognized as a fast-growing and innovative technology company with a team of passionate professionals.

US

  • Develop and manage strategic technical partnerships across the AI infrastructure ecosystem.
  • Support Business Development leadership as the primary technical liaison between Mirantis and strategic technology partners.
  • Collaborate with product management, engineering, and sales to drive joint solution development, technical validation, and technical go-to-market alignment.

Mirantis is a Kubernetes-native AI infrastructure company, enabling organizations to build and operate scalable, secure, and sovereign infrastructure for modern AI, machine learning, and data-intensive applications. They are committed to open standards and freedom from lock-in, ensuring that customers retain full control of their infrastructure strategy.

South America

  • Architect and maintain self-healing systems with 99.9%+ availability targets.
  • Use AI/ML to automate infrastructure governance and detect configuration or IaC anti-patterns.
  • Implement adaptive SLIs/SLOs that evolve automatically from real-time data.

Groupon is a marketplace where customers discover new experiences and services everyday and local businesses thrive. Even with thousands of employees spread across multiple continents, they still maintain a culture that inspires innovation, rewards risk-taking and celebrates success.

US Canada Argentina India

  • Work with research teams to design and build our training infrastructure
  • Prototype new training frameworks and production-ize solutions at scale
  • Design, optimize and test model integration infrastructure

Clarifai is a leading AI platform specializing in computer vision, NLP, LLMs, and audio recognition, helping organizations transform unstructured data into structured data. Founded in 2013, they remotely operate across multiple countries with backing from industry leaders, fostering a diverse and equal opportunity workplace.

$100,000–$165,000/yr
Europe Latin America 3w PTO

  • You’ll lead the initial setup of our DevOps and platform engineering practices
  • You’ll design and deliver an internal platform for personal or feature environments to boost developer velocity
  • You’ll build and maintain AWS-based infrastructure for performance, scale, and security

DualEntry, founded in 2024, is a rapidly growing AI startup focused on revolutionizing the finance industry. Our AI-native ERP platform helps accounting teams achieve more with less effort, automating manual data entry using AI for businesses ranging from $5M-ARR to NYSE-listed companies.

$170,000–$208,000/yr
US

  • Lead strategic initiatives to modernize and automate OpenSesame’s infrastructure.
  • Drive greater intelligence, resilience, and scalability across our global learning platform.
  • Partner across teams to embed automation, reliability, and intelligent systems into everything we do.

OpenSesame is the trusted partner for Workforce Reinvention in the age of AI. They deliver integrated software, curated and customizable content, and expert services – embedded into existing learning, HR, and work systems.