Source Job

ANZ

  • Building world-class AI infrastructure to support a 100+ person research team.
  • Designing and scaling multi-cloud systems that support high-performance model training and inference.
  • Improving monitoring, alerting and system observability for AI workloads.

AWS GCP Terraform Kubernetes DevOps

20 jobs similar to Engineering Manager (Infra) - AI Reliability

Jobs ranked by similarity.

India

  • Design and manage AWS infrastructure for AI services.
  • Implement Infrastructure as Code using Terraform.
  • Collaborate with cross-functional teams to enhance performance.

Jobgether uses an AI-powered matching process to ensure applications are reviewed quickly, objectively, and fairly against the role's core requirements. Their system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company.

$90–$120/hr
US

  • Design and maintain cloud-based infrastructure for AI development pipelines.
  • Automate infrastructure using tools like Terraform, Ansible, or similar.
  • Monitor and improve system performance, reliability, and scalability.

Labelbox builds the data engine that accelerates breakthrough AI, enabling safer, smarter models in production and is trusted by leading research labs and enterprises worldwide.

Canada 5w PTO

  • Design and evolve infrastructure systems to ensure scalability, reliability, and cost efficiency.
  • Lead and mentor a distributed infrastructure team, fostering a collaborative and inclusive culture.
  • Oversee all cloud environments supporting MZLA’s products and business systems.

MZLA Technologies Corporation (MZLA) is a wholly owned, for-profit subsidiary of the Mozilla Foundation and home to Thunderbird. They are a small but growing team of 50+ people distributed across seven countries building an open-source email and productivity platform.

Europe 4w PTO

Design, build, and own AWS-based MLOps infrastructure, defining standards for security, automation, cost-efficiency, and governance. Architect and operate production Kubernetes clusters, including containerizing and deploying ML models using Docker and Helm. Build and maintain CI/CD pipelines for training, validation, and deployment of ML workloads, implementing canary, blue-green, and rollback strategies.

Jobgether is a Talent Matching Platform that partners with companies worldwide to efficiently connect top talent with the right opportunities through AI-driven job matching.

$120,000–$140,000/yr

  • Design and plan cloud-native systems aligned with business goals and security best practices.
  • Implement and support AI-based automation tools and services.
  • Continuously tune cloud and automation workloads to improve reliability and performance.

PerfectServe offers unified healthcare communication solutions to help physicians, nurses, and care team members provide exceptional patient care.

$150,100–$188,100/yr
US Canada 2w PTO 12w maternity 12w paternity

  • Create and test reliable cloud infrastructure services that support Webflow’s range of products.
  • Balance reliability, scalability, and cost efficiency concerns while refactoring and modernizing existing services.
  • Collaborate with product engineering teams to deliver new solutions for services and ways of working that might not exist yet.

Webflow is the leading visual development platform for building powerful websites without writing code.

US

Lead and manage the Platform Engineering team, providing technical guidance and mentorship. Design, build, and evangelize Golden Paths and Service Scaffolding to reduce friction across the development lifecycle. Oversee the design, implementation, and maintenance of Shared DB Platforms, ensuring optimal performance, integrity, and security across the organization.

Founded in 2012, EasyPost is a YC unicorn whose mission is to make shipping simple for businesses from garage startups to the Fortune 500.

  • Lead the design, implementation, and continuous improvement of our cloud-native platform infrastructure.
  • Create and maintain tooling and automation that improves efficiency and developer experience.
  • Drive platform optimization initiatives focused on performance, cost efficiency, and reliability.

Intelerad's medical imaging solutions streamline the flow of information, simplifying complex processes, maximizing efficiencies, and shining a light on the unknown.

ANZ

  • Own challenging infrastructure problems end-to-end by understanding how engineers use the platform.
  • Design scalable, maintainable services and contribute to technical proposals.
  • Contribute to the roadmap, highlighting opportunities, validating approaches and helping keep our platform solutions current with cloud best practices.

Canva's intuitive suite of design products is powered by our large distributed infrastructure group, setting large and ambitious goals.

US Canada

  • Design, build, and maintain our petabyte-scale data and ML platform.
  • Ensure reliability, security, scalability, and performance across our internal systems.
  • Automate deployment pipelines, monitoring, and alerting for ML and data services.

Serve Robotics is reimagining how things move in cities with its personable sidewalk robot designed to take deliveries away from congested streets.

US

  • Ramp on AWS architecture, Terraform patterns, Kubernetes setup, CI/CD pipelines, and observability stack.
  • Take ownership of an infrastructure area: CI/CD pipelines, observability stack, Kubernetes platform, or AWS security/networking.
  • Shape infrastructure direction with design docs, RFC proposals, and mentoring engineering teams.

Bastion enables financial institutions and enterprises to issue regulated stablecoins, generate revenue on reserves, and expand their ecosystems.

Australia New Zealand

Become a go-to expert in Canva’s Infrastructure for System Integrations, helping shape the direction and technical foundation of our 2P integrations. Partner closely with product teams across the Ecosystem supergroup to understand the product landscape and influence roadmaps with technical insight. Craft clear and accessible technical documentation, including reference architectures, best practices, white papers, and solution briefs.

Canva is a company redefining how the world experiences design.

Design, implement, monitor and maintain Sysdig's Infrastructure at scale on different clouds and on-prem. Collaborate with development teams to improve system reliability, performance, and scalability. Participate in on-call rotation, respond to incidents, conduct root cause analyses, and implement preventive measures.

Sysdig helps organizations secure innovation in the cloud with runtime insights, open innovation, and agentic AI, trusted by over 60% of the Fortune 500.

$140,000–$190,000/yr
US Canada Unlimited PTO

  • Architect and maintain scalable, reliable infrastructure: Design and optimize infrastructure for high availability, fault tolerance, and performance across distributed systems.
  • Lead incident management and root cause analysis: Own incident response processes, ensure swift resolution of issues, and drive post-incident improvements to prevent recurrences.
  • Service monitoring and automation: Build and maintain automated monitoring, alerting, and healing systems that improve system health, reduce manual intervention, and minimize downtime.

VGS is the world's leader in payment tokenization, empowering clients and partners by tokenizing sensitive payment data and limiting compliance scope. They embed a universal token vault into their technology stack to manage the complexities of payment data tokenization across processors and networks and more. While the job posting doesn't specify size, they appear to have a culture that values transparency, collaboration, grit, and humility.

$120,000–$205,000/yr
US

  • Dive into client environments to explore application workloads, infrastructure dependencies, and security controls.
  • Aid in the design and implement migration strategies to reduce risks and unlock automation opportunities.
  • Develop scalable and secure infrastructure using Infrastructure as Code (IaC) tools.

Kunai builds full-stack technology solutions for banks, credit and payment networks, infrastructure providers, and their customers.

$160,000–$182,000/yr
US

  • Lead and mentor multiple teams across SRE, cloud infrastructure, and platform engineering functions.
  • Drive multi-team initiatives to deliver scalable, secure, and cost-efficient infrastructure leveraging AWS-native and serverless technologies.
  • Drive adoption of FinOps practices and partner with finance and product teams on budgeting and forecasting.

Model N is the leader in revenue optimization and compliance for pharmaceutical, medtech, and high-tech innovators. Model N is trusted by over 150 of the world’s leading companies across more than 120 countries.

$230,000–$265,000/yr
US

  • Lead a cross-functional AI product engineering team, owning strategy and execution across multiple Copilot experiences
  • Drive technical direction and product evolution for the AI stack, inference backend, user-facing AI products
  • Interface closely with the AI Research and Model Training team to align on model capabilities, evaluation methodology, data/training pipelines, and the productionization path for new models and techniques

Cribl is a data engine for IT and Security. Many big names in demanding industries trust Cribl to solve their pressing data needs; they are growing rapidly and collaborative, with curious and motivated team members who are passionate about putting customers first.

Latin America

  • Design, build, and maintain cloud infrastructure primarily on AWS, with exposure to GCP and Azure.
  • Support developers and product teams by troubleshooting infrastructure and deployment issues.
  • Enforce and promote security best practices, including least-privilege access and monitoring.

EX Squared LATAM works with international clients to build scalable, data-driven platforms that support complex digital ecosystems. They have a multicultural, LATAM-based engineering team with a culture focused on collaboration, ownership, and continuous improvement.

US Unlimited PTO

  • Deploy and manage cloud infrastructure across all three clouds using Terraform IaC.
  • Architect, build, and maintain reliable CI/CD pipelines in Github Actions and ArgoCD.
  • Contribute to decisions around our departmental roadmap and project priorities.

Coalesce is the only data transformation and governance platform designed for the AI era, improving data professionals' lives since its founding in 2020.

Brazil 26w maternity 4w paternity

Support the evolution of our platform by improving scalability, reliability, observability, and security. Proactively identify bottlenecks and unlock the autonomy of the entire engineering team. Maintain infrastructure & deployment pipelines and collaborate with engineering teams on architectural decisions and production-readiness practices.

Feegow joined the Docplanner Group, a health-tech company, in 2022 and is dedicated to developing innovative solutions for physicians and managers.