Source Job

Europe

  • Lead investigation and resolution of complex infrastructure, networking, and platform incidents.
  • Provide technical leadership for Kubernetes platform operations and drive automation initiatives.
  • Mentor engineers and develop operational standards, runbooks, and best practices.

Linux Kubernetes Networking Infrastructure-as-Code

20 jobs similar to Senior AI Infrastructure & Platform Operations Engineer

Jobs ranked by similarity.

Europe

  • Lead the investigation and resolution of complex infrastructure, networking, and platform-related incidents.
  • Provide technical leadership for Kubernetes platform operations and supporting infrastructure services.
  • Mentor and support AI Infrastructure & Platform Operations Engineers, sharing technical knowledge through documentation and training.

Mirantis helps organizations ship code faster on public and private clouds, providing a public cloud experience on any infrastructure from the data center to the edge. The company serves many of the world's leading enterprises, including Adobe, DocuSign, Liberty Mutual, and PayPal, and is a leader in container management.

Europe

  • Monitor, operate, and support production AI infrastructure platforms.
  • Investigate and resolve infrastructure, networking, hardware, and platform-related incidents.
  • Collaborate with engineering teams, hardware vendors, and datacenter personnel to resolve technical issues.

Mirantis is the Kubernetes-native AI infrastructure company, enabling organizations to build and operate scalable, secure infrastructure for AI and data-intensive applications. The company is growing and invests heavily in AI infrastructure and platform services.

Europe

  • Operate and maintain Linux-based infrastructure, deploy and scale Kubernetes clusters, and implement automation with Ansible and GitOps.
  • Design networking architecture, build observability stacks, and lead incident response across the platform.
  • Manage virtualization layers and collaborate with development teams to optimize resource utilization and system availability.

Pragmatike develops cutting-edge solutions in Cloud Computing, focusing on ambitious projects with a culture of collaboration and innovation. The team is passionate and collaborative, working in a dynamic and flexible environment to shape tomorrow's technologies.

Ireland

  • Diagnose and resolve complex production issues across Linux, Kubernetes, networking, storage, and GPU systems.
  • Act as a senior escalation point for critical incidents, collaborating with engineering teams on root cause analysis.
  • Develop tools and automation in Python, Bash, or Go to improve troubleshooting efficiency and observability.

The partner company provides advanced AI and cloud infrastructure solutions, supporting large-scale distributed computing and AI workloads. They operate in a fast-moving, collaborative environment with highly skilled engineering teams focused on cutting-edge technology and operational excellence.

Germany

  • Investigate and resolve complex production issues across cloud and customer environments with root cause analysis.
  • Debug across Linux, Kubernetes, networking, storage, and GPU-based systems as a senior escalation point.
  • Develop internal tools and automation to enhance troubleshooting efficiency and platform reliability.

Our partner is a company building cutting-edge AI and cloud infrastructure solutions. They foster an inclusive, innovation-driven culture with a strong focus on engineering excellence and continuous improvement.

Canada

  • Build and maintain infrastructure platforms for over 200 backend services running on Kubernetes clusters with 40,000+ cores.
  • Lead and mentor other engineers, own complex infrastructure failures, and participate in a shared on-call rotation.
  • Drive cloud cost efficiency, estimate schedules, and use AI tools as a first-class collaborator in daily workflows.

Life360's mission is to keep people close to the ones they love through location sharing, safe driver reports, and crash detection. The company serves approximately 97.8 million monthly active users across more than 180 countries and has more than 500 remote-first employees.

US

  • Design and build the orchestration layer using Kubernetes, Slurm, or comparable technologies.
  • Build customer-facing platform APIs, CLIs, web portals, and SDKs.
  • Drive infrastructure-as-code, multi-tenant isolation, and platform reliability.

GPU One provides GPU-as-a-Service (GPUaaS), turning raw GPU infrastructure into a usable cloud platform. The company is building a multi-tenant orchestration layer to serve customers at scale, with a focus on platform engineering and AI infrastructure.

Global

  • Troubleshoot and resolve issues in customer environments based on Linux, OpenStack, Kubernetes, and networking technologies, owning escalations end-to-end.
  • Reproduce customer issues in labs, confirm bug reports, and collaborate with the development team to improve product stability.
  • Communicate with customers during incidents via email and remote sessions, guiding them through troubleshooting and resolution processes.

Mirantis is the Kubernetes-native AI infrastructure company, enabling organizations to build and operate scalable, secure, and sovereign infrastructure for modern AI and data-intensive applications. With deep expertise in open source and Kubernetes, Mirantis empowers platform engineering teams across enterprises worldwide.

US

  • Act as the primary NVIDIA AI Enterprise and vector database expert for HyperPOD customer environments, owning end-to-end triage across GPU, NVAIE services, and storage.
  • Author and maintain support triage runbooks, diagnostics bundles, and collaborate on observability dashboards for platform health and RAG metrics.
  • Build hands-on labs, PoCs, and reusable technical assets to accelerate support readiness and partner success.

DataDirect Networks (DDN) is a global market leader in AI and high-performance data storage, powering many of the world's most demanding AI data centers across industries like life sciences, healthcare, financial services, and research. They are a global company with strong innovation, customer-centricity, and a team of passionate professionals committed to shaping the future of AI and data management.

US

  • Design, deploy, and manage production Kubernetes clusters with workload scheduling, resource quotas, network policies, and RBAC.
  • Build and optimize CI/CD pipelines using Infrastructure as Code and GitOps principles.
  • Implement observability solutions using Prometheus, Grafana, and OpenTelemetry for performance tuning and reliability.

VerTALENTS is a subsidiary of VerSprite Cybersecurity, specializing in technology staffing. The company connects top technical talent with industry clients through various methods, adding value to both clients and candidates for full-time and contracting opportunities.

Europe

  • Support and improve hybrid production infrastructure for 15+ development teams handling 100+ products, 10K+ domains, and billions of hits per day.
  • Architect and plan improvements of a multi-datacenter development environment, advocating for migration to automated, elastic infrastructures using cloud, Kubernetes, and serverless technologies.
  • Document processes, monitor performance metrics, promote CICD practices, and mentor junior DevOps engineers.

Aylo is a tech pioneer that offers world-class adult entertainment and games on safe, popular platforms. With an international team of dynamic innovators, the company focuses on trust-and-safety protocols and has offices in Montreal, Austin, and Nicosia.

US

  • Develop and maintain core messaging, positioning, and value propositions for Mirantis’ AI infrastructure and cloud-native platform portfolio.
  • Translate technical capabilities like GPU orchestration and MLOps into compelling narratives for practitioners, platform engineers, and executives.
  • Produce high-quality technical content including solution briefs, white papers, blog posts, and enable sales teams with battlecards and objection-handling guides.

Mirantis is a Kubernetes-native AI infrastructure company that enables organizations to build scalable, secure, and sovereign infrastructure for AI and data-intensive workloads. The company fosters a culture of openness, collaboration, risk-taking, and continuous growth, working with passionate colleagues to help Fortune 500 customers implement next-generation cloud technologies.

India

  • Develop and maintain automated provisioning pipelines for bare-metal servers across global data centers.
  • Perform security monitoring, repair and recover from hardware or software failures.
  • Act as technical lead, mentor engineers, and report directly to the CTO.

Kayzen is a mobile demand-side platform (DSP) that democratizes programmatic advertising. With 160B+ daily ad requests and 1B+ ads served per day globally, it powers top mobile marketing teams with a focus on performance, transparency, and control.

Global

  • Design, build, and operate scalable cloud infrastructure and infrastructure-as-code for globally distributed services.
  • Develop and maintain CI/CD pipelines to support rapid and reliable delivery of backend and client components.
  • Own service reliability by implementing observability (metrics, logs, tracing) and leading incident response with actionable improvements.

NetBird develops an open-source zero-trust network security platform that is easy to use and affordable for teams of all sizes. Since its launch in 2021, it has gained trust among thousands of companies and connects hundreds of thousands of users worldwide, driven by a community-focused culture.

Global Unlimited PTO

  • Lead and scale the Forward Deployed Engineering and Technical Support teams, defining engagement models and operating standards.
  • Own the FDE engagement lifecycle from technical discovery to deployment guidance, ensuring customer value.
  • Drive operational discipline across support tools and partner with Sales, Product, and Engineering on roadmap alignment.

Runpod is the AI Developer Cloud. More than one million developers use the platform to experiment, train, deploy, and scale AI, and we are a small, remote-first team that has processed over 20 billion inference requests and closed a $100M Series A.

Global

  • Act as the final escalation point for complex Cloud infrastructure issues, analyzing logs and metrics to identify root causes.
  • Own high-severity incidents, coordinate resolution with Engineering, DevOps, and SRE teams, and contribute to preventive actions.
  • Mentor L1 and L2 support engineers, create runbooks and SOPs, and collaborate with Product teams to reproduce issues.

Gcore provides infrastructure and software solutions for AI, cloud, network, and security, powering real-time communication, streaming, enterprise AI, and secure web applications. With 550+ professionals globally, they collaborate with partners like Intel, NVIDIA, Dell, and Equinix to support the digital ecosystem.

US Unlimited PTO

  • Design and build cloud-native infrastructure for reliability, observability, and automation across GCP, GKE, and Cloud Run.
  • Own incident response, root cause analysis, escalation workflows, and runbooks to prevent hard problems from recurring.
  • Develop Infrastructure as Code, CI/CD pipelines, and operational tooling to improve developer velocity and platform efficiency.

CertifyOS is building the data infrastructure that powers modern healthcare, automating provider licensing, enrollment, credentialing, and network monitoring through an API-first platform. The company is backed by leading investors with a team of deep experience in provider data systems, valuing authenticity, accountability, collaboration, results, and openness to feedback.

EMEA

  • Build and operate production-grade model serving infrastructure using vLLM, TGI, or Triton frameworks.
  • Design and implement auto-scaling, multi-model architectures, and intelligent request routing for ML inference.
  • Optimize GPU utilization, memory efficiency, and observability to ensure low-latency, cost-effective systems.

They are a distributed cloud infrastructure startup building AI-native cloud services with GPU-powered compute. The company is well-funded, fast-scaling, and operates in a remote-first environment with a focus on sustainability and decentralization.

US

  • Implement highly available, scalable infrastructure across AWS, GCP, and bare-metal environments.
  • Drive an "automation-first" culture by writing code in Python/Go to build self-healing systems.
  • Act as lead Incident Commander, develop response playbooks, and conduct post-incident analyses.

Zscaler accelerates digital transformation to secure customers with a cloud-native Zero Trust Exchange platform. The company processes over 200 billion transactions daily and fosters a culture of execution, collaboration, and accountability.

Germany 6w PTO

  • Architect and scale the cloud platform behind a mission-critical SaaS product used globally.
  • Lead Infrastructure as Code maturity and drive automation, reliability, and cost optimisation.
  • Own uptime, SLAs, and incident management practices while mentoring engineers.

Innocraft (trading as Matomo) provides an open-source analytics platform trusted by enterprises and governments for full data ownership. The company values diversity and inclusion, and operates with a stable, mature product and strong engineering team.