Source Job

Europe

  • Own performance optimization and reliability of large-scale GPU clusters and InfiniBand networking for HPC workloads.
  • Diagnose and resolve complex system-level issues across GPU, network, and compute layers, integrating new hardware components.
  • Develop automation for monitoring, fault detection, and proactive remediation in distributed compute environments.

C C++ Go Python Linux

10 jobs similar to Senior HPC Cluster Engineer

Jobs ranked by similarity.

APAC

  • Operate and maintain large-scale Linux environments (bare metal, clusters, cloud) and monitor system health to ensure high availability.
  • Help scale clusters toward hundreds to thousands of nodes, improving performance, reliability, and resource utilization.
  • Automate operational tasks using Python, Bash, Ansible, or Terraform and contribute to system design and architecture decisions.

Mistral AI builds high-performance, open, and efficient AI systems to power next-generation applications. We are a collaborative, low-ego, and highly technical team operating across Europe, the US, and beyond, scaling rapidly to support thousands of nodes.

US

  • Design, deploy, and maintain HPC clusters and cloud-based compute environments.
  • Support scientific workflows and compute-intensive applications in life sciences.
  • Administer HPC schedulers like SLURM and implement Infrastructure-as-Code with tools such as Terraform.

Jobgether is an AI-powered job matching platform that connects candidates with hiring companies. The company operates as a remote team focused on streamlining recruitment through technology.

Europe

  • Monitor, operate, and support production AI infrastructure platforms.
  • Investigate and resolve infrastructure, networking, hardware, and platform-related incidents.
  • Collaborate with engineering teams, hardware vendors, and datacenter personnel to resolve technical issues.

Mirantis is the Kubernetes-native AI infrastructure company, enabling organizations to build and operate scalable, secure infrastructure for AI and data-intensive applications. The company is growing and invests heavily in AI infrastructure and platform services.

Europe

  • Lead the investigation and resolution of complex infrastructure, networking, and platform-related incidents.
  • Provide technical leadership for Kubernetes platform operations and supporting infrastructure services.
  • Mentor and support AI Infrastructure & Platform Operations Engineers, sharing technical knowledge through documentation and training.

Mirantis helps organizations ship code faster on public and private clouds, providing a public cloud experience on any infrastructure from the data center to the edge. The company serves many of the world's leading enterprises, including Adobe, DocuSign, Liberty Mutual, and PayPal, and is a leader in container management.

US

  • Own the technical design and delivery of subsystems in a high-throughput, low-latency inference platform.
  • Develop robust API layers and SDKs that abstract complex distributed inference orchestration.
  • Build and harden a multi-tenant control plane for metering, rate limiting, and tenant isolation.

Stack develops revolutionary AI and autonomous systems to enhance safety and efficiency in trucking. The team has decades of experience deploying real-world systems and is committed to inclusion, entrepreneurship, and innovation.

Europe

  • Lead the architecture and development of security agent capabilities for runtime threat detection and workload protection.
  • Design and build reusable eBPF-based monitoring for process, file, and network visibility in Linux environments.
  • Partner with product, security, infrastructure, and engineering teams to deliver shared platform capabilities across multiple products.

Datadog is the leading observability and security platform for the AI era, providing businesses with unified visibility across the technology stack. With thousands of employees globally, Datadog values an inclusive culture and offers a hybrid workplace to foster collaboration and work-life harmony.

Europe 5w PTO

  • Design and optimize high-performance low-level C++ code for system-critical JVM runtime components and distributed communications.
  • Lead complex technical projects from design through production, owning outcomes under real-time constraints.
  • Mentor junior engineers and collaborate across teams to ensure robust solutions through constructive peer review.

Azul develops award-winning enhanced builds of OpenJDK for superior application performance and efficiency. They are a global leader in Java runtime solutions with offices in Prague, Limassol, and Belgrade, fostering a culture of collaboration and top engineering expertise.

UK Netherlands

  • Design and build systems that improve the efficiency of ML training and inference workloads.
  • Develop tooling that helps ML engineers debug, profile, optimize, and monitor model performance.
  • Partner with ML researchers and product teams to identify bottlenecks and drive performance improvements.

Reddit is a community of communities built on shared interests, passion, and trust, hosting the most open and authentic conversations on the internet. With over 100,000 active communities and approximately 126 million daily active users, Reddit is one of the internet's largest sources of information.

India

  • Lead end-to-end sourcing for high-performance compute and sovereign AI cloud platforms.
  • Drive end-to-end contract lifecycles including MPAs, SLAs, and complex negotiations.
  • Manage global semiconductor trends to mitigate long-lead-time risks and ensure just-in-time inventory.

Armada is the hyperscaler for the edge, delivering modular AI infrastructure from first deployment to AI factory with speed, scale and sovereignty. With nearly half a billion dollars in funding, Armada is backed by top investors such as Microsoft (M12), Founders Fund, and BlackRock, and has collaborations and partnerships including NVIDIA, Palantir and Dell Technologies.

Europe

  • Design, build, and operate scalable cloud infrastructure using Kubernetes, Terraform, and modern infrastructure-as-code practices.
  • Improve and evolve cloud networking architecture, including VPC/VNet design, peering, routing, DNS, TLS, ingress/egress, and load balancing.
  • Contribute to system reliability through on-call support, incident response, root cause analysis, and performance optimization.

Jobgether is an AI-powered job matching platform that connects candidates with hiring companies. They use automated review and matching to ensure fair candidate evaluation.