Source Job

Global

  • Support deployment, configuration, and maintenance of InfiniBand and Ethernet network infrastructure.
  • Troubleshoot network issues including connectivity, latency, and performance degradation.
  • Collaborate with compute and storage teams to support HPC and AI workloads.

Networking Linux Fortinet Scripting

6 jobs similar to HPC Network Engineer

Jobs ranked by similarity.

EU

  • Design, deploy, and maintain high-performance network infrastructures for HPC environments with a focus on InfiniBand fabrics.
  • Troubleshoot complex network issues across InfiniBand and Ethernet, manage Fortinet solutions, and perform performance tuning.
  • Collaborate with compute, storage, and platform teams to support HPC workloads and document network architecture and procedures.

Mirantis is the Kubernetes-native AI infrastructure company, enabling organizations to build and operate scalable infrastructure for modern AI, machine learning, and data-intensive applications. Mirantis serves many of the world’s leading enterprises including Adobe, DocuSign, PayPal, and Volkswagen, fostering a collaborative and innovative work environment.

Europe

  • Own performance optimization and reliability of large-scale GPU clusters and InfiniBand networking for HPC workloads.
  • Diagnose and resolve complex system-level issues across GPU, network, and compute layers, integrating new hardware components.
  • Develop automation for monitoring, fault detection, and proactive remediation in distributed compute environments.

Our partner is building a next-generation AI cloud infrastructure environment, focusing on large-scale high-performance computing systems. They foster a highly technical engineering culture with experts across systems, networking, and virtualization, offering career development and continuous learning opportunities.

US

  • Contribute to end-to-end development of Network Monitoring features, from ideation to implementation within the Datadog Agent.
  • Build and maintain shared eBPF functionality for product teams and investigate complex production issues spanning kernel, eBPF, and agent runtime.
  • Research, prototype, and document solutions to hard problems in eBPF and network monitoring while providing technical input to product decisions.

Datadog is the leading observability and security platform for the AI era, providing businesses with unified visibility across the technology stack to manage complexity at scale. Trusted by Fortune 500 companies and high-growth AI leaders, Datadog fosters an inclusive culture with continuous learning opportunities and comprehensive benefits.

Global

  • Troubleshoot and resolve issues in customer environments based on Linux, OpenStack, Kubernetes, and networking technologies, owning escalations end-to-end.
  • Reproduce customer issues in labs, confirm bug reports, and collaborate with the development team to improve product stability.
  • Communicate with customers during incidents via email and remote sessions, guiding them through troubleshooting and resolution processes.

Mirantis is the Kubernetes-native AI infrastructure company, enabling organizations to build and operate scalable, secure, and sovereign infrastructure for modern AI and data-intensive applications. With deep expertise in open source and Kubernetes, Mirantis empowers platform engineering teams across enterprises worldwide.

Europe

  • Monitor, operate, and support production AI infrastructure platforms.
  • Investigate and resolve infrastructure, networking, hardware, and platform-related incidents.
  • Collaborate with engineering teams, hardware vendors, and datacenter personnel to resolve technical issues.

Mirantis is the Kubernetes-native AI infrastructure company, enabling organizations to build and operate scalable, secure infrastructure for AI and data-intensive applications. The company is growing and invests heavily in AI infrastructure and platform services.

US Canada

  • Define and own the architectural roadmap for enterprise, data center, and cloud networks.
  • Co-author secure design and lifecycle management of high-performance data center networking.
  • Build an AI-agent based analysis framework for network changes and self-improvement.

Cerebras Systems builds the world's largest AI chip, 56 times larger than GPUs, delivering industry-leading training and inference speeds. The company partners with top model labs, global enterprises, and AI-native startups, including OpenAI, and has a non-corporate work culture that values continuous learning and growth.