Source Job

US

  • Act as the primary NVIDIA AI Enterprise and vector database expert for HyperPOD customer environments, owning end-to-end triage across GPU, NVAIE services, and storage.
  • Author and maintain support triage runbooks, diagnostics bundles, and collaborate on observability dashboards for platform health and RAG metrics.
  • Build hands-on labs, PoCs, and reusable technical assets to accelerate support readiness and partner success.

Kubernetes Linux

5 jobs similar to Platform Support Architect

Jobs ranked by similarity.

Europe

  • Lead the investigation and resolution of complex infrastructure, networking, and platform-related incidents.
  • Provide technical leadership for Kubernetes platform operations and supporting infrastructure services.
  • Mentor and support AI Infrastructure & Platform Operations Engineers, sharing technical knowledge through documentation and training.

Mirantis helps organizations ship code faster on public and private clouds, providing a public cloud experience on any infrastructure from the data center to the edge. The company serves many of the world's leading enterprises, including Adobe, DocuSign, Liberty Mutual, and PayPal, and is a leader in container management.

Europe

  • Monitor, operate, and support production AI infrastructure platforms.
  • Investigate and resolve infrastructure, networking, hardware, and platform-related incidents.
  • Collaborate with engineering teams, hardware vendors, and datacenter personnel to resolve technical issues.

Mirantis is the Kubernetes-native AI infrastructure company, enabling organizations to build and operate scalable, secure infrastructure for AI and data-intensive applications. The company is growing and invests heavily in AI infrastructure and platform services.

APAC

  • Partner directly with customer engineering teams running training and inference workloads in production.
  • Investigate failures involving distributed training, Kubernetes orchestration, GPU allocation, networking, and storage systems.
  • Identify recurring patterns across customer issues and drive long term reliability improvements.

Lightning AI is the company behind PyTorch Lightning, building an end-to-end platform for developing, training, and deploying AI systems. They serve solo researchers, startups, and large enterprises, operating globally with offices in New York City, San Francisco, Seattle, and London.

US

  • Design, build, and maintain the core infrastructure layer supporting GenAI products.
  • Implement secure access controls and authentication mechanisms integrated by default into the AI platform components.
  • Develop and manage observability, monitoring, and logging solutions for GenAI workloads and infrastructure.

PointClickCare is a healthcare technology company. This team will serve as the product owner for GenAI capabilities, closely integrated with key horizontal partners to ensure delivery of safe, scalable and high-impact AI Products.

SRE

Fal
$180,000–$250,000/yr
US

  • Own and operate our Kubernetes infrastructure.
  • Build and maintain CI/CD pipelines and deployment infrastructure.
  • Leverage AI to automate analysis and resolution of production issues.

Fal is the generative media ecosystem powering the next generation of AI products. They build the infrastructure, tools, and model access that teams need to move from idea to production.