Own and operate GPU and accelerator clusters for AI training, inference, and experimentation, ensuring reliability and cost-efficiency.
Build and optimize scheduling, orchestration, and serving systems using frameworks like vLLM and Triton to improve latency, throughput, and memory efficiency.
Partner with ML engineers to remove workflow bottlenecks and build observability for GPU utilization, capacity, and incident response.
Kraken is a crypto exchange platform building premium financial products for traders and institutions, accelerating global crypto adoption. It is a mission-driven, fully remote company with a world-class team of crypto experts spread across more than 70 countries.
Own and evolve a scalable observability platform spanning metrics, logs, traces, and events.
Design telemetry pipelines ingesting data from GPUs, CPUs, networking, containers, APIs, and BMC/Redfish.
Design and implement noise-resistant alerting systems to improve signal quality and reduce operational load.
Lightning AI builds an end-to-end platform for developing, training, and deploying AI systems, designed to take ideas from research to production with less friction. They combine developer-first software with cost-efficient, large-scale compute, serving solo researchers, startups, and large enterprises.
Build and maintain CI/CD pipelines and deployment infrastructure.
Leverage AI to automate analysis and resolution of production issues.
Fal is the generative media ecosystem powering the next generation of AI products. They build the infrastructure, tools, and model access that teams need to move from idea to production.
Benchmark FP8 quantization across GPU families and ship a production config to achieve speedup.
Evaluate serving frameworks with speculative decoding to improve performance.
Build a fine-tuning pipeline to enable faster model training and deployment.
Fathom eliminates the needless overhead of meetings with an AI assistant that captures, summarizes, and organizes key moments. They are a small company that creates magical experiences through focused builders and values a supportive environment.
Define, drive, design, and build/ship end-to-end solutions that solve real customer problems.
Contribute to the end-to-end AI/ML software development lifecycle, ensuring reproducible research.
Drive architecture, design, and delivery of advanced ML systems in the Product R&D team.
Kinaxis is a global leader in modern supply chain orchestration. Known for its AI-infused platform and transparency across end-to-end supply chains, Kinaxis helps customers make faster, better decisions. The company has over 2000 employees worldwide and is recognized with Top Employer awards.
Design and deploy GPU cluster architectures using tools like Ansible, Terraform, Kubernetes, and Slurm.
Lead technical deep-dives, workshops, and present solutions to stakeholders, translating complex concepts.
Automate provisioning and monitoring with Infrastructure as Code, and produce documentation and training materials.
Gcore is a global provider of infrastructure and software solutions for AI, cloud, network, and security, powering digital experiences worldwide. The company collaborates with leading technology partners and employs over 550 professionals building foundational technologies.
Maintain the reliability and performance of customer environments remotely, supporting Mirantis Opensack/k0s layers.
Diagnose and resolve system-level issues, requiring hands-on Linux administration experience.
Troubleshoot customer environments based on Linux, OpenStack, Kubernetes, networking, and other cloud technologies; detect, report, and resolve issues.
Mirantis helps enterprises move to the cloud on their terms, delivering a true cloud experience on any infrastructure, powered by Kubernetes. They serve many of the world’s leading enterprises and value openness, collaboration, risk-taking, and continuous growth.
Design, build, and maintain the core infrastructure layer supporting GenAI products.
Implement secure access controls and authentication mechanisms integrated by default into the AI platform components.
Develop and manage observability, monitoring, and logging solutions for GenAI workloads and infrastructure.
PointClickCare is a healthcare technology company. This team will serve as the product owner for GenAI capabilities, closely integrated with key horizontal partners to ensure delivery of safe, scalable and high-impact AI Products.
Own the messaging and content that defines MinIO's role in the NVIDIA AI Factory across NVIDIA products.
Develop the technical positioning and content for MinIO's integrations with NVIDIA technologies.
Build solutions content that shows how MinIO and NVIDIA infrastructure solve specific customer problems.
MinIO is the industry leader in high-performance object storage. It is the company behind the world’s fastest, most widely deployed object store, powering production infrastructure for more than half of the Fortune 500. The enterprise offering, AIStor, is engineered to handle the scale, speed, and pressure of modern AI and analytics, from terabytes to exabytes, all in a single namespace.