The Team:
- This dedicated AI Compute and Infrastructure team owns the backbone for running AI workloads with control, speed, and reliability.
- You will join a small, senior team working directly with AI researchers, platform engineers, and product teams to build production-grade infrastructure.
Key Responsibilities:
- Design and operate GPU clusters, including scheduling, configuration, workload isolation, and cost optimization.
- Optimize inference pipelines for performance and cost using advanced serving frameworks and tooling.
- Build observability systems for utilization, latency, and capacity, and drive reliability and incident response improvements.
Required Skills:
- 5+ years in infrastructure engineering with hands-on experience operating GPU clusters and ML infrastructure in production.
- Strong systems fundamentals in Linux, networking, containers, Kubernetes, and proficiency in Python for automation.
- Experience with ML serving frameworks, performance tradeoffs, cost optimization, and building observable, high-availability systems.
Kraken
Kraken is a crypto exchange platform building premium financial products for traders and institutions, accelerating global crypto adoption. It is a mission-driven, fully remote company with a world-class team of crypto experts spread across more than 70 countries.