Job Description
Looking for a Machine Learning Engineer to own and evolve our distributed training pipeline for large language models. You’ll work inside our GPU cluster to help researchers train and scale foundation models using frameworks like Hugging Face Transformers, Accelerate, DeepSpeed, FSDP, and others. Your focus will be distributed training: from designing sharding strategies and multi-node orchestration to optimizing throughput and managing checkpoints at scale. This role is not research - it's about building and scaling the systems that let researchers move fast and models grow big. You’ll work closely with MLOps, infra, and model developers to make our training runs efficient, resilient, and reproducible.
Responsibilities include owning the architecture and maintenance of our distributed training pipeline; training LLMs using tools like DeepSpeed, FSDP, and Hugging Face Accelerate; designing and debugging multi-node/multi-GPU training runs (Kubernetes-based); optimizing training performance: memory usage, speed, throughput, and cost; helping manage experiment tracking, artifact storage, and resume logic; building reusable, scalable training templates for internal use; collaborating with researchers to bring their training scripts into production shape.
About CloudWalk
CloudWalk is a fintech company reimagining the future of financial services by building intelligent infrastructure powered by AI, blockchain, and thoughtful design.