Design, scale, and mature the Kubernetes-based training platform that powers distributed AI workloads across teams and frameworks.
Improve the reliability, observability, debugging, and operational support for training systems to ensure efficient and reproducible training.
Collaborate with research scientists, ML engineers, and infrastructure teams to shape the platform roadmap and enhance workflows for large-scale AI training.
Own and operate GPU and accelerator clusters for AI training, inference, and experimentation, ensuring reliability and cost-efficiency.
Build and optimize scheduling, orchestration, and serving systems using frameworks like vLLM and Triton to improve latency, throughput, and memory efficiency.
Partner with ML engineers to remove workflow bottlenecks and build observability for GPU utilization, capacity, and incident response.
Kraken is a crypto exchange platform building premium financial products for traders and institutions, accelerating global crypto adoption. It is a mission-driven, fully remote company with a world-class team of crypto experts spread across more than 70 countries.
Identify recurring patterns across customer issues and drive long term reliability improvements.
Lightning AI is the company behind PyTorch Lightning, building an end-to-end platform for developing, training, and deploying AI systems. They serve solo researchers, startups, and large enterprises, operating globally with offices in New York City, San Francisco, Seattle, and London.
Define and evolve the architecture and roadmap for enterprise‑scale Data and AI platforms.
Design and build multi‑tenant, multi‑region, highly available AI platforms with governance.
Lead capacity planning and cost optimization strategies for GPU and CPU workloads.
NEORIS accelerates growth in Ibero‑America, combining global engineering with regional expertise. With over 60,000 professionals across 55+ countries, they offer technical specialization career paths and value responsibility, collaboration, creativity, and commitment.
Design, build, and maintain the core infrastructure layer supporting GenAI products.
Implement secure access controls and authentication mechanisms integrated by default into the AI platform components.
Develop and manage observability, monitoring, and logging solutions for GenAI workloads and infrastructure.
PointClickCare is a healthcare technology company. This team will serve as the product owner for GenAI capabilities, closely integrated with key horizontal partners to ensure delivery of safe, scalable and high-impact AI Products.
Design and deploy GPU cluster architectures using tools like Ansible, Terraform, Kubernetes, and Slurm.
Lead technical deep-dives, workshops, and present solutions to stakeholders, translating complex concepts.
Automate provisioning and monitoring with Infrastructure as Code, and produce documentation and training materials.
Gcore is a global provider of infrastructure and software solutions for AI, cloud, network, and security, powering digital experiences worldwide. The company collaborates with leading technology partners and employs over 550 professionals building foundational technologies.
Define, drive, design, and build/ship end-to-end solutions that solve real customer problems.
Contribute to the end-to-end AI/ML software development lifecycle, ensuring reproducible research.
Drive architecture, design, and delivery of advanced ML systems in the Product R&D team.
Kinaxis is a global leader in modern supply chain orchestration. Known for its AI-infused platform and transparency across end-to-end supply chains, Kinaxis helps customers make faster, better decisions. The company has over 2000 employees worldwide and is recognized with Top Employer awards.
Design, develop, and maintain core infrastructure supporting large-scale optimization engines and planning workflows to improve scalability and performance.
Analyze and optimize performance bottlenecks in optimization pipelines, focusing on compute, memory usage, and data flow for complex planning problems.
Contribute to evolving platform architecture, designing systems for large datasets and parallel execution while ensuring enterprise-grade reliability and maintainability.
Kinaxis is a global leader in modern supply chain orchestration, providing an AI-powered platform for end-to-end supply chain transparency and faster decision-making. The company has over 2000 employees globally, is a multi-time Top Employer award winner, and fosters a culture of innovation with a serious focus on technology, customers, and a collaborative, not-too-serious internal environment.
Design, implement, and deploy ML/AI models end-to-end, from concept through production, including data pipelines, training workflows, and optimization.
Maintain and evolve AI systems in production, monitoring for drift, debugging issues, and driving ongoing improvements to reliability and scalability.
Partner closely with product, engineering, and data teams to align AI work with broader product and business goals.
Robots & Pencils is an applied AI engineering firm that designs and ships AI co-workers integrating into operations and delivering results for clients. Founded in 2009, they have delivery centers in Canada, the United States, Eastern Europe, and Latin America, with teams averaging 15+ years of experience.
Collaborate with engineering and cross-functional teams to translate business problems into an ML product roadmap.
Contribute hands-on technical expertise as a player-coach, providing strategic direction and mentorship to the team.
Establish an engineering setup enabling rapid iteration, experimentation, and deployment of models, fostering operational excellence.
Twilio is shaping the future of communications by delivering innovative solutions to hundreds of thousands of businesses. They empower millions of developers worldwide to craft personalized customer experiences, emphasizing a remote-first culture with a vibrant and globally inclusive team.
Translate AI research into production systems for creating accurate, conflated map data products from aerial imagery and geospatial sources.
Design, build, and operate scalable software systems using Python, PyTorch, and GIS tools to transform petabytes of imagery into insights.
Collaborate closely with researchers and engineers to solve complex problems, improve data quality, and utilize agentic coding tools for exploratory work.
Nearmap provides aerial imagery, AI analytics, and geospatial tools to help professionals plan, build, insure, and govern properties. The company emphasizes a collaborative culture that brings out the best in its employees.
Design, build, and maintain scalable services that support the AI lifecycle.
Develop tools for pre/post-processing data for AI and other usage.
Design scalable pipelines for data collection, processing, and transformation.
Planner 5D is a global hub for home design, uniting over 100+ million users. They simplify the home renovation process with their cutting-edge software, fostering a vibrant community of enthusiastic and product-oriented professionals.
Build our core Python/Rust platform: request routing, AI workload orchestration, scheduling, GPU autoscaling, large scale file storage, queueing, etc
Produce forward designs for platform evolution as we scale to 100x current traffic and need to provide low latency across the world
Leverage AI to an extreme level to automate the mundane parts of building complex but reliable systems
Fal is building the infrastructure, tools, and model access to move from AI idea to production. They aim to be the unified platform where high-performance inference, orchestration, and observability come together to unlock new categories of AI-native products.
Contribute to the development of the Everywhere Inference platform, a Kubernetes-based solution.
Design and implement APIs and developer tools to simplify deployment, management, and monitoring of AI applications.
Optimize serverless container workflows for AI workloads, ensuring performance, scalability, and seamless autoscaling.
Gcore provides infrastructure and software solutions for AI, cloud, network, and security. They have 550+ professionals globally and power everything from real-time communication and streaming to enterprise AI and secure web applications.
Build and maintain CI/CD pipelines and deployment infrastructure.
Leverage AI to automate analysis and resolution of production issues.
Fal is the generative media ecosystem powering the next generation of AI products. They build the infrastructure, tools, and model access that teams need to move from idea to production.