Similar Jobs
See allSoftware Python Engineer (GPU Cloud)
Gcore
Python
Docker
Kubernetes
Senior DevOps Engineer
Serve Robotics
US
AWS
GCP
Azure
Senior Machine Learning Engineer - AI Platform Enablement (ANZ remote)
Canva
Australia
Python
Kubernetes
Data Pipelines
Senior MLOps Platform Architect
Jobgether
Europe
Python
AWS
Docker
Manager, Software Engineering
Jobgether
Canada
AWS
GCP
Azure
Infrastructure and Orchestration:
- Design resource management systems provisioning and orchestrating compute across AWS, GCP, and Azure using infrastructure-as-code (Pulumi/Terraform).
- Handle dynamic scaling, state synchronization, and concurrent operations across hundreds of heterogeneous nodes.
- Architect fault-tolerant infrastructure for distributed ML. GPU clusters, NVIDIA runtime, S3 checkpointing.
Networking and Data Handling:
- Build systems that simulate and handle real-world network conditions — bandwidth shaping, latency injection, packet loss.
- Manage dynamic node churn and ensuring efficient data flow across workers with heterogeneous connectivity, because our training happens on consumer nodes and non co-located infrastructure, not in a datacenter.
- Handle Large dataset management and streaming.
Required Skills:
- Experience in a startup environment with an emphasis on micro-services orchestration or big tech background experience.
- Deep understanding of multi-cloud infra & distributed training systems.
- Excellent be a team player with high attention to detail.
Pluralis Research
Pluralis Research is pioneering Protocol Learning—a fully decentralised way to train and deploy AI models that opens this layer to individuals rather than well resourced corporates.