What You'll Work On:
- Audit, secure, and optimize our existing cloud infrastructure (AWS) to ensure high availability, fault tolerance, and security for both training and production workloads.
- Design and maintain scalable architectures for serving deep learning models (PyTorch/TensorFlow), optimizing for low latency and high throughput in handling complex infrastructure data.
Required Technical Skills:
- Deep expertise in AWS (e.g., EC2, S3, EKS, SageMaker, Lambda) and cloud security best practices.
- Strong experience with Docker and Kubernetes for packaging and scaling ML applications.
- Proficiency with Infrastructure as Code (IaC) tools like Terraform or AWS CloudFormation.
What We're Looking For:
- 4-6+ years of experience in MLOps, DevOps, or Data Engineering, with a strong emphasis on machine learning workloads.
- A security-first and stability-first mindset with strong collaborative instincts to work closely with Data Scientists.
- Clear communication skills to articulate architectural decisions and tradeoffs to the broader technical team.