We are looking for a Senior Site Reliability Engineer to join the Model Serving team at Cohere. The team is responsible for developing, deploying, and operating the AI platform delivering Cohere's large language models through easy to use API endpoints. In this role, you will work closely with many teams to deploy optimized NLP models to production in low latency, high throughput, and high availability environments. You will also get the opportunity to interface with customers and create customized deployments to meet their specific needs.
You may be a good fit if you have 5+ years of engineering experience running production infrastructure at a large scale, experience designing large, highly available distributed systems with Kubernetes, and GPU workloads on those clusters.