Source Job

Australia

  • Support and evolve the reliability of platforms used by the AI Research team.
  • Ensure production services meet expectations for availability, latency, and operational readiness.
  • Build and maintain Kubernetes-based services on GCP using infrastructure-as-code and GitOps.

Kubernetes Terraform Go Python GCP

20 jobs similar to Senior Site Reliability Engineer, AI Research

Jobs ranked by similarity.

US

  • Ensure the smooth operation and high availability of Clarifai's core services
  • Monitor system performance, identify bottlenecks, and implement optimizations to enhance reliability and efficiency
  • Design and implement scalable, secure, and cost-effective infrastructure solutions

Clarifai is a leading AI platform specializing in computer vision and generative AI, empowering organizations to transform unstructured data into actionable insights. Founded in 2013, they have a diverse, globally distributed team with $100M in funding and are committed to building a diverse and inclusive team.

Americas EMEA Unlimited PTO

  • Design and implement highly scalable infrastructure for GitLab.com to support current and future growth.
  • Collaborate with cross-functional teams across the Infrastructure organization to plan and deliver projects that shape GitLab’s platform direction.
  • Operate and improve edge services and Kubernetes workloads, acting as a subject matter expert within the infrastructure department.

GitLab is an open-core software company that develops the most comprehensive AI-powered DevSecOps Platform, used by more than 100,000 organizations. They aim to enable everyone to contribute to and co-create the software that powers our world.

ANZ

  • Building world-class AI infrastructure to support a 100+ person research team.
  • Designing and scaling multi-cloud systems that support high-performance model training and inference.
  • Improving monitoring, alerting and system observability for AI workloads.

Canva is redefining how the world experiences design. They have campuses in Sydney and Melbourne, co-working spaces in Brisbane, Perth, Adelaide and Auckland, and trust their employees to choose the balance that empowers them and their team to achieve their goals.

South America

  • Architect and maintain self-healing systems with 99.9%+ availability targets.
  • Use AI/ML to automate infrastructure governance and detect configuration or IaC anti-patterns.
  • Implement adaptive SLIs/SLOs that evolve automatically from real-time data.

Groupon is a marketplace where customers discover new experiences and services everyday and local businesses thrive. Even with thousands of employees spread across multiple continents, they still maintain a culture that inspires innovation, rewards risk-taking and celebrates success.

Global 6w PTO 26w maternity

  • Build self-service systems that automate managing, deploying and operating services.
  • Automate environment observability and resilience. Enable all developers to troubleshoot and resolve problems.
  • Ensure we hit defined SLOs, including participation in an on-call rotation.

Cohere is focused on scaling intelligence to serve humanity by training and deploying frontier models for developers and enterprises. They are a team of researchers, engineers, and designers. They value diversity and strive to create an inclusive work environment.

US Unlimited PTO

  • Build Enterprise-Scale Infrastructure leveraging infrastructure-as-code to manage complex cloud environments.
  • Sustain Platform Health and Performance owning critical systems in production, including reliability and security.
  • Enable Teams and Customers to Move Faster creating abstractions and tooling that deploy, run, and scale AI/ML workloads.

Cake is on a mission to make cutting-edge AI accessible to enterprise teams. Backed by top investors, Cake is seeing strong adoption and is positioned for rapid growth in the next 12 months, emphasizing ownership, clear communication, and collaboration.

ANZ

  • Deeply understand the needs and workflows of platform engineers.
  • Write tools, services, configuration to ensure a low friction experience.
  • Take on ownership of our existing configuration frameworks.

Canva is a design platform that empowers everyone to create and publish anything. They have campuses in Sydney and Melbourne, as well as co-working spaces in Brisbane, Perth and Adelaide, and strive to create moments of magic, connectivity and fun throughout life at Canva.

US Canada Europe

  • Lead a global team of Site Reliability Engineers.
  • Recruit, hire, onboard and develop engineers.
  • Guide project planning by defining milestones and identifying dependencies.

AuthZed creates and maintains SpiceDB and the authorization infrastructure. They are a Series A company with a fully remote team across the US, Canada, and Europe and a hardworking, close-knit group with a software-driven culture that values integrity, collaboration, and open-mindedness.

India

  • Design and manage AWS infrastructure for AI services.
  • Implement Infrastructure as Code using Terraform.
  • Collaborate with cross-functional teams to enhance performance.

Jobgether uses an AI-powered matching process to ensure applications are reviewed quickly, objectively, and fairly against the role's core requirements. Their system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company.

North America Unlimited PTO

  • Build and operate scalable backend services and internal APIs for the AI platform.
  • Integrate LLMs and AI tool execution into reliable, production-ready workflows.
  • Own production reliability for AI platform infrastructure through observability, alerting, and incident response.

MaintainX is the world's leading Asset and Work Intelligence platform for industrial and frontline environments. They are a modern IoT-enabled cloud-based tool for reliability, safety, and operations on physical equipment and facilities, powering operational excellence for 13,000+ businesses. MaintainX recently completed a $150 million Series D round, at a valuation of $2.5 billion.

$219,000–$245,000/yr
US Unlimited PTO

  • Architect, operate, improve and secure the platform the Garner Health app runs on
  • Boost development velocity and productivity
  • Build systems to a high engineering standard and hold others to the same high standard

Garner has developed a revolutionary approach to evaluating doctor performance and a unique incentive model that's reshaping the healthcare economy to ensure everyone can afford high quality care. They have more than doubled their revenue annually over the last 5 years. Garner's award winning culture is designed to cultivate teamwork, trust, autonomy, exceptional results, and individual growth.

Global

  • Build and scale ML-optimized HPC infrastructure by deploying and managing Kubernetes-based GPU/TPU superclusters across multiple clouds.
  • Optimize for AI/ML training by collaborating with cloud providers to fine-tune infrastructure for cost efficiency, reliability, and performance.
  • Troubleshoot and resolve complex issues and proactively identify infrastructure bottlenecks, performance degradation, and system failures.

Cohere's mission is to scale intelligence to serve humanity by training and deploying frontier models for developers and enterprises who are building AI systems. The company is composed of researchers, engineers, and designers passionate about their craft and believes that a diverse range of perspectives is a requirement for building great products.

$131,600–$282,000/yr
US Unlimited PTO

  • Hire, lead, and support a high-performing Infrastructure Platforms team.
  • Connect business goals and customer needs with sound engineering.
  • Guide the security, reliability, performance, and scalability of core platform components.

GitLab is an open-core software company that develops the most comprehensive AI-powered DevSecOps Platform, used by more than 100,000 organizations. Their mission is to enable everyone to contribute to and co-create the software that powers our world.

Australia New Zealand

  • You’ll own challenging infrastructure problems end-to-end.
  • You’ll design scalable, maintainable services and contribute to technical proposals.
  • You’ll contribute to the roadmap for our Provisioning team.

Canva is a design platform that enables users to create a variety of visual content. They have campuses in Sydney and Melbourne, co-working spaces in other major cities, and offer a flexible work environment.

India

  • Oversee the reliability, scalability, performance, and security of key production services.
  • Collaborate with cross-functional teams to develop and maintain resilient infrastructure.
  • Provide expert mentorship and guidance on best practices to engineers throughout the organization.

Cision is a global leader in PR, marketing and social media management technology and intelligence, helping brands and organizations connect with customers and stakeholders to drive business results. The company has offices in 24 countries throughout the Americas, EMEA and APAC.

  • Helping improve the infrastructure and data platform using a lean approach.
  • Creating a data platform and infrastructure optimized for developments using Machine Learning and massive data processing.
  • Improving the development experience and spreading the DevOps culture in the company.

Clarity AI is a global tech company founded in 2017 with a mission to bring societal impact to markets. They leverage AI and machine learning to provide data, methodologies, and tools to investors, governments, companies, and consumers for informed decisions; they are a team of over 300 individuals with offices in New York, Madrid, London, Paris, and Abu Dhabi, backed by investors like BlackRock and SoftBank. .

US

  • Own the reliability, performance, and operational health of production AI systems.
  • Lead efforts to refactor and harden the AI codebase.
  • Design and build monitoring, alerting, and debugging tools.

MixMode is a leading provider of AI-powered cybersecurity solutions at scale, pioneering a patented third-wave, context-aware AI approach. Large organizations with big data workloads trust MixMode to defend their most important assets.

Global

  • Own and operate core platform systems across AWS, GCP, Vercel, Github, and Cloudflare.
  • Improve reliability, scalability, and security of production and non-production environments.
  • Improve local development environments and onboarding experience for engineers.

Moxie empowers ambitious aesthetic entrepreneurs to build profitable, independent practices. A global, remote-first team of more than 140 people supports hundreds of practices nationwide as they unlock sustainable success for aesthetic entrepreneurs.

US

  • Understand and participate in the changing FedRAMP space.
  • Own and champion high operational standards of Confluent Cloud systems leveraged by federal agencies.
  • Innovate and design solutions to reduce toil, bolster operational maturity, and make day-to-day worklife easier.

Confluent is rewriting how data moves and what the world can do with it. Their platform puts information in motion, streaming in near real-time so companies can react faster and build smarter. They value team players who ask hard questions, give honest feedback, and show up for each other.

Australia New Zealand

  • Owning End-to-End Technical Solutions: Designing, documenting, and implementing complex infrastructure projects.
  • Building Developer-Facing Tools & Automation: Creating pragmatic automation that eliminates repetitive manual work for engineers.
  • Collaborating & Enabling Engineering Teams: Partnering with engineering teams to understand pain points and gather requirements.

Canva is a design platform that enables users to create social media graphics, presentations, posters, documents and other visual content. They foster a culture of connectivity and fun, offering employees various benefits and opportunities for growth.