Source Job

Global

  • Design and evolve multi-provider, multi-region GPU compute clusters optimized for large-scale training.
  • Serve as the primary technical point of contact for customers running large-scale training workloads.
  • Build production-grade automation for cluster provisioning, GPU health checks, job scheduling, self-healing, and firmware/driver lifecycle management.

GPU Kubernetes Python Go

20 jobs similar to Senior Site Reliability Engineer - AI Infrastructure

Jobs ranked by similarity.

Global

  • Design and implement infrastructure and tools that empower our product teams to rapidly and securely iterate, emphasizing reliability and automation.
  • Influence the strategic direction of our infrastructure and operational practices, ensuring that we are well-positioned to scale and support our growing organization.
  • Take a proactive role in the resolution of production issues, ensuring that we are well-prepared to handle incidents and that we learn from them in a blameless manner.

SSV Labs is the core team behind the SSV Network - pioneering decentralized infrastructure for Ethereum staking. They are building tools, protocols, and standards to make staking more secure, scalable, and trustless.

Europe Unlimited PTO

  • Design, build, and maintain the inference infrastructure that powers Sword Health's AI products.
  • Own the end-to-end deployment pipeline for AI models, from real-time computer vision to large language models.
  • Architect and scale Kubernetes clusters for GPU-accelerated workloads, including autoscaling strategies and resource scheduling.

Sword Health is shifting healthcare from human-first to AI-first through its AI Care platform. They make world-class healthcare available anytime, anywhere, while significantly reducing costs. Sword Health has over 1,000 enterprise clients and has raised more than $500 million from leading investors.

Australia 5w PTO

  • Own the design, deployment and operation of OpenStack and Kubernetes environments.
  • Build and improve infrastructure using infrastructure-as-code and GitOps practices.
  • Optimise GPU workload scheduling using Kubernetes and NVIDIA tooling.

NexGen Cloud is building next-generation GPU cloud infrastructure, and is the company behind Hyperstack, a high-performance cloud platform designed for compute-intensive workloads. We're a scale-up by design, solving complex infrastructure challenges at pace, with real-world impact.

Europe

  • Design, build, and maintain scalable, highly available and fault-tolerant infrastructures.
  • Implement and improve monitoring, alerting, and incident response systems to ensure optimal system performance and minimize downtime.
  • Drive continuous improvement in infrastructure automation, deployment, and orchestration.

Mistral AI is dedicated to democratizing AI through high-performance, optimized, open-source models, products, and solutions designed to integrate seamlessly into daily working life. They are a dynamic, collaborative team passionate about AI and its potential to transform society dedicated to innovation.

US

  • Design and implement scalable distributed systems that handle heavy CPU, disk, and network workloads.
  • Analyze system behavior to identify bottlenecks across compute, storage, and network layers.
  • Build instrumentation, metrics, and telemetry to measure system performance.

RapidFort is a Series A cybersecurity company backed by $42M from leading investors, building the next generation of container and software supply-chain security. Our platform helps enterprises and U.S. government agencies eliminate vulnerabilities in container images, secure Kubernetes environments, and protect cloud-native infrastructure at runtime.

Europe

  • Own the full user journey across GPU clusters, including workflows and capacity management.
  • Define the product direction from problem discovery to solution delivery and customer adoption.
  • Lead open-source strategy and execution for the control plane, fostering community engagement.

Jobgether is a company that uses AI-powered matching process to ensure that your application is reviewed quickly, objectively, and fairly against the role's core requirements. Their system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company.

$130,000–$160,000/yr
US

  • Design, build, and optimize cloud platform capabilities.
  • Tackle complex infrastructure challenges and raise engineering quality.
  • Apply AI and AIOps to make the platform smarter and more resilient.

PerfectServe offers Best in KLAS clinical communication and physician scheduling solutions and is a Leader in the Gartner Magic Quadrant for Clinical Communication and Collaboration. We focus on optimizing provider schedules and dynamically routing messages to advance patient care and clinical workflows, valuing growth, transparency, and innovation.

Global

  • Build and own the foundational infrastructure that our products run upon.
  • Work directly on our products' golang code base to implement SRE related objectives.
  • Take a data driven approach to quantifying system performance and reliability.

LiveKit provides the network infrastructure for multimodal AI interfaces, enabling seamless audio and visual interactions. Founded in 2021, LiveKit supports over 3 Billion calls annually, with 100,000+ developers and industry giants like OpenAI, Spotify, and Meta.

UK 5w PTO

  • Lead and guide development teams while working directly with clients.
  • Translate business and technical requirements into impactful applications.
  • Ensure best practices in software development, DevOps, and agile methodologies.

Nearform is an independent team of data & AI experts, engineers, and designers who build intelligent digital solutions and capability at pace. Our team of 500 experts in 20+ countries is trusted by leading enterprises.

Spain 6w PTO

  • Operating and evolving 100+ multi-cloud streaming clusters and related database infrastructure.
  • Diagnosing and eliminating cross-layer failure modes.
  • Designing safe upgrade and rollout strategies at scale.

Grafana Labs is a remote-first, open-source powerhouse with over 20M users of Grafana, its open source visualization tool. Grafana Labs helps more than 3,000 companies manage their observability strategies with the Grafana LGTM Stack, and its team thrives in an innovation-driven environment.

Australia

  • Serve as the primary technical partner between customers and Armada’s Product and Engineering teams, translating real‑world requirements into actionable designs.
  • Provide hands‑on technical guidance on AI Factory solutions, including modular and liquid‑cooled data centers and NVIDIA‑based GPU systems.
  • Advise customers on workload suitability, rack‑level design, system architecture, and deployment tradeoffs.

Armada is a full-stack edge infrastructure company delivering compute, connectivity, and sovereign AI/ML to some of the world’s most remote places. They're backed by top investors such as Microsoft (M12), Founders Fund, and has strategic partnerships including Starlink, Skydio, and NVIDIA.

EU US

  • Define and drive Impossible Cloud’s global Go-to-Market (GTM) strategy.
  • Establish scalable customer acquisition and retention strategies.
  • Build and lead a high-performing global GTM team.

Impossible Cloud has revolutionized enterprise storage with a patented, decentralized object storage, offering cost-effective, high-performance infrastructure. They are expanding this foundation to build the next-generation AI-first platform encompassing storage, compute, and GPU capabilities.

US EMEA

  • Design and implement the complex distributed infrastructure that powers our core AI engine and distributed analysis systems.
  • Tune and optimize cloud services across compute, storage, networking, and observability to drive performance and reliability.
  • Develop our core services, written in TypeScript, Kotlin and Go to support our unique deployment and infrastructure requirements.

XBOW is building the future of offensive security. They create the platform that puts security ahead in the arms race, using AI to autonomously discover, validate, and exploit vulnerabilities. Founded by Oege de Moor, the company is backed by Sequoia, Altimeter, and other leading investors.

  • Designing, building, and operating Kubernetes infrastructure across multiple cloud providers.
  • Building and maintaining automation for cluster lifecycle management, node provisioning, and provider onboarding.
  • Developing platform tooling and abstractions that enable other Canva engineers to deploy and scale workloads.

Canva is a design platform redefining how the world experiences design. They have campuses in Sydney and Melbourne, along with co-working spaces in Brisbane, Perth and Adelaide, offering a flexible and inclusive work environment.

Australia

  • Own and drive the design, deployment, and operation of OpenStack and Kubernetes clusters optimised for GPU workloads
  • Lead and develop a team of 4–5 infrastructure engineers, setting clear direction and standards
  • Build and improve infrastructure through automation (IaC, GitOps, CI/CD pipelines)

NexGen Cloud is a fast-growing company building next-generation GPU cloud infrastructure. At the core of NexGen Cloud is a team of curious, driven people who care deeply about quality, ownership and collaboration.

Europe

  • Collaborate within a multi-disciplinary team of product managers, designers, software engineers, machine learning and biomedical scientists.
  • Design, build, and maintain scalable, reliable AI systems.
  • Drive technical decisions and provide context-aware solutions for AI systems in biological research.

Owkin is an AI company with a mission to solve the complexity of biology. They are building the first Biology Super Intelligence (BASI) by combining powerful biological large language models, multimodal patient data, and agentic software.

Europe

  • Build scalable Edge infrastructure, designing and maintaining delivery systems for model deployment.
  • Work with cross-functional teams to integrate complex features, translating research into hardware realities.
  • Drive automation and reliability by implementing infrastructure to test models and monitor performance.

Hudl builds great teams and hires the best to ensure employees are working with people they can constantly learn from. They provide a culture where everyone feels supported, becoming one of Newsweek's Top 100 Global Most Loved Workplaces.

Europe US

  • Design and build training pipelines, fine-tuning workflows, and RL infrastructure.
  • Implement data ingestion and curation systems, inference services, and scalability and backend architecture.
  • Own the platform that turns models into production systems.

Fastino is building the next generation of LLMs, with a team of alumni from Google Research, Apple, Stanford, and Cambridge. Fastino's GLiNER family of open source models has been downloaded more than 5 million times and is used by companies such as NVIDIA, Meta, and Airbnb.

$198,025–$287,952/yr
US

  • Building tools and applications to extends Calendly’s infrastructure platform
  • Evaluating and deploying cloud native open source tools
  • Exercising expertise in cloud infrastructure concepts and patterns

Calendly makes it possible for their customers through impactful innovation. They have millions of users and are in the midst of exciting product growth.

Europe 6w PTO

  • Operate and evolve multi-cloud streaming clusters and related database infrastructure, diagnosing and eliminating cross-layer failure modes.
  • Define and evolve the technical direction for operating shared database systems at scale, leading complex initiatives and reliability investments.
  • Mentor and support engineers, improve systems toil with automation, and partner with database and platform teams to align on strategy.

Grafana Labs is a remote-first, open-source powerhouse with more than 20M users of Grafana. They help more than 3,000 companies manage their observability strategies with the Grafana LGTM Stack, featuring scalable metrics, logs, and traces and thrive in an innovation-driven environment where transparency, autonomy, and trust fuel everything.