Source Job

UK 5w PTO

  • Own and drive the design, deployment, and operation of OpenStack and Kubernetes clusters optimized for GPU workloads.
  • Lead and develop a team of 4-5 infrastructure engineers, setting clear direction and standards through automation and incident management.
  • Collaborate closely with DevOps, Product, and Support teams to align infrastructure with customer needs and communicate performance to leadership.

OpenStack Kubernetes Infrastructure Automation Linux Networking

20 jobs similar to Lead Infrastructure Engineer

Jobs ranked by similarity.

Australia

  • Own and drive the design, deployment, and operation of OpenStack and Kubernetes clusters optimised for GPU workloads
  • Lead and develop a team of 4–5 infrastructure engineers, setting clear direction and standards
  • Build and improve infrastructure through automation (IaC, GitOps, CI/CD pipelines)

NexGen Cloud is a fast-growing company building next-generation GPU cloud infrastructure. At the core of NexGen Cloud is a team of curious, driven people who care deeply about quality, ownership and collaboration.

UK 5w PTO

  • Own the design, deployment, and operation of OpenStack and Kubernetes environments to ensure performance, scalability, and resilience for GPU workloads.
  • Build and improve infrastructure using infrastructure-as-code and GitOps practices, driving automation across provisioning, deployment, and operational workflows.
  • Optimize GPU workload scheduling using Kubernetes and NVIDIA tooling, and implement monitoring, logging, and alerting to ensure platform stability.

NexGen Cloud is the company behind Hyperstack, a full-stack AI cloud that provides on-demand and private GPU infrastructure to customers ranging from AI researchers to enterprises for compute-intensive workloads. It is a fast-moving, tight-knit team that equips its people with AI tools to solve complex problems and innovate in enterprise GPU infrastructure.

Australia 5w PTO

  • Own the design, deployment and operation of OpenStack and Kubernetes environments.
  • Build and improve infrastructure using infrastructure-as-code and GitOps practices.
  • Optimise GPU workload scheduling using Kubernetes and NVIDIA tooling.

NexGen Cloud is building next-generation GPU cloud infrastructure, and is the company behind Hyperstack, a high-performance cloud platform designed for compute-intensive workloads. We're a scale-up by design, solving complex infrastructure challenges at pace, with real-world impact.

Global Unlimited PTO

  • Lead a platform engineering team delivering managed Kubernetes and cloud infrastructure across multiple providers and deployment models.
  • Own the platform delivery roadmap, coordinating with Cloud Organization, Security, and Professional Services to manage dependencies.
  • Drive foundational infrastructure programs in private networking and cloud governance to establish Ditto's deployment baseline.

Ditto redefines data movement at the edge by providing a peer-to-peer sync engine for building resilient, real-time applications in any network condition. This venture-backed, globally distributed startup is trusted by major enterprises across aviation, retail, and defense, and is committed to building a diverse and inclusive team.

Global

  • Deliver a scalable internal infrastructure platform on public cloud environments.
  • Establish and evolve Kubernetes-based platform capabilities to support high-availability, production-grade workloads at scale.
  • Build a secure and reliable foundation that supports CI/CD pipelines and minimizes operational risk across engineering teams

Chainlink is the industry-standard oracle platform bringing the capital markets onchain and powering the majority of decentralized finance (DeFi). Since inventing decentralized oracle networks, Chainlink has enabled tens of trillions in transaction value and now secures the vast majority of DeFi.

US

  • Lead the design and implementation of scalable, secure, and resilient cloud infrastructure across AWS and Azure.
  • Drive the architectural vision and strategy, ensuring alignment with long-term business goals.
  • Take the lead on automating and accelerating SDLC processes by identifying bottlenecks.

Candidly flips the script on planning, borrowing, repaying, and saving for college and is a category leader with an AI-driven student debt and savings optimization platform. They partner with hundreds of top employers and have a fully remote, international team of 70+ including alumni from Google, UBS, and Twitter.

Spain 6w PTO

  • Operating and evolving 100+ multi-cloud streaming clusters and related database infrastructure.
  • Diagnosing and eliminating cross-layer failure modes.
  • Designing safe upgrade and rollout strategies at scale.

Grafana Labs is a remote-first, open-source powerhouse with over 20M users of Grafana, its open source visualization tool. Grafana Labs helps more than 3,000 companies manage their observability strategies with the Grafana LGTM Stack, and its team thrives in an innovation-driven environment.

Global

  • Design and implement infrastructure and tools that empower our product teams to rapidly and securely iterate, emphasizing reliability and automation.
  • Influence the strategic direction of our infrastructure and operational practices, ensuring that we are well-positioned to scale and support our growing organization.
  • Take a proactive role in the resolution of production issues, ensuring that we are well-prepared to handle incidents and that we learn from them in a blameless manner.

SSV Labs is the core team behind the SSV Network - pioneering decentralized infrastructure for Ethereum staking. They are building tools, protocols, and standards to make staking more secure, scalable, and trustless.

Global

  • Design and evolve multi-provider, multi-region GPU compute clusters optimized for large-scale training.
  • Serve as the primary technical point of contact for customers running large-scale training workloads.
  • Build production-grade automation for cluster provisioning, GPU health checks, job scheduling, self-healing, and firmware/driver lifecycle management.

Andromeda Cluster gives early-stage startups access to scaled AI infrastructure. They work with leading AI labs, data centers, and cloud providers to deliver compute when and where it’s needed most and are expanding to find the brightest in AI infrastructure, research and engineering.

Europe 5w PTO

  • Work with other Engineering teams to design sustainable infrastructure and microservice solutions.
  • Automate tools and infrastructure to reduce manual work.
  • Monitor applications and participate in an on-call rotation as required.

Bloomreach is building the world’s premier agentic platform for personalization, revolutionizing how businesses connect with their customers by building and deploying AI agents to personalize the entire customer journey. They power personalization for more than 1,400 global brands.

Global

  • Own the SRE roadmap end-to-end, setting priorities independently and driving execution to make the team's impact visible across the organization.
  • Drive compliance, security, and infrastructure topics for your business unit by identifying risks early and owning the resolution before they escalate.
  • Lead a 4–6-person generalist SRE team through 1:1s, performance cycles, and meaningful career development while contributing technical credibility to architectural discussions.

Kraken is a mission-focused cryptocurrency exchange building the future of crypto and blockchain technology. It is a fully remote company with employees in over 70 countries, offering premium crypto products and services for traders and institutions while emphasizing security, education, and client support.

Global

  • Spearhead the evolution of our scalable, secure, and high-performing platform, driving the infrastructure that fuels our startup.
  • Conduct gap analyses to strengthen infrastructure, maintain exceptional uptime, and enhance monitoring systems for rapid incident detection.
  • Mentor and develop your team, recruit top talent, and foster a culture of collaboration, technical excellence, and continuous improvement.

Ethena Labs is actively building and deploying groundbreaking digital dollar products, aiming to upgrade money into the internet era. They have scaled to $15b in 18 months and continue to develop new product lines and foundational infrastructure for a more open, efficient, and interconnected global financial system.

US Unlimited PTO 12w maternity 12w paternity

  • Help define and drive the technical direction of our Cloud Infrastructure team within Platform Engineering.
  • Work across Valon’s production systems—compute, databases, storage, and networking—shaping the infrastructure foundations that every product and team depends on.
  • Set the technical direction for how we meet those challenges.

Valon is building the AI-native operating system for regulated finance, starting with mortgage servicing. We're a Series C company backed by a16z, transforming industries that others have written off as too complex to innovate.

US

  • Design, build, and operate core cloud infrastructure across compute, storage, databases, and networking layers.
  • Own and improve the reliability, scalability, and security of Valon’s production systems as we scale to support major enterprise deployments.
  • Evaluate, adopt, and operationalize new infrastructure technologies (e.g., Vitess, Clickhouse, Redis) to meet evolving product and scale requirements.

Valon is building the AI-native operating system for regulated finance, starting with mortgage servicing. They are a Series C company backed by a16z, transforming industries that others have written off as too complex to innovate.

Global

  • Leading a team focused on designing, building, and evolving cloud-native, containerized infrastructure.
  • Driving complex technical initiatives and ensuring the availability, security, scalability, and reliability of our data ecosystem.
  • Guiding and developing engineering talent, setting priorities, driving execution, and partnering across teams.

Pismo, founded in 2016, provides a comprehensive processing platform for banking, card issuing and financial market infrastructure. Pismo has 500+ employees located in more than 10 countries around the world and was acquired by Visa in 2024.

$150,000–$190,000/yr
US Unlimited PTO

  • Design self-healing infrastructure and automated root-cause analysis workflows.
  • Drive the strategic roadmap for our GCP and Kubernetes-based cloud capabilities.
  • Transform CI/CD, deployment, and build tooling into a cohesive, self-service product.

Signifyd helps merchants confidently grow their businesses by building trusted relationships with their customers. They have thousands of leading merchants across more than 100 countries and securely process billions of transactions each year.

$145,000–$170,000/yr
US Unlimited PTO 12w maternity 12w paternity

  • Learn platform infrastructure, developer tooling, and deployment patterns.
  • Own and drive at least one architecture decision that improves platform reliability.
  • Ship infrastructure improvements that measurably improve developer experience or platform stability.

Homebot is a homeownership platform for lenders and real estate, title & insurance agents that drives client retention and partner referrals. They maintain a clear focus on culture, engagement, and creating an environment where people are valued and can thrive.

UK 5w PTO

  • Lead, support, and develop a team of software engineers through regular feedback, coaching, and career development to build an environment of clarity and high standards.
  • Partner with Product and stakeholders to translate priorities into realistic plans and clear execution, balancing speed, quality, and business impact.
  • Stay close to architecture and design decisions, supporting strong engineering judgement and maintaining high standards in code quality, testing, and system design.

NexGen Cloud is the company behind Hyperstack, a full-stack AI cloud serving tens of thousands of customers from AI researchers to enterprises running compute-intensive workloads. It is a tight-knit, fast-moving team with an international culture built on trust, transparency, and ownership, practicing what it preaches by equipping its people with AI at every level.

Canada Global

  • Lead, mentor, and foster a healthy, high-performing globally distributed engineering team.
  • Own the execution and delivery of highly critical, complex yearly roadmap items centered around large-scale foundational infrastructure upgrades, high availability, and platform resilience.
  • Own and drive the change management processes across engineering and product domains.

Alpaca is a US-headquartered self-clearing broker-dealer and brokerage infrastructure for stocks, ETFs, options, crypto, fixed income, 24/5 trading, and more. Their global team of 230+ members is a diverse group of experienced engineers, traders, and brokerage professionals fostering a vibrant community.

$160,000–$200,000/yr
US Unlimited PTO

  • Maintain, optimize, and enhance on-premises and cloud computing environments.
  • Execute technical aspects of implementation projects, ensuring seamless software integration and customization.
  • Automate Infrastructure-as-Code (IaC) to manage virtual machines and deploy containers, services, and other infrastructure.

Striveworks helps organizations harness AI to solve national security and business challenges, acting as a command center for data and models. Founded by data scientists and engineers, they aim to simplify the deployment and optimization of AI systems, ensuring reliability and scalability.