Source Job

Global

  • Build and maintain our host provisioning stack to bring new bare metal online quickly and confidently.
  • Evolve our homegrown orchestration engine to manage clusters, containers, and VMs.
  • Build out internal observability and alerting so we catch fleet problems before customers feel them.

Ansible Terraform Golang Rust GRPC

20 jobs similar to Infrastructure Engineer

Jobs ranked by similarity.

US

  • Own and operate end-to-end infrastructure for backend services, frontend systems and databases.
  • Build and maintain reliable deployment workflows including CI/CD pipelines and rollback procedures.
  • Improve system-wide observability through metrics, logging, alerting, and monitoring to ensure uptime.

Jito Labs builds a high-performance trading terminal on Solana. They are a lean, high-output team building something that sits at the intersection of execution quality, user experience, and on-chain infrastructure.

$115,200–$172,800/yr
US 8w paternity

  • Build internal tooling to help other engineers and the rest of the company understand and operate our system.
  • Design and implement security best practices for our team and infrastructure.
  • Reduce toil through automation, including building and maintaining CI/CD infrastructure.

Openly is rebuilding insurance from the ground up by re-envisioning and enhancing every aspect of the customer experience. They are a rapidly growing team of exceptional, curious, empathetic people with a wide range of skill sets, spanning many departments.

$127,800–$135,900/yr
US

  • Building infrastructure as code and DevOps pipelines and reviewing solutions.
  • Researching and analyzing technical solutions, maintaining and enhancing documentation.
  • Proactively identifying blockers, risks, and issues, proposing solutions or escalating as appropriate.

Nava is a consultancy and public benefit corporation working to make government services simple and effective. They guide agencies constrained by legacy systems to a future with sharp user experiences built on secure, reliable, fault-tolerant cloud infrastructure.

Canada

  • Own the end-to-end infrastructure product vision, including installers, deployment tooling, reference architectures, and operational patterns.
  • Define and evolve a cohesive infrastructure roadmap aligned with Platform architecture, customer needs, and GTM strategy.
  • Partner closely with Product Leadership to balance near-term customer needs with long-term platform scalability and repeatability.

Mechanical Orchard is reinventing how the world’s most critical software gets modernized, focusing on system behavior to turn modernization into a repeatable process. They are an applied AI company challenging industry assumptions and prioritizing quality, rigor, and progress.

Engineer

FAL
$180,000–$250,000/yr
US

  • Build and maintain Python fleet tracking system that manages the full lifecycle of servers.
  • Build server management tooling that automates provisioning, health checks, GPU diagnostics, recovery and alerting.
  • Create and maintain metrics, dashboards, and alerting for hardware health across the fleet.

FAL is committed to keeping a large fleet of GPU servers healthy and productive. They offer a collaborative and supportive culture with learning and growth opportunities.

Unlimited PTO 16w maternity 16w paternity

  • Scale and mature Vesta’s infrastructure to support the entire mortgage market reliably, securely, and efficiently.
  • Build the foundational systems that power engineering velocity and platform reliability.
  • Focus on cloud architecture, deployment systems, observability, incident response, and internal developer tooling.

Vesta is building the next-generation system of record to power the multi-trillion mortgage market. They value humility, empathy, self-awareness, and an orientation towards action and have raised $45M from top tier investors.

US

  • Oversee a specialized SRE team focused on the design, deployment, and maintenance of automation toolsets.
  • Establish and enforce standards for IaC to ensure consistent, repeatable, and secure deployments.
  • Drive the automated lifecycle of both physical and virtual assets, from initial template creation/deployment to automated patching, scaling, and decommissioning.

Galaxy is a global leader in digital assets and data center infrastructure, delivering solutions that accelerate progress in finance and artificial intelligence. Led by CEO and Founder Michael Novogratz, their team blends deep crypto expertise with institutional experience and a shared commitment to shaping the future of Web3 and AI.

$29,000–$36,000/yr
India

  • Design, build, and maintain scalable, reliable systems on GCP.
  • Develop automation for infrastructure provisioning using Terraform, Ansible, or Deployment Manager.
  • Manage incident response, conduct postmortems, and implement improvements to reduce recurrence.

SupplyHouse.com is an industry-leading e-commerce company specializing in HVAC, plumbing, heating, and electrical supplies since 2004. They value every individual team member and cultivate a community where people come first with Generosity, Respect, Innovation, Teamwork, and GRIT.

$210,000–$278,000/yr
US Unlimited PTO

  • Architect future iterations of core systems, addressing scaling requirements.
  • Design and implement developer tools to enhance deployment safety and reproducibility.
  • Drive excellence in monitoring and guide incident response for quick issue resolution.

Found provides tools for self-employed individuals, offering a business bank account that automates taxes and expense tracking. They aim to give self-employed people the security and peace of mind historically available only at large corporations and are looking for kind, resourceful, and passionate people.

$131,000–$152,000/yr
US 12w maternity

  • Build and maintain product features across backend services, APIs, data systems, and user-facing workflows.
  • Contribute to services that process SaaS activity, identity data, permissions, alerts, and security findings.
  • Improve existing systems for performance, reliability, maintainability, and observability.

Obsidian Security secures SaaS applications and platforms. They are backed by top investors and trusted by global enterprises.

$127,160–$205,700/yr
North America Unlimited PTO

  • Own the delivery of developer platform capabilities end-to-end, including design, implementation, rollout, and iteration.
  • Build and evolve paved roads that make it easy to deploy, operate, and scale services.
  • Drive improvements to GitOps workflows and harden CI/CD to improve pipeline performance and developer ergonomics.

Phaidra is building the future of industrial automation with AI-powered control systems. They are a 100% remote company with employees located throughout the USA, Canada, UK, Sweden, Spain, Portugal, the Netherlands, Singapore, Australia, and India.

Europe 6w PTO

  • Design, build, and operate reconciliation systems to track desired stack state, detect and repair drift across stack templates, grafana.com state, Hosted Grafana, and actual customer stack configuration.
  • Collaborate across SSS, grafana.com, and deployment configurations to ensure stack lifecycle workflows remain reliable, observable, and resilient.
  • Improve operational efficiency by reducing deployment complexity and contributing to the Stack Config Reconciliation project.

Grafana Labs is a remote-first, open-source powerhouse with over 20M users of Grafana. They help more than 3,000 companies manage their observability strategies with the Grafana LGTM Stack, featuring scalable metrics (Grafana Mimir), logs (Grafana Loki), and traces (Grafana Tempo).

$245,000–$295,000/yr
US

  • Build, lead, and grow the platform team, setting the pace and creating an environment where strong engineers want to stay.
  • Remain hands-on by writing code, reviewing architecture decisions, and debugging production issues while owning the platform's technical direction.
  • Steer projects through ambiguity, solving technical problems, resourcing gaps, and prioritization calls to ensure the infrastructure scales effectively.

OpenRouter is the leading AI routing and infrastructure layer that enterprises use to access, manage, and optimize the best large language models across providers. It's a fast-scaling technology company powering advanced AI teams by providing flexibility, scalability, and future-proof infrastructure.

US Unlimited PTO

  • Lead Onboarding end‑to‑end and extend with additional use cases.
  • Own a small portfolio of customer account and act as a trusted technical partner all year.
  • Provide technical support and communicate crisply with customers throughout.

OpsMill is building the next generation of infrastructure data management, focusing on helping automation teams unify data and scale automation reliably. As a commercial open-source company, they are practitioners who understand the real-world challenges of scaling infrastructure automation.

US Global

  • Performing day-to-day operational/DevOps tasks on Wikimedia’s public facing infrastructure.
  • Implementing and utilizing configuration management and deployment tools.
  • Leading continuous improvement, by automating the installation, configuration and maintenance of services on our platform.

The Wikimedia Foundation operates Wikipedia and other Wikimedia free knowledge projects with the vision of a world where every single human can freely share in the sum of all knowledge. As a charitable, not-for-profit organization, it relies on donations and has staff members based in 40+ countries.

$145,000–$250,000/yr
US Unlimited PTO

  • Construct infrastructure as code, developing and enforcing best practice across configurations while preventing drift between Terraform configurations and infrastructure deployments.

SentiLink provides innovative identity and risk solutions, empowering institutions and individuals to transaction with confidence. They are building the future of identity verification in the United States replacing a clunky, ineffective, and expensive status quo with solutions that are 10x faster, smarter, and more accurate.

$159,925–$222,230/yr
Canada

  • Build prototypes and POCs that showcase Tailscale for AI agents and tooling.
  • Work with reference customers to integrate Tailscale, both for internal adoption and for embedding into their products to enable secure customer connectivity.
  • Create reference architectures and share your work through documentation, open source, community engagement, and conference presentations.

Tailscale is building a new Internet by delivering software that makes it easy to securely interconnect people and their devices, no matter where they are. They are a fully distributed company, and teams of every size use Tailscale each day to protect their networks and share access to internal tools.

$160,000–$200,000/yr
US

  • Drive the stability and reliability of Epic's GCP infrastructure.
  • Manage and harden our Docker and GKE container platform.
  • Maintain and improve CI/CD pipelines.

Epic is the leading digital reading platform for kids ages 12 and under, used by millions of children, families, and educators around the world. As Epic continues to grow, we are reimagining what reading can be through thoughtful technology, data, and global collaboration to make learning more engaging, accessible, and impactful.

$190,800–$267,100/yr
US

  • Design and build backend systems, APIs, infrastructure, and platform capabilities that improve developer workflows across Reddit.
  • Build scalable and reliable systems across both AI-powered developer workflows and the core non-AI systems engineers rely on every day.
  • Lead high-impact projects across Reddit’s developer tooling ecosystem by writing and reviewing code and design docs, aligning stakeholders, and making pragmatic technical tradeoffs.

Reddit is a community-based platform built on shared interests, passion, and trust, facilitating open and authentic conversations. With over 100,000 active communities and approximately 126 million daily active unique visitors, it serves as one of the internet’s largest sources of information.

SRE

Fal
$180,000–$250,000/yr
US

  • Own and operate our Kubernetes infrastructure.
  • Build and maintain CI/CD pipelines and deployment infrastructure.
  • Leverage AI to automate analysis and resolution of production issues.

Fal is the generative media ecosystem powering the next generation of AI products. They build the infrastructure, tools, and model access that teams need to move from idea to production.