Source Job

US 4w PTO 14w maternity 14w paternity

  • Own core compute infrastructure across multiple cloud providers and regions.
  • Design capabilities for greater performance and flexibility in service deployment.
  • Investigate and resolve challenging cloud and compute issues across the stack.

Kubernetes Go Rust Distributed Systems Infrastructure

20 jobs similar to Infrastructure Engineer

Jobs ranked by similarity.

Global 16w maternity 16w paternity

  • Lead the design and implementation of self-service platform infrastructure for provisioning, deployment, and observability across engineering teams.
  • Evolve multi-tenant EKS foundations toward better reliability, security, scale, and multi-region connectivity.
  • Set delivery standards using Terraform, GitOps, and progressive rollout, while improving SLOs and alerting on Grafana Cloud.

Docker is a developer tooling company trusted by over 20 million monthly users and 20 billion container image pulls. They are a globally distributed, remote-first team building tools that define how software gets built and delivered.

US Unlimited PTO

  • Provide frontline technical expertise to help developers deploy and scale Temporal in cloud-native environments.
  • Troubleshoot complex infrastructure issues, optimize performance, and develop automation solutions.
  • Collaborate with engineering and product teams to influence platform improvements and enhance developer experience.

Temporal provides an open source programming model that simplifies code and makes applications more reliable. The company is a growing team driven by values of curiosity, collaboration, and humility, focused on improving developer experience.

Argentina 18w maternity 12w paternity

  • Own and evolve the cloud platform including compute layer, EKS fleet, serverless infrastructure, networking, and cloud operations across AWS and GCP.
  • Design and maintain infrastructure-as-code foundation and networking layer for reliability, security, and scalability.
  • Build AI-powered automation for cloud infrastructure management, including policy-as-code, drift detection, and LLM-assisted runbook generation.

Webflow builds the world's leading AI-native Digital Experience Platform, empowering teams to design, launch, and optimize for the web without barriers. As a remote-first company with over 2 million users across 190 countries, it fosters a culture of trust, transparency, and creativity.

Global Unlimited PTO 16w maternity 16w paternity

  • Design, implement, and operate core services that power Docker’s Cloud Sandboxes platform.
  • Build scalable systems for microVM orchestration, workload scheduling, and lifecycle management.
  • Ensure system reliability, observability, and performance across Docker’s Cloud Sandbox infrastructure.

Docker is a globally distributed, remote-first company that builds tools for developers to build, share, and run applications. Trusted by over 20 million monthly users and 20 billion container image pulls, it has a collaborative culture focused on innovation and reliability.

Global Unlimited PTO

  • Lead the architecture and implementation of managed Kubernetes infrastructure across AWS, Azure, and GCP for enterprise customer deployments.
  • Own the systems that provision, organize, and manage cloud accounts, including resource governance and multi-tenant isolation.
  • Mentor P3/P4 engineers and define architectural patterns that scale across the company's infrastructure.

Ditto builds the world's leading edge sync platform, enabling applications to share data peer-to-peer with or without internet connectivity. With over $145 million in funding and trusted by organizations like Chick-fil-A and Delta Airlines, Ditto is a fast-growing, globally distributed startup committed to building a diverse and inclusive team.

US Unlimited PTO 20w maternity 20w paternity

  • Design, build, and maintain highly available Kubernetes infrastructure at scale.
  • Lead design for components and features, and contribute to architecture decisions for container orchestration.
  • Mentor engineers on Kubernetes best practices and drive initiatives to improve system reliability.

Marqeta provides a card issuing platform for companies to issue cards, authorize transactions, and manage payment operations in real time. They are a publicly-traded company with a Flex First culture that values remote work and employee growth.

  • Co-own the architecture of cloud infrastructure on Azure and Kubernetes clusters for high throughput and availability.
  • Drive resilience strategy for global scaling, zero-downtime deployments, and disaster recovery.
  • Evolve observability stack with LGTM (Loki, Grafana, Tempo, Mimir) and lead incident response.

Flip is an AI-powered employee experience platform for frontline workers in retail, manufacturing, and logistics. The company is a young, rapidly growing tech company with a remote-first culture and offices in Berlin and Stuttgart.

Germany Unlimited PTO

  • Design and maintain scalable infrastructure-as-code solutions using Terraform and Kubernetes.
  • Build and operate observability systems while leading incident response and reliability improvements.
  • Embed security and compliance practices into infrastructure and optimize system performance and cloud costs.

This partner company builds a next-generation platform enabling AI-driven services across global employment infrastructure. It is a highly distributed, async-first organization where engineers thrive in ownership and autonomy.

North America 6w PTO 26w maternity 26w paternity

  • Lead and mentor a team of Forward Deployed Engineers deploying the North platform.
  • Drive end-to-end deployment in private cloud and on-premises environments for customer success.
  • Collaborate with Product, Engineering, and Sales while optimizing cloud infrastructure and K8s services.

Cohere is a security-first enterprise AI company building cutting-edge foundation AI models and end-to-end products for real-world business problems. They are a global technology company with offices in Toronto, San Francisco, London, New York City, Montreal, Seoul, Germany, and Paris, employing a team of researchers, engineers, and designers.

Global

  • Own and evolve Webshare's production infrastructure by leading migration from Docker Swarm to Kubernetes and maintaining high availability across hundreds of servers and ~50 services.
  • Drive observability, establish IaC practices, CI/CD pipeline reliability, and participate in on-call rotation alongside backend developers.
  • Contribute platform tooling to improve developer experience and reduce infrastructure toil, ensuring no silos and shared infrastructure ownership.

We develop cutting-edge proxy and web data scraping solutions for thousands of the world's best known businesses, including Fortune 500 companies. We are a team of 500+ professionals with a culture focused on growth, learning, and shared infrastructure ownership.

United States Unlimited PTO

  • Own full-stack design and delivery of platform capabilities from architecture to deployment and observability.
  • Build open source infrastructure packages for airgap and cloud-native environments and write comprehensive tests.
  • Work directly with product and customers to translate mission problems into platform capabilities and mentor team members.

Defense Unicorns delivers mission value by streamlining software delivery for defense and civil agencies, focusing on speed, security, and optionality. The team includes innovators, software engineers, and veterans with decades of experience delivering technology programs across the federal market.

US

  • Design and build the orchestration layer using Kubernetes, Slurm, or comparable technologies.
  • Build customer-facing platform APIs, CLIs, web portals, and SDKs.
  • Drive infrastructure-as-code, multi-tenant isolation, and platform reliability.

GPU One provides GPU-as-a-Service (GPUaaS), turning raw GPU infrastructure into a usable cloud platform. The company is building a multi-tenant orchestration layer to serve customers at scale, with a focus on platform engineering and AI infrastructure.

United States

  • Design and build core platform infrastructure for large-scale cloud-native data and analytics systems.
  • Own and improve CI/CD pipelines, testing frameworks, and deployment in a high-scale PaaS environment.
  • Contribute to reliability engineering, observability, and operational excellence across distributed systems.

Jobgether uses an AI-powered matching process to connect candidates with roles. The company is a growing platform focused on efficient job matching and data privacy compliance.

Global Unlimited PTO 16w maternity 16w paternity

  • Own the operational excellence and infrastructure strategy for Remote Build's platform, ensuring reliability, performance, and security.
  • Lead incident response, build observability systems, and drive continuous improvement in system reliability.
  • Embed security into infrastructure, optimize costs, and automate operational toil to scale efficiently.

Remote solves modern organizations' biggest challenge of navigating global employment compliantly. With a fully distributed team across 6 continents, the company fosters a future-focused culture with core values of innovation and async work.

United States 6w PTO

  • Build and operate the internal engineering platform that provides application engineers with the tools, systems, and Kubernetes clusters they need to deploy and run their workloads.
  • Focus on cloud infrastructure, capacity management, security, engineering productivity, monitoring, and US Federal compliance across squads.
  • Participate in on-call rotations to ensure the health of the system and understand how people use our products.

Grafana Labs, the company behind the open observability cloud, is founded on the principles of open source, open standards, open ecosystems, and open culture. We are a 100% remote company with 1,600+ team members across 40+ countries, backed by leading investors including Lightspeed Venture Partners, Sequoia Capital, GIC, Coatue, J.P. Morgan, CapitalG, and Lead Edge Capital.

US Unlimited PTO

  • Develop internal tools and automate infrastructure using AWS, Kubernetes, and programming languages.
  • Research and design solutions to increase website robustness, availability, and cost efficiency.
  • Collaborate on documentation, code reviews, and rollout of new processes.

Angi powers the future of the home services industry, connecting homeowners with skilled pros. With 9 brands in 8 countries and employees worldwide, Angi has helped homeowners with over 300 million home projects.

US

  • Work as part of a small, cross-functional XP team installing Imogen into client cloud environments, partnering with client infosec, infrastructure, and IT teams.
  • Pair program with other engineers and collaborate closely with product managers and designers.
  • Lead technical discovery efforts for existing customer systems and adapt Imogen to their public cloud estate.

Mechanical Orchard specializes in safely rewriting critical business applications using a unique method that eliminates modernization risks. The company is known for its expertise in Agile practices and has a small, cross-functional team culture focused on collective ownership and continuous improvement.

US

  • Spearheads evolution of compute and data delivery services with an emphasis on scale and user requirements
  • Collaborates to enable efficient and rapid access to new and growing data sets
  • Improves reliability and scalability by resolving edge cases, studying failure modes, and writing tests

Planet designs, builds, and operates the largest constellation of imaging satellites, delivering an unprecedented dataset via a cloud-based platform. With a global team and a people-centric approach, the company focuses on culture and community while preparing for growth.

Global

  • Design and implement AI inference and training cloud products optimized for Kubernetes, including autoscaling and distributed jobs across GPU fleets.
  • Write clean, efficient Go code for Kubernetes controllers, operators, and custom resources supporting AI workloads.
  • Build APIs, CLIs, and developer tools to simplify deployment, lifecycle management, and monitoring of AI applications.

Gcore is a global provider of infrastructure and software solutions for AI, cloud, network, and security, powering digital experiences worldwide. With 550+ professionals and 210+ edge locations, the company collaborates with partners like Intel, NVIDIA, and Equinix to build the foundation for an AI-driven world.

Europe

  • Lead reliability initiatives across multiple Ads domains including ad serving, auctions, targeting, reporting, measurement, and billing.
  • Partner with engineering leadership to improve reliability, scalability, operational excellence, and engineering efficiency across the Ads organization.
  • Design and build platforms, tooling, and automation that improve reliability and developer productivity at scale.

Reddit is a community of communities, built on shared interests, passion, and trust, home to the most open and authentic conversations on the internet. With 100,000+ active communities and approximately 126 million daily active unique visitors, it is one of the internet's largest sources of information.