Source Job

US Canada

  • Build platforms that scale; Design and operate foundational infrastructure that handle billions of events and enable company to grow with minimal friction.
  • Enable product velocity; Create tooling that let engineers ship faster and more reliably without becoming infrastructure experts themselves.
  • Drive technical direction; Shape Metronome's infrastructure strategy, make platform-level architectural decisions, and mentor engineers across the organization.

Kubernetes Kafka Spark CI/CD Observability

20 jobs similar to Software Engineer, Infrastructure

Jobs ranked by similarity.

US EMEA

  • Design and implement the complex distributed infrastructure that powers our core AI engine and distributed analysis systems.
  • Tune and optimize cloud services across compute, storage, networking, and observability to drive performance and reliability.
  • Develop our core services, written in TypeScript, Kotlin and Go to support our unique deployment and infrastructure requirements.

XBOW is building the future of offensive security. They create the platform that puts security ahead in the arms race, using AI to autonomously discover, validate, and exploit vulnerabilities. Founded by Oege de Moor, the company is backed by Sequoia, Altimeter, and other leading investors.

US Unlimited PTO

  • Define long-term architectural strategy for multi-cloud compute and traffic platforms.
  • Provide mentorship to engineers through design reviews and code contributions.
  • Partner with Security to build ‘secure by default’ systems.

Temporal Technologies develops an open-source programming model that simplifies code and enhances application reliability. With a focus on developer experience and open-source software, they foster a culture of curiosity, collaboration, and genuine impact.

$230,000–$250,000/yr
US Unlimited PTO 12w paternity

  • Define and evolve reliability standards for the SmarterDx platform.
  • Enhance observability systems (metrics, logs, traces, alerting) to provide actionable insights and reduce mean time to detect (MTTD) and resolve (MTTR).
  • Reduce operational toil through automation, self-healing systems, and improved deployment and rollback mechanisms.

SmarterDx, a Smarter Technologies company, builds clinical AI that is transforming how hospitals translate care into payment. Founded by physicians in 2020, their platform connects clinical context with revenue intelligence, helping health systems recover millions in missed revenue, improve quality scores, and appeal every denial.

Unlimited PTO

  • Build and operate cutting-edge cloud infrastructure to support Diagrid's core products
  • Define standards, deliver tools, processes, and frameworks to make our products secure, reliable, efficient, and highly available
  • Build and maintain CI/CD pipelines that enable delivering software quickly and securely across clouds

Diagrid believes that open-source software, open standards and APIs are the greatest transformational tools for organizations. They provide developers with APIs and tools that help them focus on their code and not on infrastructure and are founded by the creators of the Dapr and KEDA open-source projects.

Europe 6w PTO

  • Own reliability, scalability and cost discipline of ingestion and transformation systems
  • Design and deliver infrastructure for real-time/near-real-time feature computation
  • Lead and grow a small, ambitious team while raising technical standards

Yazio is a nutrition app with millions of users in over 150 countries, driven by its mission to transform the world through healthy eating. They champion a focus-driven culture that values efficiency, offering a high-impact environment supported by a diverse, international team committed to growth and well-being.

$156,000–$211,000/yr
US Canada

  • Own and deliver infrastructure projects end-to-end.
  • Build and improve platform primitives for service teams.
  • Improve observability and implement cost and performance improvements.

Afresh is the leading AI company in fresh food, partnering with grocers to order fresh food. They've experienced record-breaking growth and are on a mission to eliminate food waste. They have over 148 million in funding and embody values of proactivity, kindness, candor, and humility.

US

  • You will lead the teams responsible for the foundation that every engineering team builds on: infrastructure, developer experience, shared libraries, CI/CD, observability, and governance.
  • You will own delivery across developer experience, core platform tooling, and platform infrastructure ensuring that domain teams building on the platform have an opinionated, supported path.
  • This is a hands-on engineering management role where you review technical designs, pairing with engineers on hard problems, and making informed tradeoff calls across infrastructure, tooling, and developer experience.

EzCater is the leading food for work technology company in the US, connecting anyone who needs food for their workplace to over 100,000 restaurants nationwide. They are backed by top investors including Insight, Iconiq, Lightspeed, GIC, SoftBank, and Quadrille with engaged and passionate colleagues.

US Europe

  • Build and lead the team responsible for the reliability, security, and scalability of Gensyn’s production infrastructure and developer platform.
  • Own the availability, scalability, and security posture of production systems: SLOs/SLIs, incident response, postmortems, reliability improvements, and hardening.
  • Drive delivery across ambiguous, high-stakes initiatives: roadmap planning, prioritization, and execution against tight timelines.

Gensyn is building a protocol that networks together the core resources required for machine intelligence to flourish alongside human intelligence. They value autonomy, independence, direct feedback and an extreme learning rate, and strive to reject mediocrity and waste.

Canada EMEA Unlimited PTO

  • Evolve ArgoCD GitOps standards across environments
  • Build reusable Terraform modules and practices for safe, repeatable cloud infrastructure provisioning and drift detection
  • Lead the operation and evolution of production-grade Kubernetes clusters across cloud environments

GitLab is the intelligent orchestration platform for DevSecOps. More than 50 million registered users and more than 50% of the Fortune 100 trust GitLab to ship better, more secure software faster.

US 6w PTO

  • Design, implement, and maintain scalable integrations for metrics, logs, and traces across cloud and Kubernetes environments.
  • Build middleware, libraries, and services to simplify development and observability workflows.
  • Lead technical direction and strategic planning for observability projects.

They are currently looking for a Staff Software Engineer - Grafana Cloud Observability, Kubernetes Monitoring in United States. This role offers a unique opportunity to shape and advance cloud observability solutions for large-scale systems, focusing on metrics, logs, and traces.

US Unlimited PTO

  • Design, build, and maintain scalable infrastructure and tooling that improves reliability, performance, and availability across OnePay’s platform
  • Contribute to the evolution of our observability stack, platform libraries, cloud architecture, and CI/CD pipelines
  • Develop automation and monitoring systems to detect, prevent, and remediate incidents before they impact customers

OnePay is a consumer fintech company trusted by millions of Americans to make money better, providing an all-in-one financial services platform. Backed by Walmart and Ribbit Capital, OnePay provides banking, savings, credit cards, lending, investing, and crypto services and embedded financial services to frontline workers.

$106,500–$202,500/yr
US

  • Architect new and existing systems to enhance performance, reliability, and scalability.
  • Build, implement, iterate over CI/CD pipelines.
  • Assist with the Management, Development, Design, and Deployment of microservice and containerized applications.

AbbVie's mission is to discover and deliver innovative medicines and solutions that solve serious health issues today and address the medical challenges of tomorrow. They strive to have a remarkable impact on people's lives across several key therapeutic areas.

US

  • Collaborate with application engineering teams on platform infrastructure.
  • Enhance observability and spearhead the adoption of SRE best practices.
  • Build and maintain reliable CI/CD pipelines, tooling, and infrastructure.

Rula strives to provide quality, evidence-based, compassionate mental healthcare and aims to create a world where mental health is no longer stigmatized. They are a remote-first company operating in most U.S. states, and are dedicated to having a culture of inclusion that supports their employees.

US

  • Build and maintain infrastructure-as-code for our AWS EKS and GCP GKE clusters, plus on-premises deployments.
  • Own CI/CD pipelines and drive GitOps adoption.
  • Deploy, scale, and optimize ML/NLP inference workloads.

Vectara is the Enterprise Agent Platform that enables businesses to build and deploy governed, grounded, auditable AI agents across SaaS, VPC, and on-prem. We’re a passionate team that’s hyper-focused on solving enterprise-level technology and business problems with AI.

Europe

  • Design, build, and manage our cloud infrastructure using modern tools (Pulumi) to ensure all infrastructure changes are reproducible, secure, and easily auditable.
  • Orchestrate and optimize our Kubernetes clusters for complex, compute-heavy AI workloads, guaranteeing maximum efficiency and fault tolerance.
  • Implement a flawless monitoring setup using Datadog and OpenTelemetry to make the black box of our distributed systems transparent, hunting down latency spikes or bottlenecks before they impact users.

Deepslate is building Speech to Speech Voice AI models that sound and act indistinguishable from a human, with the belief that everyone should be able to use it. Backed by top-tier investors from the Tech and AI sectors, we are incredibly well-funded and moving fast.

Global

  • Drive improvements to our OpenSearch and Thanos infrastructure.
  • Design, build, and maintain backend services for high-performance data ingestion and storage
  • Operate and scale services across AWS (EC2, EKS, ECR)

They are looking for a software engineer who tackles big, unsolved problems head-on, researching, experimenting, and iterating until they crack it. The data pipeline is the backbone of everything they build and is pushing the boundaries of ingestion throughput, query performance, and storage efficiency.

US Canada 16w maternity

  • Build and deploy computing services and infrastructure in customer environments.
  • Clarify and surface requirements from ambiguous use cases defined by cross-functional stakeholders.
  • Improve reliability and scalability by resolving edge cases, studying failure modes, and writing tests.

Planet designs, builds, and operates the largest constellation of imaging satellites in history. They deliver an unprecedented dataset of empirical information via a revolutionary cloud-based platform to authoritative figures in commercial, environmental, and humanitarian sectors. Planet has a people-centric approach toward culture and community and it strives to iterate in a way that puts their team members first and prepares their company for growth.

$172,614–$172,614/yr
US

  • Design infrastructure, networking, and software platform architecture.
  • Build and maintain automation of Continuous Integration and Continuous Deployment pipelines.
  • Troubleshoot infrastructure, internal applications, networking, and security issues.

Loadsmart is a technology company focused on the logistics and supply chain industry. They leverage data and technology to automate and optimize freight transportation, connecting shippers and carriers to streamline the shipping process. They are a mid-sized company passionate about transforming the future of freight.

$120,000–$140,000/yr
US Unlimited PTO

  • Architect and manage scalable cloud infrastructure within AWS.
  • Implement and maintain infrastructure using Terraform.
  • Develop automation scripts to improve operational efficiency.

Attune empowers insurance agents with their technology solutions. We foster a remote-first culture and value employee development.

North America Europe

  • Build distributed systems that support reliability, resiliency, and safe operation at scale.
  • Design and operate traffic control mechanisms: circuit breakers, rate limiting, admission control, backpressure, and graceful degradation.
  • Develop tooling that improves incident detection, response, and automated mitigation.

Whatnot is the largest live shopping platform in North America and Europe to buy, sell, and discover the things you love. They are a remote co-located team, inspired by innovation and anchored in their values.