Source Job

Global

  • Design and evolve cloud-native, containerized infrastructure for data products and services.
  • Lead cross-functional technical initiatives ensuring availability, security, scalability, and reliability.
  • Contribute hands-on expertise in systems design, automation, and high-scale distributed systems.

Kubernetes Terraform Python Data Engineering CI/CD

18 jobs similar to Data Reliability Engineer

Jobs ranked by similarity.

US

  • Design, deploy, and manage production Kubernetes clusters with workload scheduling, resource quotas, network policies, and RBAC.
  • Build and optimize CI/CD pipelines using Infrastructure as Code and GitOps principles.
  • Implement observability solutions using Prometheus, Grafana, and OpenTelemetry for performance tuning and reliability.

VerTALENTS is a subsidiary of VerSprite Cybersecurity, specializing in technology staffing. The company connects top technical talent with industry clients through various methods, adding value to both clients and candidates for full-time and contracting opportunities.

US 5w PTO

  • Design and develop CI/CD systems for websites, services, and release workflows, and operate an EKS-based Kubernetes platform.
  • Diagnose debug production incidents, drive root-cause analysis, and implement improvements to enhance system reliability.
  • Write and maintain infrastructure as code using Pulumi or Terraform/OpenTofu across multiple AWS accounts with security-conscious practices.

Thunderbird is one of the world’s most trusted open-source email applications, empowering more than 20 million people globally. Our small but growing distributed team includes 65+ people across seven countries, and we build privacy-respecting communication tools with a collaborative, inclusive, and user-first spirit.

Global Unlimited PTO 16w maternity 16w paternity

  • Own the operational excellence and infrastructure strategy for Remote Build's platform, ensuring reliability, performance, and security.
  • Lead incident response, build observability systems, and drive continuous improvement in system reliability.
  • Embed security into infrastructure, optimize costs, and automate operational toil to scale efficiently.

Remote solves modern organizations' biggest challenge of navigating global employment compliantly. With a fully distributed team across 6 continents, the company fosters a future-focused culture with core values of innovation and async work.

$188,550–$212,150/yr
Global Unlimited PTO

  • Own the technical direction of Remote's SRE/Platform domain.
  • Define and drive the reliability strategy across the platform.
  • Identify and lead AI enablement initiatives across the engineering organisation.

Remote is solving modern organizations’ biggest challenge – navigating global employment compliantly with ease. With our core values at heart and a future-focused work culture, our team works tirelessly on ambitious problems, asynchronously, around the world.

$115,200–$172,800/yr
US 8w paternity

  • Build internal tooling to help other engineers and the rest of the company understand and operate our system.
  • Design and implement security best practices for our team and infrastructure.
  • Reduce toil through automation, including building and maintaining CI/CD infrastructure.

Openly is rebuilding insurance from the ground up by re-envisioning and enhancing every aspect of the customer experience. They are a rapidly growing team of exceptional, curious, empathetic people with a wide range of skill sets, spanning many departments.

Europe

  • Design and operate our Kubernetes ecosystem with a focus on high availability and zero-downtime operations.
  • Own and evolve our PaaS strategy, using GitOps and CI/CD to empower domain teams to deploy independently.
  • Define and implement our observability strategy across metrics, logs, and tracing.

Finom is a European tech startup headquartered in Amsterdam, revolutionizing financial services for entrepreneurs. They offer an all-in-one financial B2B solution integrating banking, accounting, financial management, and invoicing into a mobile-first platform, with about 346 million in funding.

Canada

  • Build and maintain infrastructure platforms for over 200 backend services running on Kubernetes clusters with 40,000+ cores.
  • Lead and mentor other engineers, own complex infrastructure failures, and participate in a shared on-call rotation.
  • Drive cloud cost efficiency, estimate schedules, and use AI tools as a first-class collaborator in daily workflows.

Life360's mission is to keep people close to the ones they love through location sharing, safe driver reports, and crash detection. The company serves approximately 97.8 million monthly active users across more than 180 countries and has more than 500 remote-first employees.

US

  • Design, provision, and manage AWS infrastructure using Terraform and Kubernetes.
  • Build, operate, and improve observability, monitoring, and incident response processes.
  • Collaborate with engineering teams on capacity planning, performance optimization, and resilient system design.

Vynca provides comprehensive care for individuals with complex needs, focusing on quality days at home. The company is a close-knit community guided by core values of Excellence, Compassion, Curiosity, and Integrity.

US Canada

  • Own and evolve AWS infrastructure using Terraform, managing EKS clusters, databases, and core services.
  • Maintain CI/CD reliability and developer tooling across the full engineering org.
  • Lead incident response, drive post-incident reviews, and improve monitoring and alerting standards.

Babylist is the leading platform for expecting and new families, helping parents feel confident, connected, and cared for at every step. As a modern, AI-forward tech company with over 10 million yearly shoppers, Babylist has expanded into a full ecosystem and generated $750M in revenue in 2025, reshaping the $235B kids and baby market.

India

  • Design, deploy, and manage Kubernetes-based platforms in production.
  • Implement and manage automation frameworks for infrastructure provisioning and operations.
  • Administer and optimize VMware environments (vSphere, ESXi, vCenter).

EPlus believes technology is a people business and delivers solutions that make a real difference. Their team is passionate, skilled, and driven, valuing collaboration, innovation, and extraordinary results and dedicated to fostering, cultivating, and preserving a culture that represents diversity, enables inclusion.

Latin America

  • Build and operate the self-service infrastructure platform for developers and AI agents.
  • Own core platform layers including CI/CD, GitOps, IaC module catalog, and golden-path scaffolding.
  • Build internal tooling, observability, and metrics to make pipelines observable and improvable.

Luxury Presence is building the AI growth platform for real estate. Backed by top investors like Bessemer Venture Partners, we're a Series C company with over $100M in ARR and more than 90,000 real estate professionals using our platform.

US Unlimited PTO

  • Lead the design, implementation, manage, support and operation of cloud-native infrastructure and container orchestration platforms.
  • Drive platform reliability, scalability, automation, and operational excellence across critical SaaS and cloud-based workloads.
  • Contribute to architectural decisions, mentoring engineers, and ensuring alignment with security, compliance, and operational standards.

Availity delivers revenue cycle and related business solutions for health care professionals who want to build healthy, thriving organizations. They are a global team with headquarters in Jacksonville, FL, and an office in Bangalore, India, united by a mission to bring the focus back to patient care.

Canada Unlimited PTO

  • Design, build, and operate distributed systems powering observability across ClickHouse Cloud.
  • Own reliability, performance, and cost-efficiency of the telemetry pipeline and storage systems.
  • Take part in on-call rotation and drive root-cause resolution and long-term fixes.

ClickHouse is a real-time analytics and data warehousing company recognized on the 2025 Forbes Cloud 100 list. With over 3,000 customers and rapid growth, the company fosters an innovative and fast-paced culture.

UK

  • Manage and optimise Kubernetes clusters in GKE through Terraform.
  • Design and implement automation strategies that empower developers to self-serve.
  • Serve as the technical point-of-contact for GCP and Kubernetes-related queries.

Prolific builds the human data infrastructure that reshapes AI development by enabling collection of high-quality, ethically sourced human behavioral data. They are a mission-driven company with a competitive salary and benefits, offering remote working within a culture focused on impact and innovation.

SRE

Fal
$180,000–$250,000/yr
US

  • Own and operate our Kubernetes infrastructure.
  • Build and maintain CI/CD pipelines and deployment infrastructure.
  • Leverage AI to automate analysis and resolution of production issues.

Fal is the generative media ecosystem powering the next generation of AI products. They build the infrastructure, tools, and model access that teams need to move from idea to production.

Germany

  • Build and maintain end-to-end observability with ELK, Prometheus, and Grafana.
  • Own and improve CI/CD pipelines (CircleCI, GitLab CI, GitHub Actions, ArgoCD).
  • Lead incident response and postmortems in a blameless culture.

Redcare Pharmacy is Europe’s No.1 e-pharmacy, powered by passionate teams and cutting-edge innovation. They strive to create a healthy, collaborative work environment where every employee feels valued and inspired to contribute to their vision “Until every human has their health”.

$180,000–$200,000/yr
US

  • Own and evolve a scalable observability platform spanning metrics, logs, traces, and events.
  • Design telemetry pipelines ingesting data from GPUs, CPUs, networking, containers, APIs, and BMC/Redfish.
  • Design and implement noise-resistant alerting systems to improve signal quality and reduce operational load.

Lightning AI builds an end-to-end platform for developing, training, and deploying AI systems, designed to take ideas from research to production with less friction. They combine developer-first software with cost-efficient, large-scale compute, serving solo researchers, startups, and large enterprises.

India

  • Assist in managing multiregion and multicloud infrastructure, ensuring resiliency, scalability, and performance.
  • Support infrastructure provisioning and deployments primarily on GCP, while gaining exposure to other cloud providers.
  • Collaborate with development teams to design and maintain CI/CD pipelines in GitLab CI and contribute to GitOps-based deployments using ArgoCD.

Learneo is a platform of builder-driven businesses, including Course Hero, CliffsNotes, LitCharts, Quillbot, Symbolab, and Scribbr, united around supercharging productivity and learning. Each team innovates independently, supported by centralized corporate operations functions, and the company values collaboration and growth.