Source Job

US

  • Implement highly available, scalable infrastructure across AWS, GCP, and bare-metal environments.
  • Drive an "automation-first" culture by writing code in Python/Go to build self-healing systems.
  • Act as lead Incident Commander, develop response playbooks, and conduct post-incident analyses.

Python Go AWS Linux Prometheus

20 jobs similar to Sr. Production Engineer

Jobs ranked by similarity.

US Canada

  • Deliver network stack projects end-to-end including service mesh, DNS, CDN, and edge protection while shaping technical vision and maintaining operability.
  • Integrate networking into self-service platforms to streamline workflows and enable engineering teams to operate independently.
  • Participate in on-call rotation, driving incident resolution and continuous improvement through postmortem analysis.

1Password builds a safe, productive digital future by unleashing employee productivity without compromising security. Over 180,000 businesses trust them, and they've earned a spot on the Forbes Cloud 100 for four consecutive years, fostering a collaborative, curious, and driven culture.

Canada

  • Build and maintain infrastructure platforms for over 200 backend services running on Kubernetes clusters with 40,000+ cores.
  • Lead and mentor other engineers, own complex infrastructure failures, and participate in a shared on-call rotation.
  • Drive cloud cost efficiency, estimate schedules, and use AI tools as a first-class collaborator in daily workflows.

Life360's mission is to keep people close to the ones they love through location sharing, safe driver reports, and crash detection. The company serves approximately 97.8 million monthly active users across more than 180 countries and has more than 500 remote-first employees.

Americas 7w PTO

  • Act as a first responder for system incidents and outages, ensuring high availability and performance.
  • Own and evolve monitoring, alerting, and log management systems while optimizing database infrastructure.
  • Collaborate with engineering teams to build scalable, resilient systems and contribute to SRE tooling and automation.

Circle is building the world's leading all-in-one platform for online communities. We're a fully remote company of around 200 team members from 30+ countries, with a culture that values autonomy, async collaboration, and high expectations.

$165,000–$165,000/yr
US

  • Design, build, and maintain scalable cloud infrastructure services in AWS and GCP.
  • Contribute production-quality Go and Python code to existing cloud services.
  • Develop and own automation and software deployment pipelines with maximum efficiency.

Dragos is dedicated to arming customers with best-in-class technology, threat intelligence, and services to protect their systems. They embody core values of authenticity, transparency, and trust and are a remote-first culture with operations in North America, Europe, the Middle East, and APAC.

US

  • Design, provision, and manage AWS infrastructure using Terraform and Kubernetes.
  • Build, operate, and improve observability, monitoring, and incident response processes.
  • Collaborate with engineering teams on capacity planning, performance optimization, and resilient system design.

Vynca provides comprehensive care for individuals with complex needs, focusing on quality days at home. The company is a close-knit community guided by core values of Excellence, Compassion, Curiosity, and Integrity.

$115,200–$172,800/yr
US 8w paternity

  • Build internal tooling to help other engineers and the rest of the company understand and operate our system.
  • Design and implement security best practices for our team and infrastructure.
  • Reduce toil through automation, including building and maintaining CI/CD infrastructure.

Openly is rebuilding insurance from the ground up by re-envisioning and enhancing every aspect of the customer experience. They are a rapidly growing team of exceptional, curious, empathetic people with a wide range of skill sets, spanning many departments.

US 5w PTO

  • Design and develop CI/CD systems for websites, services, and release workflows, and operate an EKS-based Kubernetes platform.
  • Diagnose debug production incidents, drive root-cause analysis, and implement improvements to enhance system reliability.
  • Write and maintain infrastructure as code using Pulumi or Terraform/OpenTofu across multiple AWS accounts with security-conscious practices.

Thunderbird is one of the world’s most trusted open-source email applications, empowering more than 20 million people globally. Our small but growing distributed team includes 65+ people across seven countries, and we build privacy-respecting communication tools with a collaborative, inclusive, and user-first spirit.

UK

  • Design, build, and maintain CI/CD pipelines and Infrastructure as Code using tools like CloudFormation, Ansible, and Terraform.
  • Monitor and respond to infrastructure and application health, troubleshoot operational issues, and provide on-call support.
  • Maintain operational documentation, communicate proactively with teams, and ensure service delivery meets client expectations.

NICE Ltd. provides software used by 25,000+ global businesses, including 85 of the Fortune 100, to deliver customer experiences, fight financial crime, and ensure public safety. With over 8,500 employees across 30+ countries, NICE is recognized as a market leader in AI, cloud, and digital innovation.

US

  • Lead and mentor a high-performing team of security engineers, setting technical direction and standards for excellence.
  • Define and execute the security roadmap for infrastructure, remote access, endpoints, and M&A.
  • Design and implement security controls across cloud, production, and corporate environments.

Anduril Industries is a defense technology company transforming U.S. and allied military capabilities with advanced technology, powered by Lattice OS. They bring the expertise and business model of innovative companies to the defense industry, focusing on autonomy, AI, and networking.

US Canada

  • Own and evolve AWS infrastructure using Terraform, managing EKS clusters, databases, and core services.
  • Maintain CI/CD reliability and developer tooling across the full engineering org.
  • Lead incident response, drive post-incident reviews, and improve monitoring and alerting standards.

Babylist is the leading platform for expecting and new families, helping parents feel confident, connected, and cared for at every step. As a modern, AI-forward tech company with over 10 million yearly shoppers, Babylist has expanded into a full ecosystem and generated $750M in revenue in 2025, reshaping the $235B kids and baby market.

US

  • Lead a platform team building high-throughput messaging, eventing, and notification infrastructure.
  • Manage and grow engineers, driving roadmap planning and cross-team communication.
  • Own delivery reliability for email, chat integrations, and event pub/sub systems at scale.

KnowBe4 empowers the modern workforce to make smarter security decisions every day. Trusted by more than 70,000 organizations worldwide, the company is the pioneer of digital workforce security.

APAC

  • Define and own the APAC infrastructure architecture end-to-end on Azure, including compute, networking, and containerisation.
  • Lead incident response for the region with calm, methodical root cause analysis and durable fixes.
  • Drive infrastructure migrations and PCI-DSS hardening programs across clouds safely.

Tilt is a mobile-first fintech company that uses machine learning to provide credit beyond traditional credit scores. With millions of customers worldwide, they value ownership, excellence, and mutual respect.

SRE

Fal
$180,000–$250,000/yr
US

  • Own and operate our Kubernetes infrastructure.
  • Build and maintain CI/CD pipelines and deployment infrastructure.
  • Leverage AI to automate analysis and resolution of production issues.

Fal is the generative media ecosystem powering the next generation of AI products. They build the infrastructure, tools, and model access that teams need to move from idea to production.

Europe

  • Lead Reliability Engineering for User Experience.
  • Architect for Scale, partnering with product and infrastructure teams to design highly available systems.
  • Drive Automation to eliminate repetitive operational work through tooling and systems.

Reddit is a community-based platform where users submit, vote, and comment on various topics. It hosts over 100,000 active communities and attracts millions of daily active users, making it one of the largest and most influential internet platforms.

Global Unlimited PTO 16w maternity 16w paternity

  • Own the operational excellence and infrastructure strategy for Remote Build's platform, ensuring reliability, performance, and security.
  • Lead incident response, build observability systems, and drive continuous improvement in system reliability.
  • Embed security into infrastructure, optimize costs, and automate operational toil to scale efficiently.

Remote solves modern organizations' biggest challenge of navigating global employment compliantly. With a fully distributed team across 6 continents, the company fosters a future-focused culture with core values of innovation and async work.

Europe

  • Lead the investigation and resolution of complex infrastructure, networking, and platform-related incidents.
  • Provide technical leadership for Kubernetes platform operations and supporting infrastructure services.
  • Mentor and support AI Infrastructure & Platform Operations Engineers, sharing technical knowledge through documentation and training.

Mirantis helps organizations ship code faster on public and private clouds, providing a public cloud experience on any infrastructure from the data center to the edge. The company serves many of the world's leading enterprises, including Adobe, DocuSign, Liberty Mutual, and PayPal, and is a leader in container management.

US Canada Unlimited PTO

  • Help guide technical direction and contribute to platform architectural strategy.
  • Champion engineering principles and hold the bar on code quality.
  • Elevate engineers around you through pairing and knowledge sharing.

Arctic Wolf is a cybersecurity company that helps organizations end cyber risk. They have a global presence with over 10,000 customers and more than 2,000 channel partners, and it is known for its award-winning Aurora Platform.

US Unlimited PTO

  • Lead Onboarding end‑to‑end and extend with additional use cases.
  • Own a small portfolio of customer account and act as a trusted technical partner all year.
  • Provide technical support and communicate crisply with customers throughout.

OpsMill is building the next generation of infrastructure data management, focusing on helping automation teams unify data and scale automation reliably. As a commercial open-source company, they are practitioners who understand the real-world challenges of scaling infrastructure automation.

Global Unlimited PTO

  • Improve the reliability, performance, and scalability of our production platform.
  • Operate reliable infrastructure, improve observability, and drive incident response.
  • Use data-driven reliability practices such as SLIs, SLOs, SLAs, and DORA metrics.

VRChat is a game-changing platform that provides an endless collection of social VR experiences. They empower their community to bring their imaginations to life and help shape the metaverse. Their team includes people from Netflix, Twitter, Meta, and Microsoft.

US

  • Design and implement complex software systems and integrations with minimal oversight.
  • Mentor junior and mid-level engineers through code reviews, design discussions, and pairing sessions.
  • Participate in on-call rotations and drive improvements in monitoring, alerting, and system reliability.

EasyPost is a YC unicorn founded in 2012 that makes shipping simple for businesses from startups to Fortune 500 with a developer-friendly REST API. The company is rapidly growing with a scrappy, fast-moving culture and a team of builders and problem-solvers.