Partner with product engineering squads to own production reliability for high-SLA customer environments, designing automation and defining per-tenant SLOs.
Serve as a primary escalation point for incidents, leading response, post-incident reviews, and reducing SLO burn to prevent repeats.
Influence feature design for scalability and operability, improve alert quality, and eliminate toil through automation.
Own and operate 100+ multi-cloud streaming clusters and related database infrastructure in production.
Diagnose and eliminate cross-layer failure modes such as object storage latency, noisy neighbors, and query performance regressions.
Design safe upgrade and rollout strategies at scale, improving observability, automation, and operational ergonomics.
Grafana Labs is the company behind the open observability cloud, providing a fully managed observability platform built for scale. With over 35 million users and 7,000+ customers, we are a 100% remote company of 1,600+ team members across 40+ countries, backed by leading investors.
Collaborate with service teams to define SLIs and SLOs based on customer experience and build error budget policies that influence engineering decisions.
Own the Operational Readiness Review process, conducting reviews for new services and major changes across observability, alerting, runbooks, capacity, and graceful degradation.
Act as a reliability expert for architecture reviews, failure mode analysis, dependency mapping, and resilience design.
Supabase provides the Postgres development platform with a complete backend solution including Database, Auth, Storage, Edge Functions, Realtime, and Vector Search. With 280+ team members across 55+ countries, they are an open-source-first company that values async work and has raised $500M.
Co-own the architecture of cloud infrastructure on Azure and Kubernetes clusters for high throughput and availability.
Drive resilience strategy for global scaling, zero-downtime deployments, and disaster recovery.
Evolve observability stack with LGTM (Loki, Grafana, Tempo, Mimir) and lead incident response.
Flip is an AI-powered employee experience platform for frontline workers in retail, manufacturing, and logistics. The company is a young, rapidly growing tech company with a remote-first culture and offices in Berlin and Stuttgart.
Act as a first responder for system incidents and outages, ensuring high availability and performance.
Own and evolve monitoring, alerting, and log management systems while optimizing database infrastructure.
Collaborate with engineering teams to build scalable, resilient systems and contribute to SRE tooling and automation.
Circle is building the world's leading all-in-one platform for online communities. We're a fully remote company of around 200 team members from 30+ countries, with a culture that values autonomy, async collaboration, and high expectations.
Embed with product and platform teams from early stages to ensure reliability is designed in from the start.
Define production-readiness standards and measurable SLIs/SLOs to guide operational excellence.
Build tooling and infrastructure across AWS, GCP, and Azure using Terraform, and share on-call rotation.
We build WebContainers and Bolt.new, an AI-powered app builder that lets you create, edit, and deploy full-stack apps instantly in your browser. We are a fully remote, globally distributed team of passionate engineers serving over 1 million developers monthly.
Take ownership of incident management and operational excellence across cloud infrastructure.
Automate high-risk manual processes and drive reliability gains through engineering.
Own a platform domain such as Temporal, observability, or Kubernetes operations.
Synthesia is the world’s leading AI video platform for business, used by over 90% of the Fortune 100. Founded in 2017, the company is headquartered in London with offices across Europe and the US, and has over $530 million in funding from premier investors like Accel and Nvidia's VC arm.
Build and operate the internal engineering platform that provides application engineers with the tools, systems, and Kubernetes clusters they need to deploy and run their workloads.
Focus on cloud infrastructure, capacity management, security, engineering productivity, monitoring, and US Federal compliance across squads.
Participate in on-call rotations to ensure the health of the system and understand how people use our products.
Grafana Labs, the company behind the open observability cloud, is founded on the principles of open source, open standards, open ecosystems, and open culture. We are a 100% remote company with 1,600+ team members across 40+ countries, backed by leading investors including Lightspeed Venture Partners, Sequoia Capital, GIC, Coatue, J.P. Morgan, CapitalG, and Lead Edge Capital.
Design and maintain scalable infrastructure-as-code solutions using Terraform and Kubernetes.
Build and operate observability systems while leading incident response and reliability improvements.
Embed security and compliance practices into infrastructure and optimize system performance and cloud costs.
This partner company builds a next-generation platform enabling AI-driven services across global employment infrastructure. It is a highly distributed, async-first organization where engineers thrive in ownership and autonomy.
Own and evolve observability strategy including monitoring, alerting, dashboards, logging, and distributed tracing.
Define and manage SLIs, SLOs, and reliability metrics, improving MTTD and MTTR through automation.
Build and maintain reliable cloud infrastructure on AWS and Kubernetes while mentoring engineers on SRE best practices.
Filevine is a Legal AI company delivering Legal Operating Intelligence for legal work. Fueled by a team of exceptional collaborators and innovators, Filevine’s rapid growth has earned AI awards and recognition from Deloitte and Inc. as one of the most innovative and fastest-growing technology companies in the country.
Drive the definition and adoption of SLIs and SLOs across services, reducing toil through automation and incident response.
Design and architect Infrastructure as Code solutions for large-scale environments using Docker, Kubernetes, and cloud-native services.
Serve as primary SRE liaison for development teams, influencing architecture and conducting training for clients.
Noctua Technology, LLC is a company that drives digital transformation by treating operations as a software engineering challenge, focusing on cloud native systems. They are a dynamic team seeking a Senior SRE to define strategy and bridge development and operations for clients.
Design, implement, and improve Site Reliability Engineering practices across production environments with a focus on SLOs, SLIs, and error budgets.
Lead incident response processes and build observability strategies including monitoring, logging, alerting, and distributed tracing.
Partner with engineering teams to enhance system reliability, availability, scalability, and operational efficiency.
Oowlish is a rapidly expanding software development company in Latin America that collaborates with premier clients from the United States and Europe to create pioneering digital solutions. Certified as a Great Place to Work, it offers a nurturing environment with opportunities for professional growth and international impact.
Own the operational excellence and infrastructure strategy for Remote Build's platform, ensuring reliability, performance, and security.
Lead incident response, build observability systems, and drive continuous improvement in system reliability.
Embed security into infrastructure, optimize costs, and automate operational toil to scale efficiently.
Remote solves modern organizations' biggest challenge of navigating global employment compliantly. With a fully distributed team across 6 continents, the company fosters a future-focused culture with core values of innovation and async work.
Own and operate customer-facing managed infrastructure across multiple AWS accounts and regions.
Serve as the senior technical escalation point for production incidents and complex configurations.
Contribute to OpenTelemetry distributions and maintain open source projects like Refinery.
Honeycomb provides observability for developer tools, helping companies like HelloFresh and Slack understand their software. They have over 200 employees and were named to Forbes' Best Startups in 2022 and 2023, with a culture that values inclusion and autonomy.
Design, build, and operate distributed systems powering observability across ClickHouse Cloud.
Own reliability, performance, and cost-efficiency of the telemetry pipeline and storage systems.
Take part in on-call rotation and drive root-cause resolution and long-term fixes.
ClickHouse is a real-time analytics and data warehousing company recognized on the 2025 Forbes Cloud 100 list. With over 3,000 customers and rapid growth, the company fosters an innovative and fast-paced culture.
Design, provision, and manage AWS infrastructure using Terraform and Kubernetes.
Build, operate, and improve observability, monitoring, and incident response processes.
Collaborate with engineering teams on capacity planning, performance optimization, and resilient system design.
Vynca provides comprehensive care for individuals with complex needs, focusing on quality days at home. The company is a close-knit community guided by core values of Excellence, Compassion, Curiosity, and Integrity.
Design and implement high-quality, scalable integrations for observability solutions.
Collaborate with cross-functional teams to deliver features aligned with product strategy.
Participate in on-call rotations and contribute to open-source communities.
Grafana Labs provides an open-source observability platform, Grafana Cloud, that integrates metrics, logs, and traces. With over 1,600 team members across 40+ countries, they maintain a remote-first, collaborative culture backed by leading investors.
Lead the design, development and operation of large-scale, secure observability systems to keep services online and performant.
Deploy and scale Prometheus, ElasticSearch clusters, and high-throughput Kafka data pipelines for millions of customer devices.
Collaborate with the Observability team to build alerting systems, APIs, and self-service monitoring tools using Terraform and multiple languages.
ItD is a new generation consulting and software development company that blends diversity, innovation, and integrity with real business results. It is a woman- and minority-led firm with a global community, empowering employees and offering benefits like medical, dental, vision, 401(k), and career development.
Build and operate the delivery platform across AWS, EKS, ArgoCD, Helm, and Terraform, fixing production problems and driving root-cause analysis.
Standardize CI/CD pipelines using GitHub Actions and Azure DevOps, implement progressive delivery with Argo Rollouts, and build observability with Grafana and Prometheus.
Support platform adoption, reduce toil and cost, unblock cross-team delivery, and write documentation to eliminate knowledge silos.
Attain Finance is a leading consumer credit lender with over 50 years of expertise providing credit solutions across the U.S. and Canada. The company employs a dynamic team that fosters innovation and collaboration, with a portfolio including brands like Cash Money, LendDirect, Heights Finance, and others.
Own and evolve AWS infrastructure using Terraform, managing EKS clusters, databases, and core services.
Maintain CI/CD reliability and developer tooling across the full engineering org.
Lead incident response, drive post-incident reviews, and improve monitoring and alerting standards.
Babylist is the leading platform for expecting and new families, helping parents feel confident, connected, and cared for at every step. As a modern, AI-forward tech company with over 10 million yearly shoppers, Babylist has expanded into a full ecosystem and generated $750M in revenue in 2025, reshaping the $235B kids and baby market.
Design, write and deliver software to implement and support large web-scale, highly-performant, highly-available infrastructure on GCP/AWS.
Monitor infrastructure, respond to incidents, correct and improve systems to prevent incidents, and plan capacity.
Tune large-scale clusters for optimal performance and efficiency and support system deployments and product releases.
OpenX develops digital advertising marketplaces and technologies to optimize ad delivery for publishers and advertisers. The company operates a large-scale cloud infrastructure in Poland and values teamwork, customer centricity, and continuous learning.