Source Job

US

  • Design and operate enterprise-grade observability platforms across metrics, logs, traces, and events.
  • Build scalable monitoring stacks with Prometheus, Grafana, Loki, Tempo, OpenTelemetry, and Datadog.
  • Define SLOs, SLIs, error budgets, and alerting strategies to ensure system reliability.

Prometheus Grafana OpenTelemetry

20 jobs similar to Observability Engineer

Jobs ranked by similarity.

Germany 6w PTO

  • Anticipate and support the Solutions Engineering team by designing technical presentations, demos, and white papers.
  • Create and deliver training materials, product workshops, and webinars for internal teams and customers.
  • Partner with Product, Marketing, and Engineering to enable the field with deep technical expertise and strategic support.

Grafana Labs is the company behind the open-source observability platform, providing a fully managed cloud service for monitoring and analytics. With over 1,600 team members across 40+ countries, they foster a global collaborative culture rooted in open source, transparency, and autonomy.

Canada Unlimited PTO

  • Design, build, and operate distributed systems powering observability across ClickHouse Cloud.
  • Own reliability, performance, and cost-efficiency of the telemetry pipeline and storage systems.
  • Take part in on-call rotation and drive root-cause resolution and long-term fixes.

ClickHouse is a real-time analytics and data warehousing company recognized on the 2025 Forbes Cloud 100 list. With over 3,000 customers and rapid growth, the company fosters an innovative and fast-paced culture.

UK 6w PTO

  • Act as a trusted technical partner, guiding organizations through onboarding, implementation, and expansion with white-glove support and best practices.
  • Deliver high-impact training, jumpstart engagements, and provide tailored technical consulting to help customers succeed.
  • Identify recurring issues, monitor support needs, and advocate for product improvements in close collaboration with internal teams.

Grafana Labs is the company behind Grafana, the open observability platform. With over 1,600 team members across 40+ countries, we are a 100% remote company backed by leading investors and trusted by more than 35 million users and 7,000+ customers.

United States 6w PTO

  • Build and operate the internal engineering platform that provides application engineers with the tools, systems, and Kubernetes clusters they need to deploy and run their workloads.
  • Focus on cloud infrastructure, capacity management, security, engineering productivity, monitoring, and US Federal compliance across squads.
  • Participate in on-call rotations to ensure the health of the system and understand how people use our products.

Grafana Labs, the company behind the open observability cloud, is founded on the principles of open source, open standards, open ecosystems, and open culture. We are a 100% remote company with 1,600+ team members across 40+ countries, backed by leading investors including Lightspeed Venture Partners, Sequoia Capital, GIC, Coatue, J.P. Morgan, CapitalG, and Lead Edge Capital.

US Canada 6w PTO

  • Earning the trust of our large-scale operator customers to further Grafana's "big tent" philosophy of data accessibility and to meet clear business objectives.
  • Designing and leading the development of backend services, distributed systems, and enterprise features at scale.
  • Driving continuous improvement of our engineering culture through words and actions.

Grafana Labs is a remote-first, open-source powerhouse with more than 20M users of Grafana, the open source visualization tool, around the globe. They help more than 3,000 companies manage their observability strategies with the Grafana LGTM Stack, which can be run fully managed with Grafana Cloud or self-managed with the Grafana Enterprise Stack. The Grafana team thrives in an innovation-driven environment where transparency, autonomy, and trust fuel everything they do.

UK

  • Design, build, and scale backend services powering a large-scale observability platform, including telemetry ingestion, storage systems, query engines, and alerting pipelines.
  • Develop and optimize distributed systems that process logs, metrics, and traces at high volume with a strong focus on reliability and performance.
  • Collaborate with cross-functional engineering teams to improve system architecture, scalability, and developer experience.

Global

  • Defining and driving the vision and strategy for Infrastructure Observability.
  • Identifying gaps in end to end experience, defining and owning the roadmap to fill those gaps.
  • Working closely across teams and across Orgs, collaborating with Engineering, UX, Design and other teams to deliver on your roadmap.

Elastic, the Search AI Company, enables everyone to find the answers they need in real time, using all their data, at scale — unleashing the potential of businesses and people. The Elastic Search AI Platform, used by more than 50% of the Fortune 500, brings together the precision of search and the intelligence of AI to enable everyone to accelerate the results that matter.

US Unlimited PTO

  • Provide frontline technical expertise to help developers deploy and scale Temporal in cloud-native environments.
  • Troubleshoot complex infrastructure issues, optimize performance, and develop automation solutions.
  • Collaborate with engineering and product teams to influence platform improvements and enhance developer experience.

Temporal provides an open source programming model that simplifies code and makes applications more reliable. The company is a growing team driven by values of curiosity, collaboration, and humility, focused on improving developer experience.

United States

  • Design, deploy, and operate service mesh platforms (Istio and Linkerd) across multi-cluster Kubernetes environments.
  • Implement mTLS, certificate lifecycle automation, and workload identity propagation for secure communication.
  • Build and enhance observability for service-to-service communication using tracing, metrics, and topology insights.

Jobgether uses AI-powered matching to connect candidates with roles. They focus on efficient hiring processes and data privacy.

$116,449–$139,531/yr
Europe 6w PTO

  • Take an active role in influencing our roadmap and your own career objectives.
  • Drive projects from initial ideation all the way to operations once it is in the hands of customers.
  • Design, build, operate, and maintain critical systems, owning the reliability, performance, and availability.

Grafana Labs is behind the open observability cloud, and is founded on the principles of open source, open standards, open ecosystems, and open culture. They are a 100% remote company with 1,600+ team members across 40+ countries.

Europe

  • Build and operate secure agent runtimes with sandboxing, runtime isolation, and RBAC.
  • Design and maintain integration surfaces with MCP-style adapters and gateways across marketplace teams.
  • Implement observability and cost control including traces, telemetry, and cost-per-workflow.

Zartis is a global AI transformation and technology consulting partner that designs, builds, and scales technology solutions for ambitious organizations. With engineering hubs across EMEA and LATAM and long-term partnerships in financial services, healthcare, and energy, they foster an inclusive culture based on trust and innovation.

Latin America Unlimited PTO 16w maternity 16w paternity

  • Lead customers in strategic application of Honeycomb and observability practices to meet technical and business goals.
  • Act as a trusted advisor on telemetry schema design, data modeling, and sampling strategies.
  • Coach and mentor engineering teams on observability, SRE concepts, and instrumentation best practices.

Honeycomb defines observability for developer tools, working with companies like HelloFresh, Slack, and Vanguard. They are a fully distributed company of over 200 employees, named to Forbes' America's Best Startups in 2022 and 2023, with a culture focused on impact, inclusion, and autonomy.

United States

  • Design and build core platform infrastructure for large-scale cloud-native data and analytics systems.
  • Own and improve CI/CD pipelines, testing frameworks, and deployment in a high-scale PaaS environment.
  • Contribute to reliability engineering, observability, and operational excellence across distributed systems.

Jobgether uses an AI-powered matching process to connect candidates with roles. The company is a growing platform focused on efficient job matching and data privacy compliance.

US Unlimited PTO

  • Build and operate the delivery platform across AWS, EKS, ArgoCD, Helm, and Terraform, fixing production problems and driving root-cause analysis.
  • Standardize CI/CD pipelines using GitHub Actions and Azure DevOps, implement progressive delivery with Argo Rollouts, and build observability with Grafana and Prometheus.
  • Support platform adoption, reduce toil and cost, unblock cross-team delivery, and write documentation to eliminate knowledge silos.

Attain Finance is a leading consumer credit lender with over 50 years of expertise providing credit solutions across the U.S. and Canada. The company employs a dynamic team that fosters innovation and collaboration, with a portfolio including brands like Cash Money, LendDirect, Heights Finance, and others.

Global

  • Collaborate with service teams to define SLIs and SLOs based on customer experience and build error budget policies that influence engineering decisions.
  • Own the Operational Readiness Review process, conducting reviews for new services and major changes across observability, alerting, runbooks, capacity, and graceful degradation.
  • Act as a reliability expert for architecture reviews, failure mode analysis, dependency mapping, and resilience design.

Supabase provides the Postgres development platform with a complete backend solution including Database, Auth, Storage, Edge Functions, Realtime, and Vector Search. With 280+ team members across 55+ countries, they are an open-source-first company that values async work and has raised $500M.

Global Unlimited PTO 16w maternity 16w paternity

  • Own the operational excellence and infrastructure strategy for Remote Build's platform, ensuring reliability, performance, and security.
  • Lead incident response, build observability systems, and drive continuous improvement in system reliability.
  • Embed security into infrastructure, optimize costs, and automate operational toil to scale efficiently.

Remote solves modern organizations' biggest challenge of navigating global employment compliantly. With a fully distributed team across 6 continents, the company fosters a future-focused culture with core values of innovation and async work.

Europe

  • Design and operate our Kubernetes ecosystem with a focus on high availability and zero-downtime operations.
  • Own and evolve our PaaS strategy, using GitOps and CI/CD to empower domain teams to deploy independently.
  • Define and implement our observability strategy across metrics, logs, and tracing.

Finom is a European tech startup headquartered in Amsterdam, revolutionizing financial services for entrepreneurs. They offer an all-in-one financial B2B solution integrating banking, accounting, financial management, and invoicing into a mobile-first platform, with about 346 million in funding.

Germany

  • Build and maintain end-to-end observability with ELK, Prometheus, and Grafana.
  • Own and improve CI/CD pipelines (CircleCI, GitLab CI, GitHub Actions, ArgoCD).
  • Lead incident response and postmortems in a blameless culture.

Redcare Pharmacy is Europe’s No.1 e-pharmacy, powered by passionate teams and cutting-edge innovation. They strive to create a healthy, collaborative work environment where every employee feels valued and inspired to contribute to their vision “Until every human has their health”.

US 5w PTO

  • Design and develop CI/CD systems for websites, services, and release workflows, and operate an EKS-based Kubernetes platform.
  • Diagnose debug production incidents, drive root-cause analysis, and implement improvements to enhance system reliability.
  • Write and maintain infrastructure as code using Pulumi or Terraform/OpenTofu across multiple AWS accounts with security-conscious practices.

Thunderbird is one of the world’s most trusted open-source email applications, empowering more than 20 million people globally. Our small but growing distributed team includes 65+ people across seven countries, and we build privacy-respecting communication tools with a collaborative, inclusive, and user-first spirit.

Europe

  • Lead reliability initiatives across multiple Ads domains including ad serving, auctions, targeting, reporting, measurement, and billing.
  • Partner with engineering leadership to improve reliability, scalability, operational excellence, and engineering efficiency across the Ads organization.
  • Design and build platforms, tooling, and automation that improve reliability and developer productivity at scale.

Reddit is a community of communities, built on shared interests, passion, and trust, home to the most open and authentic conversations on the internet. With 100,000+ active communities and approximately 126 million daily active unique visitors, it is one of the internet's largest sources of information.