Source Job

Canada Unlimited PTO

  • Design, build, and operate distributed systems powering observability across ClickHouse Cloud.
  • Own reliability, performance, and cost-efficiency of the telemetry pipeline and storage systems.
  • Take part in on-call rotation and drive root-cause resolution and long-term fixes.

Golang Kubernetes Terraform OpenTelemetry Prometheus

20 jobs similar to Cloud Software Engineer - Observability Platform

Jobs ranked by similarity.

US 6w PTO

  • Design, build, and operate reconciliation systems to track desired stack state, detect and repair drift across stack templates, grafana.com state, Hosted Grafana, and actual customer stack configuration.
  • Collaborate across SSS, grafana.com, and deployment configurations to ensure stack lifecycle workflows remain reliable, observable, and resilient.
  • Improve operational efficiency by reducing deployment complexity and contributing to the Stack Config Reconciliation project.

Grafana Labs is a remote-first, open-source powerhouse with more than 20M users of Grafana around the globe. They help more than 3,000 companies manage their observability strategies with the Grafana LGTM Stack. Their team thrives in an innovation-driven environment where transparency, autonomy, and trust fuel everything they do.

Europe 6w PTO

  • Design, build, and operate reconciliation systems to track desired stack state, detect and repair drift across stack templates, grafana.com state, Hosted Grafana, and actual customer stack configuration.
  • Collaborate across SSS, grafana.com, and deployment configurations to ensure stack lifecycle workflows remain reliable, observable, and resilient.
  • Improve operational efficiency by reducing deployment complexity and contributing to the Stack Config Reconciliation project.

Grafana Labs is a remote-first, open-source powerhouse with over 20M users of Grafana. They help more than 3,000 companies manage their observability strategies with the Grafana LGTM Stack, featuring scalable metrics (Grafana Mimir), logs (Grafana Loki), and traces (Grafana Tempo).

US Canada 6w PTO

  • Earning the trust of our large-scale operator customers to further Grafana's "big tent" philosophy of data accessibility and to meet clear business objectives.
  • Designing and leading the development of backend services, distributed systems, and enterprise features at scale.
  • Driving continuous improvement of our engineering culture through words and actions.

Grafana Labs is a remote-first, open-source powerhouse with more than 20M users of Grafana, the open source visualization tool, around the globe. They help more than 3,000 companies manage their observability strategies with the Grafana LGTM Stack, which can be run fully managed with Grafana Cloud or self-managed with the Grafana Enterprise Stack. The Grafana team thrives in an innovation-driven environment where transparency, autonomy, and trust fuel everything they do.

Europe 6w PTO

  • Develop and maintain features as part of Observability solutions in Grafana Cloud.
  • Contribute to the design and implementation of high-quality, scalable integrations for various infrastructure components, databases, and applications
  • Build prototypes and present your ideas as part of a cross-functional team

Grafana Labs is a remote-first, open-source powerhouse with more than 20M users of Grafana. They help more than 3,000 companies manage their observability strategies with the Grafana LGTM Stack, and thrive in an innovation-driven environment with a global collaborative culture.

$180,000–$200,000/yr
US

  • Own and evolve a scalable observability platform spanning metrics, logs, traces, and events.
  • Design telemetry pipelines ingesting data from GPUs, CPUs, networking, containers, APIs, and BMC/Redfish.
  • Design and implement noise-resistant alerting systems to improve signal quality and reduce operational load.

Lightning AI builds an end-to-end platform for developing, training, and deploying AI systems, designed to take ideas from research to production with less friction. They combine developer-first software with cost-efficient, large-scale compute, serving solo researchers, startups, and large enterprises.

$126,353–$147,947/yr
Europe 6w PTO

  • Manage, hire, and develop a team of engineers, providing regular feedback.
  • Act as project manager and work with product owners to ensure the product roadmap is up-to-date.
  • Engage in technical conversations and challenge teams to arrive at strong technical decisions.

Grafana Labs is a remote-first, open-source powerhouse that provides visualization tools and helps companies manage their observability strategies. We value transparency, autonomy, and trust.

Unlimited PTO

  • Assess and improve visibility by identifying gaps in dashboards, metrics, and logs.
  • Refine alerts and dashboards for critical services to catch issues earlier.
  • Automate routine checks and monitoring tasks to free up engineers.

PlayOn is where high school sports come to life through platforms like GoFan, NFHS Network, and MaxPreps. As a growth-stage company backed by KKR, we build the technology that powers high school athletics from ticketing and streaming to fundraising and merchandise.

Germany 6w PTO

  • Build and scale a strong culture of operational excellence by defining standards and coaching teams to own reliability and availability.
  • Drive mature DevOps/SRE practices, including incident response and PIRs, on-call readiness, runbooks, alerting, observability, and release/change management.
  • Guide teams in the design, development, evolution, and operation of large-scale, distributed cloud systems.

Grafana Labs is a remote-first, open-source powerhouse with more than 20M users of Grafana around the globe. They help more than 3,000 companies manage their observability strategies with the Grafana LGTM Stack, and their team thrives in an innovation-driven environment.

Mexico

  • Design systems with resilience, graceful degradation, and capacity in mind.
  • Define and measure SLOs and SLIs that actually reflect what our customers feel.
  • Use Datadog (logging, metrics, APM) together with CloudWatch to build signal-heavy, noise-light observability.

EarnIn is building products that deliver real-time financial flexibility for those with the unique needs of living paycheck to paycheck. They are growing fast and are excited to continue bringing world-class talent onboard to help shape the next chapter of their growth journey.

$116,449–$139,531/yr
Europe 6w PTO

  • Take an active role in influencing our roadmap and your own career objectives.
  • Drive projects from initial ideation all the way to operations once it is in the hands of customers.
  • Design, build, operate, and maintain critical systems, owning the reliability, performance, and availability.

Grafana Labs is behind the open observability cloud, and is founded on the principles of open source, open standards, open ecosystems, and open culture. They are a 100% remote company with 1,600+ team members across 40+ countries.

$124,845–$146,205/yr
Europe 6w PTO

  • Manage and grow a distributed team of engineers, providing feedback and supporting career development.
  • Partner with product management to shape the Usage squad's roadmap, ensuring alignment with company mission and customer impact.
  • Guide the team through the full project lifecycle, ensuring high-quality and timely outcomes within the Usage domain.

Grafana Labs is a remote-first, open-source powerhouse with over 20M users globally. Their team thrives in an innovation-driven environment where transparency, autonomy, and trust fuel everything they do.

US 5w PTO

  • Design and develop CI/CD systems for websites, services, and release workflows, and operate an EKS-based Kubernetes platform.
  • Diagnose debug production incidents, drive root-cause analysis, and implement improvements to enhance system reliability.
  • Write and maintain infrastructure as code using Pulumi or Terraform/OpenTofu across multiple AWS accounts with security-conscious practices.

Thunderbird is one of the world’s most trusted open-source email applications, empowering more than 20 million people globally. Our small but growing distributed team includes 65+ people across seven countries, and we build privacy-respecting communication tools with a collaborative, inclusive, and user-first spirit.

$210,000–$278,000/yr
US Unlimited PTO

  • Architect future iterations of core systems, addressing scaling requirements.
  • Design and implement developer tools to enhance deployment safety and reproducibility.
  • Drive excellence in monitoring and guide incident response for quick issue resolution.

Found provides tools for self-employed individuals, offering a business bank account that automates taxes and expense tracking. They aim to give self-employed people the security and peace of mind historically available only at large corporations and are looking for kind, resourceful, and passionate people.

US

  • Take part in all development aspects from design through production of a dynamic hybrid/cloud native product.
  • Write high quality, testable and efficient code in Go and Typescript.
  • Initiate and promote new ideas for continuous improvement of the product functionality.

Torq is a cybersecurity company with an AI SOC platform that grabs attention. Backed by Series D funding, they have experienced 200% employee growth and 300% revenue growth, making them one of Forbes' Best Startup Employers in America and a Business Insider 'startup to bet your career on'.

$160,000–$200,000/yr
US

  • Drive the stability and reliability of Epic's GCP infrastructure.
  • Manage and harden our Docker and GKE container platform.
  • Maintain and improve CI/CD pipelines.

Epic is the leading digital reading platform for kids ages 12 and under, used by millions of children, families, and educators around the world. As Epic continues to grow, we are reimagining what reading can be through thoughtful technology, data, and global collaboration to make learning more engaging, accessible, and impactful.

$190,800–$267,100/yr
US

  • Design and build backend systems, APIs, infrastructure, and platform capabilities that improve developer workflows across Reddit.
  • Build scalable and reliable systems across both AI-powered developer workflows and the core non-AI systems engineers rely on every day.
  • Lead high-impact projects across Reddit’s developer tooling ecosystem by writing and reviewing code and design docs, aligning stakeholders, and making pragmatic technical tradeoffs.

Reddit is a community-based platform built on shared interests, passion, and trust, facilitating open and authentic conversations. With over 100,000 active communities and approximately 126 million daily active unique visitors, it serves as one of the internet’s largest sources of information.

$115,200–$172,800/yr
US 8w paternity

  • Build internal tooling to help other engineers and the rest of the company understand and operate our system.
  • Design and implement security best practices for our team and infrastructure.
  • Reduce toil through automation, including building and maintaining CI/CD infrastructure.

Openly is rebuilding insurance from the ground up by re-envisioning and enhancing every aspect of the customer experience. They are a rapidly growing team of exceptional, curious, empathetic people with a wide range of skill sets, spanning many departments.

$29,000–$36,000/yr
India

  • Design, build, and maintain scalable, reliable systems on GCP.
  • Develop automation for infrastructure provisioning using Terraform, Ansible, or Deployment Manager.
  • Manage incident response, conduct postmortems, and implement improvements to reduce recurrence.

SupplyHouse.com is an industry-leading e-commerce company specializing in HVAC, plumbing, heating, and electrical supplies since 2004. They value every individual team member and cultivate a community where people come first with Generosity, Respect, Innovation, Teamwork, and GRIT.

Europe

  • Design and operate our Kubernetes ecosystem with a focus on high availability and zero-downtime operations.
  • Own and evolve our PaaS strategy, using GitOps and CI/CD to empower domain teams to deploy independently.
  • Define and implement our observability strategy across metrics, logs, and tracing.

Finom is a European tech startup headquartered in Amsterdam, revolutionizing financial services for entrepreneurs. They offer an all-in-one financial B2B solution integrating banking, accounting, financial management, and invoicing into a mobile-first platform, with about 346 million in funding.

Global 16w maternity 16w paternity

  • Lead the design and implementation of self-service platform infrastructure for provisioning, deployment, and observability across engineering teams.
  • Evolve multi-tenant EKS foundations toward better reliability, security, scale, and multi-region connectivity.
  • Set delivery standards using Terraform, GitOps, and progressive rollout, while improving SLOs and alerting on Grafana Cloud.

Docker is a developer tooling company trusted by over 20 million monthly users and 20 billion container image pulls. They are a globally distributed, remote-first team building tools that define how software gets built and delivered.