Source Job

Global

  • Defining and driving the vision and strategy for Infrastructure Observability.
  • Identifying gaps in end to end experience, defining and owning the roadmap to fill those gaps.
  • Working closely across teams and across Orgs, collaborating with Engineering, UX, Design and other teams to deliver on your roadmap.

Product Management Observability Cloud Infrastructure AI

20 jobs similar to Principal Product Manager Infrastructure, Observability

Jobs ranked by similarity.

$180,000–$200,000/yr
US

  • Own and evolve a scalable observability platform spanning metrics, logs, traces, and events.
  • Design telemetry pipelines ingesting data from GPUs, CPUs, networking, containers, APIs, and BMC/Redfish.
  • Design and implement noise-resistant alerting systems to improve signal quality and reduce operational load.

Lightning AI builds an end-to-end platform for developing, training, and deploying AI systems, designed to take ideas from research to production with less friction. They combine developer-first software with cost-efficient, large-scale compute, serving solo researchers, startups, and large enterprises.

Unlimited PTO

  • Assess and improve visibility by identifying gaps in dashboards, metrics, and logs.
  • Refine alerts and dashboards for critical services to catch issues earlier.
  • Automate routine checks and monitoring tasks to free up engineers.

PlayOn is where high school sports come to life through platforms like GoFan, NFHS Network, and MaxPreps. As a growth-stage company backed by KKR, we build the technology that powers high school athletics from ticketing and streaming to fundraising and merchandise.

Europe 6w PTO

  • Develop and maintain features as part of Observability solutions in Grafana Cloud.
  • Contribute to the design and implementation of high-quality, scalable integrations for various infrastructure components, databases, and applications
  • Build prototypes and present your ideas as part of a cross-functional team

Grafana Labs is a remote-first, open-source powerhouse with more than 20M users of Grafana. They help more than 3,000 companies manage their observability strategies with the Grafana LGTM Stack, and thrive in an innovation-driven environment with a global collaborative culture.

$126,353–$147,947/yr
Europe 6w PTO

  • Manage, hire, and develop a team of engineers, providing regular feedback.
  • Act as project manager and work with product owners to ensure the product roadmap is up-to-date.
  • Engage in technical conversations and challenge teams to arrive at strong technical decisions.

Grafana Labs is a remote-first, open-source powerhouse that provides visualization tools and helps companies manage their observability strategies. We value transparency, autonomy, and trust.

$205,000–$235,000/yr
US

  • Provide technical leadership for infrastructure, reliability, and observability.
  • Own the observability stack using Datadog and CloudWatch.
  • Design and evolve AWS infrastructure for reliability, security, scalability, and cost efficiency.

Topstep is an engaging working environment that ranges from fully remote to hybrid. They foster a culture of collaboration by keeping cameras on during meetings and maintaining a robust Slack environment for communication.

Europe 6w PTO

  • Develop and own the product vision and strategy for Data Collection, Transformation, and Ingestion as a core platform capability for Grafana Cloud
  • Partner closely with senior R&D leaders to align product and technical strategy across teams, make clear tradeoffs, and ensure the roadmap balances customer value, platform leverage, operational excellence, and business impact
  • Use customer research, product analytics, competitive insight, and business context to identify the highest-impact problems to solve

Grafana Labs is a remote-first, open-source powerhouse with more than 20M users of Grafana. They help more than 3,000 companies manage their observability strategies with the Grafana LGTM Stack, and their team thrives in an innovation-driven environment.

US

  • Own the AI observability category narrative, defining what it is, why it matters, and why Monte Carlo leads it.
  • Craft differentiated positioning that lands with technical practitioners and executive buyers.
  • Build and execute GTM plans for new product releases, including messaging frameworks and launch plans.

Monte Carlo is an Agent Trust Platform that unifies data and agent observability to monitor, troubleshoot, and improve production AI systems. Founded in 2019 and backed by leading investors, Monte Carlo empowers data and AI teams to ship trusted AI at scale.

Canada

  • Own the end-to-end infrastructure product vision, including installers, deployment tooling, reference architectures, and operational patterns.
  • Define and evolve a cohesive infrastructure roadmap aligned with Platform architecture, customer needs, and GTM strategy.
  • Partner closely with Product Leadership to balance near-term customer needs with long-term platform scalability and repeatability.

Mechanical Orchard is reinventing how the world’s most critical software gets modernized, focusing on system behavior to turn modernization into a repeatable process. They are an applied AI company challenging industry assumptions and prioritizing quality, rigor, and progress.

Global 6w PTO

  • Develop and evolve monitoring infrastructure for high-load production systems to improve observability and reliability.
  • Build and maintain Grafana dashboards, uptime monitoring, and alerting systems, optimizing integrations using Elasticsearch.
  • Improve incident detection and collaborate with Infrastructure, DevOps, and Engineering teams on monitoring solutions.

Social Discovery Group (SDG) is one of the world's largest groups of social discovery companies, uniting millions of users on dozens of products designed to solve loneliness, isolation, and disconnection. The company has an international team of 1000+ professionals and digital nomads working remotely from over 150 countries, with a culture recognized as a 'Great Place to Work' and a top company for work-from-anywhere jobs.

$124,845–$146,205/yr
Europe 6w PTO

  • Manage and grow a distributed team of engineers, providing feedback and supporting career development.
  • Partner with product management to shape the Usage squad's roadmap, ensuring alignment with company mission and customer impact.
  • Guide the team through the full project lifecycle, ensuring high-quality and timely outcomes within the Usage domain.

Grafana Labs is a remote-first, open-source powerhouse with over 20M users globally. Their team thrives in an innovation-driven environment where transparency, autonomy, and trust fuel everything they do.

Canada US 4w PTO

  • Lead and grow high-performing platform engineering teams that deliver reliable, scalable infrastructure and operational excellence for Vanta’s products and customers.
  • Set technical direction and drive multi-quarter platform initiatives spanning infrastructure reliability, security, scalability, and developer experience across shared systems and services.
  • Partner closely with product engineering, security, and engineering leadership to identify organizational needs and deliver scalable platform solutions.

Vanta helps businesses earn and prove trust by empowering companies to practice better security and prove it with ease. They have a kind and talented team, and while some have prior security experience, many have been successful without it.

Mexico

  • Design systems with resilience, graceful degradation, and capacity in mind.
  • Define and measure SLOs and SLIs that actually reflect what our customers feel.
  • Use Datadog (logging, metrics, APM) together with CloudWatch to build signal-heavy, noise-light observability.

EarnIn is building products that deliver real-time financial flexibility for those with the unique needs of living paycheck to paycheck. They are growing fast and are excited to continue bringing world-class talent onboard to help shape the next chapter of their growth journey.

Europe 5w PTO

  • Build and improve scalable infrastructure operations processes that support a growing cloud platform.
  • Enable customer-facing and operational teams with secure automation, diagnostics, tooling and clear workflows.
  • Reduce repeatable manual work by identifying operational pain points and turning them into automated or self-service solutions.

NexGen Cloud delivers on-demand and private GPU infrastructure to a wide array of customers. They're a tight-knit, fast-moving team working at the cutting edge of AI cloud infrastructure, equipping their people with AI at every level.

US Unlimited PTO 16w maternity

  • Lead and grow high-performing platform engineering teams.
  • Set technical direction and drive multi-quarter platform initiatives.
  • Design and evolve internal platforms for product teams.

Vanta's mission is to help businesses earn and prove trust by making security continuous and verifiable. They empower companies to practice better security, automating security monitoring for compliance standards. Vanta has a kind and talented team.

US Canada 6w PTO

  • Earning the trust of our large-scale operator customers to further Grafana's "big tent" philosophy of data accessibility and to meet clear business objectives.
  • Designing and leading the development of backend services, distributed systems, and enterprise features at scale.
  • Driving continuous improvement of our engineering culture through words and actions.

Grafana Labs is a remote-first, open-source powerhouse with more than 20M users of Grafana, the open source visualization tool, around the globe. They help more than 3,000 companies manage their observability strategies with the Grafana LGTM Stack, which can be run fully managed with Grafana Cloud or self-managed with the Grafana Enterprise Stack. The Grafana team thrives in an innovation-driven environment where transparency, autonomy, and trust fuel everything they do.

Germany

  • Build and maintain end-to-end observability with ELK, Prometheus, and Grafana.
  • Own and improve CI/CD pipelines (CircleCI, GitLab CI, GitHub Actions, ArgoCD).
  • Lead incident response and postmortems in a blameless culture.

Redcare Pharmacy is Europe’s No.1 e-pharmacy, powered by passionate teams and cutting-edge innovation. They strive to create a healthy, collaborative work environment where every employee feels valued and inspired to contribute to their vision “Until every human has their health”.

$200,000–$225,000/yr

  • Lead the evaluation, adoption, and execution of technology initiatives.
  • Recruit, mentor, and motivate a high-performance operations staff.
  • Drive operational excellence through structured incident, problem, and change management practices.

Business Wire is a press release distribution company. The company's total rewards include remote work, health benefits, fitness allotment, and a 401(k) plan.

$188,550–$212,150/yr
Global Unlimited PTO

  • Own the technical direction of Remote's SRE/Platform domain.
  • Define and drive the reliability strategy across the platform.
  • Identify and lead AI enablement initiatives across the engineering organisation.

Remote is solving modern organizations’ biggest challenge – navigating global employment compliantly with ease. With our core values at heart and a future-focused work culture, our team works tirelessly on ambitious problems, asynchronously, around the world.

Germany Sweden Spain Ireland UK 6w PTO

  • Lead a team covering corporate systems, employee device lifecycles, helpdesk queues, and internal tooling development.
  • Handle corporate security initiatives and compliance checks in an employee-enabling way.
  • Help define the wider internal AI rollout, enablement, and security strategy across teams.

Grafana Labs is a remote-first, open-source powerhouse, with over 20M users of Grafana. They help more than 3,000 companies manage their observability strategies with the Grafana LGTM Stack and thrive in an innovation-driven environment, scaling fast and staying true to its open-source legacy and a global collaborative culture.

US

  • Own and operate end-to-end infrastructure for backend services, frontend systems and databases.
  • Build and maintain reliable deployment workflows including CI/CD pipelines and rollback procedures.
  • Improve system-wide observability through metrics, logging, alerting, and monitoring to ensure uptime.

Jito Labs builds a high-performance trading terminal on Solana. They are a lean, high-output team building something that sits at the intersection of execution quality, user experience, and on-chain infrastructure.