Defining and driving the vision and strategy for Infrastructure Observability.
Identifying gaps in end to end experience, defining and owning the roadmap to fill those gaps.
Working closely across teams and across Orgs, collaborating with Engineering, UX, Design and other teams to deliver on your roadmap.
Elastic, the Search AI Company, enables everyone to find the answers they need in real time, using all their data, at scale — unleashing the potential of businesses and people. The Elastic Search AI Platform, used by more than 50% of the Fortune 500, brings together the precision of search and the intelligence of AI to enable everyone to accelerate the results that matter.
Provide production support on a shift according to the team on-call roster.
Work on the customer and internal engineering/implementation team raised tickets while not on-call for production support.
Continuously monitor the health and performance of our services, systems, and infrastructure.
Granicus builds and maintains technology that is transforming the Govtech industry by bringing governments and its constituents together. They serve 5,500 federal, state, and local government agencies and more than 300 million citizen subscribers, and are known for being one of the best companies to work for.
Design, build, and operate reconciliation systems to track desired stack state, detect and repair drift across stack templates, grafana.com state, Hosted Grafana, and actual customer stack configuration.
Collaborate across SSS, grafana.com, and deployment configurations to ensure stack lifecycle workflows remain reliable, observable, and resilient.
Improve operational efficiency by reducing deployment complexity and contributing to the Stack Config Reconciliation project.
Grafana Labs is a remote-first, open-source powerhouse with over 20M users of Grafana. They help more than 3,000 companies manage their observability strategies with the Grafana LGTM Stack, featuring scalable metrics (Grafana Mimir), logs (Grafana Loki), and traces (Grafana Tempo).
Take an active role in influencing our roadmap and your own career objectives.
Drive projects from initial ideation all the way to operations once it is in the hands of customers.
Design, build, operate, and maintain critical systems, owning the reliability, performance, and availability.
Grafana Labs is behind the open observability cloud, and is founded on the principles of open source, open standards, open ecosystems, and open culture. They are a 100% remote company with 1,600+ team members across 40+ countries.
Assess and improve visibility by identifying gaps in dashboards, metrics, and logs.
Refine alerts and dashboards for critical services to catch issues earlier.
Automate routine checks and monitoring tasks to free up engineers.
PlayOn is where high school sports come to life through platforms like GoFan, NFHS Network, and MaxPreps. As a growth-stage company backed by KKR, we build the technology that powers high school athletics from ticketing and streaming to fundraising and merchandise.
Design, build, and operate reconciliation systems to track desired stack state, detect and repair drift across stack templates, grafana.com state, Hosted Grafana, and actual customer stack configuration.
Collaborate across SSS, grafana.com, and deployment configurations to ensure stack lifecycle workflows remain reliable, observable, and resilient.
Improve operational efficiency by reducing deployment complexity and contributing to the Stack Config Reconciliation project.
Grafana Labs is a remote-first, open-source powerhouse with more than 20M users of Grafana around the globe. They help more than 3,000 companies manage their observability strategies with the Grafana LGTM Stack. Their team thrives in an innovation-driven environment where transparency, autonomy, and trust fuel everything they do.
Own and operate end-to-end infrastructure for backend services, frontend systems and databases.
Build and maintain reliable deployment workflows including CI/CD pipelines and rollback procedures.
Improve system-wide observability through metrics, logging, alerting, and monitoring to ensure uptime.
Jito Labs builds a high-performance trading terminal on Solana. They are a lean, high-output team building something that sits at the intersection of execution quality, user experience, and on-chain infrastructure.
Manage, hire, and develop a team of engineers, providing regular feedback.
Act as project manager and work with product owners to ensure the product roadmap is up-to-date.
Engage in technical conversations and challenge teams to arrive at strong technical decisions.
Grafana Labs is a remote-first, open-source powerhouse that provides visualization tools and helps companies manage their observability strategies. We value transparency, autonomy, and trust.
Supervise and coordinate NOC activities to ensure network infrastructure availability and efficiency.
Monitor network performance, proactively implementing solutions to prevent service disruptions.
Develop and maintain standard operating procedures and best practices for the NOC team.
Honest Networks delivers high-quality and affordable internet service as a catalyst for community growth, fostering learning, creativity, and enjoyment. They are a rapidly expanding, venture-backed internet service provider headquartered in Manhattan.
Earning the trust of our large-scale operator customers to further Grafana's "big tent" philosophy of data accessibility and to meet clear business objectives.
Designing and leading the development of backend services, distributed systems, and enterprise features at scale.
Driving continuous improvement of our engineering culture through words and actions.
Grafana Labs is a remote-first, open-source powerhouse with more than 20M users of Grafana, the open source visualization tool, around the globe. They help more than 3,000 companies manage their observability strategies with the Grafana LGTM Stack, which can be run fully managed with Grafana Cloud or self-managed with the Grafana Enterprise Stack. The Grafana team thrives in an innovation-driven environment where transparency, autonomy, and trust fuel everything they do.
Improve the reliability, performance, and scalability of our production platform.
Operate reliable infrastructure, improve observability, and drive incident response.
Use data-driven reliability practices such as SLIs, SLOs, SLAs, and DORA metrics.
VRChat is a game-changing platform that provides an endless collection of social VR experiences. They empower their community to bring their imaginations to life and help shape the metaverse. Their team includes people from Netflix, Twitter, Meta, and Microsoft.
Develop and maintain features as part of Observability solutions in Grafana Cloud.
Contribute to the design and implementation of high-quality, scalable integrations for various infrastructure components, databases, and applications
Build prototypes and present your ideas as part of a cross-functional team
Grafana Labs is a remote-first, open-source powerhouse with more than 20M users of Grafana. They help more than 3,000 companies manage their observability strategies with the Grafana LGTM Stack, and thrive in an innovation-driven environment with a global collaborative culture.
Design, build, and deploy production systems with a focus on scalability, reliability, observability, and performance.
Develop and maintain comprehensive automation solutions to eliminate toil and streamline operational efficiency.
Proactively monitor production systems and implement automated incident response mechanisms to minimise downtime.
Arista Networks is an industry leader in data-driven, client-to-cloud networking for large data center, campus and routing environments. The company is well-established and profitable with over $8 billion in revenue and values diversity and inclusivity.
Take an active role in influencing our roadmap and your own career objectives.
Design, build, operate, and maintain critical systems, owning the reliability, performance, and availability.
Mentor and support other team members, participate in design discussions and collaborate with the team.
Grafana Labs is a remote-first, open-source powerhouse that provides visualization tools. They help companies manage their observability strategies. Grafana Labs has a global collaborative culture, and a passion for meaningful work.
Build and maintain end-to-end observability with ELK, Prometheus, and Grafana.
Own and improve CI/CD pipelines (CircleCI, GitLab CI, GitHub Actions, ArgoCD).
Lead incident response and postmortems in a blameless culture.
Redcare Pharmacy is Europe’s No.1 e-pharmacy, powered by passionate teams and cutting-edge innovation. They strive to create a healthy, collaborative work environment where every employee feels valued and inspired to contribute to their vision “Until every human has their health”.
Design, build, and maintain scalable, reliable systems on GCP.
Develop automation for infrastructure provisioning using Terraform, Ansible, or Deployment Manager.
Manage incident response, conduct postmortems, and implement improvements to reduce recurrence.
SupplyHouse.com is an industry-leading e-commerce company specializing in HVAC, plumbing, heating, and electrical supplies since 2004. They value every individual team member and cultivate a community where people come first with Generosity, Respect, Innovation, Teamwork, and GRIT.
Act as the 3rd-level escalation point for complex issues in CDN and Edge Network products, diagnosing advanced problems with caching, DNS, routing, SSL/TLS, and web security.
Take ownership of high-severity incidents and drive resolution by collaborating with Engineering, Network, and Operations teams, and assist with root cause analysis and preventive measures.
Provide technical guidance and mentorship to L1 and L2 Support Engineers, maintain internal documentation, and participate in post-incident reviews and improvement initiatives.
Gcore is a global provider of infrastructure and software solutions for AI, cloud, network, and security, powering real-time communication, streaming, enterprise AI, and secure web applications. The company has a team of over 550 professionals and partners with leading technology firms like Intel and NVIDIA to build foundational infrastructure for an AI-driven world.
Build and scale a strong culture of operational excellence by defining standards and coaching teams to own reliability and availability.
Drive mature DevOps/SRE practices, including incident response and PIRs, on-call readiness, runbooks, alerting, observability, and release/change management.
Guide teams in the design, development, evolution, and operation of large-scale, distributed cloud systems.
Grafana Labs is a remote-first, open-source powerhouse with more than 20M users of Grafana around the globe. They help more than 3,000 companies manage their observability strategies with the Grafana LGTM Stack, and their team thrives in an innovation-driven environment.
Engage with customers to provide technical assistance, troubleshooting, and best-practice guidance.
Diagnose, reproduce, and resolve issues related to agent connectivity, device enrollment, patch deployment, software installation.
Collaborate cross-functionally with Engineering, Customer Success, Professional Services, and Product teams to resolve customer issues.
Automox is a cloud-native IT operations platform for modern organizations, helping to keep every endpoint automatically configured, patched, and secured – anywhere in the world. They are trusted by more than 2,500 leading companies and MSPs worldwide.
Own and evolve a scalable observability platform spanning metrics, logs, traces, and events.
Design telemetry pipelines ingesting data from GPUs, CPUs, networking, containers, APIs, and BMC/Redfish.
Design and implement noise-resistant alerting systems to improve signal quality and reduce operational load.
Lightning AI builds an end-to-end platform for developing, training, and deploying AI systems, designed to take ideas from research to production with less friction. They combine developer-first software with cost-efficient, large-scale compute, serving solo researchers, startups, and large enterprises.