Design systems with resilience, graceful degradation, and capacity in mind.
Define and measure SLOs and SLIs that actually reflect what our customers feel.
Use Datadog (logging, metrics, APM) together with CloudWatch to build signal-heavy, noise-light observability.
EarnIn is building products that deliver real-time financial flexibility for those with the unique needs of living paycheck to paycheck. They are growing fast and are excited to continue bringing world-class talent onboard to help shape the next chapter of their growth journey.
Provide production support on a shift according to the team on-call roster.
Work on the customer and internal engineering/implementation team raised tickets while not on-call for production support.
Continuously monitor the health and performance of our services, systems, and infrastructure.
Granicus builds and maintains technology that is transforming the Govtech industry by bringing governments and its constituents together. They serve 5,500 federal, state, and local government agencies and more than 300 million citizen subscribers, and are known for being one of the best companies to work for.
Own and evolve a scalable observability platform spanning metrics, logs, traces, and events.
Design telemetry pipelines ingesting data from GPUs, CPUs, networking, containers, APIs, and BMC/Redfish.
Design and implement noise-resistant alerting systems to improve signal quality and reduce operational load.
Lightning AI builds an end-to-end platform for developing, training, and deploying AI systems, designed to take ideas from research to production with less friction. They combine developer-first software with cost-efficient, large-scale compute, serving solo researchers, startups, and large enterprises.
Develop and maintain features as part of Observability solutions in Grafana Cloud.
Contribute to the design and implementation of high-quality, scalable integrations for various infrastructure components, databases, and applications
Build prototypes and present your ideas as part of a cross-functional team
Grafana Labs is a remote-first, open-source powerhouse with more than 20M users of Grafana. They help more than 3,000 companies manage their observability strategies with the Grafana LGTM Stack, and thrive in an innovation-driven environment with a global collaborative culture.
Design, build, and maintain scalable, reliable systems on GCP.
Develop automation for infrastructure provisioning using Terraform, Ansible, or Deployment Manager.
Manage incident response, conduct postmortems, and implement improvements to reduce recurrence.
SupplyHouse.com is an industry-leading e-commerce company specializing in HVAC, plumbing, heating, and electrical supplies since 2004. They value every individual team member and cultivate a community where people come first with Generosity, Respect, Innovation, Teamwork, and GRIT.
Build internal tooling to help other engineers and the rest of the company understand and operate our system.
Design and implement security best practices for our team and infrastructure.
Reduce toil through automation, including building and maintaining CI/CD infrastructure.
Openly is rebuilding insurance from the ground up by re-envisioning and enhancing every aspect of the customer experience. They are a rapidly growing team of exceptional, curious, empathetic people with a wide range of skill sets, spanning many departments.
Provide technical leadership for infrastructure, reliability, and observability.
Own the observability stack using Datadog and CloudWatch.
Design and evolve AWS infrastructure for reliability, security, scalability, and cost efficiency.
Topstep is an engaging working environment that ranges from fully remote to hybrid. They foster a culture of collaboration by keeping cameras on during meetings and maintaining a robust Slack environment for communication.
Build and maintain end-to-end observability with ELK, Prometheus, and Grafana.
Own and improve CI/CD pipelines (CircleCI, GitLab CI, GitHub Actions, ArgoCD).
Lead incident response and postmortems in a blameless culture.
Redcare Pharmacy is Europe’s No.1 e-pharmacy, powered by passionate teams and cutting-edge innovation. They strive to create a healthy, collaborative work environment where every employee feels valued and inspired to contribute to their vision “Until every human has their health”.
Defining and driving the vision and strategy for Infrastructure Observability.
Identifying gaps in end to end experience, defining and owning the roadmap to fill those gaps.
Working closely across teams and across Orgs, collaborating with Engineering, UX, Design and other teams to deliver on your roadmap.
Elastic, the Search AI Company, enables everyone to find the answers they need in real time, using all their data, at scale — unleashing the potential of businesses and people. The Elastic Search AI Platform, used by more than 50% of the Fortune 500, brings together the precision of search and the intelligence of AI to enable everyone to accelerate the results that matter.
Drive major technical initiatives from design through production, improving scalability, reliability, and correctness across critical systems.
Design and evolve backend services, APIs, event-driven workflows, and data models that support complex business processes at scale.
Improve the operational foundations of the platform through better observability, testing, deployment safety, and incident reduction.
Tem is rebuilding the energy transaction, making it transparent and fair. They aim to put power back in the hands of customers and tackle the critical problem of access to low-cost electricity, leveraging AI-driven infrastructure for efficient and sustainable energy markets.
Performing day-to-day operational/DevOps tasks on Wikimedia’s public facing infrastructure.
Implementing and utilizing configuration management and deployment tools.
Leading continuous improvement, by automating the installation, configuration and maintenance of services on our platform.
The Wikimedia Foundation operates Wikipedia and other Wikimedia free knowledge projects with the vision of a world where every single human can freely share in the sum of all knowledge. As a charitable, not-for-profit organization, it relies on donations and has staff members based in 40+ countries.
Design, build, and deploy production systems with a focus on scalability, reliability, observability, and performance.
Develop and maintain comprehensive automation solutions to eliminate toil and streamline operational efficiency.
Proactively monitor production systems and implement automated incident response mechanisms to minimise downtime.
Arista Networks is an industry leader in data-driven, client-to-cloud networking for large data center, campus and routing environments. The company is well-established and profitable with over $8 billion in revenue and values diversity and inclusivity.
Own and maintain data pipeline architectures, ensuring reliability and monitoring.
Manage and evolve data modeling environments for analysts and engineers.
Implement observability for data systems, detecting issues early and continuously monitoring data quality.
Voltus unlocks the full value of distributed energy resources for customers and the grid. They are a fast-growing climate-tech company with a bright, gritty, and good team that values innovation, impact, and integrity.
Architect for Scale, partnering with product and infrastructure teams to design highly available systems.
Drive Automation to eliminate repetitive operational work through tooling and systems.
Reddit is a community-based platform where users submit, vote, and comment on various topics. It hosts over 100,000 active communities and attracts millions of daily active users, making it one of the largest and most influential internet platforms.
Design, build, and operate reconciliation systems to track desired stack state, detect and repair drift across stack templates, grafana.com state, Hosted Grafana, and actual customer stack configuration.
Collaborate across SSS, grafana.com, and deployment configurations to ensure stack lifecycle workflows remain reliable, observable, and resilient.
Improve operational efficiency by reducing deployment complexity and contributing to the Stack Config Reconciliation project.
Grafana Labs is a remote-first, open-source powerhouse with more than 20M users of Grafana around the globe. They help more than 3,000 companies manage their observability strategies with the Grafana LGTM Stack. Their team thrives in an innovation-driven environment where transparency, autonomy, and trust fuel everything they do.
Own and operate end-to-end infrastructure for backend services, frontend systems and databases.
Build and maintain reliable deployment workflows including CI/CD pipelines and rollback procedures.
Improve system-wide observability through metrics, logging, alerting, and monitoring to ensure uptime.
Jito Labs builds a high-performance trading terminal on Solana. They are a lean, high-output team building something that sits at the intersection of execution quality, user experience, and on-chain infrastructure.
Drive the stability and reliability of Epic's GCP infrastructure.
Manage and harden our Docker and GKE container platform.
Maintain and improve CI/CD pipelines.
Epic is the leading digital reading platform for kids ages 12 and under, used by millions of children, families, and educators around the world. As Epic continues to grow, we are reimagining what reading can be through thoughtful technology, data, and global collaboration to make learning more engaging, accessible, and impactful.
Own the technical direction of Remote's SRE/Platform domain.
Define and drive the reliability strategy across the platform.
Identify and lead AI enablement initiatives across the engineering organisation.
Remote is solving modern organizations’ biggest challenge – navigating global employment compliantly with ease. With our core values at heart and a future-focused work culture, our team works tirelessly on ambitious problems, asynchronously, around the world.
Build and scale a strong culture of operational excellence by defining standards and coaching teams to own reliability and availability.
Drive mature DevOps/SRE practices, including incident response and PIRs, on-call readiness, runbooks, alerting, observability, and release/change management.
Guide teams in the design, development, evolution, and operation of large-scale, distributed cloud systems.
Grafana Labs is a remote-first, open-source powerhouse with more than 20M users of Grafana around the globe. They help more than 3,000 companies manage their observability strategies with the Grafana LGTM Stack, and their team thrives in an innovation-driven environment.
Design, build, and operate reconciliation systems to track desired stack state, detect and repair drift across stack templates, grafana.com state, Hosted Grafana, and actual customer stack configuration.
Collaborate across SSS, grafana.com, and deployment configurations to ensure stack lifecycle workflows remain reliable, observable, and resilient.
Improve operational efficiency by reducing deployment complexity and contributing to the Stack Config Reconciliation project.
Grafana Labs is a remote-first, open-source powerhouse with over 20M users of Grafana. They help more than 3,000 companies manage their observability strategies with the Grafana LGTM Stack, featuring scalable metrics (Grafana Mimir), logs (Grafana Loki), and traces (Grafana Tempo).