Build distributed systems that support reliability, resiliency, and safe operation at scale.
Design and operate traffic control mechanisms: circuit breakers, rate limiting, admission control, backpressure, and graceful degradation.
Develop tooling that improves incident detection, response, and automated mitigation.
Whatnot is the largest live shopping platform in North America and Europe to buy, sell, and discover the things you love. They are a remote co-located team, inspired by innovation and anchored in their values.
Evolve replication protocols to make failures a non-event for customers.
Deliver scalability primitives that unlock Temporal Cloud growth.
Raise the bar on safety, observability, and operability of Temporal’s replication layer.
Temporal is an open source programming model company simplifying code and enhancing application reliability. They aim to be the reliable foundation for every developer's toolbox with a growing team valuing curiosity, drive, collaboration, and genuine humility.
Design and implement core backend service features for Nexus.
Provide appropriate test coverage for unit, integration, and performance for your feature ownership area.
Clearly document design choices and operational knowledge to successfully deploy and run service with those features.
Temporal provides an open source programming model simplifying code and enhancing application reliability, allowing developers to focus on feature delivery. They value curiosity, collaboration, and authenticity, fostering a culture that challenges conventional thinking and encourages innovation, aiming to be the reliable foundation of every developer’s toolbox.
Define long-term architectural strategy for multi-cloud compute and traffic platforms.
Provide mentorship to engineers through design reviews and code contributions.
Partner with Security to build ‘secure by default’ systems.
Temporal Technologies develops an open-source programming model that simplifies code and enhances application reliability. With a focus on developer experience and open-source software, they foster a culture of curiosity, collaboration, and genuine impact.
Collaborate with engineering teams to design and implement scalable, secure systems.
Establish and manage service level objectives (SLOs) and service level agreements (SLAs).
Enhance incident response processes and post-mortem analysis for outages.
ClickHouse, recognized on the 2025 Forbes Cloud 100 list, is one of the most innovative and fast-growing private cloud companies. With more than 3,000 customers and ARR that has grown over 250 percent year over year, ClickHouse leads the market in real-time analytics, data warehousing, observability, and AI workloads.
Collaborate with exceptional engineers on building systems and services for the world's largest companies.
Lead architecture for distributed services at scale that synchronize shared state across clients.
Drive cross-team technical alignment via design docs and decision records; unblock execution across org boundaries.
Webflow is building the world’s leading AI-native Digital Experience Platform. They are a remote-first company built on trust, transparency, and creativity, empowering teams to design, launch, and optimize for the web without barriers.
Build and scale a strong culture of operational excellence by defining standards and coaching teams to own reliability and availability.
Drive mature DevOps/SRE practices, including incident response and PIRs, on-call readiness, runbooks, alerting, observability, and release/change management.
Guide teams in the design, development, evolution, and operation of large-scale, distributed cloud systems.
Grafana Labs is a remote-first, open-source powerhouse with more than 20M users of Grafana, the open source visualization tool, around the globe. They help more than 3,000 companies manage their observability strategies with the Grafana LGTM Stack.
Improve the scalability and reliability of our core data systems.
Define and evolve how we model, store, and query resource data across Vanta.
Collaborate with product, design, and other engineering teams to understand user needs.
Vanta helps businesses earn and prove trust by providing continuous security monitoring and verification. They empower companies to practice better security and prove it with ease. Vanta has a kind and talented team, with many having succeeded without prior extensive security experience.
Design, build, and maintain scalable infrastructure and tooling that improves reliability, performance, and availability across OnePay’s platform
Contribute to the evolution of our observability stack, platform libraries, cloud architecture, and CI/CD pipelines
Develop automation and monitoring systems to detect, prevent, and remediate incidents before they impact customers
OnePay is a consumer fintech company trusted by millions of Americans to make money better, providing an all-in-one financial services platform. Backed by Walmart and Ribbit Capital, OnePay provides banking, savings, credit cards, lending, investing, and crypto services and embedded financial services to frontline workers.
Define and evolve reliability standards for the SmarterDx platform.
Enhance observability systems (metrics, logs, traces, alerting) to provide actionable insights and reduce mean time to detect (MTTD) and resolve (MTTR).
Reduce operational toil through automation, self-healing systems, and improved deployment and rollback mechanisms.
SmarterDx, a Smarter Technologies company, builds clinical AI that is transforming how hospitals translate care into payment. Founded by physicians in 2020, their platform connects clinical context with revenue intelligence, helping health systems recover millions in missed revenue, improve quality scores, and appeal every denial.
Build and scale infrastructure to support billions of messages per day and real-time events
Automate deployments, alerting, and incident response
Tune MySQL and other datastore performance and improve reliability across distributed systems
Customer.io's platform enables over 8,000 companies, from scrappy startups to global brands, to send billions of automated emails, push notifications, in-app messages, and SMS every day. They foster a culture that values empathy, transparency, and responsibility.
Develop automation code to provision and operate infrastructure at scale.
Build resilient, scalable, secure, and observable services with cost optimization.
Proactively identify and address security concerns across systems and infrastructure.
Globality uses AI to transform enterprise spending into a more efficient and inclusive process. They aim to revolutionize enterprise procurement with AI and have a culture built on trust, collaboration, and innovation, fostering an environment where every individual feels valued and included.
Help deploy and configure Dynatrace OneAgent and ActiveGates with automated tooling.
Define and instrument user‑centric metrics and objectives in Dynatrace.
Combine Davis® AI with Copilot/Claude to identify root causes and reduce MTTR.
AWP Safety's IT Internship Program is a hands‑on, learning experience for early‑career professionals who want to build a future in IT Site Reliability Engineering. They operate at the intersection of Software Engineering and Systems Operations, using Dynatrace to diagnose performance bottlenecks and automate "toil" out of existence.
Take an active role in influencing our roadmap and your own career objectives
Work with your team to deliver new features, then use the results to iterate and improve.
Drive projects from initial idea all the way to operations once it is in the hands of customers
Grafana Labs is a remote-first, open-source powerhouse with over 20M Grafana users globally. With a global collaborative culture, Grafana Labs fosters transparency, autonomy, and trust in an innovation-driven environment.
Build and Lead the Platform Architecture Organization.
Own Production Readiness as a Company Capability.
Drive Operational Excellence and Business Outcomes.
Temporal is an open source programming model that simplifies code, makes applications reliable, and helps developers focus on delivering features faster. They aim to be the reliable foundation of every developer’s toolbox, with a team that embraces curiosity, drive, collaboration, and humility.
Read, understand, and write code and unit tests (primarily in Java )
Investigate, diagnose, and implement improvements for performance bottlenecks and cost inefficiencies
Implement, test, and deploy architecture and library changes which enable new insights and understanding, including cost modeling/reporting and data patterns
Airship helps brands drive revenue growth and customer loyalty with exceptional cross-channel customer experiences. Airship's platform empowers growth-focused teams to create, test, and orchestrate hyper-personalized experiences across all channels.
Design and implement scalable distributed systems that handle heavy CPU, disk, and network workloads.
Analyze system behavior to identify bottlenecks across compute, storage, and network layers.
Build instrumentation, metrics, and telemetry to measure system performance.
RapidFort is a Series A cybersecurity company backed by $42M from leading investors, building the next generation of container and software supply-chain security. Our platform helps enterprises and U.S. government agencies eliminate vulnerabilities in container images, secure Kubernetes environments, and protect cloud-native infrastructure at runtime.
Design and implement distributed scheduling and workflow systems.
Build scalable, reliable platform services and storage abstractions.
Improve system reliability, observability, and operational performance.
Voleon is a technology company that applies state-of-the-art AI and machine learning techniques to real-world problems in finance. They have become a multibillion-dollar asset manager, and they have ambitious goals for the future.
Partner closely with product engineering squads (embedded model)
Own production reliability for high-SLA and complex customer environments
Design and implement automation to scale our reliability practices
Grafana Labs is a remote-first, open-source powerhouse that helps more than 3,000 companies manage their observability strategies. They are scaling fast and staying true to what makes them different: an open-source legacy, a global collaborative culture, and a passion for meaningful work.
Support teammates with goal-setting, professional development, and mentoring.
Ensure delivery of maintainable, high-quality platform systems.
Build and sustain a healthy team culture where ownership and collaboration are the norm.
onX is a pioneer in digital outdoor navigation solutions through its suite of apps. With over 400 employees, they foster a fast-paced, tech-forward environment valuing ownership, accountability, and teamwork.