Maximize the velocity of our product engineering team.
Ensure platform scalability, reliability, and security.
Champion best practices and shape the engineering culture.
They are building a robust, scalable trading platform to serve high-traffic, latency-sensitive applications. They leverage state-of-the-art technologies to support real-time trading while providing unparalleled reliability and performance.
Collaborate with application engineering teams on platform infrastructure.
Enhance observability and spearhead the adoption of SRE best practices.
Build and maintain reliable CI/CD pipelines, tooling, and infrastructure.
Rula strives to provide quality, evidence-based, compassionate mental healthcare and aims to create a world where mental health is no longer stigmatized. They are a remote-first company operating in most U.S. states, and are dedicated to having a culture of inclusion that supports their employees.
Own the end-to-end lifecycle (design, provisioning, upgrades, and decommissioning) of core platform components.
Lead the design and implementation of infrastructure bootstrap orchestration, including: Automated cluster and environment provisioning.
Apply and promote SRE practices across the platform, including: Clear ownership and runbooks for platform components.
Pismo provides a comprehensive processing platform for banking, card issuing and financial market infrastructure and helps customers innovate and build the next generation of banking and payment solutions. Pismo’s 500+ employees are located in more than 10 countries around the world.
Own the end‑to‑end lifecycle of core platform components, including cloud infrastructure primitives and Kubernetes clusters.
Design platform components to be resilient by default, applying SRE principles like fault isolation and capacity planning.
Drive Infrastructure‑as‑Code and GitOps‑first practices to ensure platform components are reproducible and auditable.
Pismo, founded in 2016, provides a comprehensive processing platform for banking, card issuing, and financial market infrastructure, helping customers innovate in banking and payments. With over 500 employees across 10+ countries, Pismo joined Visa in 2024, leveraging Visa’s solutions to advance financial technology.
Define and evolve reliability standards for the SmarterDx platform.
Enhance observability systems (metrics, logs, traces, alerting) to provide actionable insights and reduce mean time to detect (MTTD) and resolve (MTTR).
Reduce operational toil through automation, self-healing systems, and improved deployment and rollback mechanisms.
SmarterDx, a Smarter Technologies company, builds clinical AI that is transforming how hospitals translate care into payment. Founded by physicians in 2020, their platform connects clinical context with revenue intelligence, helping health systems recover millions in missed revenue, improve quality scores, and appeal every denial.
Help drive reliability, automation and performance within our cloud-based infrastructure.
Become embedded within an Engineering team helping them navigate production excellence and advocate for best practices.
Debug production issues across services and levels of the stack as well as practice incident response and blameless postmortems.
Flywire is a global payments enablement and software company that was founded over a decade ago. They have over 1,200 global FlyMates, representing more than 40 nationalities, in 12 offices worldwide, and are looking for people to join the next stage of their journey as they continue to grow.
Design and maintain scalable, fault-tolerant infrastructure that supports our SaaS platform and keeps pace with business growth.
Define, document, and maintain SLIs, SLOs, and SLAs in partnership with product engineering, translating business commitments into technical guardrails.
Lead incident response with steady judgment, facilitate blameless postmortems, and drive remediation efforts that prevent recurrence.
Fixify is on a mission to reimagine IT teams support companies. They need a Senior Site Reliability Engineer who finds joy in building systems that fade into the background, empowering product engineers to ship with confidence and their customers to work without interruption.
Analyze, troubleshoot and resolve operational challenges contributing to defined SLO's.
Manage site stability, performance, reliability, and maintain uptime for production environments.
CentralReach provides autism and IDD care software for Applied Behavior Analysis (ABA), multidisciplinary therapy, and special education. They are trusted by more than 200,000 users and is backed by Roper Technologies, Inc. (Nasdaq: ROP). Their culture is centered around impact, inclusion, and flexibility.
Build and operate cutting-edge cloud infrastructure to support Diagrid's core products
Define standards, deliver tools, processes, and frameworks to make our products secure, reliable, efficient, and highly available
Build and maintain CI/CD pipelines that enable delivering software quickly and securely across clouds
Diagrid believes that open-source software, open standards and APIs are the greatest transformational tools for organizations. They provide developers with APIs and tools that help them focus on their code and not on infrastructure and are founded by the creators of the Dapr and KEDA open-source projects.
Design and implement comprehensive monitoring strategies.
Take ownership of production incident response, lead handling, and drive remediation.
Continuously improve operational processes, reliability practices, and team readiness.
InvestorFlow delivers industry specialized CRM and digital portals to help alternative asset firms find opportunities, create and manage relationships, and turn relationship insights into action. They serve over 175 clients, including 25 of the top 50 alternative asset managers, managing more than $6 trillion in assets.
Lead the Infrastructure Engineering team, taking full ownership of cloud infrastructure, Kubernetes platforms, DevOps tooling, and CI/CD pipelines.
Drive reliability, scalability, and security across the production environment while maintaining a sharp focus on developer velocity and business impact.
Mentor and guide engineers across SRE, DevOps, and Database Reliability functions, fostering a culture of operational excellence and pragmatic problem-solving.
Finom is a European tech startup headquartered in Amsterdam, revolutionizing financial services for entrepreneurs with an all-in-one B2B platform. They have raised $346 million, are expanding across key EU markets, and foster innovation, prioritizing research and solutions that benefit users, employees, partners, and the business.
Collaborate with engineering teams to design and implement scalable, secure systems.
Establish and manage service level objectives (SLOs) and service level agreements (SLAs).
Enhance incident response processes and post-mortem analysis for outages.
ClickHouse, recognized on the 2025 Forbes Cloud 100 list, is one of the most innovative and fast-growing private cloud companies. With more than 3,000 customers and ARR that has grown over 250 percent year over year, ClickHouse leads the market in real-time analytics, data warehousing, observability, and AI workloads.
Develop automation to eliminate manual and repetitive operational tasks.
Investigate and resolve customer complaints escalated beyond L1 and L2 support.
Moniepoint is an all-in-one financial services platform for emerging markets. Since 2019, Moniepoint’s technology has powered over 3 million people, offering personal and business banking, payment, credit and business management tools to help them succeed.
Support and operate Legion’s AWS-based cloud platform and Kubernetes (EKS) environments.
Build and maintain infrastructure-as-code using Terraform.
Improve CI/CD pipelines to increase deployment safety and velocity.
Legion Technologies delivers the industry’s most innovative workforce management platform. The AI-driven Legion WFM platform maximizes labor efficiency and employee engagement. They are a remote, mission-driven team that embraces a collaborative, fast-paced, and entrepreneurial culture.
Partner with engineers to build dev tools that empower developer workflows and deployment infrastructure.
Ensure reliability of multi-cloud Kubernetes clusters and pipelines.
Focus on automation so we can spend energy where it matters.
Cresta is on a mission to turn every customer conversation into a competitive advantage by unlocking the true potential of the contact center. Their platform combines the best of AI and human intelligence to help contact centers discover customer insights and behavioral best practices.
Partner closely with product engineering squads (embedded model)
Own production reliability for high-SLA and complex customer environments
Design and implement automation to scale our reliability practices
Grafana Labs is a remote-first, open-source powerhouse that helps more than 3,000 companies manage their observability strategies. They are scaling fast and staying true to what makes them different: an open-source legacy, a global collaborative culture, and a passion for meaningful work.
Implement SLI/SLO frameworks with error budgets to drive reliability decisions
Design release strategies including blue/green deployments and version tracking
Lead incident response and develop automated runbooks to reduce MTTR
Jobgether is a company that helps connect individuals with jobs through an AI-powered matching process. They ensure applications are reviewed quickly, objectively, and fairly against roles' core requirements.
Help deploy and configure Dynatrace OneAgent and ActiveGates with automated tooling.
Define and instrument user‑centric metrics and objectives in Dynatrace.
Combine Davis® AI with Copilot/Claude to identify root causes and reduce MTTR.
AWP Safety's IT Internship Program is a hands‑on, learning experience for early‑career professionals who want to build a future in IT Site Reliability Engineering. They operate at the intersection of Software Engineering and Systems Operations, using Dynatrace to diagnose performance bottlenecks and automate "toil" out of existence.
Define and execute the reliability engineering roadmap.
Establish SLO/SLI/error budget frameworks for system stability.
Drive continuous improvement through DORA metrics and analysis.
Jobgether leverages AI for HR solutions. They focus on connecting talent with opportunities, using AI-driven matching to ensure fair and objective application reviews.
Build and maintain CI/CD pipelines and GitOps workflows across a diverse set of engineering teams.
Own observability — monitoring, alerting, logging — and support development teams in instrumenting their services.
Optimise infrastructure for security, cost, performance and reliability.
1inch is a decentralized finance (DeFi) platform. We empower users to access the best rates and execute efficient and secure trades across multiple liquidity sources.