Help drive reliability, automation and performance within our cloud-based infrastructure.
Become embedded within an Engineering team helping them navigate production excellence and advocate for best practices.
Debug production issues across services and levels of the stack as well as practice incident response and blameless postmortems.
Flywire is a global payments enablement and software company that was founded over a decade ago. They have over 1,200 global FlyMates, representing more than 40 nationalities, in 12 offices worldwide, and are looking for people to join the next stage of their journey as they continue to grow.
Set the vision and drive execution for Reliability Engineering at Affirm
Build software and program management structure to perform continual risk management across the entire Affirm system and Engineering organization
Hire and build a global team of SREs, system engineers, and full stack engineers
Affirm is reinventing credit to make it more honest and friendly, giving consumers the flexibility to buy now and pay later without any hidden fees or compounding interest. They seem to be a remote-first company with competitive benefits that are anchored to their core value of people come first.
Own SLI/SLO/SLA definitions for the Akuity SaaS platform and drive continuous improvement.
Participate in an on-call rotation and act as incident commander for high-severity production events.
Partner with engineering teams to build reliability into new features before they ship to production
Akuity helps enterprises ship software faster and more reliably with modern GitOps best practices. The Akuity Platform enables teams to manage the development and deployment across hundreds – if not thousands – of Kubernetes clusters from a single control plane.
Maximize the velocity of our product engineering team.
Ensure platform scalability, reliability, and security.
Champion best practices and shape the engineering culture.
They are building a robust, scalable trading platform to serve high-traffic, latency-sensitive applications. They leverage state-of-the-art technologies to support real-time trading while providing unparalleled reliability and performance.
Define and execute the reliability engineering roadmap.
Establish SLO/SLI/error budget frameworks for system stability.
Drive continuous improvement through DORA metrics and analysis.
Jobgether leverages AI for HR solutions. They focus on connecting talent with opportunities, using AI-driven matching to ensure fair and objective application reviews.
Set the vision and drive execution for Reliability Engineering.
Build software and program management structure to perform continual risk management.
Hire and build a global team of SREs, system engineers, and full stack engineers.
Affirm is reinventing credit to make it more honest and friendly, giving consumers the flexibility to buy now and pay later without any hidden fees or compounding interest. They are a remote-first company that values learning, experimentation, and accountability.
Design and implement comprehensive monitoring strategies.
Take ownership of production incident response, lead handling, and drive remediation.
Continuously improve operational processes, reliability practices, and team readiness.
InvestorFlow delivers industry specialized CRM and digital portals to help alternative asset firms find opportunities, create and manage relationships, and turn relationship insights into action. They serve over 175 clients, including 25 of the top 50 alternative asset managers, managing more than $6 trillion in assets.
Design and maintain scalable, fault-tolerant infrastructure that supports our SaaS platform and keeps pace with business growth.
Define, document, and maintain SLIs, SLOs, and SLAs in partnership with product engineering, translating business commitments into technical guardrails.
Lead incident response with steady judgment, facilitate blameless postmortems, and drive remediation efforts that prevent recurrence.
Fixify is on a mission to reimagine IT teams support companies. They need a Senior Site Reliability Engineer who finds joy in building systems that fade into the background, empowering product engineers to ship with confidence and their customers to work without interruption.
Define and evolve reliability standards for the SmarterDx platform.
Enhance observability systems (metrics, logs, traces, alerting) to provide actionable insights and reduce mean time to detect (MTTD) and resolve (MTTR).
Reduce operational toil through automation, self-healing systems, and improved deployment and rollback mechanisms.
SmarterDx, a Smarter Technologies company, builds clinical AI that is transforming how hospitals translate care into payment. Founded by physicians in 2020, their platform connects clinical context with revenue intelligence, helping health systems recover millions in missed revenue, improve quality scores, and appeal every denial.
Own the end-to-end lifecycle (design, provisioning, upgrades, and decommissioning) of core platform components.
Lead the design and implementation of infrastructure bootstrap orchestration, including: Automated cluster and environment provisioning.
Apply and promote SRE practices across the platform, including: Clear ownership and runbooks for platform components.
Pismo provides a comprehensive processing platform for banking, card issuing and financial market infrastructure and helps customers innovate and build the next generation of banking and payment solutions. Pismo’s 500+ employees are located in more than 10 countries around the world.
Analyze, troubleshoot and resolve operational challenges contributing to defined SLO's.
Manage site stability, performance, reliability, and maintain uptime for production environments.
CentralReach provides autism and IDD care software for Applied Behavior Analysis (ABA), multidisciplinary therapy, and special education. They are trusted by more than 200,000 users and is backed by Roper Technologies, Inc. (Nasdaq: ROP). Their culture is centered around impact, inclusion, and flexibility.
Act as the primary responder for high-priority production incidents during the Australian business day.
Work with the core product team to identify recurring support patterns and develop automated fixes or feature enhancements.
Participate in daily handovers with EMEA and US teams to ensure seamless continuity of operations.
EngFlow helps developers save time by accelerating software builds and tests. They are backed by top investors and are redefining how companies build software, with solutions that speed up builds and an observability platform for actionable insights.
Collaborate with application engineering teams on platform infrastructure.
Enhance observability and spearhead the adoption of SRE best practices.
Build and maintain reliable CI/CD pipelines, tooling, and infrastructure.
Rula strives to provide quality, evidence-based, compassionate mental healthcare and aims to create a world where mental health is no longer stigmatized. They are a remote-first company operating in most U.S. states, and are dedicated to having a culture of inclusion that supports their employees.
Lead the Infrastructure Engineering team, taking full ownership of cloud infrastructure, Kubernetes platforms, DevOps tooling, and CI/CD pipelines.
Drive reliability, scalability, and security across the production environment while maintaining a sharp focus on developer velocity and business impact.
Mentor and guide engineers across SRE, DevOps, and Database Reliability functions, fostering a culture of operational excellence and pragmatic problem-solving.
Finom is a European tech startup headquartered in Amsterdam, revolutionizing financial services for entrepreneurs with an all-in-one B2B platform. They have raised $346 million, are expanding across key EU markets, and foster innovation, prioritizing research and solutions that benefit users, employees, partners, and the business.
Lead, coach, and grow a team of highly effective engineers, fostering a culture of continuous learning and high performance.
Own the end-to-end vulnerability lifecycle, ensuring the organization meets strict remediation SLAs and prioritizes risks based on actual business impact.
Partner with DevOps and Engineering teams to integrate security earlier in the SDLC, ensuring vulnerabilities are identified and remediated during the design and build phases.
ServiceNow is a global market leader that brings innovative AI-enhanced technology to over 8,100 customers, including 85% of the Fortune 500. Their intelligent cloud-based platform seamlessly connects people, systems, and processes to empower organizations to find smarter, faster, and better ways to work.
Help deploy and configure Dynatrace OneAgent and ActiveGates with automated tooling.
Define and instrument user‑centric metrics and objectives in Dynatrace.
Combine Davis® AI with Copilot/Claude to identify root causes and reduce MTTR.
AWP Safety's IT Internship Program is a hands‑on, learning experience for early‑career professionals who want to build a future in IT Site Reliability Engineering. They operate at the intersection of Software Engineering and Systems Operations, using Dynatrace to diagnose performance bottlenecks and automate "toil" out of existence.
Execute expert-level real-time monitoring and incident dispositioning for critical client applications.
Correlate complex data across metrics, traces, and logs to perform deep-dive root cause analysis.
Lead the triage of complex alerting environments to filter noise and ensure that high-priority incidents are managed.
Atmosera empowers businesses to redefine what's possible with modern technology and human expertise. They enable organizations to accelerate innovation, enhance security, and optimize operational agility as a Microsoft Partner.
Guide and support a team of developers through coaching, career development, and regular feedback conversations
Partner with product teams to identify reliability challenges and create solutions that improve the client experience
Promote best practices for production engineering and help establish patterns that scale across the organization
Wealthsimple aims to provide financial freedom to everyone by making money management transparent and low-cost through smart technology. As the largest fintech company in Canada, they serve 3+ million users and manage over $100 billion in assets, fostering a collaborative, humble culture focused on quality.
Partner with Product and Engineering Leads to translate strategy and OKRs into coordinated execution plans.
Lead cross-team efforts spanning product, infrastructure, and platform capabilities ensuring alignment.
Contribute to how PointClickCare executes at scale, evolving shared tools, processes, and frameworks.
PointClickCare is evolving toward an empowered, product-led organization, where outcomes define success and flow defines health. They operate at the intersection of product, engineering, and platform - driving systems that enable every team to deliver with purpose, speed, and measurable impact.
Own the end‑to‑end lifecycle of core platform components, including cloud infrastructure primitives and Kubernetes clusters.
Design platform components to be resilient by default, applying SRE principles like fault isolation and capacity planning.
Drive Infrastructure‑as‑Code and GitOps‑first practices to ensure platform components are reproducible and auditable.
Pismo, founded in 2016, provides a comprehensive processing platform for banking, card issuing, and financial market infrastructure, helping customers innovate in banking and payments. With over 500 employees across 10+ countries, Pismo joined Visa in 2024, leveraging Visa’s solutions to advance financial technology.