Implementing the improvements to the reliability, fault tolerance, scalability, and performance of our infrastructure
Managing incidents using your technical know-how to involve the appropriate teams and automate away manual practices
Improving observability across our systems (metrics, logs, tracing) to reduce time to detection and resolution
Newton is changing how Canadians trade crypto with the goal to make financial freedom achievable for everyone by giving their customers the tools and knowledge needed to navigate the crypto world. They are a remote team spread across Canada that values pushing boundaries and getting things done.
Design and maintain scalable, fault-tolerant infrastructure that supports our SaaS platform and keeps pace with business growth.
Define, document, and maintain SLIs, SLOs, and SLAs in partnership with product engineering, translating business commitments into technical guardrails.
Lead incident response with steady judgment, facilitate blameless postmortems, and drive remediation efforts that prevent recurrence.
Fixify is on a mission to reimagine IT teams support companies. They need a Senior Site Reliability Engineer who finds joy in building systems that fade into the background, empowering product engineers to ship with confidence and their customers to work without interruption.
Designing and implementing SLI/SLO frameworks with error budgets to guide reliability and performance decisions.
Building and maintaining AWS-based production infrastructure using Infrastructure as Code (Terraform, CloudFormation), including ECS, EKS/Kubernetes, and microservices orchestration.
Developing internal tools, automation frameworks, and reliability services in TypeScript, Python, or similar languages to enhance operational efficiency.
Jobgether uses an AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. They identify the top-fitting candidates, and this shortlist is then shared directly with the hiring company.
Define and evolve reliability standards for the SmarterDx platform.
Enhance observability systems (metrics, logs, traces, alerting) to provide actionable insights and reduce mean time to detect (MTTD) and resolve (MTTR).
Reduce operational toil through automation, self-healing systems, and improved deployment and rollback mechanisms.
SmarterDx, a Smarter Technologies company, builds clinical AI that is transforming how hospitals translate care into payment. Founded by physicians in 2020, their platform connects clinical context with revenue intelligence, helping health systems recover millions in missed revenue, improve quality scores, and appeal every denial.
Design, deploy, and manage scalable and highly available cloud infrastructure on AWS.
Design reusable Terraform/OpenTofu modules following DRY principles and organizational standards.
Implement AIOps practices, leveraging AI tools to enhance monitoring, incident response, and predictive alerting.
DistroKid is the world’s largest distributor of music to Spotify, Apple Music, YouTube, and beyond, empowering millions of independent artists to get their music into streaming services and keep 100% of their earnings. They move fast, stay curious, and build tools that directly impact how artists share their music with the world.
Build Reliable Cloud Infrastructure: Implement and maintain AWS infrastructure using Terraform across EKS, Lambda, EC2, and S3.
Improve Developer Workflows: Contribute to CI/CD pipelines, starter kits, and internal tooling that reduce manual effort and improve deployment confidence.
Strengthen Observability & Operations: Add monitoring, logging, and alerting (DataDog) to platform services and participate in an on-call rotation.
Spreetail helps brands increase their ecommerce market share globally while improving operational costs. They are building one of the fastest-growing ecommerce companies in history with a focus on innovation.
Collaborate with service engineering teams to design, implement, and maintain scalable and resilient infrastructure solutions.
Implement SRE principles to improve system reliability and reduce downtime.
Improve developer workflows by creating self-service tools, optimizing CI/CD pipelines, and enhancing deployment processes.
Flex is a growth-stage FinTech company creating the best rent payment experience. They empower renters with flexibility over their most significant recurring expense and are growing quickly with a focus on building an inclusive culture.
Maximize the velocity of our product engineering team.
Ensure platform scalability, reliability, and security.
Champion best practices and shape the engineering culture.
They are building a robust, scalable trading platform to serve high-traffic, latency-sensitive applications. They leverage state-of-the-art technologies to support real-time trading while providing unparalleled reliability and performance.
Using automation and Infrastructure as a Code (laC) to continuously improve the reliability, scalability, and performance of services deployed on AWS.
Performance tuning and configuration of both Linux system and application parameters supporting highly concurrent web stacks.
Manage infrastructure through code using configuration management and laC templating software such as Terraform and Puppet.
Open LMS is a Moodle-based Learning Management System that helps educators improve the learning experience and outcomes of millions of learners across the globe. Open LMS continually innovates to better enable educators, parents, and learners of all types to teach, learn, connect, and communicate whenever they want and wherever they are.
Develop and maintain resilient, cost-efficient infrastructure using AWS and other cloud services to meet evolving business needs.
Use IaC solutions to enable automated provisioning and ensure consistency across all environments.
Design, develop, and maintain advanced pipelines, ensuring automated testing integration and deployment efficiency at scale.
Pagefreezer's vision is to make the Internet a safer place by delivering solutions that transform how people protect integrity online, ensuring accountability, and enabling the pursuit of justice. They simplify compliance and litigation by automatically archiving websites, social media, mobile text messages, and enterprise collaboration platforms. It appears they have a good company culture as they have been named Canada’s Most Admired Culture 2023, 2024 and 2025, one of BC’s Top Employers 2024 and as one of Canada’s Top Small & Medium Employers for 2024.
Lead infrastructure initiatives across the engineering organization.
Design technical quality bar and architectural standards.
Build platforms and AI-enabled systems for multiple teams.
Fieldguide is automating and streamlining the work of assurance and audit practitioners specifically within cybersecurity, privacy, and financial audit, building software for the people who enable trust between businesses. They are based in San Francisco, CA, but built as a remote-first company with an inclusive, driven, humble and supportive team.
Design, implement, and manage scalable cloud infrastructure.
Automate and optimize infrastructure management tasks.
Rival Group is a forward-thinking, results-driven organization obsessed with helping innovative brands get closer to their customers. They have a fast-growing tech company with award-winning market research agency with offices in Chicago, Toronto, and Vancouver.
Design, build, and maintain scalable infrastructure and tooling that improves reliability, performance, and availability across OnePay’s platform
Contribute to the evolution of our observability stack, platform libraries, cloud architecture, and CI/CD pipelines
Develop automation and monitoring systems to detect, prevent, and remediate incidents before they impact customers
OnePay is a consumer fintech company trusted by millions of Americans to make money better, providing an all-in-one financial services platform. Backed by Walmart and Ribbit Capital, OnePay provides banking, savings, credit cards, lending, investing, and crypto services and embedded financial services to frontline workers.
Define and execute the reliability engineering roadmap.
Establish SLO/SLI/error budget frameworks for system stability.
Drive continuous improvement through DORA metrics and analysis.
Jobgether leverages AI for HR solutions. They focus on connecting talent with opportunities, using AI-driven matching to ensure fair and objective application reviews.
Own the end-to-end lifecycle (design, provisioning, upgrades, and decommissioning) of core platform components.
Lead the design and implementation of infrastructure bootstrap orchestration, including: Automated cluster and environment provisioning.
Apply and promote SRE practices across the platform, including: Clear ownership and runbooks for platform components.
Pismo provides a comprehensive processing platform for banking, card issuing and financial market infrastructure and helps customers innovate and build the next generation of banking and payment solutions. Pismo’s 500+ employees are located in more than 10 countries around the world.
Design and maintain scalable cloud environments using tools like Terraform, CloudFormation, or Ansible.
Build and optimize automated deployment pipelines to ensure rapid and reliable software delivery.
Implement robust monitoring, logging, and alerting frameworks to ensure 24/7 system health.
CodeRoad offers end-to-end software development services, helping businesses scale with infrastructure solutions. They provide staff augmentation, dedicated IT teams, and software engineering to empower businesses in a digital landscape.
Cooperate closely with other Platform and Engineering teams on strategic initiatives
Improve, automate and grow SmartRecruiters cloud platform
Respond to client threats and remediate issues
SmartRecruiters is the Recruiting AI Company that transforms hiring for the world’s leading enterprises. An SAP company, they deliver an AI-powered hiring platform that automates and optimizes the entire talent acquisition process. They are a values-driven tech company with strong financial backing and a bold vision.
Build our observability and alerting platform from the ground up.
Lead infrastructure builds for compliance (SOC 2, HIPAA).
Truv is transforming the financial data industry with a secure and real-time API platform for payroll account access. Backed by $30M from top investors, they're disrupting a $2B legacy market with cutting-edge innovation and a customer-first approach.
Own and deliver infrastructure projects end-to-end.
Build and improve platform primitives for service teams.
Improve observability and implement cost and performance improvements.
Afresh is the leading AI company in fresh food, partnering with grocers to order fresh food. They've experienced record-breaking growth and are on a mission to eliminate food waste. They have over 148 million in funding and embody values of proactivity, kindness, candor, and humility.