Ensure reliability, availability, and observability for a large-scale cloud-based SaaS platform serving millions in education.
Design and maintain infrastructure-as-code and CI/CD pipelines while leading incident response and resolution.
Mentor peers and integrate AI-driven tools to improve SRE workflows and system performance.
Jobgether is an AI-powered job matching platform that connects candidates with hiring companies. The company manages the application process and uses AI to shortlist top-fitting candidates based on core requirements.
Lead and mentor SRE/DevOps engineers, driving team growth and performance
Ensure system reliability, uptime, and performance across production systems
Implement DevOps and SRE best practices with a focus on automation and scalability
InspiredXpert is a specialist IT Talent Solutions company providing high-quality contract or perm talent across software development, cloud, AI, cybersecurity, and data-driven roles. We connect skilled professionals with innovative companies, offering exciting opportunities to work on impactful projects across the globe.
Lead the Site Reliability Operations team, overseeing observability, monitoring, incident response, and operational excellence for key enterprise services.
Partner with product, engineering, and infrastructure teams to embed CI/CD and release best practices, automating build/test/deploy and release monitoring.
Own problem management, driving root cause analysis and corrective actions to improve system resilience and reduce incident impact.
Mercury Insurance helps people reduce risk and overcome unexpected events, serving customers for over 60 years. They are a midsize employer recognized as one of America's Best Midsize Employers for 2026, with a collaborative culture focused on growth and inclusion.
Design and maintain scalable infrastructure-as-code solutions using Terraform and Kubernetes.
Build and operate observability systems while leading incident response and reliability improvements.
Embed security and compliance practices into infrastructure and optimize system performance and cloud costs.
This partner company builds a next-generation platform enabling AI-driven services across global employment infrastructure. It is a highly distributed, async-first organization where engineers thrive in ownership and autonomy.
Collaborate with service teams to define SLIs and SLOs based on customer experience and build error budget policies that influence engineering decisions.
Own the Operational Readiness Review process, conducting reviews for new services and major changes across observability, alerting, runbooks, capacity, and graceful degradation.
Act as a reliability expert for architecture reviews, failure mode analysis, dependency mapping, and resilience design.
Supabase provides the Postgres development platform with a complete backend solution including Database, Auth, Storage, Edge Functions, Realtime, and Vector Search. With 280+ team members across 55+ countries, they are an open-source-first company that values async work and has raised $500M.
Build and operate the delivery platform across AWS, EKS, ArgoCD, Helm, and Terraform, fixing production problems and driving root-cause analysis.
Standardize CI/CD pipelines using GitHub Actions and Azure DevOps, implement progressive delivery with Argo Rollouts, and build observability with Grafana and Prometheus.
Support platform adoption, reduce toil and cost, unblock cross-team delivery, and write documentation to eliminate knowledge silos.
Attain Finance is a leading consumer credit lender with over 50 years of expertise providing credit solutions across the U.S. and Canada. The company employs a dynamic team that fosters innovation and collaboration, with a portfolio including brands like Cash Money, LendDirect, Heights Finance, and others.
Lead and mentor a team of Forward Deployed Engineers deploying the North platform.
Drive end-to-end deployment in private cloud and on-premises environments for customer success.
Collaborate with Product, Engineering, and Sales while optimizing cloud infrastructure and K8s services.
Cohere is a security-first enterprise AI company building cutting-edge foundation AI models and end-to-end products for real-world business problems. They are a global technology company with offices in Toronto, San Francisco, London, New York City, Montreal, Seoul, Germany, and Paris, employing a team of researchers, engineers, and designers.
Build and lead a high-performance product engineering team focused on innovation, accountability, and reliability.
Develop scalable reliability, risk management, and operational governance capabilities for production systems.
Drive alignment across Platform Engineering, SRE, Infrastructure, and product teams to deliver long-term technical roadmap outcomes.
Affirm is reinventing credit to make it more honest and friendly, giving consumers the flexibility to buy now and pay later without hidden fees or compounding interest. It is a publicly traded, remote-first company with competitive benefits and a culture focused on innovation and people.
Design, implement, and improve Site Reliability Engineering practices across production environments with a focus on SLOs, SLIs, and error budgets.
Lead incident response processes and build observability strategies including monitoring, logging, alerting, and distributed tracing.
Partner with engineering teams to enhance system reliability, availability, scalability, and operational efficiency.
Oowlish is a rapidly expanding software development company in Latin America that collaborates with premier clients from the United States and Europe to create pioneering digital solutions. Certified as a Great Place to Work, it offers a nurturing environment with opportunities for professional growth and international impact.
Manage a team of Engineers, conducting 1:1s, performance reviews, hiring, and career development in a distributed remote friendly environment.
Own the technical roadmap for shared cloud infrastructure across Azure and AWS, balancing reliability work against longer-term platform improvements.
Set and enforce standards for infrastructure-as-code (Terraform, Helm, Kubernetes), documentation, and operational readiness.
Delinea is a pioneer in securing human and machine identities through intelligent, centralized authorization, empowering organizations to seamlessly govern their interactions across the modern enterprise. They value diversity, innovation, and a culture of respect and fairness, with a global team supported by strategic investment from TPG.
Design, build, and maintain highly available Kubernetes infrastructure at scale.
Lead design for components and features, and contribute to architecture decisions for container orchestration.
Mentor engineers on Kubernetes best practices and drive initiatives to improve system reliability.
Marqeta provides a card issuing platform for companies to issue cards, authorize transactions, and manage payment operations in real time. They are a publicly-traded company with a Flex First culture that values remote work and employee growth.
Provide frontline technical expertise to help developers deploy and scale Temporal in cloud-native environments.
Troubleshoot complex infrastructure issues, optimize performance, and develop automation solutions.
Collaborate with engineering and product teams to influence platform improvements and enhance developer experience.
Temporal provides an open source programming model that simplifies code and makes applications more reliable. The company is a growing team driven by values of curiosity, collaboration, and humility, focused on improving developer experience.
Build and maintain end-to-end observability with ELK, Prometheus, and Grafana.
Own and improve CI/CD pipelines (CircleCI, GitLab CI, GitHub Actions, ArgoCD).
Lead incident response and postmortems in a blameless culture.
Redcare Pharmacy is Europe’s No.1 e-pharmacy, powered by passionate teams and cutting-edge innovation. They strive to create a healthy, collaborative work environment where every employee feels valued and inspired to contribute to their vision “Until every human has their health”.
Design and operate our Kubernetes ecosystem with a focus on high availability and zero-downtime operations.
Own and evolve our PaaS strategy, using GitOps and CI/CD to empower domain teams to deploy independently.
Define and implement our observability strategy across metrics, logs, and tracing.
Finom is a European tech startup headquartered in Amsterdam, revolutionizing financial services for entrepreneurs. They offer an all-in-one financial B2B solution integrating banking, accounting, financial management, and invoicing into a mobile-first platform, with about 346 million in funding.
Lead reliability initiatives across multiple Ads domains including ad serving, auctions, targeting, reporting, measurement, and billing.
Partner with engineering leadership to improve reliability, scalability, operational excellence, and engineering efficiency across the Ads organization.
Design and build platforms, tooling, and automation that improve reliability and developer productivity at scale.
Reddit is a community of communities, built on shared interests, passion, and trust, home to the most open and authentic conversations on the internet. With 100,000+ active communities and approximately 126 million daily active unique visitors, it is one of the internet's largest sources of information.
Manage a scrum team of 4-6 engineers building and operating high-volume bidder systems.
Oversee AWS-based cloud infrastructure processing over 1 billion HTTP requests per hour.
Drive improvements in reliability, performance, and cost efficiency across production systems.
Jamloop builds high-scale advertising technology for real-time bidding systems. We are a remote-first company focused on reliability and operational excellence.
Own and evolve the cloud platform including compute layer, EKS fleet, serverless infrastructure, networking, and cloud operations across AWS and GCP.
Design and maintain infrastructure-as-code foundation and networking layer for reliability, security, and scalability.
Build AI-powered automation for cloud infrastructure management, including policy-as-code, drift detection, and LLM-assisted runbook generation.
Webflow builds the world's leading AI-native Digital Experience Platform, empowering teams to design, launch, and optimize for the web without barriers. As a remote-first company with over 2 million users across 190 countries, it fosters a culture of trust, transparency, and creativity.
Lead a team of experienced SRE engineers to raise reliability standards in blockchain infrastructure.
Set engineering direction, build conditions for good work, and apply SRE disciplines like SLOs and error budgets.
Drive automation and foster people development in a small, broad-scope team.
Parity is a leading core blockchain infrastructure company, founded by Dr. Gavin Wood, co-founder and former CTO of Ethereum. They are a remote-first team with offices in Berlin, Lisbon, and London, focused on building advanced technologies in the blockchain sector and committed to diversity and inclusion.
Build and maintain infrastructure platforms for over 200 backend services running on Kubernetes clusters with 40,000+ cores.
Lead and mentor other engineers, own complex infrastructure failures, and participate in a shared on-call rotation.
Drive cloud cost efficiency, estimate schedules, and use AI tools as a first-class collaborator in daily workflows.
Life360's mission is to keep people close to the ones they love through location sharing, safe driver reports, and crash detection. The company serves approximately 97.8 million monthly active users across more than 180 countries and has more than 500 remote-first employees.
Co-own the architecture of cloud infrastructure on Azure and Kubernetes clusters for high throughput and availability.
Drive resilience strategy for global scaling, zero-downtime deployments, and disaster recovery.
Evolve observability stack with LGTM (Loki, Grafana, Tempo, Mimir) and lead incident response.
Flip is an AI-powered employee experience platform for frontline workers in retail, manufacturing, and logistics. The company is a young, rapidly growing tech company with a remote-first culture and offices in Berlin and Stuttgart.