Lead reliability initiatives across multiple Ads domains including ad serving, auctions, targeting, reporting, measurement, and billing.
Partner with engineering leadership to improve reliability, scalability, operational excellence, and engineering efficiency across the Ads organization.
Design and build platforms, tooling, and automation that improve reliability and developer productivity at scale.
Partner with Ads Engineering teams to improve reliability, scalability, and operational excellence of ad-serving and related systems.
Design, build, and maintain infrastructure, tooling, and automation to improve service reliability and engineering productivity.
Participate in on-call rotations, lead incident response, and drive root cause analysis and corrective actions.
Reddit is a community of communities built on shared interests, passion, and trust. With 100,000+ active communities and approximately 126 million daily active unique visitors, it is one of the internet's largest sources of information.
Architect for Scale, partnering with product and infrastructure teams to design highly available systems.
Drive Automation to eliminate repetitive operational work through tooling and systems.
Reddit is a community-based platform where users submit, vote, and comment on various topics. It hosts over 100,000 active communities and attracts millions of daily active users, making it one of the largest and most influential internet platforms.
Collaborate with service teams to define SLIs and SLOs based on customer experience and build error budget policies that influence engineering decisions.
Own the Operational Readiness Review process, conducting reviews for new services and major changes across observability, alerting, runbooks, capacity, and graceful degradation.
Act as a reliability expert for architecture reviews, failure mode analysis, dependency mapping, and resilience design.
Supabase provides the Postgres development platform with a complete backend solution including Database, Auth, Storage, Edge Functions, Realtime, and Vector Search. With 280+ team members across 55+ countries, they are an open-source-first company that values async work and has raised $500M.
Design and build core platform infrastructure for large-scale cloud-native data and analytics systems.
Own and improve CI/CD pipelines, testing frameworks, and deployment in a high-scale PaaS environment.
Contribute to reliability engineering, observability, and operational excellence across distributed systems.
Jobgether uses an AI-powered matching process to connect candidates with roles. The company is a growing platform focused on efficient job matching and data privacy compliance.
Provide frontline technical expertise to help developers deploy and scale Temporal in cloud-native environments.
Troubleshoot complex infrastructure issues, optimize performance, and develop automation solutions.
Collaborate with engineering and product teams to influence platform improvements and enhance developer experience.
Temporal provides an open source programming model that simplifies code and makes applications more reliable. The company is a growing team driven by values of curiosity, collaboration, and humility, focused on improving developer experience.
Gain deep understanding of ad technology and industry standards like OpenRTB, MRAID, VAST.
Analyze and improve end-to-end ad workflows, reproducing problematic cases and proposing improvements.
Lead implementation of architectural solutions and provide hands-on technical leadership.
RTB House is a global marketing technology company providing AI-powered ad-buying solutions. The company processes over 20 million requests per second and fosters a culture of technical excellence and ownership.
Act as a first responder for system incidents and outages, ensuring high availability and performance.
Own and evolve monitoring, alerting, and log management systems while optimizing database infrastructure.
Collaborate with engineering teams to build scalable, resilient systems and contribute to SRE tooling and automation.
Circle is building the world's leading all-in-one platform for online communities. We're a fully remote company of around 200 team members from 30+ countries, with a culture that values autonomy, async collaboration, and high expectations.
Implement highly available, scalable infrastructure across AWS, GCP, and bare-metal environments.
Drive an "automation-first" culture by writing code in Python/Go to build self-healing systems.
Act as lead Incident Commander, develop response playbooks, and conduct post-incident analyses.
Zscaler accelerates digital transformation to secure customers with a cloud-native Zero Trust Exchange platform. The company processes over 200 billion transactions daily and fosters a culture of execution, collaboration, and accountability.
Design and deliver robust, high-scale routing experiences for Data Pipelines for Twilio Segment.
Operate always-available, complex distributed systems in cloud environments.
Collaborate cross-functionally with design, product, and other engineers to define solutions.
Twilio is shaping the future of communications, delivering innovative solutions to hundreds of thousands of businesses and empowering millions of developers worldwide. The company is remote-first with a strong culture of connection and global inclusion, and employs a diverse team of Twilions.
Take ownership of incident management and operational excellence across cloud infrastructure.
Automate high-risk manual processes and drive reliability gains through engineering.
Own a platform domain such as Temporal, observability, or Kubernetes operations.
Synthesia is the world’s leading AI video platform for business, used by over 90% of the Fortune 100. Founded in 2017, the company is headquartered in London with offices across Europe and the US, and has over $530 million in funding from premier investors like Accel and Nvidia's VC arm.
Manage a scrum team of 4-6 engineers building and operating high-volume bidder systems.
Oversee AWS-based cloud infrastructure processing over 1 billion HTTP requests per hour.
Drive improvements in reliability, performance, and cost efficiency across production systems.
Jamloop builds high-scale advertising technology for real-time bidding systems. We are a remote-first company focused on reliability and operational excellence.
Design and build backend systems, APIs, infrastructure, and platform capabilities that improve developer workflows across Reddit.
Build scalable and reliable systems across both AI-powered developer workflows and the core non-AI systems engineers rely on every day.
Lead high-impact projects across Reddit’s developer tooling ecosystem by writing and reviewing code and design docs, aligning stakeholders, and making pragmatic technical tradeoffs.
Reddit is a community-based platform built on shared interests, passion, and trust, facilitating open and authentic conversations. With over 100,000 active communities and approximately 126 million daily active unique visitors, it serves as one of the internet’s largest sources of information.
Lead a high-impact CloudOps and infrastructure engineering team powering large-scale, real-time advertising systems under extreme performance and reliability constraints.
Own planning and delivery processes including sprint planning, backlog prioritization, execution tracking, and team retrospectives.
Drive initiatives to improve system reliability, observability, deployment safety, incident response, and production readiness.
Jobgether uses an AI-powered matching process to review applications quickly, objectively, and fairly against role requirements. Their platform identifies top-fitting candidates and shares shortlists directly with hiring companies.
Lead the Site Reliability Operations team, overseeing observability, monitoring, incident response, and operational excellence for key enterprise services.
Partner with product, engineering, and infrastructure teams to embed CI/CD and release best practices, automating build/test/deploy and release monitoring.
Own problem management, driving root cause analysis and corrective actions to improve system resilience and reduce incident impact.
Mercury Insurance helps people reduce risk and overcome unexpected events, serving customers for over 60 years. They are a midsize employer recognized as one of America's Best Midsize Employers for 2026, with a collaborative culture focused on growth and inclusion.
Lead a team of software engineers building scalable services, APIs, and SDKs for our digital merchandising platform.
Drive architecture and design decisions, set technical direction, and work cross-functionally with product managers and data scientists.
Own and evolve engineering standards, manage team growth through 1:1s and performance feedback, and partner on hiring.
Jane Technologies is an MIT-founded eCommerce company in the cannabis industry, connecting consumers with local dispensaries and brands. We are a small close-knit team of highly technical engineers with diverse backgrounds, rapidly growing 20% month over month and valuing lean development and data-driven practices.
Ensure reliability, availability, and observability for a large-scale cloud-based SaaS platform serving millions in education.
Design and maintain infrastructure-as-code and CI/CD pipelines while leading incident response and resolution.
Mentor peers and integrate AI-driven tools to improve SRE workflows and system performance.
Jobgether is an AI-powered job matching platform that connects candidates with hiring companies. The company manages the application process and uses AI to shortlist top-fitting candidates based on core requirements.
Spearheads evolution of compute and data delivery services with an emphasis on scale and user requirements
Collaborates to enable efficient and rapid access to new and growing data sets
Improves reliability and scalability by resolving edge cases, studying failure modes, and writing tests
Planet designs, builds, and operates the largest constellation of imaging satellites, delivering an unprecedented dataset via a cloud-based platform. With a global team and a people-centric approach, the company focuses on culture and community while preparing for growth.
Build internal tooling to help other engineers and the rest of the company understand and operate our system.
Design and implement security best practices for our team and infrastructure.
Reduce toil through automation, including building and maintaining CI/CD infrastructure.
Openly is rebuilding insurance from the ground up by re-envisioning and enhancing every aspect of the customer experience. They are a rapidly growing team of exceptional, curious, empathetic people with a wide range of skill sets, spanning many departments.
Design, build, and operate distributed systems powering observability across ClickHouse Cloud.
Own reliability, performance, and cost-efficiency of the telemetry pipeline and storage systems.
Take part in on-call rotation and drive root-cause resolution and long-term fixes.
ClickHouse is a real-time analytics and data warehousing company recognized on the 2025 Forbes Cloud 100 list. With over 3,000 customers and rapid growth, the company fosters an innovative and fast-paced culture.