Design systems with resilience, graceful degradation, and capacity in mind.
Define and measure SLOs and SLIs that actually reflect what our customers feel.
Use Datadog (logging, metrics, APM) together with CloudWatch to build signal-heavy, noise-light observability.
EarnIn is building products that deliver real-time financial flexibility for those with the unique needs of living paycheck to paycheck. They are growing fast and are excited to continue bringing world-class talent onboard to help shape the next chapter of their growth journey.
Design and build backend systems, APIs, infrastructure, and platform capabilities that improve developer workflows across Reddit.
Build scalable and reliable systems across both AI-powered developer workflows and the core non-AI systems engineers rely on every day.
Lead high-impact projects across Reddit’s developer tooling ecosystem by writing and reviewing code and design docs, aligning stakeholders, and making pragmatic technical tradeoffs.
Reddit is a community-based platform built on shared interests, passion, and trust, facilitating open and authentic conversations. With over 100,000 active communities and approximately 126 million daily active unique visitors, it serves as one of the internet’s largest sources of information.
Own the technical direction of Remote's SRE/Platform domain.
Define and drive the reliability strategy across the platform.
Identify and lead AI enablement initiatives across the engineering organisation.
Remote is solving modern organizations’ biggest challenge – navigating global employment compliantly with ease. With our core values at heart and a future-focused work culture, our team works tirelessly on ambitious problems, asynchronously, around the world.
Assess and improve visibility by identifying gaps in dashboards, metrics, and logs.
Refine alerts and dashboards for critical services to catch issues earlier.
Automate routine checks and monitoring tasks to free up engineers.
PlayOn is where high school sports come to life through platforms like GoFan, NFHS Network, and MaxPreps. As a growth-stage company backed by KKR, we build the technology that powers high school athletics from ticketing and streaming to fundraising and merchandise.
Incident Management: Respond to and resolve incidents in a timely manner, conducting post-incident reviews to identify and implement improvements.
Alpaca is a self-clearing broker-dealer and brokerage infrastructure for stocks, ETFs, options, crypto, fixed income, 24/5 trading, and more. They are a dynamic team of 380+ globally distributed members.
Performing day-to-day operational/DevOps tasks on Wikimedia’s public facing infrastructure.
Implementing and utilizing configuration management and deployment tools.
Leading continuous improvement, by automating the installation, configuration and maintenance of services on our platform.
The Wikimedia Foundation operates Wikipedia and other Wikimedia free knowledge projects with the vision of a world where every single human can freely share in the sum of all knowledge. As a charitable, not-for-profit organization, it relies on donations and has staff members based in 40+ countries.
Build and scale a strong culture of operational excellence by defining standards and coaching teams to own reliability and availability.
Drive mature DevOps/SRE practices, including incident response and PIRs, on-call readiness, runbooks, alerting, observability, and release/change management.
Guide teams in the design, development, evolution, and operation of large-scale, distributed cloud systems.
Grafana Labs is a remote-first, open-source powerhouse with more than 20M users of Grafana around the globe. They help more than 3,000 companies manage their observability strategies with the Grafana LGTM Stack, and their team thrives in an innovation-driven environment.
Design, build, and maintain scalable, reliable systems on GCP.
Develop automation for infrastructure provisioning using Terraform, Ansible, or Deployment Manager.
Manage incident response, conduct postmortems, and implement improvements to reduce recurrence.
SupplyHouse.com is an industry-leading e-commerce company specializing in HVAC, plumbing, heating, and electrical supplies since 2004. They value every individual team member and cultivate a community where people come first with Generosity, Respect, Innovation, Teamwork, and GRIT.
Build internal tooling to help other engineers and the rest of the company understand and operate our system.
Design and implement security best practices for our team and infrastructure.
Reduce toil through automation, including building and maintaining CI/CD infrastructure.
Openly is rebuilding insurance from the ground up by re-envisioning and enhancing every aspect of the customer experience. They are a rapidly growing team of exceptional, curious, empathetic people with a wide range of skill sets, spanning many departments.
Improve the reliability, performance, and scalability of our production platform.
Operate reliable infrastructure, improve observability, and drive incident response.
Use data-driven reliability practices such as SLIs, SLOs, SLAs, and DORA metrics.
VRChat is a game-changing platform that provides an endless collection of social VR experiences. They empower their community to bring their imaginations to life and help shape the metaverse. Their team includes people from Netflix, Twitter, Meta, and Microsoft.
Lead high-performing engineering teams focused on AI-native developer productivity.
Partner with leaders to translate strategy into scalable platforms and engineering roadmaps.
Drive alignment across various departments and build organizational processes for AI-assisted workflows.
Reddit is a community-based platform built on shared interests and open conversations. With over 100,000 active communities and millions of daily active users, it's a major source of information and discussion on the internet.
Lead software engineering teams providing infrastructure-as-code to manage cloud infrastructure.
Hire experienced site reliability staff, and a line manager to grow and oversee the SRE team.
Establish design-before-build discipline; facilitate lightweight design documents, architectural decision records, and working group reviews.
Horizon3.ai is a cybersecurity company dedicated to enabling organizations to proactively find, fix, and verify exploitable attack vectors. They are a fast-growing company with a culture of respect, collaboration, ownership, and results.
Foster and Guide the Technical Strategy for Growth.
Partner With the EM to Cultivate a Growth Engineering Mindset.
Scale Impact Through Collaborative Engineering
Reddit is a community of communities built on shared interests, passion, and trust, and is home to open and authentic conversations. With 100,000+ active communities and approximately 126 million daily active unique visitors, Reddit is one of the internet’s largest sources of information.
Architect future iterations of core systems, addressing scaling requirements.
Design and implement developer tools to enhance deployment safety and reproducibility.
Drive excellence in monitoring and guide incident response for quick issue resolution.
Found provides tools for self-employed individuals, offering a business bank account that automates taxes and expense tracking. They aim to give self-employed people the security and peace of mind historically available only at large corporations and are looking for kind, resourceful, and passionate people.
Work cross-functionally to build novel products and features.
Contribute to the full development cycle.
Contribute standards that improve developer workflows.
Reddit is a community-driven platform built on shared interests and trust, hosting open conversations. With over 100,000 active communities and 126 million daily active users, it's a major source of information.
Provide production support on a shift according to the team on-call roster.
Work on the customer and internal engineering/implementation team raised tickets while not on-call for production support.
Continuously monitor the health and performance of our services, systems, and infrastructure.
Granicus builds and maintains technology that is transforming the Govtech industry by bringing governments and its constituents together. They serve 5,500 federal, state, and local government agencies and more than 300 million citizen subscribers, and are known for being one of the best companies to work for.
Design, build and scale products within our notifications system, focusing on the end-user experience.
Work across the stack and with cross-functional teams like Product, Machine Learning and Data Science.
Contribute to the full development cycle: technical design, development, test, experimentation, analysis, and launch.
Reddit is a community of communities built on shared interests, passion, and trust, fostering open and authentic conversations. With 100,000+ active communities and approximately 126 million daily active unique visitors, it’s one of the internet’s largest sources of information.
Design, build, and deploy production systems with a focus on scalability, reliability, observability, and performance.
Develop and maintain comprehensive automation solutions to eliminate toil and streamline operational efficiency.
Proactively monitor production systems and implement automated incident response mechanisms to minimise downtime.
Arista Networks is an industry leader in data-driven, client-to-cloud networking for large data center, campus and routing environments. The company is well-established and profitable with over $8 billion in revenue and values diversity and inclusivity.