The Azure Cloud Engineer will focus on ensuring the reliability, performance, and availability of our Azure infrastructure while implementing SRE best practices; this role champions reliability engineering principles, drives automation initiatives, and builds robust observability solutions to maintain world-class uptime and performance.
Job listings
Own the platform that powers our protocol and apps by designing and running AWS-first, highly available systems. Turn infrastructure into code using Kubernetes and CDK/Terraform and wire up deploys with GitHub Actions and CodePipeline. Build end-to-end observability to keep latency low and uptime high, while partnering with product and protocol teams to operate execution clients and RPC/DA/indexing workloads in production.
The Cloud Reliability Engineer will write and integrate various open source and closed sources tools and will be responsible for configuration management, containerization, and scripting. Duties include developing, configuring, and deploying tools for cloud based systems and services, containerizing new and legacy applications, and providing LOE/scoping for projects.
Join our team as an Observability Engineer, handling monitoring and system reliability in a high-scale, complex environment for a large multinational food and beverage company. Transition into the SRE role, leveraging your software development background to improve incident detection and prevention. Collaborate with development teams to enhance service reliability and performance by investigating production issues and supporting software architecture.
Join our team as an Observability Engineer and work for a large multinational company in the food and beverage sector, handling monitoring and system reliability in a high-scale, complex environment. Responsibilities include performance monitoring and analysis, software architecture support, troubleshooting, and collaboration with development teams, improving service reliability and performance. Position also available for Mid-Level Developers interested in transitioning to the SRE role.
As a Staff Site Reliability Engineer at Zapier, you'll lead Zapierโs observability strategy, partnering with product and platform teams to remove friction in adopting golden paths for observability, service ownership, and incident response. You will also mentor and guide engineers, champion AI integration, and act as a reliability steward for Zapier.
Responsible for the operational management of a team delivering and maintaining a Kubernetes (K8S) platform for a key client โ a Managed Service Provider in Western Europe. The team is responsible for the client's largest K8S deployment and plays a critical role in the platformโs ongoing development, lifecycle management, and stability.
As a Senior DevOps Engineer, you will continuously improve our development operations and support the reliability and availability of all our applications and services deployed to the cloud. Partner with various engineering teams to own and manage availability, latency, performance, reliability and scalability of all services to maintain SLAs that our customers expect from us. Provide strong technical leadership and people management to the team.
In this role, you will lead the platform engineering team that handles SRE (Site Reliability Engineering), Infrastructure, and Developer Experience. You will influence platform engineering vision, set goals, hire, and drive platform engineering efforts, collaborating with engineering leaders and product managers to provide reliable service and a great developer experience, while developing engineering talent.
As a Platform Engineer focused on Resilience, you'll build and maintain robust processes and systems to meet the highest standards of reliability and operational excellence. You will steward production readiness, support engineers in best practices, and advocate for an improved developer experience in creating resilient services.