Pythian is building a next-generation Site Reliability Engineering team, and weβre looking for talented, motivated engineers who thrive in fast-paced, problem-solving environments. As an SRE, youβll design, deploy, and operate large-scale distributed systems across compute, storage, networking, and AI/ML environments. Youβll lead projects from architecture to automation to intelligent monitoring, collaborating with both clients and teammates to build resilient, high-performing infrastructure.
Job listings
As one of the first joiners to our Reliability Engineering Team at ClickHouse, you will be responsible for building and leading processes to ensure the reliability, availability, scalability, and performance of our cloud infrastructure that runs ClickHouse databases. You will collaborate with different teams and guide them to design and implement scalable, secure, highly available and fault-tolerant distributed systems.
As a Platform Engineer focused on Resilience, you'll build and maintain robust processes and systems to meet the highest standards of reliability and operational excellence. You will steward production readiness, support engineers in best practices, and advocate for an improved developer experience in creating resilient services.
As a Senior Site Reliability Engineer at Runwise, you will maintain the stability and performance of our services, ensuring they are reliable, scalable, and fault-tolerant. Youβll collaborate with hardware and software engineers to build and maintain tools that improve the reliability and efficiency of our systems. Responsibilities include designing scalable infrastructure in AWS cloud and automating infrastructure provisioning, deployment pipelines, and operational workflows.
Build and maintain robust software systems using Python, Go or Java. Apply deep knowledge of software design patterns, data structures, algorithms, and testing methodologies to deliver scalable, high-quality solutions. Solve complex networking challenges across TCP/IP, DNS, HTTP/HTTPS, and routing protocols such as BGP and OSPF.
The Internal Platform team accelerates product development by providing a reliable, scalable, and self-service ecosystem. As an AI Platform Engineer, you will be responsible for designing, building, and maintaining the company's AI infrastructure, ensuring high availability, scalability, security, and cost efficiency. You will also work closely with Product teams to provide technical expertise and propose innovative solutions.
This role involves implementing the infrastructure and tooling that powers the software delivery pipeline and participation in an on-call rotation to support software delivery pipelines & systems. As a Software Engineer II you will be responsible for the efficiency of the software development lifecycle for Eventbrite's applications.
We are looking for a seasoned Site Reliability Engineer (SRE) to join our distributed team. This is a fully remote, work-from-home opportunity. As a key member of our DevOps team, you will be responsible for designing, implementing, and maintaining mission-critical monitoring, alerting, and incident response systems. This role ensures high availability, reliability, and performance of our infrastructure, supporting scalable services in production environments.