Drive the design of our next-generation AI infrastructure. In this high-impact, hands-on role, you will make end-to-end architectural decisions across compute, networking, and storage β ensuring our platforms can meet the massive scale, performance, and reliability requirements of modern AI workloads. This is a high-impact architecture role where youβll define how tens of thousands of GPUs are interconnected optimized across multiple data center sites.
Job listings
Improve Scalable's cloud infrastructure and automation using tools such as AWS, Terraform, and languages like Python or Go. Design, maintain, and operate multiple cloud networks on AWS, providing a secure and highly available infrastructure. you will maintain financial middleware systems, ensuring high-availability connectivity for our most critical applications and partner integrations. Strengthen our DevOps culture.
Looking for a Senior Cloud Performance Engineer to build cloud native ClickHouse Cloud Platform. The ideal candidate will have experience with database benchmarking, test automation, system engineering, performance analysis, and capacity management. This role offers the opportunity to make a significant impact on our elastic, limitless scale, high-performance, server less clickHouse Cloud.
As a Team Lead, youβll be responsible for leading a team of site reliability engineers that are designing, deploying, and operating large-scale distributed systems across compute, storage, networking, and AI/ML environments. You will act as the primary technical escalation point, oversee day-to-day operational delivery, mentor and coach team members, and ensure adherence to SLAs and quality standards.
As a Senior Site Reliability Engineer (SRE) at GitLab, youβll help keep all user-facing services and production systems reliable, scalable, and efficient. Our SREs combine a pragmatic operations mindset with strong software engineering practices to drive automation, reduce toil, and improve resilience across our platform. This position centers on automating the lifecycle of many tenant environments, ensuring they remain secure, consistent, and reliable at scale.
Build the foundation that will help our company move as fast as possible while meeting security and compliance requirements. You will be one of the key people defining and driving the future vision of what reliability and observability should look like. Responsibilities include shipping automation and tooling that reduces toil, with high-quality, well-structured code, design and codify self-healing workflows and guardrails to minimize toil and improve reliability.
CapIntel is looking for a Senior Fullstack Engineer to strengthen the architecture, reliability, and scalability of their systems. This role focuses on platform and tooling, improving developer experience and supporting secure, efficient product delivery. You'll collaborate closely with product teams and may contribute to service design or integration when needed.
We are seeking a highly skilled DevOps Engineer with expertise in Go/Golang, Python, and Terraform to join our innovative team, someone who values proactive problem-solving and can readily clarify requirements.
The Infrastructure Engineering team is the backbone of Dockerβs cloud-native platform, powering products like Docker Hub and Docker Build Cloud for millions of developers worldwide. The team designs, builds, and operates the infrastructure services and platforms that make Docker fast, reliable, and secure at global scale. They own core building blocks like compute, networking, observability, deployment, security, and cloud infra provisioning.
Work as a SRE / Platform Engineer at Teravision Technologies! The role involves working with AWS services, IaC tools like Terraform, and CI/CD tools such as GitHub Actions. It requires proficiency in scripting languages like Python or Go and a solid understanding of observability tools.