Design, build, and maintain scalable, highly available and fault-tolerant infrastructures.
Implement and improve monitoring, alerting, and incident response systems to ensure optimal system performance and minimize downtime.
Drive continuous improvement in infrastructure automation, deployment, and orchestration.
Mistral AI is dedicated to democratizing AI through high-performance, optimized, open-source models, products, and solutions designed to integrate seamlessly into daily working life. They are a dynamic, collaborative team passionate about AI and its potential to transform society dedicated to innovation.
Operating and evolving 100+ multi-cloud streaming clusters and related database infrastructure.
Diagnosing and eliminating cross-layer failure modes.
Designing safe upgrade and rollout strategies at scale.
Grafana Labs is a remote-first, open-source powerhouse with over 20M users of Grafana, its open source visualization tool. Grafana Labs helps more than 3,000 companies manage their observability strategies with the Grafana LGTM Stack, and its team thrives in an innovation-driven environment.
Support the availability and durability of critical services across production environments.
Develop automation for common operational tasks, reducing manual intervention and toil.
Partner with engineering, product, and operations teams to support resilient system design and operations.
Backblaze is the object storage leader in the open cloud movement, fueling customer success with cloud storage built purposefully to unlock budgets and unleash innovators. Founded in 2007, they scaled the business with less than $3 million in outside funding until 2021, and generate over $100m in revenue managing over three billion gigabytes of data storage for 500K+ customers in 175+ countries.
Design and implement infrastructure and tools that empower our product teams to rapidly and securely iterate, emphasizing reliability and automation.
Influence the strategic direction of our infrastructure and operational practices, ensuring that we are well-positioned to scale and support our growing organization.
Take a proactive role in the resolution of production issues, ensuring that we are well-prepared to handle incidents and that we learn from them in a blameless manner.
SSV Labs is the core team behind the SSV Network - pioneering decentralized infrastructure for Ethereum staking. They are building tools, protocols, and standards to make staking more secure, scalable, and trustless.
Lead the push toward a modern, cloud-native organization by designing and managing scalable, resilient systems on AWS.
Own the Infrastructure as Code (IaC) strategy using Terraform, ensuring environments are repeatable, versioned, and stable.
Build and optimize high-velocity deployment pipelines using GitHub Actions, ArgoCD, and Helm to get code from "commit" to "production" seamlessly.
TrueML is undergoing a major platform rearchitecture, moving toward a fully cloud-native, modernized infrastructure. They seem to be a medium-sized company with a focus on innovation and providing engineers with the tools and data they need to make smart, impactful choices.
Design and evolve multi-provider, multi-region GPU compute clusters optimized for large-scale training.
Serve as the primary technical point of contact for customers running large-scale training workloads.
Build production-grade automation for cluster provisioning, GPU health checks, job scheduling, self-healing, and firmware/driver lifecycle management.
Andromeda Cluster gives early-stage startups access to scaled AI infrastructure. They work with leading AI labs, data centers, and cloud providers to deliver compute when and where it’s needed most and are expanding to find the brightest in AI infrastructure, research and engineering.
Design and develop multi-threaded asynchronous replication systems with parallel streaming capabilities
Build object-level delta replication with checkpointing and resume functionality
Implement secure data transfer mechanisms using TLS 1.3 with mutual authentication
DataDirect Networks (DDN) is a global market leader renowned for powering many of the world's most demanding AI data centers, in industries ranging from life sciences and healthcare to financial services, autonomous cars, Government, academia, research and manufacturing. DDN's cutting-edge data intelligence platform is designed to accelerate AI workloads, enabling organizations to extract maximum value from their data.
Build and develop our operator-based platform on Kubernetes.
Work on existing operators and design new ones as we extend the platform.
Create self-service solutions across multiple Kubernetes clusters.
REWE Group Austria's IT department develops innovative IT products and services for its corporate divisions in Austria and abroad, setting the tone for modern trade. They have over 700 employees.
Support Voleon's build processes and continuous delivery tools, manage software development tools, and automate systems and operations
Work with stakeholders and product teams to create robust, high-performance software and systems architectures
Establish and promote best practices for software development and testing
Voleon is a technology company that applies state-of-the-art AI and machine learning techniques to real-world problems in finance. They have become a multibillion-dollar asset manager, and they have ambitious goals for the future, with colleagues including internationally recognized experts.
Collaborate with stakeholders to drive best practices for monitoring, CI/CD pipelines
Troubleshoot deployment issues in our CI pipeline
Identify areas for automation and embrace the codification of all things
Weedmaps is a global leader in the cannabis industry. They are dedicated to transparency, education, and community, serving cannabis to consumers and businesses in the U.S. and worldwide.