Own and evolve a scalable observability platform spanning metrics, logs, traces, and events.
Design telemetry pipelines ingesting data from GPUs, CPUs, networking, containers, APIs, and BMC/Redfish.
Design and implement noise-resistant alerting systems to improve signal quality and reduce operational load.
Lightning AI builds an end-to-end platform for developing, training, and deploying AI systems, designed to take ideas from research to production with less friction. They combine developer-first software with cost-efficient, large-scale compute, serving solo researchers, startups, and large enterprises.
Scale and mature Vesta’s infrastructure to support the entire mortgage market reliably, securely, and efficiently.
Build the foundational systems that power engineering velocity and platform reliability.
Focus on cloud architecture, deployment systems, observability, incident response, and internal developer tooling.
Vesta is building the next-generation system of record to power the multi-trillion mortgage market. They value humility, empathy, self-awareness, and an orientation towards action and have raised $45M from top tier investors.
Oversee a specialized SRE team focused on the design, deployment, and maintenance of automation toolsets.
Establish and enforce standards for IaC to ensure consistent, repeatable, and secure deployments.
Drive the automated lifecycle of both physical and virtual assets, from initial template creation/deployment to automated patching, scaling, and decommissioning.
Galaxy is a global leader in digital assets and data center infrastructure, delivering solutions that accelerate progress in finance and artificial intelligence. Led by CEO and Founder Michael Novogratz, their team blends deep crypto expertise with institutional experience and a shared commitment to shaping the future of Web3 and AI.
Own the delivery of developer platform capabilities end-to-end, including design, implementation, rollout, and iteration.
Build and evolve paved roads that make it easy to deploy, operate, and scale services.
Drive improvements to GitOps workflows and harden CI/CD to improve pipeline performance and developer ergonomics.
Phaidra is building the future of industrial automation with AI-powered control systems. They are a 100% remote company with employees located throughout the USA, Canada, UK, Sweden, Spain, Portugal, the Netherlands, Singapore, Australia, and India.
Drive the architecture for how vNode wraps containerd, integrates with the kubelet, and exposes safe isolation primitives.
Lead the work where vNode meets containerd, Kata Containers, gVisor, runc, and the kernel.
Own how vNode plugs into the node lifecycle: CRI, kubelet device plugins, cgroups v2, eviction.
VCluster Labs is pioneering Kubernetes virtualization for the AI era. They are a venture-backed tech startup in a hyper-growth phase with a remote-first work culture.
Build and scale a strong culture of operational excellence by defining standards and coaching teams to own reliability and availability.
Drive mature DevOps/SRE practices, including incident response and PIRs, on-call readiness, runbooks, alerting, observability, and release/change management.
Guide teams in the design, development, evolution, and operation of large-scale, distributed cloud systems.
Grafana Labs is a remote-first, open-source powerhouse with more than 20M users of Grafana around the globe. They help more than 3,000 companies manage their observability strategies with the Grafana LGTM Stack, and their team thrives in an innovation-driven environment.
Design systems with resilience, graceful degradation, and capacity in mind.
Define and measure SLOs and SLIs that actually reflect what our customers feel.
Use Datadog (logging, metrics, APM) together with CloudWatch to build signal-heavy, noise-light observability.
EarnIn is building products that deliver real-time financial flexibility for those with the unique needs of living paycheck to paycheck. They are growing fast and are excited to continue bringing world-class talent onboard to help shape the next chapter of their growth journey.
Own the Go Terraform provider codebase, including architecture, implementation quality, test strategy, and release readiness.
Improve Terraform provider reliability and ergonomics, including resource behavior, data sources, lifecycle edge cases, and upgrade safety.
Drive technical strategy for IaC workflows through design docs, RFCs, and iterative delivery.
Supabase is a Postgres development platform, built by developers for developers, providing a complete backend solution including Database, Auth, Storage, Edge Functions, Realtime, and Vector Search. They are a remote-first company with over 280 team members across 55+ countries.
Design, implement, and maintain reliable, scalable, and secure infrastructure, applications, and tooling, with a focus on our ML/AI pipelines and workloads
Write clean, maintainable code, and perform peer code-reviews
Write clear and concise documentation and engage in cross-team communication and knowledge sharing
Bright Machines is a next-generation, AI-enabled manufacturer focused on data center infrastructure assembly operations. The company utilizes AI-based robotics and software to assemble AI infrastructure hardware products for hyperscalers and leading OEMs, employing under 500 employees, with a culture rooted in innovation and expertise.
Assess and improve visibility by identifying gaps in dashboards, metrics, and logs.
Refine alerts and dashboards for critical services to catch issues earlier.
Automate routine checks and monitoring tasks to free up engineers.
PlayOn is where high school sports come to life through platforms like GoFan, NFHS Network, and MaxPreps. As a growth-stage company backed by KKR, we build the technology that powers high school athletics from ticketing and streaming to fundraising and merchandise.