Monitor, operate, and support production AI infrastructure platforms.
Investigate and resolve infrastructure, networking, hardware, and platform-related incidents.
Collaborate with engineering teams, hardware vendors, and datacenter personnel to resolve technical issues.
Mirantis is the Kubernetes-native AI infrastructure company, enabling organizations to build and operate scalable, secure infrastructure for AI and data-intensive applications. The company is growing and invests heavily in AI infrastructure and platform services.
Build and maintain CI/CD pipelines and deployment infrastructure.
Leverage AI to automate analysis and resolution of production issues.
Fal is the generative media ecosystem powering the next generation of AI products. They build the infrastructure, tools, and model access that teams need to move from idea to production.
Identify recurring patterns across customer issues and drive long term reliability improvements.
Lightning AI is the company behind PyTorch Lightning, building an end-to-end platform for developing, training, and deploying AI systems. They serve solo researchers, startups, and large enterprises, operating globally with offices in New York City, San Francisco, Seattle, and London.
Maintain the reliability and performance of customer environments remotely, supporting Mirantis Opensack/k0s layers.
Diagnose and resolve system-level issues, requiring hands-on Linux administration experience.
Troubleshoot customer environments based on Linux, OpenStack, Kubernetes, networking, and other cloud technologies; detect, report, and resolve issues.
Mirantis helps enterprises move to the cloud on their terms, delivering a true cloud experience on any infrastructure, powered by Kubernetes. They serve many of the world’s leading enterprises and value openness, collaboration, risk-taking, and continuous growth.
Own the technical direction of Remote's SRE/Platform domain.
Define and drive the reliability strategy across the platform.
Identify and lead AI enablement initiatives across the engineering organisation.
Remote is solving modern organizations’ biggest challenge – navigating global employment compliantly with ease. With our core values at heart and a future-focused work culture, our team works tirelessly on ambitious problems, asynchronously, around the world.
Design, build, and maintain the core infrastructure layer supporting GenAI products.
Implement secure access controls and authentication mechanisms integrated by default into the AI platform components.
Develop and manage observability, monitoring, and logging solutions for GenAI workloads and infrastructure.
PointClickCare is a healthcare technology company. This team will serve as the product owner for GenAI capabilities, closely integrated with key horizontal partners to ensure delivery of safe, scalable and high-impact AI Products.
Own Render's core network infrastructure across multiple data centers and cloud providers, shaping how networking evolves as Render rapidly scales.
Design and build customer-facing networking capabilities that give users greater flexibility in how their services connect and communicate, and how traffic is routed.
Investigate complex networking issues across the stack, from the kernel and data plane to distributed systems and edge networking.
Render is building a modern cloud platform for developers creating AI-native, full-stack, multi-service applications, eliminating the tradeoff between hyperscaler power and developer-friendliness. They are a diverse and talented team that values craft, velocity, and user experience.
Support and improve hybrid production infrastructure for 15+ development teams handling 100+ products, 10K+ domains, and billions of hits per day.
Architect and plan improvements of a multi-datacenter development environment, advocating for migration to automated, elastic infrastructures using cloud, Kubernetes, and serverless technologies.
Aylo is a tech pioneer that offers world-class adult entertainment and games on safe, popular platforms. With an international team of dynamic innovators, the company focuses on trust-and-safety protocols and has offices in Montreal, Austin, and Nicosia.
Own and evolve a scalable observability platform spanning metrics, logs, traces, and events.
Design telemetry pipelines ingesting data from GPUs, CPUs, networking, containers, APIs, and BMC/Redfish.
Design and implement noise-resistant alerting systems to improve signal quality and reduce operational load.
Lightning AI builds an end-to-end platform for developing, training, and deploying AI systems, designed to take ideas from research to production with less friction. They combine developer-first software with cost-efficient, large-scale compute, serving solo researchers, startups, and large enterprises.
Lead the architecture of a high-scale AWS environment optimized for AI workloads.
Manage and mentor a high-performing team of 8 engineers, providing technical leadership and career coaching.
Conduct user research with internal Natera developers to identify friction points.
Natera is a global leader in cell-free DNA (cfDNA) testing, dedicated to oncology, women’s health, and organ health. The Natera team consists of statisticians, geneticists, doctors, laboratory scientists, business professionals, software engineers, and many other professionals from world-class institutions.
Act as the final escalation point for complex Cloud infrastructure issues, analyzing logs and metrics to identify root causes.
Own high-severity incidents, coordinate resolution with Engineering, DevOps, and SRE teams, and contribute to preventive actions.
Mentor L1 and L2 support engineers, create runbooks and SOPs, and collaborate with Product teams to reproduce issues.
Gcore provides infrastructure and software solutions for AI, cloud, network, and security, powering real-time communication, streaming, enterprise AI, and secure web applications. With 550+ professionals globally, they collaborate with partners like Intel, NVIDIA, Dell, and Equinix to support the digital ecosystem.
Set delivery standards using Terraform, GitOps, and progressive rollout, while improving SLOs and alerting on Grafana Cloud.
Docker is a developer tooling company trusted by over 20 million monthly users and 20 billion container image pulls. They are a globally distributed, remote-first team building tools that define how software gets built and delivered.
Design, implement, and maintain reliable, scalable, and secure infrastructure, applications, and tooling, with a focus on our ML/AI pipelines and workloads
Write clean, maintainable code, and perform peer code-reviews
Write clear and concise documentation and engage in cross-team communication and knowledge sharing
Bright Machines is a next-generation, AI-enabled manufacturer focused on data center infrastructure assembly operations. The company utilizes AI-based robotics and software to assemble AI infrastructure hardware products for hyperscalers and leading OEMs, employing under 500 employees, with a culture rooted in innovation and expertise.
Own the end-to-end infrastructure product vision, including installers, deployment tooling, reference architectures, and operational patterns.
Define and evolve a cohesive infrastructure roadmap aligned with Platform architecture, customer needs, and GTM strategy.
Partner closely with Product Leadership to balance near-term customer needs with long-term platform scalability and repeatability.
Mechanical Orchard is reinventing how the world’s most critical software gets modernized, focusing on system behavior to turn modernization into a repeatable process. They are an applied AI company challenging industry assumptions and prioritizing quality, rigor, and progress.
Build and improve scalable infrastructure operations processes that support a growing cloud platform.
Enable customer-facing and operational teams with secure automation, diagnostics, tooling and clear workflows.
Reduce repeatable manual work by identifying operational pain points and turning them into automated or self-service solutions.
NexGen Cloud delivers on-demand and private GPU infrastructure to a wide array of customers. They're a tight-knit, fast-moving team working at the cutting edge of AI cloud infrastructure, equipping their people with AI at every level.
Design and develop CI/CD systems for websites, services, and release workflows, and operate an EKS-based Kubernetes platform.
Diagnose debug production incidents, drive root-cause analysis, and implement improvements to enhance system reliability.
Write and maintain infrastructure as code using Pulumi or Terraform/OpenTofu across multiple AWS accounts with security-conscious practices.
Thunderbird is one of the world’s most trusted open-source email applications, empowering more than 20 million people globally. Our small but growing distributed team includes 65+ people across seven countries, and we build privacy-respecting communication tools with a collaborative, inclusive, and user-first spirit.
Identify systemic engineering challenges across our platforms and drive their resolution.
Write code, review PRs, debug production issues, and optimize system performance.
Partner with engineering teams as a technical point of contact on complex projects.
Zeta Global is an AI-Powered Marketing Cloud that leverages advanced artificial intelligence (AI) and trillions of consumer signals to help marketers acquire, grow, and retain customers more efficiently. They were founded in 2007 and are headquartered in New York City with offices around the world.
Design and operate our Kubernetes ecosystem with a focus on high availability and zero-downtime operations.
Own and evolve our PaaS strategy, using GitOps and CI/CD to empower domain teams to deploy independently.
Define and implement our observability strategy across metrics, logs, and tracing.
Finom is a European tech startup headquartered in Amsterdam, revolutionizing financial services for entrepreneurs. They offer an all-in-one financial B2B solution integrating banking, accounting, financial management, and invoicing into a mobile-first platform, with about 346 million in funding.
Own production health, reliability, and operational support processes across critical systems and services
Lead incident response efforts, stakeholder communication, root cause analysis, and post-incident reviews
Design and implement AI-driven agents and workflows that automate support and operational tasks
Quanata is on a mission to help ensure a better world through context-based insurance solutions. They are an exceptional, customer centered team with a passion for creating innovative technologies, digital products, and brands. Quanata, LLC is wholly owned and funded by State Farm.
Drive the architecture for how vNode wraps containerd, integrates with the kubelet, and exposes safe isolation primitives.
Lead the work where vNode meets containerd, Kata Containers, gVisor, runc, and the kernel.
Own how vNode plugs into the node lifecycle: CRI, kubelet device plugins, cgroups v2, eviction.
VCluster Labs is pioneering Kubernetes virtualization for the AI era. They are a venture-backed tech startup in a hyper-growth phase with a remote-first work culture.