Scale the decision making process for tools for the tvScientific AI team, from our workflows to our training infrastructure to our Kubernetes deployments.
Improve the developer experience for the data science team and upgrade our observability tooling.
Make every deployment smooth as our infrastructure evolves, working with software engineering, data infra, and SRE partners.
Build scalable Edge infrastructure, designing and maintaining delivery systems for model deployment.
Work with cross-functional teams to integrate complex features, translating research into hardware realities.
Drive automation and reliability by implementing infrastructure to test models and monitor performance.
Hudl builds great teams and hires the best to ensure employees are working with people they can constantly learn from. They provide a culture where everyone feels supported, becoming one of Newsweek's Top 100 Global Most Loved Workplaces.
Build and operate production-grade model serving infrastructure using frameworks such as vLLM, TGI, Triton, or equivalent
Design and implement robust deployment pipelines with blue/green and canary rollout strategies for ML models
Develop and maintain auto-scaling systems, multi-model serving architectures, and intelligent request routing layers
Pragmatike is recruiting on behalf of a fast-scaling, well-funded distributed cloud infrastructure startup building next-generation AI-native cloud services. The company is redefining how compute is delivered by providing GPU-powered infrastructure for AI/ML workloads, secure storage, and high-speed data transfer through a decentralized architecture that significantly reduces environmental impact compared to traditional cloud providers.
Build and maintain infrastructure-as-code for our AWS EKS and GCP GKE clusters, plus on-premises deployments.
Own CI/CD pipelines and drive GitOps adoption.
Deploy, scale, and optimize ML/NLP inference workloads.
Vectara is the Enterprise Agent Platform that enables businesses to build and deploy governed, grounded, auditable AI agents across SaaS, VPC, and on-prem. We’re a passionate team that’s hyper-focused on solving enterprise-level technology and business problems with AI.
Design, build, and maintain scalable training infrastructure for computer vision workloads
Implement and manage distributed training pipelines to support large-scale model training and hyperparameter tuning
Build and maintain robust data pipelines for ML development
Buzz is revolutionizing the analytics and maintenance of power grid infrastructure through their advanced AI solutions. Their computer vision systems analyze critical infrastructure to enhance safety, reliability, and operational efficiency across the power grid network.
Own the design, implementation, and evolution of core MLOps systems across Hyperstack.
Build and improve systems that orchestrate model training, fine-tuning, evaluation, and deployment.
Define and embed strong MLOps practices across teams.
NexGen Cloud is the company behind Hyperstack, a full-stack AI cloud serving tens of thousands of customers from AI researchers to enterprises running the world's most compute-intensive workloads. They deliver on-demand and private GPU infrastructure to teams who treat performance as a requirement, not a feature.
Design and build the core data infrastructure powering Vantage's platform.
Own architecture decisions for systems built on ClickHouse, Temporal, Kubernetes, and Postgres.
Drive reliability, performance, and scalability initiatives across the platform as data volume and customer load grows
Vantage is the FinOps platform built for modern engineering teams. They are a high-output team of ~50 employees based in New York City with a remote-friendly culture.
Design the BYOC deployment model for Archie across customer environments.
Build and own Kubernetes-based infrastructure that runs reliably across multiple clouds and customer setups.
Create deployment tooling using Helm, GitOps, or similar approaches to make installation and operations repeatable.
P-1 AI is building an engineering AGI with their first product, Archie, an AI engineer. They closed a $23 million seed round and aim to put an Archie on every engineering team at every industrial company on earth.
Work with customers, engineers, and other stakeholders to define clear requirements that solve the customers’ problems and leverage the capabilities of our AI operations platform.
Translate requirements into a technical approach, design, scoping estimate, and execution plan.
Lead execution teams to achieve on-time completion of project deliverables mapped to customer business value while making key individual contributions throughout the process.
Striveworks helps organizations harness the power of artificial intelligence to solve real-world national security and business challenges. Founded by data scientists and engineers, they set out to make the journey from deployment to ongoing optimization simple and effective.
Develop and enhance backend features, ensuring system reliability and scalability.
Collaborate with stakeholders to define requirements and improve system performance.
Manage infrastructure using Terraform and other infrastructure-as-code tools.
Aura is on a mission to create a safer internet, offering a suite of intelligent digital safety products that help millions of customers protect themselves against digital threats. With over 400 employees worldwide, Aura is guided by experienced leadership and fostering an inclusive community.
Design, build, and maintain ML infrastructure across training, evaluation, serving, and monitoring
Own data pipelines including generation, cleaning, validation, and versioning
Build and improve experiment tracking, orchestration, and reproducibility tooling
Quilter is helping electrical engineers save time and accomplish more by automating the tedious and time-consuming task of designing printed circuit boards (PCBs). Their small team is composed of experts in electrical engineering, electromagnetic simulation, ML/AI, and high-performance computing (HPC).
Design, build, and optimize cloud platform capabilities.
Tackle complex infrastructure challenges and raise engineering quality.
Apply AI and AIOps to make the platform smarter and more resilient.
PerfectServe offers Best in KLAS clinical communication and physician scheduling solutions and is a Leader in the Gartner Magic Quadrant for Clinical Communication and Collaboration. We focus on optimizing provider schedules and dynamically routing messages to advance patient care and clinical workflows, valuing growth, transparency, and innovation.
Design, build, and maintain the inference infrastructure that powers Sword Health's AI products.
Own the end-to-end deployment pipeline for AI models.
Architect and scale Kubernetes clusters for GPU-accelerated workloads.
Sword Health is shifting healthcare from human-first to AI-first through its AI Care platform, making healthcare available anytime, anywhere, and reducing costs. They have over 1,000 enterprise clients and have raised more than $500 million from leading investors.
Partner with Sales and Field Engineering to design and architect complex, enterprise-grade solutions tailored to customer needs.
Lead the implementation of custom solutions within customer environments across multi-cloud and hybrid architectures.
Optimize solutions for performance, scalability, and reliability in production environments.
Striim is a unified data integration and streaming platform that connects clouds, data, and applications. We believe and expect all of our employees to operate as one with unlimited potential and dignity.
Working with engineers across Yelp in supporting new features and services.
Integrating tools to monitor platform stability and performance.
Help scale our Kubernetes clusters and AWS-based infrastructure while maintaining our platform's SLOs.
Yelp's engineering culture values individual authenticity and encourages creative solutions. They focus on helping users, growing as engineers, and having fun in a collaborative environment.
Lead the strategy and architecture for a scalable AI platform that integrates model orchestration, tool integration, and real-time decision systems.
Design, develop, and maintain the platform with full ownership from ideation to deployment, ensuring reliability, observability, and security.
Mentor engineers and collaborate across teams to evangelize AI best practices and drive the integration of AI throughout the product development lifecycle.
JumpCloud is an AI-powered unified IT management platform designed to secure the modern workforce by consolidating identity, device, and access management. The company is remote-first with teams in over 15 countries, fostering a culture that values building connections, out-of-the-box thinking, and passionate collaboration on challenging technical problems.
Design and implement infrastructure and tools that empower our product teams to rapidly and securely iterate, emphasizing reliability and automation.
Influence the strategic direction of our infrastructure and operational practices, ensuring that we are well-positioned to scale and support our growing organization.
Take a proactive role in the resolution of production issues, ensuring that we are well-prepared to handle incidents and that we learn from them in a blameless manner.
SSV Labs is the core team behind the SSV Network - pioneering decentralized infrastructure for Ethereum staking. They are building tools, protocols, and standards to make staking more secure, scalable, and trustless.
Design and maintain CI/CD pipelines for ML model training, packaging, and deployment across our microservices.
Manage containerized services on AWS ECS, optimizing for cost, latency, and availability.
Automate infrastructure provisioning and service configuration with Terraform.
Newsela takes authentic, real-world content from trusted sources and makes it instruction-ready for K-12 classrooms. Each text is published at five reading levels, so content is accessible to every learner; over 3.3 million teachers and 40 million students have registered.
You’ll design, build, and maintain scalable systems for serving machine learning models in production.
You’ll optimise inference performance, including latency, throughput, and cost efficiency.
You’ll collaborate with ML researchers and engineers to productionise models
Canva is a design platform that enables users to create a variety of visual content. They have campuses in Sydney and Melbourne, with co-working spaces in other Australian cities, and promote a flexible work environment.
ScienceLogic is redefining IT operations for the modern enterprise. Their AIOps platform empowers organizations to achieve Autonomic IT, helping enterprises and service providers gain unified visibility across hybrid and multi-cloud environments.
Own the architecture and evolution of P2P.org's internal developer platform—Kubernetes, monitoring, secrets management, and delivery infrastructure.
Design and build scalable, fault-tolerant platform components—including capacity planning, multi-tenancy, networking topology, and storage architecture.
Use AI tooling as a core part of how you work and champion its adoption across the infrastructure team and wider engineering organization.
P2P.org is the largest institutional staking provider with a TVL of over $10B and a market share exceeding 20% in restaking. They unite talented individuals globally and prioritize customer satisfaction, developing innovative solutions.