Design and build production-grade AI systems, including RAG pipelines, multi-step agents, and LLM-powered features.
Build comprehensive evaluation and observability frameworks to measure model accuracy, grounding, and quality drift.
Create production-quality Python services to wrap AI logic into secure microservices.
League, founded in 2014, is the leading healthcare consumer experience (CX) platform powered by AI, reaching over 63 million people globally. Payers, providers, and consumer health partners use League’s platform to deliver high-engagement healthcare solutions and improve health outcomes.
Osano is an innovative B-Corporation focused on giving modern enterprises the ability to innovate quickly and earn customer trust by respecting data privacy and complying with consent guidelines. We are scaling fast with a multi-year runway and ambitious growth plans.
Lead the AI Evaluation team, owning staffing, coaching, performance management, and delivery of evaluation and testing frameworks.
Manage the AI evaluation lifecycle — including pre-launch testing, simulation, and post-deployment health monitoring — ensuring alignment with governance standards and expectations.
Create domain-specific evaluation tracks (e.g., Compliance & Risk, Bot Experience, Agent Experience) to assess AI quality from multiple perspectives.
Chime is a financial technology company that believes everyone can achieve financial progress. They are a team of problem solvers, dreamers, and builders with one shared obsession: their members.
Design and curate evaluation datasets for retrieval quality.
Measure retrieval quality using metrics like Recall@k, Precision@k, MRR, and NDCG@k.
Conduct systematic error analysis on AI/ML system outputs; build structured failure taxonomies.
Jump empowers financial advisors, firms, and clients to thrive in the age of AI by automating tasks like meeting prep and compliance. As a Series A company, Jump has raised $30M and grown to 100+ employees including leaders from top companies and schools, fostering a culture of velocity, world-class standards, direct communication, and kindness.
Design and optimise AI-ready tools and APIs that enable LLM platforms to reliably interact with Canva's design capabilities.
Build and maintain evaluation frameworks to systematically measure tool-use accuracy across platforms.
Experiment with LLM orchestration and agent architectures – Develop Canva agents that any 3rd party provider can call to design quickly, efficiently and at scale.
Canva is a platform redefining how the world experiences design. They have a flagship campus in Sydney, with a second campus in Melbourne and co-working spaces in Brisbane, Perth, Adelaide, and Auckland, NZ.
Co-create evaluation frameworks, proctoring solutions, and critical security mechanisms.
Evaluate and implement state-of-the-art AI/ML techniques.
Design, build, and deploy scalable AI services and pipelines.
The company develops and scales a global Certification-as-a-Service platform that automates the entire lifecycle of professional exams. The solution enables companies and organizations to quickly and inexpensively create and administer official online exams/certifications.
Build and ship agentic features across the RevenueCat universe
Design and implement tool integrations that expand what agents can see and do
Own the reliability and quality of agent responses
RevenueCat is a monetization platform for mobile apps, helping developers understand and grow their revenue by removing the headaches of building and scaling in-app subscriptions. They are a remote-first company of 120+ employees across 25 countries, valuing customer obsession and continuous improvement.
You will define, build, and evolve foundational systems that enable autonomous agents to operate reliably in production.
You’ll explore new approaches, prototype quickly, and turn what works into durable platform foundations.
You’ll identify high-leverage architectural improvements, abstractions, and guardrails that expand what the platform can do while keeping it reliable, secure, observable, and maintainable under real-world conditions.
Kindo is an agent automation platform for DevOps and SecOps teams, helping organizations automate high-friction operational work using autonomous agents. They are a small, highly technical team with strong customer traction and real enterprise revenue, where engineers have direct ownership over critical systems.
n8n is the open workflow orchestration platform built for the new era of AI. They give technical teams the freedom of code with the speed of no-code, so they can automate faster, smarter, and without limits. Since their founding in 2019, they’ve grown into a diverse team of over 220 working across Europe and the US, connected by a shared builder spirit and with their centre of gravity in Berlin.
Building a truly flexible and scalable conversational AI platform.
Fine-tuning and evaluating LLM-based models to improve performance.
Contributing to platform engineering across both ML and backend systems.
Canva is a design platform that allows users to create social media graphics, presentations, posters, documents and other visual content. They have a campus in Sydney, and a second campus in Melbourne and co-working spaces in Brisbane, Perth, Adelaide, and Auckland, NZ.
Build and ship AI-powered product features using LLMs and generative models
Develop and maintain services and APIs around ML models
Integrate AI models into production systems and user-facing applications
Social Discovery Group (SDG) is the 3rd largest social discovery company in the world, uniting 60+ brands with 500 million users. They transform virtual intimacy into the new normal by solving the problems of loneliness, isolation, and disconnection. Their international team of 1200 professionals and digital nomads works all over the world.
Design, develop, and refine large language model workflows to steer and improve model behaviors.
Build language processing components for intent detection, summarization and conversational response quality.
Drive R&D-style exploration on cutting-edge speech and language systems, rapidly prototyping novel approaches.
Cresta's platform combines AI and human intelligence to help contact centers discover customer insights and behavioral best practices, automate conversations, and empower team members. They are led by founders with experience at Google, Waymo, and Open AI, and are on a mission to revolutionize the workforce with AI.
Migrate and test existing bulk flashcard creation prompts.
Run test suites and manually review AI outputs for quality and correctness.
Analyze real user data to identify failure patterns and improve prompts.
Brainscape is the world's leading web & mobile EdTech study platform. They help millions of learners create better flashcards and the company is looking for an AI Prompt Engineer to join their team.
You will design, build, and operate core systems that enable autonomous agents to function reliably in production.
You’ll build production-grade agentic workflows, retrieval and memory systems, multi-model execution, and tool-calling integrations that interact safely with enterprise systems.
You’ll explore new approaches, prototype quickly, and turn what works into durable production systems.
Kindo is an agent automation platform for DevOps and SecOps teams. They help organizations automate high-friction operational work using autonomous agents. Kindo is a small, highly technical team with strong customer traction and real enterprise revenue.
Design and Develop machine learning infrastructure, tooling, and models to help teams deliver world class experiences.
Help product and development teams understand the data lifecycle and the inherent experimental nature of machine learning.
Build internal products and platforms to enable teams to incorporate AI into their features and customer facing products.
Weave provides an all-in-one platform for small businesses to streamline communications, and patient experiences. The company has a phenomenal culture, and Weave's teams are cross-functional agile teams composed of a product owner, backend and frontend devs and devops.
Serve as the primary AI engineering partner to the CEO and executive leadership team, translating ideas into production-ready AI agents.
Independently take ideas from concept to production, shaping problem statements and operationalizing solutions.
Develop production-grade AI systems using modern LLMs, with strong attention to scalability and clean engineering practices.
Webflow is building the world’s leading AI-native Digital Experience Platform as a remote-first company. Their mission is to bring development superpowers to everyone and empower teams to design, launch, and optimize for the web without barriers.
Ship AI-powered products and tools from zero to production.
Architect systems that scale beyond demos.
Work across the full stack.
Human Agency partners with organizations of all sizes to explore, design, and implement AI strategies that are secure, scalable, and human-centered. They are scaling rapidly and have a growing pipeline of opportunities that demand exceptional talent across disciplines.
Drive the design and evolution of AI-ready tools and APIs for LLM platforms.
Own and evolve evaluation frameworks that measure tool-use accuracy across platforms.
Shape Canva's agent architecture, making strategic technical decisions about intelligence location.
Canva is a design platform that enables users to create various visual content. They have offices in multiple locations in Australia and New Zealand, and they offer a flexible work environment.
Build and ship AI-powered product and internal solutions using LLMs, RAG, tool calling, workflows, and agentic patterns
Design quality and evaluation frameworks for AI systems, including offline evals, online signals, failure analysis, and continuous improvement loops
Contribute to AI platform and tooling decisions that improve reuse, speed, and consistency across teams
Finom is a European tech startup headquartered in Amsterdam, revolutionizing financial landscape for entrepreneurs. They develop an all-in-one financial B2B solution integrating banking, accounting, financial management, and invoicing into a mobile-first platform and nurture innovation in an inspiring work environment.
Execute structured test plans across open source repositories, sample applications, SDK extensions, and AI workflow integrations.
Perform functional, integration, and regression testing on frameworks, applications, notebooks, scripts, APIs, and reference implementations.
Validate reproducibility of AI workflows in Jupyter and Google Colab environments.
Backblaze is the object storage leader in the open cloud movement, fueling customer success with cloud storage built purposefully to unlock budgets, unburden administrators, and unleash innovators. Founded in 2007, they scaled the business and today generate over $136M ARR managing over three billion gigabytes of data storage for 500K+ customers in 175+ countries.