Source Job

$850,000–$850,000/yr

  • Build new novel and long-horizon evaluations
  • Develop novel measurement approaches for understanding how model capabilities emerge and evolve during RL training
  • Lead strategic evaluation coverage across the company

Machine Learning Red Teaming

20 jobs similar to New Research Lead, Training Insights

Jobs ranked by similarity.

US Unlimited PTO

  • Architect and deploy autonomous AI agents and multi-agent workflows.
  • Design strict-source-following Retrieval-Augmented Generation (RAG) systems.
  • Build scalable backend services using FastAPI.

Osano is an innovative B-Corporation focused on giving modern enterprises the ability to innovate quickly and earn customer trust by respecting data privacy and complying with consent guidelines. We are scaling fast with a multi-year runway and ambitious growth plans.

US Unlimited PTO

  • Lead a senior applied science and ML engineering team focused on foundational personalization capabilities.
  • Set the vision and roadmap for shared reward models and entity/metadata libraries.
  • Build and develop a diverse, high-caliber team of ML engineers.

Jobgether is a platform that connects job seekers with companies. They use an AI-powered matching process to ensure applications are reviewed quickly and fairly.

$230,000–$300,000/yr
US Unlimited PTO

  • Define and lead the technical vision for Cresta’s next-generation Agentic AI systems.
  • Architect scalable, production-grade LLM systems that integrate reasoning, retrieval, planning, tool use, and real-time decision-making into cohesive, intelligent workflows.
  • Own evaluation strategy for complex, non-deterministic AI systems, including offline benchmarking, online experimentation, LLM-as-a-judge methodologies, and systematic failure analysis.

Cresta is on a mission to turn every customer conversation into a competitive advantage by unlocking the true potential of the contact center. Their platform combines the best of AI and human intelligence to help contact centers discover customer insights and behavioral best practices, automate conversations and inefficient processes, and empower every team member to work smarter and faster.

Global

  • Engage the model with investment scenarios, analytical questions, and market-based reasoning tasks; verify factual correctness and financial logic.
  • Assess the validity of investment reasoning; capture reproducible error traces; and provide structured feedback to improve prompts, evaluation frameworks, and analytical depth.
  • Identify where models oversimplify market behavior or misinterpret financial data.

They are evolving large-scale language models from simple conversational tools into systems capable of analyzing financial markets, interpreting investment strategies, and supporting decision-making across asset classes. They seem to have a growing team.

Global

  • Challenge AI models on realistic educational scenarios.
  • Validate whether its understanding of pedagogical concepts reflects best-in-class teaching practice.
  • Evaluate AI outputs for clarity and correctness, analyze subtle reasoning errors, document gaps in logic.

The company is seeking independent Instructional Experts with hands-on experience teaching, tutoring, or building curriculum to train AI models. As a contractor you’ll supply a secure computer and high-speed internet; company-sponsored benefits such as health insurance and PTO do not apply.

US

  • Design and execute the enterprise AI enablement strategy.
  • Design, build, and facilitate leadership development programs.
  • Rethink and transform the Learning & Development function through an AI-first lens.

Transcarent is a health and care company bringing medical, pharmacy, and point solutions together with a generative AI-powered platform. They empower health consumers with more choice, higher-quality care, and lower costs for 21 million members, partnering with over 1,700 employers and health plans.

Tech Lead

EverAI
Global 4w PTO

  • Architect the system and mentor the team, spend significant time hands-on in the codebase.
  • Drive our strategy for SFT and RLHF/DPO; oversee the sourcing, labeling, and cleaning of diverse datasets.
  • Design and train custom classifiers to detect and filter non-consensual or illegal content within an explicit environment.

EverAI is building the future of AI companionship, creating entirely new categories. They have 40 million users and are aiming for 100M and then 500M, consisting of an enthusiastic, passionate and hardworking team of 75 people.

US

  • Design and curate evaluation datasets for retrieval quality.
  • Measure retrieval quality using metrics like Recall@k, Precision@k, MRR, and NDCG@k.
  • Conduct systematic error analysis on AI/ML system outputs; build structured failure taxonomies.

Jump empowers financial advisors, firms, and clients to thrive in the age of AI by automating tasks like meeting prep and compliance. As a Series A company, Jump has raised $30M and grown to 100+ employees including leaders from top companies and schools, fostering a culture of velocity, world-class standards, direct communication, and kindness.

$177,000–$250,300/yr
US

  • Own Agent retrieval accuracy and relevance.
  • Drive automated resolution rates.
  • Manage AI safety and trust.

Airtable is the no-code app platform that empowers people closest to the work to accelerate their most critical business processes. More than 500,000 organizations, including 80% of the Fortune 100, rely on Airtable to transform how work gets done.

  • Designing and developing the core platform that enables the efficient deployment, scaling, and management of LLMs and multi-agent systems.
  • Building specialized infrastructure to support long-running agentic workflows, including state management, tool-calling interfaces, and complex reasoning loops.
  • Scaling inference for LLMs to handle global demand while optimizing for latency, throughput, and cost.

Clarity AI is a global tech company founded in 2017 with a unique mission: bringing societal impact to markets. They leverage AI and machine learning technologies to provide top international investors, governments, companies, and consumers with the right data, methodologies, and tools to make more informed decisions. They are now a team of more than 300 highly passionate and curious individuals from all over the world.

Global Unlimited PTO

  • Apply AI to Real Financial Problems: Use GenAI and ML to help users make sense of their money.
  • Choose the Right Tool for Each Problem: Navigate the AI toolkit thoughtfully.
  • Ship with Confidence: Leverage and enhance our sophisticated evaluation framework to ensure AI quality.

Monarch is a personal finance platform designed to simplify finances. They are a fully remote company with a team of do-ers led by experienced entrepreneurs who are passionate about helping members reach their financial goals.

US

  • Lead research initiatives that shape the integrity, quality, and governance of open knowledge systems.
  • Translate complex findings into actionable recommendations and advocate for scientific rigor.
  • Engage with global scientific communities, represent organizational research perspectives, and promote best practices.

Jobgether is a platform that connects job seekers with potential employers. They use AI-powered matching to ensure applications are reviewed quickly and fairly, and share shortlisted candidates with hiring companies for final decisions.

US

  • Identify high-impact AI use cases tied to revenue growth, margin expansion, and efficiency.
  • Build practical AI roadmaps tailored to each company’s data and capabilities.
  • Design and implement AI solutions, including automation, analytics, and LLM applications.

LLR Partners is a lower middle market private equity firm focused on investing in software and tech-enabled companies within the knowledge economy. Founded in 1999 and headquartered in Philadelphia, LLR has raised over $7.5 billion across seven funds and has partnered with over 130 companies.

  • Design, build, and deploy the critical small language models that are foundational to Fastino’s product.
  • As an engineer, you will own the full lifecycle of our state of the art models, from prototyping and data analysis to deployment and monitoring.
  • Drive the data strategy to continuously improve model performance by analyzing distribution gaps and contributing to synthetic data pipelines.

Fastino is building the next generation of LLMs, with a team of alumni from Google Research, Apple, Stanford, and Cambridge. They have raised $25M through their seed round and are backed by leading investors including Microsoft, Khosla Ventures, and Insight Partners.

$40–$100/hr
Global

  • Migrate and test existing bulk flashcard creation prompts.
  • Run test suites and manually review AI outputs for quality and correctness.
  • Analyze real user data to identify failure patterns and improve prompts.

Brainscape is the world's leading web & mobile EdTech study platform. They help millions of learners create better flashcards and the company is looking for an AI Prompt Engineer to join their team.

Global

  • Fine-tuning pre-trained LLMs on small to medium datasets.
  • Implementing parameter-efficient fine-tuning (e.g., LoRA-style methods).
  • Optimising training for cost and performance.

They deliver cutting-edge ML and GenAI solutions across diverse industries and collaborate with global organizations. The company solves real-world challenges at scale with dynamic, high-impact projects.

US

  • You'll work with AI tools, test model outputs, and evaluate responses.
  • Document errors, gaps, and collaborate with our team.
  • Spot inconsistencies and provide structured feedback.

Project World Wide is involved in shaping the future of AI through training data. They seek motivated individuals to contribute to the development of cutting-edge AI systems.

US

  • You will define, build, and evolve foundational systems that enable autonomous agents to operate reliably in production.
  • You’ll explore new approaches, prototype quickly, and turn what works into durable platform foundations.
  • You’ll identify high-leverage architectural improvements, abstractions, and guardrails that expand what the platform can do while keeping it reliable, secure, observable, and maintainable under real-world conditions.

Kindo is an agent automation platform for DevOps and SecOps teams, helping organizations automate high-friction operational work using autonomous agents. They are a small, highly technical team with strong customer traction and real enterprise revenue, where engineers have direct ownership over critical systems.

Global

  • Construct expert benchmarks: Build and validate real-world investment cases and portfolio management frameworks to evaluate AI systems.
  • Stress-test model reasoning: Diagnose weaknesses in AI-generated investment analyses, identifying where logic or market intuition fails.
  • Design frameworks: Translate how institutional investors evaluate securities and manage risk into problems that push the limits of AI reasoning.

Mentis AI operates where institutional investment expertise meets frontier AI systems. They combine asset management experience with machine learning and applied AI research, collaborating with leading AI labs to improve how models reason and make decisions in financial contexts.

Global

  • Design and execute complex jailbreak attempts to identify vulnerabilities in state-of-the-art models.
  • Use your background in linguistics or social sciences to find "hidden" biases or harms that standard automated filters miss.
  • Systematically rank LLM outputs to determine where safety guardrails are failing or succeeding.

We are building safer, more robust intelligence. We appear to be a small team with a culture that values asynchronous work and self-starters.