Source Job

US

  • Design and curate evaluation datasets for retrieval quality.
  • Measure retrieval quality using metrics like Recall@k, Precision@k, MRR, and NDCG@k.
  • Conduct systematic error analysis on AI/ML system outputs; build structured failure taxonomies.

Python SQL LLMs Data Analysis Experimental Design

18 jobs similar to Applied AI Evaluation Scientist

Jobs ranked by similarity.

$130,000–$140,000/yr
Global 7w PTO

  • Build evaluation infrastructure to measure AI system speed and accuracy.
  • Create observability tooling and dashboards that surface quality metrics week-over-week.
  • Prototype and validate improvements to our RAG pipeline.

Circle is building an all-in-one platform for online communities, enabling creators and businesses to bring together their audience with discussions, live streams, events, chat, courses, and payments. They are a fully remote company of around 200 team members from 30+ countries, valuing autonomy, trust, and collaboration across time zones.

$177,000–$250,300/yr
US

  • Own Agent retrieval accuracy and relevance.
  • Drive automated resolution rates.
  • Manage AI safety and trust.

Airtable is the no-code app platform that empowers people closest to the work to accelerate their most critical business processes. More than 500,000 organizations, including 80% of the Fortune 100, rely on Airtable to transform how work gets done.

US Canada Unlimited PTO

  • Design and build end-to-end LLM-powered workflows, including RAG pipelines, tool-calling systems, and agent architectures
  • Rapidly prototype internal AI assistants and automation tools across business functions
  • Develop shared connectors to major LLM providers and internal data sources

AssetWatch serves global manufacturers by powering manufacturing uptime through the delivery of an unparalleled condition monitoring experience. As they enter the next phase of rapid growth, they are seeking people to help lead the journey as a devoted and capable team.

$111,888–$128,633/yr
Canada US

  • Design and build production-grade AI systems, including RAG pipelines, multi-step agents, and LLM-powered features.
  • Build comprehensive evaluation and observability frameworks to measure model accuracy, grounding, and quality drift.
  • Create production-quality Python services to wrap AI logic into secure microservices.

League, founded in 2014, is the leading healthcare consumer experience (CX) platform powered by AI, reaching over 63 million people globally. Payers, providers, and consumer health partners use League’s platform to deliver high-engagement healthcare solutions and improve health outcomes.

Global

  • Review and validate Oracle SQL queries and output generated from existing natural language questions.
  • Ensure that SQL logic is correct, Oracle-compliant, and that it produces realistic, accurate results.
  • Verify that query outputs correctly and fully answer the original natural language questions and provide edits if needed.

CrowdGen, by Appen, provides AI annotation project. This role is a project-based opportunity where you will join as an Independent Contractor.

$40–$100/hr
Global

  • Migrate and test existing bulk flashcard creation prompts.
  • Run test suites and manually review AI outputs for quality and correctness.
  • Analyze real user data to identify failure patterns and improve prompts.

Brainscape is the world's leading web & mobile EdTech study platform. They help millions of learners create better flashcards and the company is looking for an AI Prompt Engineer to join their team.

$150,000–$200,000/yr
US

  • Partner with executive sponsors and end users to identify high impact use cases and turn them into measurable business outcomes on Glean.
  • Lead strategic reviews and advise customers on their AI roadmap, ensuring they get the most value from Glean’s platform.
  • Translate business needs into clear problem statements, success metrics, and practical AI solutions; collaborate with Product and R&D to shape priorities.

Glean is an innovative AI-powered knowledge management platform designed to help organizations quickly find, organize, and share information across their teams. The company's cutting-edge AI technology simplifies knowledge discovery, making it faster and more efficient for teams to leverage their collective intelligence.

$130,000–$210,000/yr
US

  • Partner with executive sponsors to identify high-impact use cases and turn them into measurable business outcomes.
  • Translate business needs into clear problem statements and success metrics; collaborate with Product and R&D.
  • Design and build AI agents with and for customers, rethinking business processes for maximum usability.

Jobgether connects job seekers with companies using an AI-powered matching process. They focus on ensuring applications are reviewed quickly and fairly, and they share top candidates with hiring companies.

US Canada

  • Establish the foundation of LiveKit's analytics practice.
  • Translate business concepts into robust data models, create actionable KPIs, and enable data-driven decision making.
  • Balance pragmatic short-term solutions with building for scale in an AI-first development environment.

LiveKit is building the infrastructure layer for the voice-driven era of computing. Their platform gives developers everything they need to build, test, deploy, scale, and observe agents in production facilitating billions of calls each year.

$15–$15/hr
US

  • Data collection, evaluation, and annotation.
  • Pairwise comparisons.
  • Object tagging and labeling across different content types.

RWS TrainAI focuses on improving AI-generated content. They embrace DEI and equal opportunity, committed to a discrimination-free work environment where employment decisions are based on business needs and qualifications.

$170,000–$190,000/yr
Global

  • Own and evolve Orchestry’s Fabric-based Data Warehouse as the trusted foundation for analytics and AI.
  • Design the warehouse not just for reporting, but for reasoning.
  • Develop internal AI agents that query the warehouse, interpret results, and recommend or take action

Orchestry is a rapidly growing SaaS company in the Microsoft 365 ecosystem, helping organizations simplify, govern, and automate their digital workplace. They work globally with partners and enterprise customers and operate as a fully remote company by design.

$15–$15/hr
US

  • Data collection, evaluation, and annotation.
  • Pairwise comparisons.
  • Object tagging and labeling across different content types (audio, video, images, or collected data)

RWS TrainAI focuses on improving AI-generated content. They embrace DEI and are an equal opportunity employer committed to providing a work environment free of discrimination and harassment.

Global

  • Design and ship production-grade agentic AI systems that meaningfully improve customer workflows and internal operations.
  • Establish a clear technical architecture for AI at Moxie, including agent orchestration, tool/function calling and observability.
  • Integrate AI deeply into the Moxie platform, ensuring AI systems are secure, resilient, cost-aware, and aligned with a regulated environment.

Moxie empowers ambitious aesthetic entrepreneurs to build profitable, independent practices. They are a global, remote-first team of more than 140 people, supporting hundreds of practices nationwide, aiming to unlock sustainable success for aesthetic entrepreneurs.

$110,000–$140,000/yr
US Canada

  • Design, build, and ship agentic workflows across multiple domains.
  • Build multi-step agents capable of autonomous planning, context tracking, memory, tool use, and API orchestration.
  • Drive technical and architectural decisions to meet product requirements while also anticipating and designing for future needs

Cority helps customers see and prevent risks across their operations in real time. They provide a platform that converges people, data, and AI agents and is trusted by more than 1,500 of the most complex organizations worldwide.

US

  • Review search results, evaluate their helpfulness and relevance.
  • Answer true/false questions about content quality.
  • Complete online tasks improving AI systems using guidelines.

Welo Data provides AI services and data validation. It appears they have a flexible and supportive culture, emphasizing the importance of individual contributions to improving AI technology.

US

  • Identify high-impact AI use cases tied to revenue growth, margin expansion, and efficiency.
  • Build practical AI roadmaps tailored to each company’s data and capabilities.
  • Design and implement AI solutions, including automation, analytics, and LLM applications.

LLR Partners is a lower middle market private equity firm focused on investing in software and tech-enabled companies within the knowledge economy. Founded in 1999 and headquartered in Philadelphia, LLR has raised over $7.5 billion across seven funds and has partnered with over 130 companies.

Europe 6w PTO

  • Scope and implement AI Agent deployments, providing strategic advice and execution support to customers and partners.
  • Leverage knowledge of LLM internals to analyze customer requirements and design precise prompts for reliable, user-aligned behavior.
  • Fine-tune conversational flows and voice output to align with customer brand standards.

Parloa is a fast-growing startup in the world of Generative AI and customer service. They have over 400 employees in Berlin, Munich, and New York and are expanding globally.

Global Unlimited PTO

  • Apply AI to Real Financial Problems: Use GenAI and ML to help users make sense of their money.
  • Choose the Right Tool for Each Problem: Navigate the AI toolkit thoughtfully.
  • Ship with Confidence: Leverage and enhance our sophisticated evaluation framework to ensure AI quality.

Monarch is a personal finance platform designed to simplify finances. They are a fully remote company with a team of do-ers led by experienced entrepreneurs who are passionate about helping members reach their financial goals.