Applied AI Evaluation Scientist

Jump

Remote regions

US

Benefits

Similar Jobs

See all

Agentic RAG Pipeline Evaluation & Optimization:

  • Design evaluation datasets (synthetic query-answer pairs, adversarial cases, real user query sets).
  • Measure retrieval quality using Recall@k, Precision@k, MRR, NDCG@k; assess appropriateness per use case.
  • Evaluate and optimize chunking strategies; benchmark embedding models and re-rankers.

Broader AI/ML Evaluation:

  • Conduct systematic error analysis through trace reading and failure mode identification.
  • Design and validate LLM-as-Judge evaluators, refining iteratively and measuring TPR/TNR.
  • Build and maintain golden datasets for CI regression testing of AI pipelines.

Collaboration & Data Review:

  • Partner with Product to translate product requirements into measurable evaluation criteria.
  • Partner with Engineering to instrument pipelines for observability and integrate evaluation checks into CI/CD.
  • Lead or facilitate annotation workflows, measure inter-annotator agreement (Cohen's Kappa), and produce labeled datasets.

Jump

Jump empowers financial advisors, firms, and clients to thrive in the age of AI by automating tasks like meeting prep and compliance. As a Series A company, Jump has raised $30M and grown to 100+ employees including leaders from top companies and schools, fostering a culture of velocity, world-class standards, direct communication, and kindness.

Apply for This Position