Our value is directly tied to content quality at scale, but evaluating changes to models or pipeline architecture currently lacks systematic methods.
You will build the evaluation function from scratch to define quality, measure it, and create an experimental framework for confident shipping.
This foundational role involves owning the LLM evaluation strategy and eventually seeding and growing a team.

What You'll Do:

Establish what "good" looks like by building gold-standard datasets and rubrics for accuracy, completeness, and readability.
Create automated evaluation pipelines and infrastructure for A/B comparisons, LLM-as-judge evaluation, and CI integration.
Develop scalable quality signals, monitor trends, design human review sampling, and run experiments to quantify cost/quality/latency tradeoffs.

Qualifications:

Requires a Bachelor's, Master's, or PhD in a quantitative field and 3-5 years (7+ preferred) in applied science or data science with a focus on evaluation, NLP, or generative AI.
Must have strong statistical foundations, experience evaluating LLM/NLP systems, proficiency in Python and data stack tools, and strong data storytelling skills.
Preferred skills include experience with LLM APIs, evaluation frameworks, data pipelines, SQL, visualization tools, and annotation workflows.

Driver

Driver builds the context layer for employees and AI agents to use in developing software, turning source code into human language. It is an early-stage, fast-growing startup backed by Y Combinator and Google Ventures, with a culture that values delivery speed, flexibility, and working within a small close-knit team.

Apply for This Position