Source Job

United States

  • Own the knowledge surface, ingestion pipelines, and retrieval quality.
  • Own the eval harness for benchmarks and regression detection.
  • Pair with Agent Engineers on the prompt-and-eval iteration cycle.

RAG Evaluation Python TypeScript

20 jobs similar to RAG and Evaluation Engineer

Jobs ranked by similarity.

$160,000–$240,000/yr
US

  • Build agentic AI systems that change how Dataiku runs internally.
  • Turn real problems into working software.
  • See your solutions through from first conversation to production.

Dataiku is the Platform for AI Success, the enterprise orchestration layer for building, deploying, and governing AI. The world’s leading companies rely on Dataiku to operationalize AI and run it as a true business performance engine delivering measurable value.

$115,000–$130,000/yr
US 4w PTO

  • Write, iterate, and maintain system prompts and instruction sets for Noodle’s AI agents across the student journey.
  • Build and maintain evaluation frameworks to measure agent accuracy, tone, hallucination rate, task completion, and alignment with rubric-based learning objectives.
  • Partner with Noodle teammates and university stakeholders to design, build, and test agents — translating learning objectives, operational flows, rubric assessments, and more into prompt-level agent instructions.

Noodle is higher education’s leading strategy, services, and technology partner that develops infrastructure, provides life-changing learning experiences, and grows the awareness of and the enrollment in some of the best academic institutions in the world. They empower universities to change the world by offering university partners various products and services.

$100,000–$200,000/yr
Global

  • Improve prompts, model selection, and tool usage so the system gets more decisions right over time.
  • Reduce latency, token usage, and cost while preserving decision quality and operational reliability.
  • Design validation, retries, and human review paths for ambiguous, adversarial, incomplete, or conflicting inputs.

Risk Labs is the core team behind UMA and Across, building infrastructure that pushes crypto forward. They value ownership, curiosity, thoughtful risk-taking, and direct communication.

US

  • Own the agent layer of the platform, including architecture, prompts, tool surfaces, and multi-agent orchestration.
  • Drive translation and dependency-mapping accuracy across unfamiliar legacy paradigms.
  • Write production agent code daily, using subagents and multi-agent workflows as the normal way of working.

LTS applies frontier AI to modernize legacy systems in healthcare and government IT. It is a small, senior engineering team operating with high leverage and a culture of innovation and collaboration.

$15–$15/hr
US

  • Identify and label languages and dialects from model-generated responses.
  • Review outputs from two different AI models and determine which model correctly identified the proposed language.
  • Compare model responses and select the appropriate evaluation outcome from predefined options

RWS – TrainAI is looking for Language Data Annotators. They embrace DEI and promotes equal opportunity and prohibits discrimination and harassment of any kind.

Global

  • Design, configure, and maintain AI agent workflows using Cursor and Claude Code for automated data system architecture.
  • Build and maintain a RAW → Base → Data Marts pipeline using dbt Core and implement business logic at the transformation layer.
  • Build comprehensive test suites using Great Expectations and ensure data quality through manual inspection and automation.

BiOptimizers helps people go from baseline health to peak biological performance with science-backed supplements and wellness tools. As a remote-first company, their globally distributed team focuses on clarity, autonomy, and operational excellence.

Canada

  • Design and implement multi-agent AI systems using frameworks like LangChain and CrewAI, building agent-to-agent orchestration pipelines.
  • Fine-tune foundation models, integrate retrieval-augmented generation, and develop APIs and backend services for production deployment.
  • Containerize and deploy agents with Docker and Kubernetes, while collaborating with QA and product teams to benchmark accuracy and safety.

Innodata is a global data engineering company focused on enabling the responsible advancement of artificial intelligence by providing data, evaluation frameworks, and human expertise. With over 36 years of experience, the company delivers high-quality data solutions and services for Generative AI builders and adopters.

Canada US

  • Own customer solutions end-to-end, rapidly prototyping and deploying solutions in live operational environments.
  • Build trusted relationships from IC level to executive sponsor, becoming the technical face of the company.
  • Operate as part of a tight, multi-disciplinary unit with focus and urgency, seamlessly trading tasks to whoever is closest to the skills needed.

Kinaxis is a global leader in modern supply chain orchestration, powering complex global supply chains with an AI-infused platform. With over 2000 employees worldwide and 6 global offices, it has been recognized with several Top Employer awards and fosters a culture focused on technology, customers, and innovation.

US

  • Interact with generative AI models and project guidelines.
  • Create prompts to test model behavior across safety categories.
  • Document model breakability and effort level.

Welo Data provides AI services and specializes in data annotation. We foster a collaborative and innovative culture where employees contribute to cutting-edge AI safety evaluation.

AI Engineer

Zinier
India

  • Design pragmatic solutions for real problems, assessing each use case and selecting the right approach.
  • Rapid prototyping and iterative delivery, shipping functional prototypes within days and validating value with real users.
  • Build agentic AI systems where justified, designing and implementing multi-agent architectures and LLM-based tooling.

Zinier empowers frontline workers to achieve greater things. They are a remote-first, global team headquartered in Silicon Valley with a hybrid workforce across the United States, Canada, Europe, Latin America, Singapore, and Bangalore, India.

US

  • Shape technical direction and architecture: Define the foundational architecture for enterprise agentic AI at Benchling.
  • Build and ship the early portfolio yourself: Write production code at least half your time, particularly during the team's first year.
  • Design for enterprise from day one: Build for multi-tenant isolation, secrets management, audit logging, payload encryption, role-based access controls, and human-in-the-loop controls calibrated to risk.

Benchling is the AI platform for biotech R&D. Scientists use Benchling to design experiments, capture structured data, and run AI agents and models directly in their workflows. They have over 200,000 scientists around the world, from academic labs to Sanofi and Moderna.

$180,000–$225,000/yr
US

  • Instrument fal's core infrastructure to capture CPU, GPU, and request-level signals.
  • Build ingestion pipelines from partner APIs, compute vendors, and internal services into BigQuery.
  • Design and operate the ETL backbone that powers cost, margin, and usage analytics.

Fal is the generative media ecosystem powering the next generation of AI products. They build the infrastructure, tools, and model access that teams need to move from idea to production at scale.

India

  • Build and ship specialized agents including parsers, extractors, and synthesizers for the Aedeon agent-native modernization platform.
  • Own the full delivery of assigned agents from prototype through deployment and post-release validation, practicing test-driven development.
  • Write clear Python, document agent contracts and decision logic, and promote a culture of release discipline and quality across the team.

Mactores is a trusted leader in providing modern data platform solutions, enabling businesses to accelerate value through automation with end-to-end data solutions that are automated, agile, and secure. Since 2008, they have collaborated with customers to strategize and navigate digital transformation via assessments, migration, or modernization, fostering a culture driven by 10 core leadership principles.

$180,000–$320,000/yr
North America

  • Quickly iterate and develop proofs of concept to explore integrating AI into data and marketing workflows.
  • Make key decisions about the choice of AI architecture and frameworks.
  • Build production data agents to seamlessly answer analytics and data science questions.

Hightouch is an Agentic Marketing Platform that provides a composable CDP. They enable marketing teams to analyze performance, brainstorm ideas, and generate creative quickly. The team is ambitious and impact-driven, with a focus on humility, kindness, and compassion.

US Europe

  • Serve as a core safety partner embedded across product and research teams, providing Trust & Safety engineering support for all launches from early design through post-launch monitoring.
  • Build and maintain safety infrastructure ensuring Runway's models have a positive impact as they reach millions of users.
  • Design, execute, and continuously improve red teaming systems to proactively surface harmful outputs before production.

Runway builds AI to simulate the world through merging art and science, focusing on world models for general-purpose simulation. The team consists of creative, open-minded, caring, and ambitious people determined to change the world.

US

  • Interact with generative AI models using project-provided guidelines, safety taxonomies, and attack-vector guidance.
  • Create and evaluate prompts designed to test model behavior across safety-related categories.
  • Identify where model responses become unsafe, noncompliant, inconsistent, or otherwise problematic.

Welo Data is an AI services company that specializes in data annotation. They deliver multilingual content transformation services in translation, localization, and adaptation for over 250 languages with a growing network of over 400,000 in-country linguistic resources.

$25–$25/hr
US

  • Collaborate with engineering and design to optimize prompt engineering frameworks for open-ended generative AI features.
  • Research customer interaction models from LLMs to downstream features.
  • Evaluate the evolving AI ecosystem, including the ChatGPT store and third-party LLM integrations.

Acorns is a financial wellness app that helps everyday people and families save and invest money for the long term. Since 2014, Acorns has grown into a global company with multiple life-stage products serving the needs of kids, teens, adults, and parents.

Global

  • Design and write high-quality prompts for LLM-based agents and build agentic tools and workflows using Netomi's no-code platform.
  • Integrate external and internal APIs, implement unit tests, and ensure reliability of agent workflows.
  • Optimize agents for performance, cost, and fault tolerance while collaborating with Product, QA, and Delivery teams.

Netomi is the leading agentic AI platform for enterprise customer experience, working with global brands like Delta Airlines and MetLife. Backed by WndrCo, Y Combinator, and Index Ventures, the company helps enterprises drive efficiency and lower costs.

Europe North America 7w PTO

  • Design a Python framework for implementing internal and public benchmarks.
  • Build and maintain a pipeline that runs distributed evaluations at scale.
  • Collaborate with modeling and product teams to improve experimentation and evaluation tooling.

Poolside aims to be the leading company in building a world where AI drives economically valuable work and scientific progress. They are a remote-first team across Europe and North America, gathering monthly in person for 3 days and twice a year for longer offsites.

Australia Canada France Germany Spain US Unlimited PTO

  • Design and build Claude skills, MCP integrations, and automated pipelines that transform internal knowledge into publication-ready docs with minimal manual intervention.
  • Act as the final reviewer for content produced by AI-assisted workflows and engineers, maintaining a high bar for technical accuracy and polish.
  • Define content structures and metadata standards that ensure our documentation is agent-consumable and machine-parseable.

Upsun, formerly Platform.sh, is the cloud application platform humans and robots love. They give developers, DevOps engineers, and platform teams the ability to build, ship, and scale confidently without wrestling with backend infrastructure.