Source Job

Egypt

  • Design scenario-based and edge-case prompts to test AI behavior.
  • Develop evaluation rubrics to assess AI responses across multiple criteria.
  • Perform side-by-side evaluations of AI outputs and score them using defined criteria.

Linguistics Prompt Engineering Technical Writing

11 jobs similar to Arabic (Egyptian) AI Evaluation Specialists

Jobs ranked by similarity.

Indonesia

  • Review short, pre-segmented datasets.
  • Evaluate model-generated replies based on Tone or Fluency .
  • Read a user prompt and two model replies, then rate each using a five-point scale.

CrowdGen, by Appen, focuses on AI response evaluation. They are looking for native Javanese speakers to contribute to a multilingual AI response evaluation project where you review large language model outputs.

Europe

  • Evaluate AI-generated responses for accuracy, grammar, and cultural relevance.
  • Identify issues and provide refined, high-quality rewritten responses.
  • Create natural prompts and responses in Spanish to improve conversational datasets.

Welo Data, part of Welocalize, is a global AI data company with 500,000+ contributors delivering high-quality, ethical data to train the world’s most advanced AI systems. They're building smarter, more human AI with a diverse community in 100+ countries.

$25–$30/hr
Global

  • Evaluate AI-generated French speech and text for linguistic accuracy, naturalness, and educational quality.
  • Assess learner speech and writing across proficiency levels from CEFR Pre-A1 through B2+.
  • Apply expert judgment to identify learner errors, unnatural phrasing, and pedagogical gaps.

Alignerr partners with leading AI labs to build expert-driven data pipelines. They improve how models reason, learn, and communicate by working with domain specialists to evaluate and refine AI systems where precision, pedagogy, and human judgment matter most.

Global

  • Evaluate AI-generated Japanese speech and text for linguistic accuracy, naturalness, and educational quality.
  • Assess learner speech and writing across proficiency levels from CEFR Pre-A1 through B2+.
  • Apply expert judgment to identify learner errors, unnatural phrasing, and pedagogical gaps.

Alignerr partners with leading AI labs to build expert-driven data pipelines that improve how models reason, learn, and communicate. They work with domain specialists around the world to evaluate and refine AI systems in areas where precision, pedagogy, and human judgment matter most.

Global

  • Completing AI training tasks such as analyzing, editing, and writing in Mandarin
  • Judging the performance of AI in performing Mandarin prompts
  • Improving cutting-edge AI models

Prolific is building the biggest pool of quality human data in the world and is not just another player in the AI space. Over 35,000 AI developers, researchers, and organizations use Prolific to gather data from paid study participants with a wide variety of experiences, knowledge, and skills.

Global

  • Native or near-native fluency in Central Khmer.
  • Based in: Cambodia, Thailand.
  • Comfortable with digital tools.

Welo Data, part of Welocalize, is a global AI data company with 500,000+ contributors delivering high-quality, ethical data to train the world’s most advanced AI systems. They’re building smarter, more human AI with a diverse community in 100+ countries.

$30–$35/hr

  • Evaluate AI-generated Korean speech and text for linguistic accuracy, naturalness, and educational quality.
  • Assess learner speech and writing across proficiency levels from CEFR Pre-A1 through B2+.
  • Apply expert judgment to identify learner errors, unnatural phrasing, and pedagogical gaps.

Alignerr collaborates with top AI labs, creating data pipelines driven by experts to enhance AI models' reasoning, learning, and communication. They partner with domain specialists worldwide, perfecting AI systems where precision, pedagogy, and human judgment are crucial.

US

  • Evaluate AI-generated presentations for accuracy and visual quality.
  • Provide detailed feedback to improve future AI performance.
  • Collaborate with product, design, and content partners to refine criteria.

Blueprint is a technology solutions firm headquartered in Bellevue, Washington, with a strong presence across the United States. They solve complicated problems, using technology to bridge the gap between strategy and execution, powered by the knowledge, skills, and the expertise of their teams. They are bold, smart, agile, and fun.

Global

  • Evaluate AI models' output in occupational therapy.
  • Assess content related to the occupational therapy field.
  • Provide clear feedback to improve AI understanding.

Handshake connects students with early talent recruiting. They provide opportunity to evaluate what AI models produce and deliver feedback that strengthens the model’s understanding of workplace tasks and language.

$30–$75/hr
US

  • Train and refine Grok for voice interactions across diverse languages.
  • Curate and annotate high-quality audio data to enhance Grok's global accessibility.
  • Collaborate with technical staff to improve AI's handling of multilingual audio nuances.

xAI aims to create AI systems that understand the universe and aid humanity. The team is small, motivated, and focused on engineering excellence with a flat organizational structure, expecting all employees to be hands-on.

$80,000–$150,000/yr

  • Research, Document, Test, and Ideate: Explore the best ways to achieve our customers’ goals using LLMs and other AI tools.
  • Master Our Dialogue Platform: Become an expert, answer questions, and train others on prompting both within and outside of our platform.
  • Train Our AIs: Utilize prompting, knowledge-base creation, and fine-tuning to enhance our AI capabilities.

1mind is a platform that deploys multimodal Superhumans for revenue teams, combining a face, a voice, and a GTM brain. The company has a remote-first, fast-moving culture with ownership, autonomy, and impact from day one.