
AI Model Evaluator (LLM & Agent Systems)
$20 - $30/hourpay
Required Skills
LLMs
Generative AI
AI Model Evaluation
AI Benchmarking
AI Quality Assessment
Model Performance Evaluation
Prompt Response Evaluation
AI Output Analysis
Rubric-Based Scoring
Job Description
Job Title: AI Model Evaluator (LLM & Agent Systems)
Job Type: Contract (Minimum 2 weeks, with potential extension)
Location: Remote
Job Summary:
Join our customer's team as an AI Model Evaluator (LLM & Agent Systems) and play a pivotal role in shaping the future of generative AI and autonomous agents. You'll help benchmark, analyze, and assess cutting-edge AI systems in real-world scenarios, providing structured insights that drive improvements. This position is ideal for analytical professionals passionate about AI quality and real-world impact.
Key Responsibilities:
- Evaluate outputs from large language models (LLMs) and autonomous agent systems against defined guidelines and rubrics
- Review multi-step agent actions, including screenshots and reasoning traces, to determine accuracy and quality
- Consistently apply evaluation standards, flagging edge cases and identifying recurring patterns or failure modes
- Provide detailed, structured feedback to inform benchmarking, product evolution, and model refinement
- Participate in calibration and alignment sessions to ensure consistent application of evaluation criteria
- Work collaboratively to adapt to evolving scenarios and ambiguous evaluation situations
- Document findings and communicate insights clearly both in writing and verbally to relevant stakeholders
Required Skills and Qualifications:
- Demonstrated experience with LLM evaluation, AI output analysis, QA/testing, UX research, or similar analytical roles
- Strong background in AI model evaluation, benchmarking, and applying rubric-based scoring frameworks
- Exceptional attention to detail and sound judgement in ambiguous or edge-case scenarios
- Proficiency in English (B2+ or equivalent) with excellent written and verbal communication skills
- Ability to adapt quickly to evolving guidelines and work independently
- Comfort with remote work and a commitment of at least 20 hours per week for the initial term
- Analytical mindset with a focus on actionable, qualitative feedback
Preferred Qualifications:
- Experience with RLHF, annotation workflows, or AI benchmarking frameworks
- Familiarity with autonomous agent systems or workflow automation tools
- Background in mobile apps or digital product evaluation processes