
AI Model Evaluator (LLM & Agent Systems)
$20 - $30/hourpay
Required Skills
LLMs
Generative AI
AI Model Evaluation
AI Benchmarking
AI Quality Assessment
Model Performance Evaluation
Prompt Response Evaluation
AI Output Analysis
Rubric-Based Scoring
Job Description
Job Title: AI Model Evaluator (LLM & Agent Systems)
Job Type: Contract (Minimum 2 weeks, with potential extension)
Location: Remote
Job Summary:
Join our customer's team as an AI Model Evaluator (LLM & Agent Systems) and play a pivotal role in shaping the future of generative AI and autonomous agents. You'll help benchmark, analyze, and assess cutting-edge AI systems in real-world scenarios, providing structured insights that drive improvements. This position is ideal for analytical professionals passionate about AI quality and real-world impact.
Key Responsibilities:
- Evaluate outputs from large language models (LLMs) and autonomous agent systems against defined guidelines and rubrics
- Review multi-step agent actions, including screenshots and reasoning traces, to determine accuracy and quality
- Consistently apply evaluation standards, flagging edge cases and identifying recurring patterns or failure modes
- Provide detailed, structured feedback to inform benchmarking, product evolution, and model refinement
- Participate in calibration and alignment sessions to ensure consistent application of evaluation criteria
- Work collaboratively to adapt to evolving scenarios and ambiguous evaluation situations
- Document findings and communicate insights clearly both in writing and verbally to relevant stakeholders
Required Skills and Qualifications:
- Demonstrated experience with LLM evaluation, AI output analysis, QA/testing, UX research, or similar analytical roles
- Strong background in AI model evaluation, benchmarking, and applying rubric-based scoring frameworks
- Exceptional attention to detail and sound judgement in ambiguous or edge-case scenarios
- Proficiency in English (B2+ or equivalent) with excellent written and verbal communication skills
- Ability to adapt quickly to evolving guidelines and work independently
- Comfort with remote work and a commitment of at least 20 hours per week for the initial term
- Analytical mindset with a focus on actionable, qualitative feedback
Preferred Qualifications:
- Experience with RLHF, annotation workflows, or AI benchmarking frameworks
- Familiarity with autonomous agent systems or workflow automation tools
- Background in mobile apps or digital product evaluation processes
About micro1
micro1 is a data engine that helps AI labs train foundational models and enterprises build AI agents. We provide frontier evaluations and reinforcement learning environments used to improve LLM capabilities, as well as contextual evaluations used to monitor and improve AI agents in enterprise settings. Our data engine includes an AI recruiter agent that sources and vets domain experts, a data platform that enables rapid production of high-quality training data, and a pipeline performance system that ensures both quality and velocity.
Our goal is to have 1 billion people doing meaningful work by contributing their expertise to the development of frontier AI models. We’ve raised $40M+ in funding, and our AI recruiter has powered more than 1 million AI-led interviews as our global network of experts expands to form the human intelligence layer for AGI.
This job is currently closed and not accepting applications. Thank you for your interest!
Refer and Earn$100