
Chain‑of‑Thought Data Curator
Required Skills
gold‑standard CoT datasets
Designs rubrics & evaluates multi‑step reasoning
Generalist
STEM‑leaning profile
About micro1
micro1 is a data engine that helps AI labs train foundational models and enterprises build AI agents. We provide frontier evaluations and reinforcement learning environments used to improve LLM capabilities, as well as contextual evaluations used to monitor and improve AI agents in enterprise settings. Our data engine includes an AI recruiter agent that sources and vets domain experts, a data platform that enables rapid production of high-quality training data, and a pipeline performance system that ensures both quality and velocity.
Our goal is to have 1 billion people doing meaningful work by contributing their expertise to the development of frontier AI models. We’ve raised $40M+ in funding, and our AI recruiter has powered more than 1 million AI-led interviews as our global network of experts expands to form the human intelligence layer for AGI.
Job Description
Job Title: Chain‑of‑Thought Data Curator
Job Type: Full-time or Part-time, contract
Location: Remote
Job Summary:
Join our customer's team as a Chain‑of‑Thought Data Curator and play a pivotal role in advancing large-language-model reasoning. You'll be responsible for crafting and evaluating gold-standard datasets that push the limits of multi-step reasoning in AI. Leverage your STEM-oriented and generalist mindset to create benchmarks that set the industry standard.
Key Responsibilities:
- Develop and curate gold-standard Chain-of-Thought (CoT) datasets across diverse reasoning-heavy tasks.
- Design clear, scalable rubrics and instructions to evaluate and annotate multi-step reasoning processes.
- Write precise, well-structured CoT responses that demonstrate high-level generalist reasoning, with a preference for STEM contexts.
- Critically assess logical flow, correctness, and justification within reasoning chains, ensuring rigor and fidelity.
- Identify and document common model failure types, such as hallucination, shortcut reasoning, and unsupported leaps.
- Collaborate with AI trainers, model evaluators, and RLHF annotators to refine CoT benchmarks and annotation protocols.
- Stress-test the depth and reliability of LLM reasoning across varied benchmarks.
Required Skills and Qualifications:
- Extensive experience in creating or curating CoT or instruction tuning datasets for AI/LLMs.
- Proven ability to design and implement binary or graded rubrics for evaluating multi-step reasoning outputs.
- Robust generalist analytical skills, ideally with a STEM or competitive exam background.
- Exceptional written and verbal communication abilities, with attention to clarity and structure.
- A deep understanding of LLM failure modes and reasoning pitfalls in model outputs.
- Experience balancing fine-grained evaluation criteria with scalable instructions for diverse teams.
- Background in RLHF annotation, AI model evaluation, or prompt engineering highly valued.
Preferred Qualifications:
- Experience with instruction tuning, model evaluation, or advanced prompt engineering projects.
- Exposure to cross-disciplinary reasoning tasks and datasets.
- Strong track record of collaborating with AI research or data curation teams.
This job is currently closed and not accepting applications. Thank you for your interest!
Refer and Earn