Job Title: Chain‑of‑Thought Data Curator

Job Type: Full-time or Part-time, contract

Location: Remote

Job Summary:

Join our customer's team as a Chain‑of‑Thought Data Curator and play a pivotal role in advancing large-language-model reasoning. You'll be responsible for crafting and evaluating gold-standard datasets that push the limits of multi-step reasoning in AI. Leverage your STEM-oriented and generalist mindset to create benchmarks that set the industry standard.

Key Responsibilities:

Develop and curate gold-standard Chain-of-Thought (CoT) datasets across diverse reasoning-heavy tasks.
Design clear, scalable rubrics and instructions to evaluate and annotate multi-step reasoning processes.
Write precise, well-structured CoT responses that demonstrate high-level generalist reasoning, with a preference for STEM contexts.
Critically assess logical flow, correctness, and justification within reasoning chains, ensuring rigor and fidelity.
Identify and document common model failure types, such as hallucination, shortcut reasoning, and unsupported leaps.
Collaborate with AI trainers, model evaluators, and RLHF annotators to refine CoT benchmarks and annotation protocols.
Stress-test the depth and reliability of LLM reasoning across varied benchmarks.

Required Skills and Qualifications:

Extensive experience in creating or curating CoT or instruction tuning datasets for AI/LLMs.
Proven ability to design and implement binary or graded rubrics for evaluating multi-step reasoning outputs.
Robust generalist analytical skills, ideally with a STEM or competitive exam background.
Exceptional written and verbal communication abilities, with attention to clarity and structure.
A deep understanding of LLM failure modes and reasoning pitfalls in model outputs.
Experience balancing fine-grained evaluation criteria with scalable instructions for diverse teams.
Background in RLHF annotation, AI model evaluation, or prompt engineering highly valued.

Preferred Qualifications:

Experience with instruction tuning, model evaluation, or advanced prompt engineering projects.
Exposure to cross-disciplinary reasoning tasks and datasets.
Strong track record of collaborating with AI research or data curation teams.