AI researcher

Required Skills

LangChain

RLHF

Artificial Inteligence

Tech Research

About micro1

micro1 is a data engine that helps AI labs train foundational models and enterprises build AI agents. We provide frontier evaluations and reinforcement learning environments used to improve LLM capabilities, as well as contextual evaluations used to monitor and improve AI agents in enterprise settings. Our data engine includes an AI recruiter agent that sources and vets domain experts, a data platform that enables rapid production of high-quality training data, and a pipeline performance system that ensures both quality and velocity.

Our goal is to have 1 billion people doing meaningful work by contributing their expertise to the development of frontier AI models. We’ve raised $40M+ in funding, and our AI recruiter has powered more than 1 million AI-led interviews as our global network of experts expands to form the human intelligence layer for AGI.

Job Description

Job Type: Full-time

Location: Remote (Anywhere)

Total Compensation: $220k–320k

Job Summary:

At micro1, we’ve built an AI recruitment engine to help companies hire top global talent. Our AI agent, Zara, autonomously sources & vets candidates, cutting recruitment costs by 87%.

We work with top AI Labs to help them build robust human-in-the-loop evaluation pipelines, conduct post-training model evaluations, and identify critical failure modes. Long term, we aim to become the default infrastructure for human oversight of frontier models and create the most reliable pre-vetted global talent pool for AI labs.

We’re hiring a US-based AI Researcher to help push the boundaries of LLM and VLM evaluation. You’ll join our core research team to work on real-world evaluation challenges—designing custom eval sets, identifying model failure modes, and partnering with leading labs to improve model robustness and alignment. If you care about model safety, enjoy fast prototyping, and want to shape how foundation models are benchmarked, this role is for you.

What You’ll Do:

Design custom evaluation sets and benchmark tasks for reasoning, math, safety, and planning
Develop failure taxonomies for hallucinations, refusal behavior, overconfidence, and jailbreaks
Build scalable human-in-the-loop workflows for rubric-based scoring, preference ranking, and adversarial testing
Work directly with top AI labs to stress test frontier models and identify hidden failure patterns
Prototype automated evaluation pipelines and agent-assisted evaluators to scale human oversight
Lead or support the development of GAIA-style evaluations or internal benchmarks for labs
Publish internal memos and external papers on findings from research pilots and lab collaborations
Collaborate with researchers from top AI Labs on joint eval initiatives.

What We’re Looking For:

Strong Python fundamentals and experience writing production-grade research code
Hands-on experience with LangChain, LangGraph, and RLHF or RLAIF evaluation pipelines
Familiarity with model evaluation frameworks (e.g., TruthfulQA, MMLU, ARC, MT-Bench, GAIA, VLM evals)
Deep understanding of foundation model failure modes and evaluation methodologies
Graduate degree in CS or related field (PhD/Master’s preferred), or equivalent research experiences.

Bonus If You Have:

Publications in top-tier AI conferences or journals (NeurIPS, ICLR, ICML, ACL, etc.)
Experience working with top AI labs
Competitive gaming experience or deep understanding of multi-agent interaction
Exposure to VLM evaluation, multimodal reasoning, or agentic collaboration frameworks.

Perks & Details:

Work closely with founders and leading AI researchers on problems that matter
Drive real impact: your evaluations shape how leading labs measure model safety and capability
Hardcore but flexible schedules with a remote global team

Apply now

First name

Last name

Phone number

Linkedin profile URL

Upload your resume (in English)

Click to upload or drag & drop (.pdf)

Please note that after completing the interview process, you’ll be added to our talent pool and considered for this and other roles that match your skills.

Have any questions? See FAQs

Refer and Earn