... annotator behavior and run experiments to improve different model capabilities - Develop ... work or internship at an RL environment company, AI safety organization, or benchmarking ...
New
Quick apply
... annotator behavior and run experiments to improve different model capabilities - Develop ... work or internship at an RL environment company, AI safety organization, or benchmarking ...
New
Quick apply
... annotator behavior and run experiments to improve different model capabilities - Develop ... work or internship at an RL environment company, AI safety organization, or benchmarking ...
New
$12.02 - $14.75
7% of jobs
$17.44 is the 25th percentile. Wages below this are outliers.
$14.75 - $17.48
18% of jobs
$17.48 - $20.21
22% of jobs
The median wage is $20.64 / hr.
$20.21 - $22.95
17% of jobs
$24.59 is the 75th percentile. Wages above this are outliers.
$22.95 - $25.68
18% of jobs
$25.68 - $28.41
8% of jobs
$28.41 - $31.14
5% of jobs
$31.14 - $33.87
1% of jobs
$33.87 - $36.60
0% of jobs
$36.60 - $39.34
2% of jobs
$39.34 - $42.07
1% of jobs
$12
$22
$42
San Francisco, CA
$150K - $250K/yr
Full-time
Posted 2 days ago
Location: San Francisco, CA (in-person)
Compensation: $150,000 - $250,000 base, plus bonus and equity (total cash compensation can reach $250,000 - $450,000+)
Join a fast-growing AI infrastructure company as a Research Scientist, designing the datasets and evaluation frameworks that shape how frontier AI models are trained and measured.
What You'll Do
- Design data slices and explore data shapes that expose meaningful model failure modes across domains, including finance, code, and enterprise workflows
- Build and refine evaluation rubrics and reward signals for RLHF and RLVR training pipelines
- Model annotator behavior and run experiments to improve different model capabilities
- Develop quantitative frameworks for measuring dataset quality, diversity, and downstream impact on model alignment and capability
- Partner with research teams at the world's top AI labs to translate their training objectives into concrete data and evaluation specifications
- Move fast from hypothesis to experiment, extract actionable insights from messy results, and iterate quickly
What You'll Bring
- Strong quantitative instincts with familiarity with LLM training pipelines, RLHF or RLVR, or evaluation methodology, no PhD required
- A genuine, intrinsic obsession with how data structure, selection, and quality drive model behavior
- The ability to design lightweight experiments, move fast, and extract insights from messy or incomplete results
- Comfort working across domains such as finance, software engineering, and policy, with the ability to context-switch and reason clearly
- A strong bias toward building and shipping experiments over theorizing
Nice to Have
- Prior work or internship at an RL environment company, AI safety organization, or benchmarking organization
- Background in evaluation methodology, benchmark design, or dataset curation at a lab or research organization
- Exposure to annotator modeling, reward signal design, or alignment-related research
This is a high-leverage research seat where your work directly shapes how the next generation of frontier models learns, with outsized impact on a small, high-caliber team.