1

Mechanistic Interpretability Jobs (NOW HIRING)

You will develop and carry out a research plan in mechanistic interpretability, in close collaboration with a highly motivated team. You will play a critical role in helping OpenAI ensure future ...

next page

Showing results 1-20

Mechanistic Interpretability information

See salary details

$31K

$36.3K

$50.5K

How much do mechanistic interpretability jobs pay per year?

As of Jun 7, 2026, the average yearly pay for mechanistic interpretability in the United States is $36,260.00, according to ZipRecruiter salary data. Most workers in this role earn between $33,500.00 and $34,000.00 per year, depending on experience, location, and employer.

What is the difference between Mechanistic Interpretability vs Data Scientist?

AspectMechanistic InterpretabilityData Scientist
Required credentialsAdvanced degrees in AI, ML, or related fieldsDegree in Data Science, Statistics, or Computer Science
Work environmentResearch labs, AI development teamsBusiness, tech companies, consulting firms
Industry usageAI research, model transparency, safetyData analysis, predictive modeling, insights
Search intentUnderstanding model internals, interpretability techniquesData analysis, insights, model building

Mechanistic Interpretability focuses on understanding how AI models work internally, often requiring deep technical expertise. Data Scientists analyze data to build models and extract insights. While both roles involve data and algorithms, Mechanistic Interpretability is more research-oriented, emphasizing transparency and safety of AI systems, whereas Data Scientists focus on practical data analysis and modeling for business applications.

More about Mechanistic Interpretability jobs
What cities are hiring for Mechanistic Interpretability jobs? Cities with the most Mechanistic Interpretability job openings:
What states have the most Mechanistic Interpretability jobs? States with the most job openings for Mechanistic Interpretability jobs include:

Research Fellowship - Mechanistic Interpretability

Vmax

San Francisco, CA

Other

Posted 18 days ago


Job description

About Vmax

Vmax is an applied research lab developing AI capable of open-ended learning. We are building systems to exceed humans in all capacities by optimizing beyond the local maxima of learning from human expertise.

About the role

LLMs are fantastically powerful and there is a rapidly growing corpus of work devoted to understanding their internal representations and computations. We use the tools of mechanistic interpretability to enhance reinforcement learning by generating intrinsic rewards as a supplement or alternative to downstream human-generated verifiers. 

This 3 to 6 month fellowship is for PhD students or equivalent early-career researchers who want to work at the intersection of mechanistic interpretability and reinforcement learning. You will own a focused research project, work closely with Vmax technical staff, and contribute to research publications.

Responsibilities
  • Develop mechanistic interpretability methods for understanding internal representations, features, circuits, and computations in language models and agents.
  • Investigate how model internals can be used to generate intrinsic rewards, auxiliary objectives, diagnostics, or training signals for reinforcement learning.
  • Design and run experiments that test whether interpretability-derived signals improve learning, exploration, generalization, robustness, or sample efficiency.
  • Compare internally derived rewards against baselines such as human-generated verifiers, reward models, task-level outcome rewards, and standard intrinsic motivation methods.
  • Use techniques such as probing, activation analysis, sparse autoencoders, causal interventions, feature attribution, or representation analysis to study model behavior.
  • Analyze failure modes, including reward hacking, spurious features, non-causal correlations, objective misspecification, and overfitting to narrow evaluation distributions.
  • Build research code, evaluation harnesses, and experimental infrastructure that make results reproducible and useful to the broader team.
  • Communicate research progress clearly through written updates, internal presentations, and final project outputs.
Role Requirements
  • Currently enrolled in a PhD program in machine learning, computer science, artificial intelligence, computational neuroscience, mathematics, or a related technical field. Exceptional candidates with equivalent research experience may also be considered.
  • Track record of research excellence or strong research promise, demonstrated through publications, preprints, open-source work, technical projects, competitions, or publicly available artifacts.
  • Working understanding of reinforcement learning.
  • Familiarity with mechanistic interpretability, representation analysis, or empirical methods for understanding neural networks.
  • Strong programming ability in Python and experience with at least one major ML framework such as PyTorch or JAX.
  • Clear written and verbal communication of technical ideas.
Nice to have
  • Experience with LLM post-training methods 
  • Familiarity with intrinsic motivation, unsupervised RL, auxiliary objectives, representation learning for RL, or curiosity-driven learning.
  • Experience with scalable ML experimentation, distributed training, experiment tracking, or reproducible research infrastructure.
  • Interest in turning mechanistic understanding into practical training methods, rather than only analyzing models after training.
Role specific location policy
  • This role is based in our San Francisco office; for exceptional candidates we are willing to consider a hybrid arrangement