We use the tools of mechanistic interpretability to enhance reinforcement learning by generating intrinsic rewards as a supplement or alternative to downstream human-generated verifiers. This 3 to 6 ...
We use the tools of mechanistic interpretability to enhance reinforcement learning by generating intrinsic rewards as a supplement or alternative to downstream human-generated verifiers. This 3 to 6 ...
We use the tools of mechanistic interpretability to enhance reinforcement learning by generating intrinsic rewards as a supplement or alternative to downstream human-generated verifiers.
We use the tools of mechanistic interpretability to enhance reinforcement learning by generating intrinsic rewards as a supplement or alternative to downstream human-generated verifiers.
Our focus is on the basic science of mechanistic interpretability, striving to reverse-engineer the internal computations of large language models to ensure their safety, alignment, and reliability.
Our focus is on the basic science of mechanistic interpretability, striving to reverse-engineer the internal computations of large language models to ensure their safety, alignment, and reliability.
We're focused on mechanistic interpretability, which aims to discover how neural network parameters map to meaningful algorithms. Some useful analogies might be to think of us as trying to do ...
We're focused on mechanistic interpretability, which aims to discover how neural network parameters map to meaningful algorithms. Some useful analogies might be to think of us as trying to do ...
We're focused on mechanistic interpretability, which aims to discover how neural network parameters map to meaningful algorithms. Some useful analogies might be to think of us as trying to do ...
We're focused on mechanistic interpretability, which aims to discover how neural network parameters map to meaningful algorithms. Some useful analogies might be to think of us as trying to do ...
Researcher, Interpretability
San Francisco, CA · On-site
$295K - $445K/yr
You will develop and carry out a research plan in mechanistic interpretability, in close collaboration with a highly motivated team. You will play a critical role in helping OpenAI ensure future ...
Researcher, Interpretability
San Francisco, CA · On-site
$295K - $445K/yr
You will develop and carry out a research plan in mechanistic interpretability, in close collaboration with a highly motivated team. You will play a critical role in helping OpenAI ensure future ...
We believe that a mechanistic understanding is the most robust way to make advanced systems safe. People mean many different things by "interpretability". We're focused on mechanistic ...
We believe that a mechanistic understanding is the most robust way to make advanced systems safe. People mean many different things by "interpretability". We're focused on mechanistic ...
Research Assistant (Maoz - 3mo)
Irvine, CA · On-site
$20.50 - $28/hr
Chapman University School of Pharmacy is seeking a Research Assistant to provide research support related to mechanistic interpretability of large language models and its connections to neuroscience.
Research Assistant (Maoz - 3mo)
Irvine, CA · On-site
$20.50 - $28/hr
Chapman University School of Pharmacy is seeking a Research Assistant to provide research support related to mechanistic interpretability of large language models and its connections to neuroscience.
We believe that a mechanistic understanding is the most robust way to make advanced systems safe. People mean many different things by "interpretability". We're focused on mechanistic ...
We believe that a mechanistic understanding is the most robust way to make advanced systems safe. People mean many different things by "interpretability". We're focused on mechanistic ...
Research Assistant (Maoz - 3mo)
Orange, CA · On-site
$18 - $20/hr
Independently perform research-related activities supporting projects in mechanistic interpretability of large language models. * Assist with implementation of algorithms related to AI explainability.
Research Assistant (Maoz - 3mo)
Orange, CA · On-site
$18 - $20/hr
Independently perform research-related activities supporting projects in mechanistic interpretability of large language models. * Assist with implementation of algorithms related to AI explainability.
Member of the Technical Staff, Interpretability
New York, NY · On-site
$120K - $250K/yr
You have a strong publication record at top-tier venues (e.g., NeurIPS, ICML, ICLR) with contributions to mechanistic interpretability, representation analysis, probing methods, or model ...
New
Quick apply
Member of the Technical Staff, Interpretability
New York, NY · On-site
$120K - $250K/yr
You have a strong publication record at top-tier venues (e.g., NeurIPS, ICML, ICLR) with contributions to mechanistic interpretability, representation analysis, probing methods, or model ...
New
Member of the Technical Staff, Interpretability
New York, NY · On-site
$120K - $250K/yr
You have a strong publication record at top-tier venues (e.g., NeurIPS, ICML, ICLR) with contributions to mechanistic interpretability, representation analysis, probing methods, or model ...
New
Member of the Technical Staff, Interpretability
New York, NY · On-site
$120K - $250K/yr
You have a strong publication record at top-tier venues (e.g., NeurIPS, ICML, ICLR) with contributions to mechanistic interpretability, representation analysis, probing methods, or model ...
New
Research Assistant (Maoz - 3mo)
Orange, CA · On-site
$18 - $20/hr
Independently perform research-related activities supporting projects in mechanistic interpretability of large language models. * Assist with implementation of algorithms related to AI explainability.
Research Assistant (Maoz - 3mo)
Orange, CA · On-site
$18 - $20/hr
Independently perform research-related activities supporting projects in mechanistic interpretability of large language models. * Assist with implementation of algorithms related to AI explainability.
Preferred : • Experience with mechanistic interpretability, probing, or other techniques for understanding model internals. • Familiarity with red-teaming or adversarial evaluation of post ...
Preferred : • Experience with mechanistic interpretability, probing, or other techniques for understanding model internals. • Familiarity with red-teaming or adversarial evaluation of post ...
Preferred : • Experience with mechanistic interpretability, probing, or other techniques for understanding model internals. • Familiarity with red-teaming or adversarial evaluation of post ...
Preferred : • Experience with mechanistic interpretability, probing, or other techniques for understanding model internals. • Familiarity with red-teaming or adversarial evaluation of post ...
Preferred : • Experience with mechanistic interpretability, probing, or other techniques for understanding model internals. • Familiarity with red-teaming or adversarial evaluation of post ...
Preferred : • Experience with mechanistic interpretability, probing, or other techniques for understanding model internals. • Familiarity with red-teaming or adversarial evaluation of post ...
Preferred : • Experience with mechanistic interpretability, probing, or other techniques for understanding model internals. • Familiarity with red-teaming or adversarial evaluation of post ...
Preferred : • Experience with mechanistic interpretability, probing, or other techniques for understanding model internals. • Familiarity with red-teaming or adversarial evaluation of post ...
... interpretability, mechanistic interpretability, or model internals (sparse autoencoders, feature steering, etc.). Company : Goodfire is an AI research lab using interpretability to turn AI into ...
... interpretability, mechanistic interpretability, or model internals (sparse autoencoders, feature steering, etc.). Company : Goodfire is an AI research lab using interpretability to turn AI into ...
Machine Learning Engineer: LLM Interpretability & Systems
San Francisco, CA · On-site
$175K - $250K/yr
Take ideas from mechanistic interpretability and related work and turn them into code that runs in production, making research into reality. * Work directly with model internals to improve behavior ...
Machine Learning Engineer: LLM Interpretability & Systems
San Francisco, CA · On-site
$175K - $250K/yr
Take ideas from mechanistic interpretability and related work and turn them into code that runs in production, making research into reality. * Work directly with model internals to improve behavior ...
Research Engineer, Interpretability
San Francisco, CA · On-site +1
How can we trust them?" The Interpretability team at Anthropic is working to reverse-engineer how trained models work because we believe that a mechanistic understanding is the most robust way to ...
Research Engineer, Interpretability
San Francisco, CA · On-site +1
How can we trust them?" The Interpretability team at Anthropic is working to reverse-engineer how trained models work because we believe that a mechanistic understanding is the most robust way to ...
Mechanistic Interpretability information
See salary details
$31K - $32.8K
13% of jobs
$33.2K is the 25th percentile. Wages below this are outliers.
$32.8K - $34.5K
56% of jobs
$35K is the 75th percentile. Wages above this are outliers.
$34.5K - $36.3K
26% of jobs
$36.3K - $38.1K
1% of jobs
$38.1K - $39.9K
0% of jobs
$39.9K - $41.6K
0% of jobs
$41.6K - $43.4K
0% of jobs
$43.4K - $45.2K
1% of jobs
$45.2K - $47K
1% of jobs
$47K - $48.7K
1% of jobs
$48.7K - $50.5K
1% of jobs
$31K
$36.3K
$50.5K
How much do mechanistic interpretability jobs pay per year?
What is the difference between Mechanistic Interpretability vs Data Scientist?
| Aspect | Mechanistic Interpretability | Data Scientist |
|---|---|---|
| Required credentials | Advanced degrees in AI, ML, or related fields | Degree in Data Science, Statistics, or Computer Science |
| Work environment | Research labs, AI development teams | Business, tech companies, consulting firms |
| Industry usage | AI research, model transparency, safety | Data analysis, predictive modeling, insights |
| Search intent | Understanding model internals, interpretability techniques | Data analysis, insights, model building |
Mechanistic Interpretability focuses on understanding how AI models work internally, often requiring deep technical expertise. Data Scientists analyze data to build models and extract insights. While both roles involve data and algorithms, Mechanistic Interpretability is more research-oriented, emphasizing transparency and safety of AI systems, whereas Data Scientists focus on practical data analysis and modeling for business applications.
Other
Posted 18 days ago
Job description
Vmax is an applied research lab developing AI capable of open-ended learning. We are building systems to exceed humans in all capacities by optimizing beyond the local maxima of learning from human expertise.
About the roleLLMs are fantastically powerful and there is a rapidly growing corpus of work devoted to understanding their internal representations and computations. We use the tools of mechanistic interpretability to enhance reinforcement learning by generating intrinsic rewards as a supplement or alternative to downstream human-generated verifiers.
This 3 to 6 month fellowship is for PhD students or equivalent early-career researchers who want to work at the intersection of mechanistic interpretability and reinforcement learning. You will own a focused research project, work closely with Vmax technical staff, and contribute to research publications.
Responsibilities- Develop mechanistic interpretability methods for understanding internal representations, features, circuits, and computations in language models and agents.
- Investigate how model internals can be used to generate intrinsic rewards, auxiliary objectives, diagnostics, or training signals for reinforcement learning.
- Design and run experiments that test whether interpretability-derived signals improve learning, exploration, generalization, robustness, or sample efficiency.
- Compare internally derived rewards against baselines such as human-generated verifiers, reward models, task-level outcome rewards, and standard intrinsic motivation methods.
- Use techniques such as probing, activation analysis, sparse autoencoders, causal interventions, feature attribution, or representation analysis to study model behavior.
- Analyze failure modes, including reward hacking, spurious features, non-causal correlations, objective misspecification, and overfitting to narrow evaluation distributions.
- Build research code, evaluation harnesses, and experimental infrastructure that make results reproducible and useful to the broader team.
- Communicate research progress clearly through written updates, internal presentations, and final project outputs.
- Currently enrolled in a PhD program in machine learning, computer science, artificial intelligence, computational neuroscience, mathematics, or a related technical field. Exceptional candidates with equivalent research experience may also be considered.
- Track record of research excellence or strong research promise, demonstrated through publications, preprints, open-source work, technical projects, competitions, or publicly available artifacts.
- Working understanding of reinforcement learning.
- Familiarity with mechanistic interpretability, representation analysis, or empirical methods for understanding neural networks.
- Strong programming ability in Python and experience with at least one major ML framework such as PyTorch or JAX.
- Clear written and verbal communication of technical ideas.
- Experience with LLM post-training methods
- Familiarity with intrinsic motivation, unsupervised RL, auxiliary objectives, representation learning for RL, or curiosity-driven learning.
- Experience with scalable ML experimentation, distributed training, experiment tracking, or reproducible research infrastructure.
- Interest in turning mechanistic understanding into practical training methods, rather than only analyzing models after training.
- This role is based in our San Francisco office; for exceptional candidates we are willing to consider a hybrid arrangement