LawZero is a non-profit building safe-by-design AI systems. We're building the Scientist AI, an advanced AI system designed from the ground up to be both highly capable and safe. As we develop both generalpurpose Scientist AI models and safety guardrails for frontier LLMs, we need rigorous, independent evaluation of every capability and safety claim we make. We are looking for a Director of Evaluations to build, lead, and grow LawZero's Evaluations Team.
This is a foundational hire. You will define what worldclass evaluation looks like at LawZero, build the team and infrastructure to deliver it, and ensure that evaluations remain independent of the main research stream so that capability and safety claims can be trusted both internally and externally by the wider AI and AI safety community.
Key responsibilities
- Define LawZero's evaluations strategy and roadmap, prioritising what needs to be measured and when, in close coordination with both research and product teams.
- Build up the Evaluations Team during your first 3-6 months, scaling to roughly 8-10 people across research, engineering, dataset and benchmark design, and redteaming.
- Operate the team independently of the main research and product streams in order to avoid conflicts of interest, including designing novel benchmarks that can be applied applestoapples to evaluate both the Scientist AI and frontier LLMs.
- Oversee the design and construction of new datasets, tasks, and virtual or interactive environments to measure performance of the Scientist AI acrossย capabilities, safety (including honesty and goal-directedness), explainability, causal mechanisms and detecting adversarial attacks.
- Lead evaluation of the Scientist AI when deployed as a guardrail around frontier models, including its ability to comply with harm specifications, detect and block harmful responses, explain its decisions, and resist adversarial attacks such as jailbreaks, prompt injection, and data poisoning.
- Establish and lead our automated and manual redteaming programmes, both inhouse and in partnership with external providers, to stress test the Scientist AI as a generalpurpose model and as a guardrail.
- Lead the construction of internal tooling and infrastructure needed to run evaluations at scale, automating and standardizing the pipeline wherever possible.
- As needed and where possible, directly support research and product streams with their own internal requirements w.r.t. evaluations and benchmarking to unblock and speed up.
- Own LawZero's public communication of evaluation results, including model and system cards, technical reports, peerreviewed publications and blog posts, to build trust with the wider AI safety community.
- Represent LawZero externally on evaluations and AI safety measurement, including engagements with AI safety institutes, research collaborators, and grant funders.
Skills and qualifications
- An advanced degree (MSc or higher) in machine learning, computer science, or a closely related field.
- 10+ years of experience in machine learning, with at least 5 years in a leadership role building or scaling technical teams working on real-world ML products.
- Handson expertise in designing and running largescale evaluations of LLMs or other frontier ML systems across capabilities, safety, and adversarial robustness.
- A track record of building evaluation datasets, benchmarks, or interactive environments from scratch, including for safetyrelevant properties such as honesty, sycophancy, refusal behaviour, and adversarial robustness.
- Strong written and verbal communication skills, including the ability to translate technical results for nontechnical audiences such as executives, funders, and policymakers.
- Comfortable operating in a researchdriven, fastmoving environment with significant ambiguity, and able to bring structure to it without slowing it down.
Nice to have:ย
- Experience leading redteaming exercises (automated, manual, or both) and working with thirdparty evaluation or redteaming partners is a bonus.
- Experience working with third-party partners for benchmark and dataset creationย
- Experience releasing opensource datasets, benchmarks, or evaluation tooling is a bonus.
- Familiarity with current AI safety policy and standards work (UK AISI, CAISI, NIST, EU AI Act, etc.) is a bonus.
- Experience contributing to or coordinating with external safety institutes, grant funders, or government bodies is a bonus.
What we offer
- The opportunity to contribute to a unique mission with a major impact
- Comprehensive health benefits
- A minimum of 20 days vacation per year upon start
- A minimum retirement savings employer contribution of 4%
- Generous flexible benefits designed to contribute to your well-being
- A team of passionate experts in their field
- A collaborative and inclusive work environment with offices in the heart of Little Italy, in the trendy Mile-Ex district, close to public transportation.