We are sharing a specialised part-time consulting opportunity for professors, PhD students, and advanced academic researchers experienced in domain-specific problem design, Python-based evaluation, benchmark task development, and structured reasoning assessment.
This role supports current and upcoming remote consulting opportunities focused on academic benchmark task design, Python-based evaluation workflows, domain-specific problem development, golden solution preparation, model behavior analysis, and high-quality project execution. Selected professionals will apply their academic expertise to create challenging real-world tasks, define precise expected outputs, develop executable tests, and evaluate reasoning or problem-solving performance across advanced subject areas.
Key Responsibilities
Professionals in this role may contribute to:
Academic Task Design & Development
- Design challenging, real-world problems drawn from your academic or professional domain
- Create tasks across areas such as machine learning, coding, data science, computer science, physics, mathematics, engineering, statistics, biology, chemistry, finance, accounting, economics, law, or business
- Build tasks that test reasoning, problem solving, instruction following, and domain-specific judgment
- Ensure task prompts are clear, rigorous, realistic, and aligned with expert-level expectations
Python-Based Solutions & Evaluation Assets
- Prepare task specifications, golden solutions, and supporting evaluation components using Python
- Develop executable tests or structured checks that support objective evaluation
- Translate complex domain problems into clear, testable workflows with measurable success criteria
- Review task materials for correctness, completeness, reproducibility, and technical clarity
Model Behavior Analysis & Failure Classification
- Evaluate model or agent performance on domain-specific tasks
- Identify tasks where outputs fail to satisfy tests, instructions, or expected reasoning standards
- Classify failure modes involving logical reasoning, problem decomposition, technical execution, or domain understanding
- Write clear analysis explaining where and why a task response succeeds or fails
Rubric Development & Structured Review
- Develop detailed rubrics and evaluation frameworks for academic and technical benchmark tasks
- Apply consistent evaluation standards across tasks, outputs, and solution materials
- Provide clear written feedback explaining quality, reasoning gaps, and improvement areas
- Collaborate with other subject matter experts to support consistency and accuracy across review workflows
Ideal Profile
Strong candidates may have:
- Current or retired professor status, or current PhD student status, in a relevant academic or professional field
- Academic expertise in STEM, quantitative, professional, or research-intensive domains
- Working proficiency in Python applied through research, industry work, GitHub projects, coursework, or technical task development
- Ability to design rigorous domain-specific problems and evaluate solutions with precision
- Strong reasoning, written communication, problem-solving, and independent work skills
- Ability to manage time effectively and contribute reliably in a remote project-based environment
- Availability for high-commitment project work, potentially 30+ hours per week during weekdays depending on project scope
Educational Background
- A completed or in-progress PhD from a strong university program is highly relevant
- Academic backgrounds may include machine learning, coding, data science, computer science, physics, mathematics, engineering, statistics, biology, chemistry, finance, accounting, economics, law, business, or related fields
- Teaching, research, publication, technical writing, benchmark design, coding, or evaluation experience may be especially valuable
Nice to Have
- Experience in AI training, model evaluation, benchmark development, data annotation, or structured task review
- Experience writing Python tests, executable checks, golden solutions, or reproducible research code
- Familiarity with agentic task design, model behavior analysis, reasoning evaluation, or failure-mode classification
- Experience developing academic assessments, problem sets, rubrics, grading criteria, or research evaluation materials
- Strong ability to turn complex academic or professional problems into clear, testable tasks
Why This Opportunity
- Apply academic expertise to structured remote benchmark and evaluation work
- Contribute to high-quality task design, Python-based solution development, and reasoning assessment
- Work on flexible assignments aligned with your research field, domain knowledge, and technical strengths
- Use your ability to identify reasoning gaps, design rigorous problems, and evaluate outputs with precision
- Remote structure with competitive hourly compensation
Contract Details
- Independent contractor role
- Fully remote with flexible scheduling
- Eligible professionals should be based in the United States depending on project needs
- High-commitment project availability may be required, potentially 30+ hours per week during weekdays depending on project scope
- Competitive rates between $70โ$100 per hour depending on expertise and project scope
- Weekly payments via Stripe or Wise
- Projects may be extended, shortened, or adjusted depending on scope and performance
- Work will not involve access to confidential or proprietary information from any employer, client, or institution
About the Platform
This opportunity is available through 24-MAG LLC. We connect experienced professionals with remote consulting opportunities across technical, evaluation, and project-based workstreams.
By submitting this application, you acknowledge that your information may be processed by 24-MAG LLC for recruitment and opportunity matching in accordance with our Privacy Policy: https://www.24-mag.com/privacy-policy.