PhysicsEval

API Endpoint
Leaderboard
Loading leaderboard...
Implementation of
README

PhysicsEval

OpenReward Environment Hugging Face

Description

PhysicsEval is an environment for evaluating agents on physics problems. It contains 19,609 physics problems from authoritative textbooks covering mechanics, thermodynamics, electromagnetism, quantum physics, and more. An LLM grader evaluates the agent's answer against the gold target.

Capabilities

  • Solving physics problems across multiple domains (mechanics, thermodynamics, electromagnetism, quantum physics, etc.)
  • Mathematical reasoning and problem-solving
  • Single-turn question answering with LLM-graded correctness

Compute Requirements

PhysicsEval does not require a sandbox. It has minimal compute requirements.

License

CC-BY-4.0.

Tasks

There are two splits: train (17,647 tasks) and test (1,962 tasks), totaling 19,609 physics problems. Each task includes a problem statement, gold answer, category, and difficulty level. Problems span a range of physics topics from authoritative textbooks.

Reward Structure

This is a sparse, verifiable reward environment. The agent calls the answer tool once with its solution, and the environment grades it using an LLM grader (gpt-5-mini). The grader evaluates whether the answer is correct versus a ground truth response:

  • CORRECT: Reward 1.0.
  • INCORRECT: Reward 0.0.

Data

Problems are sourced from the hosted HuggingFace dataset, converted to parquet format for efficient loading.

Tools

Agents are given a single tool:

  • answer: Submit a final answer to the physics problem. The answer is graded by the LLM grader against the gold target. Returns whether the answer is correct. This tool can only be called once per task.

Time Horizon

PhysicsEval is a single-turn environment. The agent receives a physics problem and submits one answer. Each task requires exactly one tool call.

Environment Difficulty

Problems are rated on a 1-10 difficulty scale, grouped into three tiers:

Difficulty TierScaleTrain TasksTest Tasks
Easy1-43,308 (18.8%)365 (18.6%)
Medium5-713,492 (76.4%)1,488 (75.8%)
Hard8-10847 (4.8%)109 (5.6%)

Model Performance (from paper):

ModelEasyMediumHard
Phi-4-reasoning-plus (multi-agent)94.793.987.6
o4-mini86.888.285.4
DeepSeek-R194.183.472.7
QwQ-32B94.681.971.0
Llama 4 Maverick92.982.452.1
Gemma 3 27B87.659.140.6

The benchmark shows clear difficulty scaling: top models achieve ~95% on easy problems but drop to ~70-88% on hard problems. The majority of tasks (76%) are medium difficulty.

Other Environment Requirements

PhysicsEval requires an OpenAI API key (OPENAI_API_KEY secret) for LLM-based grading of answers.

Safety

Agents in PhysicsEval are asked to solve physics problems. The environment does not present direct safety risks, as agents only provide text answers with no access to external systems, tools, or the internet.

Citations

@misc{siddique2025physicsevalinferencetimetechniquesimprove,
      title={PhysicsEval: Inference-Time Techniques to Improve the Reasoning Proficiency of Large Language Models on Physics Problems}, 
      author={Oshayer Siddique and J. M Areeb Uzair Alam and Md Jobayer Rahman Rafy and Syed Rifat Raiyan and Hasan Mahmud and Md Kamrul Hasan},
      year={2025},
      eprint={2508.00079},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2508.00079}, 
}
GeneralReasoning/PhysicsEval | OpenReward