API Endpoint

Leaderboard

Loading leaderboard...

Implementation of

arXiv/physicseval

README

PhysicsEval

Description

PhysicsEval is an environment for evaluating agents on physics problems. It contains 19,609 physics problems from authoritative textbooks covering mechanics, thermodynamics, electromagnetism, quantum physics, and more. An LLM grader evaluates the agent's answer against the gold target.

Capabilities

Solving physics problems across multiple domains (mechanics, thermodynamics, electromagnetism, quantum physics, etc.)
Mathematical reasoning and problem-solving
Single-turn question answering with LLM-graded correctness

Compute Requirements

PhysicsEval does not require a sandbox. It has minimal compute requirements.

License

CC-BY-4.0.

Tasks

There are two splits: train (17,647 tasks) and test (1,962 tasks), totaling 19,609 physics problems. Each task includes a problem statement, gold answer, category, and difficulty level. Problems span a range of physics topics from authoritative textbooks.

Reward Structure

This is a sparse, verifiable reward environment. The agent calls the answer tool once with its solution, and the environment grades it using an LLM grader (gpt-5-mini). The grader evaluates whether the answer is correct versus a ground truth response:

CORRECT: Reward 1.0.
INCORRECT: Reward 0.0.

Data

Problems are sourced from the hosted HuggingFace dataset, converted to parquet format for efficient loading.

Tools

Agents are given a single tool:

answer: Submit a final answer to the physics problem. The answer is graded by the LLM grader against the gold target. Returns whether the answer is correct. This tool can only be called once per task.

Time Horizon

PhysicsEval is a single-turn environment. The agent receives a physics problem and submits one answer. Each task requires exactly one tool call.

Environment Difficulty

Problems are rated on a 1-10 difficulty scale, grouped into three tiers:

Difficulty Tier	Scale	Train Tasks	Test Tasks
Easy	1-4	3,308 (18.8%)	365 (18.6%)
Medium	5-7	13,492 (76.4%)	1,488 (75.8%)
Hard	8-10	847 (4.8%)	109 (5.6%)

Model Performance (from paper):

Model	Easy	Medium	Hard
Phi-4-reasoning-plus (multi-agent)	94.7	93.9	87.6
o4-mini	86.8	88.2	85.4
DeepSeek-R1	94.1	83.4	72.7
QwQ-32B	94.6	81.9	71.0
Llama 4 Maverick	92.9	82.4	52.1
Gemma 3 27B	87.6	59.1	40.6

The benchmark shows clear difficulty scaling: top models achieve ~95% on easy problems but drop to ~70-88% on hard problems. The majority of tasks (76%) are medium difficulty.

Other Environment Requirements

PhysicsEval requires an OpenAI API key (OPENAI_API_KEY secret) for LLM-based grading of answers.

Safety

Agents in PhysicsEval are asked to solve physics problems. The environment does not present direct safety risks, as agents only provide text answers with no access to external systems, tools, or the internet.

Citations

@misc{siddique2025physicsevalinferencetimetechniquesimprove,
      title={PhysicsEval: Inference-Time Techniques to Improve the Reasoning Proficiency of Large Language Models on Physics Problems}, 
      author={Oshayer Siddique and J. M Areeb Uzair Alam and Md Jobayer Rahman Rafy and Syed Rifat Raiyan and Hasan Mahmud and Md Kamrul Hasan},
      year={2025},
      eprint={2508.00079},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2508.00079}, 
}

Repository

Source repository

EnvCommons/PhysicsEval

Clone Repository

Tools

Tools available in the environment

No tools available for this environment, it probably hasn't been indexed yet.

Compute Configuration

Resource allocation for this environment.

Component	Configuration
Environment Server	1 vCPU / 4 GB RAM
Sandbox Machine	Not configured

Estimated Cost

Pay per second of active session usage. Billing starts when your session begins and stops when it ends.

Component	Cost / second
Environment	$0.0000320
Sandbox	Not configured
Total	$0.0000320

Examples

5-minute session$0.0096

1-hour session$0.1152

PhysicsEval

GeneralReasoning/PhysicsEval

PhysicsEval

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citations

Repository

Clone Repository

Tools

Compute Configuration

Estimated Cost

Examples