API Endpoint

Leaderboard

Loading leaderboard...

Implementation of

arXiv/supergpqa

README

SuperGPQA

Description

SuperGPQA is an environment for evaluating graduate-level knowledge and reasoning across 285 academic subfields. It contains 26,500 multiple-choice questions spanning 13 disciplines and 72 fields, with 4-10 answer options per question. Questions may contain LaTeX notation and cover easy, middle, and hard difficulty levels.

Capabilities

Graduate-level knowledge evaluation across 285 subfields
Multiple-choice with variable options (4-10 choices, A-J)
Coverage of 13 disciplines including STEM, humanities, and professional fields

Compute Requirements

Agents are given a standard environment with no sandbox or file system access.

License

ODC-BY.

Tasks

There is one split in this environment:

test: 26,500 tasks

Tasks span 13 disciplines: Engineering, Science, Medicine, Economics, Philosophy, Law, History, Education, Management, Literature, Military Science, Agriculture, and Art.

Reward Structure

Single-turn evaluation with deterministic grading. The agent submits a single letter answer (A-J) via the submit_answer tool. The submitted answer is compared via exact match against the ground truth. Reward is 1.0 if correct, 0.0 if incorrect.

Data

data.parquet (26,500 questions) sourced from HuggingFace m-a-p/SuperGPQA. Stored on the OpenReward platform.

Tools

Tool	Description
`submit_answer`	Submit a single letter answer (A-J). Deterministic evaluation via exact match. Ends the episode.

Time Horizon

Single-turn. The agent reads the question with options and submits one answer.

Environment Difficulty

SuperGPQA evaluates graduate-level knowledge across 50+ LLMs:

Model	Accuracy
Gemini-2.5-Pro	63.6%
DeepSeek-R1	61.8%
o1-2024-12-17	~60%

The benchmark reveals a substantial gap between reasoning models and chat models, with fully open-sourced LLMs lagging behind proprietary models.

Other Environment Requirements

There are no further environment requirements; SuperGPQA works out of the box with the OpenReward endpoint without any external API keys.

Safety

Agents in SuperGPQA answer graduate-level multiple-choice questions in a standard environment. The environment does not present direct safety risks.

Citation

@article{du2025supergpqa,
  title={SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines},
  author={Du, Xinrun and Yao, Yifan and Ma, Kaijing and Wang, Bingli and Zheng, Tianyu and Zhu, King and Liu, Minghao and Liang, Yiming and Jin, Xiaolong and others},
  journal={arXiv preprint arXiv:2502.14739},
  year={2025}
}

Repository

Source repository

EnvCommons/SuperGPQA

Clone Repository

Tools

Tools available in the environment

No tools available for this environment, it probably hasn't been indexed yet.

Compute Configuration

Resource allocation for this environment.

Component	Configuration
Environment Server	1 vCPU / 4 GB RAM
Sandbox Machine	Not configured

Estimated Cost

Pay per second of active session usage. Billing starts when your session begins and stops when it ends.

Component	Cost / second
Environment	$0.0000320
Sandbox	Not configured
Total	$0.0000320

Examples

5-minute session$0.0096

1-hour session$0.1152

SuperGPQA

GeneralReasoning/SuperGPQA

SuperGPQA

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citation

Repository

Clone Repository

Tools

Compute Configuration

Estimated Cost

Examples