SuperGPQA
SuperGPQA
Description
SuperGPQA is an environment for evaluating graduate-level knowledge and reasoning across 285 academic subfields. It contains 26,500 multiple-choice questions spanning 13 disciplines and 72 fields, with 4-10 answer options per question. Questions may contain LaTeX notation and cover easy, middle, and hard difficulty levels.
Capabilities
- Graduate-level knowledge evaluation across 285 subfields
- Multiple-choice with variable options (4-10 choices, A-J)
- Coverage of 13 disciplines including STEM, humanities, and professional fields
Compute Requirements
Agents are given a standard environment with no sandbox or file system access.
License
Tasks
There is one split in this environment:
- test: 26,500 tasks
Tasks span 13 disciplines: Engineering, Science, Medicine, Economics, Philosophy, Law, History, Education, Management, Literature, Military Science, Agriculture, and Art.
Reward Structure
Single-turn evaluation with deterministic grading. The agent submits a single letter answer (A-J) via the submit_answer tool. The submitted answer is compared via exact match against the ground truth. Reward is 1.0 if correct, 0.0 if incorrect.
Data
data.parquet (26,500 questions) sourced from HuggingFace m-a-p/SuperGPQA. Stored on the OpenReward platform.
Tools
| Tool | Description |
|---|---|
submit_answer | Submit a single letter answer (A-J). Deterministic evaluation via exact match. Ends the episode. |
Time Horizon
Single-turn. The agent reads the question with options and submits one answer.
Environment Difficulty
SuperGPQA evaluates graduate-level knowledge across 50+ LLMs:
| Model | Accuracy |
|---|---|
| Gemini-2.5-Pro | 63.6% |
| DeepSeek-R1 | 61.8% |
| o1-2024-12-17 | ~60% |
The benchmark reveals a substantial gap between reasoning models and chat models, with fully open-sourced LLMs lagging behind proprietary models.
Other Environment Requirements
There are no further environment requirements; SuperGPQA works out of the box with the OpenReward endpoint without any external API keys.
Safety
Agents in SuperGPQA answer graduate-level multiple-choice questions in a standard environment. The environment does not present direct safety risks.
Citation
@article{du2025supergpqa,
title={SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines},
author={Du, Xinrun and Yao, Yifan and Ma, Kaijing and Wang, Bingli and Zheng, Tianyu and Zhu, King and Liu, Minghao and Liang, Yiming and Jin, Xiaolong and others},
journal={arXiv preprint arXiv:2502.14739},
year={2025}
}