SuperGPQA

API Endpoint
Leaderboard
Loading leaderboard...
Implementation of
README

SuperGPQA

OpenReward Environment Hugging Face Dataset

Description

SuperGPQA is an environment for evaluating graduate-level knowledge and reasoning across 285 academic subfields. It contains 26,500 multiple-choice questions spanning 13 disciplines and 72 fields, with 4-10 answer options per question. Questions may contain LaTeX notation and cover easy, middle, and hard difficulty levels.

Capabilities

  • Graduate-level knowledge evaluation across 285 subfields
  • Multiple-choice with variable options (4-10 choices, A-J)
  • Coverage of 13 disciplines including STEM, humanities, and professional fields

Compute Requirements

Agents are given a standard environment with no sandbox or file system access.

License

ODC-BY.

Tasks

There is one split in this environment:

  • test: 26,500 tasks

Tasks span 13 disciplines: Engineering, Science, Medicine, Economics, Philosophy, Law, History, Education, Management, Literature, Military Science, Agriculture, and Art.

Reward Structure

Single-turn evaluation with deterministic grading. The agent submits a single letter answer (A-J) via the submit_answer tool. The submitted answer is compared via exact match against the ground truth. Reward is 1.0 if correct, 0.0 if incorrect.

Data

data.parquet (26,500 questions) sourced from HuggingFace m-a-p/SuperGPQA. Stored on the OpenReward platform.

Tools

ToolDescription
submit_answerSubmit a single letter answer (A-J). Deterministic evaluation via exact match. Ends the episode.

Time Horizon

Single-turn. The agent reads the question with options and submits one answer.

Environment Difficulty

SuperGPQA evaluates graduate-level knowledge across 50+ LLMs:

ModelAccuracy
Gemini-2.5-Pro63.6%
DeepSeek-R161.8%
o1-2024-12-17~60%

The benchmark reveals a substantial gap between reasoning models and chat models, with fully open-sourced LLMs lagging behind proprietary models.

Other Environment Requirements

There are no further environment requirements; SuperGPQA works out of the box with the OpenReward endpoint without any external API keys.

Safety

Agents in SuperGPQA answer graduate-level multiple-choice questions in a standard environment. The environment does not present direct safety risks.

Citation

@article{du2025supergpqa,
  title={SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines},
  author={Du, Xinrun and Yao, Yifan and Ma, Kaijing and Wang, Bingli and Zheng, Tianyu and Zhu, King and Liu, Minghao and Liang, Yiming and Jin, Xiaolong and others},
  journal={arXiv preprint arXiv:2502.14739},
  year={2025}
}
GeneralReasoning/SuperGPQA | OpenReward