GPQA
GPQA
Description
GPQA (Graduate-Level Google-Proof Q&A Benchmark) is an environment for evaluating expert-level question answering. It contains challenging multiple-choice questions in Biology, Physics, and Chemistry that are designed to be difficult even for domain experts with unrestricted internet access. Questions are crafted to require genuine expertise rather than simple information retrieval.
Capabilities
- Graduate-level scientific reasoning
- Expert knowledge in Biology, Physics, and Chemistry
- Multiple-choice question answering
- Distinguishing between plausible-sounding incorrect answers
Compute Requirements
Agents are given a standard environment with no sandbox or file system access.
License
Tasks
There are three splits in this environment:
- main: 448 tasks
- diamond: 198 tasks (highest quality subset)
- extended: 546 tasks
Questions span Biology, Physics, and Chemistry subdomains. Each task presents a question with four answer choices (A, B, C, D).
Reward Structure
This is a single-turn environment. The agent submits an answer letter (A, B, C, or D) via the submit_answer tool. Validation is deterministic exact match. Reward is binary: 1.0 if correct, 0.0 if incorrect.
Data
Data consists of CSV files for each split sourced from HuggingFace Idavidrein/gpqa. Each row contains a question, correct answer, three incorrect answers, and subdomain. Data is stored on the OpenReward platform.
Tools
| Tool | Description |
|---|---|
submit_answer | Submit your answer choice (A, B, C, or D). Ends the episode. |
Time Horizon
Single-turn. The agent reads the question and options, then submits one answer.
Environment Difficulty
GPQA evaluates expert-level scientific reasoning:
| Model | Diamond Accuracy |
|---|---|
| Claude Opus 4.5 | 77.4% |
| GPT-5 | 72.7% |
| Gemini 2 Flash | 67.7% |
| Claude Sonnet 4 | 65.5% |
| Human Experts | 69.7% |
| Human Non-Experts | 34.1% |
Other Environment Requirements
There are no further environment requirements; GPQA works out of the box with the OpenReward endpoint without any external API keys.
Safety
Agents in GPQA answer graduate-level science questions in a standard environment. The environment does not present direct safety risks.
Citation
@article{rein2023gpqa,
title={GPQA: A Graduate-Level Google-Proof Q&A Benchmark},
author={Rein, David and Hou, Betty Li and Stickland, Asa Cooper and Petty, Jackson and Pang, Richard Yuanzhe and Dirani, Julien and Michael, Julian and Bowman, Samuel R.},
journal={arXiv preprint arXiv:2311.12022},
year={2023}
}