GPQA

API Endpoint
Leaderboard
Loading leaderboard...
Implementation of
README

GPQA

OpenReward Environment Hugging Face Dataset

Description

GPQA (Graduate-Level Google-Proof Q&A Benchmark) is an environment for evaluating expert-level question answering. It contains challenging multiple-choice questions in Biology, Physics, and Chemistry that are designed to be difficult even for domain experts with unrestricted internet access. Questions are crafted to require genuine expertise rather than simple information retrieval.

Capabilities

  • Graduate-level scientific reasoning
  • Expert knowledge in Biology, Physics, and Chemistry
  • Multiple-choice question answering
  • Distinguishing between plausible-sounding incorrect answers

Compute Requirements

Agents are given a standard environment with no sandbox or file system access.

License

CC BY 4.0.

Tasks

There are three splits in this environment:

  • main: 448 tasks
  • diamond: 198 tasks (highest quality subset)
  • extended: 546 tasks

Questions span Biology, Physics, and Chemistry subdomains. Each task presents a question with four answer choices (A, B, C, D).

Reward Structure

This is a single-turn environment. The agent submits an answer letter (A, B, C, or D) via the submit_answer tool. Validation is deterministic exact match. Reward is binary: 1.0 if correct, 0.0 if incorrect.

Data

Data consists of CSV files for each split sourced from HuggingFace Idavidrein/gpqa. Each row contains a question, correct answer, three incorrect answers, and subdomain. Data is stored on the OpenReward platform.

Tools

ToolDescription
submit_answerSubmit your answer choice (A, B, C, or D). Ends the episode.

Time Horizon

Single-turn. The agent reads the question and options, then submits one answer.

Environment Difficulty

GPQA evaluates expert-level scientific reasoning:

ModelDiamond Accuracy
Claude Opus 4.577.4%
GPT-572.7%
Gemini 2 Flash67.7%
Claude Sonnet 465.5%
Human Experts69.7%
Human Non-Experts34.1%

Other Environment Requirements

There are no further environment requirements; GPQA works out of the box with the OpenReward endpoint without any external API keys.

Safety

Agents in GPQA answer graduate-level science questions in a standard environment. The environment does not present direct safety risks.

Citation

@article{rein2023gpqa,
  title={GPQA: A Graduate-Level Google-Proof Q&A Benchmark},
  author={Rein, David and Hou, Betty Li and Stickland, Asa Cooper and Petty, Jackson and Pang, Richard Yuanzhe and Dirani, Julien and Michael, Julian and Bowman, Samuel R.},
  journal={arXiv preprint arXiv:2311.12022},
  year={2023}
}