API Endpoint

Leaderboard

Loading leaderboard...

Implementation of

arXiv/GPQA

README

GPQA

Description

GPQA (Graduate-Level Google-Proof Q&A Benchmark) is an environment for evaluating expert-level question answering. It contains challenging multiple-choice questions in Biology, Physics, and Chemistry that are designed to be difficult even for domain experts with unrestricted internet access. Questions are crafted to require genuine expertise rather than simple information retrieval.

Capabilities

Graduate-level scientific reasoning
Expert knowledge in Biology, Physics, and Chemistry
Multiple-choice question answering
Distinguishing between plausible-sounding incorrect answers

Compute Requirements

Agents are given a standard environment with no sandbox or file system access.

License

CC BY 4.0.

Tasks

There are three splits in this environment:

main: 448 tasks
diamond: 198 tasks (highest quality subset)
extended: 546 tasks

Questions span Biology, Physics, and Chemistry subdomains. Each task presents a question with four answer choices (A, B, C, D).

Reward Structure

This is a single-turn environment. The agent submits an answer letter (A, B, C, or D) via the submit_answer tool. Validation is deterministic exact match. Reward is binary: 1.0 if correct, 0.0 if incorrect.

Data

Data consists of CSV files for each split sourced from HuggingFace Idavidrein/gpqa. Each row contains a question, correct answer, three incorrect answers, and subdomain. Data is stored on the OpenReward platform.

Tools

Tool	Description
`submit_answer`	Submit your answer choice (A, B, C, or D). Ends the episode.

Time Horizon

Single-turn. The agent reads the question and options, then submits one answer.

Environment Difficulty

GPQA evaluates expert-level scientific reasoning:

Model	Diamond Accuracy
Claude Opus 4.5	77.4%
GPT-5	72.7%
Gemini 2 Flash	67.7%
Claude Sonnet 4	65.5%
Human Experts	69.7%
Human Non-Experts	34.1%

Other Environment Requirements

There are no further environment requirements; GPQA works out of the box with the OpenReward endpoint without any external API keys.

Safety

Agents in GPQA answer graduate-level science questions in a standard environment. The environment does not present direct safety risks.

Citation

@article{rein2023gpqa,
  title={GPQA: A Graduate-Level Google-Proof Q&A Benchmark},
  author={Rein, David and Hou, Betty Li and Stickland, Asa Cooper and Petty, Jackson and Pang, Richard Yuanzhe and Dirani, Julien and Michael, Julian and Bowman, Samuel R.},
  journal={arXiv preprint arXiv:2311.12022},
  year={2023}
}

Repository

Source repository

EnvCommons/GPQA

Clone Repository

Tools

Tools available in the environment

No tools available for this environment, it probably hasn't been indexed yet.

Compute Configuration

Resource allocation for this environment.

Component	Configuration
Environment Server	1 vCPU / 4 GB RAM
Sandbox Machine	Not configured

Estimated Cost

Pay per second of active session usage. Billing starts when your session begins and stops when it ends.

Component	Cost / second
Environment	$0.0000320
Sandbox	Not configured
Total	$0.0000320

Examples

5-minute session$0.0096

1-hour session$0.1152

GPQA

GeneralReasoning/GPQA

GPQA

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citation

Repository

Clone Repository

Tools

Compute Configuration

Estimated Cost

Examples