GPQA
Description
GPQA is a benchmark dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry, designed to be "Google-proof" and extremely difficult for both human experts (≈65% accuracy) and state-of-the-art AI (GPT-4 baseline ≈39%). It is intended to enable realistic scalable oversight experiments for studying how humans can reliably supervise AI systems when answering very hard scientific questions that may surpass human capabilities.
Leaderboard
Loading leaderboard...
Implementations (1)
| Environment | Stars | Last Updated | |
|---|---|---|---|
0 | 1 months ago |