SimpleQA
SimpleQA
Description
SimpleQA is an environment for evaluating short-form factual question answering. It implements the SimpleQA benchmark from OpenAI, which consists of short, fact-seeking questions with single, unambiguous answers. An LLM grader evaluates the agent's response against the gold target using a three-way classification.
Capabilities
- Short-form factual question answering across diverse topics
- Single-turn evaluation (one question, one answer)
- LLM-graded correctness with three-way classification (Correct / Incorrect / Not Attempted)
Compute Requirements
SimpleQA does not require a sandbox. It has minimal compute requirements.
License
MIT.
Tasks
There is one test split containing 4,326 short, fact-seeking questions loaded from simple_qa_test_set.csv. Questions span a range of topics including science and technology, politics, art, history, geography, and entertainment.
Each question has a single, verifiable gold target answer. Questions were adversarially collected against GPT-4 to ensure difficulty, and independently verified by human annotators. Answer types include dates (32.8%), people (24.1%), numbers (15.3%), places (9.9%), and other (18.0%).
Reward Structure
This is a sparse reward environment. The agent calls the answer tool once with its response, and the environment grades it using an LLM grader (gpt-4o). The grader assigns one of three grades:
- A (CORRECT): The answer fully contains the important information from the gold target with no contradictions. Reward: 1.0.
- B (INCORRECT): The answer contains a factual statement that contradicts the gold target. Reward: 0.0.
- C (NOT_ATTEMPTED): Important information from the gold target is missing, but nothing contradicts it. Reward: 0.0.
Grading rules:
- Case, punctuation, grammar, and order do not matter - only semantic meaning.
- Hedging and uncertainty are permitted if the gold target is fully included with no contradictions.
- Numeric answers must be correct to the last significant figure in the gold target.
- Typos in names are permitted if clearly the same name.
- Answers are not penalized for omitting information clearly inferred from the question.
Data
Questions are loaded from simple_qa_test_set.csv, which contains columns for id, metadata, problem, and answer. No additional data files are provided to the agent.
Tools
Agents are given a single tool:
answer: Submit an answer to the question. The answer is graded by the LLM grader against the gold target. Returns the grade (A/B/C) and reward (1.0 or 0.0). This tool can only be called once per task.
Time Horizon
SimpleQA is a single-turn environment. The agent receives a question and submits one answer. Each task requires exactly one tool call.
[Statistics on average tool calls here]
Environment Difficulty
[Statistics on environment difficulty here]
Other Environment Requirements
SimpleQA requires an OpenAI API key (OPENAI_API_KEY secret) for LLM-based grading of answers.
Safety
Agents in SimpleQA are asked to answer factual questions. The environment does not present direct safety risks, as agents only provide text answers to knowledge questions with no access to external systems, tools, or the internet.
However, agents trained to be information-seeking may provide capability uplift to non-expert actors in obtaining hard-to-find information.
Citations
@misc{wei2024measuringshortformfactualitylarge,
title={Measuring short-form factuality in large language models},
author={Jason Wei and Nguyen Karina and Hyung Won Chung and Yunxin Joy Jiao and Spencer Papay and Amelia Glaese and John Schulman and William Fedus},
year={2024},
eprint={2411.04368},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2411.04368}
}