SimpleQA

Description

SimpleQA is an environment for evaluating short-form factual question answering. It implements the SimpleQA benchmark from OpenAI, which consists of short, fact-seeking questions with single, unambiguous answers. An LLM grader evaluates the agent's response against the gold target using a three-way classification.

Capabilities

Short-form factual question answering across diverse topics
Single-turn evaluation (one question, one answer)
LLM-graded correctness with three-way classification (Correct / Incorrect / Not Attempted)

Compute Requirements

SimpleQA does not require a sandbox. It has minimal compute requirements.

License

MIT.

Tasks

There is one test split containing 4,326 short, fact-seeking questions loaded from simple_qa_test_set.csv. Questions span a range of topics including science and technology, politics, art, history, geography, and entertainment.

Each question has a single, verifiable gold target answer. Questions were adversarially collected against GPT-4 to ensure difficulty, and independently verified by human annotators. Answer types include dates (32.8%), people (24.1%), numbers (15.3%), places (9.9%), and other (18.0%).

Reward Structure

This is a sparse reward environment. The agent calls the answer tool once with its response, and the environment grades it using an LLM grader (gpt-4o). The grader assigns one of three grades:

A (CORRECT): The answer fully contains the important information from the gold target with no contradictions. Reward: 1.0.
B (INCORRECT): The answer contains a factual statement that contradicts the gold target. Reward: 0.0.
C (NOT_ATTEMPTED): Important information from the gold target is missing, but nothing contradicts it. Reward: 0.0.

Grading rules:

Case, punctuation, grammar, and order do not matter - only semantic meaning.
Hedging and uncertainty are permitted if the gold target is fully included with no contradictions.
Numeric answers must be correct to the last significant figure in the gold target.
Typos in names are permitted if clearly the same name.
Answers are not penalized for omitting information clearly inferred from the question.

Data

Questions are loaded from simple_qa_test_set.csv, which contains columns for id, metadata, problem, and answer. No additional data files are provided to the agent.

Tools

Agents are given a single tool:

answer: Submit an answer to the question. The answer is graded by the LLM grader against the gold target. Returns the grade (A/B/C) and reward (1.0 or 0.0). This tool can only be called once per task.

Time Horizon

SimpleQA is a single-turn environment. The agent receives a question and submits one answer. Each task requires exactly one tool call.

[Statistics on average tool calls here]

Environment Difficulty

[Statistics on environment difficulty here]

Other Environment Requirements

SimpleQA requires an OpenAI API key (OPENAI_API_KEY secret) for LLM-based grading of answers.

Safety

Agents in SimpleQA are asked to answer factual questions. The environment does not present direct safety risks, as agents only provide text answers to knowledge questions with no access to external systems, tools, or the internet.

However, agents trained to be information-seeking may provide capability uplift to non-expert actors in obtaining hard-to-find information.

Citations

@misc{wei2024measuringshortformfactualitylarge,
  title={Measuring short-form factuality in large language models},
  author={Jason Wei and Nguyen Karina and Hyung Won Chung and Yunxin Joy Jiao and Spencer Papay and Amelia Glaese and John Schulman and William Fedus},
  year={2024},
  eprint={2411.04368},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2411.04368}
}

Component	Configuration
Environment Server	1 vCPU / 4 GB RAM
Sandbox Machine	Not configured

Component	Cost / second
Environment	$0.0000320
Sandbox	Not configured
Total	$0.0000320

SimpleQA

GeneralReasoning/SimpleQA

SimpleQA

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citations

Tools

Compute Configuration

Estimated Cost

Examples