Discovery30s

API Endpoint
Leaderboard
Loading leaderboard...
README

Discovery30s

OpenReward Environment

Description

Discovery30s is an environment for testing whether vintage language models can reproduce foundational scientific discoveries from the 1930s - after their training cutoff period. The benchmark presents historical discoveries as progressive "question ladders" where each question builds on previous answers, simulating how researchers originally developed these insights.

Capabilities

  • Scientific reasoning across multiple domains
  • Progressive contextual learning with accumulated Q&A history
  • Mathematical and formal logic reasoning
  • Scientific terminology and concept equivalence
  • Historical scientific knowledge synthesis

Compute Requirements

The agent is not given access to a sandbox in this environment.

License

MIT

Tasks

There is one split in this environment:

  • test_v0_1: 196 questions across 10 historical discoveries

The 10 discoveries span 5 domains:

Mathematics (70 tasks)

  • Godel's Incompleteness Theorems (22 questions, 1931)
  • Ergodic Theory (23 questions, 1932)
  • Quantum Math Framework (25 questions, 1932)

Biology (59 tasks)

  • Chromosomal Crossover (24 questions, 1932)
  • Homeostasis (21 questions, 1932)
  • Haldane Unification (14 questions, 1932)

Physics (34 tasks)

  • Zero-Length Spring (19 questions, 1932)
  • Dirac's Magnetic Monopoles (15 questions, 1932)

Chemistry (17 tasks)

  • Huckel's Rule (17 questions, 1931)

Astronomy (16 tasks)

  • Oort Cloud (16 questions, 1932)

Each discovery is structured as a progressive question ladder where later questions include all previous Q&A pairs as context, allowing agents to build understanding incrementally.

Reward Structure

This is a sparse, LLM-graded reward environment. Rewards are issued after each answer submission:

  • 1.0: Semantically correct answer
  • 0.0: Incorrect answer

The LLM grader (gpt-5-mini) evaluates semantic equivalence using domain-specific rubrics. Equivalent scientific terminology is accepted (e.g., "6 pi electrons" = "six pi electrons" = "three pi bonds worth of electrons").

Data

The benchmark consists of 196 question-answer pairs organized into 10 question ladders. Each question includes:

  • Question text with historical framing
  • Context containing all previous Q&A pairs in the ladder
  • Reference answer for grading
  • Metadata (domain, year, task type)

Questions are presented as if exploring discoveries in their original 1930s era, requiring agents to reason through evidence without modern hindsight.

Tools

Agents have access to one tool:

  • submit_answer: Submit an answer for grading. Returns reward (1.0 or 0.0) and marks the task as finished.

Time Horizon

Discovery30s is a single-turn environment. Each task requires exactly 1 tool call to submit an answer.

Environment Difficulty

GPT-5.2 (high reasoning effort), with the benefit of modern day hindsight, gets 97.96% on test_v0_1.

No vintage language models have been evaluated on this benchmark yet.

Other Environment Requirements

Discovery30s requires an OpenAI API key for LLM-based grading:

  • openai_api_key: Required in secrets parameter for grading answers with gpt-5-mini

Export it before running:

export OPENAI_API_KEY=your_api_key_here

Pass the key via the secrets parameter when creating a session:

async with environment.session(task=task, secrets={"openai_api_key": OPENAI_API_KEY}) as session:

Safety

Discovery30s presents no direct safety risks. Agents interact only through Q&A submission about historical scientific discoveries. There are no external API calls, file system access, or real-world actions beyond answering questions.

The benchmark tests scientific reasoning capabilities without enabling potentially harmful applications.

Citations

@dataset{GRDiscovery30s,
  author    = {General Reasoning Inc. Team},
  title     = {Discovery30s},
  year      = {2026},
  publisher = {OpenReward},
  url       = {https://openreward.ai/GeneralReasoning/Discovery30s}
}
GeneralReasoning/Discovery30s | OpenReward