API Endpoint

Leaderboard

Loading leaderboard...

README

Discovery30s

Description

Discovery30s is an environment for testing whether vintage language models can reproduce foundational scientific discoveries from the 1930s - after their training cutoff period. The benchmark presents historical discoveries as progressive "question ladders" where each question builds on previous answers, simulating how researchers originally developed these insights.

Capabilities

Scientific reasoning across multiple domains
Progressive contextual learning with accumulated Q&A history
Mathematical and formal logic reasoning
Scientific terminology and concept equivalence
Historical scientific knowledge synthesis

Compute Requirements

The agent is not given access to a sandbox in this environment.

License

MIT

Tasks

There is one split in this environment:

test_v0_1: 196 questions across 10 historical discoveries

The 10 discoveries span 5 domains:

Mathematics (70 tasks)

Godel's Incompleteness Theorems (22 questions, 1931)
Ergodic Theory (23 questions, 1932)
Quantum Math Framework (25 questions, 1932)

Biology (59 tasks)

Chromosomal Crossover (24 questions, 1932)
Homeostasis (21 questions, 1932)
Haldane Unification (14 questions, 1932)

Physics (34 tasks)

Zero-Length Spring (19 questions, 1932)
Dirac's Magnetic Monopoles (15 questions, 1932)

Chemistry (17 tasks)

Huckel's Rule (17 questions, 1931)

Astronomy (16 tasks)

Oort Cloud (16 questions, 1932)

Each discovery is structured as a progressive question ladder where later questions include all previous Q&A pairs as context, allowing agents to build understanding incrementally.

Reward Structure

This is a sparse, LLM-graded reward environment. Rewards are issued after each answer submission:

1.0: Semantically correct answer
0.0: Incorrect answer

The LLM grader (gpt-5-mini) evaluates semantic equivalence using domain-specific rubrics. Equivalent scientific terminology is accepted (e.g., "6 pi electrons" = "six pi electrons" = "three pi bonds worth of electrons").

Data

The benchmark consists of 196 question-answer pairs organized into 10 question ladders. Each question includes:

Question text with historical framing
Context containing all previous Q&A pairs in the ladder
Reference answer for grading
Metadata (domain, year, task type)

Questions are presented as if exploring discoveries in their original 1930s era, requiring agents to reason through evidence without modern hindsight.

Tools

Agents have access to one tool:

submit_answer: Submit an answer for grading. Returns reward (1.0 or 0.0) and marks the task as finished.

Time Horizon

Discovery30s is a single-turn environment. Each task requires exactly 1 tool call to submit an answer.

Environment Difficulty

GPT-5.2 (high reasoning effort), with the benefit of modern day hindsight, gets 97.96% on test_v0_1.

No vintage language models have been evaluated on this benchmark yet.

Other Environment Requirements

Discovery30s requires an OpenAI API key for LLM-based grading:

openai_api_key: Required in secrets parameter for grading answers with gpt-5-mini

Export it before running:

export OPENAI_API_KEY=your_api_key_here

Pass the key via the secrets parameter when creating a session:

async with environment.session(task=task, secrets={"openai_api_key": OPENAI_API_KEY}) as session:

Safety

Discovery30s presents no direct safety risks. Agents interact only through Q&A submission about historical scientific discoveries. There are no external API calls, file system access, or real-world actions beyond answering questions.

The benchmark tests scientific reasoning capabilities without enabling potentially harmful applications.

Citations

@dataset{GRDiscovery30s,
  author    = {General Reasoning Inc. Team},
  title     = {Discovery30s},
  year      = {2026},
  publisher = {OpenReward},
  url       = {https://openreward.ai/GeneralReasoning/Discovery30s}
}

Repository

Source repository

EnvCommons/Discovery30s

Clone Repository

Tools

Tools available in the environment

No tools available for this environment, it probably hasn't been indexed yet.

Compute Configuration

Resource allocation for this environment.

Component	Configuration
Environment Server	1 vCPU / 4 GB RAM
Sandbox Machine	Not configured

Estimated Cost

Pay per second of active session usage. Billing starts when your session begins and stops when it ends.

Component	Cost / second
Environment	$0.0000320
Sandbox	Not configured
Total	$0.0000320

Examples

5-minute session$0.0096

1-hour session$0.1152