API Endpoint

Leaderboard

Loading leaderboard...

Implementation of

arXiv/discoverybench

README

DiscoveryBench

Description

DiscoveryBench is an environment for evaluating data-driven scientific hypothesis discovery. Agents analyze CSV datasets and formulate hypotheses that are scored against gold-standard hypotheses using the HMS (Hypothesis Matching Score) metric via GPT-5-mini.

Capabilities

Exploratory data analysis
Statistical hypothesis formulation
Scientific reasoning across domains (humanities, social science, economics, etc.)

Compute Requirements

Agents are given a sandboxed Docker environment. Default sandbox size is 1 CPU and 2 GB RAM. Network access enabled. No GPU required.

Tasks

Train split: ~25 tasks (from DiscoveryBench real/train, 4 topics)
Test split: 239 tasks (from DiscoveryBench real/test, 10 topics)
Each task provides one or more CSV datasets and a natural language query. Topics include archaeology, economics, NLP/requirements engineering, and more.

Reward Structure

Continuous reward via HMS (Hypothesis Matching Score):

HMS = context_score × var_f1 × rel_score
context_score (0 or 1): Does the predicted hypothesis context match the gold context?
var_f1 (0–1): F1 score of variable overlap (fuzzy matching)
rel_score (0, 0.5, or 1.0): Relationship similarity
Scored by gpt-5-mini as LLM judge (original DiscoveryBench/AstaBench used gpt-4o-2024-08-06)

Data

Source: allenai/discoverybench on HuggingFace
Gold hypotheses (test): From answer_key/answer_key_real.csv
Gold hypotheses (train): Embedded in metadata files
Format: Task JSONs + CSV data files staged per topic for sandbox bucket mount

Tools

bash: Execute shell commands in the sandbox (for data analysis)
submit_hypothesis: Submit a hypothesis and workflow for HMS evaluation (terminal action, one attempt)

Time Horizon

Multi-turn. Agents typically explore data via multiple bash calls before submitting. Expected: 5–30 tool calls.

Environment Difficulty

Challenging. Requires statistical reasoning, domain understanding, and clear scientific communication.

Safety

Code is executed in an isolated sandbox. HMS evaluation uses GPT-5-mini API calls (requires OpenAI API key).

Citations

@article{majumder2024discoverybench,
      title={DiscoveryBench: Towards Data-Driven Discovery with Large Language Models},
      author={Bodhisattwa Prasad Majumder and Harshit Surana and Dhruv Agarwal and Bhavana Dalvi Mishra and Abhijeetsingh Meena and Aryan Prakhar and Tirth Vora and Tushar Khot and Ashish Sabharwal and Peter Clark},
      year={2024},
      eprint={2407.01725},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
}

@article{bragg2025astabench,
      title={AstaBench: Rigorous Benchmarking of AI Agents with a Scientific Research Suite},
      author={Bragg, Jonathan and D'Arcy, Mike and Balepur, Nishant and Bareket, Dan and Dalvi, Bhavana and Feldman, Sergey and Haddad, Dany and Hwang, Jena D. and Jansen, Peter and Kishore, Varsha and Majumder, Bodhisattwa Prasad and Naik, Aakanksha and Rahamimov, Sigal and Richardson, Kyle and Singh, Amanpreet and Surana, Harshit and Tiktinsky, Aryeh and Vasu, Rosni and Wiener, Guy and Anastasiades, Chloe and Candra, Stefan and Dunkelberger, Jason and Emery, Dan and Evans, Rob and Hamada, Malachi and Huff, Regan and Kinney, Rodney and Latzke, Matt and Lochner, Jaron and Lozano-Aguilera, Ruben and Nguyen, Cecile and Rao, Smita and Tanaka, Amber and Vlahos, Brooke and Clark, Peter and Downey, Doug and Goldberg, Yoav and Sabharwal, Ashish and Weld, Daniel S.},
      journal={arXiv preprint arXiv:2510.21652},
      year={2025},
}

Repository

Source repository

EnvCommons/DiscoveryBench

Clone Repository

Tools

Tools available in the environment

No tools available for this environment, it probably hasn't been indexed yet.

Compute Configuration

Resource allocation for this environment.

Component	Configuration
Environment Server	1 vCPU / 4 GB RAM
Sandbox Machine	1 vCPU / 2 GB RAM

Estimated Cost

Pay per second of active session usage. Billing starts when your session begins and stops when it ends.

Component	Cost / second
Environment	$0.0000320
Sandbox	$0.0000230
Total	$0.0000550

Examples

5-minute session$0.0165

1-hour session$0.1980

DiscoveryBench

GeneralReasoning/DiscoveryBench

DiscoveryBench

Description

Capabilities

Compute Requirements

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Safety

Citations

Repository

Clone Repository

Tools

Compute Configuration

Estimated Cost

Examples