DiscoveryBench
DiscoveryBench
Description
DiscoveryBench is an environment for evaluating data-driven scientific hypothesis discovery. Agents analyze CSV datasets and formulate hypotheses that are scored against gold-standard hypotheses using the HMS (Hypothesis Matching Score) metric via GPT-5-mini.
Capabilities
- Exploratory data analysis
- Statistical hypothesis formulation
- Scientific reasoning across domains (humanities, social science, economics, etc.)
Compute Requirements
Agents are given a sandboxed Docker environment. Default sandbox size is 1 CPU and 2 GB RAM. Network access enabled. No GPU required.
Tasks
- Train split: ~25 tasks (from DiscoveryBench real/train, 4 topics)
- Test split: 239 tasks (from DiscoveryBench real/test, 10 topics)
- Each task provides one or more CSV datasets and a natural language query. Topics include archaeology, economics, NLP/requirements engineering, and more.
Reward Structure
Continuous reward via HMS (Hypothesis Matching Score):
- HMS = context_score × var_f1 × rel_score
- context_score (0 or 1): Does the predicted hypothesis context match the gold context?
- var_f1 (0–1): F1 score of variable overlap (fuzzy matching)
- rel_score (0, 0.5, or 1.0): Relationship similarity
- Scored by
gpt-5-minias LLM judge (original DiscoveryBench/AstaBench usedgpt-4o-2024-08-06)
Data
- Source: allenai/discoverybench on HuggingFace
- Gold hypotheses (test): From
answer_key/answer_key_real.csv - Gold hypotheses (train): Embedded in metadata files
- Format: Task JSONs + CSV data files staged per topic for sandbox bucket mount
Tools
bash: Execute shell commands in the sandbox (for data analysis)submit_hypothesis: Submit a hypothesis and workflow for HMS evaluation (terminal action, one attempt)
Time Horizon
Multi-turn. Agents typically explore data via multiple bash calls before submitting. Expected: 5–30 tool calls.
Environment Difficulty
Challenging. Requires statistical reasoning, domain understanding, and clear scientific communication.
Safety
Code is executed in an isolated sandbox. HMS evaluation uses GPT-5-mini API calls (requires OpenAI API key).
Citations
@article{majumder2024discoverybench,
title={DiscoveryBench: Towards Data-Driven Discovery with Large Language Models},
author={Bodhisattwa Prasad Majumder and Harshit Surana and Dhruv Agarwal and Bhavana Dalvi Mishra and Abhijeetsingh Meena and Aryan Prakhar and Tirth Vora and Tushar Khot and Ashish Sabharwal and Peter Clark},
year={2024},
eprint={2407.01725},
archivePrefix={arXiv},
primaryClass={cs.CL},
}
@article{bragg2025astabench,
title={AstaBench: Rigorous Benchmarking of AI Agents with a Scientific Research Suite},
author={Bragg, Jonathan and D'Arcy, Mike and Balepur, Nishant and Bareket, Dan and Dalvi, Bhavana and Feldman, Sergey and Haddad, Dany and Hwang, Jena D. and Jansen, Peter and Kishore, Varsha and Majumder, Bodhisattwa Prasad and Naik, Aakanksha and Rahamimov, Sigal and Richardson, Kyle and Singh, Amanpreet and Surana, Harshit and Tiktinsky, Aryeh and Vasu, Rosni and Wiener, Guy and Anastasiades, Chloe and Candra, Stefan and Dunkelberger, Jason and Emery, Dan and Evans, Rob and Hamada, Malachi and Huff, Regan and Kinney, Rodney and Latzke, Matt and Lochner, Jaron and Lozano-Aguilera, Ruben and Nguyen, Cecile and Rao, Smita and Tanaka, Amber and Vlahos, Brooke and Clark, Peter and Downey, Doug and Goldberg, Yoav and Sabharwal, Ashish and Weld, Daniel S.},
journal={arXiv preprint arXiv:2510.21652},
year={2025},
}