DiscoveryBench

API Endpoint
Leaderboard
Loading leaderboard...
Implementation of
README

DiscoveryBench

OpenReward Environment

Description

DiscoveryBench is an environment for evaluating data-driven scientific hypothesis discovery. Agents analyze CSV datasets and formulate hypotheses that are scored against gold-standard hypotheses using the HMS (Hypothesis Matching Score) metric via GPT-5-mini.

Capabilities

  • Exploratory data analysis
  • Statistical hypothesis formulation
  • Scientific reasoning across domains (humanities, social science, economics, etc.)

Compute Requirements

Agents are given a sandboxed Docker environment. Default sandbox size is 1 CPU and 2 GB RAM. Network access enabled. No GPU required.

Tasks

  • Train split: ~25 tasks (from DiscoveryBench real/train, 4 topics)
  • Test split: 239 tasks (from DiscoveryBench real/test, 10 topics)
  • Each task provides one or more CSV datasets and a natural language query. Topics include archaeology, economics, NLP/requirements engineering, and more.

Reward Structure

Continuous reward via HMS (Hypothesis Matching Score):

  • HMS = context_score × var_f1 × rel_score
  • context_score (0 or 1): Does the predicted hypothesis context match the gold context?
  • var_f1 (0–1): F1 score of variable overlap (fuzzy matching)
  • rel_score (0, 0.5, or 1.0): Relationship similarity
  • Scored by gpt-5-mini as LLM judge (original DiscoveryBench/AstaBench used gpt-4o-2024-08-06)

Data

  • Source: allenai/discoverybench on HuggingFace
  • Gold hypotheses (test): From answer_key/answer_key_real.csv
  • Gold hypotheses (train): Embedded in metadata files
  • Format: Task JSONs + CSV data files staged per topic for sandbox bucket mount

Tools

  • bash: Execute shell commands in the sandbox (for data analysis)
  • submit_hypothesis: Submit a hypothesis and workflow for HMS evaluation (terminal action, one attempt)

Time Horizon

Multi-turn. Agents typically explore data via multiple bash calls before submitting. Expected: 5–30 tool calls.

Environment Difficulty

Challenging. Requires statistical reasoning, domain understanding, and clear scientific communication.

Safety

Code is executed in an isolated sandbox. HMS evaluation uses GPT-5-mini API calls (requires OpenAI API key).

Citations

@article{majumder2024discoverybench,
      title={DiscoveryBench: Towards Data-Driven Discovery with Large Language Models},
      author={Bodhisattwa Prasad Majumder and Harshit Surana and Dhruv Agarwal and Bhavana Dalvi Mishra and Abhijeetsingh Meena and Aryan Prakhar and Tirth Vora and Tushar Khot and Ashish Sabharwal and Peter Clark},
      year={2024},
      eprint={2407.01725},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
}

@article{bragg2025astabench,
      title={AstaBench: Rigorous Benchmarking of AI Agents with a Scientific Research Suite},
      author={Bragg, Jonathan and D'Arcy, Mike and Balepur, Nishant and Bareket, Dan and Dalvi, Bhavana and Feldman, Sergey and Haddad, Dany and Hwang, Jena D. and Jansen, Peter and Kishore, Varsha and Majumder, Bodhisattwa Prasad and Naik, Aakanksha and Rahamimov, Sigal and Richardson, Kyle and Singh, Amanpreet and Surana, Harshit and Tiktinsky, Aryeh and Vasu, Rosni and Wiener, Guy and Anastasiades, Chloe and Candra, Stefan and Dunkelberger, Jason and Emery, Dan and Evans, Rob and Hamada, Malachi and Huff, Regan and Kinney, Rodney and Latzke, Matt and Lochner, Jaron and Lozano-Aguilera, Ruben and Nguyen, Cecile and Rao, Smita and Tanaka, Amber and Vlahos, Brooke and Clark, Peter and Downey, Doug and Goldberg, Yoav and Sabharwal, Ashish and Weld, Daniel S.},
      journal={arXiv preprint arXiv:2510.21652},
      year={2025},
}
GeneralReasoning/DiscoveryBench | OpenReward