GDPVal
GDPVal
Description
GDPVal is an environment for evaluating agents on real-world knowledge work tasks. Based on the GDPval benchmark from OpenAI, agents are given workplace tasks across 44 occupations and must analyze reference materials (Excel, PDF, Word, PowerPoint) and create deliverable files. Evaluation uses weighted rubric-based scoring with an LLM grader that has tool access to inspect submitted files.
Capabilities
- Analyzing reference materials (Excel, PDF, Word, PowerPoint documents)
- Creating deliverable files (reports, spreadsheets, presentations)
- Meeting detailed rubric criteria (30-56 criteria per task)
- Multi-step knowledge work reasoning across 44 occupations
Compute Requirements
Agents in GDPVal are given a sandbox with 2 CPUs and 2 GB RAM, with access to document manipulation tools (Excel, PDF, Word, PowerPoint toolsets).
License
Tasks
There is one split: testv2 (220 tasks). Each task corresponds to a real-world knowledge work challenge across 44 occupations spanning 9 major U.S. GDP-contributing sectors (Professional Services, Finance, Healthcare, Education, etc.).
Reward Structure
This is a multi-turn environment with weighted rubric-based scoring. The agent works in the sandbox and calls submit_deliverable when finished. An LLM grader (gpt-5-mini with tool access) evaluates each rubric criterion:
- Reward ranges from 0.0 to 1.0 based on: (sum of passed criteria points) / (total possible points)
- Tasks have 30-56 evaluation criteria with varying point values
Data
Tasks are derived from the GDPval benchmark from OpenAI. Reference files and deliverable templates are stored on the OpenReward platform.
Tools
Agents are given CLI tools and document manipulation tools:
CLI Tools: bash, read, write, edit, glob, grep, ls, todo_write
Document Tools:
- Excel:
excel_read,excel_write,excel_list_sheets, etc. - PDF:
pdf_extract_text,pdf_read, etc. - Word:
word_read,word_write, etc. - PowerPoint:
ppt_read,ppt_write, etc.
Submission: submit_deliverable - Submit completed file for rubric-based evaluation.
Time Horizon
GDPVal is a multi-turn environment. Agents iteratively analyze reference files, create deliverables, and refine their work before submitting the final file.
Environment Difficulty
Model performance on GDPVal from the original paper (win rates against human experts):
| Model | Win Rate |
|---|---|
| Claude Opus 4.1 | 47.6% |
| GPT-5 | 39.0% |
| o3 | 35.2% |
| o4-mini | 29.1% |
| GPT-4o | 12.5% |
Frontier models are approaching but have not yet matched industry experts (averaging 14 years of experience) in deliverable quality.
Other Environment Requirements
- OpenAI API key: Required for LLM-based rubric grading. Pass via
secrets={"openai_api_key": "..."}.
Safety
Agents in GDPVal complete knowledge work tasks in a sandboxed environment. The environment does not involve sensitive personal data or real business operations.
Citations
@article{patwardhan2025gdpval,
title={GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks},
author={Patwardhan, Tejal and Dias, Rachel and Proehl, Elizabeth and Kim, Grace and Wang, Michele and Watkins, Olivia and Fishman, Sim{\'o}n Posada and Aljubeh, Marwan and Thacker, Phoebe and Fauconnet, Laurance and Kim, Natalie S. and Chao, Patrick and Miserendino, Samuel and Chabot, Gildas and Li, David and Sharman, Michael and Barr, Alexandra and Glaese, Amelia and Tworek, Jerry},
journal={arXiv preprint arXiv:2510.04374},
year={2025},
url={https://arxiv.org/abs/2510.04374}
}