API Endpoint

Leaderboard

Loading leaderboard...

Implementation of

OpenAI/GDPval

README

GDPVal

Description

GDPVal is an environment for evaluating agents on real-world knowledge work tasks. Based on the GDPval benchmark from OpenAI, agents are given workplace tasks across 44 occupations and must analyze reference materials (Excel, PDF, Word, PowerPoint) and create deliverable files. Evaluation uses weighted rubric-based scoring with an LLM grader that has tool access to inspect submitted files.

Capabilities

Analyzing reference materials (Excel, PDF, Word, PowerPoint documents)
Creating deliverable files (reports, spreadsheets, presentations)
Meeting detailed rubric criteria (30-56 criteria per task)
Multi-step knowledge work reasoning across 44 occupations

Compute Requirements

Agents in GDPVal are given a sandbox with 2 CPUs and 2 GB RAM, with access to document manipulation tools (Excel, PDF, Word, PowerPoint toolsets).

License

CC BY 4.0.

Tasks

There is one split: testv2 (220 tasks). Each task corresponds to a real-world knowledge work challenge across 44 occupations spanning 9 major U.S. GDP-contributing sectors (Professional Services, Finance, Healthcare, Education, etc.).

Reward Structure

This is a multi-turn environment with weighted rubric-based scoring. The agent works in the sandbox and calls submit_deliverable when finished. An LLM grader (gpt-5-mini with tool access) evaluates each rubric criterion:

Reward ranges from 0.0 to 1.0 based on: (sum of passed criteria points) / (total possible points)
Tasks have 30-56 evaluation criteria with varying point values

Data

Tasks are derived from the GDPval benchmark from OpenAI. Reference files and deliverable templates are stored on the OpenReward platform.

Tools

Agents are given CLI tools and document manipulation tools:

CLI Tools: bash, read, write, edit, glob, grep, ls, todo_write

Document Tools:

Excel: excel_read, excel_write, excel_list_sheets, etc.
PDF: pdf_extract_text, pdf_read, etc.
Word: word_read, word_write, etc.
PowerPoint: ppt_read, ppt_write, etc.

Submission: submit_deliverable - Submit completed file for rubric-based evaluation.

Time Horizon

GDPVal is a multi-turn environment. Agents iteratively analyze reference files, create deliverables, and refine their work before submitting the final file.

Environment Difficulty

Model performance on GDPVal from the original paper (win rates against human experts):

Model	Win Rate
Claude Opus 4.1	47.6%
GPT-5	39.0%
o3	35.2%
o4-mini	29.1%
GPT-4o	12.5%

Frontier models are approaching but have not yet matched industry experts (averaging 14 years of experience) in deliverable quality.

Other Environment Requirements

OpenAI API key: Required for LLM-based rubric grading. Pass via secrets={"openai_api_key": "..."}.

Safety

Agents in GDPVal complete knowledge work tasks in a sandboxed environment. The environment does not involve sensitive personal data or real business operations.

Citations

@article{patwardhan2025gdpval,
  title={GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks},
  author={Patwardhan, Tejal and Dias, Rachel and Proehl, Elizabeth and Kim, Grace and Wang, Michele and Watkins, Olivia and Fishman, Sim{\'o}n Posada and Aljubeh, Marwan and Thacker, Phoebe and Fauconnet, Laurance and Kim, Natalie S. and Chao, Patrick and Miserendino, Samuel and Chabot, Gildas and Li, David and Sharman, Michael and Barr, Alexandra and Glaese, Amelia and Tworek, Jerry},
  journal={arXiv preprint arXiv:2510.04374},
  year={2025},
  url={https://arxiv.org/abs/2510.04374}
}

Repository

Source repository

EnvCommons/GDPVal

Clone Repository

Tools

Tools available in the environment

No tools available for this environment, it probably hasn't been indexed yet.

Compute Configuration

Resource allocation for this environment.

Component	Configuration
Environment Server	1 vCPU / 4 GB RAM
Sandbox Machine	2 vCPUs / 2 GB RAM

Estimated Cost

Pay per second of active session usage. Billing starts when your session begins and stops when it ends.

Component	Cost / second
Environment	$0.0000320
Sandbox	$0.0000370
Total	$0.0000690

Examples

5-minute session$0.0207

1-hour session$0.2484

GDPVal

GeneralReasoning/GDPVal

GDPVal

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citations

Repository

Clone Repository

Tools

Compute Configuration

Estimated Cost

Examples