API Endpoint

Leaderboard

Loading leaderboard...

README

OfficeQA

Description

OfficeQA is an environment for evaluating AI agents on grounded, multi-document reasoning over a large corpus of U.S. Treasury Bulletins spanning nearly a century (1939–2025). Agents must retrieve relevant documents, parse dense financial tables, and perform multi-step analytical reasoning to answer precise numerical questions.

Capabilities

Document retrieval and search across 696 Treasury Bulletin text files
Numerical reasoning and multi-step computation
Statistical analysis (linear regression, geometric mean, correlation, etc.)
Web search for external data (CPI values, exchange rates)
Code execution for complex calculations

Compute Requirements

Sandbox: 0.5 CPU / 1 GB memory
Network access enabled (required for 22% of questions)

License

Dataset (question CSVs): CC-BY-SA 4.0
Code (reward.py): Apache License 2.0
Treasury Bulletin documents: U.S. Government public domain

Tasks

Split	Source CSV	Count	Description
`train`	`officeqa_full.csv`	~246	Full development set (easy + hard)
`test`	`officeqa_pro.csv`	~133	Benchmark set (hard only)

Each task contains a natural language question, difficulty rating, and pointers to relevant source documents.

Reward Structure

Sparse, verifiable, binary reward (1.0 or 0.0). Scoring uses deterministic fuzzy numerical matching:

Unit-aware comparison (million, billion, trillion, thousand)
Configurable tolerance (default: exact match)
Multi-number list support
Text overlap checking for hybrid answers (e.g., "March 1977")
<FINAL_ANSWER> tag extraction

Data

Corpus: 696 transformed Treasury Bulletin text files (~200 MB) with Markdown-formatted tables
Source: databricks/officeqa on GitHub
Mount path: /home/ubuntu/documents/ in sandbox

Tools

Tool	Description
`bash`	Execute shell commands — search documents, run Python, access the web
`submit`	Submit final answer for scoring (single attempt)

Time Horizon

Multi-turn. Agents typically perform 10–50+ tool calls: searching documents, reading files, writing and executing Python scripts, and optionally querying external data sources before submitting.

Environment Difficulty

Frontier models achieve ~57% accuracy on the Pro (hard) set. Human performance is ~34–51% depending on setup. 62% of questions require analytical reasoning beyond basic arithmetic.

Safety

Treasury Bulletin data is publicly available U.S. Government financial information. No PII or sensitive data concerns. Questions are factual and grounded in historical financial records.

Citations

@article{opsahlong2025officeqapro,
      title={OfficeQA Pro: An Enterprise Benchmark for End-to-End Grounded Reasoning},
      author={Krista Opsahl-Ong and Arnav Singhvi and Jasmine Collins and Ivan Zhou and Cindy Wang and Ashutosh Baheti and Owen Oertell and Jacob Portes and Sam Havens and Erich Elsen and Michael Bendersky and Matei Zaharia and Xing Chen},
      year={2026},
      eprint={2603.08655},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2603.08655},
}

Repository

Source repository

EnvCommons/OfficeQA

Clone Repository

Tools

Tools available in the environment

No tools available for this environment, it probably hasn't been indexed yet.

Compute Configuration

Resource allocation for this environment.

Component	Configuration
Environment Server	1 vCPU / 4 GB RAM
Sandbox Machine	0.5 vCPUs / 1 GB RAM

Estimated Cost

Pay per second of active session usage. Billing starts when your session begins and stops when it ends.

Component	Cost / second
Environment	$0.0000320
Sandbox	$0.0000115
Total	$0.0000435

Examples

5-minute session$0.0131

1-hour session$0.1566

OfficeQA

GeneralReasoning/OfficeQA

OfficeQA

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Safety

Citations

Repository

Clone Repository

Tools

Compute Configuration

Estimated Cost

Examples