API Endpoint

Leaderboard

Loading leaderboard...

Implementation of

arXiv/spreadsheetbench

README

SpreadsheetBench

Description

SpreadsheetBench is an environment for evaluating LLM agents on real-world spreadsheet manipulation tasks. It comprises 905 instructions (from 912 original, 7 excluded due to broken metadata) sourced from online Excel forums, each with multiple test cases (typically 3) to ensure solution generality via OJ-style evaluation.

Capabilities

Reading and analyzing Excel spreadsheet structure
Writing Python code to manipulate spreadsheet data
Handling cell-level and sheet-level operations
Producing general solutions that work across multiple test cases

Compute Requirements

Sandbox: 0.5 CPU / 1 GB memory per session
Network access: enabled (not blocked)
No GPU required

License

CC-BY-SA-4.0

Tasks

Single test split with 905 tasks. Each task has 1–3 test cases (~2,700 total) with different spreadsheet values but the same instruction. Tasks are divided into:

Cell-Level Manipulation (560 tasks): modifying specific cells or ranges
Sheet-Level Manipulation (345 tasks): modifying entire sheets, cross-sheet operations

Reward Structure

Binary reward (0.0 or 1.0). The agent's Python script is executed on each test case input file. Cell values at the specified answer_position are compared against ground-truth answer files. All test cases must pass for reward=1.0 (OJ-style hard metric).

Data

Source: KAKA22/SpreadsheetBench on HuggingFace
Format: Excel .xlsx files with JSON metadata
Size: ~91 MB compressed
Input spreadsheets mounted read-only at /data/ in the sandbox

Tools

bash — Execute bash commands in the sandbox for writing code and testing
submit — Submit a Python script for OJ-style evaluation across all test cases
excel_list_tabs_in_spreadsheet — List all worksheet names
excel_read_tab — Read data from a specific worksheet
excel_read_csv — Read CSV files
excel_create_spreadsheet, excel_add_tab, excel_edit_spreadsheet, excel_add_content_text, excel_delete_content_cell, excel_create_chart, excel_delete_tab, excel_delete_spreadsheet — Full Excel manipulation via the ExcelToolset

Time Horizon

Multi-turn. Agents typically explore the spreadsheet, write a solution script, test it, and submit. Average interaction involves 5–15 tool calls.

Environment Difficulty

The original benchmark reports ChatGPT Agent achieving 45.5% task success rate, indicating substantial difficulty. Tasks range from simple cell extraction to complex multi-sheet operations.

Safety

Tasks involve spreadsheet data manipulation only. Input data is sourced from public Excel forum questions. No personally identifiable information or sensitive data.

Citations

@inproceedings{ma2024spreadsheetbench,
  title={SpreadsheetBench: Towards Challenging Real World Spreadsheet Manipulation},
  author={Ma, Zeyao and Zhang, Bohan and Zhang, Jing and Yu, Jifan and Zhang, Xiaokang and Zhang, Xiaohan and Luo, Sijia and Wang, Xi and Tang, Jie},
  booktitle={Advances in Neural Information Processing Systems},
  year={2024}
}

Repository

Source repository

EnvCommons/SpreadsheetBench

Clone Repository

Tools

Tools available in the environment

No tools available for this environment, it probably hasn't been indexed yet.

Compute Configuration

Resource allocation for this environment.

Component	Configuration
Environment Server	1 vCPU / 4 GB RAM
Sandbox Machine	0.5 vCPUs / 1 GB RAM

Estimated Cost

Pay per second of active session usage. Billing starts when your session begins and stops when it ends.

Component	Cost / second
Environment	$0.0000320
Sandbox	$0.0000115
Total	$0.0000435

Examples

5-minute session$0.0131

1-hour session$0.1566

SpreadsheetBench

GeneralReasoning/SpreadsheetBench

SpreadsheetBench

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Safety

Citations

Repository

Clone Repository

Tools

Compute Configuration

Estimated Cost

Examples