API Endpoint

Leaderboard

Loading leaderboard...

Implementation of

arXiv/longbenchv2

README

LongBench-v2

Description

LongBench-v2 is an environment for evaluating long-context understanding and reasoning. Based on the LongBench v2 benchmark from THUDM, agents are given long documents (8K–2M words) and must answer multiple-choice questions (A/B/C/D) that require deep comprehension across six task domains.

Capabilities

Long-context document comprehension (8K–2M words)
Multiple-choice reasoning over extended text
Cross-domain understanding including QA, code, dialogue, structured data, and in-context learning

Compute Requirements

Agents are given a standard environment with no sandbox or file system access.

License

Apache 2.0

Tasks

One primary split: test (503 tasks).

Also available as domain-based splits:

single-doc-qa (175 tasks)
multi-doc-qa (125 tasks)
long-icl (81 tasks)
long-dialogue (39 tasks)
code-repo (50 tasks)
structured-data (33 tasks)

Tasks span three difficulty levels:

easy: 192 tasks
hard: 311 tasks

Tasks span three length categories:

short (8K–30K words): 180 tasks
medium (30K–100K words): 215 tasks
long (100K–2M words): 108 tasks

Reward Structure

Single-turn evaluation. Agents submit an answer choice (A/B/C/D) via the submit_answer tool. Reward is deterministic based on exact match:

1.0 if the submitted answer is correct
0.0 if the submitted answer is incorrect

Data

longbench_v2.parquet (162 MB) sourced from HuggingFace THUDM/LongBench-v2. Data is stored on the OpenReward platform.

Tools

submit_answer: Submit an answer choice (A, B, C, or D) to the multiple-choice question. This is the only tool available and completes the task.

Time Horizon

Single-turn evaluation with one tool call.

Environment Difficulty

The LongBench v2 Leaderboard evaluates frontier models (Accuracy %):

Model	Accuracy
Gemini-2.5-Pro	63.3%
Gemini-2.5-Flash	62.1%
Qwen3-235B-A22B-Thinking	60.6%
DeepSeek-R1	58.3%
o1-preview	57.7%
Human Baseline	53.7%
GPT-4o	51.4%
Claude 3.5 Sonnet	46.7%
Qwen2.5-72B	43.5%
o1-mini	38.9%

Human experts achieved 53.7% accuracy under a 15-minute time constraint.

Other Environment Requirements

There are no further environment requirements; LongBench-v2 works out of the box with the OpenReward endpoint without any external API keys.

Safety

Agents in LongBench-v2 answer multiple-choice questions about long documents in a standard environment. The environment does not present direct safety risks.

Citations

@misc{bai2025longbenchv2deeperunderstanding,
      title={LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks}, 
      author={Yushi Bai and Shangqing Tu and Jiajie Zhang and Hao Peng and Xiaozhi Wang and Xin Lv and Shulin Cao and Jiazheng Xu and Lei Hou and Yuxiao Dong and Jie Tang and Juanzi Li},
      year={2025},
      eprint={2412.15204},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2412.15204}, 
}

Repository

Source repository

EnvCommons/LongBench-v2

Clone Repository

Tools

Tools available in the environment

No tools available for this environment, it probably hasn't been indexed yet.

Compute Configuration

Resource allocation for this environment.

Component	Configuration
Environment Server	1 vCPU / 4 GB RAM
Sandbox Machine	Not configured

Estimated Cost

Pay per second of active session usage. Billing starts when your session begins and stops when it ends.

Component	Cost / second
Environment	$0.0000320
Sandbox	Not configured
Total	$0.0000320

Examples

5-minute session$0.0096

1-hour session$0.1152

LongBench-v2

bys0318/LongBench-v2

LongBench-v2

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citations

Repository

Clone Repository

Tools

Compute Configuration

Estimated Cost

Examples