LongBench-v2

API Endpoint
Leaderboard
Loading leaderboard...
Implementation of
README

LongBench-v2

OpenReward Badge Hugging Face Badge

Description

LongBench-v2 is an environment for evaluating long-context understanding and reasoning. Based on the LongBench v2 benchmark from THUDM, agents are given long documents (8K–2M words) and must answer multiple-choice questions (A/B/C/D) that require deep comprehension across six task domains.

Capabilities

  • Long-context document comprehension (8K–2M words)
  • Multiple-choice reasoning over extended text
  • Cross-domain understanding including QA, code, dialogue, structured data, and in-context learning

Compute Requirements

Agents are given a standard environment with no sandbox or file system access.

License

Apache 2.0

Tasks

One primary split: test (503 tasks).

Also available as domain-based splits:

  • single-doc-qa (175 tasks)
  • multi-doc-qa (125 tasks)
  • long-icl (81 tasks)
  • long-dialogue (39 tasks)
  • code-repo (50 tasks)
  • structured-data (33 tasks)

Tasks span three difficulty levels:

  • easy: 192 tasks
  • hard: 311 tasks

Tasks span three length categories:

  • short (8K–30K words): 180 tasks
  • medium (30K–100K words): 215 tasks
  • long (100K–2M words): 108 tasks

Reward Structure

Single-turn evaluation. Agents submit an answer choice (A/B/C/D) via the submit_answer tool. Reward is deterministic based on exact match:

  • 1.0 if the submitted answer is correct
  • 0.0 if the submitted answer is incorrect

Data

longbench_v2.parquet (162 MB) sourced from HuggingFace THUDM/LongBench-v2. Data is stored on the OpenReward platform.

Tools

submit_answer: Submit an answer choice (A, B, C, or D) to the multiple-choice question. This is the only tool available and completes the task.

Time Horizon

Single-turn evaluation with one tool call.

Environment Difficulty

The LongBench v2 Leaderboard evaluates frontier models (Accuracy %):

ModelAccuracy
Gemini-2.5-Pro63.3%
Gemini-2.5-Flash62.1%
Qwen3-235B-A22B-Thinking60.6%
DeepSeek-R158.3%
o1-preview57.7%
Human Baseline53.7%
GPT-4o51.4%
Claude 3.5 Sonnet46.7%
Qwen2.5-72B43.5%
o1-mini38.9%

Human experts achieved 53.7% accuracy under a 15-minute time constraint.

Other Environment Requirements

There are no further environment requirements; LongBench-v2 works out of the box with the OpenReward endpoint without any external API keys.

Safety

Agents in LongBench-v2 answer multiple-choice questions about long documents in a standard environment. The environment does not present direct safety risks.

Citations

@misc{bai2025longbenchv2deeperunderstanding,
      title={LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks}, 
      author={Yushi Bai and Shangqing Tu and Jiajie Zhang and Hao Peng and Xiaozhi Wang and Xin Lv and Shulin Cao and Jiazheng Xu and Lei Hou and Yuxiao Dong and Jie Tang and Juanzi Li},
      year={2025},
      eprint={2412.15204},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2412.15204}, 
}
bys0318/LongBench-v2 | OpenReward