MRCRV2

API Endpoint
Leaderboard
Loading leaderboard...
README

MRCRV2

OpenReward Environment Hugging Face Dataset

Description

MRCRV2 is an environment for evaluating multi-round co-reference resolution in long contexts. Based on OpenAI's MRCR benchmark, agents are given long multi-turn conversations (up to 1M+ tokens) containing multiple identical requests scattered throughout, and must identify and reproduce a specific instance. Tasks test precise long-context retrieval with 2, 4, or 8 needle variants.

Capabilities

  • Long-context retrieval and comprehension (up to 1M+ tokens)
  • Multi-round co-reference resolution
  • Distinguishing between identical requests in extended conversations

Compute Requirements

Agents are given a standard environment with no sandbox or file system access.

License

MIT

Tasks

One split: test (2,400 tasks). Tasks span three needle counts: 2-needle, 4-needle, and 8-needle variants, with context lengths from ~4KB to over 1M tokens.

Reward Structure

Single-turn. Agent submits response via answer tool. Evaluation uses Python's SequenceMatcher to compute a sequence match ratio. Reward ranges from 0.0 to 1.0 (continuous), with 1.0 requiring an exact match. The response must start with a required random prefix string.

Data

Parquet files sourced from HuggingFace openai/mrcr. Stored on the OpenReward platform.

Tools

answer — submit a response for sequence-match evaluation against the ground truth.

Time Horizon

Single-turn.

Environment Difficulty

Tasks require precise retrieval from conversations spanning up to 1M+ tokens, with identical requests acting as confounding needles.

ModelConfigurationScore
GPT-5.24-needle, 256K98%
Claude Sonnet 4.64-needle, 256K~82%
Claude Opus 4.68-needle, 1M76%
Gemini 3 Pro8-needle, 128K77%
Gemini 3 Pro8-needle, 1M26.3%

Other Environment Requirements

There are no further environment requirements; MRCRV2 works out of the box with the OpenReward endpoint without any external API keys.

Safety

Agents in MRCRV2 answer questions about long conversations in a standard environment. The environment does not present direct safety risks.

Citation

@article{vodrahalli2024michelangelo,
  title={Michelangelo: Long Context Evaluations Beyond Haystacks via Latent Structure Queries},
  author={Vodrahalli, Kiran and Ontanon, Santiago and Tripuraneni, Nilesh and Xu, Kelvin and Jain, Sanil and Shivanna, Rakesh and Hui, Jeffrey and Dikkala, Nishanth and Kazemi, Mehran and Fatemi, Bahare and Anil, Rohan and Dyer, Ethan and Shakeri, Siamak and Vij, Roopali and Mehta, Harsh and Ramasesh, Vinay and Le, Quoc and Chi, Ed and Lu, Yifeng and Firat, Orhan and Lazaridou, Angeliki and Lespiau, Jean-Baptiste and Attaluri, Nithya and Olszewska, Kate},
  journal={arXiv preprint arXiv:2409.12640},
  year={2024}
}
GeneralReasoning/MRCRV2 | OpenReward