API Endpoint

Leaderboard

Loading leaderboard...

README

MRCRV2

Description

MRCRV2 is an environment for evaluating multi-round co-reference resolution in long contexts. Based on OpenAI's MRCR benchmark, agents are given long multi-turn conversations (up to 1M+ tokens) containing multiple identical requests scattered throughout, and must identify and reproduce a specific instance. Tasks test precise long-context retrieval with 2, 4, or 8 needle variants.

Capabilities

Long-context retrieval and comprehension (up to 1M+ tokens)
Multi-round co-reference resolution
Distinguishing between identical requests in extended conversations

Compute Requirements

Agents are given a standard environment with no sandbox or file system access.

License

MIT

Tasks

One split: test (2,400 tasks). Tasks span three needle counts: 2-needle, 4-needle, and 8-needle variants, with context lengths from ~4KB to over 1M tokens.

Reward Structure

Single-turn. Agent submits response via answer tool. Evaluation uses Python's SequenceMatcher to compute a sequence match ratio. Reward ranges from 0.0 to 1.0 (continuous), with 1.0 requiring an exact match. The response must start with a required random prefix string.

Data

Parquet files sourced from HuggingFace openai/mrcr. Stored on the OpenReward platform.

Tools

answer — submit a response for sequence-match evaluation against the ground truth.

Time Horizon

Single-turn.

Environment Difficulty

Tasks require precise retrieval from conversations spanning up to 1M+ tokens, with identical requests acting as confounding needles.

Model	Configuration	Score
GPT-5.2	4-needle, 256K	98%
Claude Sonnet 4.6	4-needle, 256K	~82%
Claude Opus 4.6	8-needle, 1M	76%
Gemini 3 Pro	8-needle, 128K	77%
Gemini 3 Pro	8-needle, 1M	26.3%

Other Environment Requirements

There are no further environment requirements; MRCRV2 works out of the box with the OpenReward endpoint without any external API keys.

Safety

Agents in MRCRV2 answer questions about long conversations in a standard environment. The environment does not present direct safety risks.

Citation

@article{vodrahalli2024michelangelo,
  title={Michelangelo: Long Context Evaluations Beyond Haystacks via Latent Structure Queries},
  author={Vodrahalli, Kiran and Ontanon, Santiago and Tripuraneni, Nilesh and Xu, Kelvin and Jain, Sanil and Shivanna, Rakesh and Hui, Jeffrey and Dikkala, Nishanth and Kazemi, Mehran and Fatemi, Bahare and Anil, Rohan and Dyer, Ethan and Shakeri, Siamak and Vij, Roopali and Mehta, Harsh and Ramasesh, Vinay and Le, Quoc and Chi, Ed and Lu, Yifeng and Firat, Orhan and Lazaridou, Angeliki and Lespiau, Jean-Baptiste and Attaluri, Nithya and Olszewska, Kate},
  journal={arXiv preprint arXiv:2409.12640},
  year={2024}
}

Tools

Tools available in the environment

No tools available for this environment, it probably hasn't been indexed yet.

Compute Configuration

Resource allocation for this environment.

Component	Configuration
Environment Server	1 vCPU / 4 GB RAM
Sandbox Machine	Not configured

Estimated Cost

Pay per second of active session usage. Billing starts when your session begins and stops when it ends.

Component	Cost / second
Environment	$0.0000320
Sandbox	Not configured
Total	$0.0000320

Examples

5-minute session$0.0096

1-hour session$0.1152

MRCRV2

GeneralReasoning/MRCRV2

MRCRV2

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citation

Tools

Compute Configuration

Estimated Cost

Examples