arc-agi-2
ARC-AGI-2
Description
ARC-AGI-2 is an environment for evaluating abstract reasoning with increased difficulty over ARC-AGI-1. Agents are given few-shot examples demonstrating transformation patterns from input grids to output grids, then must apply the deduced rule to question inputs. Tasks require compositional reasoning, global rule induction, and multi-step transformations.
Capabilities
- Advanced abstract reasoning and pattern induction
- Compositional reasoning across multiple transformation steps
- Grid-based spatial reasoning with increased complexity
Compute Requirements
Agents are given a standard environment with no sandbox or file system access.
License
Tasks
Two splits in this environment:
- train: Training tasks
- test: Evaluation tasks
Each task includes few-shot examples and question inputs requiring predicted outputs.
Reward Structure
Multi-attempt evaluation with deterministic grading. The agent submits predicted output grids via the answer tool. Up to 2 attempts are allowed per task. The submitted outputs are compared via exact match against the ground truth. Reward is 1.0 if all outputs are correct, 0.0 otherwise. Episode ends on correct answer or after 2 failed attempts.
Data
Dataset loaded from HuggingFace arc-agi-community/arc-agi-2. Tasks contain few-shot examples and question inputs.
Tools
| Tool | Description |
|---|---|
answer | Submit predicted output grids as list of objects with "output" keys. Up to 2 attempts. Ends the episode on success or final attempt. |
Time Horizon
Multi-attempt. The agent analyzes few-shot examples, deduces the transformation rule, and submits outputs with up to 2 attempts.
Environment Difficulty
ARC-AGI-2 is significantly harder than ARC-AGI-1, designed to remain challenging as AI capabilities improve:
| Model | Accuracy | Cost/Task |
|---|---|---|
| Poetiq SOTA | 54% | $30.57 |
| GPT-5.2 (X-High) | 52.9% | $1.90 |
| Gemini 3 Deep Think | 45% | $77.16 |
| Opus 4.5 (Thinking) | 37.6% | $2.20 |
| Human Average | 60% | - |
100% of tasks have been solved by humans in under 2 attempts. The benchmark enforces an implicit efficiency frontier: systems must improve accuracy without incurring higher costs.
Other Environment Requirements
There are no further environment requirements; ARC-AGI-2 works out of the box with the OpenReward endpoint without any external API keys.
Safety
Agents in ARC-AGI-2 solve abstract reasoning puzzles in a standard environment. The environment does not present direct safety risks.
Citation
@misc{arcprize2025,
title={ARC Prize 2025: Technical Report},
author={ARC Prize Foundation},
year={2025},
url={https://arcprize.org}
}