AARDData

Description

AARDData is an environment for evaluating agents on pre-training data tasks. Tasks involve downloading, parsing, filtering, and extracting data from web archives using quality heuristics inspired by C4, Gopher, and FineWeb dataset processing pipelines.

Capabilities

Downloading and processing large WARC (Web ARChive) files from Common Crawl
Implementing data quality filtering pipelines (repetition, quality, paragraph)
URL extraction and filtering with blocklists and banned word lists
Text extraction from HTML using trafilatura
Writing data processing scripts in Python

Compute Requirements

Agents in AARDData are given a sandbox with 1 CPU and 2GB RAM, with network access enabled and a Python 3.12 data science image.

Tasks

There are seven training tasks in this environment, each requiring agents to download a WARC file from Common Crawl and apply specific processing or filtering logic.

Reward Structure

This is a sparse, verifiable reward environment. Rewards are returned when the agent calls submit_answer:

Hash-based tasks (extract_text): Reward is 1.0 if the SHA256 hash of the solution file matches the expected hash, 0.0 otherwise.
F1-based tasks (all filter tasks): Reward is the F1 score comparing the agent's predicted list against the ground truth:

$F_1 = \frac{2 \cdot \text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}}$

F1 based tasks are used when a data processing component - e.g. filtering - produces a list of files, which are matched versus ground truth files from an oracle solution.

We do not use LLM graders for this task.

Data

Each task requires agents to download WARC files from Common Crawl at runtime. The url_filter task additionally uses pre-staged asset files (domain blocklists, URL blocklists, banned word lists, soft-banned word lists) that are copied from a mounted bucket into the sandbox at /home/ubuntu/assets/.

Ground truth files are stored server-side and used for F1 scoring; they are not visible to the agent.

Tools

Agents are given access to CLI tools for creating, viewing, and searching a filesystem. They are also given one environment-specific tool:

submit_answer: Submit the solution file for evaluation. For hash-based tasks, reads /home/ubuntu/solution.txt and compares its SHA256 hash. For F1-based tasks, reads /home/ubuntu/solution.json and calculates F1 score against ground truth.

Time Horizon

AARDData tasks are single-session: the agent downloads data, writes processing scripts, executes them, and submits the result. Each task involves writing a data processing pipeline from scratch.

[Statistics on average tool calls here]

Environment Difficulty

[Statistics on environment difficulty here]

Other Environment Requirements

Network access is enabled so agents can download WARC files from Common Crawl.

Safety

Agents in AARDData interact only with publicly available Common Crawl data within a sandboxed environment. The environment does not present direct safety risks. Network access is limited to downloading data files required for the tasks.

Long-term, there maybe risks with machines contributing to pre-training data creation due to risks of "data poisoning" and including data that produces dangerous behaviour in successor models. Since this environment is procedural rather than open-ended, we do not believe this environment would directly promote such behaviour.

Citation

@dataset{GRAARDData,
  author    = {General Reasoning Inc. Team},
  title     = {AARDData},
  year      = {2026},
  publisher = {OpenReward},
  url       = {https://openreward.ai/GeneralReasoning/AARDData}
}

Tools

Tools available in the environment

No tools available for this environment, it probably hasn't been indexed yet.

Compute Configuration

Resource allocation for this environment.

Component	Configuration
Environment Server	1 vCPU / 4 GB RAM
Sandbox Machine	1 vCPU / 2 GB RAM

Estimated Cost

Pay per second of active session usage. Billing starts when your session begins and stops when it ends.

Component	Cost / second
Environment	$0.0000320
Sandbox	$0.0000230
Total	$0.0000550

Examples

5-minute session$0.0165

1-hour session$0.1980