AARDData
AARDData
Description
AARDData is an environment for evaluating agents on pre-training data tasks. Tasks involve downloading, parsing, filtering, and extracting data from web archives using quality heuristics inspired by C4, Gopher, and FineWeb dataset processing pipelines.
Capabilities
- Downloading and processing large WARC (Web ARChive) files from Common Crawl
- Implementing data quality filtering pipelines (repetition, quality, paragraph)
- URL extraction and filtering with blocklists and banned word lists
- Text extraction from HTML using trafilatura
- Writing data processing scripts in Python
Compute Requirements
Agents in AARDData are given a sandbox with 1 CPU and 2GB RAM, with network access enabled and a Python 3.12 data science image.
Tasks
There are seven training tasks in this environment, each requiring agents to download a WARC file from Common Crawl and apply specific processing or filtering logic.
Reward Structure
This is a sparse, verifiable reward environment. Rewards are returned when the agent calls submit_answer:
- Hash-based tasks (extract_text): Reward is 1.0 if the SHA256 hash of the solution file matches the expected hash, 0.0 otherwise.
- F1-based tasks (all filter tasks): Reward is the F1 score comparing the agent's predicted list against the ground truth:
F1 based tasks are used when a data processing component - e.g. filtering - produces a list of files, which are matched versus ground truth files from an oracle solution.
We do not use LLM graders for this task.
Data
Each task requires agents to download WARC files from Common Crawl at runtime. The url_filter task additionally uses pre-staged asset files (domain blocklists, URL blocklists, banned word lists, soft-banned word lists) that are copied from a mounted bucket into the sandbox at /home/ubuntu/assets/.
Ground truth files are stored server-side and used for F1 scoring; they are not visible to the agent.
Tools
Agents are given access to CLI tools for creating, viewing, and searching a filesystem. They are also given one environment-specific tool:
submit_answer: Submit the solution file for evaluation. For hash-based tasks, reads/home/ubuntu/solution.txtand compares its SHA256 hash. For F1-based tasks, reads/home/ubuntu/solution.jsonand calculates F1 score against ground truth.
Time Horizon
AARDData tasks are single-session: the agent downloads data, writes processing scripts, executes them, and submits the result. Each task involves writing a data processing pipeline from scratch.
[Statistics on average tool calls here]
Environment Difficulty
[Statistics on environment difficulty here]
Other Environment Requirements
Network access is enabled so agents can download WARC files from Common Crawl.
Safety
Agents in AARDData interact only with publicly available Common Crawl data within a sandboxed environment. The environment does not present direct safety risks. Network access is limited to downloading data files required for the tasks.
Long-term, there maybe risks with machines contributing to pre-training data creation due to risks of "data poisoning" and including data that produces dangerous behaviour in successor models. Since this environment is procedural rather than open-ended, we do not believe this environment would directly promote such behaviour.
Citation
@dataset{GRAARDData,
author = {General Reasoning Inc. Team},
title = {AARDData},
year = {2026},
publisher = {OpenReward},
url = {https://openreward.ai/GeneralReasoning/AARDData}
}