DataCompEnvs

Description

DataCompEnvs is a suite of seven data science competition environments sourced from CrunchDAO. Each environment presents an agent with a real-world data science task -- spanning financial forecasting, causal discovery, time-series anomaly detection, and venture capital prediction -- and asks the agent to explore the data, build a predictive model, and submit predictions for evaluation. The environments cover a wide range of ML problem types including ranking, binary classification, multiclass classification, and trading signal generation.

Capabilities

Exploratory data analysis on tabular and time-series datasets
Feature engineering and selection across hundreds of anonymized features
Training machine learning models (regression, classification, ranking)
Handling class imbalance, missing data, and high-dimensional feature spaces
Generating and formatting predictions for deterministic evaluation
Long-horizon multi-turn execution using CLI tools (bash, file I/O, search)

Compute Requirements

Agents are given a sandboxed environment with 8GB RAM and 4 CPUs, bash access, file editing tools, and scientific Python libraries (pandas, numpy, scikit-learn, scipy).

Tasks

DataCompEnvs contains 7 environments, each with 1 task on the train split (7 tasks total):

Environment	Task ID	Problem Type	Description
ADIACrossSection	`adia-task`	Ranking	Cross-section asset ranking forecast across 461 features and ~269 dates. Predict relative performance of ~4100 assets per date for 5 test dates.
VCPortfolio	`vcportfolio-task`	Binary Classification	Predict which VC-backed startups will experience an upround. 229 features, ~2.66M training samples across 36 dates, predict for 1 test date.
DataRally	`datarally-task`	Ranking (Alpha Scoring)	Rank assets in the Russell 3000 universe to capture idiosyncratic returns at low volatility. Predictions are processed through gaussianization, industry orthogonalization, and L1 normalization.
CausalityDiscovery	`causality-discovery-task`	Multiclass Classification	Discover causal DAGs from observational data. Classify each node's role (confounder, collider, mediator, etc.) relative to treatment X and outcome Y across 1,880 test datasets.
MidOne	`midone-task`	Ternary Trading Signal	Detect martingale exceptions in high-frequency time series. Predict buy (+1), sell (-1), or hold (0) with a transaction cost of 0.0025 per trade.
StructuralBreak	`structuralbreak-task`	Binary Classification	Detect structural breaks in univariate time series at a designated boundary point. Predict probability scores between 0 and 1.
DataCrunch	`datacrunch-task`	Ranking (Multi-Target)	Equity market neutral ranking of the 3000 most liquid US equities across 4 investment horizons (7-day, 28-day, 63-day, 91-day) with 1768 anonymized features.

Reward Structure

All environments use deterministic, verifiable reward functions with no LLM grader. Each environment has its own evaluation metric applied when the agent calls the submit_predictions tool:

Environment	Metric	Reward Range	Details
ADIACrossSection	Spearman rank correlation	[-1.0, 1.0]	Mean of per-date Spearman correlations between predicted and true asset rankings.
VCPortfolio	F1 score	[0.0, 1.0]	F1 score on binary upround predictions (probabilities thresholded at 0.5).
DataRally	Cumulative return (alpha score)	Unbounded	Predictions are gaussianized, industry-orthogonalized, L1-normalized, then dot-producted with weekly returns. Final score is cumulative compounded return.
CausalityDiscovery	Multiclass balanced accuracy	[0.0, 1.0]	Balanced accuracy across 8 node role classes (confounder, collider, mediator, independent, cause/consequence of X/Y).
MidOne	Average profit per trade	Unbounded	Total profit from buy/sell/hold decisions minus transaction costs (epsilon=0.0025), divided by number of trades.
StructuralBreak	ROC AUC	[0.0, 1.0]	Area under the ROC curve for structural break detection. Random chance = 0.5.
DataCrunch	Weighted Spearman correlation	[-1.0, 1.0]	Weighted average of per-target Spearman correlations: target_b=54.5%, target_g=18.2%, target_r=18.2%, target_w=9.1%.

Reward is returned upon calling submit_predictions with finished=True. If the submission is invalid (wrong format, missing predictions, etc.), the tool returns reward=0.0 with finished=False, allowing the agent to fix and resubmit.

Data

Data files are provided in Parquet format (6 environments) and Pickle format (CausalityDiscovery). Each environment mounts its data read-only into the sandbox from CrunchDAO-sourced datasets:

Environment	Mount Path	Training Files	Data Scale
ADIACrossSection	`/tmp/adia-data/`	`X_train.parquet`, `y_train.parquet`, `X_test_reduced.parquet`	461 features, ~269 dates, ~800-900 assets/date
VCPortfolio	`/tmp/gr-datasets/`	`vc_X_train.parquet`, `vc_y_train.parquet`, `vc_X_test_reduced.parquet`	229 features, 36 dates, ~2.66M samples
DataRally	`/tmp/gr-datasets/`	`dr_X_train.parquet`, `dr_y_train.parquet`, `dr_X_test_reduced.parquet`	Weekly data with industry classification
CausalityDiscovery	`/tmp/gr-datasets/`	Pickle files with 23,500 training datasets (1000 observations, 3-10 variables each)	1,880 test datasets
MidOne	`/tmp/gr-datasets/`	`mid_X_train.parquet`, `mid_y_train.parquet`, `mid_X_test_reduced.parquet`	High-frequency time series with stream identifiers
StructuralBreak	`/tmp/gr-datasets/`	`sb_X_train.parquet`, `sb_y_train.parquet`, `sb_X_test.reduced.parquet`	MultiIndex (id, time) with value and period columns
DataCrunch	`/tmp/gr-datasets/`	`X_train.parquet`, `y_train.parquet`, `X_test_reduced.parquet`	1768 features, weekly frequency, 4 target horizons

Tools

Each environment provides 1 custom tool plus 9 inherited CLI tools from CLIEnvironment (10 tools total per environment):

Custom tool (per environment):

submit_predictions -- Validates and scores the agent's predictions from submission.csv. Returns the evaluation metric as reward and marks the episode as finished on success. Returns finished=False with an error message on invalid submissions so the agent can retry.

Inherited CLI tools (shared across all 7 environments):

bash -- Execute shell commands in the sandbox (timeout: 600s)
glob -- Find files matching a glob pattern
grep -- Search for patterns in files
ls -- List files and directories
read -- Read file contents with optional offset/limit
write -- Write content to a file
edit -- Perform exact string replacement in a file
multi_edit -- Apply multiple edits to a single file
todo_write -- Manage a structured todo list for task planning

Time Horizon

DataCompEnvs is an open-ended, multi-step environment. A typical session involves the agent exploring the dataset, engineering features, training one or more models, generating predictions, and submitting them. The episode terminates when the agent calls submit_predictions with a valid submission.

Environment Difficulty

[Put environment difficulty statistics here]

Other Environment Requirements

There are no further environment requirements; DataCompEnvs works out of the box with the OpenReward endpoint without any external API keys.

Safety

Agents in DataCompEnvs operate within sandboxed compute environments with read-only access to competition data. The environments do not interact with live markets, real financial systems, or external services. Agents can only affect the sandbox filesystem and submit predictions for deterministic evaluation. The data is anonymized and does not contain personally identifiable information.

Citation

@dataset{GRDataCompEnvs,
  author    = {General Reasoning Inc. Team},
  title     = {DataCompEnvs: Data Science Competition Environments},
  year      = {2026},
  publisher = {OpenReward},
  url       = {https://www.openreward.ai/GeneralReasoning/DataCompEnvs}
}

Component	Configuration
Environment Server	4 vCPUs / 8 GB RAM
Sandbox Machine	2 vCPUs / 8 GB RAM

Component	Cost / second
Environment	$0.0000920
Sandbox	$0.0000640
Total	$0.0001560

DataScienceComps

GeneralReasoning/DataScienceComps

DataCompEnvs

Description

Capabilities

Compute Requirements

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citation

Tools

Compute Configuration

Estimated Cost

Examples