DataScienceComps
DataCompEnvs
Description
DataCompEnvs is a suite of seven data science competition environments sourced from CrunchDAO. Each environment presents an agent with a real-world data science task -- spanning financial forecasting, causal discovery, time-series anomaly detection, and venture capital prediction -- and asks the agent to explore the data, build a predictive model, and submit predictions for evaluation. The environments cover a wide range of ML problem types including ranking, binary classification, multiclass classification, and trading signal generation.
Capabilities
- Exploratory data analysis on tabular and time-series datasets
- Feature engineering and selection across hundreds of anonymized features
- Training machine learning models (regression, classification, ranking)
- Handling class imbalance, missing data, and high-dimensional feature spaces
- Generating and formatting predictions for deterministic evaluation
- Long-horizon multi-turn execution using CLI tools (bash, file I/O, search)
Compute Requirements
Agents are given a sandboxed environment with 8GB RAM and 4 CPUs, bash access, file editing tools, and scientific Python libraries (pandas, numpy, scikit-learn, scipy).
Tasks
DataCompEnvs contains 7 environments, each with 1 task on the train split (7 tasks total):
| Environment | Task ID | Problem Type | Description |
|---|---|---|---|
| ADIACrossSection | adia-task | Ranking | Cross-section asset ranking forecast across 461 features and ~269 dates. Predict relative performance of ~4100 assets per date for 5 test dates. |
| VCPortfolio | vcportfolio-task | Binary Classification | Predict which VC-backed startups will experience an upround. 229 features, ~2.66M training samples across 36 dates, predict for 1 test date. |
| DataRally | datarally-task | Ranking (Alpha Scoring) | Rank assets in the Russell 3000 universe to capture idiosyncratic returns at low volatility. Predictions are processed through gaussianization, industry orthogonalization, and L1 normalization. |
| CausalityDiscovery | causality-discovery-task | Multiclass Classification | Discover causal DAGs from observational data. Classify each node's role (confounder, collider, mediator, etc.) relative to treatment X and outcome Y across 1,880 test datasets. |
| MidOne | midone-task | Ternary Trading Signal | Detect martingale exceptions in high-frequency time series. Predict buy (+1), sell (-1), or hold (0) with a transaction cost of 0.0025 per trade. |
| StructuralBreak | structuralbreak-task | Binary Classification | Detect structural breaks in univariate time series at a designated boundary point. Predict probability scores between 0 and 1. |
| DataCrunch | datacrunch-task | Ranking (Multi-Target) | Equity market neutral ranking of the 3000 most liquid US equities across 4 investment horizons (7-day, 28-day, 63-day, 91-day) with 1768 anonymized features. |
Reward Structure
All environments use deterministic, verifiable reward functions with no LLM grader. Each environment has its own evaluation metric applied when the agent calls the submit_predictions tool:
| Environment | Metric | Reward Range | Details |
|---|---|---|---|
| ADIACrossSection | Spearman rank correlation | [-1.0, 1.0] | Mean of per-date Spearman correlations between predicted and true asset rankings. |
| VCPortfolio | F1 score | [0.0, 1.0] | F1 score on binary upround predictions (probabilities thresholded at 0.5). |
| DataRally | Cumulative return (alpha score) | Unbounded | Predictions are gaussianized, industry-orthogonalized, L1-normalized, then dot-producted with weekly returns. Final score is cumulative compounded return. |
| CausalityDiscovery | Multiclass balanced accuracy | [0.0, 1.0] | Balanced accuracy across 8 node role classes (confounder, collider, mediator, independent, cause/consequence of X/Y). |
| MidOne | Average profit per trade | Unbounded | Total profit from buy/sell/hold decisions minus transaction costs (epsilon=0.0025), divided by number of trades. |
| StructuralBreak | ROC AUC | [0.0, 1.0] | Area under the ROC curve for structural break detection. Random chance = 0.5. |
| DataCrunch | Weighted Spearman correlation | [-1.0, 1.0] | Weighted average of per-target Spearman correlations: target_b=54.5%, target_g=18.2%, target_r=18.2%, target_w=9.1%. |
Reward is returned upon calling submit_predictions with finished=True. If the submission is invalid (wrong format, missing predictions, etc.), the tool returns reward=0.0 with finished=False, allowing the agent to fix and resubmit.
Data
Data files are provided in Parquet format (6 environments) and Pickle format (CausalityDiscovery). Each environment mounts its data read-only into the sandbox from CrunchDAO-sourced datasets:
| Environment | Mount Path | Training Files | Data Scale |
|---|---|---|---|
| ADIACrossSection | /tmp/adia-data/ | X_train.parquet, y_train.parquet, X_test_reduced.parquet | 461 features, ~269 dates, ~800-900 assets/date |
| VCPortfolio | /tmp/gr-datasets/ | vc_X_train.parquet, vc_y_train.parquet, vc_X_test_reduced.parquet | 229 features, 36 dates, ~2.66M samples |
| DataRally | /tmp/gr-datasets/ | dr_X_train.parquet, dr_y_train.parquet, dr_X_test_reduced.parquet | Weekly data with industry classification |
| CausalityDiscovery | /tmp/gr-datasets/ | Pickle files with 23,500 training datasets (1000 observations, 3-10 variables each) | 1,880 test datasets |
| MidOne | /tmp/gr-datasets/ | mid_X_train.parquet, mid_y_train.parquet, mid_X_test_reduced.parquet | High-frequency time series with stream identifiers |
| StructuralBreak | /tmp/gr-datasets/ | sb_X_train.parquet, sb_y_train.parquet, sb_X_test.reduced.parquet | MultiIndex (id, time) with value and period columns |
| DataCrunch | /tmp/gr-datasets/ | X_train.parquet, y_train.parquet, X_test_reduced.parquet | 1768 features, weekly frequency, 4 target horizons |
Tools
Each environment provides 1 custom tool plus 9 inherited CLI tools from CLIEnvironment (10 tools total per environment):
Custom tool (per environment):
submit_predictions-- Validates and scores the agent's predictions fromsubmission.csv. Returns the evaluation metric as reward and marks the episode as finished on success. Returnsfinished=Falsewith an error message on invalid submissions so the agent can retry.
Inherited CLI tools (shared across all 7 environments):
bash-- Execute shell commands in the sandbox (timeout: 600s)glob-- Find files matching a glob patterngrep-- Search for patterns in filesls-- List files and directoriesread-- Read file contents with optional offset/limitwrite-- Write content to a fileedit-- Perform exact string replacement in a filemulti_edit-- Apply multiple edits to a single filetodo_write-- Manage a structured todo list for task planning
Time Horizon
DataCompEnvs is an open-ended, multi-step environment. A typical session involves the agent exploring the dataset, engineering features, training one or more models, generating predictions, and submitting them. The episode terminates when the agent calls submit_predictions with a valid submission.
Environment Difficulty
[Put environment difficulty statistics here]
Other Environment Requirements
There are no further environment requirements; DataCompEnvs works out of the box with the OpenReward endpoint without any external API keys.
Safety
Agents in DataCompEnvs operate within sandboxed compute environments with read-only access to competition data. The environments do not interact with live markets, real financial systems, or external services. Agents can only affect the sandbox filesystem and submit predictions for deterministic evaluation. The data is anonymized and does not contain personally identifiable information.
Citation
@dataset{GRDataCompEnvs,
author = {General Reasoning Inc. Team},
title = {DataCompEnvs: Data Science Competition Environments},
year = {2026},
publisher = {OpenReward},
url = {https://www.openreward.ai/GeneralReasoning/DataCompEnvs}
}