DataScienceComps

API Endpoint
Leaderboard
Loading leaderboard...
README

DataCompEnvs

⭐ OpenReward Environment

Description

DataCompEnvs is a suite of seven data science competition environments sourced from CrunchDAO. Each environment presents an agent with a real-world data science task -- spanning financial forecasting, causal discovery, time-series anomaly detection, and venture capital prediction -- and asks the agent to explore the data, build a predictive model, and submit predictions for evaluation. The environments cover a wide range of ML problem types including ranking, binary classification, multiclass classification, and trading signal generation.

Capabilities

  • Exploratory data analysis on tabular and time-series datasets
  • Feature engineering and selection across hundreds of anonymized features
  • Training machine learning models (regression, classification, ranking)
  • Handling class imbalance, missing data, and high-dimensional feature spaces
  • Generating and formatting predictions for deterministic evaluation
  • Long-horizon multi-turn execution using CLI tools (bash, file I/O, search)

Compute Requirements

Agents are given a sandboxed environment with 8GB RAM and 4 CPUs, bash access, file editing tools, and scientific Python libraries (pandas, numpy, scikit-learn, scipy).

Tasks

DataCompEnvs contains 7 environments, each with 1 task on the train split (7 tasks total):

EnvironmentTask IDProblem TypeDescription
ADIACrossSectionadia-taskRankingCross-section asset ranking forecast across 461 features and ~269 dates. Predict relative performance of ~4100 assets per date for 5 test dates.
VCPortfoliovcportfolio-taskBinary ClassificationPredict which VC-backed startups will experience an upround. 229 features, ~2.66M training samples across 36 dates, predict for 1 test date.
DataRallydatarally-taskRanking (Alpha Scoring)Rank assets in the Russell 3000 universe to capture idiosyncratic returns at low volatility. Predictions are processed through gaussianization, industry orthogonalization, and L1 normalization.
CausalityDiscoverycausality-discovery-taskMulticlass ClassificationDiscover causal DAGs from observational data. Classify each node's role (confounder, collider, mediator, etc.) relative to treatment X and outcome Y across 1,880 test datasets.
MidOnemidone-taskTernary Trading SignalDetect martingale exceptions in high-frequency time series. Predict buy (+1), sell (-1), or hold (0) with a transaction cost of 0.0025 per trade.
StructuralBreakstructuralbreak-taskBinary ClassificationDetect structural breaks in univariate time series at a designated boundary point. Predict probability scores between 0 and 1.
DataCrunchdatacrunch-taskRanking (Multi-Target)Equity market neutral ranking of the 3000 most liquid US equities across 4 investment horizons (7-day, 28-day, 63-day, 91-day) with 1768 anonymized features.

Reward Structure

All environments use deterministic, verifiable reward functions with no LLM grader. Each environment has its own evaluation metric applied when the agent calls the submit_predictions tool:

EnvironmentMetricReward RangeDetails
ADIACrossSectionSpearman rank correlation[-1.0, 1.0]Mean of per-date Spearman correlations between predicted and true asset rankings.
VCPortfolioF1 score[0.0, 1.0]F1 score on binary upround predictions (probabilities thresholded at 0.5).
DataRallyCumulative return (alpha score)UnboundedPredictions are gaussianized, industry-orthogonalized, L1-normalized, then dot-producted with weekly returns. Final score is cumulative compounded return.
CausalityDiscoveryMulticlass balanced accuracy[0.0, 1.0]Balanced accuracy across 8 node role classes (confounder, collider, mediator, independent, cause/consequence of X/Y).
MidOneAverage profit per tradeUnboundedTotal profit from buy/sell/hold decisions minus transaction costs (epsilon=0.0025), divided by number of trades.
StructuralBreakROC AUC[0.0, 1.0]Area under the ROC curve for structural break detection. Random chance = 0.5.
DataCrunchWeighted Spearman correlation[-1.0, 1.0]Weighted average of per-target Spearman correlations: target_b=54.5%, target_g=18.2%, target_r=18.2%, target_w=9.1%.

Reward is returned upon calling submit_predictions with finished=True. If the submission is invalid (wrong format, missing predictions, etc.), the tool returns reward=0.0 with finished=False, allowing the agent to fix and resubmit.

Data

Data files are provided in Parquet format (6 environments) and Pickle format (CausalityDiscovery). Each environment mounts its data read-only into the sandbox from CrunchDAO-sourced datasets:

EnvironmentMount PathTraining FilesData Scale
ADIACrossSection/tmp/adia-data/X_train.parquet, y_train.parquet, X_test_reduced.parquet461 features, ~269 dates, ~800-900 assets/date
VCPortfolio/tmp/gr-datasets/vc_X_train.parquet, vc_y_train.parquet, vc_X_test_reduced.parquet229 features, 36 dates, ~2.66M samples
DataRally/tmp/gr-datasets/dr_X_train.parquet, dr_y_train.parquet, dr_X_test_reduced.parquetWeekly data with industry classification
CausalityDiscovery/tmp/gr-datasets/Pickle files with 23,500 training datasets (1000 observations, 3-10 variables each)1,880 test datasets
MidOne/tmp/gr-datasets/mid_X_train.parquet, mid_y_train.parquet, mid_X_test_reduced.parquetHigh-frequency time series with stream identifiers
StructuralBreak/tmp/gr-datasets/sb_X_train.parquet, sb_y_train.parquet, sb_X_test.reduced.parquetMultiIndex (id, time) with value and period columns
DataCrunch/tmp/gr-datasets/X_train.parquet, y_train.parquet, X_test_reduced.parquet1768 features, weekly frequency, 4 target horizons

Tools

Each environment provides 1 custom tool plus 9 inherited CLI tools from CLIEnvironment (10 tools total per environment):

Custom tool (per environment):

  • submit_predictions -- Validates and scores the agent's predictions from submission.csv. Returns the evaluation metric as reward and marks the episode as finished on success. Returns finished=False with an error message on invalid submissions so the agent can retry.

Inherited CLI tools (shared across all 7 environments):

  • bash -- Execute shell commands in the sandbox (timeout: 600s)
  • glob -- Find files matching a glob pattern
  • grep -- Search for patterns in files
  • ls -- List files and directories
  • read -- Read file contents with optional offset/limit
  • write -- Write content to a file
  • edit -- Perform exact string replacement in a file
  • multi_edit -- Apply multiple edits to a single file
  • todo_write -- Manage a structured todo list for task planning

Time Horizon

DataCompEnvs is an open-ended, multi-step environment. A typical session involves the agent exploring the dataset, engineering features, training one or more models, generating predictions, and submitting them. The episode terminates when the agent calls submit_predictions with a valid submission.

Environment Difficulty

[Put environment difficulty statistics here]

Other Environment Requirements

There are no further environment requirements; DataCompEnvs works out of the box with the OpenReward endpoint without any external API keys.

Safety

Agents in DataCompEnvs operate within sandboxed compute environments with read-only access to competition data. The environments do not interact with live markets, real financial systems, or external services. Agents can only affect the sandbox filesystem and submit predictions for deterministic evaluation. The data is anonymized and does not contain personally identifiable information.

Citation

@dataset{GRDataCompEnvs,
  author    = {General Reasoning Inc. Team},
  title     = {DataCompEnvs: Data Science Competition Environments},
  year      = {2026},
  publisher = {OpenReward},
  url       = {https://www.openreward.ai/GeneralReasoning/DataCompEnvs}
}
GeneralReasoning/DataScienceComps | OpenReward