KramaBench

Description

KramaBench is an environment for evaluating AI agents on data-to-insight pipelines over data lakes. Tasks require agents to discover relevant data files, perform data wrangling and cleaning, execute statistical analysis, and produce answers from real-world datasets spanning six domains: archaeology, astronomy, biomedical, environment, legal, and wildfire.

This OpenReward implementation is ported from the Harbor Framework implementation originally made by Michael Glass.

Capabilities

Data discovery and retrieval from heterogeneous data lakes
Data wrangling, cleaning, and transformation
Statistical reasoning and analysis
Building end-to-end data science pipelines
Working across multiple scientific and professional domains

Compute Requirements

Agents are given a sandboxed environment with bash access and file editing tools for data analysis. Default sandbox size is 1 CPU and 2 GB RAM, configurable per task.

License

CC BY-NC-SA 4.0.

Tasks

There is one split in this environment:

Test: 104 data science pipeline tasks

Tasks span six domains at two difficulty levels (easy and hard): archaeology, astronomy, biomedical, environment, legal, and wildfire.

Reward Structure

This is a multi-turn, sandbox-based environment. The agent builds data processing pipelines iteratively, writes the answer to /app/answer.txt, and calls submit_answer to trigger verification. The reward depends on answer type:

Numeric/String answers: 1.0 for exact match, 0.0 otherwise.
List answers: F1 score (0.0-1.0) comparing predicted vs. expected list elements.
Approximate numeric: Relative absolute error score for tolerance-based matching.

Data

Each task directory contains an instruction.md describing the data science question and a tests/ directory with verification scripts. Tasks draw from 1,700 data files across 24 data sources. Task data is stored on the OpenReward platform.

Tools

Tool	Description
`bash`	Execute shell commands in the sandbox.
`str_replace`	Replace a unique string in a file.
`view`	View file contents or list directory contents.
`create_file`	Create a new file with specified content.
`submit_answer`	Submit work for automated verification.

Time Horizon

KramaBench is a multi-turn environment. Agents discover data files, build processing pipelines, execute analysis, and submit results for verification.

Environment Difficulty

KramaBench is challenging. The original paper evaluates multiple models on end-to-end automation:

Model	Score
GPT-o3 (self-correcting)	22.1%
Gemini 2.5 Pro	18.5%
Claude-3.5 (self-correcting)	14.4%
Qwen2.5-Coder	10.0%
GPT-4o	8.3%
DeepSeek-R1	6.4%

Performance varies by domain, with wildfire tasks achieving 50.7% (GPT-o3) while astronomy tasks remain below 2%. Data discovery from heterogeneous data lakes remains a key challenge.

Other Environment Requirements

There are no further environment requirements; KramaBench works out of the box with the OpenReward endpoint without any external API keys.

Safety

Agents in KramaBench analyze publicly available datasets in a sandboxed environment. The environment does not present direct safety risks.

Citations

@article{lai2025kramabench,
  author    = {E. Lai and G. Vitagliano and Z. Zhang and S. Sudhir and O. Chabra and A. Zeng and A. A. Zabreyko and C. Li and F. Kossmann and J. Ding and others},
  title     = {KramaBench: A Benchmark for AI Systems on Data-to-Insight Pipelines over Data Lakes},
  journal   = {arXiv preprint arXiv:2506.06541},
  year      = {2025},
  url       = {https://arxiv.org/abs/2506.06541}
}

Component	Configuration
Environment Server	1 vCPU / 4 GB RAM
Sandbox Machine	Not configured

Component	Cost / second
Environment	$0.0000320
Sandbox	Not configured
Total	$0.0000320

kramabench

GeneralReasoning/kramabench

KramaBench

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citations

Tools

Compute Configuration

Estimated Cost

Examples