kramabench
KramaBench
Description
KramaBench is an environment for evaluating AI agents on data-to-insight pipelines over data lakes. Tasks require agents to discover relevant data files, perform data wrangling and cleaning, execute statistical analysis, and produce answers from real-world datasets spanning six domains: archaeology, astronomy, biomedical, environment, legal, and wildfire.
This OpenReward implementation is ported from the Harbor Framework implementation originally made by Michael Glass.
Capabilities
- Data discovery and retrieval from heterogeneous data lakes
- Data wrangling, cleaning, and transformation
- Statistical reasoning and analysis
- Building end-to-end data science pipelines
- Working across multiple scientific and professional domains
Compute Requirements
Agents are given a sandboxed environment with bash access and file editing tools for data analysis. Default sandbox size is 1 CPU and 2 GB RAM, configurable per task.
License
Tasks
There is one split in this environment:
- Test: 104 data science pipeline tasks
Tasks span six domains at two difficulty levels (easy and hard): archaeology, astronomy, biomedical, environment, legal, and wildfire.
Reward Structure
This is a multi-turn, sandbox-based environment. The agent builds data processing pipelines iteratively, writes the answer to /app/answer.txt, and calls submit_answer to trigger verification. The reward depends on answer type:
- Numeric/String answers: 1.0 for exact match, 0.0 otherwise.
- List answers: F1 score (0.0-1.0) comparing predicted vs. expected list elements.
- Approximate numeric: Relative absolute error score for tolerance-based matching.
Data
Each task directory contains an instruction.md describing the data science question and a tests/ directory with verification scripts. Tasks draw from 1,700 data files across 24 data sources. Task data is stored on the OpenReward platform.
Tools
| Tool | Description |
|---|---|
bash | Execute shell commands in the sandbox. |
str_replace | Replace a unique string in a file. |
view | View file contents or list directory contents. |
create_file | Create a new file with specified content. |
submit_answer | Submit work for automated verification. |
Time Horizon
KramaBench is a multi-turn environment. Agents discover data files, build processing pipelines, execute analysis, and submit results for verification.
Environment Difficulty
KramaBench is challenging. The original paper evaluates multiple models on end-to-end automation:
| Model | Score |
|---|---|
| GPT-o3 (self-correcting) | 22.1% |
| Gemini 2.5 Pro | 18.5% |
| Claude-3.5 (self-correcting) | 14.4% |
| Qwen2.5-Coder | 10.0% |
| GPT-4o | 8.3% |
| DeepSeek-R1 | 6.4% |
Performance varies by domain, with wildfire tasks achieving 50.7% (GPT-o3) while astronomy tasks remain below 2%. Data discovery from heterogeneous data lakes remains a key challenge.
Other Environment Requirements
There are no further environment requirements; KramaBench works out of the box with the OpenReward endpoint without any external API keys.
Safety
Agents in KramaBench analyze publicly available datasets in a sandboxed environment. The environment does not present direct safety risks.
Citations
@article{lai2025kramabench,
author = {E. Lai and G. Vitagliano and Z. Zhang and S. Sudhir and O. Chabra and A. Zeng and A. A. Zabreyko and C. Li and F. Kossmann and J. Ding and others},
title = {KramaBench: A Benchmark for AI Systems on Data-to-Insight Pipelines over Data Lakes},
journal = {arXiv preprint arXiv:2506.06541},
year = {2025},
url = {https://arxiv.org/abs/2506.06541}
}