API Endpoint

Leaderboard

Loading leaderboard...

README

ADE-Bench

Description

ADE-Bench (Analytics and Data Engineering Benchmark) is an environment for evaluating AI agents on analytics and data engineering tasks using dbt projects and DuckDB databases. Agents work in isolated containers to fix bugs, create models, refactor projects, and debug data quality issues in realistic dbt workflows.

Capabilities

Bash command execution in a dbt + DuckDB environment
Full dbt project access (models, macros, packages, profiles)
DuckDB database access for data inspection and querying
Automated evaluation via dbt tests comparing output against reference data

Compute Requirements

Sandbox: 1 CPU / 2GB RAM per session. Network access enabled (for dbt deps).

License

Apache License 2.0

Tasks

Single test split with 46 tasks across 9 domains:

Domain	Count	Description
Airbnb	11	Airbnb analytics (reviews, listings, NPS)
Formula 1	11	F1 racing data (drivers, results, standings)
Analytics Engineering	8	General AE tasks (refactoring, best practices)
Asana	5	Project management data
QuickBooks	4	Accounting/financial data
Intercom	3	Customer messaging data
Simple	2	Lightweight diagnostic tasks
Shopify Analytics	1	E-commerce analytics
Workday	1	HR/workforce data

Tasks span easy, medium, and hard difficulty. Some tasks have multiple prompt variants (base, medium, hard) which appear as separate tasks.

Reward Structure

Binary reward (0.0 or 1.0). Evaluation runs dbt test --select "test_type:singular" which includes:

Existence tests: Verify expected tables exist
Equality tests: Row-by-row comparison of agent output tables against reference CSVs (supports column filtering and alternate correct answers)
Manual tests: Task-specific SQL assertions (e.g., checking model materialization, source references)

All tests must pass for reward=1.0.

Data

Source: dbt-labs/ade-bench on GitHub.
DuckDB databases: Google Drive.

Each domain has a dbt project and DuckDB database. Per-task data includes setup scripts, reference CSVs (seeds), and test SQL files.

Tools

bash — Execute bash commands in the dbt project directory (/app/)
submit — Submit solution for evaluation (runs dbt tests, returns binary reward)

Time Horizon

Multi-turn. Agents typically need 5-30 tool calls depending on task complexity. Easy tasks (renaming a model) may need ~5 calls; hard tasks (debugging complex dbt package issues, creating new analytical models) may need 20+.

Environment Difficulty

Easy: ~30% of tasks (simple renames, straightforward fixes)
Medium: ~50% of tasks (model creation, refactoring, data inspection)
Hard: ~20% of tasks (complex debugging, multi-file changes, analytical reasoning)

Safety

Tasks involve SQL queries and file modifications in isolated containers. No access to external systems beyond dbt package registries. All execution is sandboxed.

Repository

Source repository

EnvCommons/ADE-Bench

Clone Repository

Tools

Tools available in the environment

No tools available for this environment, it probably hasn't been indexed yet.

Compute Configuration

Resource allocation for this environment.

Component	Configuration
Environment Server	1 vCPU / 4 GB RAM
Sandbox Machine	1 vCPU / 2 GB RAM

Estimated Cost

Pay per second of active session usage. Billing starts when your session begins and stops when it ends.

Component	Cost / second
Environment	$0.0000320
Sandbox	$0.0000230
Total	$0.0000550

Examples

5-minute session$0.0165

1-hour session$0.1980

ADE-Bench

GeneralReasoning/ADE-Bench

ADE-Bench

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Safety

Repository

Clone Repository

Tools

Compute Configuration

Estimated Cost

Examples