ADE-Bench

API Endpoint
Leaderboard
Loading leaderboard...
README

ADE-Bench

OpenReward Environment

Description

ADE-Bench (Analytics and Data Engineering Benchmark) is an environment for evaluating AI agents on analytics and data engineering tasks using dbt projects and DuckDB databases. Agents work in isolated containers to fix bugs, create models, refactor projects, and debug data quality issues in realistic dbt workflows.

Capabilities

  • Bash command execution in a dbt + DuckDB environment
  • Full dbt project access (models, macros, packages, profiles)
  • DuckDB database access for data inspection and querying
  • Automated evaluation via dbt tests comparing output against reference data

Compute Requirements

Sandbox: 1 CPU / 2GB RAM per session. Network access enabled (for dbt deps).

License

Apache License 2.0

Tasks

Single test split with 46 tasks across 9 domains:

DomainCountDescription
Airbnb11Airbnb analytics (reviews, listings, NPS)
Formula 111F1 racing data (drivers, results, standings)
Analytics Engineering8General AE tasks (refactoring, best practices)
Asana5Project management data
QuickBooks4Accounting/financial data
Intercom3Customer messaging data
Simple2Lightweight diagnostic tasks
Shopify Analytics1E-commerce analytics
Workday1HR/workforce data

Tasks span easy, medium, and hard difficulty. Some tasks have multiple prompt variants (base, medium, hard) which appear as separate tasks.

Reward Structure

Binary reward (0.0 or 1.0). Evaluation runs dbt test --select "test_type:singular" which includes:

  • Existence tests: Verify expected tables exist
  • Equality tests: Row-by-row comparison of agent output tables against reference CSVs (supports column filtering and alternate correct answers)
  • Manual tests: Task-specific SQL assertions (e.g., checking model materialization, source references)

All tests must pass for reward=1.0.

Data

Source: dbt-labs/ade-bench on GitHub.
DuckDB databases: Google Drive.

Each domain has a dbt project and DuckDB database. Per-task data includes setup scripts, reference CSVs (seeds), and test SQL files.

Tools

  • bash — Execute bash commands in the dbt project directory (/app/)
  • submit — Submit solution for evaluation (runs dbt tests, returns binary reward)

Time Horizon

Multi-turn. Agents typically need 5-30 tool calls depending on task complexity. Easy tasks (renaming a model) may need ~5 calls; hard tasks (debugging complex dbt package issues, creating new analytical models) may need 20+.

Environment Difficulty

  • Easy: ~30% of tasks (simple renames, straightforward fixes)
  • Medium: ~50% of tasks (model creation, refactoring, data inspection)
  • Hard: ~20% of tasks (complex debugging, multi-file changes, analytical reasoning)

Safety

Tasks involve SQL queries and file modifications in isolated containers. No access to external systems beyond dbt package registries. All execution is sandboxed.

GeneralReasoning/ADE-Bench | OpenReward