ACI-Bench

API Endpoint
Leaderboard
Loading leaderboard...
Implementation of
README

ACI-Bench

⭐ OpenReward Environment Hugging Face Dataset

Description

ACI-Bench (Ambient Clinical Intelligence Benchmark) is an environment for evaluating an agent's ability to generate clinical visit notes from doctor-patient dialogues. The agent reads an enumerated dialogue and must produce a comprehensive clinical note, which is scored against a detailed rubric covering accuracy, completeness, and communication quality.

Capabilities

  • Generating structured clinical notes from medical dialogues
  • Understanding medical terminology and clinical workflows
  • Producing notes with appropriate sections (Subjective, Objective, Assessment and Plan)

Compute Requirements

This is a single-turn environment with no sandbox.

License

CC BY 4.0.

Tasks

There are three splits in this environment:

  • Train: 114 tasks
  • Validation: 35 tasks
  • Test: 210 tasks

Tasks are drawn from three subsets: aci, virtassist, and virtscribe.

Each task presents a numbered doctor-patient dialogue. The agent must generate a complete clinical note.

Reward Structure

This is a single-turn environment with continuous reward (0.0–1.0):

Scoring is rubric-based with 10–15 criteria per note, graded in parallel by gpt-5-mini. Criteria are grouped into three evaluation axes:

  • Accuracy (~40–50%): Factual correctness of clinical information
  • Completeness (~30–40%): Coverage of required note sections
  • Communication Quality (~20–30%): Clarity and professional formatting

The final reward is: earned points / total available points across all criteria.

Data

Data consists of a consolidated Parquet file with dialogues and reference notes, plus a JSON file with per-task rubrics. Data is stored on the OpenReward platform.

Source: mkieffer/ACI-Bench-MedARC

Tools

ToolDescription
submit_noteSubmit your generated clinical note for rubric-based evaluation. Returns per-criterion scores and overall reward.

Time Horizon

ACI-Bench is a single-turn environment. The agent receives a dialogue and submits one clinical note.

Environment Difficulty

The original paper evaluates baseline models on clinical note generation:

ModelROUGE-1ROUGE-2ROUGE-LMedCon
GPT-451.822.646.057.8
BART + FTSAMSum53.525.148.6-

GPT-4 achieved the highest medical concept (MedCon) score without fine-tuning, but fine-tuned models outperform on ROUGE metrics for structured note format.

Other Environment Requirements

  • OpenAI API key: Required for rubric-based grading via gpt-5-mini. Pass via secrets={"openai_api_key": "..."}.

Safety

Agents in ACI-Bench generate clinical notes from synthetic or de-identified dialogues. The environment does not involve real patient care. Generated notes should not be used in clinical settings.

Citations

@article{yim2023acibench,
  author    = {Wen-wai Yim and Yujuan Fu and Asma Ben Abacha and Neal Snider and Thomas Lin and Meliha Yetisgen},
  title     = {ACI-BENCH: a Novel Ambient Clinical Intelligence Dataset for Benchmarking Automatic Visit Note Generation},
  journal   = {Scientific Data},
  volume    = {10},
  pages     = {586},
  year      = {2023},
  doi       = {10.1038/s41597-023-02487-3},
  url       = {https://arxiv.org/abs/2306.02022}
}
GeneralReasoning/ACI-Bench | OpenReward