ACI-Bench
ACI-Bench
Description
ACI-Bench (Ambient Clinical Intelligence Benchmark) is an environment for evaluating an agent's ability to generate clinical visit notes from doctor-patient dialogues. The agent reads an enumerated dialogue and must produce a comprehensive clinical note, which is scored against a detailed rubric covering accuracy, completeness, and communication quality.
Capabilities
- Generating structured clinical notes from medical dialogues
- Understanding medical terminology and clinical workflows
- Producing notes with appropriate sections (Subjective, Objective, Assessment and Plan)
Compute Requirements
This is a single-turn environment with no sandbox.
License
Tasks
There are three splits in this environment:
- Train: 114 tasks
- Validation: 35 tasks
- Test: 210 tasks
Tasks are drawn from three subsets: aci, virtassist, and virtscribe.
Each task presents a numbered doctor-patient dialogue. The agent must generate a complete clinical note.
Reward Structure
This is a single-turn environment with continuous reward (0.0–1.0):
Scoring is rubric-based with 10–15 criteria per note, graded in parallel by gpt-5-mini. Criteria are grouped into three evaluation axes:
- Accuracy (~40–50%): Factual correctness of clinical information
- Completeness (~30–40%): Coverage of required note sections
- Communication Quality (~20–30%): Clarity and professional formatting
The final reward is: earned points / total available points across all criteria.
Data
Data consists of a consolidated Parquet file with dialogues and reference notes, plus a JSON file with per-task rubrics. Data is stored on the OpenReward platform.
Source: mkieffer/ACI-Bench-MedARC
Tools
| Tool | Description |
|---|---|
submit_note | Submit your generated clinical note for rubric-based evaluation. Returns per-criterion scores and overall reward. |
Time Horizon
ACI-Bench is a single-turn environment. The agent receives a dialogue and submits one clinical note.
Environment Difficulty
The original paper evaluates baseline models on clinical note generation:
| Model | ROUGE-1 | ROUGE-2 | ROUGE-L | MedCon |
|---|---|---|---|---|
| GPT-4 | 51.8 | 22.6 | 46.0 | 57.8 |
| BART + FTSAMSum | 53.5 | 25.1 | 48.6 | - |
GPT-4 achieved the highest medical concept (MedCon) score without fine-tuning, but fine-tuned models outperform on ROUGE metrics for structured note format.
Other Environment Requirements
- OpenAI API key: Required for rubric-based grading via gpt-5-mini. Pass via
secrets={"openai_api_key": "..."}.
Safety
Agents in ACI-Bench generate clinical notes from synthetic or de-identified dialogues. The environment does not involve real patient care. Generated notes should not be used in clinical settings.
Citations
@article{yim2023acibench,
author = {Wen-wai Yim and Yujuan Fu and Asma Ben Abacha and Neal Snider and Thomas Lin and Meliha Yetisgen},
title = {ACI-BENCH: a Novel Ambient Clinical Intelligence Dataset for Benchmarking Automatic Visit Note Generation},
journal = {Scientific Data},
volume = {10},
pages = {586},
year = {2023},
doi = {10.1038/s41597-023-02487-3},
url = {https://arxiv.org/abs/2306.02022}
}