API Endpoint

Leaderboard

Loading leaderboard...

Implementation of

arXiv/aci-bench

README

ACI-Bench

Description

ACI-Bench (Ambient Clinical Intelligence Benchmark) is an environment for evaluating an agent's ability to generate clinical visit notes from doctor-patient dialogues. The agent reads an enumerated dialogue and must produce a comprehensive clinical note, which is scored against a detailed rubric covering accuracy, completeness, and communication quality.

Capabilities

Generating structured clinical notes from medical dialogues
Understanding medical terminology and clinical workflows
Producing notes with appropriate sections (Subjective, Objective, Assessment and Plan)

Compute Requirements

This is a single-turn environment with no sandbox.

License

CC BY 4.0.

Tasks

There are three splits in this environment:

Train: 114 tasks
Validation: 35 tasks
Test: 210 tasks

Tasks are drawn from three subsets: aci, virtassist, and virtscribe.

Each task presents a numbered doctor-patient dialogue. The agent must generate a complete clinical note.

Reward Structure

This is a single-turn environment with continuous reward (0.0–1.0):

Scoring is rubric-based with 10–15 criteria per note, graded in parallel by gpt-5-mini. Criteria are grouped into three evaluation axes:

Accuracy (~40–50%): Factual correctness of clinical information
Completeness (~30–40%): Coverage of required note sections
Communication Quality (~20–30%): Clarity and professional formatting

The final reward is: earned points / total available points across all criteria.

Data

Data consists of a consolidated Parquet file with dialogues and reference notes, plus a JSON file with per-task rubrics. Data is stored on the OpenReward platform.

Source: mkieffer/ACI-Bench-MedARC

Tools

Tool	Description
`submit_note`	Submit your generated clinical note for rubric-based evaluation. Returns per-criterion scores and overall reward.

Time Horizon

ACI-Bench is a single-turn environment. The agent receives a dialogue and submits one clinical note.

Environment Difficulty

The original paper evaluates baseline models on clinical note generation:

Model	ROUGE-1	ROUGE-2	ROUGE-L	MedCon
GPT-4	51.8	22.6	46.0	57.8
BART + FTSAMSum	53.5	25.1	48.6	-

GPT-4 achieved the highest medical concept (MedCon) score without fine-tuning, but fine-tuned models outperform on ROUGE metrics for structured note format.

Other Environment Requirements

OpenAI API key: Required for rubric-based grading via gpt-5-mini. Pass via secrets={"openai_api_key": "..."}.

Safety

Agents in ACI-Bench generate clinical notes from synthetic or de-identified dialogues. The environment does not involve real patient care. Generated notes should not be used in clinical settings.

Citations

@article{yim2023acibench,
  author    = {Wen-wai Yim and Yujuan Fu and Asma Ben Abacha and Neal Snider and Thomas Lin and Meliha Yetisgen},
  title     = {ACI-BENCH: a Novel Ambient Clinical Intelligence Dataset for Benchmarking Automatic Visit Note Generation},
  journal   = {Scientific Data},
  volume    = {10},
  pages     = {586},
  year      = {2023},
  doi       = {10.1038/s41597-023-02487-3},
  url       = {https://arxiv.org/abs/2306.02022}
}

Repository

Source repository

EnvCommons/ACIBench

Clone Repository

Tools

Tools available in the environment

No tools available for this environment, it probably hasn't been indexed yet.

Compute Configuration

Resource allocation for this environment.

Component	Configuration
Environment Server	1 vCPU / 4 GB RAM
Sandbox Machine	Not configured

Estimated Cost

Pay per second of active session usage. Billing starts when your session begins and stops when it ends.

Component	Cost / second
Environment	$0.0000320
Sandbox	Not configured
Total	$0.0000320

Examples

5-minute session$0.0096

1-hour session$0.1152

ACI-Bench

GeneralReasoning/ACI-Bench

ACI-Bench

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citations

Repository

Clone Repository

Tools

Compute Configuration

Estimated Cost

Examples