MEDEC
MEDEC
Description
MEDEC is an environment for evaluating an agent's ability to detect medical errors in clinical notes, identify the erroneous sentence, and provide a medically accurate correction. Based on the MEDEC-MS dataset, it covers five types of medical errors: Diagnosis, Management, Treatment, Pharmacotherapy, and Causal Organism.
Capabilities
- Detecting whether a clinical note contains a medical error
- Identifying the specific sentence containing the error
- Providing medically accurate corrections
- Reasoning about clinical presentations and medical knowledge
Compute Requirements
Agents in MEDEC are given a standard sandbox environment. No special compute resources are required.
License
Tasks
There are three splits in this environment:
- Train: 2,189 clinical texts
- Validation: 574 clinical texts
- Test: 597 clinical texts
Total: 3,360 tasks
Each task presents a clinical note that either contains a medical error or is correct. The agent must determine whether an error exists and, if so, identify the erroneous sentence and provide a correction.
Reward Structure
This is a single-turn environment with a weighted reward structure (0.0–1.0):
- 40% — Error detection accuracy (binary: did the agent correctly classify whether an error exists?)
- 30% — Error sentence identification (LLM-graded via gpt-5-mini: did the agent identify the correct sentence?)
- 30% — Correction quality (LLM-graded via gpt-5-mini: is the agent's correction medically equivalent to the reference?)
For notes without errors, the reward is 1.0 for correctly identifying no error, and 0.0 for a false positive.
Data
The dataset consists of clinical notes from the MEDEC-MS collection. Each note is either error-free or contains exactly one medical error of one of five types. Data is stored as Parquet files on the OpenReward platform.
Source: MEDEC GitHub Repository
Tools
| Tool | Description |
|---|---|
submit_answer | Submit error detection results: error_detected (bool), error_sentence (string), corrected_sentence (string). Returns weighted reward and detailed feedback. |
Time Horizon
MEDEC is a single-turn environment. The agent receives a clinical note and submits one answer.
Environment Difficulty
MEDEC is a challenging benchmark. The original paper reports that Claude 3.5 Sonnet achieves 70.16% accuracy on error flag detection and 65.62% on error sentence detection, while medical doctors outperform all LLMs on these tasks.
Other Environment Requirements
- OpenAI API key: Required for LLM-based grading of sentence identification and correction quality. Pass via
secrets={"openai_api_key": "..."}.
Safety
MEDEC evaluates an agent's ability to detect errors in clinical notes and should not be used as a substitute for professional medical review. The environment does not involve real patient care or clinical decision-making.
Citations
@inproceedings{BenAbacha2025MEDEC,
author = {Ben Abacha, Asma and Yim, Wen-wai and Fu, Yujuan and Sun, Zhaoyi and Xia, Fei and Yetisgen, Meliha},
title = {MEDEC: A Benchmark for Medical Error Detection and Correction in Clinical Notes},
booktitle = {Findings of the Association for Computational Linguistics: ACL 2025},
year = {2025},
url = {https://arxiv.org/abs/2412.19260}
}