API Endpoint

Leaderboard

Loading leaderboard...

Implementation of

arXiv/medec

README

MEDEC

Description

MEDEC is an environment for evaluating an agent's ability to detect medical errors in clinical notes, identify the erroneous sentence, and provide a medically accurate correction. Based on the MEDEC-MS dataset, it covers five types of medical errors: Diagnosis, Management, Treatment, Pharmacotherapy, and Causal Organism.

Capabilities

Detecting whether a clinical note contains a medical error
Identifying the specific sentence containing the error
Providing medically accurate corrections
Reasoning about clinical presentations and medical knowledge

Compute Requirements

Agents in MEDEC are given a standard sandbox environment. No special compute resources are required.

License

CC BY 4.0.

Tasks

There are three splits in this environment:

Train: 2,189 clinical texts
Validation: 574 clinical texts
Test: 597 clinical texts

Total: 3,360 tasks

Each task presents a clinical note that either contains a medical error or is correct. The agent must determine whether an error exists and, if so, identify the erroneous sentence and provide a correction.

Reward Structure

This is a single-turn environment with a weighted reward structure (0.0–1.0):

40% — Error detection accuracy (binary: did the agent correctly classify whether an error exists?)
30% — Error sentence identification (LLM-graded via gpt-5-mini: did the agent identify the correct sentence?)
30% — Correction quality (LLM-graded via gpt-5-mini: is the agent's correction medically equivalent to the reference?)

For notes without errors, the reward is 1.0 for correctly identifying no error, and 0.0 for a false positive.

Data

The dataset consists of clinical notes from the MEDEC-MS collection. Each note is either error-free or contains exactly one medical error of one of five types. Data is stored as Parquet files on the OpenReward platform.

Source: MEDEC GitHub Repository

Tools

Tool	Description
`submit_answer`	Submit error detection results: `error_detected` (bool), `error_sentence` (string), `corrected_sentence` (string). Returns weighted reward and detailed feedback.

Time Horizon

MEDEC is a single-turn environment. The agent receives a clinical note and submits one answer.

Environment Difficulty

MEDEC is a challenging benchmark. The original paper reports that Claude 3.5 Sonnet achieves 70.16% accuracy on error flag detection and 65.62% on error sentence detection, while medical doctors outperform all LLMs on these tasks.

Other Environment Requirements

OpenAI API key: Required for LLM-based grading of sentence identification and correction quality. Pass via secrets={"openai_api_key": "..."}.

Safety

MEDEC evaluates an agent's ability to detect errors in clinical notes and should not be used as a substitute for professional medical review. The environment does not involve real patient care or clinical decision-making.

Citations

@inproceedings{BenAbacha2025MEDEC,
  author    = {Ben Abacha, Asma and Yim, Wen-wai and Fu, Yujuan and Sun, Zhaoyi and Xia, Fei and Yetisgen, Meliha},
  title     = {MEDEC: A Benchmark for Medical Error Detection and Correction in Clinical Notes},
  booktitle = {Findings of the Association for Computational Linguistics: ACL 2025},
  year      = {2025},
  url       = {https://arxiv.org/abs/2412.19260}
}

Repository

Source repository

EnvCommons/MEDEC

Clone Repository

Tools

Tools available in the environment

No tools available for this environment, it probably hasn't been indexed yet.

Compute Configuration

Resource allocation for this environment.

Component	Configuration
Environment Server	1 vCPU / 4 GB RAM
Sandbox Machine	Not configured

Estimated Cost

Pay per second of active session usage. Billing starts when your session begins and stops when it ends.

Component	Cost / second
Environment	$0.0000320
Sandbox	Not configured
Total	$0.0000320

Examples

5-minute session$0.0096

1-hour session$0.1152

MEDEC

GeneralReasoning/MEDEC

MEDEC

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citations

Repository

Clone Repository

Tools

Compute Configuration

Estimated Cost

Examples