ICUSepsis
ICUSepsis
Description
ICU-Sepsis is an environment for evaluating agents on a tabular Markov Decision Process (MDP) that models sepsis treatment in the intensive care unit. Agents select treatment actions representing combinations of vasopressor and IV fluid doses to maximize patient survival probability. The MDP has 716 discrete states and 25 discrete actions, with transition dynamics derived from the MIMIC-III clinical dataset.
Capabilities
- Sequential clinical treatment decision-making under uncertainty
- Balancing vasopressor and IV fluid dosing across 25 action combinations
- Optimizing sparse binary rewards (survival vs. death)
- Reasoning about admissible actions observed in real clinical data
- Monitoring patient severity via SOFA scores
Compute Requirements
Minimal. The environment is a tabular MDP with no GPU or significant memory requirements.
License
MIT License (original ICU-Sepsis package).
Tasks
There is one split:
- train: 1,000 tasks (seeds 0-999)
Each task uses a unique random seed that determines the initial patient state sampled from the learned initial state distribution. All tasks share the same underlying MDP dynamics (transition probabilities, reward structure).
Reward Structure
This is a sparse, verifiable reward environment. All intermediate steps yield zero reward. Terminal rewards are:
- Patient survival (state 714): +1.0
- Patient death (state 713): 0.0
The discount factor is 1.0 (undiscounted returns). We do not use LLM graders for this environment.
Data
MDP parameters (transition matrix, reward matrix, initial state distribution, expert policy) ship with the icu-sepsis pip package as a compressed NumPy archive (dynamics.npz). These parameters were derived from the MIMIC-III clinical dataset using the methodology of Komorowski et al. (2018).
Tools
Agents are given two tools:
treat(vasopressor_level, iv_fluid_level): Administer treatment by choosing vasopressor dose (0-4) and IV fluid volume (0-4) independently. Returns the new patient state, SOFA score, list of admissible treatments, and current step count.info(): Display a reference of the state space, treatment parameters, reward structure, and other environment details.
Time Horizon
ICU-Sepsis is a multi-turn environment. Episodes terminate when the patient reaches a survival or death state, or when the maximum step limit (20 steps) is reached. Based on baseline evaluations from the original paper, episodes typically last 9-11 steps.
Environment Difficulty
From the original paper (Choudhary et al., 2024):
| Policy | Avg. Return | Avg. Episode Length |
|---|---|---|
| Random | 0.78 | 9.45 |
| Expert (clinician-derived) | 0.78 | 9.22 |
| Optimal (value iteration) | 0.88 | 10.99 |
The gap between expert/random (~0.78) and optimal (~0.88) performance indicates room for improvement, while the high random baseline reflects that most patients survive regardless of treatment in the underlying data.
Other Environment Requirements
There are no further environment requirements; ICU-Sepsis works out of the box without any secrets or API keys.
Safety
Agents interact only with a tabular MDP simulation derived from anonymized clinical records (MIMIC-III). There is no access to real patient data, external systems, or the internet during task execution.
Citations
@inproceedings{choudhary2024icusepsis,
title={{ICU-Sepsis}: A Benchmark {MDP} Built from Real Medical Data},
author={Kartik Choudhary and Dhawal Gupta and Philip S. Thomas},
booktitle={Reinforcement Learning Conference},
year={2024},
url={https://arxiv.org/abs/2406.05646}
}