HealthBench
HealthBench
Description
HealthBench is an environment for evaluating agents on healthcare conversations. It is based on the HealthBench benchmark from OpenAI, consisting of 5,000 multi-turn conversations between a model and a user or healthcare professional. Each conversation is graded against physician-created rubric criteria spanning diverse health contexts and behavioral dimensions. An LLM grader (gpt-4.1) evaluates each rubric item independently.
Capabilities
- Answering healthcare questions across diverse medical contexts
- Multi-turn medical dialogue comprehension
- Handling safety-critical health scenarios
- Demonstrating accuracy, communication quality, and instruction following
Compute Requirements
HealthBench does not require a sandbox. It has minimal compute requirements.
License
MIT.
Tasks
There is one split: test (5,000 tasks). Each task presents a multi-turn conversation prompt containing user and system messages. The agent must provide a response that is graded against multiple rubric criteria (48,562 unique criteria total, median 11 per task). Each rubric item has an associated point value and tags for categorization.
Reward Structure
This is a sparse reward environment with continuous scoring. The agent calls the answer tool once with its response, and the environment grades it using an LLM grader (gpt-4.1). Each rubric item is graded independently for whether its criterion is met. The overall score is the fraction of achieved points out of total possible positive points:
Scores range from 0.0 to 1.0.
We do not use LLM graders from the gpt-5-mini family for this task; grading uses gpt-4.1 to match the original HealthBench evaluation methodology.
Data
Conversations are sourced from the HealthBench benchmark by OpenAI, which includes 5,000 conversations with 48,562 unique rubric criteria created by 262 physician evaluators. Data files are stored on the OpenReward platform.
Tools
Agents are given a single tool:
answer: Submit an answer to the healthcare conversation. The answer is graded by the LLM grader against the rubric criteria. Returns the overall score. This tool can only be called once per task.
Time Horizon
HealthBench is a single-turn environment. The agent receives a conversation prompt and submits one answer. Each task requires exactly one tool call.
Environment Difficulty
Model performance on HealthBench from the original paper:
| Model | Score |
|---|---|
| GPT-3.5 Turbo | 16% |
| GPT-4o | 32% |
| o1 | 42% |
| GPT-4.1 | 48% |
| o3 | 60% |
Frontier models have improved significantly over time, but substantial headroom remains. On HealthBench Hard (a challenging 1,000-example subset).
Other Environment Requirements
HealthBench requires an OpenAI API key (OPENAI_API_KEY secret) for LLM-based grading of answers.
Safety
Agents in HealthBench are asked to respond to healthcare conversations. The environment does not present direct safety risks, as agents only provide text answers with no access to external systems, tools, or the internet. However, the content involves medical topics and responses should be evaluated in that context.
Citations
@article{arora2025healthbench,
title={HealthBench: Evaluating Large Language Models Towards Improved Human Health},
author={Arora, Rahul K. and Wei, Jason and Hicks, Rebecca Soskin and Bowman, Preston and Qui{\~n}onero-Candela, Joaquin and Tsimpourlas, Foivos and Sharman, Michael and Shah, Meghan and Vallone, Andrea and Beutel, Alex and Heidecke, Johannes and Singhal, Karan},
journal={arXiv preprint arXiv:2505.08775},
year={2025}
}