HealthBench

Description

HealthBench is an environment for evaluating agents on healthcare conversations. It is based on the HealthBench benchmark from OpenAI, consisting of 5,000 multi-turn conversations between a model and a user or healthcare professional. Each conversation is graded against physician-created rubric criteria spanning diverse health contexts and behavioral dimensions. An LLM grader (gpt-4.1) evaluates each rubric item independently.

Capabilities

Answering healthcare questions across diverse medical contexts
Multi-turn medical dialogue comprehension
Handling safety-critical health scenarios
Demonstrating accuracy, communication quality, and instruction following

Compute Requirements

HealthBench does not require a sandbox. It has minimal compute requirements.

License

MIT.

Tasks

There is one split: test (5,000 tasks). Each task presents a multi-turn conversation prompt containing user and system messages. The agent must provide a response that is graded against multiple rubric criteria (48,562 unique criteria total, median 11 per task). Each rubric item has an associated point value and tags for categorization.

Reward Structure

This is a sparse reward environment with continuous scoring. The agent calls the answer tool once with its response, and the environment grades it using an LLM grader (gpt-4.1). Each rubric item is graded independently for whether its criterion is met. The overall score is the fraction of achieved points out of total possible positive points:

$\text{Reward} = \frac{\text{achieved points}}{\text{total possible points}}$

Scores range from 0.0 to 1.0.

We do not use LLM graders from the gpt-5-mini family for this task; grading uses gpt-4.1 to match the original HealthBench evaluation methodology.

Data

Conversations are sourced from the HealthBench benchmark by OpenAI, which includes 5,000 conversations with 48,562 unique rubric criteria created by 262 physician evaluators. Data files are stored on the OpenReward platform.

Tools

Agents are given a single tool:

answer: Submit an answer to the healthcare conversation. The answer is graded by the LLM grader against the rubric criteria. Returns the overall score. This tool can only be called once per task.

Time Horizon

HealthBench is a single-turn environment. The agent receives a conversation prompt and submits one answer. Each task requires exactly one tool call.

Environment Difficulty

Model performance on HealthBench from the original paper:

Model	Score
GPT-3.5 Turbo	16%
GPT-4o	32%
o1	42%
GPT-4.1	48%
o3	60%

Frontier models have improved significantly over time, but substantial headroom remains. On HealthBench Hard (a challenging 1,000-example subset).

Other Environment Requirements

HealthBench requires an OpenAI API key (OPENAI_API_KEY secret) for LLM-based grading of answers.

Safety

Agents in HealthBench are asked to respond to healthcare conversations. The environment does not present direct safety risks, as agents only provide text answers with no access to external systems, tools, or the internet. However, the content involves medical topics and responses should be evaluated in that context.

Citations

@article{arora2025healthbench,
  title={HealthBench: Evaluating Large Language Models Towards Improved Human Health},
  author={Arora, Rahul K. and Wei, Jason and Hicks, Rebecca Soskin and Bowman, Preston and Qui{\~n}onero-Candela, Joaquin and Tsimpourlas, Foivos and Sharman, Michael and Shah, Meghan and Vallone, Andrea and Beutel, Alex and Heidecke, Johannes and Singhal, Karan},
  journal={arXiv preprint arXiv:2505.08775},
  year={2025}
}

Component	Configuration
Environment Server	1 vCPU / 4 GB RAM
Sandbox Machine	Not configured

Component	Cost / second
Environment	$0.0000320
Sandbox	Not configured
Total	$0.0000320

HealthBench

GeneralReasoning/HealthBench

HealthBench

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citations

Tools

Compute Configuration

Estimated Cost

Examples