HealthBench

Description

HealthBench is a rubric-driven benchmark for evaluating LLMs and agentic RAG-based clinical support assistants on their ability to generate high-quality, accurate, situationally aware answers to open-ended clinical questions across behavioral axes such as accuracy, completeness, instruction-following, contextual reasoning, and uncertainty handling. It consists of expert-annotated, open-ended health conversations — including a Hard subset of 1,000 challenging examples — designed for behavior-level, rubric-based scoring.

Leaderboard
Loading leaderboard...
Implementations (1)
EnvironmentStarsLast Updated
GeneralReasoningGeneralReasoning/HealthBench
1
1 months ago
OpenAI/HealthBench | OpenReward