LAB-Bench

API Endpoint
Leaderboard
Loading leaderboard...
README

LAB-Bench

OpenReward Environment Hugging Face Dataset

Description

LAB-Bench (Language Agent Biology Benchmark) is an environment for evaluating language agents on practical scientific research tasks. It contains over 2,400 multiple-choice questions across 8 sub-environments designed to assess capabilities including literature search, protocol planning, data analysis, figure interpretation, database navigation, and sequence manipulation. Unlike traditional science benchmarks focused on textbook knowledge, LAB-Bench measures performance on real-world research tasks that would make an AI system useful as a scientific assistant.

Capabilities

  • Scientific literature retrieval and RAG
  • Supplementary material interpretation
  • Scientific figure and table comprehension
  • Biological database query and navigation
  • Protocol analysis and troubleshooting
  • DNA/RNA sequence analysis and manipulation
  • Molecular cloning workflow planning

Compute Requirements

Agents are given a standard environment with no sandbox or file system access.

License

CC BY-SA 4.0.

Tasks

LAB-Bench contains 8 sub-environments:

Sub-EnvironmentDescription
LitQA2Scientific literature RAG questions
SuppQASupplementary material interpretation
FigQAScientific figure comprehension
TableQAScientific table interpretation
DbQABiological database queries (10 subtasks)
ProtocolQAProtocol analysis and troubleshooting
SeqQASequence analysis (15 subtasks)
CloningScenariosMolecular cloning workflows

Each sub-environment has a test split.

Reward Structure

This is a single-turn environment. The agent submits an answer via the answer tool. An LLM grader evaluates correctness against the reference answer. Reward is binary: 1.0 if correct, 0.0 if incorrect.

Data

Data consists of JSONL files sourced from HuggingFace futurehouse/lab-bench. Each task includes a multiple-choice question with randomized options. Data is stored on the OpenReward platform.

Tools

ToolDescription
answerSubmit your multiple-choice answer (A, B, C, or D). Ends the episode.

Time Horizon

Single-turn. The agent reads the scientific question and submits one answer.

Environment Difficulty

LAB-Bench evaluates practical scientific research capabilities with tasks designed to require real scientific knowledge. Results from the original paper:

Sub-EnvironmentHumanClaude-3.5-SonnetGPT-4o
LitQA2 (Precision)73.8%37.7%44.6%
SuppQA86%75%47%
FigQA82%54%30%
TableQA87%90%75%
ProtocolQA87%66%56%
CloningScenarios73%54%37%

Other Environment Requirements

OpenAI API key required for LLM-based grading. Pass via secrets={"openai_api_key": "..."}.

Safety

Agents in LAB-Bench answer scientific research questions in a standard environment. The environment does not present direct safety risks.

Citation

@article{laurent2024lab,
  title={LAB-Bench: Measuring Capabilities of Language Models for Biology Research},
  author={Laurent, Jon M. and Janizek, Joseph D. and Ruzo, Michael and Hinks, Michaela M. and Hammerling, Michael J. and Narayanan, Siddharth and Ponnapati, Manvitha and White, Andrew D. and Rodriques, Samuel G.},
  journal={arXiv preprint arXiv:2407.10362},
  year={2024}
}
EdisonScientific/LAB-Bench | OpenReward