BixBench

API Endpoint
Leaderboard
Loading leaderboard...
Implementation of
README

BixBench

⭐ OpenReward Environment Hugging Face Dataset

Description

BixBench is an environment for evaluating AI agents on real-world bioinformatics computational analysis tasks. Built from Code Ocean capsules containing published bioinformatics analyses, agents are given access to biological datasets and must answer hypothesis-driven research questions through multi-step analytical trajectories.

Capabilities

  • Exploring and analyzing biological datasets using CLI tools
  • Writing and executing bioinformatics analysis code
  • Interpreting results from genomic, transcriptomic, and proteomic analyses
  • Multi-step computational biology reasoning

Compute Requirements

Agents in BixBench are given a sandbox with access to bioinformatics tools (samtools, bcftools, bedtools, tabix) and the full Code Ocean capsule data (~5.91 GB total).

License

Apache 2.0.

Tasks

There is one split in this environment:

  • Test: 205 bioinformatics analysis tasks

Each task is derived from a Code Ocean capsule and presents a hypothesis-driven question about biological data. Tasks span diverse bioinformatics domains including genomics, transcriptomics, and proteomics.

Reward Structure

This is a multi-turn environment with binary reward at submission:

  • 1.0 — Correct answer
  • 0.0 — Incorrect answer

Evaluation uses two modes depending on the task:

  • String verifier: Case-insensitive string matching with LLM semantic fallback (gpt-5-mini)
  • Range verifier: Numeric proximity check with distractor-based tolerance

Exact matches are checked first to avoid unnecessary LLM calls.

Data

Task data consists of a Parquet metadata file and Code Ocean capsules containing biological datasets. Capsules are mounted at /orwd_data/bixbench/capsules/ in production.

Source: futurehouse/BixBench

Tools

ToolDescription
submit_answerSubmit your answer for binary evaluation.
bashExecute shell commands.
globFind files by pattern.
grepSearch file contents.
lsList directory contents.
readRead file contents.
writeWrite to files.
editEdit existing files.
multi_editApply multiple edits to a file.
todo_writeTrack task progress.

Time Horizon

BixBench is a multi-turn environment. Agents iteratively explore data, write analysis code, and execute computations before submitting a final answer.

Environment Difficulty

Model performance on BixBench from the original paper (open-answer setting):

ModelAccuracy
Claude 3.5 Sonnet17%
GPT-4o9%

Even frontier models achieve no better than random in the multiple-choice setting, indicating that fully autonomous bioinformatics research remained challenging at the time of the benchmark's release.

Other Environment Requirements

  • OpenAI API key: Required for LLM-based fallback grading in string verification. Pass via secrets={"openai_api_key": "..."}.

Safety

Agents in BixBench interact with published biological datasets in a sandboxed environment. The environment does not involve human subjects or clinical data requiring special protections.

Citations

@article{mitchener2025bixbench,
  author    = {Mitchener, Ludovico and Laurent, Jon M and Tenmann, Benjamin and Narayanan, Siddharth and Wellawatte, Geemi P and White, Andrew and Sani, Lorenzo and Rodriques, Samuel G},
  title     = {BixBench: a Comprehensive Benchmark for LLM-based Agents in Computational Biology},
  journal   = {arXiv preprint arXiv:2503.00096},
  year      = {2025},
  url       = {https://arxiv.org/abs/2503.00096}
}
EdisonScientific/BixBench | OpenReward