API Endpoint

Leaderboard

Loading leaderboard...

Implementation of

arXiv/bixbench

README

BixBench

Description

BixBench is an environment for evaluating AI agents on real-world bioinformatics computational analysis tasks. Built from Code Ocean capsules containing published bioinformatics analyses, agents are given access to biological datasets and must answer hypothesis-driven research questions through multi-step analytical trajectories.

Capabilities

Exploring and analyzing biological datasets using CLI tools
Writing and executing bioinformatics analysis code
Interpreting results from genomic, transcriptomic, and proteomic analyses
Multi-step computational biology reasoning

Compute Requirements

Agents in BixBench are given a sandbox with access to bioinformatics tools (samtools, bcftools, bedtools, tabix) and the full Code Ocean capsule data (~5.91 GB total).

License

Apache 2.0.

Tasks

There is one split in this environment:

Test: 205 bioinformatics analysis tasks

Each task is derived from a Code Ocean capsule and presents a hypothesis-driven question about biological data. Tasks span diverse bioinformatics domains including genomics, transcriptomics, and proteomics.

Reward Structure

This is a multi-turn environment with binary reward at submission:

1.0 — Correct answer
0.0 — Incorrect answer

Evaluation uses two modes depending on the task:

String verifier: Case-insensitive string matching with LLM semantic fallback (gpt-5-mini)
Range verifier: Numeric proximity check with distractor-based tolerance

Exact matches are checked first to avoid unnecessary LLM calls.

Data

Task data consists of a Parquet metadata file and Code Ocean capsules containing biological datasets. Capsules are mounted at /orwd_data/bixbench/capsules/ in production.

Source: futurehouse/BixBench

Tools

Tool	Description
`submit_answer`	Submit your answer for binary evaluation.
`bash`	Execute shell commands.
`glob`	Find files by pattern.
`grep`	Search file contents.
`ls`	List directory contents.
`read`	Read file contents.
`write`	Write to files.
`edit`	Edit existing files.
`multi_edit`	Apply multiple edits to a file.
`todo_write`	Track task progress.

Time Horizon

BixBench is a multi-turn environment. Agents iteratively explore data, write analysis code, and execute computations before submitting a final answer.

Environment Difficulty

Model performance on BixBench from the original paper (open-answer setting):

Model	Accuracy
Claude 3.5 Sonnet	17%
GPT-4o	9%

Even frontier models achieve no better than random in the multiple-choice setting, indicating that fully autonomous bioinformatics research remained challenging at the time of the benchmark's release.

Other Environment Requirements

OpenAI API key: Required for LLM-based fallback grading in string verification. Pass via secrets={"openai_api_key": "..."}.

Safety

Agents in BixBench interact with published biological datasets in a sandboxed environment. The environment does not involve human subjects or clinical data requiring special protections.

Citations

@article{mitchener2025bixbench,
  author    = {Mitchener, Ludovico and Laurent, Jon M and Tenmann, Benjamin and Narayanan, Siddharth and Wellawatte, Geemi P and White, Andrew and Sani, Lorenzo and Rodriques, Samuel G},
  title     = {BixBench: a Comprehensive Benchmark for LLM-based Agents in Computational Biology},
  journal   = {arXiv preprint arXiv:2503.00096},
  year      = {2025},
  url       = {https://arxiv.org/abs/2503.00096}
}

Repository

Source repository

EnvCommons/BixBench

Clone Repository

Tools

Tools available in the environment

No tools available for this environment, it probably hasn't been indexed yet.

Compute Configuration

Resource allocation for this environment.

Component	Configuration
Environment Server	1 vCPU / 4 GB RAM
Sandbox Machine	1 vCPU / 2 GB RAM

Estimated Cost

Pay per second of active session usage. Billing starts when your session begins and stops when it ends.

Component	Cost / second
Environment	$0.0000320
Sandbox	$0.0000230
Total	$0.0000550

Examples

5-minute session$0.0165

1-hour session$0.1980

BixBench

EdisonScientific/BixBench

BixBench

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citations

Repository

Clone Repository

Tools

Compute Configuration

Estimated Cost

Examples