API Endpoint

Leaderboard

Loading leaderboard...

Implementation of

OpenAI/BrowseComp

README

BrowseComp

Description

BrowseComp is an environment for evaluating web search reasoning capabilities. Based on OpenAI's simple-evals benchmark, it contains 1,266 encrypted research questions that require multi-hop reasoning and cannot be answered without current web information. The environment provides built-in web search and URL fetching tools powered by Tavily.

Capabilities

Multi-hop web search reasoning
Information retrieval and synthesis
Research question answering
Confidence calibration

Compute Requirements

Agents are given a standard environment with no sandbox or file system access.

License

MIT

Tasks

There is one split in this environment:

test: 1,266 encrypted research questions

Questions require multi-hop reasoning across multiple web searches. Example: "What was the name of the 1995 film starring the actress who played Victoria that married the final scorer of the World Cup 1998 winner?"

Reward Structure

This is a sparse reward environment with LLM-based grading:

Agent receives a research question
Agent uses web_search and fetch_url tools to gather information
Agent submits answer with explanation, exact_answer, and confidence
An LLM grader (gpt-5-mini) evaluates semantic equivalence
Binary reward: 1.0 if correct, 0.0 if incorrect

Data

Data is sourced from OpenAI's BrowseComp benchmark. Data is stored on the OpenReward platform.

Tools

Tool	Description
`web_search`	Search the web using Tavily (returns titles, URLs, snippets)
`fetch_url`	Fetch full content from a URL (truncated to 8000 chars)
`submit_answer`	Submit answer with explanation, exact_answer, and confidence

Time Horizon

Multi-turn. Agents can perform multiple web searches before submitting a final answer.

Environment Difficulty

Model	Accuracy
Gemini 3.1 Pro (search, Python, browse)	85.9%
Claude Opus 4.6	84.0%
Kimi K2.5 (agent swarm)	78.4%
MiniMax M2.5	76.3%
GLM-5 (with ctx management)	75.9%

This benchmark requires persistent multi-hop web navigation to find hard-to-find, entangled information.

Other Environment Requirements

OpenAI API key required for LLM-based grading
Tavily API key required for web search

Pass via secrets={"openai_api_key": "...", "tavily_api_key": "..."}.

Safety

Agents in BrowseComp perform web searches and answer research questions. The environment does not present direct safety risks. Note: Do not share decrypted questions publicly per OpenAI's request.

Citation

@article{wei2025browsecomp,
  title={BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents},
  author={Wei, Jason and Sun, Zhiqing and Papay, Spencer and McKinney, Scott and Han, Jeffrey and Fulford, Isa and Chung, Hyung Won and Passos, Alex Tachard and Fedus, William and Glaese, Amelia},
  journal={arXiv preprint arXiv:2504.12516},
  year={2025},
  url={https://arxiv.org/abs/2504.12516}
}

Repository

Source repository

EnvCommons/BrowseComp

Clone Repository

Tools

Tools available in the environment

No tools available for this environment, it probably hasn't been indexed yet.

Compute Configuration

Resource allocation for this environment.

Component	Configuration
Environment Server	1 vCPU / 4 GB RAM
Sandbox Machine	Not configured

Estimated Cost

Pay per second of active session usage. Billing starts when your session begins and stops when it ends.

Component	Cost / second
Environment	$0.0000320
Sandbox	Not configured
Total	$0.0000320

Examples

5-minute session$0.0096

1-hour session$0.1152

BrowseComp

GeneralReasoning/BrowseComp

BrowseComp

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citation

Repository

Clone Repository

Tools

Compute Configuration

Estimated Cost

Examples