BrowseComp

API Endpoint
Leaderboard
Loading leaderboard...
Implementation of
README

BrowseComp

OpenReward Environment

Description

BrowseComp is an environment for evaluating web search reasoning capabilities. Based on OpenAI's simple-evals benchmark, it contains 1,266 encrypted research questions that require multi-hop reasoning and cannot be answered without current web information. The environment provides built-in web search and URL fetching tools powered by Tavily.

Capabilities

  • Multi-hop web search reasoning
  • Information retrieval and synthesis
  • Research question answering
  • Confidence calibration

Compute Requirements

Agents are given a standard environment with no sandbox or file system access.

License

MIT

Tasks

There is one split in this environment:

  • test: 1,266 encrypted research questions

Questions require multi-hop reasoning across multiple web searches. Example: "What was the name of the 1995 film starring the actress who played Victoria that married the final scorer of the World Cup 1998 winner?"

Reward Structure

This is a sparse reward environment with LLM-based grading:

  1. Agent receives a research question
  2. Agent uses web_search and fetch_url tools to gather information
  3. Agent submits answer with explanation, exact_answer, and confidence
  4. An LLM grader (gpt-5-mini) evaluates semantic equivalence
  5. Binary reward: 1.0 if correct, 0.0 if incorrect

Data

Data is sourced from OpenAI's BrowseComp benchmark. Data is stored on the OpenReward platform.

Tools

ToolDescription
web_searchSearch the web using Tavily (returns titles, URLs, snippets)
fetch_urlFetch full content from a URL (truncated to 8000 chars)
submit_answerSubmit answer with explanation, exact_answer, and confidence

Time Horizon

Multi-turn. Agents can perform multiple web searches before submitting a final answer.

Environment Difficulty

ModelAccuracy
Gemini 3.1 Pro (search, Python, browse)85.9%
Claude Opus 4.684.0%
Kimi K2.5 (agent swarm)78.4%
MiniMax M2.576.3%
GLM-5 (with ctx management)75.9%

This benchmark requires persistent multi-hop web navigation to find hard-to-find, entangled information.

Other Environment Requirements

  • OpenAI API key required for LLM-based grading
  • Tavily API key required for web search

Pass via secrets={"openai_api_key": "...", "tavily_api_key": "..."}.

Safety

Agents in BrowseComp perform web searches and answer research questions. The environment does not present direct safety risks. Note: Do not share decrypted questions publicly per OpenAI's request.

Citation

@article{wei2025browsecomp,
  title={BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents},
  author={Wei, Jason and Sun, Zhiqing and Papay, Spencer and McKinney, Scott and Han, Jeffrey and Fulford, Isa and Chung, Hyung Won and Passos, Alex Tachard and Fedus, William and Glaese, Amelia},
  journal={arXiv preprint arXiv:2504.12516},
  year={2025},
  url={https://arxiv.org/abs/2504.12516}
}
GeneralReasoning/BrowseComp | OpenReward