BrowseComp-ZH

API Endpoint
Leaderboard
Loading leaderboard...
Implementation of
README

BrowseComp-ZH

⭐ OpenReward Environment Hugging Face Dataset

Description

BrowseComp-ZH is a Chinese multi-hop reasoning benchmark where agents must answer challenging research questions in Chinese by searching the web, extracting content from URLs, and submitting answers that are graded by an LLM judge. The dataset spans 11 domains including economics, sports, entertainment, science, history, geography, culture, technology, politics, health, and education. Questions are encrypted at rest using XOR cipher and decrypted at runtime.

Capabilities

  • Chinese language comprehension and reasoning
  • Multi-hop web research across diverse domains
  • Synthesizing information from multiple web sources
  • Formulating precise answers in Chinese with supporting explanations

Compute Requirements

The environment does not provide a sandbox or filesystem access. Web search and URL fetching are handled through the Tavily API.

License

Apache 2.0.

Tasks

There are 289 tasks in a single test split. Each task consists of a Chinese research question paired with a topic label from one of 11 domains: economics, sports, entertainment, science, history, geography, culture, technology, politics, health, and education. Questions are designed to require multi-hop reasoning, meaning agents must connect information from multiple web sources to arrive at the correct answer.

Reward Structure

This environment uses sparse, binary rewards. The agent receives a reward of 1.0 for a correct answer and 0.0 for an incorrect answer. Reward is issued only when the agent calls submit_answer, which ends the episode.

Correctness is determined by an LLM grader (gpt-5-mini) that compares the agent's submitted answer against the ground-truth answer. The grader accounts for Chinese language nuances including character variations, synonyms, different word orders, minor formatting differences, and small rounding differences for numerical answers.

Data

Data consists of a single Parquet file (data.parquet) containing 289 Chinese research questions across 11 domains. Each row includes the topic, encrypted question, encrypted answer, and a canary field for runtime decryption. Questions and answers use SHA256-derived XOR encryption to prevent data contamination.

Source: PALIN2018/BrowseComp-ZH

Tools

ToolDescription
web_searchSearch the web via Tavily API. Returns up to 5 results with titles, URLs, and snippets.
fetch_urlFetch and extract text content from a URL. Truncates to 8,000 characters.
submit_answerSubmit explanation, exact answer, and confidence score for LLM grading. Ends the episode.

Note that the fetch_url and web_search tools require Tavily, but are optional. If you want to use different tools for search you can exclude these tools and use external tools instead.

Time Horizon

BrowseComp-ZH is a multi-turn environment. The agent can issue multiple web_search and fetch_url calls to research the question before submitting a final answer with submit_answer. The episode ends when submit_answer is called.

Environment Difficulty

The original paper benchmarks 20+ models and agentic search systems:

SystemAccuracy
OpenAI DeepResearch42.9%
O129.1%
Gemini-2.5-Pro27.3%
Doubao (Deep Search)26.0%
DeepSeek-R123.2%
Perplexity (Research)22.6%
Claude-3.7-Sonnet17.7%
QwQ-32B11.1%
GPT-4o6.2%

Most models achieve below 10% accuracy. Questions require multi-hop reasoning across multiple Chinese web sources, making this a challenging benchmark for web browsing agents.

Other Environment Requirements

  • OpenAI API key: Required for LLM-based answer grading via gpt-5-mini
  • Tavily API key: Required for web search and URL content extraction

Pass via secrets={"openai_api_key": "...", "tavily_api_key": "..."}.

Safety

BrowseComp-ZH is a read-only research question-answering environment. Agents interact only with web search and URL extraction tools and cannot modify any external systems. The environment does not present direct safety risks.

Citation

@misc{browsecomp-zh,
  title     = {BrowseComp-ZH: Chinese Multi-Hop Reasoning Benchmark},
  author    = {PALIN2018},
  year      = {2024},
  url       = {https://huggingface.co/datasets/PALIN2018/BrowseComp-ZH}
}
GeneralReasoning/BrowseComp-ZH | OpenReward