API Endpoint

Leaderboard

Loading leaderboard...

Implementation of

arXiv/browsecomp-zh

README

BrowseComp-ZH

Description

BrowseComp-ZH is a Chinese multi-hop reasoning benchmark where agents must answer challenging research questions in Chinese by searching the web, extracting content from URLs, and submitting answers that are graded by an LLM judge. The dataset spans 11 domains including economics, sports, entertainment, science, history, geography, culture, technology, politics, health, and education. Questions are encrypted at rest using XOR cipher and decrypted at runtime.

Capabilities

Chinese language comprehension and reasoning
Multi-hop web research across diverse domains
Synthesizing information from multiple web sources
Formulating precise answers in Chinese with supporting explanations

Compute Requirements

The environment does not provide a sandbox or filesystem access. Web search and URL fetching are handled through the Tavily API.

License

Apache 2.0.

Tasks

There are 289 tasks in a single test split. Each task consists of a Chinese research question paired with a topic label from one of 11 domains: economics, sports, entertainment, science, history, geography, culture, technology, politics, health, and education. Questions are designed to require multi-hop reasoning, meaning agents must connect information from multiple web sources to arrive at the correct answer.

Reward Structure

This environment uses sparse, binary rewards. The agent receives a reward of 1.0 for a correct answer and 0.0 for an incorrect answer. Reward is issued only when the agent calls submit_answer, which ends the episode.

Correctness is determined by an LLM grader (gpt-5-mini) that compares the agent's submitted answer against the ground-truth answer. The grader accounts for Chinese language nuances including character variations, synonyms, different word orders, minor formatting differences, and small rounding differences for numerical answers.

Data

Data consists of a single Parquet file (data.parquet) containing 289 Chinese research questions across 11 domains. Each row includes the topic, encrypted question, encrypted answer, and a canary field for runtime decryption. Questions and answers use SHA256-derived XOR encryption to prevent data contamination.

Source: PALIN2018/BrowseComp-ZH

Tools

Tool	Description
`web_search`	Search the web via Tavily API. Returns up to 5 results with titles, URLs, and snippets.
`fetch_url`	Fetch and extract text content from a URL. Truncates to 8,000 characters.
`submit_answer`	Submit explanation, exact answer, and confidence score for LLM grading. Ends the episode.

Note that the fetch_url and web_search tools require Tavily, but are optional. If you want to use different tools for search you can exclude these tools and use external tools instead.

Time Horizon

BrowseComp-ZH is a multi-turn environment. The agent can issue multiple web_search and fetch_url calls to research the question before submitting a final answer with submit_answer. The episode ends when submit_answer is called.

Environment Difficulty

The original paper benchmarks 20+ models and agentic search systems:

System	Accuracy
OpenAI DeepResearch	42.9%
O1	29.1%
Gemini-2.5-Pro	27.3%
Doubao (Deep Search)	26.0%
DeepSeek-R1	23.2%
Perplexity (Research)	22.6%
Claude-3.7-Sonnet	17.7%
QwQ-32B	11.1%
GPT-4o	6.2%

Most models achieve below 10% accuracy. Questions require multi-hop reasoning across multiple Chinese web sources, making this a challenging benchmark for web browsing agents.

Other Environment Requirements

OpenAI API key: Required for LLM-based answer grading via gpt-5-mini
Tavily API key: Required for web search and URL content extraction

Pass via secrets={"openai_api_key": "...", "tavily_api_key": "..."}.

Safety

BrowseComp-ZH is a read-only research question-answering environment. Agents interact only with web search and URL extraction tools and cannot modify any external systems. The environment does not present direct safety risks.

Citation

@misc{browsecomp-zh,
  title     = {BrowseComp-ZH: Chinese Multi-Hop Reasoning Benchmark},
  author    = {PALIN2018},
  year      = {2024},
  url       = {https://huggingface.co/datasets/PALIN2018/BrowseComp-ZH}
}

Repository

Source repository

EnvCommons/BrowseComp-ZH

Clone Repository

Tools

Tools available in the environment

No tools available for this environment, it probably hasn't been indexed yet.

Compute Configuration

Resource allocation for this environment.

Component	Configuration
Environment Server	1 vCPU / 4 GB RAM
Sandbox Machine	Not configured

Estimated Cost

Pay per second of active session usage. Billing starts when your session begins and stops when it ends.

Component	Cost / second
Environment	$0.0000320
Sandbox	Not configured
Total	$0.0000320

Examples

5-minute session$0.0096

1-hour session$0.1152

BrowseComp-ZH

GeneralReasoning/BrowseComp-ZH

BrowseComp-ZH

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citation

Repository

Clone Repository

Tools

Compute Configuration

Estimated Cost

Examples