API Endpoint

Leaderboard

Loading leaderboard...

README

LongFact

Description

LongFact is an environment for evaluating long-form factual accuracy. Based on Google DeepMind's LongFact benchmark, agents are given open-ended questions requiring detailed factual responses. Evaluation uses the SAFE (Search-Augmented Factuality Evaluation) pipeline: responses are decomposed into atomic facts, each verified via web search, and scored by factual precision.

Capabilities

Long-form factual question answering
Generating detailed and accurate responses across 38 subject areas

Compute Requirements

Agents in LongFact are given a standard environment with no sandbox or file system access.

License

MIT.

Tasks

One split: test (2,280 tasks) spanning 38 subject areas.

Reward Structure

Single-turn evaluation. Agent submits a long-form response via submit_answer. The response is decomposed into atomic facts using gpt-5-mini, then each fact is verified via web search. Reward is factual precision: supported_facts / relevant_facts, ranging from 0.0 to 1.0. Irrelevant facts are excluded from the denominator.

Data

longfact_data.parquet sourced from HuggingFace claserken/longfact. Stored on the OpenReward platform.

Tools

Single tool: submit_answer — submit a long-form factual response for SAFE evaluation.

Time Horizon

Single-turn.

Environment Difficulty

The original paper evaluates 13 models across four model families (Gemini, GPT, Claude, PaLM-2) using F1@K metrics. Top performers were GPT-4-Turbo, Gemini-Ultra, and PaLM-2-L-IT-RLHF. Larger models consistently achieve higher factual precision than smaller variants within the same family.

Other Environment Requirements

OpenAI API key required for fact decomposition and web-search-based verification. Pass via secrets={"openai_api_key": "..."}.

Safety

Agents in LongFact generate factual responses in a standard environment. The environment does not present direct safety risks.

Citation

@inproceedings{wei2024longfact,
  title={Long-form factuality in large language models},
  author={Wei, Jerry and Yang, Chengrun and Song, Xinying and Lu, Yifeng and Hu, Nathan and Huang, Jie and Tran, Dustin and Peng, Daiyi and Liu, Ruibo and Huang, Da and Du, Cosmo and Le, Quoc V.},
  booktitle={NeurIPS},
  year={2024}
}

Repository

Source repository

EnvCommons/LongFact

Clone Repository

Tools

Tools available in the environment

No tools available for this environment, it probably hasn't been indexed yet.

Compute Configuration

Resource allocation for this environment.

Component	Configuration
Environment Server	1 vCPU / 4 GB RAM
Sandbox Machine	Not configured

Estimated Cost

Pay per second of active session usage. Billing starts when your session begins and stops when it ends.

Component	Cost / second
Environment	$0.0000320
Sandbox	Not configured
Total	$0.0000320

Examples

5-minute session$0.0096

1-hour session$0.1152

LongFact

GeneralReasoning/LongFact

LongFact

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citation

Repository

Clone Repository

Tools

Compute Configuration

Estimated Cost

Examples