SourceQualityTrain

API Endpoint
Leaderboard
Loading leaderboard...
README

SourceQualityTrain

OpenReward Environment

Description

SourceQualityTrain is a training environment for systematic review source quality assessment, based on FutureHouse's SourceQuality benchmark. Agents are given questions about why specific studies were excluded from systematic reviews, and must use web search to identify the verbatim exclusion reason.

Capabilities

  • Understanding systematic review methodology and exclusion criteria
  • Locating Cochrane systematic reviews on PubMed Central
  • Extracting specific information from excluded studies tables
  • Multi-hop reasoning: finding the review, then locating the specific exclusion entry

Compute Requirements

No sandbox or special compute requirements.

License

MIT

Tasks

There are 1,000 training tasks distributed across 10 medical domains:

DomainCountDescription
Cardiology100Heart disease, cardiovascular interventions
Gastroenterology/Hepatology100Digestive disorders, liver disease
Infectious Disease100Bacterial, viral, parasitic infections
Musculoskeletal/Rheumatology100Joint disorders, arthritis, bone disease
Neurology/Psychiatry100Brain disorders, mental health conditions
Obstetrics/Gynecology100Pregnancy, women's health
Oncology100Cancer treatment and prevention
Other100Miscellaneous medical specialties
Pediatrics/Neonatal100Child and newborn health
Respiratory100Lung disease, breathing disorders

Each task asks why a specific study was excluded from a Cochrane systematic review. Questions include the study reference and the review's research question, requiring agents to locate the source and extract the verbatim exclusion reason.

Reward Structure

Sparse, binary reward:

  • 1.0 for correct answers (as judged by LLM grader)
  • 0.0 for incorrect or unsure answers

Grading uses semantic equivalence checking: answers that capture the same exclusion reason are accepted, even if phrased differently (e.g., "Not RCT" matches "Not a randomised controlled trial"). The grader is based on LAB-Bench's structured evaluation methodology.

We do not use exact string matching. The LLM grader (gpt-5-mini) evaluates whether the submitted answer captures the core factual content of the expected answer.

Data

Ground-truth data consists of QA pairs derived from systematic reviews on PubMed Central. Each task includes:

  • A question referencing a specific excluded study
  • The expected exclusion reason (verbatim from the review)
  • The source review URL
  • The excluded study reference and domain

Data is stored on the OpenReward platform.

Tools

Agents have access to three tools:

ToolDescription
web_searchSearch the web using Tavily. Returns titles, URLs, and snippets.
fetch_urlFetch full text content from a URL using Tavily extract. Supports pagination for long documents.
submit_answerSubmit a final answer with explanation. Triggers LLM grading and ends the episode.

Note that the fetch_url and web_search tools require Tavily, but are optional. If you want to use a different provider for search you can exclude these tools and use external tools instead.

Time Horizon

SourceQualityTrain is a multi-turn environment. Agents typically search for the relevant systematic review, fetch the review page from PubMed Central, locate the excluded studies table, and extract the specific exclusion reason.

Environment Difficulty

[Statistics on environment difficulty here]

Other Environment Requirements

This environment requires the following API keys passed via the secrets parameter:

  • openai_api_key: For LLM-based answer grading
  • tavily_api_key: For web search and URL content extraction

Safety

SourceQualityTrain focuses on factual information retrieval from publicly available medical literature. The environment does not involve medical decision-making or patient data. All source reviews are publicly accessible.

Citations

@article{laurent2024labbench,
  title     = {LAB-Bench: Measuring Capabilities of Language Models for Biology Research},
  author    = {Laurent, Jon M. and Janizek, Joseph D. and Ruzo, Michael and Hinks, Michaela M. and Hammerling, Michael J. and Narayanan, Siddharth and Ponnapati, Manvitha and White, Andrew D. and Rodriques, Samuel G.},
  year      = {2024},
  eprint    = {2407.10362},
  archivePrefix = {arXiv},
  primaryClass  = {cs.AI}
}
@dataset{GRSourceQualityTrain,
  author    = {General Reasoning Inc. Team},
  title     = {SourceQualityTrain},
  year      = {2026},
  publisher = {OpenReward},
  url       = {https://openreward.ai/GeneralReasoning/SourceQualityTrain}
}
GeneralReasoning/SourceQualityTrain | OpenReward