NCBIGenomeTrain

API Endpoint
Leaderboard
Loading leaderboard...
README

NCBIGenomeTrain

OpenReward Environment

Description

NCBIGenomeTrain is a ORS training environment for genome-level question answering about the hg38 human reference genome. Each question requires retrieving or computing verifiable facts from the GRCh38/hg38 assembly, such as reference DNA sequences at specific coordinates, GC content of genes, chromosome sizes, cytoband mappings, and restriction enzyme site counts. Answers are structured JSON objects, and questions are designed to require querying NCBI or UCSC Genome Browser databases via web search.

This environment complements RefSeqTrain, which focuses on gene/transcript/protein metadata, by instead focusing on genome-level coordinate-based queries and sequence computations.

Capabilities

  • Retrieving reference DNA sequences from specific hg38 genomic coordinates
  • Computing sequence properties (GC content, nucleotide frequencies, motif counts)
  • Looking up gene genomic coordinates and spans in the hg38 assembly
  • Querying chromosome sizes, cytobands, and assembly-level statistics
  • Multi-step research: searching databases, fetching genomic data, and computing derived values

Compute Requirements

Agents are given a standard environment with no sandbox or file system access.

License

MIT.

Tasks

There is one split: train with 1,000 tasks spanning 10 genome-level domains:

DomainCountDescription
reference_sequence100Direct DNA sequence retrieval from hg38 coordinates
gc_content100GC content of genes or genomic regions
gene_genomic_span100Total genomic span (bp) of genes in hg38
chromosome_stats100Chromosome lengths, size comparisons, rankings
cytoband_mapping100Map genomic positions to cytobands and vice versa
intergenic_distance100Distance between neighboring genes
exon_properties100Exon counts and individual exon lengths
nucleotide_composition100A/T/G/C counts in specific genomic regions
coding_noncoding_ratio100Intronic percentage of gene spans
sequence_motif100Restriction enzyme site counts in genomic regions

Each task provides a question about the hg38 genome. Answers are JSON objects (e.g., {"reference_sequence": "ATCG..."}, {"gc_content": "0.61"}, {"motif_count": "15"}). The agent must find the answer through web search and database queries.

Reward Structure

Reward is sparse and binary, emitted only when the agent calls submit_answer (which ends the episode). The web_search and fetch_url tools always return reward 0.0 and do not end the episode.

On submission, the agent's answer is evaluated using programmatic grading tailored to the question domain:

  • Sequence match: Case-insensitive exact DNA sequence comparison
  • Exact match: Exact string/numeric comparison for counts, coordinates, and names
  • Numeric tolerance: Accepts values within a specified tolerance for computed quantities (e.g., GC content +/- 0.02, intronic percentage +/- 1.0)

If programmatic grading fails (e.g., non-standard answer format), an LLM grader (gpt-5-mini) is used as fallback.

  • 1.0: Submitted answer matches the reference answer within the domain-specific criteria
  • 0.0: Submitted answer is incorrect, missing, or malformed

Data

Data consists of a single JSONL file containing 1,000 QA pairs derived from the hg38 human reference genome assembly. Each row contains a question, JSON-formatted answer, domain, source coordinates/genes, grading type, and tolerance. Answers were computed programmatically from the UCSC Genome Browser REST API and NCBI E-utilities, ensuring deterministic correctness. Data is stored on the OpenReward platform.

Tools

ToolDescription
web_searchSearch the web using Tavily API. Returns up to 5 results with titles, URLs, and snippets.
fetch_urlFetch full text content from a URL. Supports pagination for long documents.
submit_answerSubmit a final JSON answer with explanation for grading. Ends the episode.

Note that the fetch_url and web_search tools require Tavily, but are optional. If you want to use a different provider for search you can exclude these tools and use external tools instead.

Time Horizon

Multi-turn. Agents can perform multiple web searches and URL fetches before submitting a final answer.

Environment Difficulty

[To be determined]

Other Environment Requirements

  • OpenAI API key required for LLM-based grading fallback. Pass via secrets={"openai_api_key": "..."}.
  • Tavily API key required for web search and URL fetching. Pass via secrets={"tavily_api_key": "..."}.

Safety

Agents in NCBIGenomeTrain answer genome informatics questions using web search in a standard environment. The environment focuses on factual information retrieval and computation from publicly available hg38 reference genome data. It does not involve access to non-public data or personal genomic information. The environment does not present direct safety risks.

Citations

NCBIGenomeTrain uses data derived from the GRCh38 human reference genome assembly. Please cite the Genome Reference Consortium:

@article{church2011modernizing,
  title={Modernizing reference genome assemblies},
  author={Church, Deanna M and Schneider, Valerie A and Graves, Tina and others},
  journal={PLoS Biology},
  volume={9},
  number={7},
  pages={e1001091},
  year={2011},
  publisher={Public Library of Science}
}
@dataset{GRNCBIGenomeTrain,
  author    = {General Reasoning Inc. Team},
  title     = {NCBIGenomeTrain},
  year      = {2026},
  publisher = {OpenReward},
  url       = {https://openreward.ai/GeneralReasoning/NCBIGenomeTrain}
}
GeneralReasoning/NCBIGenomeTrain | OpenReward