NCBIGenomeTrain
NCBIGenomeTrain
Description
NCBIGenomeTrain is a ORS training environment for genome-level question answering about the hg38 human reference genome. Each question requires retrieving or computing verifiable facts from the GRCh38/hg38 assembly, such as reference DNA sequences at specific coordinates, GC content of genes, chromosome sizes, cytoband mappings, and restriction enzyme site counts. Answers are structured JSON objects, and questions are designed to require querying NCBI or UCSC Genome Browser databases via web search.
This environment complements RefSeqTrain, which focuses on gene/transcript/protein metadata, by instead focusing on genome-level coordinate-based queries and sequence computations.
Capabilities
- Retrieving reference DNA sequences from specific hg38 genomic coordinates
- Computing sequence properties (GC content, nucleotide frequencies, motif counts)
- Looking up gene genomic coordinates and spans in the hg38 assembly
- Querying chromosome sizes, cytobands, and assembly-level statistics
- Multi-step research: searching databases, fetching genomic data, and computing derived values
Compute Requirements
Agents are given a standard environment with no sandbox or file system access.
License
MIT.
Tasks
There is one split: train with 1,000 tasks spanning 10 genome-level domains:
| Domain | Count | Description |
|---|---|---|
reference_sequence | 100 | Direct DNA sequence retrieval from hg38 coordinates |
gc_content | 100 | GC content of genes or genomic regions |
gene_genomic_span | 100 | Total genomic span (bp) of genes in hg38 |
chromosome_stats | 100 | Chromosome lengths, size comparisons, rankings |
cytoband_mapping | 100 | Map genomic positions to cytobands and vice versa |
intergenic_distance | 100 | Distance between neighboring genes |
exon_properties | 100 | Exon counts and individual exon lengths |
nucleotide_composition | 100 | A/T/G/C counts in specific genomic regions |
coding_noncoding_ratio | 100 | Intronic percentage of gene spans |
sequence_motif | 100 | Restriction enzyme site counts in genomic regions |
Each task provides a question about the hg38 genome. Answers are JSON objects (e.g., {"reference_sequence": "ATCG..."}, {"gc_content": "0.61"}, {"motif_count": "15"}). The agent must find the answer through web search and database queries.
Reward Structure
Reward is sparse and binary, emitted only when the agent calls submit_answer (which ends the episode). The web_search and fetch_url tools always return reward 0.0 and do not end the episode.
On submission, the agent's answer is evaluated using programmatic grading tailored to the question domain:
- Sequence match: Case-insensitive exact DNA sequence comparison
- Exact match: Exact string/numeric comparison for counts, coordinates, and names
- Numeric tolerance: Accepts values within a specified tolerance for computed quantities (e.g., GC content +/- 0.02, intronic percentage +/- 1.0)
If programmatic grading fails (e.g., non-standard answer format), an LLM grader (gpt-5-mini) is used as fallback.
- 1.0: Submitted answer matches the reference answer within the domain-specific criteria
- 0.0: Submitted answer is incorrect, missing, or malformed
Data
Data consists of a single JSONL file containing 1,000 QA pairs derived from the hg38 human reference genome assembly. Each row contains a question, JSON-formatted answer, domain, source coordinates/genes, grading type, and tolerance. Answers were computed programmatically from the UCSC Genome Browser REST API and NCBI E-utilities, ensuring deterministic correctness. Data is stored on the OpenReward platform.
Tools
| Tool | Description |
|---|---|
web_search | Search the web using Tavily API. Returns up to 5 results with titles, URLs, and snippets. |
fetch_url | Fetch full text content from a URL. Supports pagination for long documents. |
submit_answer | Submit a final JSON answer with explanation for grading. Ends the episode. |
Note that the fetch_url and web_search tools require Tavily, but are optional. If you want to use a different provider for search you can exclude these tools and use external tools instead.
Time Horizon
Multi-turn. Agents can perform multiple web searches and URL fetches before submitting a final answer.
Environment Difficulty
[To be determined]
Other Environment Requirements
- OpenAI API key required for LLM-based grading fallback. Pass via
secrets={"openai_api_key": "..."}. - Tavily API key required for web search and URL fetching. Pass via
secrets={"tavily_api_key": "..."}.
Safety
Agents in NCBIGenomeTrain answer genome informatics questions using web search in a standard environment. The environment focuses on factual information retrieval and computation from publicly available hg38 reference genome data. It does not involve access to non-public data or personal genomic information. The environment does not present direct safety risks.
Citations
NCBIGenomeTrain uses data derived from the GRCh38 human reference genome assembly. Please cite the Genome Reference Consortium:
@article{church2011modernizing,
title={Modernizing reference genome assemblies},
author={Church, Deanna M and Schneider, Valerie A and Graves, Tina and others},
journal={PLoS Biology},
volume={9},
number={7},
pages={e1001091},
year={2011},
publisher={Public Library of Science}
}@dataset{GRNCBIGenomeTrain,
author = {General Reasoning Inc. Team},
title = {NCBIGenomeTrain},
year = {2026},
publisher = {OpenReward},
url = {https://openreward.ai/GeneralReasoning/NCBIGenomeTrain}
}