SciClaimEval
SciClaimEval
Description
SciClaimEval is a single-turn fact-checking environment built from the SciClaimEval Shared Task (Task 1) dev release. Each task presents a scientific claim plus the exact figure or table from the original research paper that was curated as evidence. The agent must read the caption, contextual snippet, and visual evidence to decide whether the claim is supported by the cited figure/table.
Capabilities
- Multimodal fact checking grounded in scientific papers
- Careful reading of tables, plots, and charts (747 curated evidence snippets)
- Integrating textual context, captions, and operations noted by annotators
Compute Requirements
Standard stateless environment (no sandbox / filesystem access required).
License
CC BY 4.0. Individual figures/tables inherit the licenses specified in the dataset metadata.
Tasks
SciClaimEval currently exposes a single split:
- task_1_dev (validation) – 747 annotated claims drawn from ML, NLP, computer vision, and other STEM papers. Labels are balanced between supported (395) and not supported (352) once we collapse Hugging Face's
Supported/Refutedtags to the binary scheme used here.
Each task provides:
- Claim ID, paper ID, and scientific domain
- Figure/table caption and (when relevant) a textual context paragraph
- The operation description recorded by the annotators (e.g., swap rows, change value)
- The original PNG rendering of the evidence figure or table
Reward Structure
Single-turn deterministic grading. The agent must call classify with either supported or not supported (a justification string can follow). The environment normalizes the response and compares it against the ground-truth label. Reward is 1.0 for a correct classification and 0.0 otherwise.
Data
Ground-truth annotations, PNG figures, and rendered tables are sourced from alabnii/sciclaimeval-shared-task. Files are stored under sciclaimeval/data/ (or /orwd_data/sciclaimeval/data in production) with the following structure:
sciclaimeval/
data/
dev_task1_release.json
figures/dev/*.png
tables_png/dev/*.png
No secrets or external APIs are required to use the environment.
Tools
| Tool | Description |
|---|---|
classify | Submit supported or not supported plus an optional textual explanation. Returns deterministic reward and ends the episode. |
Time Horizon
One decision per task. The agent reads the prompt, inspects the PNG figure/table, and makes a single classify call.
Environment Difficulty
Tasks come from a competitive shared task on scientific claim verification. Claims span multiple STEM venues and include subtle perturbations (cell swaps, altered values, etc.), so visual grounding and careful numerical reading are necessary. Performance above ~70% accuracy on this dev split typically required fine-tuned multimodal models in the original leaderboard, making it a challenging benchmark for automated agents.
Other Environment Requirements
No additional requirements.
Safety
The environment only exposes publicly available scientific papers and focuses on classifying textual claims using static evidence. There is no capability to issue network calls, change files, or interact with real-world systems, so direct safety risks are minimal.
Citation
@InProceedings{HoWKXBGA2026,
title = {SciClaimEval: Cross-modal Claim Verification in Scientific Papers},
author = {Xanh Ho and Yun-Ang Wu and Sunisth Kumar and Tian Cheng Xia and Florian Boudin and Andre Greiner-Petter and Akiko Aizawa},
booktitle = {Proceedings of the 15th Language Resources and Evaluation Conference (LREC 2026)},
month = may,
year = {2026},
address = {Palma de Mallorca, Spain},
publisher = {ELRA Language Resources Association},
url = {https://arxiv.org/abs/2602.07621}
}