API Endpoint

Leaderboard

Loading leaderboard...

Implementation of

arXiv/sciclaimeval

README

SciClaimEval

Description

SciClaimEval is a single-turn fact-checking environment built from the SciClaimEval Shared Task (Task 1) dev release. Each task presents a scientific claim plus the exact figure or table from the original research paper that was curated as evidence. The agent must read the caption, contextual snippet, and visual evidence to decide whether the claim is supported by the cited figure/table.

Capabilities

Multimodal fact checking grounded in scientific papers
Careful reading of tables, plots, and charts (747 curated evidence snippets)
Integrating textual context, captions, and operations noted by annotators

Compute Requirements

Standard stateless environment (no sandbox / filesystem access required).

License

CC BY 4.0. Individual figures/tables inherit the licenses specified in the dataset metadata.

Tasks

SciClaimEval currently exposes a single split:

task_1_dev (validation) – 747 annotated claims drawn from ML, NLP, computer vision, and other STEM papers. Labels are balanced between supported (395) and not supported (352) once we collapse Hugging Face's Supported/Refuted tags to the binary scheme used here.

Each task provides:

Claim ID, paper ID, and scientific domain
Figure/table caption and (when relevant) a textual context paragraph
The operation description recorded by the annotators (e.g., swap rows, change value)
The original PNG rendering of the evidence figure or table

Reward Structure

Single-turn deterministic grading. The agent must call classify with either supported or not supported (a justification string can follow). The environment normalizes the response and compares it against the ground-truth label. Reward is 1.0 for a correct classification and 0.0 otherwise.

Data

Ground-truth annotations, PNG figures, and rendered tables are sourced from alabnii/sciclaimeval-shared-task. Files are stored under sciclaimeval/data/ (or /orwd_data/sciclaimeval/data in production) with the following structure:

sciclaimeval/
  data/
    dev_task1_release.json
    figures/dev/*.png
    tables_png/dev/*.png

No secrets or external APIs are required to use the environment.

Tools

Tool	Description
`classify`	Submit `supported` or `not supported` plus an optional textual explanation. Returns deterministic reward and ends the episode.

Time Horizon

One decision per task. The agent reads the prompt, inspects the PNG figure/table, and makes a single classify call.

Environment Difficulty

Tasks come from a competitive shared task on scientific claim verification. Claims span multiple STEM venues and include subtle perturbations (cell swaps, altered values, etc.), so visual grounding and careful numerical reading are necessary. Performance above ~70% accuracy on this dev split typically required fine-tuned multimodal models in the original leaderboard, making it a challenging benchmark for automated agents.

Other Environment Requirements

No additional requirements.

Safety

The environment only exposes publicly available scientific papers and focuses on classifying textual claims using static evidence. There is no capability to issue network calls, change files, or interact with real-world systems, so direct safety risks are minimal.

Citation

@InProceedings{HoWKXBGA2026,
  title     = {SciClaimEval: Cross-modal Claim Verification in Scientific Papers},
  author    = {Xanh Ho and Yun-Ang Wu and Sunisth Kumar and Tian Cheng Xia and Florian Boudin and Andre Greiner-Petter and Akiko Aizawa},
  booktitle = {Proceedings of the 15th Language Resources and Evaluation Conference (LREC 2026)},
  month     = may,
  year      = {2026},
  address   = {Palma de Mallorca, Spain},
  publisher = {ELRA Language Resources Association},
  url       = {https://arxiv.org/abs/2602.07621}
}

Repository

Source repository

EnvCommons/SciClaimEval

Clone Repository

Tools

Tools available in the environment

No tools available for this environment, it probably hasn't been indexed yet.

Compute Configuration

Resource allocation for this environment.

Component	Configuration
Environment Server	1 vCPU / 4 GB RAM
Sandbox Machine	Not configured

Estimated Cost

Pay per second of active session usage. Billing starts when your session begins and stops when it ends.

Component	Cost / second
Environment	$0.0000320
Sandbox	Not configured
Total	$0.0000320

Examples

5-minute session$0.0096

1-hour session$0.1152

SciClaimEval

Xanh/SciClaimEval

SciClaimEval

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citation

Repository

Clone Repository

Tools

Compute Configuration

Estimated Cost

Examples