putnamgap

Description

PutnamGAP is a benchmark for assessing LLMs' mathematical-reasoning robustness by stress-testing them on competition-level math problems that are mathematically equivalent but vary linguistically and parametrically. It comprises multiple mathematically-equivalent variants (e.g., surface-renaming and parametric changes) of original problems to measure sensitivity to non-mathematical perturbations and evaluate model robustness.

Leaderboard
Loading leaderboard...
Implementations

No implementations linked yet. Add one to showcase related work.

arXiv/putnamgap | OpenReward