putnamgap
Description
PutnamGAP is a benchmark for assessing LLMs' mathematical-reasoning robustness by stress-testing them on competition-level math problems that are mathematically equivalent but vary linguistically and parametrically. It comprises multiple mathematically-equivalent variants (e.g., surface-renaming and parametric changes) of original problems to measure sensitivity to non-mathematical perturbations and evaluate model robustness.
Leaderboard
Loading leaderboard...
Implementations
No implementations linked yet. Add one to showcase related work.