critpt

Description

CritPt (Complex Research using Integrated Thinking - Physics Test, pronounced "critical point") is the first benchmark designed to test LLMs on unpublished, research‑level reasoning tasks across modern physics subfields, comprising 71 composite research challenges decomposed into 190 simpler checkpoint tasks created by 50+ active researchers. Each problem is hand‑curated to admit guess‑resistant, machine‑verifiable answers and is evaluated by a heavily customized automated grading pipeline to measure models' ability to tackle full research‑scale challenges.

Leaderboard
Loading leaderboard...
Implementations (1)
EnvironmentStarsLast Updated
GeneralReasoningGeneralReasoning/CritPt
0
1 months ago
arXiv/critpt | OpenReward