bouncerbench
Description
BouncerBench is a benchmark for assessing whether LLM-based software agents can refuse to act on ill-defined bug reports or abstain when their own code patches are likely incorrect. It targets two failure points—underspecified issue descriptions and logically or functionally incorrect patches—measuring whether systems can distinguish actionable tickets from vague ones and valid patches from untrustworthy ones, and includes a baseline input/output bouncer, dataset, and leaderboard.
Leaderboard
Loading leaderboard...
Implementations
No implementations linked yet. Add one to showcase related work.