bouncerbench

Name: arXiv/bouncerbench
Author: arXiv

arXiv/bouncerbench

Description

BouncerBench is a benchmark for assessing whether LLM-based software agents can refuse to act on ill-defined bug reports or abstain when their own code patches are likely incorrect. It targets two failure points—underspecified issue descriptions and logically or functionally incorrect patches—measuring whether systems can distinguish actionable tickets from vague ones and valid patches from untrustworthy ones, and includes a baseline input/output bouncer, dataset, and leaderboard.

arXiv GitHub HuggingFace

Leaderboard

Loading leaderboard...

Implementations

No implementations linked yet. Add one to showcase related work.