dualgauge-bench
Description
DUALGAUGE is the first fully automated benchmarking framework for jointly evaluating the security and functional correctness of LLM-generated code, combining agentic sandboxed execution with an LLM-based evaluator to assess both vulnerabilities and expected behavior. It includes DUALGAUGE-BENCH, a curated suite of diverse coding tasks where each task is paired with manually validated security and functionality tests to enable scalable, reproducible joint evaluation.
Leaderboard
Loading leaderboard...
Implementations
No implementations linked yet. Add one to showcase related work.