dualgauge-bench

Name: arXiv/dualgauge-bench
Author: arXiv

arXiv/dualgauge-bench

Description

DUALGAUGE is the first fully automated benchmarking framework for jointly evaluating the security and functional correctness of LLM-generated code, combining agentic sandboxed execution with an LLM-based evaluator to assess both vulnerabilities and expected behavior. It includes DUALGAUGE-BENCH, a curated suite of diverse coding tasks where each task is paired with manually validated security and functionality tests to enable scalable, reproducible joint evaluation.

arXiv

Leaderboard

Loading leaderboard...

Implementations

No implementations linked yet. Add one to showcase related work.