skillsbench

Description

SkillsBench is a benchmark of 86 tasks across 11 domains paired with curated Skills and deterministic verifiers for measuring whether Agent Skills improve LLM agents at inference time. Each task is evaluated under three conditions—no Skills, curated Skills, and self-generated Skills—across seven agent-model configurations (7,308 trajectories) to quantify pass-rate changes and compare curated versus self-authored Skills.

Leaderboard
Loading leaderboard...
Implementations (1)
EnvironmentStarsLast Updated
benchflowbenchflow/skillsbench
13
1 months ago
arXiv/skillsbench | OpenReward