swe-mera

Description

SWE-MERA is a dynamic, continuously updated benchmark for evaluating LLMs on real-world software engineering tasks, assembled via an automated pipeline that collects GitHub issues and applies rigorous quality validation to minimize contamination. It comprises roughly 10,000 potential tasks with 300 samples currently available and demonstrates strong discriminative power across recent models when evaluated with the Aider coding agent.

Leaderboard
Loading leaderboard...
Implementations

No implementations linked yet. Add one to showcase related work.

arXiv/swe-mera | OpenReward