LiveRepoReflection Leaderboard

🏠 Project Page 🏆 Leaderboard 🐳 GitHub 📊 Benchmark Dataset 📄 Paper

Metrics Explanation

Pass@1 (P1)
Percentage of tasks an LLM completes correctly on its first attempt, directly reflecting the one-shot coding accuracy.
Pass@2 (P2)
After a failed attempt, LLMs can view their previous code and error messages before trying again, measuring the capacity to improve via immediate feedback.
Well Format (WF)
Percentage of tasks where the LLM strictly follows the edit format specified in the system prompt.
Fix Weight (FW)
Defined as (Pass@2-Pass@1)/Pass@2, represents the fraction of successful second-attempt fixes among all second-attempt successes.
Closed-Source LLMs
Open-Source LLMs
RepoReflectionCoder (Ours)
Full-file code generation
Model Size P1 P2 WF FW
Patch-based incremental edits
Model Size P1 P2 WF FW