Percentage of tasks an LLM completes correctly on its first attempt, directly reflecting the one-shot coding accuracy.
Pass@2 (P2)
After a failed attempt, LLMs can view their previous code and error messages before trying again, measuring the capacity to improve via immediate feedback.
Well Format (WF)
Percentage of tasks where the LLM strictly follows the edit format specified in the system prompt.
Fix Weight (FW)
Defined as (Pass@2-Pass@1)/Pass@2, represents the fraction of successful second-attempt fixes among all second-attempt successes.