Code large language models (LLMs) enhance programming by understanding and generating code across languages, offering intelligent feedback, bug detection, and code updates through reflection, improving development efficiency and accessibility. While benchmarks (e.g. HumanEval/LiveCodeBench) evaluate code generation and real-world relevance, previous works ignore the scenario of modifying code in repositories. Considering challenges remaining in improving reflection capabilities and avoiding data contamination in dynamic benchmarks, we introduce LiveRepoReflection, a challenging benchmark for evaluating code understanding and generation in multi-file repository contexts, featuring 1,888 rigorously filtered test cases across 6 programming languages to ensure diversity, correctness, and high difficulty. Further, we create RepoReflection-Instruct, a large-scale, quality-filtered instruction-tuning dataset derived from diverse sources, used to train RepoReflectionCoder through a two-turn dialogue process involving code generation and error-driven repair. The leaderboard evaluates over 40 LLMs to reflect the model performance of repository-based code reflection.
Code large language models (LLMs) represent a significant advancement in comprehending and producing code across numerous programming languages. Fueled by extensive training on massive code repositories, LLMs empower developers by offering intelligent feedback, identifying potential bugs, and updating code snippets from human instructions. Code reflection refers to the ability of LLMs to examine and modify their previous responses. Using reflection, LLMs can streamline the development process, boost efficiency, and make programming more accessible to a wider number of developers.
Comparison between general code generation tasks and LiveRepoReflection tasks.
A challenging benchmark for evaluating code understanding and generation in multi-file repository contexts, featuring 1,888 rigorously filtered test cases across 6 programming languages (Python, Java, C++, Rust, Go, JavaScript) to ensure diversity, correctness, and high difficulty.
A large-scale, quality-filtered instruction-tuning dataset derived from diverse sources including newly automatically generated examples. The dataset was created using sophisticated quality filtering mechanisms and decontamination techniques.
A specialized LLM trained on the RepoReflection-Instruct dataset through a two-turn dialogue process involving code generation and error-driven repair, with enhanced capabilities for repository-based code reflection.
Systematic evaluation of over 40 LLMs on the LiveRepoReflection benchmark, creating a dynamic leaderboard to track model performance and measuring alignment between model-generated responses and human preferences.
Comparison of dataset scales and structures between Aider Polyglot Benchmark and LiveRepoReflection.
The paper introduces an automatic creation pipeline to dynamically update the large-scale instruction corpus and evaluation benchmark to avoid data hacking:
Overview of Construction Pipeline.
Two evaluation edit format of LiveRepoReflection.
Performance comparison between LiveRepoReflection and the Aider Polyglot Benchmark.
Pass@k (k=1–10) curves for nine LLMs.
Polyglot repository code file structure examples.