RepoSim: Evaluating Prompt Strategies for Code Completion via User Behavior Simulation

Abstract

Large language models (LLMs) have revolutionized code completion tasks. IDE plugins such as Copilot can generate code recommendations, saving developers significant time and effort. However, current evaluation methods for code completion are limited by their reliance on static code benchmarks, which do not consider human interactions and evolving repositories. This paper proposes RepoSim, a novel benchmark designed to evaluate code completion tasks by simulating the evolving process of repositories and incorporating user behaviors. RepoSim leverages data from an IDE plugin, by recording and replaying user behaviors to provide a realistic programming context for evaluation. This allows for the assessment of more complex prompt strategies, such as utilizing recently visited files and incorporating user editing history. Additionally, RepoSim proposes a new metric based on users’ acceptance or rejection of predictions, offering a user-centric evaluation criterion. Our preliminary evaluation demonstrates that incorporating users’ recent edit history into prompts significantly improves the quality of LLM-generated code, highlighting the importance of temporal context in code completion. RepoSim represents a significant advancement in benchmarking tools, offering a realistic and user-focused framework for evaluating code completion performance.

Publication
In proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering (ASE 2024)
Chao Peng
Chao Peng
Senior Researcher

My research interests include Software Testing, Program Repair and Compilers.