LLM4SE_SLR.xlsx

	A	B	C	D	E	F	G	H	I	J	K	L	M	N	O	P	Q	R
1	year	Author	Title	Url	Abstract Note	VENUE	RANK	SE Task (-1,0,1)	LLM (-1,0,1)	Survey (-1,0,1)	QAC1: high-level venues？ (0,1,2,3)	QAC2: motivations clearly？ (0,1,2,3)	QAC3: clearly described proposed techniques？ (0,1,2,3)	QAC4: setups described in detail? (0,1,2,3)	QAC5: clearly validate results? (0,1,2,3)	QAC6: listed contributions and limitations? (0,1,2,3)	QAC7: contribute to academic or industrial community? (0,1,2,3)	Sum QAC scores
2	2023	Huang, Qing; Zhu, Jiahui; Xing, Zhenchang; Jin, Huan; Wang, Changjing; Xu, Xiwei	A Chain of AI-based Solutions for Resolving FQNs and Fixing Syntax Errors in Partial Code		API documentation, technical blogs and programming Q&A sites contain numerous partial code that can be reused in programming tasks, but often these code are uncompilable due to unresolved names and syntax errors. To facilitate partial code reuse, we propose the Partial Code Reuse Chain (PCR-Chain) for resolving fully-qualified names (FQNs) and fixing last-mile syntax errors in partial code based on a giant large language model (LLM) like ChatGPT. Methodologically, PCR-Chain is backed up by the underlying global-level prompt architecture (which combines three design ideas: hierarchical task breakdown, prompt composition, and a mix of prompt-based AI and non-AI units) and the local-level prompt design. Technically, we propose PCR-Chain, which employs in-context learning rather than symbolic, costly training methods. Experimental results demonstrate that in dynamically-typed languages (Python), PCR-Chain outperforms current state-of-the-art (SOTA) 5% accuracy like RING. For statically-type languages (Java), our approach achieves high accuracy of 80.5% in resolving both non-FQNs and last-mile syntax errors, surpassing SOTA methods (RING) that can only address last-mile syntax errors. The correct execution of the unit, module, and PCR-Chain demonstrates the effectiveness of the prompt design, composition, and architecture and opens up possibilities for building software engineering tools based on LLMs, replacing traditional program analysis methods.	L_W	L_W	1	1	-1	0	3	3	3	3	3	3	18
3	2022	Wang, Yawen; Shi, Lin; Li, Mingyang; Wang, Qing; Yang, Yun	A Deep Context-wise Method for Coreference Detection in Natural Language Requirements(detecting coreferent entities in natural language requirements)		Requirements are usually written in natural language and evolve continuously during the process of software development, which involves a large number of stakeholders. Stakeholders with diverse backgrounds and skills might refer to the same real-world entity with different linguistic expressions in the natural-language requirements, resulting in requirement inconsistency. We define this phenomenon as Entity Coreference (EC) in the Requirement Engineering (RE) area. It can lead to misconception about technical terminologies, and harm the readability and long-term maintainability of the requirements. In this paper, we propose a DEEP context-wise method for entity COREFerence detection, named DeepCoref. First, we truncate corresponding contexts surrounding entities. Then, we construct a deep context-wise neural network for coreference classification. The network consists of one fine-tuning BERT model for context representation, a Word2Vec-based network for entity representation, and a multi-layer perceptron in the end to fuse and make a trade-off between two representations. Finally, we cluster and normalize coreferent entities. We evaluate our method, respectively, on coreference classification and clustering with 1853 industry data on 21 projects. The former evaluation shows that DeepCoref outperforms three baselines with average precision and recall of 96.10% and 96.06%, respectively. The latter evaluation on six metrics shows that DeepCoref can cluster coreferent entities more accurately. We also conduct ablation experiments with three variants to demonstrate the performance enhancement brought by different components of neural network designed for coreference classification.	RE	B_SE	1	1	-1	2	3	3	3	3	3	3	20
4	2023	Lee, Jaehyung; Han, Kisun; Yu, Hwanjo	a light bug triage framework for applying large pre-trained language model	https://dl.acm.org/doi/10.1145/3551349.3556898	Assigning appropriate developers to the bugs is one of the main challenges in bug triage. Demands for automatic bug triage are increasing in the industry, as manual bug triage is labor-intensive and time-consuming in large projects. The key to the bug triage task is extracting semantic information from a bug report. In recent years, large Pre-trained Language Models (PLMs) including BERT [4] have achieved dramatic progress in the natural language processing (NLP) domain. However, applying large PLMs to the bug triage task for extracting semantic information has several challenges. In this paper, we address the challenges and propose a novel framework for bug triage named LBT-P, standing for Light Bug Triage framework with a Pre-trained language model. It compresses a large PLM into small and fast models using knowledge distillation techniques and also prevents catastrophic forgetting of PLM by introducing knowledge preservation fine-tuning. We also develop a new loss function exploiting representations of earlier layers as well as deeper layers in order to handle the overthinking problem. We demonstrate our proposed framework on the real-world private dataset and three public real-world datasets [11]: Google Chromium, Mozilla Core, and Mozilla Firefox. The result of the experiments shows the superiority of LBT-P.	ASE	A_SE	1	1	-1	2	3	3	3	3	2	3	19
5	2023	Mohammed Latif Siddiq, Beatrice Casey, Joanna C. S. Santos	A Lightweight Framework for High-Quality Code Generation		In recent years, the use of automated source code generation utilizing transformer-based generative models has expanded, and these models can generate functional code according to the requirements of the developers. However, recent research revealed that these automatically generated source codes can contain vulnerabilities and other quality issues. Despite researchers' and practitioners' attempts to enhance code generation models, retraining and fine-tuning large language models is time-consuming and resource-intensive. Thus, we describe FRANC, a lightweight framework for recommending more secure and high-quality source code derived from transformer-based code generation models. FRANC includes a static filter to make the generated code compilable with heuristics and a quality-aware ranker to sort the code snippets based on a quality score. Moreover, the framework uses prompt engineering to fix persistent quality issues. We evaluated the framework with five Python and Java code generation models and six prompt datasets, including a newly created one in this work (SOEval). The static filter improves 9% to 46% Java suggestions and 10% to 43% Python suggestions regarding compilability. The average improvement over the NDCG@10 score for the ranking system is 0.0763, and the repairing techniques repair the highest 80% of prompts. FRANC takes, on average, 1.98 seconds for Java; for Python, it takes 0.08 seconds.	L_W	L_W	1	1	-1	0	3	3	3	3	3	3	18
6	2023	Sghaier, Oussama Ben; Sahraoui, Houari	A Multi-Step Learning Approach to Assist Code Review		Modern code review is a process for early detection and reduction of issues, which assists in ensuring the quality of the source code, detecting anomalies, and identifying potential improvements. However, this is a highly manual activity that requires a lot of resources and time. Recent research has addressed these problems by attempting to entirely automate this task (i.e., generating code reviews). However, we do believe that dismissing the reviewer from this process is not the best option in terms of its optimal functioning, especially considering the high error rates in the proposed approaches. Furthermore, this full automation is still too far to achieve given the complexity of the task that requires human intelligence. In this work, we aim to assist the reviewer in the code review process. We propose an approach for detecting the type of issue and locating parts of the code that need to be revised by developers. In the first phase, we propose a meta-learner that combines a learning-based model and a knowledge-based model to predict the type of issue from the review comment. Then, we use this component to create and label a large dataset composed of quadruplets . We use this data set to finetune a pre-trained language model to predict the types of issues (e.g., naming, resource handling, etc.), that need to be addressed in the original code snippet. Furthermore, we fine-tune another pre-trained language model to locate these issues in the source code submitted by developers. We evaluate the performance of our approach using a test set not considered during the training. Our results show that our model accurately locates and predicts the types of issues.	SANER	B_SE	1	1	-1	2	3	3	3	3	3	3	20
7	2023	Charalambous, Yiannis; Tihanyi, Norbert; Jain, Ridhi; Sun, Youcheng; Ferrag, Mohamed Amine; Cordeiro, Lucas C.	a new era in software security: towards self-healing software via large language models and formal verification	http://arxiv.org/abs/2305.14752	In this paper we present a novel solution that combines the capabilities of Large Language Models (LLMs) with Formal Verification strategies to verify and automatically repair software vulnerabilities. Initially, we employ Bounded Model Checking (BMC) to locate the software vulnerability and derive a counterexample. The counterexample provides evidence that the system behaves incorrectly or contains a vulnerability. The counterexample that has been detected, along with the source code, are provided to the LLM engine. Our approach involves establishing a specialized prompt language for conducting code debugging and generation to understand the vulnerability's root cause and repair the code. Finally, we use BMC to verify the corrected version of the code generated by the LLM. As a proof of concept, we create ESBMC-AI based on the Efficient SMT-based Context-Bounded Model Checker (ESBMC) and a pre-trained Transformer model, specifically gpt-3.5-turbo, to detect and fix errors in C programs. Our experimentation involved generating a dataset comprising 1000 C code samples, each consisting of 20 to 50 lines of code. Notably, our proposed method achieved an impressive success rate of up to 80\\% in repairing vulnerable code encompassing buffer overflow and pointer dereference failures. We assert that this automated approach can effectively incorporate into the software development lifecycle's continuous integration and deployment (CI/CD) process.	W	W	1	1	-1	0	3	3	3	3	3	3	18
8	2023	White, Jules; Fu, Quchen; Hays, Sam; Sandborn, Michael; Olea, Carlos; Gilbert, Henry; Elnashar, Ashraf; Spencer-Smith, Jesse; Schmidt, Douglas C.	a prompt pattern catalog to enhance prompt engineering with chatgpt	http://arxiv.org/abs/2302.11382	Prompt engineering is an increasingly important skill set needed to converse effectively with large language models (LLMs), such as ChatGPT. Prompts are instructions given to an LLM to enforce rules, automate processes, and ensure specific qualities (and quantities) of generated output. Prompts are also a form of programming that can customize the outputs and interactions with an LLM. This paper describes a catalog of prompt engineering techniques presented in pattern form that have been applied to solve common problems when conversing with LLMs. Prompt patterns are a knowledge transfer method analogous to software patterns since they provide reusable solutions to common problems faced in a particular context, i.e., output generation and interaction when working with LLMs. This paper provides the following contributions to research on prompt engineering that apply LLMs to automate software development tasks. First, it provides a framework for documenting patterns for structuring prompts to solve a range of problems so that they can be adapted to different domains. Second, it presents a catalog of patterns that have been applied successfully to improve the outputs of LLM conversations. Third, it explains how prompts can be built from multiple patterns and illustrates prompt patterns that benefit from combination with other prompt patterns.	W	W	1	1	-1	0	3	3	3	2	2	3	16
9	2023	Ding, Hantian and Kumar, Varun and Tian, Yuchen and Wang, Zijian and Kwiatkowski, Rob and Li, Xiaopeng and Ramanathan, Murali Krishna and Ray, Baishakhi and Bhatia, Parminder and Sengupta, Sudipta and others	A Static Evaluation of Code Completion by Large Language Models		Large language models trained on code have shown great potential to increase productivity of software developers. Several execution-based benchmarks have been proposed to evaluate functional correctness of model-generated code on simple programming problems. Nevertheless, it is expensive to perform the same evaluation on complex real-world projects considering the execution cost. On the contrary, static analysis tools such as linters, which can detect errors without running the program, haven't been well explored for evaluating code generation models. In this work, we propose a static evaluation framework to quantify static errors in Python code completions, by leveraging Abstract Syntax Trees. Compared with execution-based evaluation, our method is not only more efficient, but also applicable to code in the wild. For experiments, we collect code context from open source repos to generate one million function bodies using public models. Our static analysis reveals that Undefined Name and Unused Variable are the most common errors among others made by language models. Through extensive studies, we also show the impact of sampling temperature, model size, and context on static errors in code completions.	L_W	L_W	1	1	-1	0	3	3	3	3	3	3	18
10	2023	Cao, Jialun; Li, Meiziniu; Wen, Ming; Cheung, Shing-chi	a study on prompt design, advantages and limitations of chatgpt for deep learning program repair	http://arxiv.org/abs/2304.08191	ChatGPT has revolutionized many research and industrial fields. ChatGPT has shown great potential in software engineering to boost various traditional tasks such as program repair, code understanding, and code generation. However, whether automatic program repair (APR) applies to deep learning (DL) programs is still unknown. DL programs, whose decision logic is not explicitly encoded in the source code, have posed unique challenges to APR. While to repair DL programs, an APR approach needs to not only parse the source code syntactically but also needs to understand the code intention. With the best prior work, the performance of fault localization is still far less than satisfactory (only about 30\\\%). Therefore, in this paper, we explore ChatGPT's capability for DL program repair by asking three research questions. (1) Can ChatGPT debug DL programs effectively? (2) How can ChatGPT's repair performance be improved by prompting? (3) In which way can dialogue help facilitate the repair? On top of that, we categorize the common aspects useful for prompt design for DL program repair. Also, we propose various prompt templates to facilitate the performance and summarize the advantages and disadvantages of ChatGPT's abilities such as detecting bad code smell, code refactoring, and detecting API misuse/deprecation.	W	W	0	1	-1	0	3	2	3	2	2	3	15
11	2023	Yang, Guang; Zhou, Yu; Chen, Xiang; Zhang, Xiangyu; Xu, Yiran; Han, Tingting; Chen, Taolue	a syntax-guided multi-task learning approach for turducken-style code generation	http://arxiv.org/abs/2303.05061	Due to the development of pre-trained language models, automated code generation techniques have shown great promise in recent years. However, the generated code is difficult to meet the syntactic constraints of the target language, especially in the case of Turducken-style code, where declarative code snippets are embedded within imperative programs. In this study, we summarize the lack of syntactic constraints into three significant challenges: (1) the efficient representation of syntactic constraints, (2) the effective integration of syntactic information, and (3) the scalable syntax-first decoding algorithm. To address these challenges, we propose a syntax-guided multi-task learning approach TurduckenGen. Specifically, we first explicitly append the type information to the code tokens to capture the representation of syntactic constraints. Then we formalize code generation with syntactic constraint representation as an auxiliary task to enable the model to learn the syntactic constraints of the code. Finally, the syntactically correct code is selected accurately from the multiple candidates with the help of the compiler feedback. Extensive experiments and comprehensive analysis demonstrate the effectiveness and general applicability of our approach after being compared with six state-of-the-art baselines on two Turducken-style code datasets. Finally, we conducted a human study and found the code quality generated by our approach is better than baselines in terms of code readability and semantic similarity.	W	W	1	0	-1	0	3	2	3	3	3	3	17
12	2022	Xu, Frank F.; Alon, Uri; Neubig, Graham; Hellendoorn, Vincent Josua	A systematic evaluation of large language models of code		Large language models (LMs) of code have recently shown tremendous promise in completing code and synthesizing code from natural language descriptions. However, the current state-of-the-art code LMs (e.g., Codex) are not publicly available, leaving many questions about their model and data design decisions. We aim to fill in some of these blanks through a systematic evaluation of the largest existing models: Codex, GPT-J, GPT-Neo, GPT-NeoX-20B, and CodeParrot, across various programming languages. Although Codex itself is not open-source, we find that existing opensource models do achieve close results in some programming languages, although targeted mainly for natural language modeling. We further identify an important missing piece in the form of a large open-source model trained exclusively on a multi-lingual corpus of code. We release a new model, PolyCoder, with 2.7B parameters based on the GPT-2 architecture, that was trained on 249GB of code across 12 programming languages on a single machine. In the C programming language, PolyCoder outperforms all models including Codex. Our trained models are open-source and publicly available at https://github.com/VHellendoorn/Code-LMs, which enables future research and application in this area.	PLDI	A_SE	1	1	-1	3	3	3	3	3	3	3	21
13	2023	Laskar, Md Tahmid Rahman; Bari, M. Saiful; Rahman, Mizanur; Bhuiyan, Md Amran Hossen; Joty, Shafiq; Huang, Jimmy Xiangji	A Systematic Study and Comprehensive Evaluation of ChatGPT on Benchmark Datasets		The development of large language models (LLMs) such as ChatGPT1 has brought a lot of attention recently. However, their evaluation in the benchmark academic datasets remains under-explored due to the difficulty of evaluating the generative outputs produced by this model against the ground truth. In this paper, we aim to present a thorough evaluation of ChatGPT’s performance on diverse academic datasets, covering tasks like questionanswering, text summarization, code generation, commonsense reasoning, mathematical problem-solving, machine translation, bias detection, and ethical considerations. Specifically, we evaluate ChatGPT across 140 tasks and analyze 255K responses it generates in these datasets. This makes our work the largest evaluation of ChatGPT in NLP benchmarks. In short, our study aims to validate the strengths and weaknesses of ChatGPT in various tasks and provide insights for future research using LLMs. We also report a new emergent ability to follow multi-query instructions that we mostly found in ChatGPT and other instruction-tuned models. Our extensive evaluation shows that even though ChatGPT is capable of performing a wide variety of tasks, and may obtain impressive performance in several benchmark datasets, it is still far from achieving the ability to reliably solve many challenging tasks. By providing a thorough assessment of ChatGPT’s performance across diverse NLP tasks, this paper sets the stage for a targeted deployment of ChatGPT-like LLMs in real-world applications.	L_W	L_W	1	1	-1	0	3	3	3	3	3	3	18
14	2020	Biswas, Eeshita; Karabulut, Mehmet Efruz; Pollock, Lori; Vijay-Shanker, K.	achieving reliable sentiment analysis in the software engineering domain using bert	https://ieeexplore.ieee.org/document/9240599/	Researchers have shown that sentiment analysis of software artifacts can potentially improve various software engineering tools, including API and library recommendation systems, code suggestion tools, and tools for improving communication among software developers. However, sentiment analysis techniques applied to software artifacts still have not yet yielded very high accuracy. Recent adaptations of sentiment analysis tools to the software domain have reported some improvements, but the f-measures for the positive and negative sentences still remain in the 0.4-0.64 range, which deters their practical usefulness for software engineering tools.In this paper, we explore the potential effectiveness of customizing BERT, a language representation model, which has recently achieved very good results on various Natural Language Processing tasks on English texts, for the task of sentiment analysis of software artifacts. We describe our application of BERT to analyzing sentiments of sentences in Stack Overflow posts and compare the impact of a BERT sentiment classifier to state-of-the-art sentiment analysis techniques when used on a domain-specific data set created from Stack Overflow posts. We also investigate how the performance of sentiment analysis changes when using a much (3 times) larger data set than previous studies. Our results show that the BERT classifier achieves reliable performance for sentiment analysis of software engineering texts. BERT combined with the larger data set achieves an overall f-measure of 0.87, with the f-measures for the negative and positive sentences reaching 0.91 and 0.78 respectively, a significant improvement over the state-of-the-art.	ICSME	B_SE	1	1	-1	2	3	3	3	3	2	3	19
15	2023	Schäfer, Max; Nadi, Sarah; Eghbali, Aryaz; Tip, Frank	adaptive test generation using a large language model	http://arxiv.org/abs/2302.06527	Unit tests play a key role in ensuring the correctness of software. However, manually creating unit tests is a laborious task, motivating the need for automation. This paper presents TestPilot, an adaptive test generation technique that leverages Large Language Models (LLMs). TestPilot uses Codex, an off-the-shelf LLM, to automatically generate unit tests for a given program without requiring additional training or few-shot learning on examples of existing tests. In our approach, Codex is provided with prompts that include the signature and implementation of a function under test, along with usage examples extracted from documentation. If a generated test fails, TestPilot's adaptive component attempts to generate a new test that fixes the problem by re-prompting the model with the failing test and error message. We created an implementation of TestPilot for JavaScript and evaluated it on 25 npm packages with a total of 1,684 API functions to generate tests for. Our results show that the generated tests achieve up to 93.1\\% statement coverage (median 68.2\\%). Moreover, on average, 58.5\\% of the generated tests contain at least one assertion that exercises functionality from the package under test. Our experiments with excluding parts of the information included in the prompts show that all components contribute towards the generation of effective test suites. Finally, we find that TestPilot does not generate memorized tests: 92.7\\% of our generated tests have $\leq$ 50\\% similarity with existing tests (as measured by normalized edit distance), with none of them being exact copies.	W	W	1	1	-1	0	3	3	3	3	3	3	18
16	2023	Widjojo, Patricia; Treude, Christoph	Addressing Compiler Errors: Stack Overflow or Large Language Models?		Compiler error messages serve as an initial resource for programmers dealing with compilation errors. However, previous studies indicate that they often lack sufficient targeted information to resolve code issues. Consequently, programmers typically rely on their own research to fix errors. Historically, Stack Overflow has been the primary resource for such information, but recent advances in large language models offer alternatives. This study systematically examines 100 compiler error messages from three sources to determine the most effective approach for programmers encountering compiler errors. Factors considered include Stack Overflow search methods and the impact of model version and prompt phrasing when using large language models. The results reveal that GPT-4 outperforms Stack Overflow in explaining compiler error messages, the effectiveness of adding code snippets to Stack Overflow searches depends on the search method, and results for Stack Overflow differ significantly between Google and StackExchange API searches. Furthermore, GPT-4 surpasses GPT-3.5, with “How to fix” prompts yielding superior outcomes to “What does this error mean” prompts. These results offer valuable guidance for programmers seeking assistance with compiler error messages, underscoring the transformative potential of advanced large language models like GPT-4 in debugging and opening new avenues of exploration for researchers in AI-assisted programming.	L_W	L_W	1	1	-1	0	3	3	3	3	3	3	18
17	2023	Huang, Qing; Zou, Zhou; Xing, Zhenchang; Zuo, Zhenkang; Xu, Xiwei; Lu, Qinghua	ai chain on large language model for unsupervised control flow graph generation for statically-typed partial code	http://arxiv.org/abs/2306.00757	Control Flow Graphs (CFGs) are essential for visualizing, understanding and analyzing program behavior. For statically-typed programming language like Java, developers obtain CFGs by using bytecode-based methods for compilable code and Abstract Syntax Tree (AST)-based methods for partially uncompilable code. However, explicit syntax errors during AST construction and implicit semantic errors caused by bad coding practices can lead to behavioral loss and deviation of CFGs.To address the issue, we propose a novel approach that leverages the error-tolerant and understanding ability of pre-trained Large Language Models (LLMs) to generate CFGs. Our approach involves a Chain of Thought (CoT) with four steps: structure hierarchy extraction, nested code block extraction, CFG generation of nested code blocks, and fusion of all nested code blocks' CFGs. To address the limitations of the original CoT's single-prompt approach (i.e., completing all steps in a single generative pass), which can result in an ``epic'' prompt with hard-to-control behavior and error accumulation, we break down the CoT into an AI chain with explicit sub-steps. Each sub-step corresponds to a separate AI-unit, with an effective prompt assigned to each unit for interacting with LLMs to accomplish a specific purpose.Our experiments confirmed that our method outperforms existing CFG tools in terms of node and edge coverage, especially for incomplete or erroneous code. We also conducted an ablation experiment and confirmed the effectiveness of AI chain design principles: Hierarchical Task Breakdown, Unit Composition, and Mix of AI Units and Non-AI Units.Our work opens up new possibilities for building foundational software engineering tools based on LLMs, as opposed to traditional program analysis methods.	W	W	1	1	-1	0	3	3	3	3	3	3	18
18	2023	Rao, Nikitha; Tsay, Jason; Kate, Kiran; Hellendoorn, Vincent J.; Hirzel, Martin	ai for low-code for ai	http://arxiv.org/abs/2305.20015	Low-code programming allows citizen developers to create programs with minimal coding effort, typically via visual (e.g. drag-and-drop) interfaces. In parallel, recent AI-powered tools such as Copilot and ChatGPT generate programs from natural language instructions. We argue that these modalities are complementary: tools like ChatGPT greatly reduce the need to memorize large APIs but still require their users to read (and modify) programs, whereas visual tools abstract away most or all programming but struggle to provide easy access to large APIs. At their intersection, we propose LowCoder, the first low-code tool for developing AI pipelines that supports both a visual programming interface (LowCoder\\_VP) and an AI-powered natural language interface (LowCoder\\_NL). We leverage this tool to provide some of the first insights into whether and how these two modalities help programmers by conducting a user study. We task 20 developers with varying levels of AI expertise with implementing four ML pipelines using LowCoder, replacing the LowCoder\\_NL component with a simple keyword search in half the tasks. Overall, we find that LowCoder is especially useful for (i) Discoverability: using LowCoder\\_NL, participants discovered new operators in 75\\% of the tasks, compared to just 32.5\\% and 27.5\\% using web search or scrolling through options respectively in the keyword-search condition, and (ii) Iterative Composition: 82.5\\% of tasks were successfully completed and many initial pipelines were further successfully improved. Qualitative analysis shows that AI helps users discover how to implement constructs when they know what to do, but still fails to support novices when they lack clarity on what they want to accomplish. Overall, our work highlights the benefits of combining the power of AI with low-code programming.	W	W	1	1	-1	0	2	2	2	3	3	3	15
19	2023	Zhang, Kexun; Wang, Danqing; Xia, Jingtao; Wang, William Yang; Li, Lei	algo: synthesizing algorithmic programs with generated oracle verifiers	http://arxiv.org/abs/2305.14591	Large language models (LLMs) excel at implementing code from functionality descriptions, but struggle with algorithmic problems that require not only implementation but also identification of the suitable algorithm. Moreover, LLM-generated programs lack guaranteed correctness and require human verification. To address these challenges, we propose ALGO, a framework that synthesizes Algorithmic programs with LLM-Generated Oracles to guide the creation and verify their correctness. ALGO first generates a probably correct but possibly slow reference oracle by prompting an LLM to exhaustively enumerate all the combinations of relevant variables. This oracle is then utilized to guide an arbitrary search strategy in exploring the algorithm space and to verify the algorithms synthesized. Our study shows that the LLM-generated oracles are correct for 88\\% of the cases. With the oracles as verifiers, ALGO can be integrated with any existing code generation model in a model-agnostic manner to enhance its performance. Experiments show that when equipped with ALGO, we achieve an 8x better one-submission pass rate over the Codex model and a 2.6x better one-submission pass rate over CodeT, the current state-of-the-art model on CodeContests. We can also get 1.3x better pass rate over the ChatGPT Code Interpreter on unseen problems.	W	W	1	1	-1	0	3	3	3	1	2	3	15
20	2022	Dibia, Victor; Fourney, Adam; Bansal, Gagan; Poursabzi-Sangdeh, Forough; Liu, Han; Amershi, Saleema	aligning offline metrics and human judgments of value of ai-pair programmers	http://arxiv.org/abs/2210.16494	Large language models trained on massive amounts of natural language data and code have shown impressive capabilities in automatic code generation scenarios. Development and evaluation of these models has largely been driven by offline functional correctness metrics, which consider a task to be solved if the generated code passes corresponding unit tests. While functional correctness is clearly an important property of a code generation model, we argue that it may not fully capture what programmers value when collaborating with their AI pair programmers. For example, while a nearly correct suggestion that does not consider edge cases may fail a unit test, it may still provide a substantial starting point or hint to the programmer, thereby reducing total needed effort to complete a coding task. To investigate this, we conduct a user study with (N=49) experienced programmers, and find that while both correctness and effort correlate with value, the association is strongest for effort. We argue that effort should be considered as an important dimension of evaluation in code generation scenarios. We also find that functional correctness remains better at identifying the highest-value generations; but participants still saw considerable value in code that failed unit tests. Conversely, similarity-based metrics are very good at identifying the lowest-value generations among those that fail unit tests. Based on these findings, we propose a simple hybrid metric, which combines functional correctness and similarity-based metrics to capture different dimensions of what programmers might value and show that this hybrid metric more strongly correlates with both value and effort. Our findings emphasize the importance of designing human-centered metrics that capture what programmers need from and value in their AI pair programmers.	ACL	A_AI	1	1	-1	3	2	1	3	3	3	3	18
21	2023	Sobania, Dominik; Briesch, Martin; Hanna, Carol; Petke, Justyna	an analysis of the automatic bug fixing performance of chatgpt	http://arxiv.org/abs/2301.08653	To support software developers in finding and fixing software bugs, several automated program repair techniques have been introduced. Given a test suite, standard methods usually either synthesize a repair, or navigate a search space of software edits to find test-suite passing variants. Recent program repair methods are based on deep learning approaches. One of these novel methods, which is not primarily intended for automated program repair, but is still suitable for it, is ChatGPT. The bug fixing performance of ChatGPT, however, is so far unclear. Therefore, in this paper we evaluate ChatGPT on the standard bug fixing benchmark set, QuixBugs, and compare the performance with the results of several other approaches reported in the literature. We find that ChatGPT's bug fixing performance is competitive to the common deep learning approaches CoCoNut and Codex and notably better than the results reported for the standard program repair approaches. In contrast to previous approaches, ChatGPT offers a dialogue system through which further information, e.g., the expected output for a certain input or an observed error message, can be entered. By providing such hints to ChatGPT, its success rate can be further increased, fixing 31 out of 40 bugs, outperforming state-of-the-art.	W	W	1	1	-1	0	3	2	2	2	3	3	15
22	2021	Mastropaolo, Antonio; Aghajani, Emad; Pascarella, Luca; Bavota, Gabriele	an empirical study on code comment completion	http://arxiv.org/abs/2107.10544	Code comments play a prominent role in program comprehension activities. However, source code is not always documented and code and comments not always co-evolve. To deal with these issues, researchers have proposed techniques to automatically generate comments documenting a given code at hand. The most recent works in the area applied deep learning (DL) techniques to support such a task. Despite the achieved advances, the empirical evaluations of these approaches show that they are still far from a performance level that would make them valuable for developers. We tackle a simpler and related problem: Code comment completion. Instead of generating a comment for a given code from scratch, we investigate the extent to which state-of-the-art techniques can help developers in writing comments faster. We present a large-scale study in which we empirically assess how a simple n-gram model and the recently proposed Text-To-Text Transfer Transformer (T5) architecture can perform in autocompleting a code comment the developer is typing. The achieved results show the superiority of the T5 model, despite the n-gram model being a competitive solution.	ICSME	B_SE	1	1	-1	2	3	3	2	3	2	3	18
23	2021	Ciniselli, Matteo; Cooper, Nathan; Pascarella, Luca; Mastropaolo, Antonio; Aghajani, Emad; Poshyvanyk, Denys; Di Penta, Massimiliano; Bavota, Gabriele	an empirical study on the usage of transformer models for code completion	https://ieeexplore.ieee.org/document/9616462/	Code completion aims at speeding up code writing by predicting the next code token(s) the developer is likely to write. Works in this field focused on improving the accuracy of the generated predictions, with substantial leaps forward made possible by deep learning ({DL}) models. However, code completion techniques are mostly evaluated in the scenario of predicting the next token to type, with few exceptions pushing the boundaries to the prediction of an entire code statement. Thus, little is known about the performance of state-of-the-art code completion approaches in more challenging scenarios in which, for example, an entire code block must be generated. We present a large-scale study exploring the capabilities of state-of-the-art Transformer-based models in supporting code completion at different granularity levels, including single tokens, one or multiple entire statements, up to entire code blocks (e.g., the iterated block of a for loop). We experimented with several variants of two recently proposed Transformer-based models, namely {RoBERTa} and the Text-To-Text Transfer Transformer (T5), for the task of code completion. The achieved results show that Transformer-based models, and in particular the T5, represent a viable solution for code completion, with perfect predictions ranging from {\textbackslash}sim∼29\%, obtained when asking the model to guess entire blocks, up to {\textbackslash}sim∼69\%, reached in the simpler scenario of few tokens masked from the same code statement.	TSE	A_SE	1	1	-1	3	3	3	3	3	3	3	21
24	2022	Sharma, Rishab; Chen, Fuxiang; Fard, Fatemeh; Lo, David	an exploratory study on code attention in bert	https://dl.acm.org/doi/10.1145/3524610.3527921	Many recent models in software engineering introduced deep neural models based on the Transformer architecture or use transformer-based Pre-trained Language Models (PLM) trained on code. Although these models achieve the state of the arts results in many downstream tasks such as code summarization and bug detection, they are based on Transformer and PLM, which are mainly studied in the Natural Language Processing (NLP) field. The current studies rely on the reasoning and practices from NLP for these models in code, despite the differences between natural languages and programming languages. There is also limited literature on explaining how code is modeled.Here, we investigate the attention behavior of PLM on code and compare it with natural language. We pre-trained BERT, a Transformer based PLM, on code and explored what kind of information it learns, both semantic and syntactic. We run several experiments to analyze the attention values of code constructs on each other and what BERT learns in each layer. Our analyses show that BERT pays more attention to syntactic entities, specifically identifiers and separators, in contrast to the most attended token [CLS] in NLP. This observation motivated us to leverage identifiers to represent the code sequence instead of the [CLS] token when used for code clone detection. Our results show that employing embeddings from identifiers increases the performance of BERT by 605% and 4% F1-score in its lower layers and the upper layers, respectively. When identifiers' embeddings are used in CodeBERT, a code-based PLM, the performance is improved by 21--24% in the F1-score of clone detection. The findings can benefit the research community by using code-specific representations instead of applying the common embeddings used in NLP, and open new directions for developing smaller models with similar performance.	ICPC	B_SE	1	1	-1	2	3	3	3	3	2	3	19
25	2022	Zeng, Zhengran; Tan, Hanzhuo; Zhang, Haotian; Li, Jing; Zhang, Yuqun; Zhang, Lingming	an extensive study on pre-trained models for program understanding and generation	https://dl.acm.org/doi/10.1145/3533767.3534390	Automatic program understanding and generation techniques could significantly advance the productivity of programmers and have been widely studied by academia and industry. Recently, the advent of pre-trained paradigm enlightens researchers to develop general-purpose pre-trained models which can be applied for a broad range of program understanding and generation tasks. Such pre-trained models, derived by self-supervised objectives on large unlabelled corpora, can be fine-tuned in downstream tasks (such as code search and code generation) with minimal adaptations. Although these pre-trained models claim superiority over the prior techniques, they seldom follow equivalent evaluation protocols, e.g., they are hardly evaluated on the identical benchmarks, tasks, or settings. Consequently, there is a pressing need for a comprehensive study of the pre-trained models on their effectiveness, versatility as well as the limitations to provide implications and guidance for the future development in this area. To this end, we first perform an extensive study of eight open-access pre-trained models over a large benchmark on seven representative code tasks to assess their reproducibility. We further compare the pre-trained models and domain-specific state-of-the-art techniques for validating pre-trained effectiveness. At last, we investigate the robustness of the pre-trained models by inspecting their performance variations under adversarial attacks. Through the study, we find that while we can in general replicate the original performance of the pre-trained models on their evaluated tasks and adopted benchmarks, subtle performance fluctuations can refute the findings in their original papers. Moreover, none of the existing pre-trained models can dominate over all other models. We also find that the pre-trained models can significantly outperform non-pre-trained state-of-the-art techniques in program understanding tasks. Furthermore, we perform the first study for natural language-programming language pre-trained model robustness via adversarial attacks and find that a simple random attack approach can easily fool the state-of-the-art pre-trained models and thus incur security issues. At last, we also provide multiple practical guidelines for advancing future research on pre-trained models for program understanding and generation.	ISSTA	A_SE	0	1	-1	3	3	3	3	3	3	3	21
26	2023	Sadik, Ahmed R; Ceravola, Antonello; Joublin, Frank; Patra, Jibesh	Analysis of ChatGPT on Source Code		This paper explores the use of Large Language Models (LLMs) and in particular ChatGPT in programming, source code analysis, and code generation. LLMs and ChatGPT are built using machine learning and artificial intelligence techniques, and they offer several benefits to developers and programmers. While these models can save time and provide highly accurate results, they are not yet advanced enough to replace human programmers entirely. The paper investigates the potential applications of LLMs and ChatGPT in various areas, such as code creation, code documentation, bug detection, refactoring, and more. The paper also suggests that the usage of LLMs and ChatGPT is expected to increase in the future as they offer unparalleled benefits to the programming community.	L_W	L_W	1	1	-1	0	3	3	3	3	3	3	18
27	2023	Huang, Di; Nan, Ziyuan; Hu, Xing; Jin, Pengwei; Peng, Shaohui; Wen, Yuanbo; Zhang, Rui; Du, Zidong; Guo, Qi; Pu, Yewen; Chen, Yunji	anpl: compiling natural programs with interactive decomposition	http://arxiv.org/abs/2305.18498	The advents of Large Language Models (LLMs) have shown promise in augmenting programming using natural interactions. However, while LLMs are proficient in compiling common usage patterns into a programming language, e.g., Python, it remains a challenge how to edit and debug an LLM-generated program. We introduce ANPL, a programming system that allows users to decompose user-specific tasks. In an ANPL program, a user can directly manipulate sketch, which specifies the data flow of the generated program. The user annotates the modules, or hole with natural language descriptions offloading the expensive task of generating functionalities to the LLM. Given an ANPL program, the ANPL compiler generates a cohesive Python program that implements the functionalities in hole, while respecting the dataflows specified in sketch. We deploy ANPL on the Abstraction and Reasoning Corpus (ARC), a set of unique tasks that are challenging for state-of-the-art AI systems, showing it outperforms baseline programming systems that (a) without the ability to decompose tasks interactively and (b) without the guarantee that the modules can be correctly composed together. We obtain a dataset consisting of 300/400 ARC tasks that were successfully decomposed and grounded in Python, providing valuable insights into how humans decompose programmatic tasks. See the dataset at https://iprc-dip.github.io/DARC.	W	W	1	1	-1	0	3	3	3	2	3	3	17
28	2023	Huang, Qing; Sun, Yanbang; Xing, Zhenchang; Yu, Min; Xu, Xiwei; Lu, Qinghua	api entity and relation joint extraction from text via dynamic prompt-tuned language model	http://arxiv.org/abs/2301.03987	Extraction of Application Programming Interfaces (APIs) and their semantic relations from unstructured text (e.g., Stack Overflow) is a fundamental work for software engineering tasks (e.g., API recommendation). However, existing approaches are rule-based and sequence-labeling based. They must manually enumerate the rules or label data for a wide range of sentence patterns, which involves a significant amount of labor overhead and is exacerbated by morphological and common-word ambiguity. In contrast to matching or labeling API entities and relations, this paper formulates heterogeneous API extraction and API relation extraction task as a sequence-to-sequence generation task, and proposes AERJE, an API entity-relation joint extraction model based on the large pre-trained language model. After training on a small number of ambiguous but correctly labeled data, AERJE builds a multi-task architecture that extracts API entities and relations from unstructured text using dynamic prompts. We systematically evaluate AERJE on a set of long and ambiguous sentences from Stack Overflow. The experimental results show that AERJE achieves high accuracy and discrimination ability in API entity-relation joint extraction, even with zero or few-shot fine-tuning.	W	W	1	1	-1	0	3	3	3	2	3	3	17
29	2023	Paranjape, Bhargavi; Lundberg, Scott; Singh, Sameer; Hajishirzi, Hannaneh; Zettlemoyer, Luke; Ribeiro, Marco Tulio	art: automatic multi-step reasoning and tool-use for large language models	http://arxiv.org/abs/2303.09014	Large language models (LLMs) can perform complex reasoning in few- and zero-shot settings by generating intermediate chain of thought (CoT) reasoning steps. Further, each reasoning step can rely on external tools to support computation beyond the core LLM capabilities (e.g. search/running code). Prior work on CoT prompting and tool use typically requires hand-crafting task-specific demonstrations and carefully scripted interleaving of model generations with tool use. We introduce Automatic Reasoning and Tool-use (ART), a framework that uses frozen LLMs to automatically generate intermediate reasoning steps as a program. Given a new task to solve, ART selects demonstrations of multi-step reasoning and tool use from a task library. At test time, ART seamlessly pauses generation whenever external tools are called, and integrates their output before resuming generation. ART achieves a substantial improvement over few-shot prompting and automatic CoT on unseen tasks in the BigBench and MMLU benchmarks, and matches performance of hand-crafted CoT prompts on a majority of these tasks. ART is also extensible, and makes it easy for humans to improve performance by correcting errors in task-specific programs or incorporating new tools, which we demonstrate by drastically improving performance on select tasks with minimal human intervention.	W	W	1	1	-1	0	3	3	3	1	2	3	15
30	2022	Yang, chengran; Xu, Bowen; Khan, Junaed younus; Uddin, Gias; Han, Donggyun; Yang, Zhou; Lo, David	aspect-based api review classification: how far can pre-trained transformer model go?	http://arxiv.org/abs/2201.11327	APIs (Application Programming Interfaces) are reusable software libraries and are building blocks for modern rapid software development. Previous research shows that programmers frequently share and search for reviews of APIs on the mainstream software question and answer (Q\\&A) platforms like Stack Overflow, which motivates researchers to design tasks and approaches related to process API reviews automatically. Among these tasks, classifying API reviews into different aspects (e.g., performance or security), which is called the aspect-based API review classification, is of great importance. The current state-of-the-art (SOTA) solution to this task is based on the traditional machine learning algorithm. Inspired by the great success achieved by pre-trained models on many software engineering tasks, this study fine-tunes six pre-trained models for the aspect-based API review classification task and compares them with the current SOTA solution on an API review benchmark collected by Uddin et al. The investigated models include four models (BERT, RoBERTa, ALBERT and XLNet) that are pre-trained on natural languages, BERTOverflow that is pre-trained on text corpus extracted from posts on Stack Overflow, and CosSensBERT that is designed for handling imbalanced data. The results show that all the six fine-tuned models outperform the traditional machine learning-based tool. More specifically, the improvement on the F1-score ranges from 21.0\\% to 30.2\\%. We also find that BERTOverflow, a model pre-trained on the corpus from Stack Overflow, does not show better performance than BERT. The result also suggests that CosSensBERT also does not exhibit better performance than BERT in terms of F1, but it is still worthy of being considered as it achieves better performance on MCC and AUC.	SANER	B_SE	1	1	-1	2	3	3	3	3	2	3	19
31	2022	Gu, Jian; Salza, Pasquale; Gall, Harald C.	assemble foundation models for automatic code summarization	http://arxiv.org/abs/2201.05222	Automatic code summarization is beneficial to daily software development since it could help reduce the requirement of manual writing. Currently, artificial intelligence is undergoing a paradigm shift. The foundation models pretrained on massive data and finetuned to downstream tasks surpass specially customized models. This trend inspired us to consider reusing foundation models instead of learning from scratch. Thereby, we propose a flexible and robust approach for automatic code summarization, based on neural models. We assemble available foundation models, such as CodeBERT and GPT-2, into a single neural model named AdaMo. Moreover, we utilize Gaussian noise as the simulation of contextual information to optimize the latent representation. Furthermore, we introduce two adaptive schemes from the perspective of knowledge transfer, namely continuous pretraining and intermediate finetuning, and design intermediate stage tasks for general sequence-to-sequence learning. Finally, we evaluate AdaMo against a benchmark dataset for code summarization, by comparing it with state-of-the-art models.	SANER	B_SE	1	1	-1	2	3	3	3	3	2	3	19
32	2023	Jana, Prithwish; Jha, Piyush; Ju, Haoyang; Kishore, Gautham; Mahajan, Aryan; Ganesh, Vijay	Attention, Compilation, and Solver-based Symbolic Analysis are All You Need		In this paper we present a Java-to-Python (J2P) and Python-to-Java (P2J) backto-back code translation method, and associated tool called CoTran, based on large language models (LLMs). Our method leverages the attention mechanism of LLMs, compilation, and symbolic execution-based test generation for equivalence testing between the input and output programs. More precisely, we modify the typical LLM training loop to incorporate compiler and symbolic execution loss. Via extensive experiments comparing CoTran with 10 other transpilers and LLM-based translation tools over a benchmark of more than 57,000 Java-Python equivalent pairs, we show that CoTran outperforms them on relevant metrics such as compilation and runtime equivalence accuracy. For example, our tool gets 97.43% compilation accuracy and 49.66% runtime equivalence accuracy for J2P translation, whereas the nearest competing tool only gets 96.44% and 6.8% respectively.	L_W	L_W	1	1	-1	0	3	3	3	3	3	3	18
33	2022	Li, Lingwei; Yang, Li; Jiang, Huaxi; Yan, Jun; Luo, Tiejian; Hua, Zihan; Liang, Geng; Zuo, Chun	auger: automatically generating review comments with pre-training models	https://dl.acm.org/doi/10.1145/3540250.3549099	Code review is one of the best practices as a powerful safeguard for software quality. In practice, senior or highly skilled reviewers inspect source code and provide constructive comments, consider- ing what authors may ignore, for example, some special cases. The collaborative validation between contributors results in code being highly qualified and less chance of bugs. However, since personal knowledge is limited and varies, the efficiency and effectiveness of code review practice are worthy of further improvement. In fact, it still takes a colossal and time-consuming effort to deliver useful review comments. This paper explores a synergy of multiple practical review comments to enhance code review and proposes AUGER (AUtomatically GEnerating Review comments): a review comments generator with pre-training models. We first collect empirical review data from 11 notable Java projects and construct a dataset of 10,882 code changes. By leveraging Text-to-Text Transfer Transformer (T5) models, the framework synthesizes valuable knowledge in the training stage and effectively outperforms baselines by 37.38% in ROUGE-L. 29% of our automatic review comments are considered useful according to prior studies. The inference generates just in 20 seconds and is also open to training further. Moreover, the performance also gets improved when thoroughly analyzed in case study.	FSE/ESEC	A_SECURITY	1	1	-1	3	3	3	3	3	3	3	21
34	2021	Ghadhab, Lobna; Jenhani, Ilyes; Mkaouer, Mohamed Wiem; Ben Messaoud, Montassar	augmenting commit classification by using fine-grained source code changes and a pre-trained deep neural language model	https://linkinghub.elsevier.com/retrieve/pii/S0950584921000495	Context: Analyzing software maintenance activities is very helpful in ensuring cost-effective evolution and development activities. The categorization of commits into maintenance tasks supports practitioners in making decisions about resource allocation and managing technical debt. Objective: In this paper, we propose to use a pre-trained language neural model, namely BERT (Bidirectional Encoder Representations from Transformers) for the classification of commits into three categories of mainte-nance tasks ? corrective, perfective and adaptive. The proposed commit classification approach will help the classifier better understand the context of each word in the commit message. Methods: We built a balanced dataset of 1793 labeled commits that we collected from publicly available datasets. We used several popular code change distillers to extract fine-grained code changes that we have incorporated into our dataset as additional features to BERT?s word representation features. In our study, a deep neural network (DNN) classifier has been used as an additional layer to fine-tune the BERT model on the task of commit classification. Several models have been evaluated to come up with a deep analysis of the impact of code changes on the classification performance of each commit category. Results and conclusions: Experimental results have shown that the DNN model trained on BERT?s word representations and Fixminer code changes (DNN@BERT+Fix\_cc) provided the best performance and achieved 79.66\% accuracy and a macro-average f1 score of 0.8. Comparison with the state-of-the-art model that combines keywords and code changes (RF@KW+CD\_cc) has shown that our model achieved approximately 8\% improvement in accuracy. Results have also shown that a DNN model using only BERT?s word representation features achieved an improvement of 5\% in accuracy compared to the RF@KW+CD\_cc model.	IST	B_SE	1	1	-1	2	3	3	3	3	3	3	20
35	2023	Hu, Jie; Zhang, Qian; Yin, Heng	Augmenting Greybox Fuzzing with Generative AI		Real-world programs expecting structured inputs often has a format-parsing stage gating the deeper program space. Neither a mutation-based approach nor a generative approach can provide a solution that is effective and scalable. Large language models (LLM) pre-trained with an enormous amount of natural language corpus have proved to be effective for understanding the implicit format syntax and generating format-conforming inputs. In this paper, propose CHATFUZZ, a greybox fuzzer augmented by generative AI. More specifically, we pick a seed in the fuzzer’s seed pool and prompt ChatGPT generative models to variations, which are more likely to be format-conforming and thus of high quality. We conduct extensive experiments to explore the best practice for harvesting the power of generative LLM models. The experiment results show that our approach improves the edge coverage by 12.77% over the SOTA greybox fuzzer (AFL++) on 12 target programs from three well-tested benchmarks. As for vulnerability detection, CHATFUZZ is able to perform similar to or better than AFL++ for programs with explicit syntax rules but not for programs with non-trivial syntax.	L_W	L_W	1	1	-1	0	3	3	3	3	3	3	18
36	2022	Ezzini S,Abualhaija S,Arora C,Sabetzadeh M	Automated Handling of Anaphoric Ambiguity in Requirements: A Multi-Solution Study	https://doi.org/10.1145/3510003.3510157	Ambiguity is a pervasive issue in natural-language requirements. A common source of ambiguity in requirements is when a pronoun is anaphoric. In requirements engineering, anaphoric ambiguity occurs when a pronoun can plausibly refer to different entities and thus be interpreted differently by different readers. In this paper, we develop an accurate and practical automated approach for handling anaphoric ambiguity in requirements, addressing both ambiguity detection and anaphora interpretation. In view of the multiple competing natural language processing (NLP) and machine learning (ML) technologies that one can utilize, we simultaneously pursue six alternative solutions, empirically assessing each using a collection of ≈1,350 industrial requirements. The alternative solution strategies that we consider are natural choices induced by the existing technologies; these choices frequently arise in other automation tasks involving natural-language requirements. A side-by-side empirical examination of these choices helps develop insights about the usefulness of different state-of-the-art NLP and ML technologies for addressing requirements engineering problems. For the ambiguity detection task, we observe that supervised ML outperforms both a large-scale language model, SpanBERT (a variant of BERT), as well as a solution assembled from off-the-shelf NLP coreference re-solvers. In contrast, for anaphora interpretation, SpanBERT yields the most accurate solution. In our evaluation, (1) the best solution for anaphoric ambiguity detection has an average precision of ≈60% and a recall of 100%, and (2) the best solution for anaphora interpretation (resolution) has an average success rate of ≈98%.	ICSE	A_SE	1	1	-1	3	3	3	3	3	3	3	21
37	2023	Fan, Zhiyu; Gao, Xiang; Mirchev, Martin; Roychoudhury, Abhik; Tan, Shin Hwei	automated repair of programs from large language models	http://arxiv.org/abs/2205.10583	Large language models such as Codex, have shown the capability to produce code for many programming tasks. However, the success rate of existing models is low, especially for complex programming tasks. One of the reasons is that language models lack awareness of program semantics, resulting in incorrect programs, or even programs which do not compile. In this paper, we systematically study whether automated program repair (APR) techniques can fix the incorrect solutions produced by language models in LeetCode contests. The goal is to study whether APR techniques can enhance reliability in the code produced by large language models. Our study revealed that: (1) automatically generated code shares common programming mistakes with human-crafted solutions, indicating APR techniques may have potential to fix auto-generated code; (2) given bug location information provided by a statistical fault localization approach, the newly released Codex edit mode, which supports editing code, is similar to or better than existing Java repair tools TBar and Recoder in fixing incorrect solutions. By analyzing the experimental results generated by these tools, we provide several suggestions: (1) enhancing APR tools to surpass limitations in patch space (e.g., introducing more flexible fault localization) is desirable; (2) as large language models can derive more fix patterns by training on more data, future APR tools could shift focus from adding more fix patterns to synthesis/semantics based approaches, (3) combination of language models with APR to curate patch ingredients, is worth studying.	ICSE	A_SE	1	1	-1	3	3	3	3	3	3	3	21
38	2023	Kou, Bonan; Chen, Muhao; Zhang, Tianyi	Automated Summarization of Stack Overflow Posts	https://arxiv.org/abs/2305.16680	Software developers often resort to Stack Overflow (SO) to fill their programming needs. Given the abundance of relevant posts, navigating them and comparing different solutions is tedious and time-consuming. Recent work has proposed to automatically summarize SO posts to concise text to facilitate the navigation of SO posts. However, these techniques rely only on information retrieval methods or heuristics for text summarization, which is insufficient to handle the ambiguity and sophistication of natural language. This paper presents a deep learning based framework called ASSORT for SO post summarization. ASSORT includes two complementary learning methods, ASSORT_S and ASSORT_{IS}, to address the lack of labeled training data for SO post summarization. ASSORT_S is designed to directly train a novel ensemble learning model with BERT embeddings and domainspecific features to account for the unique characteristics of SO posts. By contrast, ASSORT_{IS} is designed to reuse pre-trained models while addressing the domain shift challenge when no training data is present (i.e., zero-shot learning). Both ASSORT_S and ASSORT_{IS} outperform six existing techniques by at least 13% and 7% respectively in terms of the F1 score. Furthermore, a human study shows that participants significantly preferred summaries generated by ASSORT_S and ASSORT_{IS} over the best baseline, while the preference difference between ASSORT_S and ASSORT_{IS} was small.	ICSE	A_SE	1	1	-1	3	3	3	3	3	3	3	21
39	2022	Khan, Junaed Younus; Uddin, Gias	automatic detection and analysis of technical debts in peer-review documentation of r packages	http://arxiv.org/abs/2201.04241	Technical debt (TD) is a metaphor for code-related problems that arise as a result of prioritizing speedy delivery over perfect code. Given that the reduction of TDs can have long-term positive impact in the software engineering life-cycle (SDLC), TDs are studied extensively in the literature. However, very few of the existing research focused on the technical debts of R programming language despite its popularity and usage. Recent research by Codabux et al. [21] finds that R packages can have 10 diverse TD types analyzing peer-review documentation. However, the findings are based on the manual analysis of a small sample of R package review comments. In this paper, we develop a suite of Machine Learning (ML) classifiers to detect the 10 TDs automatically. The best performing classifier is based on the deep ML model BERT, which achieves F1-scores of 0.71 - 0.91. We then apply the trained BERT models on all available peer-review issue comments from two platforms, rOpenSci and BioConductor (13.5K review comments coming from a total of 1297 R packages). We conduct an empirical study on the prevalence and evolution of 10 TDs in the two R platforms. We discovered documentation debt is the most prevalent among all types of TD, and it is also expanding rapidly. We also find that R packages of generic platform (i.e. rOpenSci) are more prone to TD compared to domain-specific platform (i.e. BioConductor). Our empirical study findings can guide future improvements opportunities in R package documentation. Our ML models can be used to automatically monitor the prevalence and evolution of TDs in R package documentation.	SANER	B_SE	1	1	-1	2	3	3	3	2	2	3	18
40	2021	Khan, Junaed Younus; Tawkat Islam Khondaker, Md.; Uddin, Gias; Iqbal, Anindya	Automatic Detection of Five API Documentation Smells: Practitioners’ Perspectives		The learning and usage of an API is supported by official documentation. Like source code, API documentation is itself a software product. Several research results show that bad design in API documentation can make the reuse of API features difficult. Indeed, similar to code smells or code anti-patterns, poorly designed API documentation can also exhibit `smells'. Such documentation smells can be described as bad documentation styles that do not necessarily produce an incorrect documentation but nevertheless make the documentation difficult to properly understand and to use. Recent research on API documentation has focused on finding content inaccuracies in API documentation and to complement API documentation with external resources (e.g., crowd-shared code examples). We are aware of no research that focused on the automatic detection of API documentation smells. This paper makes two contributions. First, we produce a catalog of five API documentation smells by consulting literature on API documentation presentation problems. We create a benchmark dataset of 1,000 API documentation units by exhaustively and manually validating the presence of the five smells in Java official API reference and instruction documentation. Second, we conduct a survey of 21 professional software developers to validate the catalog. The developers agreed that they frequently encounter all five smells in API official documentation and 95.2% of them reported that the presence of the documentation smells negatively affects their productivity. The participants wished for tool support to automatically detect and fix the smells in API official documentation. We develop a suite of rule-based, deep and shallow machine learning classifiers to automatically detect the smells. The best performing classifier BERT, a deep learning model, achieves F1-scores of 0.75 - 0.97.	SANER	B_SE	1	1	-1	2	3	3	3	3	3	3	20
41	2023	Zhao, Xu; Xie, Yuxi; Kawaguchi, Kenji; He, Junxian; Xie, Qizhe	automatic model selection with large language models for reasoning	http://arxiv.org/abs/2305.14333	Chain-of-Thought and Program-Aided Language Models represent two distinct reasoning methods, each with its own strengths and weaknesses. We demonstrate that it is possible to combine the best of both worlds by using different models for different problems, employing a large language model (LLM) to perform model selection. Through a theoretical analysis, we discover that the performance improvement is determined by the differences between the combined methods and the success rate of choosing the correct model. On eight reasoning datasets, our proposed approach shows significant improvements. Furthermore, we achieve new state-of-the-art results on GSM8K and SVAMP with accuracies of 96.5\\% and 93.7\\%, respectively. Our code is publicly available at https://github.com/XuZhao0/Model-Selection-Reasoning.	W	W	0	1	-1	0	3	3	3	2	1	3	15
42	2023	Zhu, Jie; Li, Lingwei; Yang, Li; Ma, Xiaoxiao; Zuo, Chun	automating method naming with context-aware prompt-tuning	http://arxiv.org/abs/2303.05771	Method names are crucial to program comprehension and maintenance. Recently, many approaches have been proposed to automatically recommend method names and detect inconsistent names. Despite promising, their results are still sub-optimal considering the three following drawbacks: 1) These models are mostly trained from scratch, learning two different objectives simultaneously. The misalignment between two objectives will negatively affect training efficiency and model performance. 2) The enclosing class context is not fully exploited, making it difficult to learn the abstract function of the method. 3) Current method name consistency checking methods follow a generate-then-compare process, which restricts the accuracy as they highly rely on the quality of generated names and face difficulty measuring the semantic consistency. In this paper, we propose an approach named AUMENA to AUtomate MEthod NAming tasks with context-aware prompt-tuning. Unlike existing deep learning based approaches, our model first learns the contextualized representation(i.e., class attributes) of PL and NL through the pre-training model, then fully exploits the capacity and knowledge of large language model with prompt-tuning to precisely detect inconsistent method names and recommend more accurate names. To better identify semantically consistent names, we model the method name consistency checking task as a two-class classification problem, avoiding the limitation of previous similarity-based consistency checking approaches. The experimental results reflect that AUMENA scores 68.6\\%, 72.0\\%, 73.6\\%, 84.7\\% on four datasets of method name recommendation, surpassing the state-of-the-art baseline by 8.5\\%, 18.4\\%, 11.0\\%, 12.0\\%, respectively. And our approach scores 80.8\\% accuracy on method name consistency checking, reaching an 5.5\\% outperformance. All data and trained models are publicly available.	ICPC	B_SE	1	1	-1	2	3	3	3	3	3	3	20
43	2023	Schroder, Martin	AutoScrum: Automating Project Planning Using Large Language Models		Recent advancements in the field of large language models have made it possible to use language models for advanced reasoning. In this paper we leverage this ability for designing complex project plans based only on knowing the current state and the desired state. Two approaches are demonstrated - a scrum based approach and a shortcut plan approach. The scrum based approach executes an automated process of requirements gathering, user story mapping, feature identification, task decomposition and finally generates questions and search terms for seeking out domain specific information to assist with task completion. The shortcut approach looks at most recent snapshot of the current and desired state and generates the next most reasonable task to do in order to get to the desired state as quickly as possible. In this paper we automate everything using a novel concept of "Language Programs". These are programs written in natural language designed to process input data through the language model. Guidance language is used for all LLM programs. All demo source code for this paper is available at https://github.com/autoscrum/autoscrum	L_W	L_W	1	1	-1	0	3	3	3	3	3	3	18
44	2022	Shen, Da; Chen, Xinyun; Wang, Chenguang; Sen, Koushik; Song, Dawn	benchmarking language models for code syntax understanding	http://arxiv.org/abs/2210.14473	Pre-trained language models have demonstrated impressive performance in both natural language processing and program understanding, which represent the input as a token sequence without explicitly modeling its structure. Some prior works show that pre-trained language models can capture the syntactic rules of natural languages without finetuning on syntax understanding tasks. However, there is limited understanding of how well pre-trained models understand the code structure so far. In this work, we perform the first thorough benchmarking of the state-of-the-art pre-trained models for identifying the syntactic structures of programs. Specifically, we introduce CodeSyntax, a large-scale dataset of programs annotated with the syntactic relationships in their corresponding abstract syntax trees. Our key observation is that existing language models pretrained on code still lack the understanding of code syntax. In fact, these pre-trained programming language models fail to match the performance of simple baselines based on positional offsets and keywords. We also present a natural language benchmark to highlight the differences between natural languages and programming languages in terms of syntactic structure understanding. Our findings point out key limitations of existing pre-training methods for programming languages, and suggest the importance of modeling code syntactic structures.	EMNLP	B_AI	1	1	-1	2	3	2	3	3	3	3	19
45	2022	Zhang J,Liu S,Gong L,Zhang H,Huang Z,Jiang H	BEQAIN: An Effective and Efficient Identifier Normalization Approach with BERT and the Question Answering System	http://dx.doi.org/10.1109/TSE.2022.3227559	As one of the most important resources to express the semantics of source code, identifiers are usually composed of several common or domain-specific terms and abbreviations, thus heavily hindering developers from analyzing and comprehending source code. Hence, it is very necessary to normalize identifiers, which aims to align the vocabulary found in identifiers with natural language words found in other software artifacts. Even though researchers have proposed several identifier normalization approaches in the literature, these approaches only rely on the lexical information in identifiers and related source code entities to normalize identifiers, suffering from the lack of deep semantic understanding of identifiers. In this paper, we propose an effective and efficient identifier normalization approach BEQAIN to split identifiers into their composing words and expand the enclosed abbreviations. Specifically, BEQAIN employs a deep learning model, which is mainly composed of a Bidirectional Encoder Representation from Transformers (BERT) layer and a Conditional Random Fields (CRF) layer to embed identifiers into low-level vectors and learn the identifier splitting patterns. The BERT-CRF network is also combined with a pre-processing component and a post-processing component to resolve the problems of over-splitting and under-splitting so as to improve the identifier splitting performance. Furthermore, BEQAIN also employs a Question Answering (Q&A) system to learn the abbreviation expansion mappings and leverages the current programming context to determine the exactly correct expansion when there are multiple expansions for specific abbreviations. After BEQAIN is fully trained, it can be used to normalize identifiers. We conduct extensive experiments to validate the effectiveness and efficiency of BEQAIN over two publicly available datasets with nine projects. Experimental results show that BEQAIN achieves the overall average Accuracy of 80.20% and outperforms the existing state-of-the-art approach by 9.88% in normalizing identifiers. The pre-processing and post-processing components could improve the Accuracy of BEQAIN in identifier splitting by 11.70%. Employing the programming context information could improve the Accuracy of BEQAIN in abbreviation expansion by 11.15% on average. In addition, the average normalization time of BEQAIN is less than one second. Finally, we also discuss some observations for the road ahead for identifier normalization to inspire other researchers.	TSE	A_SE	1	1	-1	3	3	3	3	3	3	3	21
46	2023	Gomes, Luiz; Da Silva Torres, Ricardo; Côrtes, Mario Lúcio	bert- and tf-idf-based feature extraction for long-lived bug prediction in floss: a comparative study	https://linkinghub.elsevier.com/retrieve/pii/S095058492300071X	Context: The correct prediction of long-lived bugs could help maintenance teams to build their plan and to fix more bugs that often adversely affect software quality and disturb the user experience across versions in Free/Libre Open-Source Software (FLOSS). Machine Learning and Text Mining methods have been applied to solve many real-world prediction problems, including bug report handling. Objective: Our research aims to compare the accuracy of ML classifiers on long-lived bug prediction in FLOSS using Bidirectional Encoder Representations from Transformers (BERT)- and Term Frequency - Inverse Document Frequency (TF-IDF)-based feature extraction. Besides that, we aim to investigate BERT variants on the same task. Method: We collected bug reports from six popular FLOSS and used the Machine Learning classifiers to predict long-lived bugs. Furthermore, we compare different feature extractors, based on BERT and TF-IDF methods, in long-lived bug prediction. Results: We found that long-lived bug prediction using BERT-based feature extraction systematically outperformed the TF-IDF. The SVM and Random Forest outperformed other classifiers in almost all datasets using BERT. Furthermore, smaller BERT architectures show themselves as competitive. Conclusion: Our results demonstrated a promising avenue to predict long-lived bugs based on BERT contextual embedding features and fine-tuning procedures.	IST	B_SE	1	1	-1	2	3	3	2	2	3	3	18
47	2023	Workshop, BigScience; Scao, Teven Le; Fan, Angela; Akiki, Christopher; Pavlick, Ellie; Ilić, Suzana; Hesslow, Daniel; Castagné, Roman; Luccioni, Alexandra Sasha; Yvon, François; Gallé, Matthias; Tow, Jonathan; Rush, Alexander M.; Biderman, Stella; Webson, Albert; Ammanamanchi, Pawan Sasanka; Wang, Thomas; Sagot, Benoît; Muennighoff, Niklas; del Moral, Albert Villanova; Ruwase, Olatunji; Bawden, Rachel; Bekman, Stas; McMillan-Major, Angelina; Beltagy, Iz; Nguyen, Huu; Saulnier, Lucile; Tan, Samson; Suarez, Pedro Ortiz; Sanh, Victor; Laurençon, Hugo; Jernite, Yacine; Launay, Julien; Mitchell, Margaret; Raffel, Colin; Gokaslan, Aaron; Simhi, Adi; Soroa, Aitor; Aji, Alham Fikri; Alfassy, Amit; Rogers, Anna; Nitzav, Ariel Kreisberg; Xu, Canwen; Mou, Chenghao; Emezue, Chris; Klamm, Christopher; Leong, Colin; van Strien, Daniel; Adelani, David Ifeoluwa; Radev, Dragomir; Ponferrada, Eduardo González; Levkovizh, Efrat; Kim, Ethan; Natan, Eyal Bar; De Toni, Francesco; Dupont, Gérard; Kruszewski, Germán; Pistilli, Giada; Elsahar, Hady; Benyamina, Hamza; Tran, Hieu; Yu, Ian; Abdulmumin, Idris; Johnson, Isaac; Gonzalez-Dios, Itziar; de la Rosa, Javier; Chim, Jenny; Dodge, Jesse; Zhu, Jian; Chang, Jonathan; Frohberg, Jörg; Tobing, Joseph; Bhattacharjee, Joydeep; Almubarak, Khalid; Chen, Kimbo; Lo, Kyle; Von Werra, Leandro; Weber, Leon; Phan, Long; allal, Loubna Ben; Tanguy, Ludovic; Dey, Manan; Muñoz, Manuel Romero; Masoud, Maraim; Grandury, María; Šaško, Mario; Huang, Max; Coavoux, Maximin; Singh, Mayank; Jiang, Mike Tian-Jian; Vu, Minh Chien; Jauhar, Mohammad A.; Ghaleb, Mustafa; Subramani, Nishant; Kassner, Nora; Khamis, Nurulaqilla; Nguyen, Olivier; Espejel, Omar; de Gibert, Ona; Villegas, Paulo; Henderson, Peter; Colombo, Pierre; Amuok, Priscilla; Lhoest, Quentin; Harliman, Rheza; Bommasani, Rishi; López, Roberto Luis; Ribeiro, Rui; Osei, Salomey; Pyysalo, Sampo; Nagel, Sebastian; Bose, Shamik; Muhammad, Shamsuddeen Hassan; Sharma, Shanya; Longpre, Shayne; Nikpoor, Somaieh; Silberberg, Stanislav; Pai, Suhas; Zink, Sydney; Torrent, Tiago Timponi; Schick, Timo; Thrush, Tristan; Danchev, Valentin; Nikoulina, Vassilina; Laippala, Veronika; Lepercq, Violette; Prabhu, Vrinda; Alyafeai, Zaid; Talat, Zeerak; Raja, Arun; Heinzerling, Benjamin; Si, Chenglei; Taşar, Davut Emre; Salesky, Elizabeth; Mielke, Sabrina J.; Lee, Wilson Y.; Sharma, Abheesht; Santilli, Andrea; Chaffin, Antoine; Stiegler, Arnaud; Datta, Debajyoti; Szczechla, Eliza; Chhablani, Gunjan; Wang, Han; Pandey, Harshit; Strobelt, Hendrik; Fries, Jason Alan; Rozen, Jos; Gao, Leo; Sutawika, Lintang; Bari, M. Saiful; Al-shaibani, Maged S.; Manica, Matteo; Nayak, Nihal; Teehan, Ryan; Albanie, Samuel; Shen, Sheng; Ben-David, Srulik; Bach, Stephen H.; Kim, Taewoon; Bers, Tali; Fevry, Thibault; Neeraj, Trishala; Thakker, Urmish; Raunak, Vikas; Tang, Xiangru; Yong, Zheng-Xin; Sun, Zhiqing; Brody, Shaked; Uri, Yallow; Tojarieh, Hadar; Roberts, Adam; Chung, Hyung Won; Tae, Jaesung; Phang, Jason; Press, Ofir; Li, Conglong; Narayanan, Deepak; Bourfoune, Hatim; Casper, Jared; Rasley, Jeff; Ryabinin, Max; Mishra, Mayank; Zhang, Minjia; Shoeybi, Mohammad; Peyrounette, Myriam; Patry, Nicolas; Tazi, Nouamane; Sanseviero, Omar; von Platen, Patrick; Cornette, Pierre; Lavallée, Pierre François; Lacroix, Rémi; Rajbhandari, Samyam; Gandhi, Sanchit; Smith, Shaden; Requena, Stéphane; Patil, Suraj; Dettmers, Tim; Baruwa, Ahmed; Singh, Amanpreet; Cheveleva, Anastasia; Ligozat, Anne-Laure; Subramonian, Arjun; Névéol, Aurélie; Lovering, Charles; Garrette, Dan; Tunuguntla, Deepak; Reiter, Ehud; Taktasheva, Ekaterina; Voloshina, Ekaterina; Bogdanov, Eli; Winata, Genta Indra; Schoelkopf, Hailey; Kalo, Jan-Christoph; Novikova, Jekaterina; Forde, Jessica Zosa; Clive, Jordan; Kasai, Jungo; Kawamura, Ken; Hazan, Liam; Carpuat, Marine; Clinciu, Miruna; Kim, Najoung; Cheng, Newton; Serikov, Oleg; Antverg, Omer; van der Wal, Oskar; Zhang, Rui; Zhang, Ruochen; Gehrmann, Sebastian; Mirkin, Shachar; Pais, Shani; Shavrina, Tatiana; Scialom, Thomas; Yun, Tian; Limisiewicz, Tomasz; Rieser, Verena; Protasov, Vitaly; Mikhailov, Vladislav; Pruksachatkun, Yada; Belinkov, Yonatan; Bamberger, Zachary; Kasner, Zdeněk; Rueda, Alice; Pestana, Amanda; Feizpour, Amir; Khan, Ammar; Faranak, Amy; Santos, Ana; Hevia, Anthony; Unldreaj, Antigona; Aghagol, Arash; Abdollahi, Arezoo; Tammour, Aycha; HajiHosseini, Azadeh; Behroozi, Bahareh; Ajibade, Benjamin; Saxena, Bharat; Ferrandis, Carlos Muñoz; McDuff, Daniel; Contractor, Danish; Lansky, David; David, Davis; Kiela, Douwe; Nguyen, Duong A.; Tan, Edward; Baylor, Emi; Ozoani, Ezinwanne; Mirza, Fatima; Ononiwu, Frankline; Rezanejad, Habib; Jones, Hessie; Bhattacharya, Indrani; Solaiman, Irene; Sedenko, Irina; Nejadgholi, Isar; Passmore, Jesse; Seltzer, Josh; Sanz, Julio Bonis; Dutra, Livia; Samagaio, Mairon; Elbadri, Maraim; Mieskes, Margot; Gerchick, Marissa; Akinlolu, Martha; McKenna, Michael; Qiu, Mike; Ghauri, Muhammed; Burynok, Mykola; Abrar, Nafis; Rajani, Nazneen; Elkott, Nour; Fahmy, Nour; Samuel, Olanrewaju; An, Ran; Kromann, Rasmus; Hao, Ryan; Alizadeh, Samira; Shubber, Sarmad; Wang, Silas; Roy, Sourav; Viguier, Sylvain; Le, Thanh; Oyebade, Tobi; Le, Trieu; Yang, Yoyo; Nguyen, Zach; Kashyap, Abhinav Ramesh; Palasciano, Alfredo; Callahan, Alison; Shukla, Anima; Miranda-Escalada, Antonio; Singh, Ayush; Beilharz, Benjamin; Wang, Bo; Brito, Caio; Zhou, Chenxi; Jain, Chirag; Xu, Chuxin; Fourrier, Clémentine; Periñán, Daniel León; Molano, Daniel; Yu, Dian; Manjavacas, Enrique; Barth, Fabio; Fuhrimann, Florian; Altay, Gabriel; Bayrak, Giyaseddin; Burns, Gully; Vrabec, Helena U.; Bello, Imane; Dash, Ishani; Kang, Jihyun; Giorgi, John; Golde, Jonas; Posada, Jose David; Sivaraman, Karthik Rangasai; Bulchandani, Lokesh; Liu, Lu; Shinzato, Luisa; de Bykhovetz, Madeleine Hahn; Takeuchi, Maiko; Pàmies, Marc; Castillo, Maria A.; Nezhurina, Marianna; Sänger, Mario; Samwald, Matthias; Cullan, Michael; Weinberg, Michael; De Wolf, Michiel; Mihaljcic, Mina; Liu, Minna; Freidank, Moritz; Kang, Myungsun; Seelam, Natasha; Dahlberg, Nathan; Broad, Nicholas Michio; Muellner, Nikolaus; Fung, Pascale; Haller, Patrick; Chandrasekhar, Ramya; Eisenberg, Renata; Martin, Robert; Canalli, Rodrigo; Su, Rosaline; Su, Ruisi; Cahyawijaya, Samuel; Garda, Samuele; Deshmukh, Shlok S.; Mishra, Shubhanshu; Kiblawi, Sid; Ott, Simon; Sang-aroonsiri, Sinee; Kumar, Srishti; Schweter, Stefan; Bharati, Sushil; Laud, Tanmay; Gigant, Théo; Kainuma, Tomoya; Kusa, Wojciech; Labrak, Yanis; Bajaj, Yash Shailesh; Venkatraman, Yash; Xu, Yifan; Xu, Yingxin; Xu, Yu; Tan, Zhe; Xie, Zhongli; Ye, Zifan; Bras, Mathilde; Belkada, Younes; Wolf, Thomas	BLOOM: A 176B-Parameter Open-Access Multilingual Language Model		Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License.	L_W	L_W	1	1	-1	0	3	3	3	3	3	3	18
48	2023	Zhang, Quanjun; Fang, Chunrong; Sun, Weisong; Liu, Yan; He, Tieke; Hao, Xiaodong; Chen, Zhenyu	boosting automated patch correctness prediction via pre-trained language model	http://arxiv.org/abs/2301.12453	Automated program repair (APR) aims to fix software bugs automatically without human debugging efforts and plays a crucial role in software development and maintenance. Despite the recent significant progress, APR is still challenged by a long-standing overfitting problem (i.e., the generated patch is plausible but overfitting). Various techniques have thus been proposed to address the overfitting problem. Among them, leveraging deep learning approaches to predict patch correctness is emerging along with the available large-scale patch benchmarks recently. However, existing learning-based techniques mainly rely on manually-designed code features, which can be extremely costly and challenging to construct in practice. In this paper, we propose APPT, a pre-trained model-based automated patch correctness assessment technique, which treats the source code as token sequences without extra overhead to design hand-crafted features. In particular, APPT adopts a pre-trained model as the encoder stack, followed by an LSTM stack and a deep learning classifier. Although our idea is general and can be built on various pre-trained models, we implemente APPT based on the BERT model. We conduct an extensive experiment on 1,183 Defects4J patches and the results show that APPT achieves prediction accuracy of 79.0\\% and recall of 81.3\\%, outperforming the state-of-the-art technique CACHE by 3.6\\% and 4.8\\%. Our additional investigation on 49,694 real-world patches shows that APPT achieves the optimum performance (exceeding 99\\% in five common metrics for assessing patch classification techniques) compared with existing representation learning techniques. We also prove that adopting code pre-trained models can further provide substantial advancement (e.g., GraphCodeBERT-based APPT improves BERT-based APPT by 3.0\\% and 2.6\\% in precision and recall, respectively), highlighting the generalizability of APPT.	W	W	1	1	1	0	3	3	3	3	3	3	18
49	2023	Vikram, Vasudev; Lemieux, Caroline; Padhye, Rohan	Can Large Language Models Write Good Property-Based Tests?		Property-based testing (PBT), while an established technique in the software testing research community, is still relatively underused in real-world software. Pain points in writing property-based tests include implementing diverse random input generators and thinking of meaningful properties to test. Developers, however, are more amenable to writing documentation; plenty of library API documentation is available and can be used as natural language specifications for property-based tests. As large language models (LLMs) have recently shown promise in a variety of coding tasks, we explore the potential of using LLMs to synthesize property-based tests. We call our approach PBT-GPT, and propose three different strategies of prompting the LLM for PBT. We characterize various failure modes of PBT-GPT and detail an evaluation methodology for automatically synthesized property-based tests. PBT-GPT achieves promising results in our preliminary studies on sample Python library APIs in numpy, networkx, and datetime.	L_W	L_W	1	1	-1	0	3	3	3	3	3	3	18
50	2022	Jones, Erik; Steinhardt, Jacob	capturing failures of large language models via human cognitive biases	http://arxiv.org/abs/2202.12299	Large language models generate complex, open-ended outputs: instead of outputting a class label they write summaries, generate dialogue, or produce working code. In order to asses the reliability of these open-ended generation systems, we aim to identify qualitative categories of erroneous behavior, beyond identifying individual errors. To hypothesize and test for such qualitative errors, we draw inspiration from human cognitive biases -- systematic patterns of deviation from rational judgement. Specifically, we use cognitive biases as motivation to (i) generate hypotheses for problems that models may have, and (ii) develop experiments that elicit these problems. Using code generation as a case study, we find that OpenAI's Codex errs predictably based on how the input prompt is framed, adjusts outputs towards anchors, and is biased towards outputs that mimic frequent training examples. We then use our framework to elicit high-impact errors such as incorrectly deleting files. Our results indicate that experimental methodology from cognitive science can help characterize how machine learning systems behave.	NeurIPS	A_AI	1	1	-1	3	3	3	3	2	3	3	20
51	2023	Li, Zongjie; Wang, Chaozheng; Liu, Zhibo; Wang, Haoxuan; Chen, Dong; Wang, Shuai; Gao, Cuiyun	cctest: testing and repairing code completion systems	http://arxiv.org/abs/2208.08289	Code completion, a highly valuable topic in the software development domain, has been increasingly promoted for use by recent advances in large language models (LLMs). To date, visible LLM-based code completion frameworks such as GitHub Copilot and GPT are trained using deep learning over vast quantities of unstructured text and open source code. As the paramount component and the cornerstone in daily programming tasks, code completion has largely boosted professionals' efficiency in building real-world software systems. In contrast to this flourishing market, we find that code completion systems often output suspicious results, and to date, an automated testing and enhancement framework for code completion systems is not available. This research proposes CCTEST, a framework to test and repair code completion systems in blackbox settings. CCTEST features a set of novel mutation strategies, namely program structure-correlated (PSC) mutations, to generate mutated code completion inputs. Then, it detects inconsistent outputs, representing possibly erroneous cases, from all the completed code cases. Moreover, CCTEST repairs the code completion outputs by selecting the output that mostly reflects the "average" appearance of all output cases, as the final output of the code completion systems. We detected a total of 33,540 inputs (with a true positive rate of 86\\%) that can trigger erroneous cases from eight popular LLM-based code completion systems. With repairing, we show that the accuracy of code completion systems is notably increased by 40\\% and 67\\% with respect to BLEU score and Levenshtein edit similarity.	ICSE	A_SE	1	1	-1	3	3	3	3	3	3	3	21
52	2022	Zan, Daoguang; Chen, Bei; Yang, Dejian; Lin, Zeqi; Kim, Minsu; Guan, Bei; Wang, Yongji; Chen, Weizhu; Lou, Jian-Guang	cert: continual pre-training on sketches for library-oriented code generation	https://www.ijcai.org/proceedings/2022/329	Code generation is a longstanding challenge, aiming to generate a code snippet based on a natural language description. Usually, expensive text-code paired data is essential for training a code generation model. Recently, thanks to the success of pre-training techniques, large language models are trained on large unlabelled code corpora and perform well in generating code. In this paper, we investigate how to leverage an unlabelled code corpus to train a model for library-oriented code generation. Since it is a common practice for programmers to reuse third-party libraries, in which case the text-code paired data are harder to obtain due to the huge number of libraries. We observe that library-oriented code snippets are more likely to share similar code sketches. Hence, we present {CERT} with two steps: a sketcher generates the sketch, then a generator fills the details in the sketch. Both the sketcher and generator are continually pre-trained upon a base model using unlabelled data. Also, we carefully craft two benchmarks to evaluate library-oriented code generation named {PandasEval} and {NumpyEval}. Experimental results have shown the impressive performance of {CERT}. For example, it surpasses the base model by an absolute 15.67\% improvement in terms of pass@1 on {PandasEval}. Our work is available at https://github.com/microsoft/{PyCodeGPT}.	IJCAI	A_AI	1	1	-1	3	3	3	2	2	3	3	19
53	2023	Ronanki, Krishna and Cabrero-Daniel, Beatriz and Berger, Christian	ChatGPT as a tool for User Story Quality Evaluation: Trustworthy Out of the Box?		In Agile software development, user stories play a vital role in capturing and conveying end-user needs, prioritizing features, and facilitating communication and collaboration within development teams. However, automated methods for evaluating user stories require training in NLP tools and can be time-consuming to develop and integrate. This study explores using ChatGPT for user story quality evaluation and compares its performance with an existing benchmark. Our study shows that ChatGPT's evaluation aligns well with human evaluation, and we propose a ``best of three'' strategy to improve its output stability. We also discuss the concept of trustworthiness in AI and its implications for non-experts using ChatGPT's unprocessed outputs. Our research contributes to understanding the reliability and applicability of AI in user story evaluation and offers recommendations for future research.	L_W	L_W	1	1	-1	0	3	3	3	3	3	3	18
54	2023	Amos Azaria, Rina Azoulay, Shulamit Reches	ChatGPT is a Remarkable Tool—For Experts		This paper investigates the capabilities of ChatGPT as an automated assistant in diverse domains, including scientific writing, mathematics, education, programming, and healthcare. We explore the potential of ChatGPT to enhance productivity, streamline problem-solving processes, and improve writing style. Furthermore, we highlight the potential risks associated with excessive reliance on ChatGPT in these fields. These limitations encompass factors like incorrect and fictitious responses, inaccuracies in code, limited logical reasoning abilities, overconfidence, and critical ethical concerns of copyrights and privacy violation. We outline areas and objectives where ChatGPT proves beneficial, applications where it should be used judiciously, and scenarios where its reliability may be limited. In light of observed limitations, and given that the tool's fundamental errors may pose a special challenge for non-experts, ChatGPT should be used with a strategic methodology. By drawing from comprehensive experimental studies, we offer methods and flow charts for effectively using ChatGPT. Our recommendations emphasize iterative interaction with ChatGPT and independent verification of its outputs. Considering the importance of utilizing ChatGPT judiciously and with expertise, we recommend its usage for experts who are well-versed in the respective domains.	L_W	L_W	1	1	0	0	3	3	3	3	3	3	18
55	2023	White, Jules; Hays, Sam; Fu, Quchen; Spencer-Smith, Jesse; Schmidt, Douglas C.	chatgpt prompt patterns for improving code quality, refactoring, requirements elicitation, and software design	http://arxiv.org/abs/2303.07839	This paper presents prompt design techniques for software engineering, in the form of patterns, to solve common problems when using large language models (LLMs), such as ChatGPT to automate common software engineering activities, such as ensuring code is decoupled from third-party libraries and simulating a web application API before it is implemented. This paper provides two contributions to research on using LLMs for software engineering. First, it provides a catalog of patterns for software engineering that classifies patterns according to the types of problems they solve. Second, it explores several prompt patterns that have been applied to improve requirements elicitation, rapid prototyping, code quality, refactoring, and system design.	W	W	1	1	0	0	3	2	3	2	2	3	15
56	2023	Tang, Yutian and Liu, Zhijie and Zhou, Zhichao and Luo, Xiapu	ChatGPT vs SBST: A Comparative Assessment of Unit Test Suite Generation		Recent advancements in large language models (LLMs) have demonstrated exceptional success in a wide range of general domain tasks, such as question answering and following instructions. Moreover, LLMs have shown potential in various software engineering applications. In this study, we present a systematic comparison of test suites generated by the ChatGPT LLM and the state-of-the-art SBST tool EvoSuite. Our comparison is based on several critical factors, including correctness, readability, code coverage, and bug detection capability. By highlighting the strengths and weaknesses of LLMs (specifically ChatGPT) in generating unit test cases compared to EvoSuite, this work provides valuable insights into the performance of LLMs in solving software engineering problems. Overall, our findings underscore the potential of LLMs in software engineering and pave the way for further research in this area.	L_W	L_W	1	1	-1	0	3	3	3	3	3	3	18
57	2023	Sridhara, Giriprasad; G., Ranjani H.; Mazumdar, Sourav	chatgpt: a study on its utility for ubiquitous software engineering tasks	http://arxiv.org/abs/2305.16837	ChatGPT (Chat Generative Pre-trained Transformer) is a chatbot launched by OpenAI on November 30, 2022. OpenAI's GPT-3 family of large language models serve as the foundation for ChatGPT. ChatGPT is fine-tuned with both supervised and reinforcement learning techniques and has received widespread attention for its articulate responses across diverse domains of knowledge. In this study, we explore how ChatGPT can be used to help with common software engineering tasks. Many of the ubiquitous tasks covering the breadth of software engineering such as ambiguity resolution in software requirements, method name suggestion, test case prioritization, code review, log summarization can potentially be performed using ChatGPT. In this study, we explore fifteen common software engineering tasks using ChatGPT. We juxtapose and analyze ChatGPT's answers with the respective state of the art outputs (where available) and/or human expert ground truth. Our experiments suggest that for many tasks, ChatGPT does perform credibly and the response from it is detailed and often better than the human expert output or the state of the art output. However, for a few other tasks, ChatGPT in its present form provides incorrect answers and hence is not suited for such tasks.	W	W	1	1	1	0	3	2	3	2	3	3	16
58	2023	Xie, Zhuokui; Chen, Yinghao; Zhi, Chen; Deng, Shuiguang; Yin, Jianwei	chatunitest: a chatgpt-based automated unit test generation tool	http://arxiv.org/abs/2305.04764	Unit testing is a crucial, yet often tedious and time-consuming task. To relieve developers from this burden, automated unit test generation techniques are developed. Existing automated unit test generation tools, such as program-analysis-based tools like EvoSuite and Randoop, lack program comprehension, resulting in unit tests with poor readability and limited assertions. Language-model-based tools, such as AthenaTest and A3Test, have limitations in the generation of correct unit tests. In this paper, we introduce ChatUniTest, a ChatGPT-based automated unit test generation tool developed under the Generation-Validation-Repair framework. ChatUniTest generates tests by parsing the project, extracting essential information, and creating an adaptive focal context that includes the focal method and its dependencies within the pre-defined maximum prompt token limit. The context is incorporated into a prompt and subsequently submitted to ChatGPT. Once ChatGPT's response is received, ChatUniTest proceeds to extract the raw test from the response. It then validates the test and employs rule-based repair to fix syntactic and simple compile errors, followed by ChatGPT-based repair to address challenging errors. Our rigorous evaluation demonstrates that ChatUniTest outperforms EvoSuite in branch and line coverage, surpasses AthenaTest and A3Test in focal method coverage, and effectively generates assertions while utilizing mock objects and reflection to achieve test objectives.	W	W	1	1	-1	0	3	3	3	3	2	3	17
59	2022	Yuan, Wei; Zhang, Quanjun; He, Tieke; Fang, Chunrong; Hung, Nguyen Quoc Viet; Hao, Xiaodong; Yin, Hongzhi	circle: continual repair across programming languages	https://dl.acm.org/doi/10.1145/3533767.3534219	Automatic Program Repair (APR) aims at fixing buggy source code with less manual debugging efforts, which plays a vital role in improving software reliability and development productivity. Recent APR works have achieved remarkable progress via applying deep learning (DL), particularly neural machine translation (NMT) techniques. However, we observe that existing DL-based APR models suffer from at least two severe drawbacks: (1) Most of them can only generate patches for a single programming language, as a result, to repair multiple languages, we have to build and train many repairing models. (2) Most of them are developed offline. Therefore, they won’t function when there are new-coming requirements. To address the above problems, a T5-based APR framework equipped with continual learning ability across multiple programming languages is proposed, namely ContInual Repair aCross Programming LanguagEs (CIRCLE). Specifically, (1) CIRCLE utilizes a prompting function to narrow the gap between natural language processing (NLP) pre-trained tasks and APR. (2) CIRCLE adopts a difficulty-based rehearsal strategy to achieve lifelong learning for APR without access to the full historical data. (3) An elastic regularization method is employed to strengthen CIRCLE’s continual learning ability further, preventing it from catastrophic forgetting. (4) CIRCLE applies a simple but effective re-repairing method to revise generated errors caused by crossing multiple programming languages. We train CIRCLE for four languages (i.e., C, JAVA, JavaScript, and Python) and evaluate it on five commonly used benchmarks. The experimental results demonstrate that CIRCLE not only effectively and efficiently repairs multiple programming languages in continual learning settings, but also achieves state-of-the-art performance (e.g., fixes 64 Defects4J bugs) with a single repair model.	ISSTA	A_SE	1	1	-1	3	3	3	3	3	3	3	21
60	2023	Du, Xueying; Liu, Mingwei; Wang, Kaixin; Wang, Hanlin; Liu, Junwei; Chen, Yixuan; Feng, Jiayi; Sha, Chaofeng; Peng, Xin; Lou, Yiling	ClassEval: A Manually-Crafted Benchmark for Evaluating LLMs on Class-level Code Generation		In this work, we make the first attempt to evaluate LLMs in a more challenging code generation scenario, i.e. class-level code generation. We first manually construct the first class-level code generation benchmark ClassEval of 100 class-level Python code generation tasks with approximately 500 person-hours. Based on it, we then perform the first study of 11 state-of-the-art LLMs on class-level code generation. Based on our results, we have the following main findings. First, we find that all existing LLMs show much worse performance on class-level code generation compared to on standalone method-level code generation benchmarks like HumanEval; and the method-level coding ability cannot equivalently reflect the class-level coding ability among LLMs. Second, we find that GPT-4 and GPT-3.5 still exhibit dominate superior than other LLMs on class-level code generation, and the second-tier models includes Instruct-Starcoder, Instruct-Codegen, and Wizardcoder with very similar performance. Third, we find that generating the entire class all at once (i.e. holistic generation strategy) is the best generation strategy only for GPT-4 and GPT-3.5, while method-by-method generation (i.e. incremental and compositional) is better strategies for the other models with limited ability of understanding long instructions and utilizing the middle information. Lastly, we find the limited model ability of generating method-dependent code and discuss the frequent error types in generated classes. Our benchmark is available at https://github.com/FudanSELab/ClassEval.	L_W	L_W	1	1	-1	0	3	3	3	3	3	3	18
61	2022	Wei M,Harzevili NS,Huang Y,Wang J,Wang S	CLEAR: COntrastive LeArning for API REcommendation	https://doi.org/10.1145/3510003.3510159	Automatic API recommendation has been studied for years. There are two orthogonal lines of approaches for this task, i.e., information-retrieval-based (IR-based) and neural-based methods. Although these approaches were reported having remarkable performance, our observation shows that existing approaches can fail due to the following two reasons: 1) most IR-based approaches treat task queries as bag-of-words and use word embedding to represent queries, which cannot capture the sequential semantic information. 2) both the IR-based and the neural-based approaches are weak at distinguishing the semantic difference among lexically similar queries.In this paper, we propose CLEAR, which leverages BERT sentence embedding and contrastive learning to tackle the above two issues. Specifically, CLEAR embeds the whole sentence of queries and Stack Overflow (SO) posts with a BERT-based model rather than the bag-of-word-based word embedding model, which can preserve the semantic-related sequential information. In addition, CLEAR uses contrastive learning to train the BERT-based embedding model for learning precise semantic representation of programming terminologies regardless of their lexical information. CLEAR also builds a BERT-based re-ranking model to optimize its recommendation results. Given a query, CLEAR first selects a set of candidate SO posts via the BERT sentence embedding-based similarity to reduce search space. CLEAR further leverages a BERT-based re-ranking model to rank candidate SO posts and recommends the APIs from the ranked top SO posts for the query.Our experiment results on three different test datasets confirm the effectiveness of CLEAR for both method-level and class-level API recommendation. Compared to the state-of-the-art API recommendation approaches, CLEAR improves the MAP by 25%-187% at method-level and 10%-100% at class-level.	ICSE	A_SE	1	1	-1	3	3	3	3	3	3	3	21
62	2022	Bareiß, Patrick; Souza, Beatriz; d'Amorim, Marcelo; Pradel, Michael	code generation tools (almost) for free? a study of few-shot, pre-trained language models on code	http://arxiv.org/abs/2206.01335	Few-shot learning with large-scale, pre-trained language models is a powerful way to answer questions about code, e.g., how to complete a given code example, or even generate code snippets from scratch. The success of these models raises the question whether they could serve as a basis for building a wide range code generation tools. Traditionally, such tools are built manually and separately for each task. Instead, few-shot learning may allow to obtain different tools from a single pre-trained language model by simply providing a few examples or a natural language description of the expected tool behavior. This paper studies to what extent a state-of-the-art, pre-trained language model of code, Codex, may serve this purpose. We consider three code manipulation and code generation tasks targeted by a range of traditional tools: (i) code mutation; (ii) test oracle generation from natural language documentation; and (iii) test case generation. For each task, we compare few-shot learning to a manually built tool. Our results show that the model-based tools complement (code mutation), are on par (test oracle generation), or even outperform their respective traditionally built tool (test case generation), while imposing far less effort to develop them. By comparing the effectiveness of different variants of the model-based tools, we provide insights on how to design an appropriate input ("prompt") to the model and what influence the size of the model has. For example, we find that providing a small natural language description of the code generation task is an easy way to improve predictions. Overall, we conclude that few-shot language models are surprisingly effective, yet there is still more work to be done, such as exploring more diverse ways of prompting and tackling even more involved tasks.	W	W	1	1	-1	0	3	3	3	3	3	3	18
63	2023	Murali, Vijayaraghavan; Maddila, Chandra; Ahmad, Imad; Bolin, Michael; Cheng, Daniel; Ghorbani, Negar; Fernandez, Renuka; Nagappan, Nachiappan	codecompose: a large-scale industrial deployment of ai-assisted code authoring	http://arxiv.org/abs/2305.12050	The rise of large language models (LLMs) has unlocked various applications of this technology in software development. In particular, generative LLMs have been shown to effectively power AI-based code authoring tools that can suggest entire statements or blocks of code during code authoring. In this paper we present CodeCompose, an AI-assisted code authoring tool developed and deployed at Meta internally. CodeCompose is based on the InCoder LLM that merges generative capabilities with bi-directionality. We have scaled up CodeCompose to serve tens of thousands of developers at Meta, across 10+ programming languages and several coding surfaces. We discuss unique challenges in terms of user experience and metrics that arise when deploying such tools in large-scale industrial settings. We present our experience in making design decisions about the model and system architecture for CodeCompose that addresses these challenges. Finally, we present metrics from our large-scale deployment of CodeCompose that shows its impact on Meta's internal code authoring experience over a 15-day time window, where 4.5 million suggestions were made by CodeCompose. Quantitative metrics reveal that (i) CodeCompose has an acceptance rate of 22\\% across several languages, and (ii) 8\\% of the code typed by users of CodeCompose is through accepting code suggestions from CodeCompose. Qualitative feedback indicates an overwhelming 91.5\\% positive reception for CodeCompose. In addition to assisting with code authoring, CodeCompose is also introducing other positive side effects such as encouraging developers to generate more in-code documentation, helping them with the discovery of new APIs, etc.	W	W	1	1	-1	0	3	3	3	3	3	3	18
64	2022	Izadi, Maliheh; Gismondi, Roberta; Gousios, Georgios	codefill: multi-token code completion by jointly learning from structure and naming sequences	https://dl.acm.org/doi/10.1145/3510003.3510172	Code completion is an essential feature of IDEs, yet current auto-completers are restricted to either grammar-based or NLP-based single token completions. Both approaches have significant drawbacks: grammar-based autocompletion is restricted in dynamically-typed language environments, whereas NLP-based autocompleters struggle to understand the semantics of the programming language and the developer's code context.In this work, we present CodeFill, a language model for autocompletion that combines learned structure and naming information. Using a parallel Transformer architecture and multi-task learning, CodeFill consumes sequences of source code token names and their equivalent AST token types. Uniquely, CodeFill is trained both for single-token and multi-token (statement) prediction, which enables it to learn long-range dependencies among grammatical and naming elements. We train CodeFill on two datasets, consisting of 29M and 425M lines of code, respectively. To make the evaluation more realistic, we develop a method to automatically infer points in the source code at which completion matters. We compare CodeFill against four baselines and two state-of-the-art models, GPT-C and TravTrans+. CodeFill surpasses all baselines in single token prediction (MRR: 70.9% vs. 66.2% and 67.8%) and outperforms the state of the art for multi-token prediction (ROUGE-L: 63.7% vs. 52.4% and 59.2%, for n = 4 tokens). We publicly release our source code and datasets.	ICSE	A_SE	1	1	-1	3	3	3	3	2	3	3	20
65	2023	Zheng, Qinkai; Xia, Xiao; Zou, Xu; Dong, Yuxiao; Wang, Shan; Xue, Yufei; Wang, Zihan; Shen, Lei; Wang, Andi; Li, Yang; Su, Teng; Yang, Zhilin; Tang, Jie	codegeex: a pre-trained model for code generation with multilingual evaluations on humaneval-x	http://arxiv.org/abs/2303.17568	Large pre-trained code generation models, such as OpenAI Codex, can generate syntax- and function-correct code, making the coding of programmers more productive and our pursuit of artificial general intelligence closer. In this paper, we introduce CodeGeeX, a multilingual model with 13 billion parameters for code generation. CodeGeeX is pre-trained on 850 billion tokens of 23 programming languages as of June 2022. Our extensive experiments suggest that CodeGeeX outperforms multilingual code models of similar scale for both the tasks of code generation and translation on HumanEval-X. Building upon HumanEval (Python only), we develop the HumanEval-X benchmark for evaluating multilingual models by hand-writing the solutions in C++, Java, JavaScript, and Go. In addition, we build CodeGeeX-based extensions on Visual Studio Code, JetBrains, and Cloud Studio, generating 4.7 billion tokens for tens of thousands of active users per week. Our user study demonstrates that CodeGeeX can help to increase coding efficiency for 83.4\\% of its users. Finally, CodeGeeX is publicly accessible and in Sep. 2022, we open-sourced its code, model weights (the version of 850B tokens), API, extensions, and HumanEval-X at https://github.com/THUDM/CodeGeeX.	W	W	1	1	-1	0	3	3	3	3	3	3	18
66	2023	Nijkamp, Erik; Hayashi, Hiroaki; Xiong, Caiming; Savarese, Silvio; Zhou, Yingbo	CodeGen2: Lessons for Training LLMs on Programming and Natural Languages		Large language models (LLMs) have demonstrated remarkable abilities in representation learning for program synthesis and understanding tasks. The quality of the learned representations appears to be dictated by the neural scaling laws as a function of the number of model parameters and observations, while imposing upper bounds on the model performance by the amount of available data and compute, which is costly.	L_W	L_W	1	1	-1	0	3	3	3	3	3	3	18
67	2023	Li, Peng; Sun, Tianxiang; Tang, Qiong; Yan, Hang; Wu, Yuanbin; Huang, Xuanjing; Qiu, Xipeng	codeie: large code generation models are better few-shot information extractors	http://arxiv.org/abs/2305.05711	Large language models (LLMs) pre-trained on massive corpora have demonstrated impressive few-shot learning ability on many NLP tasks. A common practice is to recast the task into a text-to-text format such that generative LLMs of natural language (NL-LLMs) like GPT-3 can be prompted to solve it. However, it is nontrivial to perform information extraction (IE) tasks with NL-LLMs since the output of the IE task is usually structured and therefore is hard to be converted into plain text. In this paper, we propose to recast the structured output in the form of code instead of natural language and utilize generative LLMs of code (Code-LLMs) such as Codex to perform IE tasks, in particular, named entity recognition and relation extraction. In contrast to NL-LLMs, we show that Code-LLMs can be well-aligned with these IE tasks by designing code-style prompts and formulating these IE tasks as code generation tasks. Experiment results on seven benchmarks show that our method consistently outperforms fine-tuning moderate-size pre-trained models specially designed for IE tasks (e.g., UIE) and prompting NL-LLMs under few-shot settings. We further conduct a series of in-depth analyses to demonstrate the merits of leveraging Code-LLMs for IE tasks.	ACL	A_AI	1	1	-1	3	3	3	3	2	3	3	20
68	2023	Yu, Hao; Shen, Bo; Ran, Dezhi; Zhang, Jiaxin; Zhang, Qi; Ma, Yuchi; Liang, Guangtai; Li, Ying; Xie, Tao; Wang, Qianxiang	codereval: a benchmark of pragmatic code generation with generative pre-trained models	http://arxiv.org/abs/2302.00288	Code generation models based on the pre-training and fine-tuning paradigm have been increasingly attempted by both academia and industry, resulting in well-known industrial models such as Codex, CodeGen, and PanGu-Coder. To validate the performance of these models, multiple existing benchmarks (e.g., AiXBench and HumanEval) are proposed, including only cases of generating a standalone function, i.e., a function that invokes or accesses only built-in functions and standard libraries. However, standalone functions constitute only about 30\\\% of functions from real open-source projects. To assess a model's performance for pragmatic code generation (i.e., code generation for real settings of open source or proprietary code), in this paper, we propose a benchmark named CoderEval of pragmatic code generation with generative pre-trained models. Compared with the widely-used HumanEval benchmark from OpenAI, CoderEval can be used to assess the performance of models against pragmatic code generation beyond just generating standalone functions. Through the evaluation of three public available models (CodeGen, PanGu-Coder, and Codex) on CoderEval, we analyze and discuss the current progress and future directions of pragmatic code generation with a generative pre-trained model.	W	W	1	1	-1	0	3	3	3	2	1	3	15
69	2023	Wang, Yue; Le, Hung; Gotmare, Akhilesh Deepak; Bui, Nghi D. Q.; Li, Junnan; Hoi, Steven C. H.	codet5+: open code large language models for code understanding and generation	http://arxiv.org/abs/2305.07922	Large language models (LLMs) pretrained on vast source code have achieved prominent progress in code intelligence. However, existing code LLMs have two main limitations in terms of architecture and pretraining tasks. First, they often adopt a specific architecture (encoder-only or decoder-only) or rely on a unified encoder-decoder network for different downstream tasks. The former paradigm is limited by inflexibility in applications while in the latter, the model is treated as a single system for all tasks, leading to suboptimal performance on a subset of tasks. Secondly, they often employ a limited set of pretraining objectives which might not be relevant to some downstream tasks and hence result in substantial performance degrade. To address these limitations, we propose ``CodeT5+'', a family of encoder-decoder LLMs for code in which component modules can be flexibly combined to suit a wide range of downstream code tasks. Such flexibility is enabled by our proposed mixture of pretraining objectives to mitigate the pretrain-finetune discrepancy. These objectives cover span denoising, contrastive learning, text-code matching, and causal LM pretraining tasks, on both unimodal and bimodal multilingual code corpora. Furthermore, we propose to initialize CodeT5+ with frozen off-the-shelf LLMs without training from scratch to efficiently scale up our models, and explore instruction-tuning to align with natural language instructions. We extensively evaluate CodeT5+ on over 20 code-related benchmarks in different settings, including zero-shot, finetuning, and instruction-tuning. We observe state-of-the-art (SoTA) model performance on various code-related tasks, such as code generation and completion, math programming, and text-to-code retrieval tasks. Particularly, our instruction-tuned CodeT5+ 16B achieves new SoTA results on HumanEval code generation task against other open code LLMs.	W	W	1	1	-1	0	3	3	3	3	1	3	16
70	2023	Bui, Nghi D. Q.; Le, Hung; Wang, Yue; Li, Junnan; Gotmare, Akhilesh Deepak; Hoi, Steven C. H.	codetf: one-stop transformer library for state-of-the-art code llm	http://arxiv.org/abs/2306.00029	Code intelligence plays a key role in transforming modern software engineering. Recently, deep learning-based models, especially Transformer-based large language models (LLMs), have demonstrated remarkable potential in tackling these tasks by leveraging massive open-source code data and programming language features. However, the development and deployment of such models often require expertise in both machine learning and software engineering, creating a barrier for the model adoption. In this paper, we present CodeTF, an open-source Transformer-based library for state-of-the-art Code LLMs and code intelligence. Following the principles of modular design and extensible framework, we design CodeTF with a unified interface to enable rapid access and development across different types of models, datasets and tasks. Our library supports a collection of pretrained Code LLM models and popular code benchmarks, including a standardized interface to train and serve code LLMs efficiently, and data features such as language-specific parsers and utility functions for extracting code attributes. In this paper, we describe the design principles, the architecture, key modules and components, and compare with other related library tools. Finally, we hope CodeTF is able to bridge the gap between machine learning/generative AI and software engineering, providing a comprehensive open-source solution for developers, researchers, and practitioners.	W	W	1	1	-1	0	3	2	2	3	2	3	15
71	2022	Zhang, Jiyang; Panthaplackel, Sheena; Nie, Pengyu; Li, Junyi Jessy; Gligoric, Milos	coditt5: pretraining for source code and natural language editing	https://dl.acm.org/doi/10.1145/3551349.3556955	Pretrained language models have been shown to be effective in many software-related generation tasks; however, they are not well-suited for editing tasks as they are not designed to reason about edits. To address this, we propose a novel pretraining objective which explicitly models edits and use it to build CoditT5, a large language model for software-related editing tasks that is pretrained on large amounts of source code and natural language comments. We fine-tune it on various downstream editing tasks, including comment updating, bug fixing, and automated code review. By outperforming standard generation-based models, we demonstrate the generalizability of our approach and its suitability for editing tasks. We also show how a standard generation model and our edit-based model can complement one another through simple reranking strategies, with which we achieve state-of-the-art performance for the three downstream editing tasks.	ASE	A_SE	1	1	-1	2	3	3	3	3	3	3	20
72	2023	Chen Qian, Xin Cong, Cheng Yang, Weize Chen, Yusheng Su, Juyuan Xu, Zhiyuan Liu, Maosong Sun	Communicative Agents for Software Development		Software engineering is a domain characterized by intricate decision-making processes, often relying on nuanced intuition and consultation. Recent advancements in deep learning have started to revolutionize software engineering practices through elaborate designs implemented at various stages of software development. In this paper, we present an innovative paradigm that leverages large language models (LLMs) throughout the entire software development process, streamlining and unifying key processes through natural language communication, thereby eliminating the need for specialized models at each phase. At the core of this paradigm lies ChatDev, a virtual chat-powered software development company that mirrors the established waterfall model, meticulously dividing the development process into four distinct chronological stages: designing, coding, testing, and documenting. Each stage engages a team of agents, such as programmers, code reviewers, and test engineers, fostering collaborative dialogue and facilitating a seamless workflow. The chat chain acts as a facilitator, breaking down each stage into atomic subtasks. This enables dual roles, allowing for proposing and validating solutions through context-aware communication, leading to efficient resolution of specific subtasks. The instrumental analysis of ChatDev highlights its remarkable efficacy in software generation, enabling the completion of the entire software development process in under seven minutes at a cost of less than one dollar. It not only identifies and alleviates potential vulnerabilities but also rectifies potential hallucinations while maintaining commendable efficiency and cost-effectiveness. The potential of ChatDev unveils fresh possibilities for integrating LLMs into the realm of software development.	L_W	L_W	1	1	-1	0	3	3	3	3	3	3	18
73	2023	Nascimento, Nathalia; Alencar, Paulo; Cowan, Donald	comparing software developers with chatgpt: an empirical investigation	http://arxiv.org/abs/2305.11837	The advent of automation in particular Software Engineering (SE) tasks has transitioned from theory to reality. Numerous scholarly articles have documented the successful application of Artificial Intelligence to address issues in areas such as project management, modeling, testing, and development. A recent innovation is the introduction of ChatGPT, an ML-infused chatbot, touted as a resource proficient in generating programming codes and formulating software testing strategies for developers and testers respectively. Although there is speculation that AI-based computation can increase productivity and even substitute software engineers in software development, there is currently a lack of empirical evidence to verify this. Moreover, despite the primary focus on enhancing the accuracy of AI systems, non-functional requirements including energy efficiency, vulnerability, fairness (i.e., human bias), and safety frequently receive insufficient attention. This paper posits that a comprehensive comparison of software engineers and AI-based solutions, considering various evaluation criteria, is pivotal in fostering human-machine collaboration, enhancing the reliability of AI-based methods, and understanding task suitability for humans or AI. Furthermore, it facilitates the effective implementation of cooperative work structures and human-in-the-loop processes. This paper conducts an empirical investigation, contrasting the performance of software engineers and AI systems, like ChatGPT, across different evaluation metrics. The empirical study includes a case of assessing ChatGPT-generated code versus code produced by developers and uploaded in Leetcode.	W	W	1	1	-1	0	3	3	3	3	3	3	18
74	2023	Gao, Shuzheng; Wen, Xin-Cheng; Gao, Cuiyun; Wang, Wenxuan; Lyu, Michael R.	constructing effective in-context demonstration for code intelligence tasks: an empirical study	http://arxiv.org/abs/2304.07575	Pre-trained models of code have gained widespread popularity in many code intelligence tasks. Recently, with the scaling of the model and corpus size, large language models have shown the ability of in-context learning. These models employ task instructions and a few demonstration examples as prompts to learn the semantics of the task and make predictions for test samples. This new learning paradigm is training-free and has shown impressive performance in various natural language processing and code intelligence tasks. However, the performance of in-context learning heavily relies on the quality of demonstration, and there has been no systematic investigation into how to construct a good demonstration for code-related tasks with in-context learning. In this paper, by analyzing the design space of in-context demonstration, we empirically explore the impact of three key factors on the performance of in-context learning in code intelligence tasks: the selection of demonstration examples, the order of demonstration examples, and the number of demonstration examples. We conduct extensive experiments on three code intelligence tasks including bug fixing, code summarization, and program synthesis. Our experimental results demonstrate that all the above three factors dramatically impact the performance of in-context learning in code intelligence tasks. Additionally, we summarize our findings and provide takeaway suggestions on how to construct effective demonstrations, taking into account these three perspectives. We show that a well-constructed demonstration can lead to significant improvements over simple demonstrations and previous fine-tuned state-of-the-art models, e.g., improving EM, BLEU-4, and EM by at least 11.91\\%, 36.88\\%, and 37.18\\% on code summarization, bug fixing and program synthesis, respectively.	W	W	1	1	-1	0	3	3	3	3	3	3	18
75	2023	Xia, Chunqiu Steven; Zhang, Lingming	conversational automated program repair	http://arxiv.org/abs/2301.13246	Automated Program Repair (APR) can help developers automatically generate patches for bugs. Due to the impressive performance obtained using Large Pre-Trained Language Models (LLMs) on many code related tasks, researchers have started to directly use LLMs for APR. However, prior approaches simply repeatedly sample the LLM given the same constructed input/prompt created from the original buggy code, which not only leads to generating the same incorrect patches repeatedly but also miss the critical information in testcases. To address these limitations, we propose conversational APR, a new paradigm for program repair that alternates between patch generation and validation in a conversational manner. In conversational APR, we iteratively build the input to the model by combining previously generated patches with validation feedback. As such, we leverage the long-term context window of LLMs to not only avoid generating previously incorrect patches but also incorporate validation feedback to help the model understand the semantic meaning of the program under test. We evaluate 10 different LLM including the newly developed ChatGPT model to demonstrate the improvement of conversational APR over the prior LLM for APR approach.	W	W	1	1	-1	0	3	3	2	2	3	2	15
76	2023	Tan, Chee Wei; Guo, Shangxin; Wong, Man Fai; Hang, Ching Nam	Copilot for Xcode: Exploring AI-Assisted Programming by Prompting Cloud-based Large Language Models		This paper presents an AI-assisted programming tool called Copilot for Xcode for program composition and design to support human software developers. By seamlessly integrating cloud-based Large Language Models (LLM) with Apple’s local development environment, Xcode, this tool enhances productivity and unleashes creativity for software development in Apple software ecosystem (e.g., iOS apps, macOS). Leveraging advanced natural language processing (NLP) techniques, Copilot for Xcode effectively processes source code tokens and patterns within code repositories, enabling features such as code generation, autocompletion, documentation, and error detection. Software developers can also query and make “small” decisions for program composition, some of which can be made simultaneously, and this is facilitated through prompt engineering in a chat interface of Copilot for Xcode. Finally, we present simple case studies as evidence of the effectiveness of utilizing NLP in Xcode to prompt popular LLM services like OpenAI ChatGPT for program composition and design.	L_W	L_W	1	1	-1	0	3	3	3	3	3	3	18
77	2022	Shi, Zejian; Xiong, Yun; Zhang, Xiaolong; Zhang, Yao; Li, Shanshan; Zhu, Yangyong	cross-modal contrastive learning for code search	https://ieeexplore.ieee.org/document/9978195/	Code search aims to retrieve code snippets from natural language queries, which serves as a core technology to improve development efficiency. Previous approaches have achieved promising results to learn code and query representations by using BERT-based pre-trained models which, however, leads to semantic collapse problems, i.e. native representations of code and query clustering in a high similarity interval. In this paper, we propose CrossCS, a cross-modal contrastive learning method for code search, to improve the representations of code and query by explicit fine-grained contrastive objectives. Specifically, we design a novel and effective contrastive objective that considers not only the similarity between modalities, but also the similarity within modalities. To maintain semantic consistency of code snippets with different names of functions and variables, we use data augmentation to rename functions and variables to meaningless tokens, which enables us to add comparisons between code and augmented code within modalities. Moreover, in order to further improve the effectiveness of pre-trained models, we rank candidate code snippets using similarity scores weighted by retrieval scores and classification scores. Comprehensive experiments demonstrate that our method can significantly improve the effectiveness of pre-trained models for code search.	ICSME	B_SE	1	1	-1	2	3	3	3	1	2	3	17
78	2023	Tang, Wei; Tang, Mingwei; Ban, Minchao; Zhao, Ziguo; Feng, Mingjun	csgvd: a deep learning approach combining sequence and graph embedding for source code vulnerability detection	https://linkinghub.elsevier.com/retrieve/pii/S0164121223000183	In order to secure software, it is critical to detect potential vulnerabilities. The performance of traditional static vulnerability detection methods is limited by predefined rules, which rely heavily on the expertise of developers. Existing deep learning-based vulnerability detection models usually use only a single sequence or graph embedding approach to extract vulnerability features. Sequence embedding-based models ignore the structured information inherent in the code, and graph embedding-based models lack effective node and graph embedding methods. As a result, we propose a novel deep learning-based approach, CSGVD (Combining Sequence and Graph embedding for Vulnerability Detection), which considers function-level vulnerability detection as a graph binary classification task. Firstly, we propose a PE-BL module, which inherits and enhances the knowledge from the pre-trained language model. It extracts the code’s local semantic features as node embedding in the control flow graph by using sequence embedding. Secondly, CSGVD uses graph neural networks to extract the structured information of the graph. Finally, we propose a mean biaffine attention pooling, M-BFA, to better aggregate node information as a graph’s feature representation. The experimental results show that CSGVD outperforms the existing state-of-the-art models and obtains 64.46% accuracy on the real-world benchmark dataset from CodeXGLUE for vulnerability detection.	JSS	B_SE	1	1	-1	2	3	3	2	2	3	3	18
79	2023	Kolthoff, Kristian; Bartelt, Christian; Ponzetto, Simone Paolo	data-driven prototyping via natural-language-based gui retrieval	https://link.springer.com/10.1007/s10515-023-00377-x	Rapid GUI prototyping has evolved into a widely applied technique in early stages of software development to facilitate the clarification and refinement of requirements. Especially high-fidelity GUI prototyping has shown to enable productive discussions with customers and mitigate potential misunderstandings, however, the benefits of applying high-fidelity GUI prototypes are accompanied by the disadvantage of being expensive and time-consuming in development and requiring experience to create. In this work, we show RaWi, a data-driven GUI prototyping approach that effectively retrieves GUIs for reuse from a large-scale semi-automatically created GUI repository for mobile apps on the basis of Natural Language (NL) searches to facilitate GUI prototyping and improve its productivity by leveraging the vast GUI prototyping knowledge embodied in the repository. Retrieved GUIs can directly be reused and adapted in the graphical editor of RaWi. Moreover, we present a comprehensive evaluation methodology to enable (i) the systematic evaluation of NL-based GUI ranking methods through a novel high-quality gold standard and conduct an in-depth evaluation of traditional IR and state-of-the-art BERT-based models for GUI ranking, and (ii) the assessment of GUI prototyping productivity accompanied by an extensive user study in a practical GUI prototyping environment.	ASE_J	B_SE	0	0	-1	2	3	3	3	3	3	3	20
80	2023	Olausson, Theo X.; Inala, Jeevana Priya; Wang, Chenglong; Gao, Jianfeng; Solar-Lezama, Armando	Demystifying GPT Self-Repair for Code Generation		Large Language Models (LLMs) have shown remarkable aptitude in code generation but still struggle on challenging programming tasks. Self-repair—in which the model debugs and fixes mistakes in its own code—has recently become a popular way to boost performance in these settings. However, only very limited studies on how and when self-repair works effectively exist in the literature, and one might wonder to what extent a model is really capable of providing accurate feedback on why the code is wrong when that code was generated by the same model. In this paper, we analyze GPT-3.5 and GPT-4’s ability to perform self-repair on APPS, a challenging dataset consisting of diverse coding challenges. To do so, we first establish a new evaluation strategy dubbed pass@t that measures the pass rate of the tasks against the total number of tokens sampled from the model, enabling a fair comparison to purely sampling-based approaches. With this evaluation strategy, we find that the effectiveness of self-repair is only seen in GPT-4. We also observe that self-repair is bottlenecked by the feedback stage; using GPT-4 to give feedback on the programs generated by GPT-3.5 and using expert human programmers to give feedback on the programs generated by GPT-4, we unlock significant performance gains.	L_W	L_W	1	1	-1	0	3	3	3	3	3	3	18
81	2023	Koide, Takashi; Fukushi, Naoki; Nakano, Hiroki; Chiba, Daiki	Detecting Phishing Sites Using ChatGPT		The rise of large language models (LLMs) has had a significant impact on various domains, including natural language processing and artificial intelligence. While LLMs such as ChatGPT have been extensively researched for tasks such as code generation and text synthesis, their application in detecting malicious web content, particularly phishing sites, has been largely unexplored. To combat the rising tide of automated cyber attacks facilitated by LLMs, it is imperative to automate the detection of malicious web content, which requires approaches that leverage the power of LLMs to analyze and classify phishing sites.	L_W	L_W	1	1	-1	0	3	3	3	3	3	3	18
82	2023	Li Ke and Hong Sheng and Fu Cai and Zhang Yunhe and Liu Ming	Discriminating Human-authored from ChatGPT-Generated Code Via Discernable Feature Analysis		The ubiquitous adoption of Large Language Generation Models (LLMs) in programming has underscored the importance of differentiating between human-written code and code generated by intelligent models. This paper specifically aims to distinguish code generated by ChatGPT from that authored by humans. Our investigation reveals disparities in programming style, technical level, and readability between these two sources. Consequently, we develop a discriminative feature set for differentiation and evaluate its efficacy through ablation experiments. Additionally, we devise a dataset cleansing technique, which employs temporal and spatial segmentation, to mitigate the dearth of datasets and to secure high-caliber, uncontaminated datasets. To further enrich data resources, we employ "code transformation," "feature transformation," and "feature customization" techniques, generating an extensive dataset comprising 10,000 lines of ChatGPT-generated code. The salient contributions of our research include: proposing a discriminative feature set yielding high accuracy in differentiating ChatGPT-generated code from human-authored code in binary classification tasks; devising methods for generating extensive ChatGPT-generated codes; and introducing a dataset cleansing strategy that extracts immaculate, high-grade code datasets from open-source repositories, thus achieving exceptional accuracy in code authorship attribution tasks.	L_W	L_W	1	1	-1	0	3	3	3	3	3	3	18
83	2023	Chen, Yizheng; Ding, Zhoujie; Chen, Xinyun; Wagner, David	diversevul: a new vulnerable source code dataset for deep learning based vulnerability detection	http://arxiv.org/abs/2304.00409	We propose and release a new vulnerable source code dataset. We curate the dataset by crawling security issue websites, extracting vulnerability-fixing commits and source codes from the corresponding projects. Our new dataset contains 150 CWEs, 26,635 vulnerable functions, and 352,606 non-vulnerable functions extracted from 7,861 commits. Our dataset covers 305 more projects than all previous datasets combined. We show that increasing the diversity and volume of training data improves the performance of deep learning models for vulnerability detection. Combining our new dataset with previous datasets, we present an analysis of the challenges and promising research directions of using deep learning for detecting software vulnerabilities. We study 11 model architectures belonging to 4 families. Our results show that deep learning is still not ready for vulnerability detection, due to high false positive rate, low F1 score, and difficulty of detecting hard CWEs. In particular, we demonstrate an important generalization challenge for the deployment of deep learning-based models. However, we also identify hopeful future research directions. We demonstrate that large language models (LLMs) are the future for vulnerability detection, outperforming Graph Neural Networks (GNNs) with manual feature engineering. Moreover, developing source code specific pre-training objectives is a promising research direction to improve the vulnerability detection performance.	W	W	1	1	-1	0	3	3	2	2	3	3	16
84	2022	Li, Yao; Zhang, Tao; Luo, Xiapu; Cai, Haipeng; Fang, Sen; Yuan, Dawei	do pre-trained language models indeed understand software engineering tasks?	http://arxiv.org/abs/2211.10623	Artificial intelligence (AI) for software engineering (SE) tasks has recently achieved promising performance. In this paper, we investigate to what extent the pre-trained language model truly understands those SE tasks such as code search, code summarization, etc. We conduct a comprehensive empirical study on a board set of AI for SE (AI4SE) tasks by feeding them with variant inputs: 1) with various masking rates and 2) with sufficient input subset method. Then, the trained models are evaluated on different SE tasks, including code search, code summarization, and duplicate bug report detection. Our experimental results show that pre-trained language models are insensitive to the given input, thus they achieve similar performance in these three SE tasks. We refer to this phenomenon as overinterpretation, where a model confidently makes a decision without salient features, or where a model finds some irrelevant relationships between the final decision and the dataset. Our study investigates two approaches to mitigate the overinterpretation phenomenon: whole word mask strategy and ensembling. To the best of our knowledge, we are the first to reveal this overinterpretation phenomenon to the AI4SE community, which is an important reminder for researchers to design the input for the models and calls for necessary future work in understanding and implementing AI4SE tasks.	W	W	1	1	-1	0	3	3	3	3	3	3	18
85	2022	Lai, Yuhang; Li, Chengxi; Wang, Yiming; Zhang, Tianyi; Zhong, Ruiqi; Zettlemoyer, Luke; Yih, Scott Wen-tau; Fried, Daniel; Wang, Sida; Yu, Tao	ds-1000: a natural and reliable benchmark for data science code generation	http://arxiv.org/abs/2211.11501	We introduce DS-1000, a code generation benchmark with a thousand data science problems spanning seven Python libraries, such as NumPy and Pandas. Compared to prior works, DS-1000 incorporates three core features. First, our problems reflect diverse, realistic, and practical use cases since we collected them from StackOverflow. Second, our automatic evaluation is highly specific (reliable) -- across all Codex-002-predicted solutions that our evaluation accept, only 1.8\\% of them are incorrect; we achieve this with multi-criteria metrics, checking both functional correctness by running test cases and surface-form constraints by restricting API usages or keywords. Finally, we proactively defend against memorization by slightly modifying our problems to be different from the original StackOverflow source; consequently, models cannot answer them correctly by memorizing the solutions from pre-training. The current best public system (Codex-002) achieves 43.3\\% accuracy, leaving ample room for improvement. We release our benchmark at https://ds1000-code-gen.github.io.	W	W	0	1	-1	0	3	3	3	2	1	3	15
86	2021	Isotani, Haruna; Washizaki, Hironori; Fukazawa, Yoshiaki; Nomoto, Tsutomu; Ouji, Saori; Saito, Shinobu	duplicate bug report detection by using sentence embedding and fine-tuning	https://ieeexplore.ieee.org/document/9609125/	Industrial software maintenance devotes much time and effort to find duplicate bug reports. In this paper, we propose an automated duplicate bug report detection system to improve software maintenance efficiency. Our system detects duplicate reports by vectorizing the contents of each report item by deep-learning-based sentence embedding and calculating the similarity of the whole report from those of the item vectors. The Sentence-BERT fine-tuned with report texts is used for sentence embedding. Finally, we verify that the combination of processing separately by item and Sentence-BERT fine-tuned with reports effectively detects duplicate bug reports in industrial experiments that compare the performance of existing methods.	ICSME	B_SE	1	1	-1	2	3	3	2	2	3	3	18
87	2023	Khanfir, Ahmed; Degiovanni, Renzo; Papadakis, Mike; Traon, Yves Le	efficient mutation testing via pre-trained language models	http://arxiv.org/abs/2301.03543	Mutation testing is an established fault-based testing technique. It operates by seeding faults into the programs under test and asking developers to write tests that reveal these faults. These tests have the potential to reveal a large number of faults -- those that couple with the seeded ones -- and thus are deemed important. To this end, mutation testing should seed faults that are both "natural" in a sense easily understood by developers and strong (have high chances to reveal faults). To achieve this we propose using pre-trained generative language models (i.e. CodeBERT) that have the ability to produce developer-like code that operates similarly, but not exactly, as the target code. This means that the models have the ability to seed natural faults, thereby offering opportunities to perform mutation testing. We realise this idea by implementing $μ$BERT, a mutation testing technique that performs mutation testing using CodeBert and empirically evaluated it using 689 faulty program versions. Our results show that the fault revelation ability of $μ$BERT is higher than that of a state-of-the-art mutation testing (PiTest), yielding tests that have up to 17\\% higher fault detection potential than that of PiTest. Moreover, we observe that $μ$BERT can complement PiTest, being able to detect 47 bugs missed by PiTest, while at the same time, PiTest can find 13 bugs missed by $μ$BERT.	W	W	1	1	-1	0	3	3	3	3	3	3	18
88	2023	Li, Jia; Li, Ge; Li, Yongmin; Jin, Zhi	enabling programming thinking in large language models toward code generation	http://arxiv.org/abs/2305.06599	Large Language Models (LLMs) (e.g., ChatGPT) have shown impressive performance in code generation. A large-scale study released that writing programs requires programming thinking, i.e., analyzing and implementing requirements in programming logic (e.g., sequence, branch, loop). Existing studies use LLMs to generate programs from requirements directly and do not explicitly introduce the programming thinking. This paper explores how to unlock the programming thinking of LLMs in code generation and proposes an approach named TiP. Our idea is to decompose code generation into two steps and progressively lead LLMs to analyze\\&implement requirements in programming logic. Specifically, TiP first generates a code sketch, which provides a high-level solving process using programming logic but omits implementation details (e.g., APIs). Then, TiP implements the sketch into a program using specific programming languages. We conduct extensive experiments on three public benchmarks (i.e., HumanEval, MBPP, and MBCPP). (1) TiP outperforms the state-of-the-art baseline - ChatGPT by up to 17.5\\% in Pass@1, 11.02\\% in Pass@3, and 9.84\\% in Pass@5. (2) Human evaluation shows that TiP outperforms ChatGPT in three aspects (i.e., correctness, code quality, and maintainability). (3) TiP is effective for different LLMs. (4) We explore multiple choices (e.g., chain-of-thought) for the code sketch and validate the superiority of our design. (5) We discuss the complementarity between TiP and post-processing approaches (e.g., CodeT).	W	W	1	1	-1	0	3	3	3	3	3	3	18
89	2023	Paul, Rishov; Hossain, Md Mohib; Siddiq, Mohammed Latif; Hasan, Masum; Iqbal, Anindya; Santos, Joanna C. S.	Enhancing Automated Program Repair through Fine-tuning and Prompt Engineering		Sequence-to-sequence models have been used to transform erroneous programs into correct ones when trained with a large enough dataset. Some recent studies also demonstrated strong empirical evidence that code review could improve the program repair further. Large language models, trained with Natural Language (NL) and Programming Language (PL), can contain inherent knowledge of both. In this study, we investigate if this inherent knowledge of PL and NL can be utilized to improve automated program repair. We applied PLBART and CodeT5, two state-of-the-art language models that are pretrained with both PL and NL, on two such natural languagebased program repair datasets and found that the pre-trained language models fine-tuned with datasets containing both code review and subsequent code changes notably outperformed each of the previous models. With the advent of code generative models like Codex and GPT-3.5-Turbo, we also performed zero-shot and few-shots learning-based prompt engineering to assess their performance on these datasets. However, the practical application of using LLMs in the context of automated program repair is still a long way off based on our manual analysis of the generated repaired codes by the learning models.	L_W	L_W	1	1	-1	0	3	3	3	3	3	3	18
90	2022	Zhu, Jianfei; Xiao, Guanping; Zheng, Zheng; Sui, Yulei	Enhancing Traceability Link Recovery with Unlabeled Data		Traceability link recovery (TLR) is an important software engineering task for developing trustworthy and reliable software systems. Recently proposed deep learning (DL) models have shown their effectiveness compared to traditional information retrieval-based methods. DL often heavily relies on sufficient labeled data to train the model. However, manually labeling traceability links is time-consuming, labor-intensive, and requires specific knowledge from domain experts. As a result, typically only a small portion of labeled data is accompanied by a large amount of unlabeled data in real-world projects. Our hypothesis is that artifacts are semantically similar if they have the same linked artifact(s). This paper presents TRACEFUN, a new approach to enhance traceability link recovery with unlabeled data. TRACEFUN first measures the similarities between unlabeled and labeled artifacts using two similarity prediction methods (i.e., vector space model and contrastive learning). Then, based on the similarities, newly labeled links are generated between the unlabeled artifacts and the linked objects of the labeled artifacts. Generated links are further used for TLR model training. We have evaluated TRACEFUN on three GitHub projects with two state-of-the-art DL models (i.e., Trace BERT and TraceNN). The results show that TRACEFUN is effective in terms of a maximum improvement of F1-score up to 21% and 1,088%, respectively for Trace BERT and TraceNN.	ISSRE	B_SE	1	1	-1	2	3	3	3	3	3	3	20
91	2023	Wang, Jian; Liu, Shangqing; Xie, Xiaofei; Li, Yi	evaluating aigc detectors on code content	http://arxiv.org/abs/2304.05193	Artificial Intelligence Generated Content (AIGC) has garnered considerable attention for its impressive performance, with ChatGPT emerging as a leading AIGC model that produces high-quality responses across various applications, including software development and maintenance. Despite its potential, the misuse of ChatGPT poses significant concerns, especially in education and safetycritical domains. Numerous AIGC detectors have been developed and evaluated on natural language data. However, their performance on code-related content generated by ChatGPT remains unexplored. To fill this gap, in this paper, we present the first empirical study on evaluating existing AIGC detectors in the software domain. We created a comprehensive dataset including 492.5K samples comprising code-related content produced by ChatGPT, encompassing popular software activities like Q\\&A (115K), code summarization (126K), and code generation (226.5K). We evaluated six AIGC detectors, including three commercial and three open-source solutions, assessing their performance on this dataset. Additionally, we conducted a human study to understand human detection capabilities and compare them with the existing AIGC detectors. Our results indicate that AIGC detectors demonstrate lower performance on code-related data compared to natural language data. Fine-tuning can enhance detector performance, especially for content within the same domain; but generalization remains a challenge. The human evaluation reveals that detection by humans is quite challenging.	W	W	1	1	-1	0	3	3	3	3	3	3	18
92	2023	Ochs, Marcel; Narasimhan, Krishna; Mezini, Mira	evaluating and improving transformers pre-trained on asts for code completion	https://ieeexplore.ieee.org/document/10123540/	Automatic code completion is one of the most popular developer assistance features which is usually performed using static program analysis. But, studies have shown that static analysis can be riddled with false positives. Another issue with static-analysis based code completion is that the recommendation does not take into account the history/ context in which the software is operating and rely on type/ alphabetical ordering of suggestions. A recent development that has shown to be promising in this direction is the use of language models such as transformers that are trained on real-world code from Github to provide context sensitive, accurate code completion suggestions. Studies on transformer-based code completion have shown that such restrictions can be leveraged; i.e; training transformers on structural representations of code (specifically ASTs) could have a positive impact on the accuracy of code completion. To this end, the work by Kim et al. implemented and evaluated TravTrans which is based on the powerful text prediction language model GPT-2 by training on abstract syntax trees instead of treating code as plain texts. Using alternative source code representation such as AST provides the already potent language model with an additional layer of program semantic-awareness. But, TravTrans has adapted several rigid choices regarding various components of the transformer architecture like embedding sizes, sliding windows etc. TravTrans also suffers from issues related to running out of vocabulary. In this paper, we reproduce the TravTrans model and perform a deeper, fine-grained analysis of the impact of various architectural and code-level settings on the prediction. As a result of our fine-grained analysis, we also identify several aspects that need improvements like the fact that the model performs particularly poorly with code involving dictionaries and lists. We also offer solutions to a few of the issues like the out-of-vocabulary issue. Finally, our results motivates the need for a customizable-transformer architecture for coding tasks.	SANER	B_SE	1	1	-1	2	3	3	3	1	2	3	17
93	2023	Adish Singla	Evaluating ChatGPT and GPT-4 for Visual Programming		Generative AI and large language models have the potential to drastically improve the landscape of computing education by automatically generating personalized feedback and content. Recent works have studied the capabilities of these models for different programming education scenarios; however, these works considered only text-based programming, in particular, Python programming. Consequently, they leave open the question of how well these models would perform in visual programming domains popularly used for K-8 programming education. The main research question we study is: Do state-of-the-art generative models show advanced capabilities in visual programming on par with their capabilities in text-based Python programming? In our work, we evaluate two models, ChatGPT (based on GPT-3.5) and GPT-4, in visual programming domains for various scenarios and assess performance using expert-based annotations. In particular, we base our evaluation using reference tasks from the domains of Hour of Code: Maze Challenge by Code-dot-org and Karel. Our results show that these models perform poorly and struggle to combine spatial, logical, and programming skills crucial for visual programming. These results also provide exciting directions for future work on developing techniques to improve the performance of generative models in visual programming.	L_W	L_W	1	1	-1	0	3	3	3	3	3	3	18
94	2021	Chen, Mark; Tworek, Jerry; Jun, Heewoo; Yuan, Qiming; Pinto, Henrique Ponde de Oliveira; Kaplan, Jared; Edwards, Harri; Burda, Yuri; Joseph, Nicholas; Brockman, Greg; Ray, Alex; Puri, Raul; Krueger, Gretchen; Petrov, Michael; Khlaaf, Heidy; Sastry, Girish; Mishkin, Pamela; Chan, Brooke; Gray, Scott; Ryder, Nick; Pavlov, Mikhail; Power, Alethea; Kaiser, Lukasz; Bavarian, Mohammad; Winter, Clemens; Tillet, Philippe; Such, Felipe Petroski; Cummings, Dave; Plappert, Matthias; Chantzis, Fotios; Barnes, Elizabeth; Herbert-Voss, Ariel; Guss, William Hebgen; Nichol, Alex; Paino, Alex; Tezak, Nikolas; Tang, Jie; Babuschkin, Igor; Balaji, Suchir; Jain, Shantanu; Saunders, William; Hesse, Christopher; Carr, Andrew N.; Leike, Jan; Achiam, Josh; Misra, Vedant; Morikawa, Evan; Radford, Alec; Knight, Matthew; Brundage, Miles; Murati, Mira; Mayer, Katie; Welinder, Peter; McGrew, Bob; Amodei, Dario; McCandlish, Sam; Sutskever, Ilya; Zaremba, Wojciech	evaluating large language models trained on code	http://arxiv.org/abs/2107.03374	We introduce Codex, a GPT language model fine-tuned on publicly available code from GitHub, and study its Python code-writing capabilities. A distinct production version of Codex powers GitHub Copilot. On HumanEval, a new evaluation set we release to measure functional correctness for synthesizing programs from docstrings, our model solves 28.8\\% of the problems, while GPT-3 solves 0\\% and GPT-J solves 11.4\\%. Furthermore, we find that repeated sampling from the model is a surprisingly effective strategy for producing working solutions to difficult prompts. Using this method, we solve 70.2\\% of our problems with 100 samples per problem. Careful investigation of our model reveals its limitations, including difficulty with docstrings describing long chains of operations and with binding operations to variables. Finally, we discuss the potential broader impacts of deploying powerful code generation technologies, covering safety, security, and economics.	W	W	1	1	-1	0	3	3	3	3	3	3	18
95	2020	Tian, Haoye; Liu, Kui; Kaboré, Abdoul Kader; Koyuncu, Anil; Li, Li; Klein, Jacques; Bissyandé, Tegawendé F.	evaluating representation learning of code changes for predicting patch correctness in program repair	https://dl.acm.org/doi/10.1145/3324884.3416532	A large body of the literature of automated program repair develops approaches where patches are generated to be validated against an oracle (e.g., a test suite). Because such an oracle can be imperfect, the generated patches, although validated by the oracle, may actually be incorrect. While the state of the art explore research directions that require dynamic information or that rely on manually-crafted heuristics, we study the benefit of learning code representations in order to learn deep features that may encode the properties of patch correctness. Our empirical work mainly investigates different representation learning approaches for code changes to derive embeddings that are amenable to similarity computations. We report on findings based on embeddings produced by pre-trained and re-trained neural networks. Experimental results demonstrate the potential of embeddings to empower learning algorithms in reasoning about patch correctness: a machine learning predictor with BERT transformer-based embeddings associated with logistic regression yielded an AUC value of about 0.8 in the prediction of patch correctness on a deduplicated dataset of 1000 labeled patches. Our investigations show that learned representations can lead to reasonable performance when comparing against the state-of-the-art, PATCH-SIM, which relies on dynamic information. These representations may further be complementary to features that were carefully (manually) engineered in the literature.	ASE	A_SE	1	0	-1	2	3	3	3	3	3	3	20
96	2023	Yetiştiren, Burak; Özsoy, Işık; Ayerdem, Miray; Tüzün, Eray	Evaluating the Code Quality of AI-Assisted Code Generation Tools: An Empirical Study on GitHub Copilot, Amazon CodeWhisperer, and ChatGPT	http://arxiv.org/abs/2304.10778	Context: AI-assisted code generation tools have become increasingly prevalent in software engineering, offering the ability to generate code from natural language prompts or partial code inputs. Notable examples of these tools include GitHub Copilot, Amazon CodeWhisperer, and OpenAI's ChatGPT. Objective: This study aims to compare the performance of these prominent code generation tools in terms of code quality metrics, such as Code Validity, Code Correctness, Code Security, Code Reliability, and Code Maintainability, to identify their strengths and shortcomings. Method: We assess the code generation capabilities of GitHub Copilot, Amazon CodeWhisperer, and ChatGPT using the benchmark HumanEval Dataset. The generated code is then evaluated based on the proposed code quality metrics. Results: Our analysis reveals that the latest versions of ChatGPT, GitHub Copilot, and Amazon CodeWhisperer generate correct code 65.2\\%, 46.3\\%, and 31.1\\% of the time, respectively. In comparison, the newer versions of GitHub CoPilot and Amazon CodeWhisperer showed improvement rates of 18\\% for GitHub Copilot and 7\\% for Amazon CodeWhisperer. The average technical debt, considering code smells, was found to be 8.9 minutes for ChatGPT, 9.1 minutes for GitHub Copilot, and 5.6 minutes for Amazon CodeWhisperer. Conclusions: This study highlights the strengths and weaknesses of some of the most popular code generation tools, providing valuable insights for practitioners. By comparing these generators, our results may assist practitioners in selecting the optimal tool for specific tasks, enhancing their decision-making process.	W	W	1	1	-1	0	3	3	3	3	3	3	18
97	2022	Alhamed, Mohammed; Storer, Tim	evaluation of context-aware language models and experts for effort estimation of software maintenance issues	https://ieeexplore.ieee.org/document/9978209/	Reflecting upon recent advances in Natural Language Processing (NLP), this paper evaluates the effectiveness of context-aware NLP models for predicting software task effort estimates. Term Frequency–Inverse Document Frequency (TF-IDF) and Bidirectional Encoder Representations from Transformers (BERT) were used as feature extraction methods; Random forest and BERT feed-forward linear neural networks were used as classifiers. Using three datasets drawn from open-source projects and one from a commercial project, the paper evaluates the models and compares the best performing model with expert estimates from both kinds of datasets. The results suggest that BERT as feature extraction and classifier shows slightly better performance than other combinations, but that there is no significant difference between the presented methods. On the other hand, the results show that expert and Machine Learning (ML) estimate performances are similar, with the experts’ performance being slightly better. Both findings confirmed existing literature, but using substantially different experimental settings.	ICSME	B_SE	1	0	-1	2	3	2	2	3	2	3	17
98	2022	Pearce, Hammond; Tan, Benjamin; Ahmad, Baleegh; Karri, Ramesh; Dolan-Gavitt, Brendan	examining zero-shot vulnerability repair with large language models	http://arxiv.org/abs/2112.02125	Human developers can produce code with cybersecurity bugs. Can emerging 'smart' code completion tools help repair those bugs? In this work, we examine the use of large language models (LLMs) for code (such as OpenAI's Codex and AI21's Jurassic J-1) for zero-shot vulnerability repair. We investigate challenges in the design of prompts that coax LLMs into generating repaired versions of insecure code. This is difficult due to the numerous ways to phrase key information - both semantically and syntactically - with natural languages. We perform a large scale study of five commercially available, black-box, "off-the-shelf" LLMs, as well as an open-source model and our own locally-trained model, on a mix of synthetic, hand-crafted, and real-world security bug scenarios. Our experiments demonstrate that while the approach has promise (the LLMs could collectively repair 100\\% of our synthetically generated and hand-crafted scenarios), a qualitative evaluation of the model's performance over a corpus of historical real-world examples highlights challenges in generating functionally correct code.	S&P	A_SECURITY	1	1	-1	3	3	3	3	3	3	3	21
99	2023	Kang, Sungmin; Chen, Bei; Yoo, Shin; Lou, Jian-Guang	explainable automated debugging via large language model-driven scientific debugging	http://arxiv.org/abs/2304.02195	Automated debugging techniques have the potential to reduce developer effort in debugging, and have matured enough to be adopted by industry. However, one critical issue with existing techniques is that, while developers want rationales for the provided automatic debugging results, existing techniques are ill-suited to provide them, as their deduction process differs significantly from that of human developers. Inspired by the way developers interact with code when debugging, we propose Automated Scientific Debugging (AutoSD), a technique that given buggy code and a bug-revealing test, prompts large language models to automatically generate hypotheses, uses debuggers to actively interact with buggy code, and thus automatically reach conclusions prior to patch generation. By aligning the reasoning of automated debugging more closely with that of human developers, we aim to produce intelligible explanations of how a specific patch has been generated, with the hope that the explanation will lead to more efficient and accurate developer decisions. Our empirical analysis on three program repair benchmarks shows that AutoSD performs competitively with other program repair baselines, and that it can indicate when it is confident in its results. Furthermore, we perform a human study with 20 participants, including six professional developers, to evaluate the utility of explanations from AutoSD. Participants with access to explanations could judge patch correctness in roughly the same time as those without, but their accuracy improved for five out of six real-world bugs studied: 70\\% of participants answered that they wanted explanations when using repair tools, while 55\\% answered that they were satisfied with the Scientific Debugging presentation.	W	W	1	1	-1	0	3	3	3	3	3	3	18
100	2023	Arakelyan, Shushan; Das, Rocktim Jyoti; Mao, Yi; Ren, Xiang	exploring distributional shifts in large language models for code analysis	http://arxiv.org/abs/2303.09128	We systematically study the capacity of two large language models for code - CodeT5 and Codex - to generalize to out-of-domain data. In this study, we consider two fundamental applications - code summarization, and code generation. We split data into domains following its natural boundaries - by an organization, by a project, and by a module within the software project. This makes recognition of in-domain vs out-of-domain data at the time of deployment trivial. We establish that samples from each new domain present both models with a significant challenge of distribution shift. We study how well different established methods can adapt models to better generalize to new domains. Our experiments show that while multitask learning alone is a reasonable baseline, combining it with few-shot finetuning on examples retrieved from training data can achieve very strong performance. In fact, according to our experiments, this solution can outperform direct finetuning for very low-data scenarios. Finally, we consider variations of this approach to create a more broadly applicable method to adapt to multiple domains at once. We find that in the case of code generation, a model adapted to multiple domains simultaneously performs on par with those adapted to each domain individually.	W	W	1	1	-1	0	3	3	3	3	3	3	18