Search terms:manual, review, annotate, author, inspect
human, user study, participants, recruitment, subjects, pilot, survey, feedbackdeveloper, deployment, github, sourceforge, pull request, industry, bug report, open source, merge
quality, tool, system, frameworkcase study, evaluate, correct, validate
PaperURLYearIncluding in paper analysisDissertationMeta / position paper / not app / proposal / benchmarksHas Evaluation of Repairs, Tool or SystemHas Case Study of toolManual Review of tool produced patches (paper authors)Manual Review has methodological detailsNumber of AnnotatorsQuestion asked to annotatorsHuman Study (Non authors - need irb)Number of partcipantsParticipant populationStudy /Task discriptionQuestion AskedInvolves developers (industry or github)Evaluates patch quality beyond creation
R. Corchuelo, Repairing syntax errors in LR parsers, ACM Transactions on Programming Languages and Systems, vol.24, pp.698-710, 2002.
C. Nentwich, W. Emmerich, and A. Finkelstein, Consistency Management with Repair Actions, Proceedings of the 25th International Conference on Software Engineering, pp.455-464, 2003.
S. Sidiroglou and A. D. Keromytis, Countering Network Worms Through Automatic Patch Generation, In: Security & Privacy, vol.3, pp.41-49, 2005.
B. Jobstmann, A. Griesmayer, and R. Bloem, Program Repair As a Game, Computer Aided Verification, pp.226-238, 2005.
L. A. Dennis, R. Monroy, and P. Nogueira, Proof-directed Debugging and Repair, Seventh Symposium on Trends in Functional Programming, pp.131-140, 2006.
W. Weimer, Patches as better bug reports, Proceedings of the International Conference on Generative Programming and Component Engineering, 2006.
N/A - 1 paper author
"In this study manual inspection found that the bug reports addressed using explanatory patches did not introduce new bugs with respect to any other safety policy we were aware of."
A. Kalyanpur, Repairing Unsatisfiable Concepts in OWL Ontologies, The Semantic Web: Research and Applications, vol.4011, pp.170-184, 2006.
At least 1 yr OWL experience
We selected two OWL Ontologies – University.owl and miniTambis.owl and asked each subject to fix all the unsatisfiable classes in a particular ontology using the debugging techniques seen in [6] (case 1), and in the other ontology using the repair techniques described in this paper (case 2). The subjects were randomly assigned to the two cases,
fix all unsatisfiable clauses
A. Griesmayer, R. Bloem, and B. Cook, Repair of Boolean Programs with An Application to C, Computer Aided Verification, pp.358-371, 2006.
S. Thomas and L. Williams, Using Automated Fix Generation to Secure SQL Statements, Proceedings of the Third International Workshop on Software Engineering for Secure Systems, p.9, 2007.
Z. Lin, AutoPaG: Towards Automated Software Patch Generation with Source Code Root Cause Identification and Repair, Proceedings of the 2nd ACM Symposium on Information, pp.329-340, 2007.
N/A - 5 paper authors
"We manually examine the source code in the benchmark and the results confirms with the automated output from AutoPaG"
A. Arcuri, On the Automation of Fixing Software Bugs, Companion of the 30th International Conference on Software Engineering, pp.1003-1006, 2008.
M. Atif and . Memon, Automatically Repairing Event Sequence-based GUI Test Suites for Regression Testing, ACM Transactions on Software Engineering and Methodology, vol.18, p.4, 2008.
A. Arcuri and X. Yao, A Novel Co-evolutionary Approach to Automatic Software Bug Fixing, Proceedings of the IEEE Congress on Evolutionary Computation, pp.162-168, 2008.
F. Wang and C. Cheng, Program Repair Suggestions From Graphical State-Transition Specifications, Proceedings of FORTE 2008, 2008.
D. Jeffrey, BugFix: a Learning-based Tool to Assist Developers in Fixing Bugs, pp.70-79, 2009.
V. Dallmeier, A. Zeller, and B. Meyer, Generating Fixes From Object Behavior Anomalies, Proceedings of the International Conference on Automated Software Engineering, 2009.
B. Daniel, ReAssert: Suggesting Repairs for Broken Unit Tests, Proceedings of the 24th IEEE/ACM International Conference on Automated Software Engineering, pp.433-444, 2009.
13 graduate, 3 undergraduate, 2 industry proffessionals
Half had tool, half did not - has both quantitative + qualitative results.
Task 1: Write some unit tests of their own to test previously untested functionality.
Task 2: Implement a requirement change which could potentially cause some of their tests to fail.
Task 3: Repair all failing tests.
Task 4: Implement another requirement change which would cause some of the initially provided tests to fail.
Task 5: Repair all failing tests
Non-tool half given questionare
1. Useful for the study
2. Use for own projects
3. Include in Eclipse
4. Recommend to others
S. Forrest, A Genetic Programming Approach to Automated Software Repair, Proceedings of the 11th Annual Conference on Genetic and Evolutionary Computation, pp.947-954, 2009.
W. Weimer, Automatically Finding Patches Using Genetic Programming, Proceedings of the International Conference on Software Engineering, 2009.
Y. Qi, X. Mao, and Y. Lei, Program Repair As Sound Optimization of Broken Programs, International Symposium on Theoretical Aspects of Software Engineering, 2009.
Xiong, Yingfei, et al. "Supporting automatic model inconsistency fixing." Proceedings of the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering. 2009.
2009N/A - 6 authors
"Beanbag only ensures the correctness of the output updates, but does not ensure the existence of an output. It is up to the programmers to ensure the primitive constraints and functions are composed correctly so that the fixing function will not return ⊥ for a proper input. " - Not quality, not correctness
C. Kern and J. Esparza, Automatic Error Correction of Java Programs, Formal Methods for Industrial Critical Systems, pp.67-81, 2010.
E. Fast, Designing Better Fitness Functions for Automated Program Repair, Proceedings of the 12th Annual Conference on Genetic and Evolutionary Computation, pp.965-972, 2010.
E. Schulte, S. Forrest, and W. Weimer, Automated Program Repair Through the Evolution of Assembly Code, Proceedings of the IEEE/ACM International Conference on Automated Software Engineering, pp.313-316, 2010.
M. Silva, Towards Automated Inconsistency Handling in Design Models, Proceedings of the 22nd International Conference on Advanced Information Systems Engineering, pp.348-362, 2010.
T. V. Nguyen, Automating Program Verification and Repair Using Invariant Analysis and Test Input Generation, 2010.
V. Debroy and W. E. Wong, Using Mutation to Automatically Suggest Fixes for Faulty Programs, Proceedings of the International Conference on Software Testing, Verification and Validation, pp.65-74, 2010.
W. Weimer, Automatic Program Repair with Evolutionary Computation, Communications of the ACM, vol.53, p.109, 2010.
Y. Wei, Automated Fixing of Programs with Contracts, Proceedings of the International Symposium on Software Testing and Analysis, 2010.
2010N/A - 7 authors
"we manually inspected the top five valid fixes for each fault, according
to the ranking criterion of AutoFix-E ... find at least one “proper” fix among the top five."
"We selected a few faults
and asked two experienced programmers from Eiffel Software to write their own fixes for the faults. In 4 out of 6 of
the cases, the programmers submitted fixes which are identical (or semantically equivalent) to the best fixes produced automatically by AutoFix-E."
D. Gopinath, M. Z. Malik, and S. Khurshid, Specificationbased Program Repair Using SAT, Proceedings of the International Conference on Tools and Algorithms for the Construction and Analysis of Systems, 2011.
N/A - 3 paper authors
"The repaired statements were manually verified for accuracy. They were considered to be correct if they were semantically similar to the statements in the correct implementation of the respective algorithms."
A. Arcuri, Evolutionary Repair of Faulty Software, Applied Soft Computing, vol.11, pp.3494-3514, 2011.
G. Jin, Automated Atomicity-violation Fixing, Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation, pp.389-400, 2011.
N/A - 4 paper authors
" both merged and unmerged patches successfully fix all eight bugs. We also manually checked all these patches. Our findings are consistent with the random testing results. In those cases with 0% failure rates, the bugs are all truly fixed."
Readability": "Manual inspection shows that all merging
decisions made by AFix improve readability
M. Z. Malik, J. H. Siddiqi, and S. Khurshid, Constraint-Based Program Debugging Using Data Structure Repair, International Conference on Software Testing, Verification and Validation, pp.190-199, 2011.
N. Lazaar, A. Gotlieb, and Y. Lebbah, A Framework for the Automatic Correction of Constraint Programs, Proceedings of the International Conference on Software Testing, Verification and Validation, pp.319-326, 2011.
R. Könighofer and R. Bloem, Automated Error Localization and Correction for Imperative Programs, Formal Methods in Computer-Aided Design (FMCAD), pp.91-100, 2011.
S. Kalvala and R. Warburton, A Formal Approach to Fixing Bugs, Formal Methods, Foundations and Applications, pp.172-187, 2011.
T. Ackling, B. Alexander, and I. Grunert, Evolving Patches for Software Repair, Proceedings of the 13th Annual Conference on Genetic and Evolutionary Computation, pp.1427-1434, 2011.
N/A - 3 paper authors
"We observed no spurious changes in any of the large number individuals we inspected."
Y. Pei, Code-Based Automated Program Fixing, 2011.
2011N/A - 5 authors
"manual inspection confirmed to be adequate
beyond the correctness criterion
provided by the contracts and tests available."
C. L. Goues, GenProg: a Generic Method for Automatic Software Repair, IEEE Transactions on Software Engineering, vol.38, pp.54-72, 2012.
N/A - 4 paper authors
"summary of the effect of the final repair, as judged by manual inspection"
"manual inspection suggests that the produced patches are acceptable"
C. and L. Goues, A Systematic Study of Automated Program Repair: Fixing 55 Out of 105 Bugs for $8 Each, Proceedings of the International Conference on Software Engineering, pp.3-13, 2012.
F. Logozzo and T. Ball, Modular and Verified Automatic Program Repair, Proceedings of the 27th ACM International Conference on Object Oriented Programming Systems Languages and Applications, 2012.
N/A - 2 paper authors
N/A: just says "We manually inspected some of the repairs generated by Cccheck and discovered new bugs in shipped and very well-tested libraries
H. Samimi, Automated Repair of HTML Generation Errors in PHP Applications Using String Constraint Solving, Proceedings of ICSE, pp.277-287, 2012.
P. Liu and C. Zhang, Axis: Automatically Fixing Atomicity Violations Through Solving Control Constraints, Proceedings of the 2012 International Conference on Software Engineering, pp.299-309, 2012.
U. Repinski, Combining dynamic slicing and mutation operators for ESL correction, 17th IEEE European Test Symposium, pp.1-6, 2012.
Y. Qi, More Efficient Automatic Repair of Large-scale Programs Using Weak Recompilation, Science China Information Sciences, vol.55, issue.12, pp.2785-2799, 2012.
Z. P. Fry, B. Landau, and W. Weimer, A Human Study of Patch Maintainability, Proceedings of the International Symposium on Software Testing and Analysis, pp.177-187, 2012.
20122+ annotators
At least two annotators verified each participant’s answers to mitigate grading errors or ambiguities due to the use of free-form text.
Used Amazon's Mechanical Turk ; "First, participants were required to give answers for all questions and complete the exit survey fully. Second, participants who scored more than one standard deviation below the average student’s score were removed from consideration."
"In the study, humans perform tasks that demonstrate their understanding of the control flow, state, and maintainability aspects of code patches."
Participants were asked to complete three tasks for each code segment:
• Answer the code understanding question (in free form text).
• Give a subjective judgment of how confident they were in their answer (using a 1–5 Likert scale).
• Give a subjective judgment of how maintainable they felt the code in question was (using a 1–5 Likert scale). Note that
“maintainability” was not defined for participants; they were
forced to use their own intuitions.
C. Le-goues, S. Forrest, and W. Weimer, Current Challenges in Automatic Software Repair". In: Software Quality Journal, vol.21, pp.421-443, 2013.
C. Liu, R2Fix: Automatically Generating Bug Fixes From Bug Reports, Proceedings of the International Conference on Software Testing, Verification and Validation, pp.282-291, 2013.
Developers who discussed the bug report and top commiters for the project
Sent a survey to developers which contained a link to a bug report and a patch generated by their tool. Asked two questions:
Q1: "Would a patch automatically generated by R2Fix save developers’ time in fixing the bug?"
Q2: "Would a patch automatically generated by R2Fix prompt a quicker response to the bug report?"
D. Kim, Automatic Patch Generation Learned From Human-Written Patches, Proceedings of ICSE, 2013.
85 (S1), 168 (S2)
S1: 17 grad students, 68 devs from stackoverflow, coderanch, Daum (Korean software company) with Java experience)
S2: 72 students and 96 developers
Study 1: shown 5 bugs, each with three patches (human, Par, genprog). Participants asked to rank patches
Study 2: "Each survey session showed a pair of anonymized patches (one from human and the other from PAR or GenProg for the same bug) along with corresponding bug information. Participants were asked to select more acceptable patches if they were patch reviewers. In addition, participants were given the choice of both are acceptable or not sure if they could not determine acceptable patches"
S1: "asked to compare them as a patch reviewer and to report their rankings according to acceptability"
S2: asked for two patches if "both were acceptable", one was more acceptable, or neither was acceptable
C. L. Goues, Automatic Program Repair Using Genetic Programming, 2013.
F. Logozzo and M. Martel, Automatic Repair of Overflowing Expressions with Abstract Interpretation, Semantics, Abstract Interpretation, and Reasoning About Programs: Essays Dedicated to David A. Schmidt on the Occasion of His Sixtieth Birthday, vol.129, pp.341-357, 2013.
H. Duong-thien-nguyen, SemFix: Program Repair via Semantic Analysis, Proceedings of the International Conference on Software Engineering, 2013.
NOTE: Have manual invlolvement, but only looking at bugs semdix din't solve, not the patches themselves
M. Leotta, Repairing Selenium Test Cases: an Industrial Case Study about Web Page Element Localization, International Conference on Software Testing, Verification and Validation, pp.487-488, 2013.
R. Singh, S. Gulwani, and A. Solar-lezama, Automated Feedback Generation for Introductory Programming Assignments, In: ACM SIGPLAN Notices, vol.48, pp.15-26, 2013.
S. Son, S. Kathryn, V. Mckinley, and . Shmatikov, Fix Me Up: Repairing Access-Control Bugs in Web Applications, Proceedings of the Network and Distributed System Security Symposium, 2013.
V. Balachandran, Fix-it: An extensible code auto-fix component in review bot, IEEE 13th International Working Conference on Source Code Analysis and Manipulation, SCAM 2013, pp.167-172, 2013.
Experienced java developer
In an attempt to estimate the effort reduction due to Fix-it,
we requested an experienced developer who is proficient in
Java and XML technologies to check whether fixes can be
provided for various Checkstyle rules enabled in Review Bo
•For each of the rules enabled, is it possible to provide a
fix with the current features in Fix-it?
• For rules where a fix is not possible, is it possible
to provide a fix if Fix-it supports complex refactorings
involving multiple files and manipulating the source code
text directly?
W. Weimer, Z. P. Fry, and S. Forrest, Leveraging program equivalence for adaptive program repair: Models and first results, International Conference on Automated Software Engineering, pp.356-366, 2013.
Y. Khmelevsky, C. Martin, S. Rinard, and . Sidiroglou-douskos, A Sourceto-source Transformation Tool for Error Fixing, Proceedings of CASCON. 2013, pp.147-160
2013NA - 3 authors
"Validation Runs: We've built and deployed the original and updated programs to confirm the reported bugs and
check the results of the bug fixing by the
Y. Qi, X. Mao, and Y. Lei, Efficient Automated Program Repair Through Fault-Recorded Testing Prioritization, Proceedings of ICSM, 2013.
Z. Coker and M. Hafiz, Program Transformations to Fix C Integers, Proceedings of the International Conference on Software Engineering, pp.792-801, 2013.
M. Monperrus and B. Baudry, Research Report Dagstuhl Seminar 13061 "Fault Prediction, Localization, and Repair, Schloss Dagstuhl -Leibniz Center for Informatics, p.5, 2013.
2013Couldn't find pdf
M. Monperrus, Automatic Patch Generation Learned from Human-Written Patches": Essay on the Problem Statement and the Evaluation of Automatic Software Repair, International Conference on Software Engineering, pp.234-242, 2014.
D. Gopinath, Data-guided Repair of Selection Statements, Proceedings of the 36th International Conference on Software Engineering, pp.243-253, 2014.
N/A - 4 paper authors
"The usefulness of the generated repair suggestions is summarized in the last column of Table 3. Except for Ex4, the repair suggestions were close to manual (ideal) fixes for the bugs."
A. Zeller, Automated Fixing of Programs with Contracts, IEEE Transactions on Software Engineering, vol.40, pp.427-449, 2014.\
1 ("Classification of fixes into proper and improper was done manually by the first author. While this may have introduced a classification bias, it also ensured that the classification was done by someone familiar with the code bases,")
Q1: "We manually inspected the valid fixes and determined how many of them can be considered proper, that is genuine corrections that remove the root of the error (see Section 4.5)."
Not patch related: "To understand the limitations of our technique, we manually analyzed all the faults for which AutoFix always failed, and identified four scenarios that prevent success. "
A. Shaw, D. Doggett, and M. Hafiz, Automatically Fixing C Buffer Overflows Using Program Transformations, International Conference on Dependable Systems and Networks, pp.124-135, 2014.
F. Demarco, Automatic Repair of Buggy If Conditions and Missing Preconditions with SMT, Proceedings of the 6th International Workshop on Constraints in Software Testing, Verification, and Analysis, 2014.
F. Long, Sound Input Filter Generation for Integer Overflow Errors, ACM SIGPLAN Notices, vol.49, pp.439-452, 2014.
N/A - 4 paper authors
"We also manually examined the root cause of each vulnerability and confirmed that the generated filters completely nullified the vulnerability — if an input passes the filter, it will not trigger the overflow error that enables the vulnerability"
K. Frolin-s-ocariza, A. Pattabiraman, and . Mesbah, Vejovis: suggesting fixes for JavaScript faults, Proceedings of the 36th International Conference on Software Engineering, 2014.
manual reviw doesn't involve patch correctness
M. Martinez, Extraction and Analysis of Knowledge for Automatic Software Repair, 2014.
M. Martinez, W. Weimer, and M. Monperrus, Do the Fix Ingredients Already Exist? An Empirical Inquiry into the Redundancy Assumptions of Program Repair Approaches, ICSE -36th IEEE International Conference on Software Engineering, 2014.
R. Samanta, O. Olivo, and E. Emerson, Cost-aware Automatic Program Repair, International Static Analysis Symposium, pp.268-284, 2014.
S. Kaleeswaran, Minthint: Automated Synthesis of Repair Hints, Proceedings of the International Conference on Software Engineering, pp.266-276, 2014.
8 devs, 2 grad students with dev experience
"We performed fault localization on a set of programs from the Siemens suite [19], which consists of programs with multiple faulty versions. Table 5 lists the programs and the number of faulty versions that were selected as tasks. The tasks represent a diverse collection of faults (see Table 6). In the user study, each user was required to work on two independent tasks. To keep each task manageable within 2h, we presented to the user only the top 5 statements identified by Zoltar as potentially faulty. For each of the chosen tasks, the actual faulty statement belonged to this list. For each program and candidate faulty statement, MintHint obtained the state transformers for the failing tests through symbolic execution with a timeout of 5m per test. For one of the candidate tasks, replace-v18, symbolic execution of many failing tests timed out. In comparison, there were many more passing tests—potentially making data from failing tests statistically insignificant. To avoid this, only half of the passing tests were used for deriving the state transforme"
"We performed the user study in two phases. In the control phase, the users were given the fault localization information and the test suite. In the experimental phase they were also given the repair hints. Each user worked on a single task per phase and was given 2h to complete that task. We considered a task to be complete if the repaired program passed all the tests. The users chose the programs for the control phase by drawing lots. We mapped each task in the control phase to a task in the experimental phase to make sure that a user would not work on the same task, or on another faulty version of the same program, in both phases.
To solve the bug
and also
qualive rating of task difficulty: difficulty of localization and difficulty of repair
T. Wang, C. Song, and W. Lee, Diagnosis and Emergency Patch Generation for Integer Overflow Exploits, Detection of Intrusions and Malware, and Vulnerability Assessment, pp.255-275, 2014.
2014N/A - 3 authors
"We manually verified that SoupInt correctly
found the error handling branch by using the heuristics in Section 3.2"
Y. Lin and S. Kulkarni, Automatic Repair for Multi-threaded Programs with Deadlock/Livelock Using Maximum Satisfiability, Proceedings of the 2014 International Symposium on Software Testing and Analysis, pp.237-247, 2014.
Y. Qi, The Strength of Random Search on Automated Program Repair, Proceedings of the 36th International Conference on Software Engineering, pp.254-265, 2014.
Y. Tao, Automatically Generated Patches As Debugging Aids: a Human Study, Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, pp.64-74, 2014.
44 CS graduate students
28 software engineers of varying experience levels
23 Amazon Mechanical Turk workers,
including 20 developers, two undergraduate students, and
one IT manager, with 1-14 (on average 5.7) years of Java
mturk criteria: to safeguard worker quality, we prepared a
buggy code revised from Apache Commons Collection Issue 359 and asked workers to describe how to fix it. Only
those who passed this qualifying test could proceed to our
debugging tasks.
five bugs and ten generated patches used to create five debugging tasks
debugging task:
- We provided detailed bug descriptions from Mozilla and
Apache bug reports to participants as their starting point.
We also provided developer-written test cases, which included the failed ones that reproduced the bugs. Although
participants were encouraged to fix as many bugs as they
could, they were free to skip any bug as the completion of
all five tasks was not mandatory.
debugging system:
- created/provided a web-based online debugging system
that provided similar features to the Eclipse workbench
exit survey:
- asked the participants to rate the difficulty of each bug and the helpfulness of the provided aids on a 5-point Likert scale.
- asked them to share their opinions on using auto-generated patches in a free-textual form.
- Engr and MTurk participants to self-report their Java programming experience and debugging time for each task,
- MTurk to report occupations
Y. Xiong, Range Fixes: Interactive Error Resolution for Software Configuration, IEEE Transactions on Software Engineering, vol.41, pp.603-619, 2015.
R. Just, D. Jalali, and M. D. Ernst, Defects4J: A Database of Existing Faults to Enable Controlled Testing Studies for Java Programs, Proceedings of the 2014 International Symposium on Software Testing and Analysis (ISSTA'14), 2014.
E. K. Smith, Is the Cure Worse Than the Disease? Overfitting in Automated Program Repair, Proceedings of the 10th Joint Meeting of the European Software Engineering Conference and ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC/FSE), 2015.
C. L. Goues, The ManyBugs and IntroClass Benchmarks for Automated Repair of C Programs, IEEE Transactions on Software Engineering (TSE), 2015.
A. Dhar, CLOTHO: Saving Programs from Malformed Strings and Incorrect String-handling, Foundations of Software Engineering, pp.555-566, 2015.
NOTE: have manual involvement, but regards developer patches, not tool patches only. "On manual investigation, we found that for some benchmarks the developers significantly changed the code structure and introduced several new sets of constraints in the patched version, which resulted in low PPI (< 0.7) for those benchmarks."
A. Shin-hwei-tan and . Roychoudhury, Relifix: Automated Repair of Software Regressions, Proceedings of ICSE, 2015.
D. Xuan-bach, . Le, B. Tien-duy, D. Le, and . Lo, Should fixing these failures be delegated to automated program repair?, In: Proceedings of the IEEE International Symposium on Software Reliability Engineering, pp.427-437, 2015.
E. Kneuss, M. Koukoutos, and V. Kuncak, Deductive Program Repair, International Conference on Computer Aided Verification, pp.217-233, 2015.
N/A - 3 paper authors
D. Gopinath, Systematic techniques for more effective fault localization and program repair, 2016.
B. Cornu, Automatic Analysis and Repair of Exception Bugs for Java Programs, 2015.
F. Long and M. C. Rinard, Prophet: Automatic Patch Generation via Learning From Successful Patches, Proceedings of the Symposium on Principles of Programming Languages, 2016.
N/A - 2 paper authors
"We manually analyze each generated patch to determine whether the generated patch is a correct patch or just a plausible but incorrect patch that produces correct outputs for all of the inputs in the test suite"
"Our manual code analysis indicates that each of the generated correct patches in our experiments is semantically equivalent"
F. Long and M. C. Rinard, Staged Program Repair with Condition Synthesis, Proceedings of ESEC/FSE, 2015.
N/A - 2 paper authors
"We consider a generated repair correct if 1) the repair completely eliminates the defect exposed by the negative test cases so that no test case will be able to trigger the defect, and 2) the repair does not introduce any new defects."
"We also analyze the developer patch (when available) for each of the defects/changes for which SPR generated plausible repairs. Our analysis indicates that the developer patches are consistent with our correctness analysis: 1) if our analysis indicates that the SPR repair is correct, then the repair has the same semantics as the developer patch and 2) if our analysis indicates that the SPR repair is not correct, then the repair has different semantics from the patch."
"We acknowledge that, in general, determining whether a specific repair corrects a specific defect can be difficult (or in some cases not even well defined). We emphasize that this is not the case for the repairs and defects that we consider in this paper. The correct behavior for all of the defects is clear, as is repair correctness and incorrectness."
P. Muntean, Automated Generation of Buffer Overflows Quick Fixes Using Symbolic Execution and SMT, International Conference on Computer Safety, Reliability & Security (SAFECOMP'15), 2015.
Q. Gao, Fixing Recurring Crash Bugs via Analyzing Q&A Sites, Proceedings of the 30th IEEE/ACM International Conference on Automated Software Engineering, 2015.
"Then we manually verified the correctness of the patches by comparing each generated patch with each patch written by developers. N"
Q. Gao, Safe Memory-leak Fixing for C Programs, Proceedings of the 37th International Conference on Software Engineering, pp.459-470, 2015.
N/A - 8 paper authors
"We also manually check if there are useless fixes. During the manual check, we try to identify any of the two cases where a fix can be useless: (1) the deallocated expression is always a null pointer, and (2) the inserted deallocation is on a dead path. These conditions usually can be easily falsified by seeking for counter-examples."
S. Lamelas and M. Monperrus, Automatic Repair of Infinite Loops, 2015.
N/A - 2 paper authors
Yes, manual eval of repair appropriatnes. They show all patches found and explain them
S. Mechtaev, J. Yi, and A. Roychoudhury, DirectFix: Looking for Simple Program Repairs, Proceedings of the 37th International Conference on Software Engineering, 2015.
T. Durieux, Automatic Repair of Real Bugs: An Experience Report on the Defects4J Dataset, 2015.
"Our manual analysis of all 84 generated patches shows that 11/84 are correct, 61/84 are incorrect, and 12/84 require a domain expertise, which we do not have"
"For correctness assesment, we manually examine the generated patches. For each patch, one of the authors (called thereafter an“analyst”) analyzed the patch correctness, readability, and the difficulty of validating the correctness. The correctness of a patch can be correct, incorrect, or unknown. The term “correct” denotes that a patch is exactly the same or equivalent to the patch that is written by developers. The equivalence is assessed according to the"
"In our experiment, since manual analysis is very tedious, we did not cross analysis (more than one analyst per patch)
Y. Ke, Repairing Programs with Semantic Code Search, Proceedings of the International Conference on Automated Software Engineering, 2015.