Results 1 - 10
of
14
A Survey on Software Clone Detection Research
- SCHOOL OF COMPUTING TR 2007-541, QUEEN’S UNIVERSITY
, 2007
"... Code duplication or copying a code fragment and then reuse by pasting with or without any modifications is a well known code smell in software maintenance. Several studies show that about 5 % to 20 % of a software systems can contain duplicated code, which is basically the results of copying existin ..."
Abstract
-
Cited by 32 (7 self)
- Add to MetaCart
Code duplication or copying a code fragment and then reuse by pasting with or without any modifications is a well known code smell in software maintenance. Several studies show that about 5 % to 20 % of a software systems can contain duplicated code, which is basically the results of copying existing code fragments and using then by pasting with or without minor modifications. One of the major shortcomings of such duplicated fragments is that if a bug is detected in a code fragment, all the other fragments similar to it should be investigated to check the possible existence of the same bug in the similar fragments. Refactoring of the duplicated code is another prime issue in software maintenance although several studies claim that refactoring of certain clones are not desirable and there is a risk of removing them. However, it is also widely agreed that clones should at least be detected. In this paper, we survey the state of the art in clone detection research. First, we describe the clone terms commonly used in the literature along with their corresponding mappings to the commonly used clone types. Second, we provide a review of the existing
Deckard: Scalable and accurate tree-based detection of code clones
- In ICSE
, 2007
"... Detecting code clones has many software engineering applications. Existing approaches either do not scale to large code bases or are not robust against minor code modifications. In this paper, we present an efficient algorithm for identifying similar subtrees and apply it to tree representations of ..."
Abstract
-
Cited by 29 (3 self)
- Add to MetaCart
Detecting code clones has many software engineering applications. Existing approaches either do not scale to large code bases or are not robust against minor code modifications. In this paper, we present an efficient algorithm for identifying similar subtrees and apply it to tree representations of source code. Our algorithm is based on a novel characterization of subtrees with numerical vectors in the Euclidean space R n and an efficient algorithm to cluster these vectors w.r.t. the Euclidean distance metric. Subtrees with vectors in one cluster are considered similar. We have implemented our tree similarity algorithm as a clone detection tool called DECKARD and evaluated it on large code bases written in C and Java including the Linux kernel and JDK. Our experiments show that DECKARD is both scalable and accurate. It is also language independent, applicable to any language with a formally specified grammar. 1.
Mining specifications of malicious behavior
- In Proceedings of the 6th joint meeting of the European Software Engineering Conference and the ACM SIGSOFT International Symposium on Foundations of Software Engineering
, 2007
"... Malware detectors require a specification of malicious behavior. Typically, these specifications are manually constructed by investigating known malware. We present an automatic technique to overcome this laborious manual process. Our technique derives such a specification by comparing the execution ..."
Abstract
-
Cited by 26 (7 self)
- Add to MetaCart
Malware detectors require a specification of malicious behavior. Typically, these specifications are manually constructed by investigating known malware. We present an automatic technique to overcome this laborious manual process. Our technique derives such a specification by comparing the execution behavior of a known malware against the execution behaviors of a set of benign programs. In other words, we mine the malicious behavior present in a known malware that is not present in a set of benign programs. The output of our algorithm can be used by malware detectors to detect malware variants. Since our algorithm provides a succinct description of malicious behavior present in a malware, it can also be used by security analysts for understanding the malware. We have implemented a prototype based on our algorithm and tested it on several malware programs. Experimental results obtained from our prototype indicate that our algorithm is effective in extracting malicious behaviors that can be used to detect malware variants.
Scalable Detection of Semantic Clones ∗
"... Several techniques have been developed for identifying similar code fragments in programs. These similar fragments, referred to as code clones, can be used to identify redundant code, locate bugs, or gain insight into program design. Existing scalable approaches to clone detection are limited to fin ..."
Abstract
-
Cited by 17 (1 self)
- Add to MetaCart
Several techniques have been developed for identifying similar code fragments in programs. These similar fragments, referred to as code clones, can be used to identify redundant code, locate bugs, or gain insight into program design. Existing scalable approaches to clone detection are limited to finding program fragments that are similar only in their contiguous syntax. Other, semantics-based approaches are more resilient to differences in syntax, such as reordered statements, related statements interleaved with other unrelated statements, or the use of semantically equivalent control structures. However, none of these techniques have scaled to real world code bases. These approaches capture semantic information from Program Dependence Graphs (PDGs), program representations that encode data and control dependencies between statements and predicates. Our definition of a code clone is also based on this representation: we consider program fragments with isomorphic PDGs to be clones. In this paper, we present the first scalable clone detection algorithm based on this definition of semantic clones. Our insight is the reduction of the difficult graph similarity problem to a simpler tree similarity problem by mapping carefully selected PDG subgraphs to their related structured syntax. We efficiently solve the tree similarity problem to create a scalable analysis. We have implemented this algorithm in a practical tool and performed evaluations on several million-line open source projects, including the Linux kernel. Compared with previous approaches, our tool locates significantly more clones, which are often more semantically interesting than simple copied and pasted code fragments.
Unifying clones with a generative programming technique: a case study
- Journal of Software Maintenance and Evolution: Research and Practice John Wiley & Sons, Volume 18, Issue 4, July/August 2006
"... Software clones – similar program structures repeated in variant forms – increase the risk of update anomalies, blow up the program size and complexity, possibly contributing to high maintenance costs. Yet, programs are often polluted by clones. In this paper, we present a case study of cloning in t ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
Software clones – similar program structures repeated in variant forms – increase the risk of update anomalies, blow up the program size and complexity, possibly contributing to high maintenance costs. Yet, programs are often polluted by clones. In this paper, we present a case study of cloning in the Java Buffer library, JDK 1.5. We found that at least 68 % of code in the Buffer library was contained in cloned classes or class methods. Close analysis of program situations that led to cloning revealed difficulties in eliminating clones with conventional program design techniques. As a possible solution, we applied a generative technique of XVCL to represent similar classes and methods in generic, adaptable form. Concrete buffer classes could be automatically produced from the generic structures. We argue, on analytical and empirical grounds, that unifying clones reduced conceptual complexity, and enhanced changeability of the Buffer library at rates proportional to code size reduction (68%). We evaluated our solution in qualitative and quantitative ways, and conducted a controlled experiment to support this claim. The approach presented in the paper can be used to enhance genericity and changeability of any program, independently of an application domain or programming language. As the solution is not without pitfalls, we discuss trade-offs involved in its project application. KEY WORDS: class libraries; Object-Oriented methods; maintainability; reusability; generative programming; 1.
Genericity - a “Missing in Action” Key to Software Simplification and Reuse,” to appear
- Eng. Conference
, 2006
"... Abstract—Similarities are inherent in software. The aim of generic design is to avoid repetitions for the sake of conceptual simplicity, ease of maintenance and reuse. In the paper, we focus on the many types of repetitions that cannot be avoided with conventional generic design techniques, and engi ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Abstract—Similarities are inherent in software. The aim of generic design is to avoid repetitions for the sake of conceptual simplicity, ease of maintenance and reuse. In the paper, we focus on the many types of repetitions that cannot be avoided with conventional generic design techniques, and engineering benefits missed because of that. We show how the problem can be solved by complementing programming language mechanisms and conventional design with a generative technique of XVCL. Generic structures built with XVCL represent abstractions and design information useful in program understanding, maintenance and reuse. Being fully integrated with code, the design and code evolve together, avoiding the problem of external design documentation which often becomes disconnected from code after changes. In the paper, we discuss engineering goals that can be achieved by complementing conventional programming techniques with XVCL, and evaluate trade-offs involved in adoption of the approach.
Reuse without Compromising Performance: Industrial Experience from RPG Software Product Line for Mobile Devices
"... Abstract. It is often believed that reusable solutions, being generic, must necessarily compromise performance. In this paper, we consider a family of Role-Playing Games (RPGs). We analyzed similarities and differences among four RPGs. By applying a reuse technique of XVCL, we built an RPG product l ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Abstract. It is often believed that reusable solutions, being generic, must necessarily compromise performance. In this paper, we consider a family of Role-Playing Games (RPGs). We analyzed similarities and differences among four RPGs. By applying a reuse technique of XVCL, we built an RPG product line architecture (RPG-PLA) from which we could derive any of the four RPGs. We built into the RPG-PLA a number of performance optimization strategies that could benefit any of the four (and possibly other similar) RPGs. By comparing the original vs. the new RPGs derived from the RPG-PLA, we demonstrated that reuse allowed us to achieve improved performance, both speed and memory utilization, as compared to each game developed individually. At the same time, our solution facilitated rapid development of new games, for new mobile devices, as well as ease of evolving with new features the RPG-PLA and custom games already in use. 1
Scalable Detection of Similar Code: Techniques and Applications
, 2009
"... Similar code, also known as cloned code, commonly exists in large software. Studies show that code duplication can incur higher software maintenance cost and more software defects. Thus, detecting similar code and tracking its migration have many important applications, including program understandi ..."
Abstract
- Add to MetaCart
Similar code, also known as cloned code, commonly exists in large software. Studies show that code duplication can incur higher software maintenance cost and more software defects. Thus, detecting similar code and tracking its migration have many important applications, including program understanding, refactoring, optimization, and bug detection. This dissertation presents novel, general techniques for detecting and analyzing both syntactic and semantic code clones. The techniques can scalably and accurately detect clones based on various similarity definitions, including trees, graphs, and functional behavior. They also have the general capability to help reduce software defects and advance code reuse. Specifically, this dissertation makes the following main contributions: First, it presents Deckard, a tree-based clone detection technique and tool. The key insight is that we accurately represent syntax trees and dependency graphs of a program as characteristic vectors in the Euclidean space and apply hashing algorithms to cluster similar vectors efficiently. Experiments show that Deckard scales to millions of lines of code with few false positives. In addition, Deckard is language-agnostic and easily parallelizable,
Program Behavior Discovery and Verification: A Graph . . .
- IEEE TRANSACTIONS ON SOFTWARE ENGINEERING
, 2009
"... Discovering program behaviors and functionalities can ease program comprehension and verification. Existing program analysis approaches have used text mining algorithms to infer behavior patterns or formal models from program execution. When one tries to identify the hierarchical composition of a pr ..."
Abstract
- Add to MetaCart
Discovering program behaviors and functionalities can ease program comprehension and verification. Existing program analysis approaches have used text mining algorithms to infer behavior patterns or formal models from program execution. When one tries to identify the hierarchical composition of a program behavior at different abstraction levels, textual descriptions are not informative and expressive enough. To address this, we present a semi-automatic graph grammar approach to retrieving the hierarchical structure of the program behavior. The hierarchical structure is built on recurring substructures in a bottom-up fashion. We formulate the behavior discovery and verification problem as a graph grammar induction and parsing problem, i.e. automatically iteratively mining qualified patterns and then constructing graph rewriting rules. Furthermore, using the induced grammar to parse the behavioral structure of a new program could verify if the program has the same behavioral properties specified by the grammar.
Clone Miner & Clone Analyzer Technology Summary A Technology for Detection and Analysis of Design-Level Software Similarities The Problem Addressed
, 2009
"... Similarities are inherent in software and lead to repetitions, so-called code clones. We find clones within and across software systems. Removing clones from programs is often neither possible nor even desirable: Some clones play a useful role in a program (e.g., for performance or reliability reaso ..."
Abstract
- Add to MetaCart
Similarities are inherent in software and lead to repetitions, so-called code clones. We find clones within and across software systems. Removing clones from programs is often neither possible nor even desirable: Some clones play a useful role in a program (e.g., for performance or reliability reasons); Other clones occur because of the limitations of a programming language; Removal of some clones might conflict with design or business goals that cannot be compromised; Clones often result from beneficial standardization of software design, due to application of company standards, or standardized architectures and pattern-driven development on modern component platforms e.g., (.NET ™ and JEE™); Uniformity of design is desirable despite inducing high levels of cloning. Companies often manage families of similar software systems, formed by multiple variants (releases) of a system deployed to different customers during software evolution, or software Product Lines developed with reuse strategies. Such systems display high levels of cloning. In current programming paradigms, whether we are in a single- or multiple-system development/maintenance situation, we are bound to programs containing significant cloning. While cloning cannot be classified as inherently good or bad, the knowledge of clones, especially in large software, is useful for program understanding, maintenance, reuse and quality assessment.

