Results 1 - 10
of
27
Winnowing: Local Algorithms for Document Fingerprinting
- Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data 2003
, 2003
"... Digital content is for copying: quotation, revision, plagiarism, and file sharing all create copies. Document fingerprinting is concerned with accurately identifying copying, including small partial copies, within large sets of documents. We introduce the class of local document fingerprinting algor ..."
Abstract
-
Cited by 129 (2 self)
- Add to MetaCart
Digital content is for copying: quotation, revision, plagiarism, and file sharing all create copies. Document fingerprinting is concerned with accurately identifying copying, including small partial copies, within large sets of documents. We introduce the class of local document fingerprinting algorithms, which seems to capture an essential property of any fingerprinting technique guaranteed to detect copies. We prove a novel lower bound on the performance of any local algorithm. We also develop winnowing, an efficient local fingerprinting algorithm, and show that winnowing’s performance is within 33 % of the lower bound. Finally, we also give experimental results on Web data, and report experience with MOSS, a widely-used plagiarism detection service. 1.
Compiler Techniques for Code Compaction
, 2000
"... This article explores the use of compiler techniques to accomplish code compaction to yield smaller executables. The main contribution of this article is to show that careful, aggressive, interprocedural optimization, together with procedural abstraction of repeated code fragments, can yield signifi ..."
Abstract
-
Cited by 83 (17 self)
- Add to MetaCart
This article explores the use of compiler techniques to accomplish code compaction to yield smaller executables. The main contribution of this article is to show that careful, aggressive, interprocedural optimization, together with procedural abstraction of repeated code fragments, can yield significantly better reductions in code size than previous approaches, which have generally focused on abstraction of repeated instruction sequences. We also show how "equivalent" code fragments can be detected and factored out using conventional compiler techniques, and without having to resort to purely linear treatments of code sequences as in suffix-tree-based approaches, thereby setting up a framework for code compaction that can be more exible in its treatment of what code fragments are considered equivalent. Our ideas have been implemented in the form of a binary-rewriting tool that reduces the size of executables by about 30% on the average.
Code compression
- In Proc. Conf. on Programming Languages Design and Implementation
, 1997
"... Current research in compiler optimization counts mainly CPU time and perhaps the first cache level or two. This view has been important but is becoming myopic, at least from a system-wide viewpoint, as the ratio of network and disk speeds to CPU speeds grows exponentially. For example, we have seen ..."
Abstract
-
Cited by 80 (11 self)
- Add to MetaCart
Current research in compiler optimization counts mainly CPU time and perhaps the first cache level or two. This view has been important but is becoming myopic, at least from a system-wide viewpoint, as the ratio of network and disk speeds to CPU speeds grows exponentially. For example, we have seen the CPU idle for most of the time during paging, so compressing pages can increase total performance even though the CPU must decompress or interpret the page contents. Another profile shows that many functions are called just once, so reduced paging could pay for their interpretation overhead. This paper describes:. Measurements that show how code compression can save space and total time in some important real-world scenarios.. A compressed executable representation that is roughly the same size as gzipped x86 programs and can be interpreted without decompression. It can also be compiled to high-quality machine code at 2.5 megabytes per second on a 120MHz Pentium processor l A compressed “wire ” representation that must be decompressed before execution but is, for example, roughly 21 % the size of SPARC code when compressing gee.
Compiler Techniques for Code Compression
- In Workshop on Compiler Support for System Software
, 1999
"... In recent years there has been an increasing trend towards the incorporation of computers into a variety of devices where the amount of available memory is limited. This makes it desirable to try and reduce the size of applications where possible. This paper explores the use of compiler techniques t ..."
Abstract
-
Cited by 33 (4 self)
- Add to MetaCart
In recent years there has been an increasing trend towards the incorporation of computers into a variety of devices where the amount of available memory is limited. This makes it desirable to try and reduce the size of applications where possible. This paper explores the use of compiler techniques to accomplish code compression to yield smaller executables. The main contribution of this paper is that, by showing how "equivalent" code fragments can be detected and factored out without having to resort to purely linear treatments of code sequences as in suffix-tree-based approaches, it sets up a framework for code compression that can be more flexible in its treatment of what code fragments are considered "equivalent." Our ideas have been implemented in the form of a binary-rewriting tool that is able to achieve significantly better compression than previous approaches.
A Survey on Software Clone Detection Research
- SCHOOL OF COMPUTING TR 2007-541, QUEEN’S UNIVERSITY
, 2007
"... Code duplication or copying a code fragment and then reuse by pasting with or without any modifications is a well known code smell in software maintenance. Several studies show that about 5 % to 20 % of a software systems can contain duplicated code, which is basically the results of copying existin ..."
Abstract
-
Cited by 32 (7 self)
- Add to MetaCart
Code duplication or copying a code fragment and then reuse by pasting with or without any modifications is a well known code smell in software maintenance. Several studies show that about 5 % to 20 % of a software systems can contain duplicated code, which is basically the results of copying existing code fragments and using then by pasting with or without minor modifications. One of the major shortcomings of such duplicated fragments is that if a bug is detected in a code fragment, all the other fragments similar to it should be investigated to check the possible existence of the same bug in the similar fragments. Refactoring of the duplicated code is another prime issue in software maintenance although several studies claim that refactoring of certain clones are not desirable and there is a risk of removing them. However, it is also widely agreed that clones should at least be detected. In this paper, we survey the state of the art in clone detection research. First, we describe the clone terms commonly used in the literature along with their corresponding mappings to the commonly used clone types. Second, we provide a review of the existing
BMAT - A Binary Matching Tool for Stale Profile Propagation
- The Journal of Instruction-Level Parallelism
, 2002
"... A major challenge of applying profile-based optimization on large real-world applications is how to capture adequate profile information. A large program, especially a GUI-based application, may be used in a large variety of ways by different users on different machines. Extensive collection of p ..."
Abstract
-
Cited by 26 (1 self)
- Add to MetaCart
A major challenge of applying profile-based optimization on large real-world applications is how to capture adequate profile information. A large program, especially a GUI-based application, may be used in a large variety of ways by different users on different machines. Extensive collection of profile data is necessary to fully characterize this type of program behavior. Unfortunately, in a realistic software production environment, many developers and testers need fast access to the latest build, leaving little time for collecting profiles. To address this dilemma, we would like to re-use stale profile information from a prior program build. In this paper we present BMAT, a fast and effective tool that matches two versions of a binary program without knowledge of source code changes. BMAT enables the propagation of profile information from an older, extensively profiled build to a newer build, thus greatly reducing or even eliminating the need for re-profiling. We use two m...
Plagiarism in natural and programming languages: an overview of current tools and technologies
, 2000
"... This report discusses in detail methods of plagiarism and its detection in both natural and programming
languages. The increase of material now available in electronic form and improved access to this via the
Internet is allowing, with greater ease than ever before, plagiarism that is either intenti ..."
Abstract
-
Cited by 22 (0 self)
- Add to MetaCart
This report discusses in detail methods of plagiarism and its detection in both natural and programming
languages. The increase of material now available in electronic form and improved access to this via the
Internet is allowing, with greater ease than ever before, plagiarism that is either intentional or unintentional.
Due to increased availability of On-line material, people checking for plagiarism are finding the task
increasingly harder. Techniques for detecting both plagiarism in natural and programming languages are
discussed in this report to provide the reader with a comprehensive introduction to this area. Also provided
are examples of common techniques used in natural language plagiarism.
Malware Phylogeny Generation using Permutations of Code
- JOURNAL IN COMPUTER VIROLOGY
, 2005
"... Malicious programs, such as viruses and worms, are frequently related to previous programs through evolutionary relationships. Discovering those relationships and constructing a phylogeny model is expected to be helpful for analyzing new malware and for establishing a principled naming scheme. Mat ..."
Abstract
-
Cited by 21 (3 self)
- Add to MetaCart
Malicious programs, such as viruses and worms, are frequently related to previous programs through evolutionary relationships. Discovering those relationships and constructing a phylogeny model is expected to be helpful for analyzing new malware and for establishing a principled naming scheme. Matching permutations of code may help build better models in cases where malware evolution does not keep things in the same order. We describe method for constructing phylogeny models that uses features called n-perms to match possibly permuted code. An experiment was performed to compare the relative effectiveness of vector similarity measures using n-perms and n-grams when comparing permuted variants of programs. The similarity measures using n-perms maintained a greater separation between the similarity scores of permuted families of specimens versus unrelated specimens. A subsequent study using a tree generated through suggests that phylogeny models based on may help forensic analysts investigate new specimens, and assist in reconciling malware naming inconsistencies.
Sifting out the Mud: Low Level C++ Code Reuse
, 2002
"... ... where the available amount of memory is limited. This contrasts with the increasing need for additional functionality and the need for rapid application development. While object-oriented programming languages, providing mechanisms such as inheritance and templates, allow fast development of com ..."
Abstract
-
Cited by 15 (8 self)
- Add to MetaCart
... where the available amount of memory is limited. This contrasts with the increasing need for additional functionality and the need for rapid application development. While object-oriented programming languages, providing mechanisms such as inheritance and templates, allow fast development of complex applications, they have a detrimental effect on program size. This paper introduces new techniques to reuse the code of whole procedures at the binary level and a supporting technique for data reuse. These techniques benefit specifically from program properties originating from the use of templates and inheritance. Together with our previous work on code abstraction at lower levels of granularity, they achieve additional code size reductions of up to 38% on already highly optimized and compacted binaries, without sacrificing execution speed. We have incorporated these techniques in Squeeze++, a prototype link-time binary rewriter for the Alpha architecture, and extensively evaluate them on a suite of 8 real-life C++ applications. The total code size reductions achieved post link-time (i.e. without requiring any change to the compiler) range from 27 to 70%, averaging at around 43%.

