Results 1 - 10
of
127
On the Resemblance and Containment of Documents
- In Compression and Complexity of Sequences (SEQUENCES’97
, 1997
"... Given two documents A and B we define two mathematical notions: their resemblance r(A, B)andtheircontainment c(A, B) that seem to capture well the informal notions of "roughly the same" and "roughly contained." The basic idea is to reduce these issues to set intersection problems that can be eas ..."
Abstract
-
Cited by 254 (5 self)
- Add to MetaCart
Given two documents A and B we define two mathematical notions: their resemblance r(A, B)andtheircontainment c(A, B) that seem to capture well the informal notions of "roughly the same" and "roughly contained." The basic idea is to reduce these issues to set intersection problems that can be easily evaluated by a process of random sampling that can be done independently for each document. Furthermore, the resemblance can be evaluated using a fixed size sample for each document.
A Low-bandwidth Network File System
, 2001
"... This paper presents LBFS, a network file system designed for low bandwidth networks. LBFS exploits similarities between files or versions of the same file to save bandwidth. It avoids sending data over the network when the same data can already be found in the server's file system or the client's ca ..."
Abstract
-
Cited by 240 (3 self)
- Add to MetaCart
This paper presents LBFS, a network file system designed for low bandwidth networks. LBFS exploits similarities between files or versions of the same file to save bandwidth. It avoids sending data over the network when the same data can already be found in the server's file system or the client's cache. Using this technique, LBFS achieves up to two orders of magnitude reduction in bandwidth utilization on common workloads, compared to traditional network file systems
Automated worm fingerprinting
- In OSDI
, 2004
"... Network worms are a clear and growing threat to the security of today’s Internet-connected hosts and networks. The combination of the Internet’s unrestricted connectivity and widespread software homogeneity allows network pathogens to exploit tremendous parallelism in their propagation. In fact, mod ..."
Abstract
-
Cited by 239 (6 self)
- Add to MetaCart
Network worms are a clear and growing threat to the security of today’s Internet-connected hosts and networks. The combination of the Internet’s unrestricted connectivity and widespread software homogeneity allows network pathogens to exploit tremendous parallelism in their propagation. In fact, modern worms can spread so quickly, and so widely, that no human-mediated reaction can hope to contain an outbreak. In this paper, we propose an automated approach for quickly detecting previously unknown worms and viruses based on two key behavioral characteristics – a common exploit sequence together with a range of unique sources generating infections and destinations being targeted. More importantly, our approach – called “content sifting ” – automatically generates precise signatures that can then be used to filter or moderate the spread of the worm elsewhere in the network. Using a combination of existing and novel algorithms we have developed a scalable content sifting implementation with low memory and CPU requirements. Over months of active use at UCSD, our Earlybird prototype system has automatically detected and generated signatures for all pathogens known to be active on our network as well as for several new worms and viruses which were unknown at the time our system identified them. Our initial experience suggests that, for a wide range of network pathogens, it may be practical to construct fully automated defenses – even against so-called “zero-day” epidemics. 1
Venti: A New Approach to Archival Storage
, 2002
"... This paper describes a network storage system, called Venti, intended for archival data. In this system, a unique hash of a block's contents acts as the block identifier for read and write operations. This approach enforces a write-once policy, preventing accidental or malicious destruction of data. ..."
Abstract
-
Cited by 198 (0 self)
- Add to MetaCart
This paper describes a network storage system, called Venti, intended for archival data. In this system, a unique hash of a block's contents acts as the block identifier for read and write operations. This approach enforces a write-once policy, preventing accidental or malicious destruction of data. In addition, duplicate copies of a block can be coalesced, reducing the consumption of storage and simplifying the implementation of clients. Venti is a building block for constructing a variety of storage applications such as logical backup, physical backup, and snapshot file systems.
A Language Independent Approach for Detecting Duplicated Code
, 1999
"... Code duplication is one of the factors that severely complicates the maintenance and evolution of large software systems. Techniques for detecting duplicated code exist but rely mostly on parsers, technology that has proven to be brittle in the face of different languages and dialects. In this paper ..."
Abstract
-
Cited by 154 (19 self)
- Add to MetaCart
Code duplication is one of the factors that severely complicates the maintenance and evolution of large software systems. Techniques for detecting duplicated code exist but rely mostly on parsers, technology that has proven to be brittle in the face of different languages and dialects. In this paper we show that is possible to circumvent this hindrance by applying a language independent and visual approach, i.e. a tool that requires no parsing, yet is able to detect a significant amount of code duplication. We validate our approach on a number of case studies, involving four different implementation languages and ranging from 256 K up to 13Mb of source code size. Keywords: Software maintenance, code duplication detection, code visualization 1. Code Duplication Detection Duplicated code is a phenomenon that occurs frequently in large systems. The reasons why programmers duplicate code are manifold (see [9, 2] for a thorough discussion) and include the following reasons: (a) Making a ...
The PARSEC benchmark suite: Characterization and architectural implications
- IN PRINCETON UNIVERSITY
, 2008
"... This paper presents and characterizes the Princeton Application Repository for Shared-Memory Computers (PARSEC), a benchmark suite for studies of Chip-Multiprocessors (CMPs). Previous available benchmarks for multiprocessors have focused on high-performance computing applications and used a limited ..."
Abstract
-
Cited by 150 (1 self)
- Add to MetaCart
This paper presents and characterizes the Princeton Application Repository for Shared-Memory Computers (PARSEC), a benchmark suite for studies of Chip-Multiprocessors (CMPs). Previous available benchmarks for multiprocessors have focused on high-performance computing applications and used a limited number of synchronization methods. PARSEC includes emerging applications in recognition, mining and synthesis (RMS) as well as systems applications which mimic large-scale multithreaded commercial programs. Our characterization shows that the benchmark suite covers a wide spectrum of working sets, locality, data sharing, synchronization and off-chip traffic. The benchmark suite has been made available to the public.
On Finding Duplication and Near-Duplication in Large Software Systems
, 1995
"... This paper describes how a program called dup can be used to locate instances of duplication or near-duplication in a software system. Dup reports both textually identical sections of code and sections that are the same textually except for systematic substitution of one set of variable names and co ..."
Abstract
-
Cited by 144 (1 self)
- Add to MetaCart
This paper describes how a program called dup can be used to locate instances of duplication or near-duplication in a software system. Dup reports both textually identical sections of code and sections that are the same textually except for systematic substitution of one set of variable names and constants for another. Further processing locates longer sections of code that are the same except for other small modifications. Experimental results from running dup on millions of lines from two large software systems show dup to be both effective at locating duplication and fast. Applications could include identifying sections of code that should be replaced by procedures, elimination of duplication during reengineering of the system,
Optimizing the migration of virtual computers
- In Proceedings of the 5th Symposium on Operating Systems Design and Implementation
, 2002
"... This paper shows how to quickly move the state of a run-ning computer across a network, including the state in its disks, memory, CPU registers, and I/O devices. We call this state a capsule. Capsule state is hardware state, so it ..."
Abstract
-
Cited by 142 (4 self)
- Add to MetaCart
This paper shows how to quickly move the state of a run-ning computer across a network, including the state in its disks, memory, CPU registers, and I/O devices. We call this state a capsule. Capsule state is hardware state, so it
Copy Detection Mechanisms for Digital Documents
- In Proceedings of the ACM SIGMOD Annual Conference
, 1995
"... In a digital library system, documents are available in digital form and therefore are more easily copied and their copyrights are more easily violated. This is a very serious problem, as it discourages owners of valuable information from sharing it with authorized users. There are two main philosop ..."
Abstract
-
Cited by 138 (9 self)
- Add to MetaCart
In a digital library system, documents are available in digital form and therefore are more easily copied and their copyrights are more easily violated. This is a very serious problem, as it discourages owners of valuable information from sharing it with authorized users. There are two main philosophies for addressing this problem: prevention and detection. The former actually makes unauthorized use of documents difficult or impossible while the latter makes it easier to discover such activity. In this paper we propose a system for registering documents and then detecting copies, either complete copies or partial copies. We describe algorithms for such detection, and metrics required for evaluating detection mechanisms (covering accuracy, efficiency, and security). We also describe a working prototype, called COPS, describe implementation issues, and present experimental results that suggest the proper settings for copy detection parameters. 1 Introduction Digital libraries are a conc...
Winnowing: Local Algorithms for Document Fingerprinting
- Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data 2003
, 2003
"... Digital content is for copying: quotation, revision, plagiarism, and file sharing all create copies. Document fingerprinting is concerned with accurately identifying copying, including small partial copies, within large sets of documents. We introduce the class of local document fingerprinting algor ..."
Abstract
-
Cited by 129 (2 self)
- Add to MetaCart
Digital content is for copying: quotation, revision, plagiarism, and file sharing all create copies. Document fingerprinting is concerned with accurately identifying copying, including small partial copies, within large sets of documents. We introduce the class of local document fingerprinting algorithms, which seems to capture an essential property of any fingerprinting technique guaranteed to detect copies. We prove a novel lower bound on the performance of any local algorithm. We also develop winnowing, an efficient local fingerprinting algorithm, and show that winnowing’s performance is within 33 % of the lower bound. Finally, we also give experimental results on Web data, and report experience with MOSS, a widely-used plagiarism detection service. 1.

