Results 1 - 10
of
648,558
Filtering near-duplicate documents
- Proc. FUN 98
, 1998
"... Abstract. The mathematical concept of document resemblance captures well the informal notion of syntactic similarity. The resemblance can be estimated using a fixed size “sketch ” for each document. For a large collection of documents (say hundreds of millions) the size of this sketch is of the orde ..."
Abstract
-
Cited by 156 (7 self)
- Add to MetaCart
is of the order of a few hundred bytes per document. However, for efficient large scale web indexing it is not necessary to determine the actual resemblance value: it suffices to determine whether newly encountered documents are duplicates or near-duplicates of documents already indexed. In other words
Adaptive Near-Duplicate Detection via Similarity Learning
- In Proceedings of SIGIR ‘10
, 2010
"... In this paper, we present a novel near-duplicate document detection method that can easily be tuned for a particular domain. Our method represents each document as a real-valued sparse k-gram vector, where the weights are learned to optimize for a specified similarity function, such as the cosine si ..."
Abstract
-
Cited by 11 (2 self)
- Add to MetaCart
In this paper, we present a novel near-duplicate document detection method that can easily be tuned for a particular domain. Our method represents each document as a real-valued sparse k-gram vector, where the weights are learned to optimize for a specified similarity function, such as the cosine
Near-duplicate Detection by Instance-level Constrained Clustering
- In Proceedings of the 29th ACM Conference on Research and Development in Information Retrieval (SIGIR-06). 2006
, 2006
"... For the task of near-duplicated document detection, both traditional fingerprinting techniques used in database community and bag-of-word comparison approaches used in information retrieval community are not sufficiently accurate. This is due to the fact that the characteristics of near-duplicated d ..."
Abstract
-
Cited by 33 (6 self)
- Add to MetaCart
For the task of near-duplicated document detection, both traditional fingerprinting techniques used in database community and bag-of-word comparison approaches used in information retrieval community are not sufficiently accurate. This is due to the fact that the characteristics of near-duplicated
On Finding Duplication and Near-Duplication in Large Software Systems
, 1995
"... This paper describes how a program called dup can be used to locate instances of duplication or near-duplication in a software system. Dup reports both textually identical sections of code and sections that are the same textually except for systematic substitution of one set of variable names and co ..."
Abstract
-
Cited by 240 (1 self)
- Add to MetaCart
This paper describes how a program called dup can be used to locate instances of duplication or near-duplication in a software system. Dup reports both textually identical sections of code and sections that are the same textually except for systematic substitution of one set of variable names
Community detection in graphs
, 2009
"... The modern science of networks has brought significant advances to our understanding of complex systems. One of the most relevant features of graphs representing real systems is community structure, or clustering, i. e. the organization of vertices in clusters, with many edges joining vertices of th ..."
Abstract
-
Cited by 801 (1 self)
- Add to MetaCart
The modern science of networks has brought significant advances to our understanding of complex systems. One of the most relevant features of graphs representing real systems is community structure, or clustering, i. e. the organization of vertices in clusters, with many edges joining vertices
Dynamics of Random Early Detection
- In Proceedings of ACM SIGCOMM
, 1997
"... In this paper we evaluate the effectiveness of Random Early Detection (RED) over traffic types categorized as nonadaptive, fragile and robust, according to their responses to congestion. We point out that RED allows unfair bandwidth sharing when a mixture of the three traffic types shares a link Thi ..."
Abstract
-
Cited by 464 (1 self)
- Add to MetaCart
In this paper we evaluate the effectiveness of Random Early Detection (RED) over traffic types categorized as nonadaptive, fragile and robust, according to their responses to congestion. We point out that RED allows unfair bandwidth sharing when a mixture of the three traffic types shares a link
DART: Directed automated random testing
- In Programming Language Design and Implementation (PLDI
, 2005
"... We present a new tool, named DART, for automatically testing software that combines three main techniques: (1) automated extraction of the interface of a program with its external environment using static source-code parsing; (2) automatic generation of a test driver for this interface that performs ..."
Abstract
-
Cited by 823 (41 self)
- Add to MetaCart
techniques constitute Directed Automated Random Testing,or DART for short. The main strength of DART is thus that testing can be performed completely automatically on any program that compiles – there is no need to write any test driver or harness code. During testing, DART detects standard errors
An intrusion-detection model
- IEEE TRANSACTIONS ON SOFTWARE ENGINEERING
, 1987
"... A model of a real-time intrusion-detection expert system capable of detecting break-ins, penetrations, and other forms of computer abuse is described. The model is based on the hypothesis that security violations can be detected by monitoring a system's audit records for abnormal patterns of sy ..."
Abstract
-
Cited by 632 (0 self)
- Add to MetaCart
A model of a real-time intrusion-detection expert system capable of detecting break-ins, penetrations, and other forms of computer abuse is described. The model is based on the hypothesis that security violations can be detected by monitoring a system's audit records for abnormal patterns
Attention and the detection of signals
- Journal of Experimental Psychology: General
, 1980
"... Detection of a visual signal requires information to reach a system capable of eliciting arbitrary responses required by the experimenter. Detection latencies are reduced when subjects receive a cue that indicates where in the visual field the signal will occur. This shift in efficiency appears to b ..."
Abstract
-
Cited by 532 (2 self)
- Add to MetaCart
Detection of a visual signal requires information to reach a system capable of eliciting arbitrary responses required by the experimenter. Detection latencies are reduced when subjects receive a cue that indicates where in the visual field the signal will occur. This shift in efficiency appears
Results 1 - 10
of
648,558