Results 1  10
of
40
Shared Information and Program Plagiarism Detection
 IEEE TRANS. INFORM. TH
"... A fundamental question in information theory and in computer science is how to measure similarity or the amount of shared information between two sequences. We have proposed a metric, based on Kolmogorov complexity to answer this question, and have proven it to be universal. We apply this metric i ..."
Abstract

Cited by 68 (2 self)
 Add to MetaCart
(Show Context)
A fundamental question in information theory and in computer science is how to measure similarity or the amount of shared information between two sequences. We have proposed a metric, based on Kolmogorov complexity to answer this question, and have proven it to be universal. We apply this metric in measuring the amount of shared information between two computer programs, to enable plagiarism detection. We have
Information Assurance through Kolmogorov Complexity
, 2001
"... The problem of Information Assurance is approached from the point of view of Kolmogorov Complexity and Minimum Message Length criteria. Several theoretical results are obtained, possible applications are discussed and a new metric for measuring complexity is introduced. Utilization of Kolmogorov Com ..."
Abstract

Cited by 22 (9 self)
 Add to MetaCart
The problem of Information Assurance is approached from the point of view of Kolmogorov Complexity and Minimum Message Length criteria. Several theoretical results are obtained, possible applications are discussed and a new metric for measuring complexity is introduced. Utilization of Kolmogorov Complexity like metrics as conserved parameters to detect abnormal system behavior is explored. Data and process vulnerabilities are put forward as two different dimensions of vulnerability that can be discussed in terms of Kolmogorov Complexity. Finally, these results are utilized to conduct complexitybased vulnerability analysis. 1. Introduction Information security (or lack thereof) is too often dealt with after security has been lost. Back doors are opened, Trojan horses are placed, passwords are guessed and firewalls are broken down  in general, security is lost as barriers to hostile attackers are breached and one is put in the undesirable position of detecting and patching holes. In ...
Independent Minimum Length Programs to Translate Between Given Strings
"... A string p is called a program to compute y given x if U (p; x) = y, where U denotes universal programming language. Kolmogorov complexity K(yjx) of y relative to x is defined as minimum length of a program to compute y given x. Let K(x) denote K(xjempty string) (Kolmogorov complexity of x) and let ..."
Abstract

Cited by 20 (0 self)
 Add to MetaCart
A string p is called a program to compute y given x if U (p; x) = y, where U denotes universal programming language. Kolmogorov complexity K(yjx) of y relative to x is defined as minimum length of a program to compute y given x. Let K(x) denote K(xjempty string) (Kolmogorov complexity of x) and let I(x : y) = K(x) + K(y) K(hx; yi) (the amount of mutual information in x; y). In the present paper we answer in negative the following question posed in [1]: Is it true that for any strings x; y there are independent minimum length programs p; q to translate between x; y, that is, is it true that for any x; y there are p; q such that U (p; x) = y, U (q; y) = x, the length of p is K(yjx), the length of q is K(xjy), and I(p : q) = 0 (where the last three equalities hold up to an additive O(log(K(xjy) + K(yjx))) term)? 1
Algorithmic clustering of music
 Computer Music Journal
, 2004
"... We present a method for hierarchical music clustering, based on compression of strings that represent the music pieces. The method uses no background knowledge about music whatsoever: it is completely general and can, without change, be used in different areas like linguistic classification, literat ..."
Abstract

Cited by 16 (4 self)
 Add to MetaCart
(Show Context)
We present a method for hierarchical music clustering, based on compression of strings that represent the music pieces. The method uses no background knowledge about music whatsoever: it is completely general and can, without change, be used in different areas like linguistic classification, literature, and genomics. Indeed, it can be used to simultaneously cluster objects from completely different domains, like with like. It is based on an ideal theory of the information content in individual objects (Kolmogorov complexity), information distance, and a universal similarity metric. The approximation to the universal similarity metric obtained using standard data compressors is called “normalized compression distance (NCD). ” Experiments using our CompLearn software tool show that the method distinguishes between various musical genres and can even cluster pieces by composer. 1.
Information Distance in Multiples
, 2009
"... Information distance is a parameterfree similarity measure based on compression, used in pattern recognition, data mining, phylogeny, clustering, and classification. The notion of information distance is extended from pairs to multiples (finite lists). We study maximal overlap, metricity, universal ..."
Abstract

Cited by 9 (4 self)
 Add to MetaCart
(Show Context)
Information distance is a parameterfree similarity measure based on compression, used in pattern recognition, data mining, phylogeny, clustering, and classification. The notion of information distance is extended from pairs to multiples (finite lists). We study maximal overlap, metricity, universality, minimal overlap, additivity, and normalized information distance in multiples. We use the theoretical notion of Kolmogorov complexity which for practical purposes is approximated by the length of the compressed version of the file involved, using a realworld compression program.
Algorithms for estimating information distance with application to bioinformatics and linguistics
 in Proc. Canadian Conference on Electrical and Computer Engineering. IEEE
"... ..."
(Show Context)
The Normalized Compression Distance Is Resistant to Noise
, 2007
"... This correspondence studies the influence of noise on the normalized compression distance (NCD), a measure based on the use of compressors to compute the degree of similarity of two files. This influence is approximated by a first order differential equation which gives rise to a complex effect, wh ..."
Abstract

Cited by 6 (2 self)
 Add to MetaCart
This correspondence studies the influence of noise on the normalized compression distance (NCD), a measure based on the use of compressors to compute the degree of similarity of two files. This influence is approximated by a first order differential equation which gives rise to a complex effect, which explains the fact that the NCD may give values greater than 1, observed by other authors. The model is tested experimentally with good adjustment. Finally, the influence of noise on the clustering of files of different types is explored, finding that the NCD performs well even in the presence of quite high noise levels.
Reducing the Plagiarism Detection Search Space on the Basis of the KullbackLeibler Distance
"... Abstract. Automatic plagiarism detection considering a reference corpus compares a suspicious text to a set of original documents in order to relate the plagiarised fragments to their potential source. Publications on this task often assume that the search space (the set of reference documents) is a ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
(Show Context)
Abstract. Automatic plagiarism detection considering a reference corpus compares a suspicious text to a set of original documents in order to relate the plagiarised fragments to their potential source. Publications on this task often assume that the search space (the set of reference documents) is a narrow set where any search strategy will produce a good output in a short time. However, this is not always true. Reference corpora are often composed of a big set of original documents where a simple exhaustive search strategy becomes practically impossible. Before carrying out an exhaustive search, it is necessary to reduce the search space, represented by the documents in the reference corpus, as much as possible. Our experiments with the METER corpus show that a previous search space reduction stage, based on the KullbackLeibler symmetric distance, reduces the search process time dramatically. Additionally, it improves the Precision and Recall obtained by a search strategy based on the exhaustive comparison of word ngrams. 1
MultiDocument Summarization by Information Distance
"... Abstract—We are now living in a world where information is growing and updating quickly. Knowledge can be acquired more efficiently with the help of automatic document summarization and updating techniques. This paper describes a novel approach for multidocument update summarization. The best summa ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
(Show Context)
Abstract—We are now living in a world where information is growing and updating quickly. Knowledge can be acquired more efficiently with the help of automatic document summarization and updating techniques. This paper describes a novel approach for multidocument update summarization. The best summary is defined as one of which has the minimal information distance to the entire document set. And the best update summary has the minimal conditional information distance to a document cluster given that a prior document cluster has already been read. We propose two methods to approximate information distance between two documents, one by compression and the other by the coding theory. Experiments on the DUC 2007 dataset 1 and the TAC 2008 dataset 2 have proved that our method closely correlates with the humanwritten summaries and outperforms LexRank in many categories under the ROUGE evaluation criterion.