Results 1 - 10
of
16
Information Assurance through Kolmogorov Complexity
, 2001
"... The problem of Information Assurance is approached from the point of view of Kolmogorov Complexity and Minimum Message Length criteria. Several theoretical results are obtained, possible applications are discussed and a new metric for measuring complexity is introduced. Utilization of Kolmogorov Com ..."
Abstract
-
Cited by 19 (7 self)
- Add to MetaCart
The problem of Information Assurance is approached from the point of view of Kolmogorov Complexity and Minimum Message Length criteria. Several theoretical results are obtained, possible applications are discussed and a new metric for measuring complexity is introduced. Utilization of Kolmogorov Complexity like metrics as conserved parameters to detect abnormal system behavior is explored. Data and process vulnerabilities are put forward as two different dimensions of vulnerability that can be discussed in terms of Kolmogorov Complexity. Finally, these results are utilized to conduct complexitybased vulnerability analysis. 1. Introduction Information security (or lack thereof) is too often dealt with after security has been lost. Back doors are opened, Trojan horses are placed, passwords are guessed and firewalls are broken down -- in general, security is lost as barriers to hostile attackers are breached and one is put in the undesirable position of detecting and patching holes. In ...
Independent Minimum Length Programs to Translate Between Given Strings
"... A string p is called a program to compute y given x if U (p; x) = y, where U denotes universal programming language. Kolmogorov complexity K(yjx) of y relative to x is defined as minimum length of a program to compute y given x. Let K(x) denote K(xjempty string) (Kolmogorov complexity of x) and let ..."
Abstract
-
Cited by 15 (0 self)
- Add to MetaCart
A string p is called a program to compute y given x if U (p; x) = y, where U denotes universal programming language. Kolmogorov complexity K(yjx) of y relative to x is defined as minimum length of a program to compute y given x. Let K(x) denote K(xjempty string) (Kolmogorov complexity of x) and let I(x : y) = K(x) + K(y) K(hx; yi) (the amount of mutual information in x; y). In the present paper we answer in negative the following question posed in [1]: Is it true that for any strings x; y there are independent minimum length programs p; q to translate between x; y, that is, is it true that for any x; y there are p; q such that U (p; x) = y, U (q; y) = x, the length of p is K(yjx), the length of q is K(xjy), and I(p : q) = 0 (where the last three equalities hold up to an additive O(log(K(xjy) + K(yjx))) term)? 1
Algorithmic clustering of music
- Computer Music Journal
, 2004
"... We present a method for hierarchical music clustering, based on compression of strings that represent the music pieces. The method uses no background knowledge about music whatsoever: it is completely general and can, without change, be used in different areas like linguistic classification, literat ..."
Abstract
-
Cited by 11 (1 self)
- Add to MetaCart
We present a method for hierarchical music clustering, based on compression of strings that represent the music pieces. The method uses no background knowledge about music whatsoever: it is completely general and can, without change, be used in different areas like linguistic classification, literature, and genomics. Indeed, it can be used to simultaneously cluster objects from completely different domains, like with like. It is based on an ideal theory of the information content in individual objects (Kolmogorov complexity), information distance, and a universal similarity metric. The approximation to the universal similarity metric obtained using standard data compressors is called “normalized compression distance (NCD). ” Experiments using our CompLearn software tool show that the method distinguishes between various musical genres and can even cluster pieces by composer. 1.
Algorithms for Estimating Information Distance with Application to Bioinformatics and Linguistics
- Literary and Linguistic Computing
, 2004
"... After reviewing unnormalized and normalized information distances based on incomputable notions of Kolmogorov complexity, we discuss how Kolmogorov complexity can be approximated by data compression algorithms. We argue that optimal algorithms for data compression with side information can be succes ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
After reviewing unnormalized and normalized information distances based on incomputable notions of Kolmogorov complexity, we discuss how Kolmogorov complexity can be approximated by data compression algorithms. We argue that optimal algorithms for data compression with side information can be successfully used to approximate the normalized distance. Next, we discuss an alternative information distance, which is based on relative entropy rate (also known as Kullback-Leibler divergence), and compression-based algorithms for its estimation. Based on available biological and linguistic data, we arrive to unexpected conclusion that in Bioinformatics and Computational Linguistics this alternative distance is more relevant and important than the ones based on Kolmogorov complexity.
The Normalized Compression Distance Is Resistant to Noise
, 2007
"... This correspondence studies the influence of noise on the normalized compression distance (NCD), a measure based on the use of compressors to compute the degree of similarity of two files. This influence is approximated by a first order differential equation which gives rise to a complex effect, wh ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
This correspondence studies the influence of noise on the normalized compression distance (NCD), a measure based on the use of compressors to compute the degree of similarity of two files. This influence is approximated by a first order differential equation which gives rise to a complex effect, which explains the fact that the NCD may give values greater than 1, observed by other authors. The model is tested experimentally with good adjustment. Finally, the influence of noise on the clustering of files of different types is explored, finding that the NCD performs well even in the presence of quite high noise levels.
Multi-Document Summarization by Information Distance
"... Abstract—We are now living in a world where information is growing and updating quickly. Knowledge can be acquired more efficiently with the help of automatic document summarization and updating techniques. This paper describes a novel approach for multi-document update summarization. The best summa ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Abstract—We are now living in a world where information is growing and updating quickly. Knowledge can be acquired more efficiently with the help of automatic document summarization and updating techniques. This paper describes a novel approach for multi-document update summarization. The best summary is defined as one of which has the minimal information distance to the entire document set. And the best update summary has the minimal conditional information distance to a document cluster given that a prior document cluster has already been read. We propose two methods to approximate information distance between two documents, one by compression and the other by the coding theory. Experiments on the DUC 2007 dataset 1 and the TAC 2008 dataset 2 have proved that our method closely correlates with the human-written summaries and outperforms LexRank in many categories under the ROUGE evaluation criterion.
Tsinghua University at the summarization track of TAC 2008
"... The three authors should be all first-authors. ..."
Reducing the Plagiarism Detection Search Space on the Basis of the Kullback-Leibler Distance
"... Abstract. Automatic plagiarism detection considering a reference corpus compares a suspicious text to a set of original documents in order to relate the plagiarised fragments to their potential source. Publications on this task often assume that the search space (the set of reference documents) is a ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Abstract. Automatic plagiarism detection considering a reference corpus compares a suspicious text to a set of original documents in order to relate the plagiarised fragments to their potential source. Publications on this task often assume that the search space (the set of reference documents) is a narrow set where any search strategy will produce a good output in a short time. However, this is not always true. Reference corpora are often composed of a big set of original documents where a simple exhaustive search strategy becomes practically impossible. Before carrying out an exhaustive search, it is necessary to reduce the search space, represented by the documents in the reference corpus, as much as possible. Our experiments with the METER corpus show that a previous search space reduction stage, based on the Kullback-Leibler symmetric distance, reduces the search process time dramatically. Additionally, it improves the Precision and Recall obtained by a search strategy based on the exhaustive comparison of word n-grams. 1
An Application of Information Theory to Intrusion Detection
- Proceedings of the Fourth IEEE International Workshop on Information Assurance (IWIA’06
, 2006
"... Zero-day attacks, new (anomalous) attacks exploiting previously unknown system vulnerabilities, are a serious threat. Defending against them is no easy task, however. Having identified “degree of system knowledge” as one difference between legitimate and illegitimate users, theorists have drawn on i ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Zero-day attacks, new (anomalous) attacks exploiting previously unknown system vulnerabilities, are a serious threat. Defending against them is no easy task, however. Having identified “degree of system knowledge” as one difference between legitimate and illegitimate users, theorists have drawn on information theory as a basis for intrusion detection. In particular, Kolmogorov complexity (K) has been used successfully. In this work, we consider information distance (Observed K − Expected K) as a method of detecting system scans. Observed K is computed directly, Expected K is taken from compression tests shared herein. Results are encouraging. Observed scan traffic has an information distance at least an order of magnitude greater than the threshold value we determined for normal Internet traffic. With 320 KB packet blocks, separation between distributions appears to exceed 4σ. 1.
A Review Selection Approach for Accurate Feature Rating Estimation
"... In this paper, we propose a review selection approach towards accurate estimation of feature ratings for services on participatory websites where users write textual reviews for these services. Our approach selects reviews that comprehensively talk about a feature of a service by using information d ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
In this paper, we propose a review selection approach towards accurate estimation of feature ratings for services on participatory websites where users write textual reviews for these services. Our approach selects reviews that comprehensively talk about a feature of a service by using information distance of the reviews on the feature. The rating estimation of the feature for these selected reviews using machine learning techniques provides more accurate results than that for other reviews. The average of these estimated feature ratings also better represents an accurate overall rating for the feature of the service, which provides useful feedback for other users to choose their satisfactory services. 1

