Results 1 -
6 of
6
Shared Information and Program Plagiarism Detection
- IEEE TRANS. INFORM. TH
"... A fundamental question in information theory and in computer science is how to measure similarity or the amount of shared information between two sequences. We have proposed a metric, based on Kolmogorov complexity to answer this question, and have proven it to be universal. We apply this metric i ..."
Abstract
-
Cited by 45 (1 self)
- Add to MetaCart
A fundamental question in information theory and in computer science is how to measure similarity or the amount of shared information between two sequences. We have proposed a metric, based on Kolmogorov complexity to answer this question, and have proven it to be universal. We apply this metric in measuring the amount of shared information between two computer programs, to enable plagiarism detection. We have
Automatic Meaning Discovery Using Google
- Manuscript, CWI, 2004; http://arxiv.org/abs/cs.CL/0412098
, 2004
"... We have found a method to automatically extract the meaning of words and phrases from the world-wide-web using Google page counts. The approach is novel in its unrestricted problem domain, simplicity of implementation, and manifestly ontological underpinnings. The world-wide-web is the largest dat ..."
Abstract
-
Cited by 29 (2 self)
- Add to MetaCart
We have found a method to automatically extract the meaning of words and phrases from the world-wide-web using Google page counts. The approach is novel in its unrestricted problem domain, simplicity of implementation, and manifestly ontological underpinnings. The world-wide-web is the largest database on earth, and the latent semantic context information entered by millions of independent users averages out to provide automatic meaning of useful quality. We demonstrate positive correlations, evidencing an underlying semantic structure, in both numerical symbol notations and number-name words in a variety of natural languages and contexts. Next, we demonstrate the ability to distinguish between colors and numbers, and to distinguish between 17th century Dutch painters; the ability to understand electrical terms, religious terms, and emergency incidents; we conduct a massive experiment in understanding WordNet categories; and finally we demonstrate the ability to do a simple automatic English-Spanish translation.
Using Data Compressors to Construct Rank Tests
, 2007
"... New nonparametric rank tests for homogeneity and component independence are proposed, which are based on data compressors. For homogeneity testing the idea is to compress the binary string obtained by ordering the two joint samples and writing 0 if the element is from the first sample and 1 if it is ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
New nonparametric rank tests for homogeneity and component independence are proposed, which are based on data compressors. For homogeneity testing the idea is to compress the binary string obtained by ordering the two joint samples and writing 0 if the element is from the first sample and 1 if it is from the second sample and breaking ties by randomization (extension to the case of multiple samples is straightforward). H0 should be rejected if the string is compressed (to a certain degree) and accepted otherwise. We show that such a test obtained from an ideal data compressor is valid against all alternatives. Component independence is reduced to homogeneity testing by constructing two samples, one of which is the first half of the original and the other is the second half with one of the components randomly permuted.
Exploring Automated Music Genre Classification
, 2009
"... This report explores the classification and clustering of audio files. The classification is based on Feature extraction using the MARSYAS tool. The performance of both the Logistic Regression and Support Vector Machine models is evaluated. Within Logistic Regression, information is produced based o ..."
Abstract
- Add to MetaCart
This report explores the classification and clustering of audio files. The classification is based on Feature extraction using the MARSYAS tool. The performance of both the Logistic Regression and Support Vector Machine models is evaluated. Within Logistic Regression, information is produced based on the probabilities of classification that may be useful to the end-user. The clustering is based on the use of the Normalised Compression Distance, and the use of an agglomerative hierarchical clustering technique. The results show that SVMs work the best out of all considered techniques. The report suggests why this might be so, and offers suggestions on improving both clustering and classification techniques. 1 1
Diploma of Advanced Studies Doctorate in Computer Science and Digital
"... I would like to express my gratitude to Ramon López de Mántaras and Josep Lluís Arcos. In the first place for offering me a great place to work, to study, and to meet people at the Artificial Intelligence Research Institute (IIIA) in Barcelona. Secondly, for sharing their experience through invaluab ..."
Abstract
- Add to MetaCart
I would like to express my gratitude to Ramon López de Mántaras and Josep Lluís Arcos. In the first place for offering me a great place to work, to study, and to meet people at the Artificial Intelligence Research Institute (IIIA) in Barcelona. Secondly, for sharing their experience through invaluable advice, throughout the time I have worked with them. And I also want to mention Lieveke van Heck. We have been together for a lot of years now and she showed patience, soothed doubts and raised
“Tell Me More”: Finding Related Items from User Provided Feedback.
"... Abstract. The results returned by a search, datamining or database engine often contains an overload of potentially interesting information. A daunting and challenging problem for a user is to pick out the useful information. In this paper we propose an interactive framework to efficiently explore a ..."
Abstract
- Add to MetaCart
Abstract. The results returned by a search, datamining or database engine often contains an overload of potentially interesting information. A daunting and challenging problem for a user is to pick out the useful information. In this paper we propose an interactive framework to efficiently explore and (re)rank the objects retrieved by such an engine, according to feedback provided on part of the initially retrieved objects. In particular, given a set of objects, a similarity measure applicable to the objects and an initial set of objects that are of interest to the user, our algorithm computes the k most similar objects. This problem, previously coined as ’cluster on demand ’ [10], is solved by transforming the data into a weighted graph. On this weighted graph we compute a relevance score between the initial set of nodes and the remaining nodes based upon random walks with restart in graphs. We apply our algorithm “Tell Me More ” (TMM) on text, numerical and zero/one data. The results show that TMM for almost every experiment significantly outperforms a k-nearest neighbor approach. 1

