Results 1 - 10
of
58
Clustering by compression
- IEEE Transactions on Information Theory
, 2005
"... Abstract—We present a new method for clustering based on compression. The method does not use subject-specific features or background knowledge, and works as follows: First, we determine a parameter-free, universal, similarity distance, the normalized compression distance or NCD, computed from the l ..."
Abstract
-
Cited by 120 (12 self)
- Add to MetaCart
Abstract—We present a new method for clustering based on compression. The method does not use subject-specific features or background knowledge, and works as follows: First, we determine a parameter-free, universal, similarity distance, the normalized compression distance or NCD, computed from the lengths of compressed data files (singly and in pairwise concatenation). Second, we apply a hierarchical clustering method. The NCD is not restricted to a specific application area, and works across application area boundaries. A theoretical precursor, the normalized information distance, co-developed by one of the authors, is provably optimal. However, the optimality comes at the price of using the noncomputable notion of Kolmogorovcomplexity. We propose axioms to capture the real-world setting, and show that the NCD approximates optimality. To extract a hierarchy of clusters from the distance matrix, we determine a dendrogram (ternary tree) by a new quartet method and a fast heuristic to implement it. The method is implemented and available as public software, and is robust under choice of different compressors. To substantiate our claims of universality and robustness, we report evidence of successful application in areas as diverse as genomics, virology, languages, literature, music, handwritten digits, astronomy, and combinations of objects from completely different domains, using statistical, dictionary, and block sorting compressors. In genomics, we presented new evidence for major questions in Mammalian evolution, based on whole-mitochondrial genomic analysis: the Eutherian orders and the Marsupionta hypothesis against the Theria hypothesis. Index Terms—Heterogenous data analysis, hierarchical unsupervised clustering, Kolmogorovcomplexity, normalized compression distance, parameter-free data mining, quartet tree method, universal dissimilarity distance. I.
P.M.B.: The Google similarity distance
- IEEE Transactions on Knowledge and Data Engineering
, 2007
"... Abstract—Words and phrases acquire meaning from the way they are used in society, from their relative semantics to other words and phrases. For computers, the equivalent of “society ” is “database, ” and the equivalent of “use ” is “a way to search the database.” We present a new theory of similarit ..."
Abstract
-
Cited by 98 (4 self)
- Add to MetaCart
Abstract—Words and phrases acquire meaning from the way they are used in society, from their relative semantics to other words and phrases. For computers, the equivalent of “society ” is “database, ” and the equivalent of “use ” is “a way to search the database.” We present a new theory of similarity between words and phrases based on information distance and Kolmogorov complexity. To fix thoughts, we use the World Wide Web (WWW) as the database, and Google as the search engine. The method is also applicable to other search engines and databases. This theory is then applied to construct a method to automatically extract similarity, the Google similarity distance, of words and phrases from the WWW using Google page counts. The WWW is the largest database on earth, and the context information entered by millions of independent users averages out to provide automatic semantics of useful quality. We give applications in hierarchical clustering, classification, and language translation. We give examples to distinguish between colors and numbers, cluster names of paintings by 17th century Dutch masters and names of books by English novelists, the ability to understand emergencies and primes, and we demonstrate the ability to do a simple automatic English-Spanish translation. Finally, we use the WordNet database as an objective baseline against which to judge the performance of our method. We conduct a massive randomized trial in binary classification using support vector machines to learn categories based on our Google distance, resulting in an a mean agreement of 87 percent with the expert crafted WordNet categories. Index Terms—Accuracy comparison with WordNet categories, automatic classification and clustering, automatic meaning discovery using Google, automatic relative semantics, automatic translation, dissimilarity semantic distance, Google search, Google distribution via page hit counts, Google code, Kolmogorov complexity, normalized compression distance (NCD), normalized information distance (NID), normalized Google distance (NGD), meaning of words and phrases extracted from the Web, parameter-free data mining, universal similarity metric. Ç 1
Anomaly Detection: A Survey
, 2007
"... Anomaly detection is an important problem that has been researched within diverse research areas and application domains. Many anomaly detection techniques have been specifically developed for certain application domains, while others are more generic. This survey tries to provide a structured and c ..."
Abstract
-
Cited by 69 (1 self)
- Add to MetaCart
Anomaly detection is an important problem that has been researched within diverse research areas and application domains. Many anomaly detection techniques have been specifically developed for certain application domains, while others are more generic. This survey tries to provide a structured and comprehensive overview of the research on anomaly detection. We have grouped existing techniques into different categories based on the underlying approach adopted by each technique. For each category we have identified key assumptions, which are used by the techniques to differentiate between normal and anomalous behavior. When applying a given technique to a particular domain, these assumptions can be used as guidelines to assess the effectiveness of the technique in that domain. For each category, we provide a basic anomaly detection technique, and then show how the different existing techniques in that category are variants of the basic technique. This template provides an easier and succinct understanding of the techniques belonging to each category. Further, for each category, we identify the advantages and disadvantages of the techniques in that category. We also provide a discussion on the computational complexity of the techniques since it is an important issue in real application domains. We hope that this survey will provide a better understanding of the di®erent directions in which research has been done on this topic, and how techniques developed in one area can be applied in domains for which they were not intended to begin with.
Streaming Pattern Discovery in Multiple Time-Series
- In VLDB
, 2005
"... In this paper, we introduce SPIRIT (Streaming Pattern dIscoveRy in multIple Timeseries) . Given n numerical data streams, all of whose values we observe at each time tick t, SPIRIT can incrementally find correlations and hidden variables, which summarise the key trends in the entire stream col ..."
Abstract
-
Cited by 50 (14 self)
- Add to MetaCart
In this paper, we introduce SPIRIT (Streaming Pattern dIscoveRy in multIple Timeseries) . Given n numerical data streams, all of whose values we observe at each time tick t, SPIRIT can incrementally find correlations and hidden variables, which summarise the key trends in the entire stream collection.
Hot sax: Efficiently finding the most unusual time series subsequence
, 2005
"... In this work, we introduce the new problem of finding time series discords. Time series discords are subsequences of a longer time series that are maximally different to all the rest of the time series subsequences. They thus capture the sense of the most unusual subsequence within a time series. Ti ..."
Abstract
-
Cited by 43 (3 self)
- Add to MetaCart
In this work, we introduce the new problem of finding time series discords. Time series discords are subsequences of a longer time series that are maximally different to all the rest of the time series subsequences. They thus capture the sense of the most unusual subsequence within a time series. Time series discords have many uses for data mining, including improving the quality of clustering, data cleaning, summarization, and anomaly detection. As we will show, discords are particularly attractive as anomaly detectors because they only require one intuitive parameter (the length of the subsequence) unlike most anomaly detection algorithms that typically require many parameters. We evaluate our work with a comprehensive set of experiments. In particular, we demonstrate the utility of discords with objective experiments on domains as diverse as Space Shuttle telemetry monitoring, medicine, surveillance, and industry, and we demonstrate the effectiveness of our discord discovery algorithm with more than one million experiments, on 82 different datasets from diverse domains.
Trajectory Clustering: A Partition-and-Group Framework
- In SIGMOD
, 2007
"... Existing trajectory clustering algorithms group similar trajectories as a whole, thus discovering common trajectories. Our key observation is that clustering trajectories as a whole could miss common sub-trajectories. Discovering common sub-trajectories is very useful in many applications, especiall ..."
Abstract
-
Cited by 42 (9 self)
- Add to MetaCart
Existing trajectory clustering algorithms group similar trajectories as a whole, thus discovering common trajectories. Our key observation is that clustering trajectories as a whole could miss common sub-trajectories. Discovering common sub-trajectories is very useful in many applications, especially if we have regions of special interest for analysis. In this paper, we propose a new partition-and-group framework for clustering trajectories, which partitions a trajectory into a set of line segments, and then, groups similar line segments together into a cluster. The primary advantage of this framework is to discover common sub-trajectories from a trajectory database. Based on this partition-and-group framework, we develop a trajectory clustering algorithm TRA-CLUS. Our algorithm consists of two phases: partitioning and grouping. For the first phase, we present a formal trajectory partitioning algorithm using the minimum description length (MDL) principle. For the second phase, we present a density-based line-segment clustering algorithm. Experimental results demonstrate that TRACLUS correctly discovers common sub-trajectories from real trajectory data.
Spam filtering using statistical data compression models
- Journal of Machine Learning Research
, 2006
"... Spam filtering poses a special problem in text categorization, of which the defining characteristic is that filters face an active adversary, which constantly attempts to evade filtering. Since spam evolves continuously and most practical applications are based on online user feedback, the task call ..."
Abstract
-
Cited by 33 (12 self)
- Add to MetaCart
Spam filtering poses a special problem in text categorization, of which the defining characteristic is that filters face an active adversary, which constantly attempts to evade filtering. Since spam evolves continuously and most practical applications are based on online user feedback, the task calls for fast, incremental and robust learning algorithms. In this paper, we investigate a novel approach to spam filtering based on adaptive statistical data compression models. The nature of these models allows them to be employed as probabilistic text classifiers based on character-level or binary sequences. By modeling messages as sequences, tokenization and other error-prone preprocessing steps are omitted altogether, resulting in a method that is very robust. The models are also fast to construct and incrementally updateable. We evaluate the filtering performance of two different compression algorithms; dynamic Markov compression and prediction by partial matching. The results of our empirical evaluation indicate that compression models outperform currently established spam filters, as well as a number of methods proposed in previous studies.
SMArTIC: Toward building an accurate, robust and scalable specification miner
- In SIGSOFT FSE
, 2006
"... Improper management of software evolution, compounded by imprecise, and changing requirements, along with the “short time to market ” requirement, commonly leads to a lack of up-to-date specifications. This can result in software that is characterized by bugs, anomalies and even security threats. So ..."
Abstract
-
Cited by 29 (12 self)
- Add to MetaCart
Improper management of software evolution, compounded by imprecise, and changing requirements, along with the “short time to market ” requirement, commonly leads to a lack of up-to-date specifications. This can result in software that is characterized by bugs, anomalies and even security threats. Software specification mining is a new technique to address this concern by inferring specifications automatically. In this paper, we propose a novel API specification mining architecture called SMArTIC (Specification Mining Architecture with Trace fIltering and Clustering) to improve the accuracy, robustness and scalability of specification miners. This architecture is constructed based on two hypotheses: (1) Erroneous traces should be pruned from the input traces to a miner, and (2) Clustering related traces will localize inaccuracies and reduce over-generalizationin learning. Correspondingly, SMArTIC comprises four components: an erroneous-trace filtering block, a related-trace clustering block, a learner, and a merger. We show through experiments that the quality of specification mining can be significantly improved using SMArTIC.
Experiencing SAX: a Novel Symbolic Representation of Time Series
, 2007
"... Many high level representations of time series have been proposed for data mining, including Fourier transforms, wavelets, eigenwaves, piecewise polynomial models etc. Many researchers have also considered symbolic representations of time series, noting that such representations would potentiality ..."
Abstract
-
Cited by 21 (7 self)
- Add to MetaCart
Many high level representations of time series have been proposed for data mining, including Fourier transforms, wavelets, eigenwaves, piecewise polynomial models etc. Many researchers have also considered symbolic representations of time series, noting that such representations would potentiality allow researchers to avail of the wealth of data structures and algorithms from the text processing and bioinformatics communities. While many symbolic representations of time series have been introduced over the past decades, they all suffer from two fatal flaws. Firstly, the dimensionality of the symbolic representation is the same as the original data, and virtually all data mining algorithms scale poorly with dimensionality. Secondly, although distance measures can be defined on the symbolic approaches, these distance measures have little correlation with distance measures defined on the original time series. In this work we formulate a new symbolic representation of time series. Our representation is unique in that it allows dimensionality/numerosity reduction, and it also allows distance measures to be defined on the symbolic approach that lower bound corresponding distance measures defined on the original series. As we shall demonstrate, this latter feature is particularly exciting because it allows one to run certain data mining algorithms on the efficiently manipulated symbolic representation, while producing identical results to the algorithms that operate on the original data. In particular, we will demonstrate the utility of our
Modeling Multiple Time Series for Anomaly Detection
- FIFTH IEEE INTERNATIONAL CONFERENCE ON DATA MINING
, 2005
"... Our goal is to generate comprehensible and accurate models from multiple time series for anomaly detection. The models need to produce anomaly scores in an online manner for real-life monitoring tasks. We introduce three algorithms that work in a constructed feature space and evaluate them with a re ..."
Abstract
-
Cited by 19 (0 self)
- Add to MetaCart
Our goal is to generate comprehensible and accurate models from multiple time series for anomaly detection. The models need to produce anomaly scores in an online manner for real-life monitoring tasks. We introduce three algorithms that work in a constructed feature space and evaluate them with a real data set from the NASA shuttle program. Our offline and online evaluations indicate that our algorithms can be more accurate than two existing algorithms.

