Results 1  10
of
44
On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration
 SIGKDD'02
, 2002
"... ... mining time series data. Literally hundreds of papers have introduced new algorithms to index, classify, cluster and segment time series. In this work we make the following claim. Much of this work has very little utility because the contribution made (speed in the case of indexing, accuracy in ..."
Abstract

Cited by 220 (50 self)
 Add to MetaCart
... mining time series data. Literally hundreds of papers have introduced new algorithms to index, classify, cluster and segment time series. In this work we make the following claim. Much of this work has very little utility because the contribution made (speed in the case of indexing, accuracy in the case of classification and clustering, model accuracy in the case of segmentation) offer an amount of "improvement" that would have been completely dwarfed by the variance that would have been observed by testing on many real world datasets, or the variance that would have been observed by changing minor (unstated) implementation details. To illustrate our point
Towards parameterfree data mining
 In: Proc. 10th ACM SIGKDD Intn’l Conf. Knowledge Discovery and Data Mining
, 2004
"... Most data mining algorithms require the setting of many input parameters. Two main dangers of working with parameterladen algorithms are the following. First, incorrect settings may cause an algorithm to fail in finding the true patterns. Second, a perhaps more insidious problem is that the algorit ..."
Abstract

Cited by 122 (20 self)
 Add to MetaCart
Most data mining algorithms require the setting of many input parameters. Two main dangers of working with parameterladen algorithms are the following. First, incorrect settings may cause an algorithm to fail in finding the true patterns. Second, a perhaps more insidious problem is that the algorithm may report spurious patterns that do not really exist, or greatly overestimate the significance of the reported patterns. This is especially likely when the user fails to understand the role of parameters in the data mining process. Data mining algorithms should have as few parameters as possible, ideally none. A parameterfree algorithm would limit our ability to impose our prejudices, expectations, and presumptions on the problem at hand, and would let the data itself speak to us. In this work, we show that recent results in bioinformatics and computational theory hold great promise for a parameterfree datamining paradigm. The results are motivated by observations in Kolmogorov complexity theory. However, as a practical matter, they can be implemented using any offtheshelf compression algorithm with the addition of just a dozen or so lines of code. We will show that this approach is competitive or superior to the stateoftheart approaches in anomaly/interestingness detection, classification, and clustering with empirical tests on time series/DNA/text/video datasets.
Finding Motifs in Time Series
, 2002
"... The problem of efficiently locating previously known patterns in a time series database (i.e., query by content) has received much attention and may now largely be regarded as a solved problem. However, from a knowledge discovery viewpoint, a more interesting problem is the enumeration of previously ..."
Abstract

Cited by 72 (15 self)
 Add to MetaCart
The problem of efficiently locating previously known patterns in a time series database (i.e., query by content) has received much attention and may now largely be regarded as a solved problem. However, from a knowledge discovery viewpoint, a more interesting problem is the enumeration of previously unknown, frequently occurring patterns. We call such patterns "motifs," because of their close analogy to their discrete counterparts in computation biology. An efficient motif discovery algorithm for time series would be useful as a tool for summarizing and visualizing massive time series databases. In addition, it could be used as a subroutine in various other data mining tasks, including the discovery of association rules, clustering and classification. In this work we carefully motivate, then introduce, a nontrivial definition of time series motifs. We propose an efficient algorithm to discover them, and we demonstrate the utility and efficiency of our approach on several real world datasets.
A Unified Framework for Modelbased Clustering
 Journal of Machine Learning Research
, 2003
"... Modelbased clustering techniques have been widely used and have shown promising results in many applications involving complex data. This paper presents a unified framework for probabilistic modelbased clustering based on a bipartite graph view of data and models that highlights the commonaliti ..."
Abstract

Cited by 56 (6 self)
 Add to MetaCart
Modelbased clustering techniques have been widely used and have shown promising results in many applications involving complex data. This paper presents a unified framework for probabilistic modelbased clustering based on a bipartite graph view of data and models that highlights the commonalities and differences among existing modelbased clustering algorithms. In this view, clusters are represented as probabilistic models in a model space that is conceptually separate from the data space. For partitional clustering, the view is conceptually similar to the ExpectationMaximization (EM) algorithm. For hierarchical clustering, the graphbased view helps to visualize critical/important distinctions between similaritybased approaches and modelbased approaches.
Experiencing SAX: A Novel Symbolic Representation of Time Series. Data Mining and Knowledge Discovery Journal
, 2007
"... Abstract Many high level representations of time series have been proposed for data mining, including Fourier transforms, wavelets, eigenwaves, piecewise polynomial models, etc. Many researchers have also considered symbolic representations of time series, noting that such representations would pote ..."
Abstract

Cited by 51 (13 self)
 Add to MetaCart
Abstract Many high level representations of time series have been proposed for data mining, including Fourier transforms, wavelets, eigenwaves, piecewise polynomial models, etc. Many researchers have also considered symbolic representations of time series, noting that such representations would potentiality allow researchers to avail of the wealth of data structures and algorithms from the text processing and bioinformatics communities. While many symbolic representations of time series have been introduced over the past decades, they all suffer from two fatal flaws. First, the dimensionality of the symbolic representation is the same as the original data, and virtually all data mining algorithms scale poorly with dimensionality. Second, although distance measures can be defined on the symbolic approaches, these distance measures have little correlation with distance measures defined on the original time series. In this work we formulate a new symbolic representation of time series. Our representation is unique in that it allows dimensionality/numerosity reduction,
Mining Motifs in Massive Time Series Databases
 In Proceedings of IEEE International Conference on Data Mining (ICDM’02
, 2002
"... The problem of efficiently locating previously known patterns in a time series database (i.e., query by content) has received much attention and may now largely be regarded as a solved problem. However, from a knowledge discovery viewpoint, a more interesting problem is the enumeration of previously ..."
Abstract

Cited by 30 (0 self)
 Add to MetaCart
The problem of efficiently locating previously known patterns in a time series database (i.e., query by content) has received much attention and may now largely be regarded as a solved problem. However, from a knowledge discovery viewpoint, a more interesting problem is the enumeration of previously unknown, frequently occurring patterns. We call such patterns "motifs", because of their close analogy to their discrete counterparts in computation biology. An efficient motif discovery algorithm for time series would be useful as a tool for summarizing and visualizing massive time series databases. In addition it could be used as a subroutine in various other data mining tasks, including the discovery of association rules, clustering and classification.
Mixtures of ARMA Models for ModelBased Time Series Clustering
 In Proceedings of the IEEE International Conference on Data Mining
, 2002
"... Clustering problems are central to many knowledge discovery and data mining tasks. However, most existing clustering methods can only work with fixeddimensional representations of data patterns. In this paper, we study the clustering of data patterns that are represented as sequences or time series ..."
Abstract

Cited by 19 (1 self)
 Add to MetaCart
Clustering problems are central to many knowledge discovery and data mining tasks. However, most existing clustering methods can only work with fixeddimensional representations of data patterns. In this paper, we study the clustering of data patterns that are represented as sequences or time series possibly of di#erent lengths. We propose a modelbased approach to this problem using mixtures of autoregressive moving average (ARMA) models. We derive an expectationmaximization (EM) algorithm for learning the mixing coe#cients as well as the parameters of the component models. The algorithm can determine the number of clusters in the data automatically. Experiments were conducted on a number of simulated and real datasets. Results from the experiments show that our method compares favorably with another method recently proposed by others for similar time series clustering problems.
A WaveletBased Anytime Algorithm for KMeans Clustering of Time Series
 In Proc. Workshop on Clustering High Dimensionality Data and Its Applications
, 2003
"... The emergence of the field of data mining in the last decade has sparked an increasing interest in clustering of tiate series. Although there has been much research on clustering in general, most classic machine learning and data mining algorithms do not work well for time series due to their unique ..."
Abstract

Cited by 19 (3 self)
 Add to MetaCart
The emergence of the field of data mining in the last decade has sparked an increasing interest in clustering of tiate series. Although there has been much research on clustering in general, most classic machine learning and data mining algorithms do not work well for time series due to their unique structure. In particular, the high dimensionaliF, very high feature correlation, and the (typically) large amount of noise that characterize time series data present a difficult challenge. In this work we address these challenges by introducing a novel anytiate version of kMeans clustering algorithm for time series. The algorithm works by leveraging off the multiresolution property of wavelets. In particular, an initial clustering is perforated with a very coarse resolution representation of the data. The results obtained from this "quick and dirty" clustering are used to initialize a clustering at a slightly finer level of approximation. This process is repeated until the clustering results stabilize or until the "approxiatation" is the raw data. In addition to casting kMeans as an anytime algorithm, our approach has two other very unintuitive properties. The quality of the clustering is often better than the batch algorithm, and even if the algorithm is run to coatpletion, the time taken is typically much less than the time taken by the original algorithm. We explain, and eatpirically demonstrate these surprising and desirable properties with coatprehensive experiatents on several publicly available real data sets.
Timeseries Bitmaps: A Practical Visualization Tool for working with Large Time Series Databases
 In proceedings of SIAM International Conference on Data Mining (SDM '05
"... The increasing interest in time series data mining in the last decade has resulted in the introduction of a variety of similarity measures, representations, and algorithms. Surprisingly, this massive research effort has had little impact on real world applications. Real world practitioners who work ..."
Abstract

Cited by 19 (8 self)
 Add to MetaCart
The increasing interest in time series data mining in the last decade has resulted in the introduction of a variety of similarity measures, representations, and algorithms. Surprisingly, this massive research effort has had little impact on real world applications. Real world practitioners who work with time series on a daily basis rarely take advantage of the wealth of tools that the data mining community has made available. In this work, we attempt to address this problem by introducing a simple parameterlight tool that allows users to efficiently navigate through large collections of time series. Our system has the unique advantage that it can be embedded directly into any standard graphical user interfaces, such as Microsoft Windows, thus making deployment easier. Our approach extracts features from a time series of arbitrary length and uses information about the relative frequency of its features to color a bitmap in a principled way. By visualizing the similarities and differences within a collection of bitmaps, a user can quickly discover clusters, anomalies, and other regularities within their data collection. We demonstrate the utility of our approach with a set of comprehensive experiments on real datasets from a variety of domains.