Results 1  10
of
17
Probabilistic discovery of time series motifs
, 2003
"... Several important time series data mining problems reduce to the core task of finding approximately repeated subsequences in a longer time series. In an earlier work, we formalized the idea of approximately repeated subsequences by introducing the notion of time series motifs. Two limitations of thi ..."
Abstract

Cited by 119 (21 self)
 Add to MetaCart
Several important time series data mining problems reduce to the core task of finding approximately repeated subsequences in a longer time series. In an earlier work, we formalized the idea of approximately repeated subsequences by introducing the notion of time series motifs. Two limitations of this work were the poor scalability of the motif discovery algorithm, and the inability to discover motifs in the presence of noise. Here we address these limitations by introducing a novel algorithm inspired by recent advances in the problem of pattern discovery in biosequences. Our algorithm is probabilistic in nature, but as we show empirically and theoretically, it can find time series motifs with very high probability even in the presence of noise or “don’t care ” symbols. Not only is the algorithm fast, but it is an anytime algorithm, producing likely candidate motifs almost immediately, and gradually improving the quality of results over time.
SpeedBoost: Anytime Prediction with Uniform NearOptimality
"... We present SpeedBoost, a natural extension of functional gradient descent, for learning anytime predictors, which automatically trade computation time for predictive accuracy by selecting from a set of simpler candidate predictors. These anytime predictors not only generate approximate predictions r ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
We present SpeedBoost, a natural extension of functional gradient descent, for learning anytime predictors, which automatically trade computation time for predictive accuracy by selecting from a set of simpler candidate predictors. These anytime predictors not only generate approximate predictions rapidly, but are capable of using extra resources at prediction time, when available, to improve performance. We also demonstrate how our framework can be used to select weak predictors which target certain subsets of the data, allowing for efficient use of computational resources on difficult examples. We also show that variants of the SpeedBoost algorithm produce predictors which are provably competitive with any possible sequence of weak predictors with the same total complexity. 1
Fast bestmatch shape searching in rotationinvariant metric spaces
 IEEE Transactions on Multimedia
"... Object recognition and contentbased image retrieval systems rely heavily on the accurate and efficient identification of shapes. A fundamental requirement in the shape analysis process is that shape similarities should be computed invariantly to basic geometric transformations, e.g. scaling, shifti ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
Object recognition and contentbased image retrieval systems rely heavily on the accurate and efficient identification of shapes. A fundamental requirement in the shape analysis process is that shape similarities should be computed invariantly to basic geometric transformations, e.g. scaling, shifting, and most importantly, rotations. And while scale and shift invariance are easily achievable through a suitable shape representation, rotation invariance is much harder to deal with. In this work we explore the metric properties of the rotation invariant distance measures and propose an algorithm for fast similarity search in the shape space. The algorithm can be utilized in a number of important data mining tasks such as shape clustering and classification, or for discovering of motifs and discords in image collections. The technique is demonstrated to introduce a dramatic speedup over the current approaches, and is guaranteed to introduce no false dismissals. 1
Polishing the Right Apple: Anytime Classification Also Benefits Data Streams with Constant Arrival Times
"... Abstract — Classification of items taken from data streams requires algorithms that operate in time sensitive and computationally constrained environments. Often, the available time for classification is not known a priori and may change as a consequence of external circumstances. Many traditional a ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
Abstract — Classification of items taken from data streams requires algorithms that operate in time sensitive and computationally constrained environments. Often, the available time for classification is not known a priori and may change as a consequence of external circumstances. Many traditional algorithms are unable to provide satisfactory performance while supporting the highly variable response times that exemplify such applications. In such contexts, anytime algorithms, which are amenable to trading time for accuracy, have been found to be exceptionally useful and constitute an area of increasing research activity. Previous techniques for improving anytime classification have generally been concerned with optimizing the probability of correctly classifying individual objects. However, as we shall see, serially optimizing the probability of correctly classifying individual objects K times, generally gives inferior results to batch optimizing the probability of correctly classifying K objects. In this work, we show that this simple observation can be exploited to improve overall classification performance by using an anytime framework to allocate resources among a set of objects buffered from a fast arriving stream. Our ideas are independent of object arrival behavior; and, perhaps unintuitively, even in data streams with constant arrival rates our technique exhibits a marked improvement in performance. The utility of our approach is demonstrated with extensive experimental evaluations conducted on a wide range of diverse datasets. Keywordsanytime algorithms; classification; nearest neighbor; streaming data I.
Indexing density models for incremental learning and anytime classification on data streams
 In 12th EDBT/ICDT
, 2009
"... Classification of streaming data faces three basic challenges: it has to deal with huge amounts of data, the varying time between two stream data items must be used best possible (anytime classification) and additional training data must be incrementally learned (anytime learning) for applying the c ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
Classification of streaming data faces three basic challenges: it has to deal with huge amounts of data, the varying time between two stream data items must be used best possible (anytime classification) and additional training data must be incrementally learned (anytime learning) for applying the classifier consistently to fast data streams. In this work, we propose a novel indexbased technique that can handle all three of the above challenges using the established Bayes classifier on effective kernel density estimators. Our novel Bayes tree automatically generates (adapted efficiently to the individual object to be classified) a hierarchy of mixture densities that represent kernel density estimators at successively coarser levels. Our probability density queries together with novel classification improvement strategies provide the necessary information for very effective classification at any point of interruption. Moreover, we propose a novel evaluation method for anytime classification using Poisson streams and demonstrate the anytime learning performance of the Bayes tree. 1.
Anytime learning of anycost classifiers
"... The classification of new cases using a predictive model incurs two types of costs—testing costs and misclassification costs. Recent research efforts have resulted in several novel algorithms that attempt to produce learners that simultaneously minimize both types. In many real life scenarios, howe ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
The classification of new cases using a predictive model incurs two types of costs—testing costs and misclassification costs. Recent research efforts have resulted in several novel algorithms that attempt to produce learners that simultaneously minimize both types. In many real life scenarios, however, we cannot afford to conduct all the tests required by the predictive model. For example, a medical center might have a fixed predetermined budget for diagnosing each patient. For cost bounded classification, decision trees are considered attractive as they measure only the tests along a single path. In this work we present an anytime framework for producing decisiontree based classifiers that can make accurate decisions within a strict bound on testing costs. These bounds can be known to the learner, known to the classifier but not to the learner, or not predetermined. Extensive experiments with a variety of datasets show that our proposed framework produces trees with lower misclassification costs along a wide range of testing cost bounds.
Time Series Classification under More Realistic Assumptions
"... Most literature on time series classification assumes that the beginning and ending points of the pattern of interest can be correctly identified, both during the training phase and later deployment. In this work, we argue that this assumption is unjustified, and this has in many cases led to unwarr ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
Most literature on time series classification assumes that the beginning and ending points of the pattern of interest can be correctly identified, both during the training phase and later deployment. In this work, we argue that this assumption is unjustified, and this has in many cases led to unwarranted optimism about the performance of the proposed algorithms. As we shall show, the task of correctly extracting individual gait cycles, heartbeats, gestures, behaviors, etc., is generally much more difficult than the task of actually classifying those patterns. We propose to mitigate this problem by introducing an alignmentfree time series classification framework. The framework requires only very weakly annotated data, such as “in this ten minutes of data, we see mostly normal heartbeats..., ” and by generalizing the classic machine learning idea of data editing to streaming/continuous data, allows us to build robust, fast and accurate classifiers. We demonstrate on several diverse realworld problems that beyond removing unwarranted assumptions and requiring essentially no human intervention, our framework is both significantly faster and significantly more accurate than current stateoftheart approaches. 1.
Mining Massive Archives of Mice Sounds with Symbolized Representations
"... Many animals produce long sequences of vocalizations best described as “songs. ” In some animals, such as crickets and frogs, these songs are relatively simple and repetitive chirps or trills. However, animals as diverse as whales, bats, birds and even the humble mice considered here produce intrica ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
Many animals produce long sequences of vocalizations best described as “songs. ” In some animals, such as crickets and frogs, these songs are relatively simple and repetitive chirps or trills. However, animals as diverse as whales, bats, birds and even the humble mice considered here produce intricate and complex songs. These songs are worthy of study in their own right. For example, the study of bird songs has helped to cast light on various questions in the nature vs. nurture debate. However, there is a particular reason why the study of mice songs can benefit mankind. The house mouse (Mus musculus) has long been an important model organism in biology and medicine, and it is by far the most commonly used genetically altered laboratory mammal to address human diseases. While there has been significant recent efforts to analyze mice songs, advances in sensor technology have created a situation where our ability to collect data far outstrips our ability to analyze it. In this work we argue that the time is ripe for archives of mice songs to fall into the purview of data mining. We show a novel technique for mining mice vocalizations directly in the visual (spectrogram) space that practitioners currently use. Working in this space allows us to bring an arsenal of data mining tools to bear on this important domain, including similarity search, classification, motif discovery and contrast set mining.
Abstract The Asymmetric Approximate Anytime Join: A New Primitive with Applications to Data Mining
"... It has long been noted that many data mining algorithms can be built on top of join algorithms. This has lead to a wealth of recent work on efficiently supporting such joins with various indexing techniques. However, there are many applications which are characterized by two special conditions, firs ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
It has long been noted that many data mining algorithms can be built on top of join algorithms. This has lead to a wealth of recent work on efficiently supporting such joins with various indexing techniques. However, there are many applications which are characterized by two special conditions, firstly the two datasets to be joined are of radically different sizes, a situation we call an asymmetric join. Secondly, the two datasets are not, and possibly can not be indexed for some reason. In such circumstances the time complexity is proportional to the product of the number of objects in each of the two datasets, an untenable proposition in most cases. In this work we make two contributions to mitigate this situation. We argue that for many applications, an exact solution to the problem is not required, and we show that by framing the problem as
A Novel Approximation to Dynamic Time Warping allows Anytime Clustering of Massive Time Series Datasets
"... Given the ubiquity of time series data, the data mining community has spent significant time investigating the best time series similarity measure to use for various tasks and domains. After more than a decade of extensive efforts, there is increasing evidence that Dynamic Time Warping (DTW) is very ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Given the ubiquity of time series data, the data mining community has spent significant time investigating the best time series similarity measure to use for various tasks and domains. After more than a decade of extensive efforts, there is increasing evidence that Dynamic Time Warping (DTW) is very difficult to beat. Given that, recent efforts have focused on making the intrinsically slow DTW algorithm faster. For the similaritysearch task, an important subroutine in many data mining algorithms, significant progress has been made by replacing the vast majority of expensive DTW calculations with cheaptocompute lower bound calculations. However, these lower bound based optimizations do not directly apply to clustering, and thus for some realistic problems, clustering with DTW can take days or weeks. In this work, we show that we can mitigate this untenable lethargy by casting DTW clustering as an anytime algorithm. At the heart of our algorithm is a novel dataadaptive approximation to DTW which can be quickly computed, and which produces approximations to DTW that are much better than the best currently known lineartime approximations. We demonstrate our ideas on real world problems showing that we can get virtually all the accuracy of a batch DTW clustering algorithm in a fraction of the time.