Results 1 - 10
of
82
Robust and fast similarity search for moving object trajectories
- In Proc. ACM SIGMOD Int. Conf. on Management of Data
, 2005
"... An important consideration in similarity-based retrieval of moving object trajectories is the definition of a distance function. The ex-isting distance functions are usually sensitive to noise, shifts and scaling of data that commonly occur due to sensor failures, errors in detection techniques, dis ..."
Abstract
-
Cited by 155 (14 self)
- Add to MetaCart
(Show Context)
An important consideration in similarity-based retrieval of moving object trajectories is the definition of a distance function. The ex-isting distance functions are usually sensitive to noise, shifts and scaling of data that commonly occur due to sensor failures, errors in detection techniques, disturbance signals, and different sampling rates. Cleaning data to eliminate these is not always possible. In this paper, we introduce a novel distance function, Edit Distance on Real sequence (EDR) which is robust against these data imper-fections. Analysis and comparison of EDR with other popular dis-tance functions, such as Euclidean distance, Dynamic Time Warp-ing (DTW), Edit distance with Real Penalty (ERP), and Longest Common Subsequences (LCSS), indicate that EDR is more robust than Euclidean distance, DTW and ERP, and it is on average 50% more accurate than LCSS. We also develop three pruning tech-niques to improve the retrieval efficiency of EDR and show that these techniques can be combined effectively in a search, increas-ing the pruning power significantly. The experimental results con-firm the superior efficiency of the combined methods. 1.
Towards parameter-free data mining
- In: Proc. 10th ACM SIGKDD Intn’l Conf. Knowledge Discovery and Data Mining
, 2004
"... Most data mining algorithms require the setting of many input parameters. Two main dangers of working with parameter-laden algorithms are the following. First, incorrect settings may cause an algorithm to fail in finding the true patterns. Second, a perhaps more insidious problem is that the algorit ..."
Abstract
-
Cited by 145 (19 self)
- Add to MetaCart
(Show Context)
Most data mining algorithms require the setting of many input parameters. Two main dangers of working with parameter-laden algorithms are the following. First, incorrect settings may cause an algorithm to fail in finding the true patterns. Second, a perhaps more insidious problem is that the algorithm may report spurious patterns that do not really exist, or greatly overestimate the significance of the reported patterns. This is especially likely when the user fails to understand the role of parameters in the data mining process. Data mining algorithms should have as few parameters as possible, ideally none. A parameter-free algorithm would limit our ability to impose our prejudices, expectations, and presumptions on the problem at hand, and would let the data itself speak to us. In this work, we show that recent results in bioinformatics and computational theory hold great promise for a parameter-free datamining paradigm. The results are motivated by observations in Kolmogorov complexity theory. However, as a practical matter, they can be implemented using any off-the-shelf compression algorithm with the addition of just a dozen or so lines of code. We will show that this approach is competitive or superior to the stateof-the-art approaches in anomaly/interestingness detection, classification, and clustering with empirical tests on time series/DNA/text/video datasets.
The function space of an activity
- in Proc. Comput. Vis. Pattern Recognit
"... An activity consists of an actor performing a series of actions in a pre-defined temporal order. An action is an individual atomic unit of an activity. Different instances of the same activity may consist of varying relative speeds at which the various actions are executed, in addition to other intr ..."
Abstract
-
Cited by 61 (11 self)
- Add to MetaCart
(Show Context)
An activity consists of an actor performing a series of actions in a pre-defined temporal order. An action is an individual atomic unit of an activity. Different instances of the same activity may consist of varying relative speeds at which the various actions are executed, in addition to other intra- and inter- person variabilities. Most existing algorithms for activity recognition are not very robust to intra- and inter-personal changes of the same activity, and are extremely sensitive to warping of the temporal axis due to variations in speed profile. In this paper, we provide a systematic approach to learn the nature of such time warps while simultaneously allowing for the variations in descriptors for actions. For each activity we learn an ‘average ’ sequence that we denote as the nominal activity trajectory. We also learn a function space of time warpings for each activity separately. The model can be used to learn individualspecific warping patterns so that it may also be used for activity based person identification. The proposed model leads us to algorithms for learning a model for each activity, clustering activity sequences and activity recognition that are robust to temporal, intra- and inter-person variations. We provide experimental results using two datasets. 1.
Three Myths about Dynamic Time Warping Data
- Mining, in the Proceedings of SIAM International Conference on Data Mining (2005
"... The Dynamic Time Warping (DTW) distance measure is a technique that has long been known in speech recognition community. It allows a non-linear mapping of one signal to another by minimizing the distance between the two. A decade ago, DTW was introduced into Data Mining community as a utility for va ..."
Abstract
-
Cited by 43 (14 self)
- Add to MetaCart
The Dynamic Time Warping (DTW) distance measure is a technique that has long been known in speech recognition community. It allows a non-linear mapping of one signal to another by minimizing the distance between the two. A decade ago, DTW was introduced into Data Mining community as a utility for various tasks for time series problems including classification, clustering, and anomaly detection. The technique has flourished, particularly in the last three years, and has been applied to a variety of problems in various disciplines. In spite of DTW’s great success, there are still several persistent “myths ” about it. These myths have caused confusion and led to much wasted research effort. In this work, we will dispel these myths with the most comprehensive set of time series experiments ever conducted.
A Brief Survey on Sequence Classification
"... Sequence classification has a broad range of applications such as genomic analysis, information retrieval, health informatics, finance, and abnormal detection. Different from the classification task on feature vectors, sequences do not have explicit features. Even with sophisticated feature selectio ..."
Abstract
-
Cited by 31 (1 self)
- Add to MetaCart
(Show Context)
Sequence classification has a broad range of applications such as genomic analysis, information retrieval, health informatics, finance, and abnormal detection. Different from the classification task on feature vectors, sequences do not have explicit features. Even with sophisticated feature selection techniques, the dimensionality of potential features may still be very high and the sequential nature of features is difficult to capture. This makes sequence classification a more challenging task than classification on feature vectors. In this paper, we present a brief review of the existing work on sequence classification. We summarize the sequence classification in terms of methodologies and application domains. We also provide a review on several extensions of the sequence classification problem, such as early classification on sequences and semi-supervised learning on sequences. 1.
An Efficient and Accurate Method for Evaluating Time Series Similarity
, 2007
"... A variety of techniques currently exist for measuring the similarity between time series datasets. Of these techniques, the methods whose matching criteria is bounded by a specified ǫ threshold value, such as the LCSS and the EDR techniques, have been shown to be robust in the presence of noise, tim ..."
Abstract
-
Cited by 25 (1 self)
- Add to MetaCart
A variety of techniques currently exist for measuring the similarity between time series datasets. Of these techniques, the methods whose matching criteria is bounded by a specified ǫ threshold value, such as the LCSS and the EDR techniques, have been shown to be robust in the presence of noise, time shifts, and data scaling. Our work proposes a new algorithm, called the Fast Time Series Evaluation (FTSE) method, which can be used to evaluate such threshold value techniques, including LCSS and EDR. Using FTSE, we show that these techniques can be evaluated faster than using either traditional dynamic programming or even warp-restricting methods such as the Sakoe-Chiba band and the Itakura Parallelogram. We also show that FTSE can be used in a framework that can evaluate a richer range of ǫ threshold-based scoring techniques, of which EDR and LCSS are just two examples. This framework, called Swale, extends the ǫ threshold-based scoring techniques to include arbitrary match rewards and gap penalties. Through extensive empirical evaluation, we show that Swale can obtain greater accuracy than existing methods.
Time series knowledge mining
, 2006
"... An important goal of knowledge discovery is the search for patterns in data that can help explain the underlying process that generated the data. The patterns are required to be new, useful, and understandable to humans. In this work we present a new method for the understandable description of loca ..."
Abstract
-
Cited by 20 (2 self)
- Add to MetaCart
An important goal of knowledge discovery is the search for patterns in data that can help explain the underlying process that generated the data. The patterns are required to be new, useful, and understandable to humans. In this work we present a new method for the understandable description of local temporal relationships in multivariate data, called Time Series Knowledge Mining (TSKM). We define the Time Series Knowledge Representation (TSKR) as a new language for expressing temporal knowledge. The patterns have a hierarchical structure, each level corresponds to a single temporal concept. On the lowest level, intervals are used to represent duration. Overlapping parts of intervals represent coincidence on the next level. Several such blocks of intervals are connected with a partial order relation on the highest level. Each pattern element consists of a semiotic triple to connect syntactic and semantic information with pragmatics. The patterns are very compact, but offer details for each element on demand. In comparison with related approaches, the TSKR is shown to have advantages in robustness, expressivity, and comprehensibility. Efficient algorithms for the discovery of the patterns are proposed. The search for coincidence as well as partial order can be formulated as variants of the well known frequent itemset problem. One of the best known algorithms for this problem is therefore adapted for our purposes. Human interaction is used during the mining to analyze and validate partial results as early as possible and guide further processing steps. The efficacy of the methods is demonstrated using several data sets. In an application to sports medicine the results were recognized as valid and useful by an expert of the field.
Finding time series discords based on haar transform
- In Proceeding of the 2nd International Conference on Advanced Data Mining and Applications
, 2006
"... Abstract. The problem of finding anomaly has received much attention recently. However, most of the anomaly detection algorithms depend on an explicit definition of anomaly, which may be impossible to elicit from a domain expert. Using discords as anomaly detectors is useful since less parameter set ..."
Abstract
-
Cited by 13 (1 self)
- Add to MetaCart
(Show Context)
Abstract. The problem of finding anomaly has received much attention recently. However, most of the anomaly detection algorithms depend on an explicit definition of anomaly, which may be impossible to elicit from a domain expert. Using discords as anomaly detectors is useful since less parameter setting is required. Keogh et al proposed an efficient method for solving this problem. However, their algorithm requires users to choose the word size for the compression of subsequences. In this paper, we propose an algorithm which can dynamically determine the word size for compression. Our method is based on some properties of the Haar wavelet transformation. Our experiments show that this method is highly effective. 1