Results 1  10
of
20
Probabilistic discovery of time series motifs
, 2003
"... Several important time series data mining problems reduce to the core task of finding approximately repeated subsequences in a longer time series. In an earlier work, we formalized the idea of approximately repeated subsequences by introducing the notion of time series motifs. Two limitations of thi ..."
Abstract

Cited by 119 (21 self)
 Add to MetaCart
Several important time series data mining problems reduce to the core task of finding approximately repeated subsequences in a longer time series. In an earlier work, we formalized the idea of approximately repeated subsequences by introducing the notion of time series motifs. Two limitations of this work were the poor scalability of the motif discovery algorithm, and the inability to discover motifs in the presence of noise. Here we address these limitations by introducing a novel algorithm inspired by recent advances in the problem of pattern discovery in biosequences. Our algorithm is probabilistic in nature, but as we show empirically and theoretically, it can find time series motifs with very high probability even in the presence of noise or “don’t care ” symbols. Not only is the algorithm fast, but it is an anytime algorithm, producing likely candidate motifs almost immediately, and gradually improving the quality of results over time.
PPIndex: Using Permutation Prefixes for Efficient and Scalable Approximate Similarity Search
"... We present the Permutation Prefix Index (PPIndex), an index data structure that allows to perform efficient approximate similarity search. The PPIndex belongs to the family of the permutationbased indexes, which are based on representing any indexed object with “its view of the surrounding world”, ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
We present the Permutation Prefix Index (PPIndex), an index data structure that allows to perform efficient approximate similarity search. The PPIndex belongs to the family of the permutationbased indexes, which are based on representing any indexed object with “its view of the surrounding world”, i.e., a list of the elements of a set of reference objects sorted by their distance order with respect to the indexed object. In its basic formulation, the PPIndex is strongly biased toward efficiency, treating effectiveness as a secondary aspect. We show how the effectiveness can easily reach optimal levels just by adopting two “boosting ” strategies: multiple index search and multiple query search. Such strategies have nice parallelization properties that allow to distribute the search process in order to keep high efficiency levels. We study both the efficiency and the effectiveness properties of the PPIndex. We report experiments on collections of sizes up to one hundred million images, represented in a very highdimensional similarity space based on the combination of five MPEG7 visual descriptors.
Speeding Up Permutation Based Indexing with Indexing
"... Abstract—A recent probabilistic approach for searching in high dimensional metric spaces is based on predicting the distances between database elements according to how they order their distances towards some set of distinguished elements, called permutants. In the preprocessing phase a set of permu ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Abstract—A recent probabilistic approach for searching in high dimensional metric spaces is based on predicting the distances between database elements according to how they order their distances towards some set of distinguished elements, called permutants. In the preprocessing phase a set of permutants is chosen, and are sorted (permuted) by their distances against every database element. The permutations form the index. When a query is given, its corresponding permutation is computed, and — as similar elements will (probably) have a similar permutation — the database is compared in the order induced by the similarity between permutations. This works well but has relatively high CPU time due to computing the distances between permutations and (partially) sorting the database by the similarity. We improve this by identifying and solving this as another metric space problem. This avoids many distance computations between the permutants. The experimental results show that this works extremely well in practice. Keywordsmetric space indexing; probabilistic algorithms; indexing permutations; I.
Approximate VariableLength Time Series Motif Discovery Using Grammar Inference
 In Proceedings of the Tenth International Workshop on Multimedia Data Mining
, 2010
"... The problem of identifying frequently occurring patterns, or motifs, in time series data has received a lot of attention in the past few years. Most existing work on finding time series motifs require that the length of the patterns be known in advance. However, such information is not always availa ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
The problem of identifying frequently occurring patterns, or motifs, in time series data has received a lot of attention in the past few years. Most existing work on finding time series motifs require that the length of the patterns be known in advance. However, such information is not always available. In addition, motifs of different lengths may coexist in a time series dataset. In this work, we propose a novel approach, based on grammar induction, for approximate variablelength time series motif discovery. Our algorithm offers the advantage of discovering hierarchical structure, regularity and grammar from the data. The preliminary results are promising. They show that the grammarbased approach is able to find some important motifs, and suggest that the new direction of using grammarbased algorithms for time series pattern discovery might be worth exploring. human life. Some examples of such data include speech, electrocardiogram (ECG) signals, radar signals, seismic activities, etc. In addition to the conventional definition of time series, i.e., measurements taken over time, recently, it has been shown that certain other multimedia data, e.g., images and shapes [48, 49], and XML [19], can be converted to time series and mined with promising results. Figure 1 shows an example of how shapes can be converted to time series.
Challenges in Finding an Appropriate MultiDimensional Index Structure with Respect to Specific Use Cases
"... In recent years, index structures for managing multidimensional data became increasingly important. Due to heterogeneous systems and specific use cases, it is a complex challenge to find an appropriate index structure for specific problems, such as finding similar fingerprints or micro traces in a ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
In recent years, index structures for managing multidimensional data became increasingly important. Due to heterogeneous systems and specific use cases, it is a complex challenge to find an appropriate index structure for specific problems, such as finding similar fingerprints or micro traces in a database. One aspect that should be considered in general is the dimensionality and the related curse of dimensionality. However, dimensionality of data is just one component that have to be considered. To address the challenges of finding the appropriate index, we motivate the necessity of a framework to evaluate indexes for specific use cases. Furthermore, we discuss core components of a framework that supports users in finding the most appropriate index structure for their use case.
Speeding up Spatial Approximation Search in Metric Spaces
"... Proximity searching consists in retrieving from a database those elements that are similar to a query object. The usual model for proximity searching is a metric space where the distance, which models the proximity, is expensive to compute. An index uses precomputed distances to speed up query proce ..."
Abstract
 Add to MetaCart
Proximity searching consists in retrieving from a database those elements that are similar to a query object. The usual model for proximity searching is a metric space where the distance, which models the proximity, is expensive to compute. An index uses precomputed distances to speed up query processing. Among all the known indices, the baseline for performance for about twenty years has been AESA. This index uses an iterative procedure, where at each iteration it first chooses the next promising element (“pivot”) to compare to the query, and then it discards database elements that can be proved not relevant to the query using the pivot. The next pivot in AESA is chosen as the one minimizing the sum of lower bounds to the distance to the query proved by previous pivots. In this paper we introduce the new index iAESA, which establishes a new performance baseline for metric space searching. The difference with AESA is the method to select the next pivot. In iAESA, each candidate sorts previous pivots by closeness to it, and chooses the next pivot as the candidate whose order is most similar to that of the query. We also propose a modification to AESAlike algorithms to turn them into probabilistic algorithms. Our empirical results confirm a consistent improvement in query performance. For example, we perform as few as 60 % of the distance evaluations of AESA in a database of documents, a
Searching by Similarity and Classifying Images on a Very Large Scale
"... Abstract—In the demonstration we will show a system for searching by similarity and automatically classifying images in a very large dataset. The demonstrated techniques are based on the use of the MIFile (Metric Inverted File) as the access method for executing similarity search efficiently. The M ..."
Abstract
 Add to MetaCart
Abstract—In the demonstration we will show a system for searching by similarity and automatically classifying images in a very large dataset. The demonstrated techniques are based on the use of the MIFile (Metric Inverted File) as the access method for executing similarity search efficiently. The MIFile is an access methods based on inverted files that relies on a space transformation that use the notion of perspective to decide about the similarity between two objects. More specifically, if two objects are close one to each other, also the view of the space from their position is similar. Leveraging on this space transformation, it is possible to use inverted file to execute approximate similarity search. In order to test the scalability of this access method, we inserted 106 millions images from the CoPhIR dataset and we created an online search engine that allows everybody to search in this dataset. In addition we also used this access methods to perform automatic classification on this very large image dataset. More specifically, we reformulated the classification problem, as resulting from the use of SVM with RBF kernel, as a complex approximate similarity search problem. In such a way, instead of comparing every single image against the classifier, the best images belonging to a class are directly obtained as the result of a complex approximate similarity search query. Keywordssimilarity search; image content based retrieval; image classification I.
Finding Good Permutants for Proximity Searching in Metric Spaces
"... Abstract—The permutation index has shown to be very effective in medium and high dimensional metric spaces, even in difficult problems, for instance, when solving reverse knearest neighbor queries. Nevertheless, currently there is no study about which are the desirable features one can ask to a per ..."
Abstract
 Add to MetaCart
Abstract—The permutation index has shown to be very effective in medium and high dimensional metric spaces, even in difficult problems, for instance, when solving reverse knearest neighbor queries. Nevertheless, currently there is no study about which are the desirable features one can ask to a permutant set, or how to select good permutants. Similar to the case of pivots, our experimental results show that, compared with a randomly chosen set, a good permutant set yields to fast query response or to reduce the amount of space used by the index. In this paper we start by characterizing permutants and studying their discrimination power, and then we propose an effective heuristic to select a good permutant candidate set. We also show empirical evidence that supports our technique. Keywordscomponent; metric space indexing; probabilistic € algorithms; indexing permutations I.
Brigham and Women's Hospital
"... Time series motifs are pairs of individual time series, or subsequences of a longer time series, which are very similar to each other. As with their discrete analogues in computational biology, this similarity hints at structure which has been conserved for some reason and may therefore be of intere ..."
Abstract
 Add to MetaCart
Time series motifs are pairs of individual time series, or subsequences of a longer time series, which are very similar to each other. As with their discrete analogues in computational biology, this similarity hints at structure which has been conserved for some reason and may therefore be of interest. Since the formalism of time series motifs in 2002, dozens of researchers have used them for diverse applications in many different domains. Because the obvious algorithm for computing motifs is quadratic in the number of items, more than a dozen approximate algorithms to discover motifs have been proposed in the literature. In this work, for the first time, we show a tractable exact algorithm to find time series motifs. As we shall show through extensive experiments, our algorithm is up to three orders of magnitude faster than bruteforce search in large datasets. We further show that our algorithm is fast enough to be used as a subroutine in higher level data mining algorithms for anytime classification, nearduplicate detection and summarization, and we consider detailed case studies in domains as diverse as electroencephalograph interpretation and entomological telemetry data mining.