Results 1  10
of
30
Mtree: An Efficient Access Method for Similarity Search in Metric Spaces
, 1997
"... A new access meth d, called Mtree, is proposed to organize and search large data sets from a generic "metric space", i.e. whE4 object proximity is only defined by a distance function satisfyingth positivity, symmetry, and triangle inequality postulates. We detail algorith[ for insertion of objects ..."
Abstract

Cited by 508 (37 self)
 Add to MetaCart
A new access meth d, called Mtree, is proposed to organize and search large data sets from a generic "metric space", i.e. whE4 object proximity is only defined by a distance function satisfyingth positivity, symmetry, and triangle inequality postulates. We detail algorith[ for insertion of objects and split management, whF h keep th Mtree always balanced  severalheralvFV split alternatives are considered and experimentally evaluated. Algorithd for similarity (range and knearest neigh bors) queries are also described. Results from extensive experimentationwith a prototype system are reported, considering as th performance criteria th number of page I/O's and th number of distance computations. Th results demonstratethm th Mtree indeed extendsth domain of applicability beyond th traditional vector spaces, performs reasonably well inhE[94Kv#E44V[vh data spaces, and scales well in case of growing files. 1
Optimal MultiStep kNearest Neighbor Search
, 1998
"... For an increasing number of modern database applications, efficient support of similarity search becomes an important task. Along with the complexity of the objects such as images, molecules and mechanical parts, also the complexity of the similarity models increases more and more. Whereas algorithm ..."
Abstract

Cited by 166 (19 self)
 Add to MetaCart
For an increasing number of modern database applications, efficient support of similarity search becomes an important task. Along with the complexity of the objects such as images, molecules and mechanical parts, also the complexity of the similarity models increases more and more. Whereas algorithms that are directly based on indexes work well for simple mediumdimensional similarity distance functions, they do not meet the efficiency requirements of complex highdimensional and adaptable distance functions. The use of a multistep query processing strategy is recommended in these cases, and our investigations substantiate that the number of candidates which are produced in the filter step and exactly evaluated in the refinement step is a fundamental efficiency parameter. After revealing the strong performance shortcomings of the stateoftheart algorithm for knearest neighbor search [Korn et al. 1996], we present a novel multistep algorithm which is guaranteed to produce the minim...
Efficient UserAdaptable Similarity Search in Large Multimedia Databases
 IN PROCEEDINGS OF THE INT. CONF. ON VERY LARGE DATA BASES
, 1997
"... Efficient useradaptable similarity search more and more increases in its importance for multimedia and spatial database systems. As a general similarity model for multidimensional vectors that is adaptable to application requirements and user preferences, we use quadratic form distance functions w ..."
Abstract

Cited by 61 (4 self)
 Add to MetaCart
Efficient useradaptable similarity search more and more increases in its importance for multimedia and spatial database systems. As a general similarity model for multidimensional vectors that is adaptable to application requirements and user preferences, we use quadratic form distance functions which have been successfully applied to color histograms in image databases [Fal+ 94]. The components a ij of the matrix A denote similarity of the components i and j of the vectors. Beyond the Euclidean distance which produces spherical query ranges, the similarity distance defines a new query type, the ellipsoid query. We present new algorithms to efficiently support ellipsoid query processing for various userdefined similarity matrices on existing precomputed indexes. By adapting techniques for reducing the dimensionality and employing a multistep query processing architecture, the method is extended to highdimensional data spaces. In particular, from our algorithm to reduce the simila...
Efficient Searches for Similar Subsequences of Different Lengths in Sequence Databases
 In ICDE
, 2000
"... We propose an indexing technique for fast retrieval of similar subsequences using time warping distances. A time warping distance is a more suitable similarity measure than the Euclidean distance in many applications, where sequences may be of different lengths or different sampling rates. Our index ..."
Abstract

Cited by 39 (4 self)
 Add to MetaCart
We propose an indexing technique for fast retrieval of similar subsequences using time warping distances. A time warping distance is a more suitable similarity measure than the Euclidean distance in many applications, where sequences may be of different lengths or different sampling rates. Our indexing technique uses a diskbased suffix tree as an index structure and employs' lowerbound distance functions to filter out dissimilar subsequences without false dismissals. To make the index structure compact and thus accelerate the query processing, we convert sequences of continuous values to sequences of discrete values via a categorization method and store only a subset of suffixes whose first values are different from their preceding values. The experimental results' reveal that our proposed technique can be a few orders' of magnitude faster than sequential scanning.
Matching and Indexing Sequences of Different Lengths
 In Proc.of the CIKM, Las Vegas
, 1997
"... In this paper, we consider the problem of efficient matching and retrieval of sequences of different lengths. Most of the previous research is concentrated on similarity matching and retrieval of sequences of the same length using Euclidean distance metric. For similarity matching of sequences, we u ..."
Abstract

Cited by 38 (3 self)
 Add to MetaCart
In this paper, we consider the problem of efficient matching and retrieval of sequences of different lengths. Most of the previous research is concentrated on similarity matching and retrieval of sequences of the same length using Euclidean distance metric. For similarity matching of sequences, we use a modified version of the edit distance function, and consider two sequences matching if a majority of the elements in the sequences match. In the matching process a mapping among nonmatching elements is created to check if there are unacceptable deviations among them. This means that two matching sequences should have lengths that are comparable. For efficient retrieval of matching sequences, we propose an indexing scheme which is totally based on lengths and relative distances between sequences. We use vptrees as the underlying distancebased index structures in our method. 1 Introduction The problem of matching sequences with respect to a similarity measure is encountered in a varie...
Supporting Fast Search in Time Series for Movement Patterns in Multiple Scales
 Proc. 7th ACM Int. Conf. on Information and Knowledge Management
, 1998
"... An important investigation of time series involves searching for "movement" patterns, such as "going up" or "going down" or some combinations of them. Movement patterns can be in various scales: a large scale pattern may cover a long time period, while a small scale pattern usually covers a short ti ..."
Abstract

Cited by 36 (2 self)
 Add to MetaCart
An important investigation of time series involves searching for "movement" patterns, such as "going up" or "going down" or some combinations of them. Movement patterns can be in various scales: a large scale pattern may cover a long time period, while a small scale pattern usually covers a short time period. This paper considers such scale requirement. More specifically, a pattern is defined as a regular expression of letters, where each letter describes a movement direction and covers a specified length of time (called pattern unit length). To find if a time series (or a part of it) matches a pattern, the time series is first partitioned into consecutive subseries of the unit length, and for each subseries, the direction of its best fitting line is taken as the movement direction of the subseries if the distance between the best fitting line and the subseries is within a specified tolerance (tolerance requirement). A direct implementation of pattern search will undoubtedly yield poor performance if the number of time series or the length of them is large. This paper introduces a precomputation and indexing method to facilitate fast evaluation of pattern queries in userspecified scales. An efficient precomputation algorithm is given to find the movement directions for all the subseries that satisfy the tolerance requirement. Bounding triangles are used to represent clusters of subseries. Relational database is then used to store these bounding triangles and relational operations are employed to facilitate the evaluation of pattern queries. The paper also reports some experiments performed on a reallife data set to show the efficiency and the scalability of the algorithms.
The Haar Wavelet Transform in the Time Series Similarity Paradigm
 In proceedings of Principles of Data Mining and Knowledge Discovery, 3 rd European Conference. Prague, Czech Republic
, 1999
"... Abstract. Similarity measures play an important role in many data mining algorithms. To allow the use of such algorithms on nonstandard databases, such as databases of financial time series, their similarity measure has to be defined. We present a simple and powerful technique which allows for the ..."
Abstract

Cited by 26 (1 self)
 Add to MetaCart
Abstract. Similarity measures play an important role in many data mining algorithms. To allow the use of such algorithms on nonstandard databases, such as databases of financial time series, their similarity measure has to be defined. We present a simple and powerful technique which allows for the rapid evaluation of similarity between time series in large data bases. It is based on the orthonormal decomposition of the time series into the Haar basis. We demonstrate that this approach is capable of providing estimates of the local slope of the time series in the sequence of multiresolution steps. The Haar representation and a number of related represenations derived from it are suitable for direct comparison, e.g. evaluation of the correlation product. We demonstrate that the distance between such representations closely corresponds to the subjective feeling of similarity between the time series. In order to test the validity of subjective criteria, we test the records of currency exchanges, finding convincing levels of correlation. 1
Improving Adaptable Similarity Query Processing by Using Approximations
 PROC. 24TH INT. CONF. ON VERY LARGE DATA BASES (VLDB
, 1998
"... Similarity search and contentbased retrieval are becoming more and more important for an increasing number of applications including multimedia, medical imaging, 3D molecular and CAD database systems. As a general similarity model that is particularly adaptable to user preferences and, theref ..."
Abstract

Cited by 21 (6 self)
 Add to MetaCart
Similarity search and contentbased retrieval are becoming more and more important for an increasing number of applications including multimedia, medical imaging, 3D molecular and CAD database systems. As a general similarity model that is particularly adaptable to user preferences and, therefore, fits the subjective character of similarity, quadratic form distance functions have been successfully employed, e.g. for color histograms as well as for 2D and 3D shape histograms. Although efficient algorithms for processing adaptable similarity queries using multidimensional index structures are available, the quadratic nature of the distance function strongly affects the CPU time which in turn represents a high percentage of the overall runtime. The basic idea of our approach is to reduce the number of exact distance computations by adapting conservative approximation techniques to similarity range query processing and, in addition, to extend the concepts to knearest neighbor search. As part of a detailed analysis, we show that our methods guarantee no false drops. Experiments
Incremental, Online, and Merge Mining of Partial Periodic Patterns in TimeSeries Databases
 IEEE Transactions on Knowledge and Data Engineering
, 2004
"... Mining of periodic patterns in timeseries databases is an interesting data mining problem. It can be envisioned as a tool for forecasting and prediction of the future behavior of timeseries data. Incremental mining refers to the issue of maintaining the discovered patterns over time in the prese ..."
Abstract

Cited by 15 (6 self)
 Add to MetaCart
Mining of periodic patterns in timeseries databases is an interesting data mining problem. It can be envisioned as a tool for forecasting and prediction of the future behavior of timeseries data. Incremental mining refers to the issue of maintaining the discovered patterns over time in the presence of more items being added into the database. Because of the mostly append only nature of updating timeseries data, incremental mining would be very effective and efficient. Several algorithms for incremental mining of partial periodic patterns in timeseries databases are proposed and are analyzed empirically. The new algorithms allow for online adaptation of the thresholds in order to produce interactive mining of partial periodic patterns. The storage overhead of the incremental online mining algorithms is analyzed. Results show that the storage overhead for storing the intermediate data structures pays off as the incremental online mining of partial periodic patterns proves to be significantly more efficient than the nonincremental nononline versions. Moreover, a new problem, termed merge mining, is introduced as a generalization of incremental mining. Merge mining can be defined as merging the discovered patterns of two or more databases that are mined independently of each other. An algorithm for merge mining of partial periodic patterns in timeseries databases is proposed and analyzed.
Dynamically optimizing highdimensional index structures
 In Proc. Int. Conf. on Extending Database Technology (EDBT
, 2000
"... Abstract. In highdimensional query processing, the optimization of the logical pagesize of index structures is an important research issue. Even very simple query processing techniques such as the sequential scan are able to outperform indexes which are not suitably optimized. Pagesize optimizati ..."
Abstract

Cited by 15 (4 self)
 Add to MetaCart
Abstract. In highdimensional query processing, the optimization of the logical pagesize of index structures is an important research issue. Even very simple query processing techniques such as the sequential scan are able to outperform indexes which are not suitably optimized. Pagesize optimization based on a cost model faces the problem, that the optimum not only depends on static schema information such as the dimension of the data space but also on dynamically changing parameters such as the number of objects stored in the database and the degree of clustering and correlation in the current data set. Therefore, we propose a method for adapting the page size of an index dynamically during insert processing. Our solution, called DABStree, uses a flat directory whose entries consist of an MBR, a pointer to the data page and the size of the data page. Before splitting pages in insert operations, a cost model is consulted to estimate whether the split operation is beneficial. Otherwise, the split is avoided and the logical pagesize is adapted instead. A similar rule applies for merging when performing delete operations. We present an algorithm for the management of data pages with varying pagesizes in an index and show that all restructuring operations are locally restricted. We show in our experimental evaluation that the DABS tree outperforms the Xtree by a factor up to 4.6 and the sequential scan by a factor up to 6.6. 1.