Results 1 - 10
of
26
M-tree: An Efficient Access Method for Similarity Search in Metric Spaces
, 1997
"... A new access meth d, called M-tree, is proposed to organize and search large data sets from a generic "metric space", i.e. whE4 object proximity is only defined by a distance function satisfyingth positivity, symmetry, and triangle inequality postulates. We detail algorith[ for insertion of objects ..."
Abstract
-
Cited by 447 (36 self)
- Add to MetaCart
A new access meth d, called M-tree, is proposed to organize and search large data sets from a generic "metric space", i.e. whE4 object proximity is only defined by a distance function satisfyingth positivity, symmetry, and triangle inequality postulates. We detail algorith[ for insertion of objects and split management, whF h keep th M-tree always balanced - severalheralvFV split alternatives are considered and experimentally evaluated. Algorithd for similarity (range and k-nearest neigh bors) queries are also described. Results from extensive experimentationwith a prototype system are reported, considering as th performance criteria th number of page I/O's and th number of distance computations. Th results demonstratethm th Mtree indeed extendsth domain of applicability beyond th traditional vector spaces, performs reasonably well inhE[94Kv#E44V[vh data spaces, and scales well in case of growing files. 1
Optimal Multi-Step k-Nearest Neighbor Search
, 1998
"... For an increasing number of modern database applications, efficient support of similarity search becomes an important task. Along with the complexity of the objects such as images, molecules and mechanical parts, also the complexity of the similarity models increases more and more. Whereas algorithm ..."
Abstract
-
Cited by 146 (14 self)
- Add to MetaCart
For an increasing number of modern database applications, efficient support of similarity search becomes an important task. Along with the complexity of the objects such as images, molecules and mechanical parts, also the complexity of the similarity models increases more and more. Whereas algorithms that are directly based on indexes work well for simple medium-dimensional similarity distance functions, they do not meet the efficiency requirements of complex high-dimensional and adaptable distance functions. The use of a multi-step query processing strategy is recommended in these cases, and our investigations substantiate that the number of candidates which are produced in the filter step and exactly evaluated in the refinement step is a fundamental efficiency parameter. After revealing the strong performance shortcomings of the state-of-the-art algorithm for k-nearest neighbor search [Korn et al. 1996], we present a novel multi-step algorithm which is guaranteed to produce the minim...
Efficient User-Adaptable Similarity Search in Large Multimedia Databases
- IN PROCEEDINGS OF THE INT. CONF. ON VERY LARGE DATA BASES
, 1997
"... Efficient user-adaptable similarity search more and more increases in its importance for multimedia and spatial database systems. As a general similarity model for multi-dimensional vectors that is adaptable to application requirements and user preferences, we use quadratic form distance functions w ..."
Abstract
-
Cited by 61 (4 self)
- Add to MetaCart
Efficient user-adaptable similarity search more and more increases in its importance for multimedia and spatial database systems. As a general similarity model for multi-dimensional vectors that is adaptable to application requirements and user preferences, we use quadratic form distance functions which have been successfully applied to color histograms in image databases [Fal+ 94]. The components a ij of the matrix A denote similarity of the components i and j of the vectors. Beyond the Euclidean distance which produces spherical query ranges, the similarity distance defines a new query type, the ellipsoid query. We present new algorithms to efficiently support ellipsoid query processing for various user-defined similarity matrices on existing precomputed indexes. By adapting techniques for reducing the dimensionality and employing a multi-step query processing architecture, the method is extended to high-dimensional data spaces. In particular, from our algorithm to reduce the simila...
Efficient Searches for Similar Subsequences of Different Lengths in Sequence Databases
- In ICDE
, 2000
"... We propose an indexing technique for fast retrieval of similar subsequences using time warping distances. A time warping distance is a more suitable similarity measure than the Euclidean distance in many applications, where sequences may be of different lengths or different sampling rates. Our index ..."
Abstract
-
Cited by 35 (4 self)
- Add to MetaCart
We propose an indexing technique for fast retrieval of similar subsequences using time warping distances. A time warping distance is a more suitable similarity measure than the Euclidean distance in many applications, where sequences may be of different lengths or different sampling rates. Our indexing technique uses a disk-based suffix tree as an index structure and employs' lower-bound distance functions to filter out dissimilar subsequences without false dismissals. To make the index structure compact and thus accelerate the query processing, we convert sequences of continuous values to sequences of discrete values via a categorization method and store only a subset of suffixes whose first values are different from their preceding values. The experimental results' reveal that our proposed technique can be a few orders' of magnitude faster than sequential scanning.
Matching and Indexing Sequences of Different Lengths
- In Proc.of the CIKM, Las Vegas
, 1997
"... In this paper, we consider the problem of efficient matching and retrieval of sequences of different lengths. Most of the previous research is concentrated on similarity matching and retrieval of sequences of the same length using Euclidean distance metric. For similarity matching of sequences, we u ..."
Abstract
-
Cited by 34 (3 self)
- Add to MetaCart
In this paper, we consider the problem of efficient matching and retrieval of sequences of different lengths. Most of the previous research is concentrated on similarity matching and retrieval of sequences of the same length using Euclidean distance metric. For similarity matching of sequences, we use a modified version of the edit distance function, and consider two sequences matching if a majority of the elements in the sequences match. In the matching process a mapping among non-matching elements is created to check if there are unacceptable deviations among them. This means that two matching sequences should have lengths that are comparable. For efficient retrieval of matching sequences, we propose an indexing scheme which is totally based on lengths and relative distances between sequences. We use vp-trees as the underlying distance-based index structures in our method. 1 Introduction The problem of matching sequences with respect to a similarity measure is encountered in a varie...
Supporting Fast Search in Time Series for Movement Patterns in Multiple Scales
- Proc. 7th ACM Int. Conf. on Information and Knowledge Management
, 1998
"... An important investigation of time series involves searching for "movement" patterns, such as "going up" or "going down" or some combinations of them. Movement patterns can be in various scales: a large scale pattern may cover a long time period, while a small scale pattern usually covers a short ti ..."
Abstract
-
Cited by 34 (2 self)
- Add to MetaCart
An important investigation of time series involves searching for "movement" patterns, such as "going up" or "going down" or some combinations of them. Movement patterns can be in various scales: a large scale pattern may cover a long time period, while a small scale pattern usually covers a short time period. This paper considers such scale requirement. More specifically, a pattern is defined as a regular expression of letters, where each letter describes a movement direction and covers a specified length of time (called pattern unit length). To find if a time series (or a part of it) matches a pattern, the time series is first partitioned into consecutive sub-series of the unit length, and for each subseries, the direction of its best fitting line is taken as the movement direction of the sub-series if the distance between the best fitting line and the sub-series is within a specified tolerance (tolerance requirement). A direct implementation of pattern search will undoubtedly yield poor performance if the number of time series or the length of them is large. This paper introduces a pre-computation and indexing method to facilitate fast evaluation of pattern queries in user-specified scales. An efficient pre-computation algorithm is given to find the movement directions for all the sub-series that satisfy the tolerance requirement. Bounding triangles are used to represent clusters of sub-series. Relational database is then used to store these bounding triangles and relational operations are employed to facilitate the evaluation of pattern queries. The paper also reports some experiments performed on a real-life data set to show the efficiency and the scalability of the algorithms.
Improving Adaptable Similarity Query Processing by Using Approximations
- PROC. 24TH INT. CONF. ON VERY LARGE DATA BASES (VLDB
, 1998
"... Similarity search and content-based retrieval are becoming more and more important for an increasing number of applications including multimedia, medical imaging, 3D molecular and CAD database systems. As a general similarity model that is particularly adaptable to user preferences and, theref ..."
Abstract
-
Cited by 21 (6 self)
- Add to MetaCart
Similarity search and content-based retrieval are becoming more and more important for an increasing number of applications including multimedia, medical imaging, 3D molecular and CAD database systems. As a general similarity model that is particularly adaptable to user preferences and, therefore, fits the subjective character of similarity, quadratic form distance functions have been successfully employed, e.g. for color histograms as well as for 2D and 3D shape histograms. Although efficient algorithms for processing adaptable similarity queries using multidimensional index structures are available, the quadratic nature of the distance function strongly affects the CPU time which in turn represents a high percentage of the overall runtime. The basic idea of our approach is to reduce the number of exact distance computations by adapting conservative approximation techniques to similarity range query processing and, in addition, to extend the concepts to k-nearest neighbor search. As part of a detailed analysis, we show that our methods guarantee no false drops. Experiments
The Haar Wavelet Transform in the Time Series Similarity Paradigm
- In proceedings of Principles of Data Mining and Knowledge Discovery, 3 rd European Conference. Prague, Czech Republic
, 1999
"... Abstract. Similarity measures play an important role in many data mining algorithms. To allow the use of such algorithms on non-standard databases, such as databases of financial time series, their similarity measure has to be defined. We present a simple and powerful technique which allows for the ..."
Abstract
-
Cited by 17 (1 self)
- Add to MetaCart
Abstract. Similarity measures play an important role in many data mining algorithms. To allow the use of such algorithms on non-standard databases, such as databases of financial time series, their similarity measure has to be defined. We present a simple and powerful technique which allows for the rapid evaluation of similarity between time series in large data bases. It is based on the orthonormal decomposition of the time series into the Haar basis. We demonstrate that this approach is capable of providing estimates of the local slope of the time series in the sequence of multi-resolution steps. The Haar representation and a number of related represenations derived from it are suitable for direct comparison, e.g. evaluation of the correlation product. We demonstrate that the distance between such representations closely corresponds to the subjective feeling of similarity between the time series. In order to test the validity of subjective criteria, we test the records of currency exchanges, finding convincing levels of correlation. 1
Incremental, Online, and Merge Mining of Partial Periodic Patterns in Time-Series Databases
- IEEE Transactions on Knowledge and Data Engineering
, 2004
"... Mining of periodic patterns in time-series databases is an interesting data mining problem. It can be envisioned as a tool for forecasting and prediction of the future behavior of time-series data. Incremental mining refers to the issue of maintaining the discovered patterns over time in the prese ..."
Abstract
-
Cited by 14 (5 self)
- Add to MetaCart
Mining of periodic patterns in time-series databases is an interesting data mining problem. It can be envisioned as a tool for forecasting and prediction of the future behavior of time-series data. Incremental mining refers to the issue of maintaining the discovered patterns over time in the presence of more items being added into the database. Because of the mostly append only nature of updating time-series data, incremental mining would be very effective and efficient. Several algorithms for incremental mining of partial periodic patterns in time-series databases are proposed and are analyzed empirically. The new algorithms allow for online adaptation of the thresholds in order to produce interactive mining of partial periodic patterns. The storage overhead of the incremental online mining algorithms is analyzed. Results show that the storage overhead for storing the intermediate data structures pays off as the incremental online mining of partial periodic patterns proves to be significantly more efficient than the nonincremental nononline versions. Moreover, a new problem, termed merge mining, is introduced as a generalization of incremental mining. Merge mining can be defined as merging the discovered patterns of two or more databases that are mined independently of each other. An algorithm for merge mining of partial periodic patterns in time-series databases is proposed and analyzed.
Dynamically optimizing high-dimensional index structures
- In Proc. Int. Conf. on Extending Database Technology (EDBT
, 2000
"... Abstract. In high-dimensional query processing, the optimization of the logical page-size of index structures is an important research issue. Even very simple query processing techniques such as the sequential scan are able to outperform indexes which are not suitably optimized. Page-size optimizati ..."
Abstract
-
Cited by 12 (3 self)
- Add to MetaCart
Abstract. In high-dimensional query processing, the optimization of the logical page-size of index structures is an important research issue. Even very simple query processing techniques such as the sequential scan are able to outperform indexes which are not suitably optimized. Page-size optimization based on a cost model faces the problem, that the optimum not only depends on static schema information such as the dimension of the data space but also on dynamically changing parameters such as the number of objects stored in the database and the degree of clustering and correlation in the current data set. Therefore, we propose a method for adapting the page size of an index dynamically during insert processing. Our solution, called DABS-tree, uses a flat directory whose entries consist of an MBR, a pointer to the data page and the size of the data page. Before splitting pages in insert operations, a cost model is consulted to estimate whether the split operation is beneficial. Otherwise, the split is avoided and the logical page-size is adapted instead. A similar rule applies for merging when performing delete operations. We present an algorithm for the management of data pages with varying page-sizes in an index and show that all restructuring operations are locally restricted. We show in our experimental evaluation that the DABS tree outperforms the X-tree by a factor up to 4.6 and the sequential scan by a factor up to 6.6. 1.

