Results 1 
6 of
6
A Cost Model for Similarity Queries in Metric Spaces
 In Proc. 17th ACM SIGACTSIGMODSIGART Symposium on Principles of Database Systems (PODS'98
, 1998
"... We consider the problem of estimating CPU (distance computations) and I/O costs for processing range and knearest neighbors queries over metric spaces. Unlike the specific case of vector spaces, where information on data distribution has been exploited to derive cost models for predicting the perf ..."
Abstract

Cited by 49 (12 self)
 Add to MetaCart
We consider the problem of estimating CPU (distance computations) and I/O costs for processing range and knearest neighbors queries over metric spaces. Unlike the specific case of vector spaces, where information on data distribution has been exploited to derive cost models for predicting the performance of multidimensional access methods, in a generic metric space there is no such a possibility, which makes the problem quite different and requires a novel approach. We insist that the distance distribution of objects can be profitably used to solve the problem, and consequently develop a concrete cost model for the Mtree access method [10]. Our results rely on the assumption that the indexed dataset comes from a metric space which is "homogeneous" enough (in a probabilistic sense) to allow reliable cost estimations even if the distance distribution with respect to a specific query object is unknown. We experimentally validate the model over both real and synthetic datasets, and sho...
The NDTree: A Dynamic Indexing Technique for Multidimensional Nonordered Discrete Data Spaces
 In Proc. of VLDB
, 2003
"... Similarity searches in multidimensional Nonordered Discrete Data Spaces (NDDS) are becoming increasingly important for application areas such as genome sequence databases. ..."
Abstract

Cited by 14 (8 self)
 Add to MetaCart
Similarity searches in multidimensional Nonordered Discrete Data Spaces (NDDS) are becoming increasingly important for application areas such as genome sequence databases.
A Querysensitive Cost Model for Similarity Queries with Mtree
 In Proc. of the 10th ADC
, 1999
"... . We intro duc acS4 model for the Mtree acRNH method [Ciac4( et al., 1997]whic h provides estimates of CPU(distanc cis putations) and I/Oc osts for the execDRS( of similarity queries as a funcSL( ofeac h single query. This model is said to bequerysensit , sinc it takes intoacSSL t, by relying on t ..."
Abstract

Cited by 6 (2 self)
 Add to MetaCart
. We intro duc acS4 model for the Mtree acRNH method [Ciac4( et al., 1997]whic h provides estimates of CPU(distanc cis putations) and I/Oc osts for the execDRS( of similarity queries as a funcSL( ofeac h single query. This model is said to bequerysensit , sinc it takes intoacSSL t, by relying on the novel notion of "witness", the "position" of the query point inside the metric spac indexed by the Mtree. We desc4fi e thebasic c onccE underlying the model along with di#erent methods whic hcS be used for its implementation; finally, we experimentally validate the model over both real and synthetic datasets. 1 Introducti5 Modern advanceddatabas applications sp has sEEO59 comparisz in molecular biology [Chen and Aberer, 1997],s97 e matching [Huttenlocker et al., 1993], fingerprint recognition [Maio and Maltoni, 1996], and manyothers which typically occur in multimedia environments often require the e#cient evaluation ofs9+LEL9 y (range andneares neighbors queries over asL of objects drawn from an arbitrary metric space. A metricstri M =(U,d) is defined by a value domain U and a metric d,s atisOE9) the axioms of nonnegativity,s;EEO49 and triangular inequality (d(O i ,O j ) # d(O i ,O k )+d(O k ,O j ), # O i ,O j ,O k #U), which measxO thedisx;E (dis;E9)EzzEL y) of points (objects of U . As a particular cas9 metrics paces include multidimensE9s vectorstor9L where objects are us+;zO compareduspa the L pdisxxE9 ss has Euclidean (L 2 )orManhattan (L 1 ), but they are far more general.As an example, the domain S of textsxt9+; endowed with the edit (Levens tein)dis9L;+x d edit , which counts the minimal number of changes (insx9)E+z; deletions sletionsz;49 needed to trans59 as+z44 into another one,is a metrics pace (S,d edit ). # This work has been funded by...
String matching with metric trees using an approximate distance
 In SPIRE, LNCS 2476
, 2002
"... Abstract. Searching in a large data set those strings that are more similar, according to the edit distance, to a given one is a timeconsuming process. In this paper we investigate the performance of metric trees, namely the Mtree, when they are extended using a cheap approximate distance function ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
Abstract. Searching in a large data set those strings that are more similar, according to the edit distance, to a given one is a timeconsuming process. In this paper we investigate the performance of metric trees, namely the Mtree, when they are extended using a cheap approximate distance function as a filter to quickly discard irrelevant strings. Using the bag distance as an approximation of the edit distance, we show an improvement in performance up to 90 % with respect to the basic case. This, along with the fact that our solution is independent on both the distance used in the pretest and on the underlying metric index, demonstrates that metric indices are a powerful solution, not only for many modern application areas, as multimedia, data mining and pattern recognition, but also for the string matching problem. 1
Indexbased approach to similarity search in protein and nucleotide databases
 DATESO
, 2007
"... nucleotide databases ..."
Fast Database Indexing for Large Protein Sequence Collections Using Parallel NGram Transformation Algorithm
"... Abstract—With the rapid development in the field of life sciences and the flooding of genomic information, the need for faster and scalable searching methods has become urgent. One of the approaches that were investigated is indexing. The indexing methods have been categorized into three categories ..."
Abstract
 Add to MetaCart
Abstract—With the rapid development in the field of life sciences and the flooding of genomic information, the need for faster and scalable searching methods has become urgent. One of the approaches that were investigated is indexing. The indexing methods have been categorized into three categories which are the lengthbased index algorithms, transformationbased algorithms and mixed techniquesbased algorithms. In this research, we focused on the transformation based methods. We embedded the Ngram method into the transformationbased method to build an inverted index table. We then applied the parallel methods to speed up the index building time and to reduce the overall retrieval time when querying the genomic database. Our experiments show that the use of NGram transformation algorithm is an economical solution; it saves time and space too. The result shows that the size of the index is smaller than the size of the dataset when the size of NGram is 5 and 6. The parallel NGram transformation algorithm’s results indicate that the uses of parallel programming with large dataset are promising which can be improved further. Keywords—Biological sequence, Database index, Ngram indexing, Parallel computing, Sequence retrieval.