Results 1  10
of
20
OASIS: An Online and Accurate Technique for Localalignment Searches on Biological Sequences
 In VLDB
, 2003
"... A common query against large protein and gene sequence data sets is to locate targets that are similar to an input query sequence. The current set of popular search tools, such as BLAST, employ heuristics to improve the speed of such searches. However, such heuristics can sometimes miss target ..."
Abstract

Cited by 32 (4 self)
 Add to MetaCart
(Show Context)
A common query against large protein and gene sequence data sets is to locate targets that are similar to an input query sequence. The current set of popular search tools, such as BLAST, employ heuristics to improve the speed of such searches. However, such heuristics can sometimes miss targets, which in many cases is undesirable.
A linear size index for approximate pattern matching
 In Proc. 17th Annual Symposium on Combinatorial Pattern Matching
, 2006
"... Abstract. This paper revisits the problem of indexing a text S[1..n]to support searching substrings in S that match a given pattern P[1..m] with at most k errors. A naive solution either has a worstcase matching time complexity of Ω(m k)orrequiresΩ(n k) space. Devising a solution with better perfor ..."
Abstract

Cited by 13 (1 self)
 Add to MetaCart
(Show Context)
Abstract. This paper revisits the problem of indexing a text S[1..n]to support searching substrings in S that match a given pattern P[1..m] with at most k errors. A naive solution either has a worstcase matching time complexity of Ω(m k)orrequiresΩ(n k) space. Devising a solution with better performance has been a challenge until Cole et al. [5] showed an O(nlog k n)space index that can support kerror matching in O(m+occ+log k nlog log n) time, where occ is the number of occurrences. Motivated by the indexing of DNA, we investigate in this paper the feasibility of devising a linearsize index that still has a time complexity linear in m. In particular, we give an O(n)space index that supports kerror matching in O(m + occ +(logn) k(k+1) log log n) worstcase time. Furthermore, the index can be compressed from O(n) wordsintoO(n) bits with a slight increase in the time complexity. 1
A Comprehensive Trainable Error Model for Sung Music Queries
 Journal of Artificial Intelligence Research
, 2004
"... We propose a model for errors in sung queries, a variant of the hidden Markov model (HMM). This is a solution to the problem of identifying the degree of similarity between a (typically errorladen) sung query and a potential target in a database of musical works, an important problem in the field o ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
(Show Context)
We propose a model for errors in sung queries, a variant of the hidden Markov model (HMM). This is a solution to the problem of identifying the degree of similarity between a (typically errorladen) sung query and a potential target in a database of musical works, an important problem in the field of music information retrieval. Similarity metrics are a critical component of “querybyhumming ” (QBH) applications which search audio and multimedia databases for strong matches to oral queries. Our model comprehensively expresses the types of error or variation between target and query: cumulative and noncumulative local errors, transposition, tempo and tempo changes, insertions, deletions and modulation. The model is not only expressive, but automatically trainable, or able to learn and generalize from query examples. We present results of simulations, designed to assess the discriminatory potential of the model, and tests with real sung queries, to demonstrate relevance to realworld applications. 1.
Abbadi. BFT: Bit Filtration Technique for Approximate String Join in Biological Databases
 Join in Biological Databases. SPIRE
, 2003
"... Abstract. Joining massive tables in relational databases have received substantial attention in the past decade. Numerous filtration and indexing techniques have been proposed to reduce the curse of dimensionality. This paper proposes a novel approach to map the problem of pairwise wholegenome comp ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
(Show Context)
Abstract. Joining massive tables in relational databases have received substantial attention in the past decade. Numerous filtration and indexing techniques have been proposed to reduce the curse of dimensionality. This paper proposes a novel approach to map the problem of pairwise wholegenome comparison into an approximate join operation in the wellestablished relational database context. We propose a novel Bit Filtration Technique (BFT) based on vector transformation and furthermore conduct the application of DFT(Discrete Fourier Transformation) and DWT(Discrete Wavelet Transformation, Haar) dimensionality reduction techniques as a preprocessing filtration step which effectively reduces the search space and running time of the join operation. Our empirical results on a number of Prokaryote and Eukaryote DNA contig datasets demonstrate very efficient filtration to effectively prune nonrelevant portions of the database, incurring no false negatives, with up to 50 times faster running time compared with traditional dynamic programming, and qgram approaches. BFT may easily be incorporated as a preprocessing step for any of the wellknown sequence search heuristics as BLAST, QUASAR and FastA, for the purpose of pairwise wholegenome comparison. We analyze the precision of applying BFT and other transformationbased dimensionality reduction techniques, and finally discuss the imposed tradeoffs. 1
Compressed indexes for approximate string matching
 In Proceedings of the European Symposium on Algorithms
, 2006
"... Abstract. We revisit the problem of indexing a string S[1..n] to support searching all substrings in S that match a given pattern P [1..m] with at most k errors. Previous solutions either require an index of size exponential in k or need Ω(m k) time for searching. Motivated by the indexing of DNA se ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
Abstract. We revisit the problem of indexing a string S[1..n] to support searching all substrings in S that match a given pattern P [1..m] with at most k errors. Previous solutions either require an index of size exponential in k or need Ω(m k) time for searching. Motivated by the indexing of DNA sequences, we investigate space efficient indexes that occupy only O(n) space. For k = 1, we give an index to support matching in O(m + occ + log n log log n) time. The previously best solution achieving this time complexity requires an index of size O(n log n). This new index can be used to improve existing indexes for k ≥ 2 errors. Among others, it can support matching with k = 2 errors in O(m log n log log n + occ) time. 1
On the Suffix Automaton with mismatches
, 2007
"... In this paper we focus on the construction of the minimal deterministic finite automaton S k that recognizes the set of suffixes of a word w up to k errors. We present an algorithm that makes use of the automaton S k in order to accept in an efficient way the language of all suffixes of w up to k e ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
(Show Context)
In this paper we focus on the construction of the minimal deterministic finite automaton S k that recognizes the set of suffixes of a word w up to k errors. We present an algorithm that makes use of the automaton S k in order to accept in an efficient way the language of all suffixes of w up to k errors in every windows of size r, where r is the value of the repetition index of w. Moreover, we give some experimental results on some wellknown words, like prefixes of Fibonacci and ThueMorse words, and we make a conjecture on the size of the suffix automaton with mismatches.
Using Transformation Techniques Towards Efficient Filtration of String Proximity Search of Biological Sequences
, 2003
"... The problem of proximity search in biological databases is addressed. We study vector transformations and conduct the application of DFT(Discrete Fourier Transformation) and DWT(Discrete Wavelet Transformation, Haar) dimensionality reduction techniques for DNA sequence proximity search to reduce the ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
The problem of proximity search in biological databases is addressed. We study vector transformations and conduct the application of DFT(Discrete Fourier Transformation) and DWT(Discrete Wavelet Transformation, Haar) dimensionality reduction techniques for DNA sequence proximity search to reduce the search time of range queries. Our empirical results on a number of Prokaryote and Eukaryote DNA contig databases demonstrate up to 50fold filtration ratio of the search space, up to 13 times faster filtration. The proposed transformation techniques may easily be integrated as a preprocessing phase on top of the current existing similarity search heuristics such as BLAST[1], PattenHunter[11], FastA[17], QUASAR[4] and to efficiently prune nonrelevant sequences. We study the precision of applying dimensionality reduction techniques for faster and more efficient range query searches, and discuss the imposed tradeoffs.
BFT: A Relationalbased Bit Filtration Technique for Efficient Approximate String Joins in Biological Databases
 Join in Biological Databases (Extended Version). UCSB Technical Report
, 2003
"... Joining massive tables in relational databases have received substantial attention in the past decade. Numerous filtration and indexing techniques have been proposed to reduce the curse of dimensionality. ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
Joining massive tables in relational databases have received substantial attention in the past decade. Numerous filtration and indexing techniques have been proposed to reduce the curse of dimensionality.