Results 1 - 10
of
30
Ultrafast and memory-efficient alignment of short DNA sequences to the human genome
- GENOME BIOLOGY
, 2009
"... ..."
An Efficient Index Structure for String Databases
- In VLDB
, 2001
"... We consider the problem of substring searching in large databases. Typical applications of this problem are genetic data, web data, and event sequences. Since the size of such databases grows exponentially, it becomes impractical to use inmemory algorithms for these problems. In this paper, we ..."
Abstract
-
Cited by 59 (8 self)
- Add to MetaCart
We consider the problem of substring searching in large databases. Typical applications of this problem are genetic data, web data, and event sequences. Since the size of such databases grows exponentially, it becomes impractical to use inmemory algorithms for these problems. In this paper, we propose to map the substrings of the data into an integer space with the help of wavelet coefficients. Later, we index these coefficients using MBRs (Minimum Bounding Rectangles). We define a distance function which is a lower bound to the actual edit distance between strings. We experiment with both nearest neighbor queries and range queries. The results show that our technique prunes significant amount of the database (typically 50-95%), thus reducing both the disk I/O cost and the CPU cost significantly. 1
Musical Information Retrieval Using Musical Parameters
- In Proceedings of the 1998 International Computer Music Conference
, 1998
"... . The application domain for automatical retrieval of melodic excerpts in musical collections is wide; e.g. it would facilitate the work of music researcher trying to find specific features in music. In this paper we consider several parts of the retrieving process. We present our representation for ..."
Abstract
-
Cited by 25 (8 self)
- Add to MetaCart
. The application domain for automatical retrieval of melodic excerpts in musical collections is wide; e.g. it would facilitate the work of music researcher trying to find specific features in music. In this paper we consider several parts of the retrieving process. We present our representation for musical data. This inner representation is converted and established from MIDI-files. For the matching we use a particular encoding (two dimensional relative code), which is formed out of the inner representation. This encoding can be interpreted differently depending on the way the key is given. Furthermore, in the matching phase we use an efficient indexing structure, well-known in string pattern matching, called suffix-trie. 1 Introduction In the earlier researches concerning musical data representation, researchers seemed to be rather sensible to the delicate details of different styles of music. One example of such a meticulous approach is Leo Plenckers encoding system for Spanish med...
Approximate Pattern Matching with Samples
- In Proc. of ISAAC'94
, 1994
"... . We simplify in this paper the algorithm by Chang and Lawler for the approximate string matching problem, by adopting the concept of sampling. We have a more general analysis of expected time with the simplified algorithm for the one-dimensional case under a non-uniform probability distribution ..."
Abstract
-
Cited by 22 (1 self)
- Add to MetaCart
. We simplify in this paper the algorithm by Chang and Lawler for the approximate string matching problem, by adopting the concept of sampling. We have a more general analysis of expected time with the simplified algorithm for the one-dimensional case under a non-uniform probability distribution, and we show that our method can easily be generalized to the two-dimensional approximate pattern matching problem with sublinear expected time. 1 Introduction Since the inaugural papers on string matching algorithms were published by Knuth, Morris and Pratt[11] and Boyer and Moore [5], the problem diversified into various directions. Let us call string matching one-dimensional pattern matching. One is two-dimensional pattern matching and the other is approximate pattern matching where up to k differences are allowed for a match. Yet another theme is two-dimensional approximate pattern matching. There are numerous papers in these new research areas. We cite just a few of them to compare...
Tries for Approximate String Matching
- IEEE Transactions on Knowledge and Data Engineering
, 1996
"... Tries offer text searches with costs which are independent of the size of the document being searched, and so are important for large documents requiring spelling checkers), case insensitivity, and limited approximate regular secondary storage. Approximate searches, in which the search pattern d ..."
Abstract
-
Cited by 22 (1 self)
- Add to MetaCart
Tries offer text searches with costs which are independent of the size of the document being searched, and so are important for large documents requiring spelling checkers), case insensitivity, and limited approximate regular secondary storage. Approximate searches, in which the search pattern differs from the document by k substitutions, transpositions, insertions or deletions, have hitherto been carried out only at costs linear in the size of the document. We present a trie-based method whose cost is independent of document size. H. Shang and T.H. Merrett are at the School of Computer Science, McGill University, Montr'eal, Qu'ebec, Canada H3A 2A7, Email: fshang, timg@cs.mcgill.ca 100 Our experiments show that this new method significantly outperforms the nearest competitor for k=0 and k=1, which are arguably the most important cases. The linear cost (in k) of the other methods begins to catch up, for our small files, only at k=2. For larger files, complexity arguments i...
SEMEX - An Efficient Music Retrieval Prototype
- In First International Symposium on Music Information Retrieval (ISMIR’2000
, 2000
"... We present an efficient prototype for music information retrieval. The prototype uses bitparallel algorithms for locating transposition invariant matches of monophonic query melodies within monophonic or polyphonic music stored in a database. When dealing with monophonic music, we employ a fast a ..."
Abstract
-
Cited by 20 (2 self)
- Add to MetaCart
We present an efficient prototype for music information retrieval. The prototype uses bitparallel algorithms for locating transposition invariant matches of monophonic query melodies within monophonic or polyphonic music stored in a database. When dealing with monophonic music, we employ a fast approximate bit-parallel algorithm with special edit distance metrics.
Search Algorithms for Biosequences Using Random Projection
, 2001
"... and have found that it is complete and satisfactory in all respects, Chair of Supervisory Committee: Reading Committee: ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
and have found that it is complete and satisfactory in all respects, Chair of Supervisory Committee: Reading Committee:
Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases
- In Third IEEE Symposium on BioInformatics and BioEngineering (BIBE’03
, 2003
"... We present a multi-dimensional indexing approach for fast sequence similarity search in DNA and protein databases. In particular, we propose effective transformations of subsequences into numerical vector domains and build efficient index structures on the transformed vectors. We then define distanc ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
We present a multi-dimensional indexing approach for fast sequence similarity search in DNA and protein databases. In particular, we propose effective transformations of subsequences into numerical vector domains and build efficient index structures on the transformed vectors. We then define distance functions in the transformed domain and examine properties of these functions. We experimentally compared their (a) approximation quality for k-Nearest Neighbor (k-NN) queries, (b) pruning ability and (c) approximation quality for e-range queries. Results for k-NN queries, which we present here, show that our proposed distances FD2 and WD2 (i.e. Frequency and Wavelet Distance functions for 2-grams) perform significantly better than the others'. We then develop effective index structures, based on R-trees and scalar quantization, on top of transformed vectors' and distance functions. Promising results from the experiments on real biosequence data sets are presented.
On-line approximate string searching algorithms: survey and experimental results
- International Journal of Computer Mathematics
, 2002
"... The problem of approximate string searching comprises two classes of problems: string searching with k mismatches and string searching with k differences. In this paper we present a short survey and experimental results for well known sequential approximate string searching algorithms. We consider a ..."
Abstract
-
Cited by 7 (5 self)
- Add to MetaCart
The problem of approximate string searching comprises two classes of problems: string searching with k mismatches and string searching with k differences. In this paper we present a short survey and experimental results for well known sequential approximate string searching algorithms. We consider algorithms based on different approaches including dynamic programming, deterministic finite automata, filtering, counting and bit parallelism. We compare these algorithms in terms of running time against pattern length and for several values of k for four different kinds of text: binary alphabet, alphabet of size 8, English alphabet and DNA alphabet. Finally, we compare the experimental results of the algorithms with their theoretical complexities.
Boyer-Moore strategy to efficient approximate string matching
, 2007
"... . We propose a simple but efficient algorithm for searching all occurrences of a pattern or a class of patterns (length m) in a text (length n) with at most k mismatches. This algorithm relies on the Shift-Add algorithm of Baeza-Yates and Gonnet [6], which involves representing by a bit number the ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
. We propose a simple but efficient algorithm for searching all occurrences of a pattern or a class of patterns (length m) in a text (length n) with at most k mismatches. This algorithm relies on the Shift-Add algorithm of Baeza-Yates and Gonnet [6], which involves representing by a bit number the current state of the search and uses the ability of programming languages to handle bit words. State representation should not, therefore, exceeds the word size !, that is, m(dlog 2 (k + 1)e + 1) !. This algorithm consists in a preprocessing step and a searching step. It is linear and performs 3n operations during the searching step. Notions of shift and character skip found in the Boyer-Moore (BM) [9] approach, are introduced in this algorithm. Provided that the considered alphabet is large enough (compared to the Pattern length), the average number of operations performed by our algorithm during the searching step becomes n(2 + k+4 m\Gammak ). 1 Introduction Our purpose is approximate m...

