Results 1 - 10
of
47
A Guided Tour to Approximate String Matching
- ACM Computing Surveys
, 1999
"... We survey the current techniques to cope with the problem of string matching allowing errors. This is becoming a more and more relevant issue for many fast growing areas such as information retrieval and computational biology. We focus on online searching and mostly on edit distance, explaining t ..."
Abstract
-
Cited by 306 (38 self)
- Add to MetaCart
We survey the current techniques to cope with the problem of string matching allowing errors. This is becoming a more and more relevant issue for many fast growing areas such as information retrieval and computational biology. We focus on online searching and mostly on edit distance, explaining the problem and its relevance, its statistical behavior, its history and current developments, and the central ideas of the algorithms and their complexities. We present a number of experiments to compare the performance of the different algorithms and show which are the best choices according to each case. We conclude with some future work directions and open problems. 1
Complete inverted files for efficient text retrieval and analysis
- Journal of the ACM
, 1987
"... Abstract. Given a finite set of texts S = (wi, *.., wk) over some fixed finite alphabet 2, a complete inverted tile for S is an abstract data type that provides the functionsfind ( which returns the longest prefix of w that occurs (as a subword of a word) in S, freq(w), which returns the number of t ..."
Abstract
-
Cited by 57 (1 self)
- Add to MetaCart
Abstract. Given a finite set of texts S = (wi, *.., wk) over some fixed finite alphabet 2, a complete inverted tile for S is an abstract data type that provides the functionsfind ( which returns the longest prefix of w that occurs (as a subword of a word) in S, freq(w), which returns the number of times w occurs in S, and locations(w), which returns the set of positions where w occurs in S. A data structure. that implements a complete inverted file for S that occupies linear space and can be built in linear time, using the uniform-cost RAM model, is given. Using this data structure, the time for each of the above query functions is optimal. To accomplish this, techniques from the theory of finite automata and the work on suffix trees are used to build a deterministic finite automaton that recognizes the set of all subwords of the set S. This automaton is then annotated with additional information and compacted to facilitate the desired query functions. The result is a data structure that is smaller and more flexible than the s&ix tree.
The Computational Power and Complexity of Constraint Handling Rules
- In Second Workshop on Constraint Handling Rules, at ICLP05
, 2005
"... Constraint Handling Rules (CHR) is a high-level rule-based programming language which is increasingly used for general purposes. We introduce the CHR machine, a model of computation based on the operational semantics of CHR. Its computational power and time complexity properties are compared to thos ..."
Abstract
-
Cited by 47 (21 self)
- Add to MetaCart
Constraint Handling Rules (CHR) is a high-level rule-based programming language which is increasingly used for general purposes. We introduce the CHR machine, a model of computation based on the operational semantics of CHR. Its computational power and time complexity properties are compared to those of the well-understood Turing machine and Random Access Memory machine. This allows us to prove the interesting result that every algorithm can be implemented in CHR with the best known time and space complexity. We also investigate the practical relevance of this result and the constant factors involved. Finally we expand the scope of the discussion to other (declarative) programming languages.
Lempel-Ziv parsing and sublinear-size index structures for string matching (Extended Abstract)
- Proc. 3rd South American Workshop on String Processing (WSP'96
, 1996
"... String matching over a long text can be significantly speeded up with an index structure formed by preprocessing the text. For very long texts, the size of such an index can be a problem. This paper presents the first sublinear-size index structure. The new structure is based on Lempel-Ziv parsing ..."
Abstract
-
Cited by 46 (1 self)
- Add to MetaCart
String matching over a long text can be significantly speeded up with an index structure formed by preprocessing the text. For very long texts, the size of such an index can be a problem. This paper presents the first sublinear-size index structure. The new structure is based on Lempel-Ziv parsing of the text and has size linear in N, the size of the Lempel-Ziv parse. For a text of length n, N = O(n = log n) and can be still smaller if the text is compressible. With the new index structure, all occurrences of a pattern string of length m can be found in time O(m 2
Fast Text Searching for Regular Expressions or Automaton Searching on Tries
"... We present algorithms for efficient searching of regular expressions on preprocessed text, using a Patricia tree as a logical model for the index. We obtain searching algorithms that run in logarithmic expected time in the size of the text for a wide subclass of regular expressions, and in subline ..."
Abstract
-
Cited by 43 (6 self)
- Add to MetaCart
We present algorithms for efficient searching of regular expressions on preprocessed text, using a Patricia tree as a logical model for the index. We obtain searching algorithms that run in logarithmic expected time in the size of the text for a wide subclass of regular expressions, and in sublinear expected time for any regular expression. This is the first such algorithm to be found with this complexity.
Sparse Suffix Trees
- In Proc. 2nd Annual International Conference on Computing and Combinatorics (COCOON), LNCS v. 1090
, 1996
"... . A sparse suffix tree is a suffix tree that represents only a subset of the suffixes of the text. This is in contrast to the standard suffix tree that represents all suffixes. By selecting a small enough subset, a sparse suffix tree can be made to fit the available storage, unfortunately at the cos ..."
Abstract
-
Cited by 30 (1 self)
- Add to MetaCart
. A sparse suffix tree is a suffix tree that represents only a subset of the suffixes of the text. This is in contrast to the standard suffix tree that represents all suffixes. By selecting a small enough subset, a sparse suffix tree can be made to fit the available storage, unfortunately at the cost of increased search times. The idea of sparse suffix trees goes back to PATRICIA tries. Evenly spaced sparse suffix trees represent every kth suffix of the text. In the paper, we give general construction and search algorithms for evenly spaced sparse suffix trees, and present their run time analysis, both in the worst and in the average case. The algorithms are further improved by using so-called dual suffix trees. 1 Introduction Finding an index for a long text that makes fast string matching possible is one of the very central problems of text processing systems. Suffix trees offer a theoretically time-optimal solution. A suffix tree is a trie-like data structure that represents all su...
Finding Approximate Matches in Large Lexicons
- SOFTWARE - PRACTICE AND EXPERIENCE
, 1995
"... Approximate string matching is used for spelling correction and personal name matching. In this paper we show how to use string matching techniques in conjunction with lexicon indexes to find approximate matches in a large lexicon. We test several lexicon indexing techniques, including n-grams and p ..."
Abstract
-
Cited by 27 (5 self)
- Add to MetaCart
Approximate string matching is used for spelling correction and personal name matching. In this paper we show how to use string matching techniques in conjunction with lexicon indexes to find approximate matches in a large lexicon. We test several lexicon indexing techniques, including n-grams and permuted lexicons, and several string matching techniques, including string similarity measures and phonetic coding. We propose methods for combining these techniques, and show experimentally that these combinations yield good retrieval effectiveness while keeping index size and retrieval time low. Our experiments also suggest that, in contrast to previous claims, phonetic codings are markedly inferior to string distance measures, which are demonstrated to be suitable for both spelling correction and personal name matching. KEY WORDS: pattern matching; string indexing; approximate matching; compressed inverted files; Soundex
Lexicographical Indices for Text: Inverted files vs. PAT trees
, 1991
"... We survey two indices for text, with emphasis on Pat arrays (also called suffix arrays). A Pat array is an index based on a new model of text which does not use the concept of word and does not need to know the structure of the text. to appear in Information Retrieval: Data Structures and Algori ..."
Abstract
-
Cited by 23 (0 self)
- Add to MetaCart
We survey two indices for text, with emphasis on Pat arrays (also called suffix arrays). A Pat array is an index based on a new model of text which does not use the concept of word and does not need to know the structure of the text. to appear in Information Retrieval: Data Structures and Algorithms, R.A. Baeza-Yates and W. Frakes, eds., Prentice-Hall. 1 1 Introduction Text searching methods may be classified as lexicographical indices (indices that are sorted), clustering techniques, and indices based on hashing (for example, signature files [FC87]). In this report we discuss lexicographical indices, in particular, two main data structures: inverted files and Pat trees. Our aim is to build an index for the text of size similar to or smaller than the text. Briefly, the traditional model of text used in information retrieval is that of a set of documents. Each document is assigned a list of keywords (attributes), with optional relevance weights associated to each keyword. This ...
Scalable High-Speed Prefix Matching
- ACM Transactions on Computer Systems
, 2001
"... Finding the longest matching prefix from a database of keywords is an old problem with a number of applications, ranging from dictionary searches to advanced memory management to computational geometry. But perhaps today's most frequent best matching prefix lookups occur in the Internet, when forwar ..."
Abstract
-
Cited by 20 (4 self)
- Add to MetaCart
Finding the longest matching prefix from a database of keywords is an old problem with a number of applications, ranging from dictionary searches to advanced memory management to computational geometry. But perhaps today's most frequent best matching prefix lookups occur in the Internet, when forwarding packets from router to router. Internet traffic volume and link speeds are rapidly increasing; at the same time, an increasing user population is increasing the size of routing tables against which packets must be matched. Both factors make router prefix matching extremely performance critical. In this paper, we introduce a taxonomy for prefix matching technologies, which we use as a basis for describing, categorizing, and comparing existing approaches. We then present in detail a fast scheme using binary search over hash tables, which is especially suited for matching long addresses, such as the 128 bit addresses proposed for use in the next generation Internet Protocol, IPv6. We also present optimizations that exploit the structure of existing databases to further improve access time and reduce storage space.
Fast Mergeable Integer Maps
- In Workshop on ML
, 1998
"... Finite maps are ubiquitous in many applications, but perhaps nowhere more so than in compilers and other language processors. In these applications, three operations on finite maps dominate all others: looking up the value associated with a key, inserting a new binding, and merging two finite maps. ..."
Abstract
-
Cited by 19 (1 self)
- Add to MetaCart
Finite maps are ubiquitous in many applications, but perhaps nowhere more so than in compilers and other language processors. In these applications, three operations on finite maps dominate all others: looking up the value associated with a key, inserting a new binding, and merging two finite maps. Most implementations of finite maps in functional languages are based on balanced binary search trees, which perform well on the first two, but poorly on the third. We describe an implementation of finite maps with integer keys that performs well in practice on all three operations. This data structure is not new -- indeed, it is thirty years old this year -- but it deserves to be more widely known.

