Results 1  10
of
61
A Guided Tour to Approximate String Matching
 ACM Computing Surveys
, 1999
"... We survey the current techniques to cope with the problem of string matching allowing errors. This is becoming a more and more relevant issue for many fast growing areas such as information retrieval and computational biology. We focus on online searching and mostly on edit distance, explaining t ..."
Abstract

Cited by 404 (38 self)
 Add to MetaCart
We survey the current techniques to cope with the problem of string matching allowing errors. This is becoming a more and more relevant issue for many fast growing areas such as information retrieval and computational biology. We focus on online searching and mostly on edit distance, explaining the problem and its relevance, its statistical behavior, its history and current developments, and the central ideas of the algorithms and their complexities. We present a number of experiments to compare the performance of the different algorithms and show which are the best choices according to each case. We conclude with some future work directions and open problems. 1
Complete inverted files for efficient text retrieval and analysis
 Journal of the ACM
, 1987
"... Abstract. Given a finite set of texts S = (wi, *.., wk) over some fixed finite alphabet 2, a complete inverted tile for S is an abstract data type that provides the functionsfind ( which returns the longest prefix of w that occurs (as a subword of a word) in S, freq(w), which returns the number of t ..."
Abstract

Cited by 59 (1 self)
 Add to MetaCart
Abstract. Given a finite set of texts S = (wi, *.., wk) over some fixed finite alphabet 2, a complete inverted tile for S is an abstract data type that provides the functionsfind ( which returns the longest prefix of w that occurs (as a subword of a word) in S, freq(w), which returns the number of times w occurs in S, and locations(w), which returns the set of positions where w occurs in S. A data structure. that implements a complete inverted file for S that occupies linear space and can be built in linear time, using the uniformcost RAM model, is given. Using this data structure, the time for each of the above query functions is optimal. To accomplish this, techniques from the theory of finite automata and the work on suffix trees are used to build a deterministic finite automaton that recognizes the set of all subwords of the set S. This automaton is then annotated with additional information and compacted to facilitate the desired query functions. The result is a data structure that is smaller and more flexible than the s&ix tree.
The Computational Power and Complexity of Constraint Handling Rules
 In Second Workshop on Constraint Handling Rules, at ICLP05
, 2005
"... Constraint Handling Rules (CHR) is a highlevel rulebased programming language which is increasingly used for general purposes. We introduce the CHR machine, a model of computation based on the operational semantics of CHR. Its computational power and time complexity properties are compared to thos ..."
Abstract

Cited by 50 (21 self)
 Add to MetaCart
Constraint Handling Rules (CHR) is a highlevel rulebased programming language which is increasingly used for general purposes. We introduce the CHR machine, a model of computation based on the operational semantics of CHR. Its computational power and time complexity properties are compared to those of the wellunderstood Turing machine and Random Access Memory machine. This allows us to prove the interesting result that every algorithm can be implemented in CHR with the best known time and space complexity. We also investigate the practical relevance of this result and the constant factors involved. Finally we expand the scope of the discussion to other (declarative) programming languages.
Fast Text Searching for Regular Expressions or Automaton Searching on Tries
"... We present algorithms for efficient searching of regular expressions on preprocessed text, using a Patricia tree as a logical model for the index. We obtain searching algorithms that run in logarithmic expected time in the size of the text for a wide subclass of regular expressions, and in subline ..."
Abstract

Cited by 49 (6 self)
 Add to MetaCart
We present algorithms for efficient searching of regular expressions on preprocessed text, using a Patricia tree as a logical model for the index. We obtain searching algorithms that run in logarithmic expected time in the size of the text for a wide subclass of regular expressions, and in sublinear expected time for any regular expression. This is the first such algorithm to be found with this complexity.
LempelZiv parsing and sublinearsize index structures for string matching (Extended Abstract)
 Proc. 3rd South American Workshop on String Processing (WSP'96
, 1996
"... String matching over a long text can be significantly speeded up with an index structure formed by preprocessing the text. For very long texts, the size of such an index can be a problem. This paper presents the first sublinearsize index structure. The new structure is based on LempelZiv parsing ..."
Abstract

Cited by 48 (1 self)
 Add to MetaCart
String matching over a long text can be significantly speeded up with an index structure formed by preprocessing the text. For very long texts, the size of such an index can be a problem. This paper presents the first sublinearsize index structure. The new structure is based on LempelZiv parsing of the text and has size linear in N, the size of the LempelZiv parse. For a text of length n, N = O(n = log n) and can be still smaller if the text is compressible. With the new index structure, all occurrences of a pattern string of length m can be found in time O(m 2
Sparse Suffix Trees
 In Proc. 2nd Annual International Conference on Computing and Combinatorics (COCOON), LNCS v. 1090
, 1996
"... . A sparse suffix tree is a suffix tree that represents only a subset of the suffixes of the text. This is in contrast to the standard suffix tree that represents all suffixes. By selecting a small enough subset, a sparse suffix tree can be made to fit the available storage, unfortunately at the cos ..."
Abstract

Cited by 30 (1 self)
 Add to MetaCart
. A sparse suffix tree is a suffix tree that represents only a subset of the suffixes of the text. This is in contrast to the standard suffix tree that represents all suffixes. By selecting a small enough subset, a sparse suffix tree can be made to fit the available storage, unfortunately at the cost of increased search times. The idea of sparse suffix trees goes back to PATRICIA tries. Evenly spaced sparse suffix trees represent every kth suffix of the text. In the paper, we give general construction and search algorithms for evenly spaced sparse suffix trees, and present their run time analysis, both in the worst and in the average case. The algorithms are further improved by using socalled dual suffix trees. 1 Introduction Finding an index for a long text that makes fast string matching possible is one of the very central problems of text processing systems. Suffix trees offer a theoretically timeoptimal solution. A suffix tree is a trielike data structure that represents all su...
Finding Approximate Matches in Large Lexicons
 SOFTWARE  PRACTICE AND EXPERIENCE
, 1995
"... Approximate string matching is used for spelling correction and personal name matching. In this paper we show how to use string matching techniques in conjunction with lexicon indexes to find approximate matches in a large lexicon. We test several lexicon indexing techniques, including ngrams and p ..."
Abstract

Cited by 30 (5 self)
 Add to MetaCart
Approximate string matching is used for spelling correction and personal name matching. In this paper we show how to use string matching techniques in conjunction with lexicon indexes to find approximate matches in a large lexicon. We test several lexicon indexing techniques, including ngrams and permuted lexicons, and several string matching techniques, including string similarity measures and phonetic coding. We propose methods for combining these techniques, and show experimentally that these combinations yield good retrieval effectiveness while keeping index size and retrieval time low. Our experiments also suggest that, in contrast to previous claims, phonetic codings are markedly inferior to string distance measures, which are demonstrated to be suitable for both spelling correction and personal name matching. KEY WORDS: pattern matching; string indexing; approximate matching; compressed inverted files; Soundex
Matchmaking for online games and other latencysensitive P2P systems
 In SIGCOMM
, 2009
"... ABSTRACT – The latency between machines on the Internet can dramatically affect users ’ experience for many distributed applications. Particularly, in multiplayer online games, players seek to cluster themselves so that those in the same session have low latency to each other. A system that predicts ..."
Abstract

Cited by 27 (2 self)
 Add to MetaCart
ABSTRACT – The latency between machines on the Internet can dramatically affect users ’ experience for many distributed applications. Particularly, in multiplayer online games, players seek to cluster themselves so that those in the same session have low latency to each other. A system that predicts latencies between machine pairs allows such matchmaking to consider many more machine pairs than can be probed in a scalable fashion while users are waiting. Using a farreaching trace of latencies between players on over 3.5 million game consoles, we designed Htrae, a latency prediction system for game matchmaking scenarios. One novel feature of Htrae is its synthesis of geolocation with a network coordinate system. It uses geolocation to select reasonable initial network coordinates for new machines joining the system, allowing it to converge more quickly than standard network coordinate systems and produce substantially lower prediction error than stateoftheart latency prediction systems. For instance, it produces 90th percentile errors less than half those of iPlane and Pyxida. Our design is general enough to make it a good fit for other latencysensitive peertopeer applications besides game matchmaking.
Fast Mergeable Integer Maps
 In Workshop on ML
, 1998
"... Finite maps are ubiquitous in many applications, but perhaps nowhere more so than in compilers and other language processors. In these applications, three operations on finite maps dominate all others: looking up the value associated with a key, inserting a new binding, and merging two finite maps. ..."
Abstract

Cited by 23 (1 self)
 Add to MetaCart
Finite maps are ubiquitous in many applications, but perhaps nowhere more so than in compilers and other language processors. In these applications, three operations on finite maps dominate all others: looking up the value associated with a key, inserting a new binding, and merging two finite maps. Most implementations of finite maps in functional languages are based on balanced binary search trees, which perform well on the first two, but poorly on the third. We describe an implementation of finite maps with integer keys that performs well in practice on all three operations. This data structure is not new  indeed, it is thirty years old this year  but it deserves to be more widely known.
Lexicographical Indices for Text: Inverted files vs. PAT trees
, 1991
"... We survey two indices for text, with emphasis on Pat arrays (also called suffix arrays). A Pat array is an index based on a new model of text which does not use the concept of word and does not need to know the structure of the text. to appear in Information Retrieval: Data Structures and Algori ..."
Abstract

Cited by 23 (0 self)
 Add to MetaCart
We survey two indices for text, with emphasis on Pat arrays (also called suffix arrays). A Pat array is an index based on a new model of text which does not use the concept of word and does not need to know the structure of the text. to appear in Information Retrieval: Data Structures and Algorithms, R.A. BaezaYates and W. Frakes, eds., PrenticeHall. 1 1 Introduction Text searching methods may be classified as lexicographical indices (indices that are sorted), clustering techniques, and indices based on hashing (for example, signature files [FC87]). In this report we discuss lexicographical indices, in particular, two main data structures: inverted files and Pat trees. Our aim is to build an index for the text of size similar to or smaller than the text. Briefly, the traditional model of text used in information retrieval is that of a set of documents. Each document is assigned a list of keywords (attributes), with optional relevance weights associated to each keyword. This ...