Results 11 - 20
of
111
Text Retrieval: Theory and Practice
- In 12th IFIP World Computer Congress, volume I
, 1992
"... We present the state of the art of the main component of text retrieval systems: the searching engine. We outline the main lines of research and issues involved. We survey recently published results for text searching and we explore the gap between theoretical vs. practical algorithms. The main obse ..."
Abstract
-
Cited by 43 (14 self)
- Add to MetaCart
We present the state of the art of the main component of text retrieval systems: the searching engine. We outline the main lines of research and issues involved. We survey recently published results for text searching and we explore the gap between theoretical vs. practical algorithms. The main observation is that simpler ideas are better in practice. 1597 Shaks. Lover's Compl. 2 From off a hill whose concaue wombe reworded A plaintfull story from a sistring vale. OED2, reword, sistering 1 1 Introduction Full text retrieval systems are becoming a popular way of providing support for on-line text. Their main advantage is that they avoid the complicated and expensive process of semantic indexing. From the end-user point of view, full text searching of on-line documents is appealing because a valid query is just any word or sentence of the document. However, when the desired answer cannot be obtained with a simple query, the user must perform his/her own semantic processing to guess w...
Fast Text Searching for Regular Expressions or Automaton Searching on Tries
"... We present algorithms for efficient searching of regular expressions on preprocessed text, using a Patricia tree as a logical model for the index. We obtain searching algorithms that run in logarithmic expected time in the size of the text for a wide subclass of regular expressions, and in subline ..."
Abstract
-
Cited by 43 (6 self)
- Add to MetaCart
We present algorithms for efficient searching of regular expressions on preprocessed text, using a Patricia tree as a logical model for the index. We obtain searching algorithms that run in logarithmic expected time in the size of the text for a wide subclass of regular expressions, and in sublinear expected time for any regular expression. This is the first such algorithm to be found with this complexity.
Dynamical Sources in Information Theory: A General Analysis of Trie Structures
- ALGORITHMICA
, 1999
"... Digital trees, also known as tries, are a general purpose flexible data structure that implements dictionaries built on sets of words. An analysis is given of three major representations of tries in the form of array-tries, list tries, and bst-tries ("ternary search tries"). The size and the sear ..."
Abstract
-
Cited by 43 (6 self)
- Add to MetaCart
Digital trees, also known as tries, are a general purpose flexible data structure that implements dictionaries built on sets of words. An analysis is given of three major representations of tries in the form of array-tries, list tries, and bst-tries ("ternary search tries"). The size and the search costs of the corresponding representations are analysed precisely in the average case, while a complete distributional analysis of height of tries is given. The unifying data model used is that of dynamical sources and it encompasses classical models like those of memoryless sources with independent symbols, of finite Markovchains, and of nonuniform densities. The probabilistic behaviour of the main parameters, namely size, path length, or height, appears to be determined by two intrinsic characteristics of the source: the entropy and the probability of letter coincidence. These characteristics are themselves related in a natural way to spectral properties of specific transfer operators of the Ruelle type.
FASTER SUFFIX SORTING
, 1999
"... We propose a fast and memory efficient algorithm for lexicographically sorting the suffixes of a string, a problem that has important applications in data compression as well as string matching. Our ..."
Abstract
-
Cited by 42 (2 self)
- Add to MetaCart
We propose a fast and memory efficient algorithm for lexicographically sorting the suffixes of a string, a problem that has important applications in data compression as well as string matching. Our
An Optimal Algorithm for Generating Minimal Perfect Hash Functions
- Information Processing Letters
, 1992
"... A new algorithm for generating order preserving minimal perfect hash functions is presented. The algorithm is probabilistic, involving generation of random graphs. It uses expected linear time and requires a linear number words to represent the hash function, and thus is optimal up to constant facto ..."
Abstract
-
Cited by 37 (0 self)
- Add to MetaCart
A new algorithm for generating order preserving minimal perfect hash functions is presented. The algorithm is probabilistic, involving generation of random graphs. It uses expected linear time and requires a linear number words to represent the hash function, and thus is optimal up to constant factors. It runs very fast in practice. Keywords: Data structures, probabilistic algorithms, analysis of algorithms, hashing, random graphs
Large Text Searching Allowing Errors
, 1997
"... . We present a full inverted index for exact and approximate string matching in large texts. The index is composed of a table containing the vocabulary of words of the text and a list of positions in the text corresponding to each word. The size of the table of words is usually much less than 1% of ..."
Abstract
-
Cited by 35 (17 self)
- Add to MetaCart
. We present a full inverted index for exact and approximate string matching in large texts. The index is composed of a table containing the vocabulary of words of the text and a list of positions in the text corresponding to each word. The size of the table of words is usually much less than 1% of the text size and hence can be kept in main memory, where most query processing takes place. The text, on the other hand, is not accessed at all. The algorithm permits a large number of variations of the exact and approximate string search problem, such as phrases, string matching with sets of characters (range and arbitrary set of characters, complements, wild cards), approximate search with nonuniform costs and arbitrary regular expressions. The whole index can be built in linear time, in a single sequential pass over the text, takes near 1=3 the space of the text, and retrieval times are near O( p n) for typical cases. Experimental results show that the algorithm works well in practice...
Efficient Implementation of Suffix Trees
, 1995
"... this article we discuss how the suffix tree can be used for string searching ..."
Abstract
-
Cited by 32 (3 self)
- Add to MetaCart
this article we discuss how the suffix tree can be used for string searching
Improved Behaviour of Tries by Adaptive Branching
"... We introduce and analyze a method to reduce the search cost in tries. Traditional trie structures use branching factors at the nodes that are either fixed or a function of the number of elements. Instead, we let the distribution of the elements guide the choice of branching factors. This is accomp ..."
Abstract
-
Cited by 30 (8 self)
- Add to MetaCart
We introduce and analyze a method to reduce the search cost in tries. Traditional trie structures use branching factors at the nodes that are either fixed or a function of the number of elements. Instead, we let the distribution of the elements guide the choice of branching factors. This is accomplished in a strikingly simple way: in a binary trie, the i highest complete levels are replaced by a single node of degree 2i; the compression is repeated in the subtries. This structure, the level-compressed trie, inherits the good properties of binary tries with respect to neighbour and range searches, while the external path length is significantly decreased. It also has the advantage of being easy to implement. Our analysis shows that the expected depth of a stored element is \Theta (log \Lambda n) for uniformly distributed data.
A New Efficient Radix Sort
, 1994
"... We present new improved algorithms for the sorting problem. The algorithms are not only efficient but also clear and simple. First, we introduce Forward Radix Sort which combines the advantages of traditional left-to-right and right-to-left radix sort in a simple manner. We argue that this algorithm ..."
Abstract
-
Cited by 29 (7 self)
- Add to MetaCart
We present new improved algorithms for the sorting problem. The algorithms are not only efficient but also clear and simple. First, we introduce Forward Radix Sort which combines the advantages of traditional left-to-right and right-to-left radix sort in a simple manner. We argue that this algorithm will work very well in practice. Adding a preprocessing step, we obtain an algorithm with attractive theoretical properties. For example, n binary strings can be sorted in \Theta i n log i B n log n + 2 jj time, where B is the minimum number of bits that have to be inspected to distinguish the strings. This is an improvement over the previously best known result by Paige and Tarjan. The complexity may also be expressed in terms of H, the entropy of the input: n strings from a stationary ergodic process can be sorted in \Theta \Gamma n log \Gamma 1 H + 1 \Delta\Delta time, an improvement over the result recently presented by Chen and Reif.
Improved Probabilistic Verification by Hash Compaction
- In Advanced Research Working Conference on Correct Hardware Design and Verification Methods
, 1995
"... . We present and analyze a probabilistic method for verification by explicit state enumeration, which improves on the "hashcompact" method of Wolper and Leroy. The hashcompact method maintains a hash table in which compressed values for states instead of full state descriptors are stored. This metho ..."
Abstract
-
Cited by 28 (7 self)
- Add to MetaCart
. We present and analyze a probabilistic method for verification by explicit state enumeration, which improves on the "hashcompact" method of Wolper and Leroy. The hashcompact method maintains a hash table in which compressed values for states instead of full state descriptors are stored. This method saves space but allows a non-zero probability of omitting states during verification, which may cause verification to miss design errors (i.e. verification may produce "false positives"). Our method improves on Wolper and Leroy's by calculating the hash and compressed values independently, and by using a specific hashing scheme that requires a low number of probes in the hash table. The result is a large reduction in the probability of omitting a state. Hence, we can achieve a given upper bound on the probability of omitting a state using fewer bits per compressed state. For example, we can reduce the number of bytes stored for each state from the eight recommended by Wolper and Leroy to o...

