## Text Retrieval: Theory and Practice (1992)

Venue: | In 12th IFIP World Computer Congress, volume I |

Citations: | 46 - 14 self |

### BibTeX

@INPROCEEDINGS{Baeza-yates92textretrieval:,

author = {Ricardo A. Baeza-yates},

title = {Text Retrieval: Theory and Practice},

booktitle = {In 12th IFIP World Computer Congress, volume I},

year = {1992},

pages = {465--476},

publisher = {Elsevier Science}

}

### Years of Citing Articles

### OpenURL

### Abstract

We present the state of the art of the main component of text retrieval systems: the searching engine. We outline the main lines of research and issues involved. We survey recently published results for text searching and we explore the gap between theoretical vs. practical algorithms. The main observation is that simpler ideas are better in practice. 1597 Shaks. Lover's Compl. 2 From off a hill whose concaue wombe reworded A plaintfull story from a sistring vale. OED2, reword, sistering 1 1 Introduction Full text retrieval systems are becoming a popular way of providing support for on-line text. Their main advantage is that they avoid the complicated and expensive process of semantic indexing. From the end-user point of view, full text searching of on-line documents is appealing because a valid query is just any word or sentence of the document. However, when the desired answer cannot be obtained with a simple query, the user must perform his/her own semantic processing to guess w...

### Citations

644 | Suffix arrays: a new method for on-line string searches
- Manber, Myers
- 1990
(Show Context)
Citation Context ... the whole index degenerates into a single array (PAT array) of external nodes ordered lexicograph9 ically by sistrings. With some additions this idea was independently discovered by Manber and Myers =-=[MM90]-=- where it is called suffix arrays. The size of the index is now 4n bytes, where n is the number of index points. For the OED we have approximately 475Mb, if we only index word beginnings. There is a s... |

627 |
V.R.: Fast pattern matching in strings
- Knuth, Morris, et al.
- 1977
(Show Context)
Citation Context ...seems that other computational models are as good or perhaps better than just using character comparisons. The first important result on string searching is the so called Knuth-Morris-Pratt algorithm =-=[KMP77]-=-. This algorithm, discovered around 1970, was the first to achieve linear worst-case time. In [KMP77] there is a very nice historical account of this algorithm. The second result is due to Boyer and M... |

572 |
A Fast String-Searching Algorithm
- Boyer, Moore
- 1977
(Show Context)
Citation Context ...s algorithm, discovered around 1970, was the first to achieve linear worst-case time. In [KMP77] there is a very nice historical account of this algorithm. The second result is due to Boyer and Moore =-=[BM77]-=- who developed a fast algorithm on average, achieving O(log mn/m) comparisons. After that, several papers were published on the topic. However, recently it seems that there are an increasing number of... |

224 | A new approach to text searching
- Baeza-Yates, Gonnet
- 1992
(Show Context)
Citation Context ...b] and can be easily extended to string matching with classes (every element of the pattern is a set of symbols, rather than one symbol), string matching with mismatches and multiple string searching =-=[BYG89b]. Recently-=- [WM91, WM92] the idea has been extended to string matching with errors, and implemented as "agrep", the fastest tool to searching through files, even when we allow errors. 3.3 Counting Algo... |

162 |
Handbook of Algorithms and Data Structures
- Gonnet
- 1984
(Show Context)
Citation Context ... the number of index-points or sistrings. This variant is called a Patricia tree. This data structure easily supports prefix and range searching, as well as proximity and regular expression searching =-=[GBY91]-=-. Although a Patricia tree needs O(n) space, the actual constant is important in practical applications. We need at least two pointers for every internal node, and one pointer for every external node.... |

153 | Approximate string matching with q-grams and maximal matches - Ukkonen - 1992 |

130 | agrep - A Fast Approximate Pattern-Matching Tool - Wu, Manber - 1991 |

107 | A very fast substring search algorithm - Sunday - 1990 |

103 | Signature files: an access method for documents and its analytical performance evaluation - Faloutsos, Christodoulakis - 1984 |

77 | An improved algorithm for approximate string matching - Galil, Park - 1990 |

66 | Two algorithms for approximate string matching in static texts - Jokinen, Ukkonen - 1991 |

53 |
Two-way string matching
- Crochemore, Perrin
- 1991
(Show Context)
Citation Context ...e, the main results are: ffl improvements to the Boyer-Moore algorithm, through alphabet transformations [BY89b, BY89a], sensitivity to the distribution of the text [BY89b, BY89c, Sun90], word theory =-=[CP91]-=-, adaptivity [Smi91], and a taxonomy of string searching algorithms [HS91]; ffl fast algorithms for long patterns based on n-grams [KST91]; ffl several new algorithms for string matching with mismatch... |

48 | Efficient pattern matching with scaling - Amir, Landau, et al. - 1992 |

48 | Theoretical and empirical comparisons of approximate string matching algorithms - Chang, Lampe - 1992 |

46 | Approximate string matching in sublinear expected time - CHANG, LAWLER - 1990 |

46 | Fast string searching
- Hume, Sunday
- 1991
(Show Context)
Citation Context ...rough alphabet transformations [BY89b, BY89a], sensitivity to the distribution of the text [BY89b, BY89c, Sun90], word theory [CP91], adaptivity [Smi91], and a taxonomy of string searching algorithms =-=[HS91]-=-; ffl fast algorithms for long patterns based on n-grams [KST91]; ffl several new algorithms for string matching with mismatches, based on the Boyer-Moore approach, which are faster and more practical... |

42 | Description and performance analysis of signature file methods for office filing - Faloutsos, Christodoulakis - 1987 |

40 |
An algorithm for string matching with a sequence of don’t cares
- Manber, Baeza-Yates
- 1991
(Show Context)
Citation Context ...sion searching, the time increases by a factor of O(log n) [BYG89a]. Details about proximity searching (searching for the occurrence of a string near other string) are given by Manber and Baeza-Yates =-=[MBY91]-=-. For this problem, it is possible to know the number of occurrences in logarithmic time, and all the answers in O(n 1/4 ) time plus the number of occurrences. In both cases, O(n) extra space is neede... |

40 | Multikey access methods based on superimposed coding techniques - Sacks-Davis, Kent, et al. - 1987 |

37 | A comparison of approximate string matching algorithms - Jokinen, Tarhio, et al. - 1996 |

31 |
Fast and practical approximate pattern matching
- Baeza-Yates, Perleberg
- 1996
(Show Context)
Citation Context ... h a n Figure 5: Searching example for the counting approach. This idea is implicit in [MW89], where is used for string matching with insertions and deletions only. It was independently discovered in =-=[BYP91]-=-, which presents the algorithm mentioned above for mismatches. This technique is also very fast and does not use character comparisons. 4 Retrieval Algorithms for Indexed Text In this section we struc... |

27 | Fast parallel and serial multidimensional approximate array matching - Amir, Landau - 1991 |

27 | Fast text searching with errors - Wu, Manber - 1991 |

24 |
Adaptive dictionary matching
- Amir, Farach
- 1991
(Show Context)
Citation Context ...ariations of approximate string matching [WMM91]; ffl the derivation of classical and improved algorithms through formal proofs of program correctness [Col91b]; ffl adaptive multiple string searching =-=[AM91]-=-. On the practical side, the main results are: ffl improvements to the Boyer-Moore algorithm, through alphabet transformations [BY89b, BY89a], sensitivity to the distribution of the text [BY89b, BY89c... |

24 | Efficient Text Searching - Baeza-Yates - 1989 |

24 |
Tight Bounds on the Complexity of the Boyer-Moore String Matching Algorithm
- Cole
- 1994
(Show Context)
Citation Context ...rithms or proving some previously open problems. Among the theoretical results, we should mention: ffl a 3n upper and lower bound for the worst case number of comparisons of the Boyer-Moore algorithm =-=[Col91a]-=- and several papers dealing with the average case analysis of this algorithm [BYGR90, Sch88]; ffl the 4/3n upper bound in the worst case number of comparisons with a (1 + 1/(2m))n lower bound for any ... |

24 | Simple and efficient string matching with k mismatches - Grossi, Luccio - 1989 |

20 |
Efficient text searching of regular expressions
- Baeza-Yates, Gonnet
- 1989
(Show Context)
Citation Context ... (sistring). This represents a significant economy in space at the cost of a modest deterioration in access time [GBYS92]. For regular expression searching, the time increases by a factor of O(log n) =-=[BYG89a]-=-. Details about proximity searching (searching for the occurrence of a string near other string) are given by Manber and Baeza-Yates [MBY91]. For this problem, it is possible to know the number of occ... |

19 | Boyer-Moore approach to approximate string matching - Tarhio, Ukkonen - 1990 |

18 | Efficient 2-dimensional approximate matching of nonrectangular figures - Amir, Farach - 1991 |

18 | Fast string matching with mismatches - Baeza-Yates, Gonnet - 1994 |

18 | Experiments with a very fast substring search algorithm
- Smith
- 1991
(Show Context)
Citation Context ... are: ffl improvements to the Boyer-Moore algorithm, through alphabet transformations [BY89b, BY89a], sensitivity to the distribution of the text [BY89b, BY89c, Sun90], word theory [CP91], adaptivity =-=[Smi91]-=-, and a taxonomy of string searching algorithms [HS91]; ffl fast algorithms for long patterns based on n-grams [KST91]; ffl several new algorithms for string matching with mismatches, based on the Boy... |

18 | A technique for two-dimensional pattern matching - Zhu, Takaoka - 1989 |

17 |
Correctness and efficiency of the pattern matching algorithms
- Colussi
- 1991
(Show Context)
Citation Context ... that is still open; ffl improvement of several variations of approximate string matching [WMM91]; ffl the derivation of classical and improved algorithms through formal proofs of program correctness =-=[Col91b]-=-; ffl adaptive multiple string searching [AM91]. On the practical side, the main results are: ffl improvements to the Boyer-Moore algorithm, through alphabet transformations [BY89b, BY89a], sensitivit... |

15 | Fast algorithms for twodimensional and multiple pattern matching - Baeza-Yates, Régnier - 1990 |

15 | Incremental alignment algorithms and their applications - Myers - 1986 |

13 |
On the exact complexity of string matching
- Colussi, Galil, et al.
(Show Context)
Citation Context ...e case analysis of this algorithm [BYGR90, Sch88]; ffl the 4/3n upper bound in the worst case number of comparisons with a (1 + 1/(2m))n lower bound for any comparison-based string matching algorithm =-=[CGG90]-=-; ffl some results concerning the maximal number of states of a Boyer-Moore type automaton [BYR90, Cho90, Bru91], a problem that is still open; ffl improvement of several variations of approximate str... |

12 |
Unstructured data bases or very efficient text searching
- Gonnet
- 1983
(Show Context)
Citation Context ...date the index. Text can be viewed as a very long string of data. Often text has little or no structure, and in many applications we wish to process the text without concern for the structure. Gonnet =-=[Gon83]-=- used the term unstructured database to refer to this type of data. Examples of such collections are: dictionaries, legal cases, articles on wire services, scientific papers, etc. Text can instead be ... |

10 | Improved string searching - Baeza-Yates - 1989 |

10 | Examples of PAT applied to the Oxford English Dictionary - Gonnet - 1987 |

10 | Fast string matching using an n-gram algorithm
- Kim, Shawe-Taylor
- 1994
(Show Context)
Citation Context ...he distribution of the text [BY89b, BY89c, Sun90], word theory [CP91], adaptivity [Smi91], and a taxonomy of string searching algorithms [HS91]; ffl fast algorithms for long patterns based on n-grams =-=[KST91]-=-; ffl several new algorithms for string matching with mismatches, based on the Boyer-Moore approach, which are faster and more practical [BY89b, BYG92, GL89, TU90]; ffl algorithms based on partitionin... |

9 | String searching algorithms revisited - Baeza-Yates - 1989 |

9 | All-against-all sequence matching - BAEZA-YATES, GoNrqET - 1990 |

8 |
Efficient searching of text and pictures (extended abstract
- Gonnet
- 1988
(Show Context)
Citation Context ...te String Model Let us assume that the text to be searched is a single string and padded at its right end with an infinite number of null (or any special) characters. A semi-infinite string (sistring)=-=[Gon88]-=- is the sequence of characters starting at any position of the text and continuing to the right. For example, if the text is The traditional approach for searching a text is ... the following are some... |

7 | On the expected sublinearity of the Boyer–Moore algorithm - Schaback - 1988 |

6 | A Text Searching System --- PAT 3.3 User's Guide - Fawcett - 1989 |

5 |
The New Oxford English Dictionary Project at the University of Waterloo, in Computational Lexicology and Lexicography: Special Issue Dedicated to Bernard Quemada (edited by
- Berg, Gonnet, et al.
- 1991
(Show Context)
Citation Context ...ges again. For example, this is the case when searching the Oxford English Dictionary (OED). The computerization of this dictionary was carried out between 1985 and 1990 at the University of Waterloo =-=[BGT88]-=-. On the other hand, dynamic text is text that changes too frequently to justify preprocessing, for example, in text-editing applications. In this case, we must use search algorithms that scan the tex... |

5 | Analysis of Boyer-Moore-Type String Searching Algorithms - Baeza-Yates, Gonnet, et al. - 1990 |

5 |
Signature files: An integrated access method for text and attributes, suitable for optical disk storage
- Faloutsos
- 1988
(Show Context)
Citation Context ...arch achieves reasonable answer time, hashing is used to build a compact version of the text (between 10% and 20% of the original size). For this task, the text is divided in words and block of words =-=[Fal88]-=-, transforming the text into a sequence of bits. To search in this index, first the query is transformed to a bit string, and that string is sequentially searched in the index (or signature file). Bec... |

5 | Playing detective with full text searching software
- Raymond, Fawcett
- 1990
(Show Context)
Citation Context ...about the command syntax. Thus, it is possible to solve many fact-finding problems using a full text system, even though it may provide inadequate recall, especially when the data itself is redundant =-=[RF90]-=-. The main component of a free-text retrieval system is the text searching engine. Formally, the text searching problem can be defined as follows: Given a text string t and a query (pattern) q, locate... |