Results 1 -
7 of
7
Finding Approximate Matches in Large Lexicons
- SOFTWARE - PRACTICE AND EXPERIENCE
, 1995
"... Approximate string matching is used for spelling correction and personal name matching. In this paper we show how to use string matching techniques in conjunction with lexicon indexes to find approximate matches in a large lexicon. We test several lexicon indexing techniques, including n-grams and p ..."
Abstract
-
Cited by 27 (5 self)
- Add to MetaCart
Approximate string matching is used for spelling correction and personal name matching. In this paper we show how to use string matching techniques in conjunction with lexicon indexes to find approximate matches in a large lexicon. We test several lexicon indexing techniques, including n-grams and permuted lexicons, and several string matching techniques, including string similarity measures and phonetic coding. We propose methods for combining these techniques, and show experimentally that these combinations yield good retrieval effectiveness while keeping index size and retrieval time low. Our experiments also suggest that, in contrast to previous claims, phonetic codings are markedly inferior to string distance measures, which are demonstrated to be suitable for both spelling correction and personal name matching. KEY WORDS: pattern matching; string indexing; approximate matching; compressed inverted files; Soundex
Storing Text Retrieval Systems on CD-ROM: Compression and Encryption Considerations
- ACM Transactions on Information Systems
, 1989
"... : The emergence of the CD-ROM as a storage medium for full-text databases raises the question of the maximum size database that can be contained by this medium. As an example, the problem of storing the Tr'esor de la Langue Fran¸caise on a CD-ROM is examined in this paper. The text alone of this dat ..."
Abstract
-
Cited by 21 (3 self)
- Add to MetaCart
: The emergence of the CD-ROM as a storage medium for full-text databases raises the question of the maximum size database that can be contained by this medium. As an example, the problem of storing the Tr'esor de la Langue Fran¸caise on a CD-ROM is examined in this paper. The text alone of this database is 700 MB long, more than a CD-ROM can hold. But in addition the dictionary and concordance needed to access this data must be stored. A further constraint is that some of the material is copyrighted, and it is desirable that such material be difficult to decode except through software provided by the system. Pertinent approaches to compression of the various files are reviewed and the compression of the text is related to the problem of data encryption: specifically, it is shown that, under simple models of text generation, Huffman encoding produces a bit-string indistinguishible from a representation of coin flips. Categories and Subject Descriptors: E.3 E.4 H.3.2 J.5 General terms: ...
Searching Large Lexicons for Partially Specified Terms using Compressed Inverted Files
- Proc. International Conference on Very Large Databases
, 1993
"... There are several advantages to be gained by storing the lexicon of a full text database in main memory. In this paper we describe how to use a compressed inverted file index to search such a lexicon for entries that match a pattern or partially specified term. Our experiments show that this method ..."
Abstract
-
Cited by 15 (5 self)
- Add to MetaCart
There are several advantages to be gained by storing the lexicon of a full text database in main memory. In this paper we describe how to use a compressed inverted file index to search such a lexicon for entries that match a pattern or partially specified term. Our experiments show that this method provides an effective compromise between speed and space, running orders of magnitude faster than brute force search, but requiring less memory than other pattern-matching data structures; indeed, in some cases requiring less memory than would be consumed by a single pointer to each string. The pattern search method is based on text indexing techniques and is a successful adaptation of inverted files to main memory databases.
Searching BWT compressed text with the Boyer-Moore algorithm and binary search
- Proceedings, IEEE Data Compression Conference, 2002
, 2002
"... Abstract: This paper explores two techniques for on-line exact pattern matching in files that have been compressed using the Burrows-Wheeler transform. We investigate two approaches. The first is an application of the Boyer-Moore algorithm (Boyer & Moore 1977) to a transformed string. The second app ..."
Abstract
-
Cited by 11 (6 self)
- Add to MetaCart
Abstract: This paper explores two techniques for on-line exact pattern matching in files that have been compressed using the Burrows-Wheeler transform. We investigate two approaches. The first is an application of the Boyer-Moore algorithm (Boyer & Moore 1977) to a transformed string. The second approach is based on the observation that the transform effectively contains a sorted list of all substrings of the original text, which can be exploited for very rapid searching using a variant of binary search. Both methods are faster than a decompress-and-search approach for small numbers of queries, and binary search is much faster even for large numbers of queries. 1
Using Bitmaps for Medium Sized Information Retrieval Systems
- Information Processing & Management
, 1990
"... : We describe the use of various forms of bitmaps as a basic tool for improving the search algorithms in medium sized information retrieval systems. The bitmaps considered include and extend known techniques using occurrence maps and signatures. Such an approach to text retrieval is flexible, effici ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
: We describe the use of various forms of bitmaps as a basic tool for improving the search algorithms in medium sized information retrieval systems. The bitmaps considered include and extend known techniques using occurrence maps and signatures. Such an approach to text retrieval is flexible, efficient and, relative to the customary concordance approach, inexpensive in storage costs. 1. Introduction Our ability to control textual information is being strongly influenced by a variety of technological advances. These include new means of storing and sharing information that makes possible and realistic an information system model in which large bodies of full text are compactly stored, widely distributed, and shared by a large number of interested persons. Such changes require a careful search for techniques that promise convenient and effective access to such textual databases. The research that is required in this environment differs from that traditional in Information Retrieval (IR)...
An Algorithm for Full Text Indexing
, 1992
"... A fast B-tree based indexing algorithm is presented. In some applications, such as full text indexing or indexing of very large tables, the new algorithm can be orders of magnitude faster than conventional B-tree insertion algorithms, while still allowing concurrent access. A similar algorithm c ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
A fast B-tree based indexing algorithm is presented. In some applications, such as full text indexing or indexing of very large tables, the new algorithm can be orders of magnitude faster than conventional B-tree insertion algorithms, while still allowing concurrent access. A similar algorithm can be used for deletion.
Comparing Inverted Files and Signature Files for Searching a Large Lexicon p
- Communications of the ACM
, 1996
"... Signature files and inverted files are well-known index structures. In this paper we undertake a direct comparison of the two for searching for partially-specified queries in a large lexicon stored in main memory. Using n-grams to index lexicon terms, a bit-sliced signature file can be compressed to ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Signature files and inverted files are well-known index structures. In this paper we undertake a direct comparison of the two for searching for partially-specified queries in a large lexicon stored in main memory. Using n-grams to index lexicon terms, a bit-sliced signature file can be compressed to a smaller size than an inverted file if each n-gram sets only one bit in the term signature. With a signature width less than half the number of unique n-grams in the lexicon, the signature file method is about as fast as the inverted file method, and significantly smaller. Greater flexibility in memory usage and faster index generation time make signature files appropriate for searching large lexicons or other collections in an environment where memory is at a premium.

