Results 1 -
6 of
6
Finding Approximate Matches in Large Lexicons
- SOFTWARE - PRACTICE AND EXPERIENCE
, 1995
"... Approximate string matching is used for spelling correction and personal name matching. In this paper we show how to use string matching techniques in conjunction with lexicon indexes to find approximate matches in a large lexicon. We test several lexicon indexing techniques, including n-grams and p ..."
Abstract
-
Cited by 27 (5 self)
- Add to MetaCart
Approximate string matching is used for spelling correction and personal name matching. In this paper we show how to use string matching techniques in conjunction with lexicon indexes to find approximate matches in a large lexicon. We test several lexicon indexing techniques, including n-grams and permuted lexicons, and several string matching techniques, including string similarity measures and phonetic coding. We propose methods for combining these techniques, and show experimentally that these combinations yield good retrieval effectiveness while keeping index size and retrieval time low. Our experiments also suggest that, in contrast to previous claims, phonetic codings are markedly inferior to string distance measures, which are demonstrated to be suitable for both spelling correction and personal name matching. KEY WORDS: pattern matching; string indexing; approximate matching; compressed inverted files; Soundex
An Evaluation of Phonetic Spell Checkers
- Mechanisms of Radiation Eflects in Electronic Materials
, 2001
"... In the work reported here, we describe a phonetic spell-checking algorithm, Phonetex which integrates aspects of Soundex and its extension Phonix. It is designed to provide a phonetic component for an existing typographic spell checker. We increase the number of letter codes compared to Soundex a ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
In the work reported here, we describe a phonetic spell-checking algorithm, Phonetex which integrates aspects of Soundex and its extension Phonix. It is designed to provide a phonetic component for an existing typographic spell checker. We increase the number of letter codes compared to Soundex and Phonix. We also integrate phonetic rules but use far less than Phonix which was designed for South African name matching or Rogers and Willett's Phonix extension which was designed for 17th century spellings as these includes many rules that are redundant in a contemporary word-based domain. We evaluate our algorithm by comparing it to phonetic spell checkers, Soundex and Editex and four benchmark spell checkers (Agrep, MS Word 97 & 2000 and UNIX `ispell') using a list of phonetic spelling errors. We nd that our approach has superior recall (accuracy) to the alternative approaches although the higher recall is at the expense of precision (number of possible matches retrieved). We intend to integrate it into an existing spell checker so the precision will be improved by integration thus high recall is the aim for our approach in this paper. Keywords: Data Cleaning, Phonetic Spell Checker, Phonetic Code Generation. 1
On the Development of Name Search Techniques for Arabic
, 2003
"... The need for effective identity matching systems has led to extensive research in the area of name search. For the most part, such work has been limited to English and other Latin-based languages. Consequently, algorithms such as Soundex and n-gram matching are of limited utility for languages such ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
The need for effective identity matching systems has led to extensive research in the area of name search. For the most part, such work has been limited to English and other Latin-based languages. Consequently, algorithms such as Soundex and n-gram matching are of limited utility for languages such as Arabic, which has vastly different morphologic features that rely heavily on phonetic information. The dearth of work in this field is partly caused by the lack of standardized test data. Consequently, we have built a collection of 7,939 Arabic names, along with 50 training queries and 111 test queries. We use this collection to evaluate a variety of algorithms, including a derivative of Soundex tailored to Arabic (ASOUNDEX), measuring effectiveness by using standard information retrieval measures. Our results show an improvement of 70 % over existing approaches.
A Cross-Language Approach to Historic Document Retrieval
- In Proceedings 28th European Conference on Information Retrieval (ECIR 2006), LNCS 3936
, 2006
"... Introduction Our cultural heritage, as preserved in libraries, archives and museums, is made up of documents written many centuries ago. Large-scale digitization initiatives, like DigiCULT [2], make these documents available to non-expert users through digital libraries and vertical search engines. ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Introduction Our cultural heritage, as preserved in libraries, archives and museums, is made up of documents written many centuries ago. Large-scale digitization initiatives, like DigiCULT [2], make these documents available to non-expert users through digital libraries and vertical search engines. For a user, querying a historic document collection may be a disappointing experience. Natural languages evolve over time, changing in pronunciation and spelling, and new words are introduced continuously, while older words
Fast retrieval of electronic messages that contain mistyped words or spelling errors
- IEEE hnsactions on System, Man and Cybernetics
, 1996
"... Abstract—This paper presents an index structure for retrieving electronic messages that contain mistyped words or spelling errors. Given a query string (e.g., a search key), we want to find those messages that approximately contain the query, i.e., certain inserts, deletes and mismatches are allowed ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Abstract—This paper presents an index structure for retrieving electronic messages that contain mistyped words or spelling errors. Given a query string (e.g., a search key), we want to find those messages that approximately contain the query, i.e., certain inserts, deletes and mismatches are allowed when matching the query with a word (or phrase) in the messages. Our approach is to store the messages sequentially in a database and hash their “fingerprints ” into a number of “fingerprint files. ” When the query is given, its fingerprints are also hashed into the files and a histogram of votes is constructed on the messages. We derive a lower bound, based on which one can prune a large number of nonqualifying messages (i.e., those whose votes are below the lower bound) during searching. The paper presents some experimental results, which demonstrate the effectiveness of the index structure and the lower bound. I.
Privacy-preserving record linkage using Bloom filters
, 2009
"... © 2009 Schnell et al; licensee BioMed Central Ltd. ..."

