Results 11 - 20
of
22
Comparing Inverted Files and Signature Files for Searching a Large Lexicon p
- Communications of the ACM
, 1996
"... Signature files and inverted files are well-known index structures. In this paper we undertake a direct comparison of the two for searching for partially-specified queries in a large lexicon stored in main memory. Using n-grams to index lexicon terms, a bit-sliced signature file can be compressed to ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Signature files and inverted files are well-known index structures. In this paper we undertake a direct comparison of the two for searching for partially-specified queries in a large lexicon stored in main memory. Using n-grams to index lexicon terms, a bit-sliced signature file can be compressed to a smaller size than an inverted file if each n-gram sets only one bit in the term signature. With a signature width less than half the number of unique n-grams in the lexicon, the signature file method is about as fast as the inverted file method, and significantly smaller. Greater flexibility in memory usage and faster index generation time make signature files appropriate for searching large lexicons or other collections in an environment where memory is at a premium.
MAL4:6- Using Data Mining for Record Linkage
"... This paper presents a first attempt at using pedigree-based data to improve record linkage. It describes a composite metric for similarity and a mechanism to extract relevant generational features. Results on a large data set demonstrate promise. 1 ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
This paper presents a first attempt at using pedigree-based data to improve record linkage. It describes a composite metric for similarity and a mechanism to extract relevant generational features. Results on a large data set demonstrate promise. 1
Information Access to Historical Documents from the Early New High German Period
"... With the new interest in historical documents insight grew that electronic access to these texts causes many specific problems. In the first part of the paper we survey the present role of digital historical documents. After collecting central facts and observations on historical language change we ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
With the new interest in historical documents insight grew that electronic access to these texts causes many specific problems. In the first part of the paper we survey the present role of digital historical documents. After collecting central facts and observations on historical language change we comment on the difficulties that result for retrieval and data mining on historical texts. In the second part of the paper we report on our own work in the area with a focus on special matching strategies that help to relate modern language keywords with old variants. The basis of our studies is a collection of documents from the Early New High German period. These texts come with a very rich spectrum on word variants and spelling variations.
Similarity Searching in the CORDIS Text Database
, 2001
"... Similarity searching in text databases with multiple field types is still an open problem. We focus our attention on the "COmmunity Research and Development Information Service" (CORDIS) database of the European Union and we evaluate the effectiveness of many text retrieval methods in terms of preci ..."
Abstract
- Add to MetaCart
Similarity searching in text databases with multiple field types is still an open problem. We focus our attention on the "COmmunity Research and Development Information Service" (CORDIS) database of the European Union and we evaluate the effectiveness of many text retrieval methods in terms of precision, recall and ranking quality. Our experiments indicate that different field types should be handled by different retrieval methods.
Finding Variants of Out-of-Vocabulary Words in Arabic
"... Transliteration of a word into another language often leads to multiple spellings. Unless an information retrieval system recognises different forms of transliterated words, a significant number of documents will be missed when users specify only one spelling variant. Using two different datasets, w ..."
Abstract
- Add to MetaCart
Transliteration of a word into another language often leads to multiple spellings. Unless an information retrieval system recognises different forms of transliterated words, a significant number of documents will be missed when users specify only one spelling variant. Using two different datasets, we evaluate several approaches to finding variants of foreign words in Arabic, and show that the longest common subsequence (LCS) technique is the best overall. 1
Utilizing Stacking for Feature Reduction in Graph-Based Genealogical Record Linkage
"... Abstract — Genealogy research is centered on collecting records about an individual from various sources and combining the information to gain a larger historical perspective about that individual, commonly in the form of a pedigree. Data extraction, the internet, and other technological advancement ..."
Abstract
- Add to MetaCart
Abstract — Genealogy research is centered on collecting records about an individual from various sources and combining the information to gain a larger historical perspective about that individual, commonly in the form of a pedigree. Data extraction, the internet, and other technological advancements have made large amounts of digital genealogical data more accessible. Discovering the relevancy of a digital record to a given pedigree involves determining if the individual described in the record is in actuality an individual within the pedigree. This process is called Genealogical Record Linkage (GRL). GRL can be automated through data mining and techniques by creating machine learned models from hand labeled comparisons. In this paper, we compare two such models-a tabular approach and a graph based stacking approach-and report the successful application of both on a large, post-blocking database. We also note the successful integration of these approaches in an open source distributed genealogy program that finds relevant machetes to a given pedigree from multiple online repositories. I.
Information Access to Historical Documents from the Early New High German Period
"... With the new interest in historical documents insight grew that electronic access to these texts causes many specific problems. In the first part of the paper we survey the present role of digital historical documents. After collecting central facts and observations on historical language change we ..."
Abstract
- Add to MetaCart
With the new interest in historical documents insight grew that electronic access to these texts causes many specific problems. In the first part of the paper we survey the present role of digital historical documents. After collecting central facts and observations on historical language change we comment on the difficulties that result for retrieval and data mining on historical texts. In the second part of the paper we report on our own work in the area with a focus on special matching strategies that help to relate modern language keywords with old variants. The basis of our studies is a collection of documents from the Early New High German period. These texts come with a very rich spectrum on word variants and spelling variations.

