Results 11 - 20
of
85
Prototype extraction and adaptive OCR
- IEEE Transactions on Pattern Analysis and Machine Intelligence
, 1999
"... AbstractÐTo maintain OCR accuracy with decreasing quality of page image composition, production, and digitization, it is essential to tune the system to each document. We propose a prototype extraction method for document-specific OCR systems. The method automatically generates training samples from ..."
Abstract
-
Cited by 30 (8 self)
- Add to MetaCart
AbstractÐTo maintain OCR accuracy with decreasing quality of page image composition, production, and digitization, it is essential to tune the system to each document. We propose a prototype extraction method for document-specific OCR systems. The method automatically generates training samples from unsegmented text images and the corresponding transcripts. It is tolerant of transcription errors, so a transcript produced automatically by an imperfect omnifont OCR system can be used. The method is based on new algorithms for estimating character widths, character locations in a word, and match/nonmatch probabilities from unsegmented text. An experimental word recognition system is designed and developed to combine prototype extraction algorithms and segmentation-free word recognition. The system can adapt itself to different page images and achieve high recognition accuracy on heavily degraded print. Index TermsÐOptical character recognition, adaptive classification, template matching, segmentation, document image analysis, text reader. 1
The Effects of Query-Based Sampling on Automatic Database Selection Algorithms
, 2000
"... Database selection algorithms need to know the subject areas covered by each text database, but this metadata can be difficult to acquire in multi-party environments, such as the Internet, where each party has different interests and capabilities. Query-based sampling is a relatively new technique ..."
Abstract
-
Cited by 24 (11 self)
- Add to MetaCart
Database selection algorithms need to know the subject areas covered by each text database, but this metadata can be difficult to acquire in multi-party environments, such as the Internet, where each party has different interests and capabilities. Query-based sampling is a relatively new technique in which metadata is inferred by interacting with each text database and observing the outcomes. Query-based sampling has been proposed as a solution to the problem of discovering the contents of each database in multi-party environments, but its generality and effectiveness had not been tested under a wide range of conditions. This paper investigates the generality and effectiveness of query-based sampling with three well-known database selection algorithms (gGlOSS, CORI, CVV). Experimental results support the generality of query-based sampling as a solution for acquiring database descriptions in multi-party environments. The experiments also compare the effectiveness of the database selection algorithms under different conditions. 1
Fast Searching on Compressed Text Allowing Errors
, 1998
"... We present a fast compression and decompression scheme for natural language texts that allows efficient and flexible string matching by searching the compressed text directly. The compression scheme uses a word-based Huffman encoding and the coding alphabet is byte-oriented rather than bit-oriented. ..."
Abstract
-
Cited by 23 (13 self)
- Add to MetaCart
We present a fast compression and decompression scheme for natural language texts that allows efficient and flexible string matching by searching the compressed text directly. The compression scheme uses a word-based Huffman encoding and the coding alphabet is byte-oriented rather than bit-oriented. We compress typical English texts to about 30% of their original size, against 40% and 35% for Compress and Gzip, respectively. Compression times are close to the times of Compress and approximately half the times of Gzip, and decompression times are lower than those of Gzip and one third of those of Compress. The searching algorithm allows a large number of variations of the exact and approximate compressed string matching problem, such as phrases, ranges, complements, wild cards and arbitrary regular expressions. Separators and stopwords can be discarded at search time without significantly increasing the cost. The algorithm is based on a word-oriented shift-or algorithm and a fast Boyer-Moore-type filter. It concomitantly uses the vocabulary of the text available as part of the Huffman coding data. When searching for simple patterns, our experiments show that running our algorithm on a compressed text is twice as fast as running Agrep on the uncompressed version of the same text. When searching complex or approximate patterns, our algorithm is up to 8 times faster than Agrep. We also mention the impact of our technique in inverted files pointing to documents or logical blocks as Glimpse.
Lightweight natural language text compression. Information Retrieval
, 2007
"... Variants of Huffman codes where words are taken as the source symbols are currently the most attractive choices to compress natural language text databases. In particular, Tagged Huffman Code by Moura et al. offers fast direct searching on the compressed text and random access capabilities, in excha ..."
Abstract
-
Cited by 22 (18 self)
- Add to MetaCart
Variants of Huffman codes where words are taken as the source symbols are currently the most attractive choices to compress natural language text databases. In particular, Tagged Huffman Code by Moura et al. offers fast direct searching on the compressed text and random access capabilities, in exchange for producing around 11 % larger compressed files. This work describes End-Tagged Dense Code and (s, c)-Dense Code, two new semistatic statistical methods for compressing natural language texts. These techniques permit simpler and faster encoding and obtain better compression ratios than Tagged Huffman Code, while maintaining its fast direct search and random access capabilities. We show that Dense Codes improve Tagged Huffman Code compression ratio by about 10%, reaching only 0.6% overhead over the optimal Huffman compression ratio. Being simpler, Dense Codes are generated 45% to 60 % faster than Huffman codes. This makes Dense Codes a very attractive alternative to Huffman code variants for various reasons: they are simpler to program, faster to build, of almost optimal size, and as fast and easy to search as the best Huffman variants, which are not so close to the optimal size.
Storing Text Retrieval Systems on CD-ROM: Compression and Encryption Considerations
- ACM Transactions on Information Systems
, 1989
"... : The emergence of the CD-ROM as a storage medium for full-text databases raises the question of the maximum size database that can be contained by this medium. As an example, the problem of storing the Tr'esor de la Langue Fran¸caise on a CD-ROM is examined in this paper. The text alone of this dat ..."
Abstract
-
Cited by 21 (3 self)
- Add to MetaCart
: The emergence of the CD-ROM as a storage medium for full-text databases raises the question of the maximum size database that can be contained by this medium. As an example, the problem of storing the Tr'esor de la Langue Fran¸caise on a CD-ROM is examined in this paper. The text alone of this database is 700 MB long, more than a CD-ROM can hold. But in addition the dictionary and concordance needed to access this data must be stored. A further constraint is that some of the material is copyrighted, and it is desirable that such material be difficult to decode except through software provided by the system. Pertinent approaches to compression of the various files are reviewed and the compression of the text is related to the problem of data encryption: specifically, it is shown that, under simple models of text generation, Huffman encoding produces a bit-string indistinguishible from a representation of coin flips. Categories and Subject Descriptors: E.3 E.4 H.3.2 J.5 General terms: ...
Indexing Compressed Text
- Proceedings of the 4th South American Workshop on String Processing
, 1997
"... We present a technique to build an index based on suffix arrays for compressed texts. We also propose a compression scheme for textual databases based on words that generates a compression code that preserves the lexicographical ordering of the text words. As a consequence it permits the sorting of ..."
Abstract
-
Cited by 20 (8 self)
- Add to MetaCart
We present a technique to build an index based on suffix arrays for compressed texts. We also propose a compression scheme for textual databases based on words that generates a compression code that preserves the lexicographical ordering of the text words. As a consequence it permits the sorting of the compressed strings to generate the suffix array without decompressing. As the compressed text is under 30% of the size of the original text we are able to build the suffix array twice as fast on the compressed text. The compressed text plus index is 55-60% of the size of the original text plus index and search times are reduced to approximately half the time. We also present analytical and experimental results for different variations of the word-oriented compression paradigm.
Fast Approximate String Matching in a Dictionary
- In Proc. SPIRE'98
, 1998
"... A successful technique to search large textual databases allowing errors relies on an online search in the vocabulary of the text. To reduce the time of that online search, we index the vocabulary as a metric space. We show that with reasonable space overhead we can improve by a factor of two over t ..."
Abstract
-
Cited by 20 (8 self)
- Add to MetaCart
A successful technique to search large textual databases allowing errors relies on an online search in the vocabulary of the text. To reduce the time of that online search, we index the vocabulary as a metric space. We show that with reasonable space overhead we can improve by a factor of two over the fastest online algorithms, when the tolerated error level is low (which is reasonable in text searching). 1 Introduction Approximate string matching is a recurrent problem in many branches of computer science, with applications to text searching, computational biology, pattern recognition, signal processing, etc. The problem can be stated as follows: given a long text of length n, and a (comparatively short) pattern of length m, retrieve all the segments (or "occurrences") of the text whose edit distance to the pattern is at most k. The edit distance ed() between two strings is defined as the minimum number of character insertions, deletions and replacements needed to make them equal. I...
Word spotting for historical documents
- INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION
, 2007
"... Searching and indexing historical handwritten collections is a very challenging problem. We describe an approach called word spotting which involves grouping word images into clusters of similar words by using image matching to find similarity. By annotating “interesting ” clusters, an index that li ..."
Abstract
-
Cited by 20 (1 self)
- Add to MetaCart
Searching and indexing historical handwritten collections is a very challenging problem. We describe an approach called word spotting which involves grouping word images into clusters of similar words by using image matching to find similarity. By annotating “interesting ” clusters, an index that links words to the locations where they occur can be built automatically. Image similarities computed using a number of different techniques including dynamic time warping are compared. The word similarities are then used for clustering
Execution Performance Issues in Full-Text Information Retrieval
, 1995
"... The task of an information retrieval system is to identify documents that will satisfy a user's information need. Effective fulfillment of this task has long been an active area of research, leading to sophisticated retrieval models for representing information content in documents and queries and m ..."
Abstract
-
Cited by 18 (0 self)
- Add to MetaCart
The task of an information retrieval system is to identify documents that will satisfy a user's information need. Effective fulfillment of this task has long been an active area of research, leading to sophisticated retrieval models for representing information content in documents and queries and measuring similarity between the two. The maturity and proven effectiveness of these systems has resulted in demand for increased capacity, performance, scalability, and functionality, especially as information retrieval is integrated into more traditional database management environments. In this dissertation we explore a number of functionality and performance issues in information retrieval. First, we consider creation and modification of the document collection, concentrating on management of the inverted file index. An inverted file architecture based on a persistent object store is described and experimental results are presented for inverted file creation and modification. Our architecture provides performance that scales well with document collection size and the database features supported by the persistent object store provide many solutions to issues that arise during integration of information retrieval into more general database environments. We then turn to query evaluation speed and introduce a new optimization technique for statistical ranking retrieval systems that support structured queries. Experimental results from a variety of query sets show that execution time can be reduced by more than 50% wit...
Is Huffman Coding Dead?
- Computing
, 1993
"... : In recent publications about data compression, arithmetic codes are often suggested as the state of the art, rather than the more popular Huffman codes. While it is true that Huffman codes are not optimal in all situations, we show that the advantage of arithmetic codes in compression performance ..."
Abstract
-
Cited by 16 (3 self)
- Add to MetaCart
: In recent publications about data compression, arithmetic codes are often suggested as the state of the art, rather than the more popular Huffman codes. While it is true that Huffman codes are not optimal in all situations, we show that the advantage of arithmetic codes in compression performance is often negligible. Referring also to other criteria, we conclude that for many applications, Huffman codes should still remain a competitive choice. 1. Introduction It is paradoxical that, as the technology for storing and transmitting information has gotten cheaper and more effective, interest in data compression has increased. There are many explanations, but most conspicuous is that improvements in media have expanded our sense of what we wish to store. For example, CD-Rom technology allows us to store whole libraries instead of records describing individual items; but the requirements of storing full text easily exceeds the capabilities even of the optical format. Similarly, there is ...

