Results 1 - 10
of
25
Content-Based Retrieval for Music Collections
, 1999
"... A content-based retrieval model for tackling the mismatch problems specific to music data is proposed and implemented. The system uses a pitch profile encoding for queries in any key and an n-note indexing method for approximate matching in sub-linear time. A distinct function that extracts key melo ..."
Abstract
-
Cited by 35 (1 self)
- Add to MetaCart
A content-based retrieval model for tackling the mismatch problems specific to music data is proposed and implemented. The system uses a pitch profile encoding for queries in any key and an n-note indexing method for approximate matching in sub-linear time. A distinct function that extracts key melodies for query suggestion is developed. The Web-based system provides flexible user interface for query formulation and result browsing. Users can search the system by a short sequence of notes, by uploading a file created by singing, or by clicking suggested key melodies without input. Experiments show that the pitch profile encoding and a 3-note indexing are able to overcome the key mismatch problem and the random errors caused by pitch error, note deletion and insertion. The use of extracted key melodies improves performance over direct search of the music database. For the type of burst mismatch, a query expansion approach is applied.
Imaged document text retrieval without OCR
- IEEE Trans. Pattern Analysis and Machine Intelligence
, 2002
"... AbstractÐWe propose a method for text retrieval from document images without the use of OCR. Documents are segmented into character objects. Image features, namely, the Vertical Traverse Density (VTD) and Horizontal Traverse Density (HTD), are extracted. An n-gram based document vector is constructe ..."
Abstract
-
Cited by 22 (9 self)
- Add to MetaCart
AbstractÐWe propose a method for text retrieval from document images without the use of OCR. Documents are segmented into character objects. Image features, namely, the Vertical Traverse Density (VTD) and Horizontal Traverse Density (HTD), are extracted. An n-gram based document vector is constructed for each document based on these features. Text similarity between documents is then measured by calculating the dot product of the document vectors. Testing with seven corpora of imaged textual documents in English and Chinese as well as images from UW1 database confirms the validity of the proposed method. Index TermsÐDocument image analysis, document vector, text similarity, text retrieval. æ 1
Holistic Word Recognition for Handwritten Historical Documents
, 2004
"... Most offline handwriting recognition approaches proceed by segmenting words into smaller pieces (usually characters) which are recognized separately. The recognition result of a word is then the composition of the individually recognized parts. Inspired by results in cognitive psychology, researcher ..."
Abstract
-
Cited by 21 (9 self)
- Add to MetaCart
Most offline handwriting recognition approaches proceed by segmenting words into smaller pieces (usually characters) which are recognized separately. The recognition result of a word is then the composition of the individually recognized parts. Inspired by results in cognitive psychology, researchers have begun to focus on holistic word recognition approaches. Here we present a holistic word recognition approach for single-author historical documents, which is motivated by the fact that for severely degraded documents a segmentation of words into characters will produce very poor results. The quality of the original documents does not allow us to recognize them with high accuracy - our goal here is to produce transcriptions that will allow successful retrieval of images, which has been shown to be feasible even in such noisy environments. We believe that this is the first systematic approach to recognizing words in historical manuscripts with extensive experiments. Our experiments show a recognition accuracy of 65%, which exceeds performance of other systems that operate on non-degraded input images (non historical documents) .
Information retrieval in document image databases
- IEEE Transactions on Knowledge and Data Engineering
"... Abstract—With the rising popularity and importance of document images as an information source, information retrieval in document image databases has become a growing and challenging problem. In this paper, we propose an approach with the capability of matching partial word images to address two iss ..."
Abstract
-
Cited by 15 (4 self)
- Add to MetaCart
Abstract—With the rising popularity and importance of document images as an information source, information retrieval in document image databases has become a growing and challenging problem. In this paper, we propose an approach with the capability of matching partial word images to address two issues in document image retrieval: word spotting and similarity measurement between documents. First, each word image is represented by a primitive string. Then, an inexact string matching technique is utilized to measure the similarity between the two primitive strings generated from two word images. Based on the similarity, we can estimate how a word image is relevant to the other and, thereby, decide whether one is a portion of the other. To deal with various character fonts, we use a primitive string which is tolerant to serif and font differences to represent a word image. Using this technique of inexact string matching, our method is able to successfully handle the problem of heavily touching characters. Experimental results on a variety of document image databases confirm the feasibility, validity, and efficiency of our proposed approach in document image retrieval. Index Terms—Document image retrieval, partial word image matching, primitive string, word searching, document similarity measurement. æ 1
Correcting broken characters in the recognition of historical printed documents
- Joint Conference on Digital Libraries
, 2003
"... This paper presents a new technique for dealing with broken characters, one of the major challenges in the optical character recognition (OCR) of degraded historical printed documents. A technique based on graph combinatorics is used to rejoin the appropriate connected components. It has been applie ..."
Abstract
-
Cited by 14 (0 self)
- Add to MetaCart
This paper presents a new technique for dealing with broken characters, one of the major challenges in the optical character recognition (OCR) of degraded historical printed documents. A technique based on graph combinatorics is used to rejoin the appropriate connected components. It has been applied to real data with successful results. 1
Term selection for searching printed Arabic
- Proceedings ACMSIGIR'2002
, 2002
"... Since many Arabic documents are available only in print, automating retrieval from collections of scanned Arabic document images using Optical Character Recognition (OCR) is an interesting problem. Arabic combines rich morphology with a writing system that presents unique challenges to OCR systems. ..."
Abstract
-
Cited by 12 (2 self)
- Add to MetaCart
Since many Arabic documents are available only in print, automating retrieval from collections of scanned Arabic document images using Optical Character Recognition (OCR) is an interesting problem. Arabic combines rich morphology with a writing system that presents unique challenges to OCR systems. These factors must be considered when selecting terms for automatic indexing. In this paper, alternative choices of indexing terms are explored using both an existing electronic text collection and a newly developed collection built from images of actual printed Arabic documents. Character n-grams or lightly stemmed words were found to typically yield near-optimal retrieval effectiveness, and combining both types of terms resulted in robust performance across a broad range of conditions.
Large Scale Parallel Document Mining for Machine Translation
"... A distributed system is described that reliably mines parallel text from large corpora. The approach can be regarded as cross-language near-duplicate detection, enabled by an initial, low-quality batch translation. In contrast to other approaches which require specialized metadata, the system uses o ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
A distributed system is described that reliably mines parallel text from large corpora. The approach can be regarded as cross-language near-duplicate detection, enabled by an initial, low-quality batch translation. In contrast to other approaches which require specialized metadata, the system uses only the textual content of the documents. Results are presented for a corpus of over two billion web pages and for a large collection of digitized public-domain books. 1
Document Image Retrieval Techniques for Chinese
- Proceedings of the Fourth Symposium on Document Image Understanding Technology
, 2001
"... In this paper we present experiment results for retrieval from a collection of scanned article clippings from Chinese newspapers. The test collection consists of 8,438 articles from China, Taiwan and Hong Kong in a mix of traditional and simplified Chinese. A commercial OCR system was used to produ ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
In this paper we present experiment results for retrieval from a collection of scanned article clippings from Chinese newspapers. The test collection consists of 8,438 articles from China, Taiwan and Hong Kong in a mix of traditional and simplified Chinese. A commercial OCR system was used to produce errorful text. Exhaustive relevance assessment was performed over the entire collection for 30 Chinese queries by multiple judges. Indexing a combination of unigrams and overlapping bigrams was found to outperform overlapping bigram indexing alone, and byte length normalization was found to outperform cosine normalization. No improvement resulted from the addition of query expansion using blind relevance feedback on the same collection.
Text Retrieval from Document Images based on N-Gram Algorithm
- Text and Web Mining Workshop, 6th Pacific Rim International Conference on Artificial Intelligence, Publisher
, 2000
"... In this paper, we propose a method of text retrieval from document images using a similarity measure based on an N-Gram algorithm. We directly extract image features instead of using optical character recognition. Character image objects are extracted from document images based on connected componen ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
In this paper, we propose a method of text retrieval from document images using a similarity measure based on an N-Gram algorithm. We directly extract image features instead of using optical character recognition. Character image objects are extracted from document images based on connected components first and then an unsupervised classifier is used to classify these objects. All objects are encoded according to one unified class set and each document image is represented by one stream of object codes. Next, we retrieve N-Gram slices from these streams and build document vectors. Lastly, we obtain the pair-wise similarity of document images by means of the scalar product of the document vectors. Four copora of news articles were used to test the validity of our method. During the test, the similarity of document images using this method was compared with the result of ASCII version of those documents based on the N-Gram algorithm for text documents.
Polyphonic Music Retrieval: The N-gram Approach
, 2004
"... This Music Information Retrieval (MIR) study investigates the use of n-grams and textual In-formation Retrieval (IR) approaches for the retrieval and access of polyphonic music data. IR, synonymous with text IR, implies the task of retrieving documents or texts with information content that is relev ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
This Music Information Retrieval (MIR) study investigates the use of n-grams and textual In-formation Retrieval (IR) approaches for the retrieval and access of polyphonic music data. IR, synonymous with text IR, implies the task of retrieving documents or texts with information content that is relevant to a user’s information need. With music retrieval, the use of n-grams has largely been confined to monophonic musical sequences. The few studies that have investigated its use with polyphonic music collections typically reduce a polyphonic file into a monophonic sequence for n-gram construction. Tech-niques for full-music indexing of polyphonic music data with n-grams are investigated. A method to obtain n-grams from polyphonic music data is introduced. The information con-tent of ‘musical n-grams ’ is extended to include rhythmic information in addition to intervallic information. For this, ratios of onset times between two adjacent pairs of pitch events are used. To encode ‘musical n-grams ’ to obtain ‘musical words ’ for indexing, a function that maps interval classes to text characters is formulated, and ranges of ratio bins are defined. These encoding approaches enable encoding of the pitch and rhythm information at vari-

