Results 1 - 10
of
38
Word Image Matching Using Dynamic Time Warping
, 2002
"... Libraries and other institutions are interested in providing access to scanned versions of their large collections of handwritten historical manuscripts on electronic media. Convenient access to a collection requires an index, which is manually created at great labour and expense. Since current hand ..."
Abstract
-
Cited by 150 (14 self)
- Add to MetaCart
Libraries and other institutions are interested in providing access to scanned versions of their large collections of handwritten historical manuscripts on electronic media. Convenient access to a collection requires an index, which is manually created at great labour and expense. Since current handwriting recognizers do not perform well on historical documents, a technique called word spotting has been developed: clusters with occurrences of the same word in a collection are established using image matching. By annotating "interesting" clusters, an index can be built automatically. We present an algorithm for matching handwritten words in noisy historical documents. The segmented word images are preprocessed to create sets of 1-dimensional features, which are then compared using dynamic time warping. We present experimental results on two different data sets from the George Washington collection. Our experiments show that this algorithm performs better and is faster than competing matching techniques.
Word spotting for historical documents
- INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION
, 2007
"... Searching and indexing historical handwritten collections is a very challenging problem. We describe an approach called word spotting which involves grouping word images into clusters of similar words by using image matching to find similarity. By annotating “interesting ” clusters, an index that li ..."
Abstract
-
Cited by 82 (8 self)
- Add to MetaCart
(Show Context)
Searching and indexing historical handwritten collections is a very challenging problem. We describe an approach called word spotting which involves grouping word images into clusters of similar words by using image matching to find similarity. By annotating “interesting ” clusters, an index that links words to the locations where they occur can be built automatically. Image similarities computed using a number of different techniques including dynamic time warping are compared. The word similarities are then used for clustering
Features for Word Spotting in Historical Manuscripts
"... For the transition from traditional to digital libraries, the large number of handwritten manuscripts that exist pose a great challenge. Easy access to such collections requires an index, which is currently created manually at great cost. Because automatic handwriting recognizers fail on historical ..."
Abstract
-
Cited by 56 (4 self)
- Add to MetaCart
(Show Context)
For the transition from traditional to digital libraries, the large number of handwritten manuscripts that exist pose a great challenge. Easy access to such collections requires an index, which is currently created manually at great cost. Because automatic handwriting recognizers fail on historical manuscripts, the word spotting technique has been developed: the words in a collection are matched as images and grouped into clusters which contain all instances of the same word. By annotating "interesting" clusters, an index that links words to the locations where they occur can be built automatically.
Robust anisotropic Gaussian fitting for volumetric characterization of pulmonary nodules in multislice CT
- IEEE Trans. Med. Imag
, 2005
"... This article proposes a robust statistical estimation and verification framework for characterizing the ellipsoidal (anisotropic) geometrical structure of pulmonary nodules in the Multislice X-ray CT images. Given a marker indicating a rough location of a target, the proposed solution estimates the ..."
Abstract
-
Cited by 31 (8 self)
- Add to MetaCart
(Show Context)
This article proposes a robust statistical estimation and verification framework for characterizing the ellipsoidal (anisotropic) geometrical structure of pulmonary nodules in the Multislice X-ray CT images. Given a marker indicating a rough location of a target, the proposed solution estimates the target’s center location, ellipsoidal boundary approximation, volume, maximum/average diameters, and isotropy by robustly and efficiently fitting an anisotropic Gaussian intensity model. We propose a novel multi-scale joint segmentation and model fitting solution which extends the robust mean shift-based analysis to the linear scale-space theory. The design is motivated for enhancing the robustness against margin-truncation induced by neighboring structures, data with large deviations from the chosen model, and marker location variability. A chi-square-based statistical verification and analytical volumetric measurement solutions are also proposed to complement this estimation framework. Experiments with synthetic 1D and 2D data clearly demonstrate the advantage of our solution in comparison with the γ-normalized Laplacian approach [1] and the standard sample estimation approach [2, p.179]. A quasi real-time 3D nodule characterization system is developed using this framework and validated with two clinical data sets of thin-section chest CT images. Our experiments with 1310 nodules resulted in i) robustness against intra- and inter-operator variability due to varying marker locations, ii) 81 % correct estimation rate, iii) 3 % false acceptance and 5 % false rejection rates, and
Text Alignment with Handwritten Documents
"... Today's digital libraries increasingly include not only printed text but also scanned handwritten pages and other multimedia material. There are, however, few tools available for manipulating handwritten pages. Here, we propose an algorithm based on dynamic time warping (DTW) for a word by word ..."
Abstract
-
Cited by 26 (2 self)
- Add to MetaCart
Today's digital libraries increasingly include not only printed text but also scanned handwritten pages and other multimedia material. There are, however, few tools available for manipulating handwritten pages. Here, we propose an algorithm based on dynamic time warping (DTW) for a word by word alignment of handwritten documents with their (ASCII) transcripts. We see at least three uses for such alignment algorithms. First, alignment algorithms allow us to produce displays (for example on the web) that allow a person to easily find their place in the manuscript when reading a transcript. Second, such alignment algorithms will allow us to produce large quantities of ground truth data for evaluating handwriting recognition algorithms. Third, such algorithms allow us to produce indices in a straightforward manner for handwriting material. We provide experimental results of our algorithm on a set of 70 pages of historical handwritten material - specifically the writings of George Washington. Our method achieves 74.5% accuracy on line by line alignment and 60.5% accuracy when aligning whole pages at time.
Text Extraction from Gray Scale Historical Document Images Using Adaptive Local Connectivity Map
"... This paper presents an algorithm using adaptive local connectivity map for retrieving text lines from the complex handwritten documents such as handwritten historical manuscripts. The algorithm is designed for solving the particularly complex problems seen in handwritten documents. These problems in ..."
Abstract
-
Cited by 23 (0 self)
- Add to MetaCart
(Show Context)
This paper presents an algorithm using adaptive local connectivity map for retrieving text lines from the complex handwritten documents such as handwritten historical manuscripts. The algorithm is designed for solving the particularly complex problems seen in handwritten documents. These problems include fluctuating text lines, touching or crossing text lines and low quality image that do not lend themselves easily to binarizations. The algorithm is based on connectivity features similar to local projection profiles, which can be directly extracted from gray scale images. The proposed technique is robust and has been tested on a set of complex historical handwritten documents such as Newton’s and Galileo’s manuscripts. A preliminary testing shows a successful location rate of above 95 % for the test set. 1
Indexing of Handwritten Historical Documents - Recent Progress
"... Indexing and searching collections of handwritten archival documents and manuscripts has always been a challenge because handwriting recognizers do not perform well on such noisy documents. Given a collection of documents written by a single author (or a few authors), one can apply a technique calle ..."
Abstract
-
Cited by 22 (2 self)
- Add to MetaCart
(Show Context)
Indexing and searching collections of handwritten archival documents and manuscripts has always been a challenge because handwriting recognizers do not perform well on such noisy documents. Given a collection of documents written by a single author (or a few authors), one can apply a technique called word spotting. The approach is to cluster word images based on their visual appearance, after segmenting them from the documents. Annotation can then be performed for clusters rather than documents.
Using Corner Feature Correspondences to Rank Word Images by Similarity
"... Libraries contain enormous amounts of handwritten historical documents which cannot be made available on-line because they do not have a searchable index. The wordspotting idea has previously been proposed as a solution to creating indexes for such documents and collections by matching word images. ..."
Abstract
-
Cited by 18 (3 self)
- Add to MetaCart
Libraries contain enormous amounts of handwritten historical documents which cannot be made available on-line because they do not have a searchable index. The wordspotting idea has previously been proposed as a solution to creating indexes for such documents and collections by matching word images. In this paper we present an algorithm which compares whole word-images based on their appearance. This algorithm recovers correspondences of points of interest in two images, and then uses these correspondences to construct a similarity measure. This similarity measure can then be used to rank word-images in order of their closeness to a querying image. We achieved an average precision of 62.57% on a set of 2372 images of reasonable quality and an average precision of 15.49% on a set of 3262 images from documents of poor quality that are even hard to read for humans.
Boosted decision trees for word recognition in handwritten document retrieval
- in: 28th Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2005
"... Recognition and retrieval of historical handwritten material is an unsolved problem. We propose a novel approach to recognizing and retrieving handwritten manuscripts, based upon word image classification as a key step. Decision trees with normalized pixels as features form the basis of a highly acc ..."
Abstract
-
Cited by 17 (3 self)
- Add to MetaCart
(Show Context)
Recognition and retrieval of historical handwritten material is an unsolved problem. We propose a novel approach to recognizing and retrieving handwritten manuscripts, based upon word image classification as a key step. Decision trees with normalized pixels as features form the basis of a highly accurate AdaBoost classifier, trained on a corpus of word images that have been resized and sampled at a pyramid of resolutions. To stem problems from the highly skewed distribution of class frequencies, word classes with very few training samples are augmented with stochastically altered versions of the originals. This increases recognition performance substantially. On a standard corpus of 20 pages of handwritten material from the George Washington collection the recognition performance shows a substantial improvement in performance over previous published results (75 % vs 65%). Following word recognition, retrieval is done using a language model over the recognized words. Retrieval performance also shows substantially improved results over previously published results on this database. Recognition/retrieval results on a more challenging database of 100 pages from the George Washington collection are also presented.
Aligning transcripts to automatically segmented handwritten manuscripts
- Proceedings of the 7th IAPR Workshop on Document Analysis Systems
, 2006
"... Abstract. Training and evaluation of techniques for handwriting recognition and retrieval is a challenge given that it is difficult to create large ground-truthed datasets. This is especially true for historical handwritten datasets. In many instances the ground truth has to be created by manually t ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
(Show Context)
Abstract. Training and evaluation of techniques for handwriting recognition and retrieval is a challenge given that it is difficult to create large ground-truthed datasets. This is especially true for historical handwritten datasets. In many instances the ground truth has to be created by manually transcribing each word, which is a very labor intensive process. Sometimes transcriptions are available for some manuscripts. These transcriptions were created for other purposes and hence correspondence at the word, line, or sentence level may not be available. To be useful for training and evaluation, a word level correspondence must be available between the segmented handwritten word images and the ASCII transcriptions. Creating this correspondence or alignment is challenging because the segmentation is often errorful and the ASCII transcription may also have errors in it. Very little work has been done on the alignment of handwritten data to transcripts. Here, a novel Hidden Markov Model based automatic alignment algorithm is described and tested. The algorithm produces an average alignment accuracy of about 72.8 % when aligning whole pages at a time on a set of 70 pages of the George Washington collection. This outperforms a dynamic time warping alignment algorithm by about 12 % previously reported in the literature and tested on the same collection. 1