Results 11 - 20
of
27
A Novel Connectionist System for Unconstrained Handwriting Recognition
, 2008
"... Recognising lines of unconstrained handwritten text is a challenging task. The difficulty of segmenting cursive or overlapping characters, combined with the need to exploit surrounding context, has led to low recognition rates for even the best current recognisers. Most recent progress in the field ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
Recognising lines of unconstrained handwritten text is a challenging task. The difficulty of segmenting cursive or overlapping characters, combined with the need to exploit surrounding context, has led to low recognition rates for even the best current recognisers. Most recent progress in the field has been made either through improved preprocessing, or through advances in language modelling. Relatively little work has been done on the basic recognition algorithms. Indeed, most systems rely on the same hidden Markov models that have been used for decades in speech and handwriting recognition, despite their well-known shortcomings. This paper proposes an alternative approach based on a novel type of recurrent neural network, specifically designed for sequence labelling tasks where the data is hard to segment and contains long range, bidirectional interdependencies. In experiments on two large unconstrained handwriting databases, our approach achieves word recognition accuracies of 79.7 % on online data and 74.1 % on offline data, significantly outperforming a state-of-the-art HMM-based system. In addition, we demonstrate the network’s robustness to lexicon size, measure the individual influence of its hidden layers, and analyse its use of context. Lastly we provide an in depth discussion of the differences between the network and HMMs, suggesting reasons for the network’s superior performance.
Adaptation of Statistical Language Models for Automatic Speech Recognition
, 1999
"... Statistical language models encode linguistic information in such a way as to be useful to systems which process human language. Such systems include those for optical character recognition and machine translation. Currently, however, the most common application of language modelling is in automatic ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Statistical language models encode linguistic information in such a way as to be useful to systems which process human language. Such systems include those for optical character recognition and machine translation. Currently, however, the most common application of language modelling is in automatic speech recognition, and it is this that forms the focus of this thesis. Most current speech recognition systems are dedicated to one specific task (for example, the recognition of broadcast news), and thus use a language model which has been trained on text which is appropriate to that task. If, however, one wants to perform recognition on more general language, then creating an appropriate language model is far from straightforward. A taskspecific language model will often perform very badly on language from a different domain, whereas a model trained on text from many diverse styles of language might perform better in general, but will not be especially well suited to any particular domai...
Automatic Acquisition of Word Classification using Distributional Analysis of Content Words with Respect to Function Words
, 2002
"... This project describes a method which can automatically infer word classification. Previous systems designed to assign parts-of-speech to words sought the use of training data or were built upon rules devised by experts in linguistics. The report details the use of an unsupervised approach that can ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
This project describes a method which can automatically infer word classification. Previous systems designed to assign parts-of-speech to words sought the use of training data or were built upon rules devised by experts in linguistics. The report details the use of an unsupervised approach that can reduce significantly the reliance on prior linguistic intuition. The study looks in to how words behave relative to the function words. As these are the most common words, there is a great deal of information that can be attained. It was possible to analyse how the content words from a given body of text were distributed with respect to the function words. This information could be used as a profile, and therefore content words with a similar profile against the function words could be assumed to be of similar word class. Agglomerative hierarchical clustering techniques were applied to partition words into different clusters. Words that were deemed similar were grouped together, and thus, each cluster should contain words that posses the same part-of-speech. This project performed many experiments to investigate how the many factors affected the overall clustering performance, in order to find the optimal parameters. The results report an accuracy of 87% when performed on the LOB corpus. Experiments were also carried out with an alternative Spanish corpus and the clustering accuracy achieved 85%. Semantic clustering was also observed indicating the effectiveness of the described approach for the task of automatically acquiring word classification.
Statistical Language Processing based on Self-Organising Word Classification
, 1994
"... An automatic word classification system has been designed which processes word unigram and bigram frequency statistics extracted from a corpus of natural language utterances. The system implements a type of simulated annealing which employs an average class mutual information metric. Resulting class ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
An automatic word classification system has been designed which processes word unigram and bigram frequency statistics extracted from a corpus of natural language utterances. The system implements a type of simulated annealing which employs an average class mutual information metric. Resulting classifications are hierarchical, allowing variable class granularity. Words are represented as structural tags --- unique n-bit numbers the most significant bit-patterns of which incorporate class information. Therefore, access to a structural tag immediately provides access to all classification levels for the corresponding word. The classification system has successfully revealed some of the structure of two natural languages, from the phonemic to the semantic level. The system has been favourably compared --- directly and indirectly --- with other word classification systems. Class based interpolated language models have been constructed to exploit the extra information supplied by structural...
Word-to-Category Backoff Language Models
, 1996
"... A language model combining word-based and category-based n-grams within a backoff framework is presented. Word n-grams conveniently capture sequential relations between particular words, while the category-model, which is based on part-of-speech classifications and allows ambiguous category membersh ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
A language model combining word-based and category-based n-grams within a backoff framework is presented. Word n-grams conveniently capture sequential relations between particular words, while the category-model, which is based on part-of-speech classifications and allows ambiguous category membership, is able to generalise to unseen word sequences and therefore appropriate in backoff situations. Experiments on the LOB, Switchboard and WSJ0 corpora demonstrate that the technique greatly improves language model perplexities for sparse training sets, and offers significantly improved complexity versus performance tradeoffs when compared with standard trigram models. Contents 1. Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1 2. Exact model : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1 3. Approximate model : : : : : : : : ...
Detecting Errors in Discontinuous Structural Annotation
- In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics
, 2005
"... Consistency of corpus annotation is an essential property for the many uses of annotated corpora in computational and theoretical linguistics. While some research addresses the detection of inconsistencies in positional annotation (e.g., partof-speech) and continuous structural annotation (e.g., syn ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Consistency of corpus annotation is an essential property for the many uses of annotated corpora in computational and theoretical linguistics. While some research addresses the detection of inconsistencies in positional annotation (e.g., partof-speech) and continuous structural annotation (e.g., syntactic constituency), no approach has yet been developed for automatically detecting annotation errors in discontinuous structural annotation. This is significant since the annotation of potentially discontinuous stretches of material is increasingly relevant, from treebanks for free-word order languages to semantic and discourse annotation. In this paper we discuss how the variation n-gram error detection approach (Dickinson and Meurers, 2003a) can be extended to discontinuous structural annotation. We exemplify the approach by showing how it successfully detects errors in the syntactic annotation of the German TIGER corpus (Brants et al., 2002). 1
Writer identification for smart meeting room systems
- In Proc. 7th IAPR Workshop on Document Analysis Systems, volume 3872 of LNCS
, 2006
"... Abstract. In this paper we present a text independent on-line writer identification system based on Gaussian Mixture Models (GMMs). This system has been developed in the context of research on Smart Meeting Rooms. The GMMs in our system are trained using two sets of features extracted from a text li ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
Abstract. In this paper we present a text independent on-line writer identification system based on Gaussian Mixture Models (GMMs). This system has been developed in the context of research on Smart Meeting Rooms. The GMMs in our system are trained using two sets of features extracted from a text line. The first feature set is similar to feature sets used in signature verification systems before. It consists of information gathered for each recorded point of the handwriting, while the second feature set contains features extracted from each stroke. While both feature sets perform very favorably, the stroke-based feature set outperforms the point-based feature set in our experiments. We achieve a writer identification rate of 100 % for writer sets with up to 100 writers. Increasing the number of writers to 200, the identification rate decreases to 94.75%. 1
Diverse Classifiers for NLP Disambiguation Tasks Comparison, Optimization, Combination, and Evolution
, 2000
"... In this paper we report preliminary results from an ongoing study that investigates the performance of machine learning classifiers on a diverse set of Natural Language Processing (NLP) tasks. First, we compare a number of popular existing learning methods (Neural networks, Memory-based learning, ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
In this paper we report preliminary results from an ongoing study that investigates the performance of machine learning classifiers on a diverse set of Natural Language Processing (NLP) tasks. First, we compare a number of popular existing learning methods (Neural networks, Memory-based learning, Rule induction, Decision trees, Maximum Entropy, Winnow Perceptrons, Naive Bayes and Support Vector Machines), and discuss their properties vis a vis typical NLP data sets. Next, we turn to methods to optimize the parameters of single learning methods through cross-validation and evolutionary algorithms. Then we investigate how we can get the best of all single methods through combination of the tested systems in classifier ensembles. Finally we discuss new and more thorough methods of automatically constructing ensembles of classifiers based on the techniques used for parameter optimization.
Comparative Evaluation of Word- and Category-Based Language Models
, 1996
"... Conventional n-gram language models employ the occurrence counts of word n-tuples to calculate probabilities for word sequences. It has been demonstrated, however, that language models using n- tuples of word-categories rather than words exhibit certain advantages, such as the intrinsic ability to ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Conventional n-gram language models employ the occurrence counts of word n-tuples to calculate probabilities for word sequences. It has been demonstrated, however, that language models using n- tuples of word-categories rather than words exhibit certain advantages, such as the intrinsic ability to generalise to unseen word sequences, and attactive size versus performance tradeoffs. This document compares the behaviour of word- and category-based language models in detail, and among the significant findings are that the category-based model is less likely to deliver very small probability estimates, that it performs better in situations where the word-model backs-off, and that the categorybased model is less sensitive to changes in the character of the test-text. Contents 1. Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1 2. The corpus : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :...
Part-of-Speech Tagging from "Small" Data Sets
, 1996
"... Probabilistic approaches to part-of-speech (POS) tagging compile statistics from massive corpora such as the Lancaster-Oslo-Bergen (LOB) corpus. Training on a 900,000 token training corpus, the hidden Markov model (HMM) method easily achieves a 95 per cent success rate on a 100,000 token test corpus ..."
Abstract
- Add to MetaCart
Probabilistic approaches to part-of-speech (POS) tagging compile statistics from massive corpora such as the Lancaster-Oslo-Bergen (LOB) corpus. Training on a 900,000 token training corpus, the hidden Markov model (HMM) method easily achieves a 95 per cent success rate on a 100,000 token test corpus. However, even such large corpora contain relatively few words and new words are subsequently encountered in test corpora. For example, the million-token LOB contains only about 45,000 different words, most of which occur only once or twice. We find that 3--4 per cent of tokens in a disjoint test corpus are unseen, that is, unknown to the tagger after training, and cause a significant proportion of errors. A corpus representative of all possible tag sequences seems implausible enough, let alone a corpus that also represents, even in small numbers, enough of English to make the problem of unseen words insignificant. Experimental results confirm that this extreme course is not necessary. Vari...

