Results 11 - 20
of
96
Error-tolerant finite state recognition with applications to morphological analysis and spelling correction
- COMPUTATIONAL LINGUISTICS
, 1996
"... This paper presents the notion of error-tolerant recognition with finite-state recognizers along with results from some applications. Error-tolerant recognition enables the recognition of strings that deviate mildly from any string in the regular set recognized by the underlying finite-state recogni ..."
Abstract
-
Cited by 18 (0 self)
- Add to MetaCart
(Show Context)
This paper presents the notion of error-tolerant recognition with finite-state recognizers along with results from some applications. Error-tolerant recognition enables the recognition of strings that deviate mildly from any string in the regular set recognized by the underlying finite-state recognizer. Such recognition has applications to error-tolerant morphological processing, spelling correction, and approximate string matching in information retrieval. After a description of the concepts and algorithms involved, we give examples from two applications: in the context of morphological analysis, error-tolerant recognition allows misspelled input word forms to be corrected and morphologically analyzed concurrently. We present an application of this to error-tolerant analysis of the agglutinative morphology of Turkish words. The algorithm can be applied to morphological analysis of any language whose morphology has been fully captured by a single (and possibly very large) finite-state transducer, regardless of the word formation processes and morphographemic phenomena involved. In the context of spelling correction, error-tolerant recognition can be used to enumerate candidate correct forms from a given misspelled string within a certain edit distance. Error-tolerant recognition can be applied to spelling correction for any language, if (a) it has a word list comprising all infiected forms, or (b) its morphology has been fully described by a finite-state transducer. We present experimental results for spelling correction for a number of languages. These results indicate that such recognition works very efficiently for candidate generation in spelling correction for many European languages (English, Dutch, French, German, and Italian, among others) with very large...
Compressed Storage of Sparse Finite-State Transducers
- Workshop on Implementing Automata WIA99 - Pre-Proceedings
, 1999
"... This paper presents an eclectic approach for compressing weighted finite-state automata and transducers, with minimal impact on performance. The approach is eclectic in the sense that various complementary methods have been employed: row-indexed storage of sparse matrices, dictionary compression, bi ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
(Show Context)
This paper presents an eclectic approach for compressing weighted finite-state automata and transducers, with minimal impact on performance. The approach is eclectic in the sense that various complementary methods have been employed: row-indexed storage of sparse matrices, dictionary compression, bit manipulation, and lossless omission of data. The compression rate is over 83% with respect to the current Bell Labs FSM library.
Efficient Submatch Addressing for Regular Expressions
, 2001
"... String pattern matching in its different forms is an important topic in theoretical computer science. This thesis concentrates on the problem of regular expression matching with submatch addressing, where the position and extent of the substrings matched by given subexpressions must be provided. The ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
String pattern matching in its different forms is an important topic in theoretical computer science. This thesis concentrates on the problem of regular expression matching with submatch addressing, where the position and extent of the substrings matched by given subexpressions must be provided. The algorithms in widespread use at the time either take exponential worst-case time to find a match, can handle only a subset of all regular expressions, or use space proportional to the length of the input string where constant space would suffice. This thesis proposes a new method for solving the submatch addressing problem using nondeterministic finite automata with transitions augmented by copy-on-write update operations. The resulting algorithm makes a single pass over the input string, always using time linearly proportional to the input. Space consumption depends only on the used regular expression, and not on the input string. To the author's knowledge, this is a new result. A prototype of a POSIX.2 compatible regular expression matcher using the algorithm was done. Benchmarking results indicate that the prototype compares favorably against some popular implementations. Furthermore, absence of exponential or polynomial time worst cases makes it possible to use any regular expression without performance problems, which is not the case with previous implementations or algorithms.
The Role of Lexicalization and Pruning for Base Noun Phrase Grammars
- In AAA1 99
, 1999
"... This paper explores the role of lexicalization and pruning of grammars for base noun phrase identification. ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
(Show Context)
This paper explores the role of lexicalization and pruning of grammars for base noun phrase identification.
Meta-Learning for Phonemic Annotation of Corpora
- STANFORD UNIVERSITY
, 2000
"... We apply rule induction, classier combination and meta-learning (stacked classiers) to the problem of bootstrapping high accuracy automatic annotation of corpora with pronunciation information. The task we address in this paper consists of generating phonemic representations reecting the Flem ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
We apply rule induction, classier combination and meta-learning (stacked classiers) to the problem of bootstrapping high accuracy automatic annotation of corpora with pronunciation information. The task we address in this paper consists of generating phonemic representations reecting the Flemish and Dutch pronunciations of a word on the basis of its orthographic representation (which in turn is based on the actual speech recordings). We compare several possible approaches to achieve the text-topronunciation mapping task: memory-based learning, transformation-based learning, rule induction, maximum entropy modeling, combination of classiers in stacked learning, and stacking of meta-learners. We are interested both in optimal accuracy and in obtaining insight into the linguistic regularities involved. As far as accuracy is concerned, an already high accuracy level (93% for Celex and 86% for Fonilex at word level) for single classiers is boosted signicantly with additional error reductions of 31% and 38% respectively using combination of classiers, and a further 5% using combination of meta-learners, bringing overall word level accuracy to 96% for the Dutch variant and 92% for the Flemish variant. We also show that the application of machine learning methods indeed leads to increased insight into the linguistic regularities determining the variation between the two pronunciation variants studied.
Parallel Replacement in Finite State Calculus
, 1996
"... This paper extends the calculus of regular expressions ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
(Show Context)
This paper extends the calculus of regular expressions
Grapheme-To-Phone Using Finite-State Transducers
- In: Proc. 2002 IEEE Workshop on Speech Synthesis. Volume
, 2002
"... Several approaches have been adopted over the years for grapheme-to-phone conversion for European Portuguese: hand derived rules, neural networks, classification and regression trees, etc. This paper describes different approaches implemented as Weighted Finite State Transducers (WFSTs), motivated b ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
Several approaches have been adopted over the years for grapheme-to-phone conversion for European Portuguese: hand derived rules, neural networks, classification and regression trees, etc. This paper describes different approaches implemented as Weighted Finite State Transducers (WFSTs), motivated by their flexibility in integrating multiples sources of information and other interesting properties such as inversion. We describe and compare rule-based, data-driven and hybrid approaches. Best results were obtained with the rule-based approach, but one should take into account the fact that the data-driven one was trained with automatically transcribed material.
Domain-Adaptive Information Extraction
- Core System for Real World German Text Processing, in Proceedings of ANLP
, 1998
"... . We present in this paper the methodology developed within the PARADIME (Parameterizable Domain-Adaptive Information and Message Extraction) project for designing an Information Extraction (IE) system easily adaptable to new domains of application. For this we went for a strict separation of the (s ..."
Abstract
-
Cited by 8 (3 self)
- Add to MetaCart
. We present in this paper the methodology developed within the PARADIME (Parameterizable Domain-Adaptive Information and Message Extraction) project for designing an Information Extraction (IE) system easily adaptable to new domains of application. For this we went for a strict separation of the (shallow) linguistic processing modules on the one hand and the domain-modeling modules on the other hand, thus looking for the maximal degree of reusability of common linguistic resources shared by all domains of application. The tools used for the domain-modeling allow a declarative description of the domain under consideration and a simple (abstract) mapping to the output of the Natural Language (NL) analysis, thus requiring only few and very general linguistic knowledge for the adaptation of the IE-system to new applications. We describe a real scale experiment on a fast adaptation cycle of the system to a new domain -- the soccer domain -- and present the first results obtained.
Finite State Transducers Approximating Hidden Markov Models
, 1997
"... This paper describes the conversion of a ..."
(Show Context)