Results 1 - 10
of
64
Speech Recognition by Composition of Weighted Finite Automata
- FINITE-STATE LANGUAGE PROCESSING
, 1996
"... We present a general framework based on weighted finite automata and weighted finite-state transducers for describing and implementing speech recognizers. The framework allows us to represent uniformly the information sources and data structures used in recognition, including context-dependent u ..."
Abstract
-
Cited by 103 (11 self)
- Add to MetaCart
We present a general framework based on weighted finite automata and weighted finite-state transducers for describing and implementing speech recognizers. The framework allows us to represent uniformly the information sources and data structures used in recognition, including context-dependent units, pronunciation dictionaries, language models and lattices. Furthermore, general but efficient algorithms can used for combining information sources in actual recognizers and for optimizing their application. In particular, a single composition algorithm is used both to combine in advance information sources such as language models and dictionaries, and to combine acoustic observations and information sources dynamically during recognition.
Designing Statistical Language Learners: Experiments on Noun Compounds
, 1995
"... Statistical language learning research takes the view that many traditional natural language processing tasks can be solved by training probabilistic models of language on a sufficient volume of training data. The design of statistical language learners therefore involves answering two questions: (i ..."
Abstract
-
Cited by 65 (0 self)
- Add to MetaCart
Statistical language learning research takes the view that many traditional natural language processing tasks can be solved by training probabilistic models of language on a sufficient volume of training data. The design of statistical language learners therefore involves answering two questions: (i) Which of the multitude of possible language models will most accurately reflect the properties necessary to a given task? (ii) What will constitute a sufficient volume of training data? Regarding the first question, though a variety of successful models have been discovered, the space of possible designs remains largely unexplored. Regarding the second, exploration of the design space has so far proceeded without an adequate answer. The goal of this thesis is to advance the exploration of the statistical language learning design space. In pursuit of that goal, the thesis makes two main theoretical contributions: it identifies a new class of designs by providing a novel theory of statistical natural language processing, and it presents the foundations for a predictive theory of data requirements to assist in future design explorations. The first of these contributions is called the meaning distributions theory. This theory
Weighted Automata in Text and Speech Processing
- IN ECAI-96 WORKSHOP
, 1996
"... Finite-state automata are a very effective tool in natural language processing. However, in a variety of applications and especially in speech precessing, it is necessary to consider more general machines in which arcs are assigned weights or costs. We briefly describe some of the main theoretical a ..."
Abstract
-
Cited by 63 (30 self)
- Add to MetaCart
Finite-state automata are a very effective tool in natural language processing. However, in a variety of applications and especially in speech precessing, it is necessary to consider more general machines in which arcs are assigned weights or costs. We briefly describe some of the main theoretical and algorithmic aspects of these machines. In particular, we describe an efficient composition algorithm for weighted transducers, and give examples illustrating the value of determinization and minimization algorithms for weighted automata.
Automatic Discovery of Non-Compositional Compounds in Parallel Data
, 1997
"... Automatic segmentation of text into minimal content-bearing units is an unsolved problem even for languages like English. Spaces between words offer an easy first approximation, but this approximation is not good enough for machine translation (MT), where many word sequences are not translated word- ..."
Abstract
-
Cited by 58 (1 self)
- Add to MetaCart
Automatic segmentation of text into minimal content-bearing units is an unsolved problem even for languages like English. Spaces between words offer an easy first approximation, but this approximation is not good enough for machine translation (MT), where many word sequences are not translated word-for-word. This paper presents an efficient automatic method for discover- ing sequences of words that are translated as a unit. The method proceeds by comparing pairs of statistical translation models induced from parallel texts in two languages. It can discover hundreds of noncompositional compounds on each iteration, and constructs longer compounds out of shorter ones. Objective evaluation on a simple machine translation task has shown the method's potential to improve the quality of MT output. The method makes few assumptions about the data, so it can be applied to parallel data other than parallel texts, such as word spellings and pronunci- ations.
A Rational Design for a Weighted Finite-State Transducer Library
- LECTURE NOTES IN COMPUTER SCIENCE
, 1998
"... ..."
A Compression-based Algorithm for Chinese Word Segmentation
- Computational Linguistics
"... This paper describes a general scheme for segmenting text by inferring the position of word boundaries, thus supplying a necessary preprocessing step for applications like those mentioned above. Unlike other approaches, which involve a dictionary of legal words and are therefore language-specific, i ..."
Abstract
-
Cited by 48 (7 self)
- Add to MetaCart
This paper describes a general scheme for segmenting text by inferring the position of word boundaries, thus supplying a necessary preprocessing step for applications like those mentioned above. Unlike other approaches, which involve a dictionary of legal words and are therefore language-specific, it works by using a corpus of already segmented text for training and thus can easily be retargeted for any language for which a suitable corpus of segmented material is available. To infer word boundaries, a general adaptive text compression technique is used that predicts upcoming characters on the basis of their preceding context. Spaces are inserted into positions where their presence enables the text to be compressed more effectively. This approach means that we can capitalize on existing research in text compression to create good models for word segmentation. To build a segmenter for a new language, the only resource required is a corpus of segmented text to train the compression model...
A Trainable Rule-based Algorithm for Word Segmentation
, 1997
"... This paper presents a trainable rule-based algorithm for performing word segmentation. ..."
Abstract
-
Cited by 46 (0 self)
- Add to MetaCart
This paper presents a trainable rule-based algorithm for performing word segmentation.
Developing Guidelines and Ensuring Consistency for Chinese Text Annotation
- In Proceedings of the Second Language Resources and Evaluation Conference
, 2000
"... With growing interest in Chinese Language Processing, numerous NLP tools (e.g. word segmenters, part-of-speech taggers, and parsers) for Chinese have been developed all over the world. However, since no large-scale bracketed corpora are available to the public, these tools are trained on the corpora ..."
Abstract
-
Cited by 39 (10 self)
- Add to MetaCart
With growing interest in Chinese Language Processing, numerous NLP tools (e.g. word segmenters, part-of-speech taggers, and parsers) for Chinese have been developed all over the world. However, since no large-scale bracketed corpora are available to the public, these tools are trained on the corpora with different segmentation criteria, part-of-speech tagsets and bracketing guidelines, and therefore, comparisons are difficult. As a first step towards addressing this issue, we have been preparing a 100-thousand-word bracketed corpus since late 1998 and plan to release it to the public summer 2000. In this paper, we will address several challenges in building the corpus, namely, creating annotation guidelines, ensuring annotation accuracy and maintaining a high level of community involvement. 1. Introduction With growing interest in Chinese Language Processing, numerous NLP tools (e.g. word segmenters, part-of-speech taggers, and parsers) for Chinese have been developed all over the wo...
An Unsupervised Iterative Method for Chinese New Lexicon Extraction
- International Journal of Computational Linguistics & Chinese Language Processing
, 1997
"... An unsupervised iterative approach for extracting a new lexicon (or unknown words) from a Chinese text corpus is proposed in this paper. Instead of using a non-iterative segmentation-mergingfiltering -and-disambiguation approach, the proposed method iteratively integrates the contextual constraints ..."
Abstract
-
Cited by 32 (3 self)
- Add to MetaCart
An unsupervised iterative approach for extracting a new lexicon (or unknown words) from a Chinese text corpus is proposed in this paper. Instead of using a non-iterative segmentation-mergingfiltering -and-disambiguation approach, the proposed method iteratively integrates the contextual constraints (among word candidates) and a joint character association metric to progressively improve the segmentation results of the input corpus (and thus the new word list.) An augmented dictionary, which includes potential unknown words (in addition to known words), is used to segment the input corpus, unlike traditional approaches which use only known words for segmentation. In the segmentation process, the augmented dictionary is used to impose contextual constraints over known words and potential unknown words within input sentences; an unsupervised Viterbi Training process is then applied to ensure that the selected potential unknown words (and known words) maximize the likelihood of the input ...
Multilingual Text Analysis for Text-to-Speech Synthesis
, 1996
"... We present a model of text analysis for text-to-speech (TTS) synthesis based on (weighted) finite-state transducers, which serves as the text-analysis module of the multilingual Bell Labs TTS system. The transducers are constructed using a lexical toolkit that allows declarative descriptions of le ..."
Abstract
-
Cited by 28 (7 self)
- Add to MetaCart
We present a model of text analysis for text-to-speech (TTS) synthesis based on (weighted) finite-state transducers, which serves as the text-analysis module of the multilingual Bell Labs TTS system. The transducers are constructed using a lexical toolkit that allows declarative descriptions of lexicons, morphological rules, numeral-expansion rules, and phonological rules, inter alia. To date, the model has been applied to eight languages: Spanish, Italian, Romanian,

