Results 1 -
8 of
8
Profer: Predictive, Robust Finite-State Parsing for Spoken Language
- In Proceedings of ICASSP
, 1999
"... The natural languageprocessingcomponentof a speechunderstanding system is commonly a robust, semantic parser, implemented as either a chart-based transition network, or as a generalized leftright (GLR) parser. In contrast, we are developing a robust, semantic parser that is a single, predictive fini ..."
Abstract
-
Cited by 16 (3 self)
- Add to MetaCart
The natural languageprocessingcomponentof a speechunderstanding system is commonly a robust, semantic parser, implemented as either a chart-based transition network, or as a generalized leftright (GLR) parser. In contrast, we are developing a robust, semantic parser that is a single, predictive finite-state machine. Our approach is motivated by our belief that such a finite-state parser can ultimately provide an efficient vehicle for tightly integrating higher-level linguistic knowledge into speech recognition. We report on our development of this parser, with an example of its use, and a description of how it compares to both finite-state predictors and chart-based semantic parsers, while combining the elements of
Hierarchical search for large vocabulary conversational speech recognition
- IEEE Signal Processing Magazine
, 1999
"... ABSTRACT 2 Speaker-independent speech recognition technology has made significant progress from the days of isolated word recognition. Today, state-of-the-art systems are capable of performing large vocabulary continuous speech recognition (LVCSR) on audio streams derived from complex information so ..."
Abstract
-
Cited by 15 (5 self)
- Add to MetaCart
ABSTRACT 2 Speaker-independent speech recognition technology has made significant progress from the days of isolated word recognition. Today, state-of-the-art systems are capable of performing large vocabulary continuous speech recognition (LVCSR) on audio streams derived from complex information sources such as broadcast news and two-way telephone dialogs. A significant contribution to this advancement in technology is the development of search techniques that find suboptimal but accurate solutions in problems involving large search spaces and extremely complex statistical models. Moreover, these search strategies are capable of dynamically integrating information from a number of diverse knowledge sources to determine the correct word hypothesis, and limit the scope of the search by using a hierarchical search strategy. We refer to this problem as the decoding or search problem. This paper describes the complexity associated with decoding using hierarchical representations for linguistic and acoustic knowledge sources. An extensible object-oriented decoder available in the public domain, that leverages current state-of-the-art technology is described to illustrate these concepts. This decoder supports efficient handling of acoustic models for cross-word contextdependent phones, multiple pronunciations of words using lexical trees, and rescoring of word graphs based on N-gram language models in a single pass. It employs a state-of-the-art Viterbistyle dynamic programming algorithm, and is equipped with several heuristic pruning criteria to minimize the consumption of computational resources while maintaining good accuracy.
A Framework and Toolkit for the Construction of Multimodal Learning Interfaces
, 1998
"... Multimodal human-computer interaction, in which the computer accepts input from multiple channels or modalities, is more flexible, natural, and powerful than unimodal interaction with input from a single modality. Many research studies ([Hauptmann89], [Nakagawa94], [Nishimoto94], [Oviatt97b], [Chu97 ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
Multimodal human-computer interaction, in which the computer accepts input from multiple channels or modalities, is more flexible, natural, and powerful than unimodal interaction with input from a single modality. Many research studies ([Hauptmann89], [Nakagawa94], [Nishimoto94], [Oviatt97b], [Chu97], to name a few) have reported that the combination of human communication means such as speech, gestures, handwriting, eye movement, etc. enjoys strong preference among users. Unfortunately, the development of multimodal applications is difficult and still suffers from a lack of generality, such that a lot of duplicated effort is wasted when implementing different applications sharing some common aspects. The research presented in this dissertation aims to provide a partial solution to the difficult problem of developing multimodal applications by creating a modular, distributed, and customizable infrastructure to facilitate the construction of such applications. This dissertation contribu...
Mx: A package for rapid mathematical prototyping and algorithm development with application to speech and speaker recognition
- tech. rep., Dept. CSE, Oregon Graduate Institute
, 1998
"... ix Preface xi I Overview 1 1 Introduction 3 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Software architecture . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4 Versions . . . . ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
ix Preface xi I Overview 1 1 Introduction 3 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Software architecture . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4 Versions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.5 Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.6 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2 Using Mx 9 2.1 Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Getting Started . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3 Commands and arguments . . . . . . . . . . . . . . . . . . . . . . 11 2.4 Using variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.5 Specifying matrix input . . . . . . . . . . . . . . . . . . . . . . . 14 2.6 Accessing a submatrix . . . . . . . . . . . . . . . . . . . . . . . . 15 2.7 ...
Robust, Finite-State Parsing for Spoken Language Understanding
"... Human understanding of spoken language appears to integrate the use of contextual expectations with acoustic level perception in a tightly-coupled, sequential fashion. Yet computer speech understanding systems typically pass the transcript produced by a speech recognizer into a natural language pars ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Human understanding of spoken language appears to integrate the use of contextual expectations with acoustic level perception in a tightly-coupled, sequential fashion. Yet computer speech understanding systems typically pass the transcript produced by a speech recognizer into a natural language parser with no integration of acoustic and grammatical constraints. One reason for this is the complex- ity of implementing that integration. To ad- dress this issue we have created a robust, semantic parser as a single finite-state machine (FSM). As such, its run-time action is less complex than other robust parsers that are based on either chart or generalized left-right (GLR) architectures. Therefore, we believe it is ultimately more amenable to direct integration with a speech decoder.
Variations on Statistical Phoneme Recognition -- A Hybrid Approach
, 1997
"... Automatic speech recognition (ASR) is rapidly becoming a mature technology leading to an increasing number of commercial applications. Although great advances have been made in the state of the art of speech recognition over the last 10 years, the holy grail of ASR, namely large vocabulary speaker ..."
Abstract
- Add to MetaCart
Automatic speech recognition (ASR) is rapidly becoming a mature technology leading to an increasing number of commercial applications. Although great advances have been made in the state of the art of speech recognition over the last 10 years, the holy grail of ASR, namely large vocabulary speaker independent continuous speech recognition with an error rate of less than 1%, still eludes researchers. At the heart of most modern speech recognition systems lies a HMM based phoneme recognition engine which segments and classifies the incoming acoustic signal into a sequence of phonemes. These phonemes are concatenated to form word models which are processed further to arrive at a transcription of the linguistic message encoded in the speech signal. The final recognition accuracy of the speech recognition system can thus be directly linked to the recognition accuracy of the underlying phoneme recogniser. Two types of features extracted from the speech signal is commonly used for phoneme recognition. These are the supra-segmental knowledge-based features derived from phonetic and phonologic theory, and the widely used frame-based cepstral features. Up till now, these features have been used separately by researchers, resulting in the loss of valuable discriminative information.
A Diphonebased Digit Recognition System Using Neural Networks
- In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing. IEEE
, 1997
"... In exploring new ways of looking at speech data# wehave developed an alternative method of seg# mentation for training a neural#network#based digit#recognition system. Whereas previous meth# ods segment the data into monophones# biphones# or triphones and train on each sub#phone unit in several broa ..."
Abstract
- Add to MetaCart
In exploring new ways of looking at speech data# wehave developed an alternative method of seg# mentation for training a neural#network#based digit#recognition system. Whereas previous meth# ods segment the data into monophones# biphones# or triphones and train on each sub#phone unit in several broad#category contexts# our new method uses modi#ed diphones to train on the regions of greatest spectral change as well as the regions of greatest stability. Although we account for re# gions of spectral stability#we do not require their presence in our word models. Empirical evidence for the advantage of this new method is seen by the 13# reduction in word#level error that was achieved on a test set of the OGI Numbers cor# pus. Comparison was made to a baseline system that used context#independent monophones and context#dependent biphones and triphones. 1. INTRODUCTION Previous methods for training neural#network rec# ognizers divide each phone for training into one# two# or three parts #mo...
Natural Language Processing: A Human-Computer Interaction Perspective
, 1998
"... Natural language processing has been in existence for more than fifty years. During this time, it has significantly contributed to the field of human-computer interaction in terms of theoretical results and practical applications. As computers continue to become more affordable and accessible, the i ..."
Abstract
- Add to MetaCart
Natural language processing has been in existence for more than fifty years. During this time, it has significantly contributed to the field of human-computer interaction in terms of theoretical results and practical applications. As computers continue to become more affordable and accessible, the importance of user interfaces that are effective, robust, unobtrusive, and user-friendly -- regardless of user expertise or impediments -- becomes more pronounced. Since natural language usually provides for effortless and effective communication in human-human interaction, its significance and potential in human-computer interaction should not be overlooked -- either spoken or typewritten, it may effectively complement other available modalities, such as windows, icons, and menus, and pointing; in some cases, such as in users with disabilities, natural language may even be the only applicable modality. This chapter examines the field of natural language processing as it relates to humanc...

