Results 11 -
17 of
17
Training Mixture Density HMMs with SOM and LVQ
, 1997
"... ¯ The objective of this paper is to present experiments and discussions of how some neural network algorithms can help the phoneme recognition with mixture density hidden Markov models (MDHMMs). In MDHMMs the modeling of the stochastic observation processes associated with the states is based on the ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
¯ The objective of this paper is to present experiments and discussions of how some neural network algorithms can help the phoneme recognition with mixture density hidden Markov models (MDHMMs). In MDHMMs the modeling of the stochastic observation processes associated with the states is based on the estimation of the probability density function of the short-time observations in each state as a mixture of Gaussian densities. The Learning Vector Quantization (LVQ) is used to increase the discrimination between dioeerent phoneme models both during the initialization of the Gaussian codebooks and during the actual MDHMM training. The Self-Organizing Map (SOM) is applied to provide a suitably smoothed mapping of the training vectors to accelerate the convergence of the actual training. The obtained codebook topology can also be exploited in the recognition phase to speed up the calculations to approximate the observation probabilities. The experiments with LVQ and SOMs show reductions both...
Hidden Model Sequence Models for Automatic Speech Recognition
, 2001
"... Most modern automatic speech recognition systems make use of acoustic models based on hidden Markov models. To obtain reasonable recognition performance within a large vocabulary framework, the acoustic models usually include a pronunciation model, together with complex parameter tying schemes. In m ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Most modern automatic speech recognition systems make use of acoustic models based on hidden Markov models. To obtain reasonable recognition performance within a large vocabulary framework, the acoustic models usually include a pronunciation model, together with complex parameter tying schemes. In many cases the pronunciation model operates on a phoneme level and is derived independently of the underlying models. In contrast, this work is aimed at improving pronunciation modelling on a sub-phone level in a combined framework. The modelling of pronunciation variation is assumed to be of special importance for recognition of spontaneous speech.
A Survey on Chinese Speech Recognition
- Communications of COLIPS
, 1996
"... This paper gives a comprehensive survey of the recognition techniques that have been applied to Chinese Speech. Speech Recognition systems for multi-syllable languages such as English have been successfully constructed in the last two decades. In recent years, Chinese speech recognition has become a ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
This paper gives a comprehensive survey of the recognition techniques that have been applied to Chinese Speech. Speech Recognition systems for multi-syllable languages such as English have been successfully constructed in the last two decades. In recent years, Chinese speech recognition has become a fast developing area of research. This paper attempts to review and summarize techniques and results and highlight the following areas: the characteristics of Chinese spoken language (especially Mandarin and Cantonese), such characteristics have significant bearing on the approaches taken for recognition; the speech recognition framework based on the bisyllabic and tonal nature of Chinese spoken language; the recognition techniques based on bi-syllabic nature; the recognition of Chinese tones; and techniques based on language processing of words and sentence hypothesis. The results of various techniques are reported, compared and commented upon. Keywords: Chinese Speech Recognition, Surveys...
Using Location Information From Speech Recognition Of Television News Broadcasts
- Proceedings of the ESCA ETRW Workshop on Accessing Information in Spoken Audio
, 1999
"... The Informedia Digital Video Library system extracts information from digitized video sources and allows full content search and retrieval over all extracted data. This extracted 'metadata' enables users to rapidly find interesting news stories and to quickly identify whether a retrieved TV news sto ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
The Informedia Digital Video Library system extracts information from digitized video sources and allows full content search and retrieval over all extracted data. This extracted 'metadata' enables users to rapidly find interesting news stories and to quickly identify whether a retrieved TV news story is indeed relevant to their query. Through the extraction of named entity information from broadcast news we can determine what people, organizations, dates, times and monetary amounts are mentioned in the broadcast. With respect to location data, we have been able to use location analysis derived from the speech transcripts to allow the user to visually follow the action in the news story on a map and also allow queries for news stories by graphically selecting a region on the map. 1. The Informedia Digital Video Library Project The Informedia Digital Video Library project [1], initiated in 1994, uniquely utilizes integrated speech, image and natural language understanding to process b...
Large vocabulary continuous speech recognition using linguistic features and constraints
, 2005
"... Automatic speech recognition (ASR) is a process of applying constraints, as encoded in the computer system (the recognizer), to the speech signal until ambiguity is satisfactorily resolved to the extent that only one sequence of words is hypothesized. Such constraints fall naturally into two categor ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Automatic speech recognition (ASR) is a process of applying constraints, as encoded in the computer system (the recognizer), to the speech signal until ambiguity is satisfactorily resolved to the extent that only one sequence of words is hypothesized. Such constraints fall naturally into two categories. One deals with the the ordering of words (syntax) and organization of their meanings (semantics, pragmatics, etc). The other governs how speech signals are related to words, a process often termed as “lexical access”. This thesis studies the Huttenlocher-Zue lexical access model, its implementation in a modern probabilistic speech recognition framework and its application to continuous speech from an open vocabulary. The Huttenlocher-Zue model advocates a two-pass lexical access paradigm. In the first pass, the lexicon is effectively pruned using broad linguistic constraints. In the original Huttenlocher-Zue model, the authors had proposed six linguistic features motivated by the manner of pronunciation.
Performance Of The Ibm Large Vocabulary Continuous Speech Recognition System On The Arpa Wall Street Journal Task
- on the ARPA Wall Street Journal task,” in Proc. ICASSP
"... In this paper we discuss various experimental results using our continuous speech recognition system on the Wall Street Jounal task. Experiments with different feature extraction methods, varying amounts and type of training data, and different vocabulary sizes are reported. 1 INTRODUCTION Large v ..."
Abstract
- Add to MetaCart
In this paper we discuss various experimental results using our continuous speech recognition system on the Wall Street Jounal task. Experiments with different feature extraction methods, varying amounts and type of training data, and different vocabulary sizes are reported. 1 INTRODUCTION Large vocabulary continuous speech recognition is an area that is of great current interest, and to this end, several speech recognition systems have evolved that are capable of dealing with such recognition tasks [2, 4, 5, 6, 7, 9]. The ARPA sponsored Wall Street Journal task represents a standardized database that enables the evaluation of the features specific to these different systems on a common platform. In this paper, we present the performance of the IBM continuous speech recognition system on this task. We will concentrate on the speaker-independent portion of the database. The test data used in the experiments is read speech recorded using a Sennheiser microphone. We report experimental ...

