Results 11 - 20
of
30
Multi-Stream Segmentation of Meetings
, 2004
"... This paper investigates the automatic segmentation of meetings into a sequence of group actions or phases. Our work is based on a corpus of multiparty meetings collected in a meeting room instrumented with video cameras, lapel microphones and a microphone array. We have extracted a set of feature st ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
This paper investigates the automatic segmentation of meetings into a sequence of group actions or phases. Our work is based on a corpus of multiparty meetings collected in a meeting room instrumented with video cameras, lapel microphones and a microphone array. We have extracted a set of feature streams, in this case extracted from the audio data, based on speaker turns, prosody and a transcript of what was spoken. We have related these signals to the higher level semantic categories via a multistream statistical model based on dynamic Bayesian networks (DBNs). We report on a set of experiments in which different DBN architectures are compared, together with the different feature streams. The resultant system has an action error rate of 9%.
Decision tree-based training of probabilistic concatenation models for corpusbased speech synthesis
- in Proc. Interspeech
, 2006
"... The measure of the goodness, or cost, of concatenating synthesis units plays an important role in concatenative speech synthesis. In this paper, we present a probabilistic approach to concatenation modeling in which the goodness of concatenation is represented as the conditional probability of obser ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
The measure of the goodness, or cost, of concatenating synthesis units plays an important role in concatenative speech synthesis. In this paper, we present a probabilistic approach to concatenation modeling in which the goodness of concatenation is represented as the conditional probability of observing the spectral shape of a unit given the previous unit and the current phonetic context. This conditional probability is modeled by a conditional Gaussian density whose mean vector has a form of linear transform of the past spectral shape. A phonetic decision-tree based parameter tying is performed to achieve a robust training that balances between model complexity and the amount of training data available. The concatenation models are implemented in a corpus-based speech synthesizer trained with a CMU Arctic database and the effectiveness of the proposed method was confirmed by a subjective listening test. Index Terms: speech synthesis, unit selection, join costs. 1.
The 2001 GMTK-Based Spine ASR System
, 2002
"... This paper provides a detailed description of the University of Washington automatic speech recognition (ASR) system for the 2001 DARPA SPeech In Noisy Environments (SPINE) task. Our system makes heavy use of the graphical modeling toolkit (GMTK), a general purpose graphical modeling-based ASR syste ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
This paper provides a detailed description of the University of Washington automatic speech recognition (ASR) system for the 2001 DARPA SPeech In Noisy Environments (SPINE) task. Our system makes heavy use of the graphical modeling toolkit (GMTK), a general purpose graphical modeling-based ASR system that allows arbitrary parameter tying, flexible deterministic and stochastic dependencies between variables, and a generalized maximum likelihood parameter estimation algorithm. In our SPINE system, GMTK was used for acoustic model training whereas feature extraction, speaker adaptation, and first-pass decoding were performed by HTK. Our integrated GMTK/HTK system demonstrates the relative merits provided by each tool. Novel aspects of our SPINE system include the capturing of correlations among feature vectors via a globally-shared factored sparse inverse covariance matrix and generalized EM training.
Factored Language Models Tutorial
, 2008
"... The Factored Language Model (FLM) is a flexible framework for incorporating various information sources, such as morphology and part-of-speech, into language modeling. FLMs have so far been successfully applied to tasks such as speech recognition and machine translation; it has the potential to be u ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
The Factored Language Model (FLM) is a flexible framework for incorporating various information sources, such as morphology and part-of-speech, into language modeling. FLMs have so far been successfully applied to tasks such as speech recognition and machine translation; it has the potential to be used in a wide variety of problems in estimating probability tables from sparse data. This tutorial serves as a comprehensive description of FLMs and related algorithms. We document the FLM functionalities as implemented in the SRI Language Modeling toolkit and provide an introductory walk-through using FLMs on an actual dataset. Our goal is to provide an easy-to-understand tutorial and reference for researchers interested in applying FLMs to their problems. Overview of the Tutorial We first describe the factored language model (Section 1) and generalized backoff (Section 2), two complementary techniques that attempt to improve statistical estimation (i.e., reduce parameter variance) in language models, and that also attempt to better describe the way in which language (and sequences of words) might be produced. Researchers familar with the algorithms behind FLMs may skip to Section 3, which describes the FLM programs and file formats in the publicly-available SRI Language Modeling (SRILM) toolkit. 1 Section 4 is a step-by-step walkthrough
Time Adjustable Mixture Weights for Speaking Rate Fluctuation
, 2003
"... One of the most serious problems in spontaneous speech recognition is the degradation of recognition accuracy due to the speaking rate fluctuation in an utterance. This paper proposes a method for adjusting mixture weights of an HMM frame by frame depending on the local speaking rate. The propose ..."
Abstract
- Add to MetaCart
One of the most serious problems in spontaneous speech recognition is the degradation of recognition accuracy due to the speaking rate fluctuation in an utterance. This paper proposes a method for adjusting mixture weights of an HMM frame by frame depending on the local speaking rate. The proposed method is implemented using the Bayesian network framework. A hidden variable representing the variation of the "mode" of the speaking rate is introduced and its value controls the mixture weights of Gaussian mixtures. Model training and maximum probability assignment of the variables are conducted using the EM/GEM and inference algorithms for Bayesian networks. The Bayesian network is used to rescore the acoustic likelihood of the hypotheses in N-best lists. Experimental results show that the proposed method improves word accuracy by 1.6% for the absolute value on meeting speech given the speaking rate information, whereas improvement by a regression HMM is less significant.
Hidden Dynamic Models for Speech Processing Applications
"... c○Leo Jingyu Lee 2004I hereby declare that I am the sole author of this thesis. I authorize the University of Waterloo to lend this thesis to other institutions or individuals for the purpose of scholarly research. ..."
Abstract
- Add to MetaCart
c○Leo Jingyu Lee 2004I hereby declare that I am the sole author of this thesis. I authorize the University of Waterloo to lend this thesis to other institutions or individuals for the purpose of scholarly research.
Computation of ASR Systems
, 2003
"... This paper proposes a novel technique to reduce the likelihood computation in ASR systems that use continuous density HMMs. Based on the nature of dynamic features and the numerical properties of Gaussian mixture distributions, we approximate the observation likelihood computation to achieve a speed ..."
Abstract
- Add to MetaCart
This paper proposes a novel technique to reduce the likelihood computation in ASR systems that use continuous density HMMs. Based on the nature of dynamic features and the numerical properties of Gaussian mixture distributions, we approximate the observation likelihood computation to achieve a speedup. Although the technique does not show appreciable benefit in an isolated word task, it yields significant improvements in continuous speech recognition. For example, 50 % of the computation can be saved on the TIMIT database with only a negligible degradation in system performanc 1
A Multi-Resolution Hidden Markov Model 1 using Class-Specific Features
"... We apply the PDF projection theorem to generalize the hidden Markov model (HMM) to acomodate multiple simultaneous segmentations of the raw data and multiple feature extraction transformations. Different segment sizes and feature transformations are assigned to each state. The algorithm averages ove ..."
Abstract
- Add to MetaCart
We apply the PDF projection theorem to generalize the hidden Markov model (HMM) to acomodate multiple simultaneous segmentations of the raw data and multiple feature extraction transformations. Different segment sizes and feature transformations are assigned to each state. The algorithm averages over all allowable segmentations by mapping the segmentations to a “proxy ” HMM and using the forward procedure.

