Results 1 - 10
of
10
From HMM's to Segment Models: A Unified View of Stochastic Modeling for Speech Recognition
, 1996
"... ..."
Speech Recognition System Design Based on Automatically Derived Units
, 1999
"... In most speech recognition systems today, acoustic modeling and lexical modeling are viewed as separable problems. Currently the most popular approach is to manually define canonical word pronunciations in terms of phonetic units and let the acoustic models capture differences between actual spoken ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
In most speech recognition systems today, acoustic modeling and lexical modeling are viewed as separable problems. Currently the most popular approach is to manually define canonical word pronunciations in terms of phonetic units and let the acoustic models capture differences between actual spoken and canonical pronunciations implicitly with Gaussian mixture models. As a result, these models can be very broad, particularly for casual spontaneous speech. An alternative approach, explored in this thesis, is to learn a unit inventory and pronunciation dictionary from training data using a maximum likelihood objective function. In particular,
A study on music genre classification based on universal acoustic models
- ISMIR 2006
, 2006
"... Classification of musical genres gives a useful measure of similarity and is often the most useful descriptor of a musical piece. Previous techniques to use hidden Markov models (HMMs) for automatic genre classification have used a single HMM to model an entire song or genre. This paper provides a f ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
Classification of musical genres gives a useful measure of similarity and is often the most useful descriptor of a musical piece. Previous techniques to use hidden Markov models (HMMs) for automatic genre classification have used a single HMM to model an entire song or genre. This paper provides a framework to give finer segmentation of HMMs through acoustic segment modeling. Modeling each of these acoustic segments with an HMM builds a timbral dictionary in the same fashion that one would create a phonetic dictionary for speech. A symbolic transcription is created by finding the most likely sequence of symbols. These transcriptions then serve as inputs into an efficient text classifier utilized to provide a solution to the genre classification problem. This paper demonstrates that language-ignorant approaches provide results that are consistent with the current state-of-the-art for the genre classification problem. However, the finer segmentation potentially allows for “musical language”-based syntactic rules to enhance performance.
LyricAlly: Automatic Synchronization of Textual Lyrics to Acoustic Music Signals
"... Abstract—We present LyricAlly, a prototype that automatically aligns acoustic musical signals with their corresponding textual lyrics, in a manner similar to manually-aligned karaoke. We tackle this problem based on a multimodal approach, using an appropriate pairing of audio and text processing to ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Abstract—We present LyricAlly, a prototype that automatically aligns acoustic musical signals with their corresponding textual lyrics, in a manner similar to manually-aligned karaoke. We tackle this problem based on a multimodal approach, using an appropriate pairing of audio and text processing to create the resulting prototype. LyricAlly’s acoustic signal processing uses standard audio features but constrained and informed by the musical nature of the signal. The resulting detected hierarchical rhythm structure is utilized in singing voice detection and chorus detection to produce results of higher accuracy and lower computational costs than their respective baselines. Text processing is employed to approximate the length of the sung passages from the lyrics. Results show an average error of less than one bar for per-line alignment of the lyrics on a test bed of 20 songs (sampled from CD audio and carefully selected for variety). We perform a comprehensive set of system-wide and per-component tests and discuss their results. We conclude by outlining steps for further development. Index Terms—Acoustic signal detection, acoustic signal processing, music, text processing. I.
Zue, “A Comparison of Broad Phonetic and Acoustic Units for Noise Robust Segment-Based Phonetic Recognition
- in Proc. Interspeech
, 2008
"... In this paper, we compare speech recognition performance using broad phonetically- and acoustically-motivated units as a pre-processor in designing a novel noise robust landmark detection and segmentation algorithm. We introduce a cluster evaluation method to measure acoustic unit cluster quality. O ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
In this paper, we compare speech recognition performance using broad phonetically- and acoustically-motivated units as a pre-processor in designing a novel noise robust landmark detection and segmentation algorithm. We introduce a cluster evaluation method to measure acoustic unit cluster quality. On the noisy TIMIT task, we find that the acoustic and phonetic segmentation approaches offer significant improvements over two baseline methods used in the SUMMIT segment-based speech recognizer, a sinusoidal model method and a spectral change approach. In addition, we find that the acoustic method has much faster computation time in stationary noises, while the phonetic approach is faster in non-stationary noise conditions. 1.
Competing hidden markov models on the self-organizing map
- Piscataway, NJ. Helsinki Univ of Technology, IEEE
"... This paper presents an unsupervised segmentation method for feature sequences based on competitivelearning hidden Markov models. Models associated with the nodes of the Self-Organizing Map learn to become selective to the segments of temporal input sequences. Input sequences may have arbitrary lengt ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
This paper presents an unsupervised segmentation method for feature sequences based on competitivelearning hidden Markov models. Models associated with the nodes of the Self-Organizing Map learn to become selective to the segments of temporal input sequences. Input sequences may have arbitrary lengths. Segment models emerge then on the map through an unsupervised learning process. The method was tested in speech recognition, where the performance of the emergent segment models was as good as the performance of the traditionally used linguistic speech segment models. The benefits of the proposed method are the use of unsupervised learning for obtaining the state models for temporal data and the convenient visualization of the state space on the two-dimensional map. 1.
A survey on automatic speech recognition with an illustrative example on continuous speech recognition
- of Mandarin,” Computat. Linguistics Chinese Language Processing
, 1996
"... For the past two decades, research in speech recognition has been intensively carried out worldwide, spurred on by advances in signal processing, algorithms, architectures, and hardware. Speech recognition systems have been developed for a wide variety of applications, ranging from small vocabulary ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
For the past two decades, research in speech recognition has been intensively carried out worldwide, spurred on by advances in signal processing, algorithms, architectures, and hardware. Speech recognition systems have been developed for a wide variety of applications, ranging from small vocabulary keyword recognition over dial-up telephone lines, to medium size vocabulary voice interactive command and control systems on personal computers, to large vocabulary speech dictation, spontaneous speech understanding, and limited-domain speech translation. In this paper we review some of the key advances in several areas of automatic speech recognition. We also illustrate, by examples, how these key advances can be used for continuous speech recognition of Mandarin. Finally we elaborate the requirements in designing successful real-world applications and address technical challenges that need to be harnessed in order to reach the ultimate goal of providing an easy-to-use, natural, and flexible voice interface between people and machines.
An Unsupervised Framework for Action Recognition Using Actemes
"... Abstract. In speech recognition, phonemes have demonstrated their efficacy to model the words of a language. While they are well defined for languages, their extension to human actions is not straightforward. In this paper, we study such an extension and propose an unsupervised framework to find pho ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract. In speech recognition, phonemes have demonstrated their efficacy to model the words of a language. While they are well defined for languages, their extension to human actions is not straightforward. In this paper, we study such an extension and propose an unsupervised framework to find phoneme-like units for actions, which we call actemes, using 3D data and without any prior assumptions. To this purpose, build on an earlier proposed framework in speech literature to automatically find actemes in the training data. We experimentally show that actions defined in terms of actemes and actions defined by whole units give similar recognition results. We define actions out of the training set in terms of these actemes to see whether the actemes generalize to unseen actions. The results show that although the acteme definitions of the actions are not always semantically meaningful, they yield optimal recognition accuracy and constitute a promising direction of research for action modeling. 1
WS96 Project Report Automatic Learning of Word Pronunciation from Data
"... Today's recognizers are primarily based on single pronunciations for most words. This means that the burden of modeling phonetic variability falls entirely on acoustic modeling. In addition, certain types of pronunciation variation (phone deletion/reduction, dialect) are impossible to model well at ..."
Abstract
- Add to MetaCart
Today's recognizers are primarily based on single pronunciations for most words. This means that the burden of modeling phonetic variability falls entirely on acoustic modeling. In addition, certain types of pronunciation variation (phone deletion/reduction, dialect) are impossible to model well at the acoustic level. We suspect that one of the difficulties in recognizing conversational speech (compared to read speech) is the greater variability of pronunciation. We propose to capture this variability by modeling the pronunciations for each word. The goal of this project is to automatically learn a model of word pronunciation from data. We focus on frequent words that appear many times in the Switchboard and Callhome corpora, since a small number of words make up a large fraction of the total errors. We can hope to learn these pronunciations automatically since these words occur many times in the training data. All past attempts in this area have treated pronunciation variants as mutua...

