Results 1 - 10
of
24
Minimum cut model for spoken lecture segmentation
- In Proceedings of the Annual Meeting of the Association for Computational Linguistics (COLING-ACL 2006
, 2006
"... We consider the task of unsupervised lecture segmentation. We formalize segmentation as a graph-partitioning task that optimizes the normalized cut criterion. Our approach moves beyond localized comparisons and takes into account longrange cohesion dependencies. Our results demonstrate that global a ..."
Abstract
-
Cited by 35 (7 self)
- Add to MetaCart
We consider the task of unsupervised lecture segmentation. We formalize segmentation as a graph-partitioning task that optimizes the normalized cut criterion. Our approach moves beyond localized comparisons and takes into account longrange cohesion dependencies. Our results demonstrate that global analysis improves the segmentation accuracy and is robust in the presence of speech recognition errors. 1
Unsupervised word acquisition from speech using pattern discovery
- In Proceedings of ICASSP
, 2006
"... In this paper, we present an unsupervised method for automatically discovering words from speech using a combination of acoustic pattern discovery, graph clustering, and baseform searching. The algorithm we propose represents an alternative to traditional methods of speech recognition and makes use ..."
Abstract
-
Cited by 7 (3 self)
- Add to MetaCart
In this paper, we present an unsupervised method for automatically discovering words from speech using a combination of acoustic pattern discovery, graph clustering, and baseform searching. The algorithm we propose represents an alternative to traditional methods of speech recognition and makes use of the acoustic similarity of multiple realizations of the same words or phrases. On a set of three academic lectures on different subjects, we show that the clustering component of the algorithm is able to successfully generate word clusters that have good coverage of subject-relevant words. Moreover, we illustrate how to use the cluster nodes to retrieve the word identity of each cluster from a large baseform dictionary. Results indicate that this algorithm may prove useful for applications such as vocabulary initialization, speech summarization, or augmentation of existing recognition systems. 1.
Unsupervised Spoken Keyword Spotting via Segmental DTW on Gaussian Posteriorgrams
"... Abstract—In this paper, we present an unsupervised learning framework to address the problem of detecting spoken ..."
Abstract
-
Cited by 7 (4 self)
- Add to MetaCart
Abstract—In this paper, we present an unsupervised learning framework to address the problem of detecting spoken
A Computational Model of Language Acquisition: the Emergence of Words
, 2009
"... In this paper, we discuss a computational model that is able to detect and build word-like representations on the basis of sensory input. The model is designed and tested with a further aim to investigate how infants may learn to communicate by means of spoken language. The computational model makes ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
In this paper, we discuss a computational model that is able to detect and build word-like representations on the basis of sensory input. The model is designed and tested with a further aim to investigate how infants may learn to communicate by means of spoken language. The computational model makes use of a memory, a perception module, and the concept of ’learning drive’. Learning takes place within a communicative loop between a ’caregiver’ and the ’learner’. Experiments carried out on three European languages with different genetic background (Finnish, Swedish, and Dutch) show that a robust word representation can be learned in using less than 100 acoustic tokens (examples) of that word. The model is inspired by the memory structure that is assumed functional for human cognitive processing.
TOWARDS MULTI-SPEAKER UNSUPERVISED SPEECH PATTERN DISCOVERY
"... In this paper, we explore the use of a Gaussian posteriorgram based representation for unsupervised discovery of speech patterns. Compared with our previous work, the new approach provides significant improvement towards speaker independence. The framework consists of three main procedures: a Gaussi ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
In this paper, we explore the use of a Gaussian posteriorgram based representation for unsupervised discovery of speech patterns. Compared with our previous work, the new approach provides significant improvement towards speaker independence. The framework consists of three main procedures: a Gaussian posteriorgram generation procedure which learns an unsupervised Gaussian mixture model and labels each speech frame with a Gaussian posteriorgram representation; a segmental dynamic time warping procedure which locates pairs of similar sequences of Gaussian posteriorgram vectors; and a graph clustering procedure which groups similar sequences into clusters. We demonstrate the viability of using the posteriorgram approach to handle many talkers by finding clusters of words in the TIMIT corpus. Index Terms — unsupervised learning, language acquisition 1.
Rapid Evaluation of Speech Representations for Spoken Term Discovery
"... Acoustic front-ends are typically developed for supervised learning tasks and are thus optimized to minimize word error rate, phone error rate, etc. However, in recent efforts to develop zero-resource speech technologies, the goal is not to use transcribed speech to train systems but instead to disc ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
Acoustic front-ends are typically developed for supervised learning tasks and are thus optimized to minimize word error rate, phone error rate, etc. However, in recent efforts to develop zero-resource speech technologies, the goal is not to use transcribed speech to train systems but instead to discover the acoustic structure of the spoken language automatically. For this new setting, we require a framework for evaluating the quality of speech representations without coupling to a particular recognition architecture. Motivated by the spoken term discovery task, we present a dynamic time warping-based framework for quantifying how well a representation can associate words of the same type spoken by different speakers. We benchmark the quality of a wide range of speech representations using multiple frame-level distance metrics and demonstrate that our performance metrics can also accurately predict phone recognition accuracies. Index Terms: evaluation methods, acoustic front-end, spoken term discovery, zero resource
Integration of Asynchronous Knowledge Sources in a Novel Speech Recognition Framework
- Hugo Van hamme and Lou Boves
, 2008
"... Hidden Markov Models have been essential in obtaining today’s successes in speech recognition. However, some limitations of HMMs become clear: for example it is difficult to successfully exploit features that are measured at different time scales than the centisecond scale at which the spectral feat ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Hidden Markov Models have been essential in obtaining today’s successes in speech recognition. However, some limitations of HMMs become clear: for example it is difficult to successfully exploit features that are measured at different time scales than the centisecond scale at which the spectral features are measured. Little success has been achieved in integrating utterance level information such as prosody, segmental information and finer detail such as voice onset times. In this paper, we apply latent semantic analysis (LSA) techniques known from the text processing field to histograms of acoustic event co-occurrence (HAC) to propose a novel speech recognition framework. We show that the HACmethod can deal with correlated information and exploit knowledge sources that are asynchronous. Index Terms: speech recognition, information discovery, information integration, latent semantic analysis, cooccurrence statistics. 1.
Single speaker segmentation and inventory selection using dynamic time warping self organization and joint multigram mapping
- in SSW06 ISCA Workshop
, 2007
"... In speech synthesis the inventory of units is decided by inspection and on the basis of phonological and phonetic expertise. The ephone (or emergent phone) project at CSTR is investigating how self organisation techniques can be applied to build an inventory based on collected acoustic data together ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
In speech synthesis the inventory of units is decided by inspection and on the basis of phonological and phonetic expertise. The ephone (or emergent phone) project at CSTR is investigating how self organisation techniques can be applied to build an inventory based on collected acoustic data together with the constraints of a synthesis lexicon. In this paper we will describe a prototype inventory creation method using dynamic time warping (DTW) for acoustic clustering and a joint multigram approach for relating a series of symbols that represent the speech to these emerged units. We initially examined two symbol sets: 1) A baseline of standard phones 2) Orthographic symbols. The success of the approach is evaluated by comparing word boundaries generated by the emergent phones against those created using state-of-the-art HMM segmentation. Initial results suggest the DTW segmentation can match word boundaries with a root mean square error (RMSE) of 35ms. Results from mapping units onto phones resulted in a higher RMSE of 103ms. This error was increased when multiple multigram types were added and when the default unit clustering was altered from 40 (our baseline) to 10. Results for orthographic matching had a higher RMSE of 125ms. To conclude we discuss future work that we believe can reduce this error rate to a level sufficient for the techniques to be applied to a unit selection synthesis system. Index Terms: speech synthesis, unit selection. 1.
AN INNER-PRODUCT LOWER-BOUND ESTIMATE FOR DYNAMIC TIME WARPING
"... In this paper, we present a lower-bound estimate for dynamic time warping (DTW) on time series consisting of multi-dimensional posterior probability vectors known as posteriorgrams. We develop a lower-bound estimate based on the inner-product distance that has been found to be an effective metric fo ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
In this paper, we present a lower-bound estimate for dynamic time warping (DTW) on time series consisting of multi-dimensional posterior probability vectors known as posteriorgrams. We develop a lower-bound estimate based on the inner-product distance that has been found to be an effective metric for computing similarities between posteriorgrams. In addition to deriving the lower-bound estimate, we show how it can be efficiently used in an admissible K nearest neighbor (KNN) search for spotting matching sequences. We quantify the amount of computational savings achieved by performing a set of unsupervised spoken keyword spotting experiments using Gaussian mixture model posteriorgrams. In these experiments the proposed lower-bound estimate eliminates 89 % of the DTW previously required calculations without affecting overall keyword detection performance. Index Terms — dynamic time warping, posteriorgram 1.
Summarizing multiple spoken documents: finding evidence from untranscribed audio
"... This paper presents a model for summarizing multiple untranscribed spoken documents. Without assuming the availability of transcripts, the model modifies a recently proposed unsupervised algorithm to detect re-occurring acoustic patterns in speech and uses them to estimate similarities between utter ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
This paper presents a model for summarizing multiple untranscribed spoken documents. Without assuming the availability of transcripts, the model modifies a recently proposed unsupervised algorithm to detect re-occurring acoustic patterns in speech and uses them to estimate similarities between utterances, which are in turn used to identify salient utterances and remove redundancies. This model is of interest due to its independence from spoken language transcription, an error-prone and resource-intensive process, its ability to integrate multiple sources of information on the same topic, and its novel use of acoustic patterns that extends previous work on low-level prosodic feature detection. We compare the performance of this model with that achieved using manual and automatic transcripts, and find that this new approach is roughly equivalent to having access to ASR transcripts with word error rates in the 33–37 % range without actually having to do the ASR, plus it better handles utterances with out-ofvocabulary words. 1

