Results 1 - 10
of
16
Feature-based Pronunciation Modeling with Trainable Asynchrony Probabilities
- in ICSLP
, 2004
"... We report on ongoing work on a pronunciation model based on explicit representation of the evolution of multiple linguistic feature streams. In this type of model, most pronunciation variation is viewed as the result of asynchrony between features and changes in feature values. We have implemented s ..."
Abstract
-
Cited by 11 (6 self)
- Add to MetaCart
We report on ongoing work on a pronunciation model based on explicit representation of the evolution of multiple linguistic feature streams. In this type of model, most pronunciation variation is viewed as the result of asynchrony between features and changes in feature values. We have implemented such a model using dynamic Bayesian networks. In this paper, we extend our previous work with a mechanism for learning feature asynchrony probabilities from data. We present experimental results on a word classification task using phonetic transcriptions of utterances from the Switchboard corpus.
Analysis of Sound Features for Music Timbre Recognition
- in Proceedings of the IEEE CS International Conference on Multimedia and Ubiquitous Engineering (MUE 2007), April 26-28, 2007, in Seoul, Korea
, 2007
"... Recently, communication, digital music creation, and computer storage technology has led to the dynamic increasing of online music repositories in both number and size, where automatic content-based indexing is critical for users to identify possible favorite music pieces. Timbre recognition is one ..."
Abstract
-
Cited by 6 (4 self)
- Add to MetaCart
Recently, communication, digital music creation, and computer storage technology has led to the dynamic increasing of online music repositories in both number and size, where automatic content-based indexing is critical for users to identify possible favorite music pieces. Timbre recognition is one of the important subtasks for such an indexing purpose. Lots of research has been carried out in exploring new sound features to describe the characteristics of a musical sound. The Moving Picture Expert Group (MPEG) provides a standard set of multimedia features, including low level acoustical features based on latest research in this area. This paper introduces our newly designed temporal features used for automatic indexing of musical sounds and evaluates them with MPEG7 descriptors, and other popular features. 1.
Towards Formal Structural Representation of Spoken Language: An Evolving Transformation System (ETS) Approach
, 2005
"... Speech recognition has been a very active area of research over the past twenty years. Despite an evident progress, it is generally agreed by the practitioners of the field that performance of the current speech recognition systems is rather suboptimal and new ap-proaches are needed. The motivation ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Speech recognition has been a very active area of research over the past twenty years. Despite an evident progress, it is generally agreed by the practitioners of the field that performance of the current speech recognition systems is rather suboptimal and new ap-proaches are needed. The motivation behind the undertaken research is an observation that the notion of representation of objects and concepts that once was considered to be central in the early days of pattern recognition, has been largely marginalised by the ad-vent of statistical approaches. As a consequence of a predominantly statistical approach to speech recognition problem, due to the numeric, feature vector-based, nature of rep-resentation, the classes inductively discovered from real data using decision-theoretic techniques have little meaning outside the statistical framework. This is because deci-sion surfaces or probability distributions are difficult to analyse linguistically. Because of the later limitation it is doubtful that the gap between speech recognition and lin-guistic research can be bridged by the numeric representations. This thesis investigates an alternative, structural, approach to spoken language representation and categorisa-
Dependency parsing with dynamic bayesian network
- In AAAI, 20th National Conference on Artificial Intelligence
, 2005
"... Exact parsing with finite state automata is deemed inappropriate because of the unbounded non-locality languages overwhelmingly exhibit. We propose a way to structure the parsing task in order to make it amenable to local classification methods. This allows us to build a Dynamic Bayesian Network whi ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Exact parsing with finite state automata is deemed inappropriate because of the unbounded non-locality languages overwhelmingly exhibit. We propose a way to structure the parsing task in order to make it amenable to local classification methods. This allows us to build a Dynamic Bayesian Network which uncovers the syntactic dependency structure of English sentences. Experiments with the Wall Street Journal demonstrate that the model successfully learns from labeled data.
Reaching over the gap: A review of efforts to link human and automatic speech recognition research
, 2007
"... ..."
A note on join and autointersection of n-ary rational relations
- Proc. Eindhoven FASTAR Days, number 04–40 in TU/e CS TR
, 2004
"... A finite-state machine with n tapes describes a rational (or regular) relation on n strings. It is more expressive than a relational database table with n columns, which can only describe a finite relation. We describe some basic operations on n-ary rational relations and propose notation for them. ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
A finite-state machine with n tapes describes a rational (or regular) relation on n strings. It is more expressive than a relational database table with n columns, which can only describe a finite relation. We describe some basic operations on n-ary rational relations and propose notation for them. (For generality we give the semiring-weighted case in which each tuple has a weight.) Unfortunately, the join operation is problematic: if two rational relations are joined on more than one tape, it can lead to non-rational relations with undecidable properties. We recast join in terms of “auto-intersection” and illustrate some cases in which difficulties arise. We close with the hope that partial or restricted algorithms may be found that are still powerful enough to have practical use.
Capturing fine-phonetic variation in speech through automatic classification of articulatory features
- In: Proceedings of the workshop on Speech Recognition and Intrinsic Variation
, 2006
"... The ultimate goal of our research is to develop a computational model of human speech recognition that is able to capture the effects of fine-grained acoustic variation on speech recognition behaviour. As part of this work we are investigating automatic feature classifiers that are able to create re ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
The ultimate goal of our research is to develop a computational model of human speech recognition that is able to capture the effects of fine-grained acoustic variation on speech recognition behaviour. As part of this work we are investigating automatic feature classifiers that are able to create reliable and accurate transcriptions of the articulatory behaviour encoded in the acoustic speech signal. In the experiments reported here, we compared support vector machines (SVMs) with multilayer perceptrons (MLPs). MLPs have been widely (and rather successfully) used for the task of multi-value articulatory feature classification, while (to the best of our knowledge) SVMs have not. This paper compares the performances of the two classifiers and analyses the results in order to better understand the articulatory representations. It was found that the MLPs outperformed the SVMs, but it is concluded that both classifiers exhibit similar behaviour in terms of patterns of errors. 1.
AN SVM FRONT-END LANDMARK SPEECH RECOGNITION SYSTEM
, 2008
"... Support vector machines (SVMs) can be trained to detect manner transitions between phones and to identify the manner and place of articulation of any given phone. The SVMs can perform these tasks with high accuracy using a variety of acoustic representations. The SVMs generalize well to unseen test ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Support vector machines (SVMs) can be trained to detect manner transitions between phones and to identify the manner and place of articulation of any given phone. The SVMs can perform these tasks with high accuracy using a variety of acoustic representations. The SVMs generalize well to unseen test data if these data were created under identical conditions to the training corpus. Unseen acoustic data from different corpora present a problem for the SVM, even if these acoustic data were generated under similar conditions. The discriminant outputs of these SVMs are used to create both a hybrid SVM/HMM (hidden Markov model) phone recogni-tion system and a hybrid SVM/HMM word recognition system. There is a significant improvement in both phone and word recognition accuracy when these SVM discrim-inant features are used instead of mel frequency cepstral coefficients (MFCCs).
Signal Separation of Similar Pitches and Instruments in a Noisy Polyphonic Domain
- in Foundations of Intelligent Systems, Proceedings of ISMIS 2006, F. Esposito et
, 2006
"... Abstract. In our continuing work on ”Blind Signal Separation ” this paper focuses on extending our previous work [1] by creating a data set that can successfully perform blind separation of polyphonic signals containing similar instruments playing similar notes in a noisy environment. Upon isolating ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Abstract. In our continuing work on ”Blind Signal Separation ” this paper focuses on extending our previous work [1] by creating a data set that can successfully perform blind separation of polyphonic signals containing similar instruments playing similar notes in a noisy environment. Upon isolating and subtracting the dominant signal from a base signal containing varying types and amounts of noise, even though we purposefully excluded any identical matches in the dataset, the signal separation system successfully built a resulting foreign set of synthesized sounds that the classifier correctly recognized. Herein, this paper presents a system that classifies and separates two harmonic signals with added noise. This novel methodology incorporates Knowledge Discovery, MPEG7-based segmentation and Inverse Fourier Transforms. 1
Graphical model representations of word lattices
- IEEE/ACL 2006 Workshop on Spoken Language Technology (SLT
, 2006
"... We introduce a method for expressing word lattices within a dynamic graphical model. We describe a variety of choices for doing this, including a technique to relax the time information associated with lattice nodes in a way that trades off hypothesis expansion with presumed segmentation boundary ac ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
We introduce a method for expressing word lattices within a dynamic graphical model. We describe a variety of choices for doing this, including a technique to relax the time information associated with lattice nodes in a way that trades off hypothesis expansion with presumed segmentation boundary accuracy. Our approach uses a set of time-inhomogeneous and algorithmically expressed conditional probability tables to encode the lattice. The approach was implemented as part of the graphical model toolkit, and word error rate improvements on the Switchboard corpus indicate that our technique is a viable means to incorporate large state space speech recognition systems into a graphical model. Index Terms — word lattice, graphical model, DBN, dynamic Bayesian network, dynamic graphical network, GMTK

