Results 1  10
of
38
A general regression technique for learning transductions
 Proceedings of ICML 2005
, 2005
"... The problem of learning a transduction, that is a stringtostring mapping, is a common problem arising in natural language processing and computational biology. Previous methods proposed for learning such mappings are based on classification techniques. This paper presents a new and general regress ..."
Abstract

Cited by 30 (1 self)
 Add to MetaCart
The problem of learning a transduction, that is a stringtostring mapping, is a common problem arising in natural language processing and computational biology. Previous methods proposed for learning such mappings are based on classification techniques. This paper presents a new and general regression technique for learning transductions and reports the results of experiments showing its effectiveness. Our transduction learning consists of two phases: the estimation of a set of regression coefficients and the computation of the preimage corresponding to this set of coefficients. A novel and conceptually cleaner formulation of kernel dependency estimation provides a simple framework for estimating the regression coefficients, and an efficient algorithm for computing the preimage from the regression coefficients extends the applicability of kernel dependency estimation to output sequences. We report the results of a series of experiments illustrating the application of our regression technique for learning transductions. 1.
Semisupervised classification for extracting protein interaction sentences using dependency parsing
 In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLPCoNLL
, 2007
"... We introduce a relation extraction method to identify the sentences in biomedical text that indicate an interaction among the protein names mentioned. Our approach is based on the analysis of the paths between two protein names in the dependency parse trees of the sentences. Given two dependency tre ..."
Abstract

Cited by 19 (4 self)
 Add to MetaCart
We introduce a relation extraction method to identify the sentences in biomedical text that indicate an interaction among the protein names mentioned. Our approach is based on the analysis of the paths between two protein names in the dependency parse trees of the sentences. Given two dependency trees, we define two separate similarity functions (kernels) based on cosine similarity and edit distance among the paths between the protein names. Using these similarity functions, we investigate the performances of two classes of learning algorithms, Support Vector Machines and knearestneighbor, and the semisupervised counterparts of these algorithms, transductive SVMs and harmonic functions, respectively. Significant improvement over the previous results in the literature is reported as well as a new benchmark dataset is introduced. Semisupervised algorithms perform better than their supervised version by a wide margin especially when the amount of labeled data is limited. 1
Augmented Statistical Models for Classifying Sequence Data
, 2006
"... Declaration This dissertation is the result of my own work and includes nothing that is the outcome of work done in collaboration. It has not been submitted in whole or in part for a degree at any other university. Some of the work has been published previously in conference proceedings [66,69], two ..."
Abstract

Cited by 18 (0 self)
 Add to MetaCart
Declaration This dissertation is the result of my own work and includes nothing that is the outcome of work done in collaboration. It has not been submitted in whole or in part for a degree at any other university. Some of the work has been published previously in conference proceedings [66,69], two journal articles [36,68], two workshop papers [35,67] and a technical report [65]. The length of this thesis including appendices, bibliography, footnotes, tables and equations is approximately 60,000 words. This thesis contains 27 figures and 20 tables. i
Discriminative models for speech recognition
 In Information Theory and Applications Workshop
, 1997
"... Abstract — The vast majority of automatic speech recognition systems use Hidden Markov Models (HMMs) as the underlying acoustic model. Initially these models were trained based on the maximum likelihood criterion. Significant performance gains have been obtained by using discriminative training crit ..."
Abstract

Cited by 16 (6 self)
 Add to MetaCart
Abstract — The vast majority of automatic speech recognition systems use Hidden Markov Models (HMMs) as the underlying acoustic model. Initially these models were trained based on the maximum likelihood criterion. Significant performance gains have been obtained by using discriminative training criteria, such as maximum mutual information and minimum phone error. However, the underlying acoustic model is still generative, with the associated constraints on the state and transition probability distributions, and classification is based on Bayes ’ decision rule. Recently, there has been interest in examining discriminative, or direct, models for speech recognition. This paper briefly reviews the forms of discriminative models that have been investigated. These include maximum entropy Markov models, hidden conditional random fields and conditional augmented models. The relationships between the various models and issues with applying them to large vocabulary continuous speech recognition will be discussed. I.
Weighted decomposition kernels
 IN: ICML ’05: PROCEEDINGS OF THE 22ND INTERNATIONAL CONFERENCE ON MACHINE LEARNING
, 2005
"... We introduce a family of kernels on discrete data structures within the general class of decomposition kernels. A weighted decomposition kernel (WDK) is computed by dividing objects into substructures indexed by a selector. Two substructures are then matched if their selectors satisfy an equality pr ..."
Abstract

Cited by 14 (3 self)
 Add to MetaCart
We introduce a family of kernels on discrete data structures within the general class of decomposition kernels. A weighted decomposition kernel (WDK) is computed by dividing objects into substructures indexed by a selector. Two substructures are then matched if their selectors satisfy an equality predicate, while the importance of the match is determined by a probability kernel on local distributions fitted on the substructures. Under reasonable assumptions, a WDK can be computed efficiently and can avoid combinatorial explosion of the feature space. We report experimental evidence that the proposed kernel is highly competitive with respect to more complex stateoftheart methods on a set of problems in bioinformatics.
Nonextensive Information Theoretic Kernels on Measures
, 2009
"... Positive definite kernels on probability measures have been recently applied to classification problems involving text, images, and other types of structured data. Some of these kernels are related to classic information theoretic quantities, such as (Shannon’s) mutual information and the JensenSha ..."
Abstract

Cited by 9 (3 self)
 Add to MetaCart
Positive definite kernels on probability measures have been recently applied to classification problems involving text, images, and other types of structured data. Some of these kernels are related to classic information theoretic quantities, such as (Shannon’s) mutual information and the JensenShannon (JS) divergence. Meanwhile, there have been recent advances in nonextensive generalizations of Shannon’s information theory. This paper bridges these two trends by introducing nonextensive information theoretic kernels on probability measures, based on new JStype divergences. These new divergences result from extending the the two building blocks of the classical JS divergence: convexity and Shannon’s entropy. The notion of convexity is extended to the wider concept of qconvexity, for which we prove a Jensen qinequality. Based on this inequality, we introduce JensenTsallis (JT) qdifferences, a nonextensive generalization of the JS divergence, and define a kth order JT qdifference between stochastic processes. We then define a new family of nonextensive mutual information kernels, which allow weights to be assigned to their arguments, and which includes the Boolean, JS, and linear kernels as particular cases. Nonextensive string kernels are also defined that generalize the pspectrum kernel. We illustrate the performance of
Acoustic modelling using continuous rational kernels
 in MLSP
, 2005
"... There has been significant interest in developing alternatives to hidden Markov models (HMMs) for speech recognition. In particular, interest has been focused upon models that allow additional dependencies to be incorporated. One such model is the Augmented Statistical Model. Here a local exponentia ..."
Abstract

Cited by 9 (5 self)
 Add to MetaCart
There has been significant interest in developing alternatives to hidden Markov models (HMMs) for speech recognition. In particular, interest has been focused upon models that allow additional dependencies to be incorporated. One such model is the Augmented Statistical Model. Here a local exponential approximation, based upon derivatives of a base distribution, is made about some distribution of the base model. Augmented statistical models can be trained using a maximum margin criterion, which may be implemented using an SVM with a generative kernel. Calculating derivatives of the base distribution, in particular higherorder derivatives, to form the generative kernel requires complex dynamic programming algorithms. In this paper a new form of rational kernel, a continuous rational kernel is proposed. This allows elements of the generative kernel, including those based on higherorder derivatives, to be computed using standard forms of transducer within a rational kernel framework. In addition, the derivatives are shown to be a principled method of defining marginalised kernels. Continuous rational kernels are evaluated using a large vocabulary continuous speech recognition (LVCSR) task. 1.
A comparison of classifiers for detecting emotion from speech
 in IEEE International Conference on Acoustics, Speech and Signal Processing
, 2005
"... Accurate detection of emotion from speech has clear benefits for the design of more natural humanmachine speech interfaces or for the extraction of useful information from large quantities of speech data. The task consists of assigning, ..."
Abstract

Cited by 6 (2 self)
 Add to MetaCart
Accurate detection of emotion from speech has clear benefits for the design of more natural humanmachine speech interfaces or for the extraction of useful information from large quantities of speech data. The task consists of assigning,
Maximum Mutual Information Multiphone Units in Direct Modeling
"... This paper introduces a class of discriminative features for use in maximum entropy speech recognition models. The features we propose are acoustic detectors for discriminatively determined multiphone units. The multiphone units are found by computing the mutual information between the phonetic su ..."
Abstract

Cited by 6 (2 self)
 Add to MetaCart
This paper introduces a class of discriminative features for use in maximum entropy speech recognition models. The features we propose are acoustic detectors for discriminatively determined multiphone units. The multiphone units are found by computing the mutual information between the phonetic subsequences that occur in the training lexicon, and the word labels. This quantity is a function of an error model governing our ability to detect phone sequences accurately (an otherwise informative sequence which cannot be reliably detected is not so useful). We show how to compute this mutual information quantity under a class of error models efficiently, in one pass over the data, for all phonetic subsequences in the training data. After this computation, detectors are created for a subset of highly informative units. We then define two novel classes of features based on these units: associative and transductive. Incorporating these features in a maximum entropy based direct model for VoiceSearch outperforms the baseline by 24 % in sentence error rate. Index Terms: speech recognition, direct model, maximum mutual information, features
3Way Composition of Weighted FiniteState Transducers
, 802
"... Abstract. Composition of weighted transducers is a fundamental algorithm used in many applications, including for computing complex editdistances between automata, or string kernels in machine learning, or to combine different components of a speech recognition, speech synthesis, or information ext ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
Abstract. Composition of weighted transducers is a fundamental algorithm used in many applications, including for computing complex editdistances between automata, or string kernels in machine learning, or to combine different components of a speech recognition, speech synthesis, or information extraction system. We present a generalization of the composition of weighted transducers, 3way composition, which is dramatically faster in practice than the standard composition algorithm when combining more than two transducers. The worstcase complexity of our algorithm for composing three transducers T1, T2, and T3 resulting in T, is O(T Q min(d(T1)d(T3), d(T2))+T E), where ·Q denotes the number of states,  · E the number of transitions, and d(·) the maximum outdegree. As in regular composition, the use of perfect hashing requires a preprocessing step with lineartime expected complexity in the size of the input transducers. In many cases, this approach significantly improves on the complexity of standard composition. Our algorithm also leads to a dramatically faster composition in practice. Furthermore, standard composition can be obtained as a special case of our algorithm. We report the results of several experiments demonstrating this improvement. These theoretical and empirical improvements significantly enhance performance in the applications already mentioned. 1