Results 1  10
of
30
A general regression technique for learning transductions
 Proceedings of ICML 2005
, 2005
"... The problem of learning a transduction, that is a stringtostring mapping, is a common problem arising in natural language processing and computational biology. Previous methods proposed for learning such mappings are based on classification techniques. This paper presents a new and general regress ..."
Abstract

Cited by 28 (0 self)
 Add to MetaCart
The problem of learning a transduction, that is a stringtostring mapping, is a common problem arising in natural language processing and computational biology. Previous methods proposed for learning such mappings are based on classification techniques. This paper presents a new and general regression technique for learning transductions and reports the results of experiments showing its effectiveness. Our transduction learning consists of two phases: the estimation of a set of regression coefficients and the computation of the preimage corresponding to this set of coefficients. A novel and conceptually cleaner formulation of kernel dependency estimation provides a simple framework for estimating the regression coefficients, and an efficient algorithm for computing the preimage from the regression coefficients extends the applicability of kernel dependency estimation to output sequences. We report the results of a series of experiments illustrating the application of our regression technique for learning transductions. 1.
Semisupervised classification for extracting protein interaction sentences using dependency parsing
 In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLPCoNLL
, 2007
"... We introduce a relation extraction method to identify the sentences in biomedical text that indicate an interaction among the protein names mentioned. Our approach is based on the analysis of the paths between two protein names in the dependency parse trees of the sentences. Given two dependency tre ..."
Abstract

Cited by 19 (4 self)
 Add to MetaCart
We introduce a relation extraction method to identify the sentences in biomedical text that indicate an interaction among the protein names mentioned. Our approach is based on the analysis of the paths between two protein names in the dependency parse trees of the sentences. Given two dependency trees, we define two separate similarity functions (kernels) based on cosine similarity and edit distance among the paths between the protein names. Using these similarity functions, we investigate the performances of two classes of learning algorithms, Support Vector Machines and knearestneighbor, and the semisupervised counterparts of these algorithms, transductive SVMs and harmonic functions, respectively. Significant improvement over the previous results in the literature is reported as well as a new benchmark dataset is introduced. Semisupervised algorithms perform better than their supervised version by a wide margin especially when the amount of labeled data is limited. 1
Augmented Statistical Models for Classifying Sequence Data
, 2006
"... Declaration This dissertation is the result of my own work and includes nothing that is the outcome of work done in collaboration. It has not been submitted in whole or in part for a degree at any other university. Some of the work has been published previously in conference proceedings [66,69], two ..."
Abstract

Cited by 18 (0 self)
 Add to MetaCart
Declaration This dissertation is the result of my own work and includes nothing that is the outcome of work done in collaboration. It has not been submitted in whole or in part for a degree at any other university. Some of the work has been published previously in conference proceedings [66,69], two journal articles [36,68], two workshop papers [35,67] and a technical report [65]. The length of this thesis including appendices, bibliography, footnotes, tables and equations is approximately 60,000 words. This thesis contains 27 figures and 20 tables. i
Discriminative models for speech recognition
 In Information Theory and Applications Workshop
, 1997
"... Abstract — The vast majority of automatic speech recognition systems use Hidden Markov Models (HMMs) as the underlying acoustic model. Initially these models were trained based on the maximum likelihood criterion. Significant performance gains have been obtained by using discriminative training crit ..."
Abstract

Cited by 16 (6 self)
 Add to MetaCart
Abstract — The vast majority of automatic speech recognition systems use Hidden Markov Models (HMMs) as the underlying acoustic model. Initially these models were trained based on the maximum likelihood criterion. Significant performance gains have been obtained by using discriminative training criteria, such as maximum mutual information and minimum phone error. However, the underlying acoustic model is still generative, with the associated constraints on the state and transition probability distributions, and classification is based on Bayes ’ decision rule. Recently, there has been interest in examining discriminative, or direct, models for speech recognition. This paper briefly reviews the forms of discriminative models that have been investigated. These include maximum entropy Markov models, hidden conditional random fields and conditional augmented models. The relationships between the various models and issues with applying them to large vocabulary continuous speech recognition will be discussed. I.
Weighted decomposition kernels
 IN: ICML ’05: PROCEEDINGS OF THE 22ND INTERNATIONAL CONFERENCE ON MACHINE LEARNING
, 2005
"... We introduce a family of kernels on discrete data structures within the general class of decomposition kernels. A weighted decomposition kernel (WDK) is computed by dividing objects into substructures indexed by a selector. Two substructures are then matched if their selectors satisfy an equality pr ..."
Abstract

Cited by 14 (3 self)
 Add to MetaCart
We introduce a family of kernels on discrete data structures within the general class of decomposition kernels. A weighted decomposition kernel (WDK) is computed by dividing objects into substructures indexed by a selector. Two substructures are then matched if their selectors satisfy an equality predicate, while the importance of the match is determined by a probability kernel on local distributions fitted on the substructures. Under reasonable assumptions, a WDK can be computed efficiently and can avoid combinatorial explosion of the feature space. We report experimental evidence that the proposed kernel is highly competitive with respect to more complex stateoftheart methods on a set of problems in bioinformatics.
Acoustic modelling using continuous rational kernels
 in MLSP
, 2005
"... There has been significant interest in developing alternatives to hidden Markov models (HMMs) for speech recognition. In particular, interest has been focused upon models that allow additional dependencies to be incorporated. One such model is the Augmented Statistical Model. Here a local exponentia ..."
Abstract

Cited by 9 (5 self)
 Add to MetaCart
There has been significant interest in developing alternatives to hidden Markov models (HMMs) for speech recognition. In particular, interest has been focused upon models that allow additional dependencies to be incorporated. One such model is the Augmented Statistical Model. Here a local exponential approximation, based upon derivatives of a base distribution, is made about some distribution of the base model. Augmented statistical models can be trained using a maximum margin criterion, which may be implemented using an SVM with a generative kernel. Calculating derivatives of the base distribution, in particular higherorder derivatives, to form the generative kernel requires complex dynamic programming algorithms. In this paper a new form of rational kernel, a continuous rational kernel is proposed. This allows elements of the generative kernel, including those based on higherorder derivatives, to be computed using standard forms of transducer within a rational kernel framework. In addition, the derivatives are shown to be a principled method of defining marginalised kernels. Continuous rational kernels are evaluated using a large vocabulary continuous speech recognition (LVCSR) task. 1.
Nonextensive Information Theoretic Kernels on Measures
, 2009
"... Positive definite kernels on probability measures have been recently applied to classification problems involving text, images, and other types of structured data. Some of these kernels are related to classic information theoretic quantities, such as (Shannon’s) mutual information and the JensenSha ..."
Abstract

Cited by 9 (3 self)
 Add to MetaCart
Positive definite kernels on probability measures have been recently applied to classification problems involving text, images, and other types of structured data. Some of these kernels are related to classic information theoretic quantities, such as (Shannon’s) mutual information and the JensenShannon (JS) divergence. Meanwhile, there have been recent advances in nonextensive generalizations of Shannon’s information theory. This paper bridges these two trends by introducing nonextensive information theoretic kernels on probability measures, based on new JStype divergences. These new divergences result from extending the the two building blocks of the classical JS divergence: convexity and Shannon’s entropy. The notion of convexity is extended to the wider concept of qconvexity, for which we prove a Jensen qinequality. Based on this inequality, we introduce JensenTsallis (JT) qdifferences, a nonextensive generalization of the JS divergence, and define a kth order JT qdifference between stochastic processes. We then define a new family of nonextensive mutual information kernels, which allow weights to be assigned to their arguments, and which includes the Boolean, JS, and linear kernels as particular cases. Nonextensive string kernels are also defined that generalize the pspectrum kernel. We illustrate the performance of
Maximum Mutual Information Multiphone Units in Direct Modeling
"... This paper introduces a class of discriminative features for use in maximum entropy speech recognition models. The features we propose are acoustic detectors for discriminatively determined multiphone units. The multiphone units are found by computing the mutual information between the phonetic su ..."
Abstract

Cited by 6 (2 self)
 Add to MetaCart
This paper introduces a class of discriminative features for use in maximum entropy speech recognition models. The features we propose are acoustic detectors for discriminatively determined multiphone units. The multiphone units are found by computing the mutual information between the phonetic subsequences that occur in the training lexicon, and the word labels. This quantity is a function of an error model governing our ability to detect phone sequences accurately (an otherwise informative sequence which cannot be reliably detected is not so useful). We show how to compute this mutual information quantity under a class of error models efficiently, in one pass over the data, for all phonetic subsequences in the training data. After this computation, detectors are created for a subset of highly informative units. We then define two novel classes of features based on these units: associative and transductive. Incorporating these features in a maximum entropy based direct model for VoiceSearch outperforms the baseline by 24 % in sentence error rate. Index Terms: speech recognition, direct model, maximum mutual information, features
A comparison of classifiers for detecting emotion from speech
 in IEEE International Conference on Acoustics, Speech and Signal Processing
, 2005
"... Accurate detection of emotion from speech has clear benefits for the design of more natural humanmachine speech interfaces or for the extraction of useful information from large quantities of speech data. The task consists of assigning, ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
Accurate detection of emotion from speech has clear benefits for the design of more natural humanmachine speech interfaces or for the extraction of useful information from large quantities of speech data. The task consists of assigning,
Kernels on Prolog Proof Trees: Statistical Learning in the ILP Setting
, 2006
"... We develop kernels for measuring the similarity between relational instances using background knowledge expressed in firstorder logic. The method allows us to bridge the gap between traditional inductive logic programming (ILP) representations and statistical approaches to supervised learning. L ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
We develop kernels for measuring the similarity between relational instances using background knowledge expressed in firstorder logic. The method allows us to bridge the gap between traditional inductive logic programming (ILP) representations and statistical approaches to supervised learning. Logic programs are first used to generate proofs of given visitor programs that use predicates declared in the available background knowledge. A kernel is then defined over pairs of proof trees. The method can be used for supervised learning tasks and is suitable for classification as well as regression. We report positive empirical results on Bongardlike and MofN problems that are difficult or impossible to solve with traditional ILP techniques, as well as on real bioinformatics and chemoinformatics data sets.