Results 1  10
of
39
KernelBased Learning of Hierarchical Multilabel Classification Models
 JOURNAL OF MACHINE LEARNING RESEARCH
, 2006
"... We present a kernelbased algorithm for hierarchical text classification where the documents are allowed to belong to more than one category at a time. The classification model is a variant of the Maximum Margin Markov Network framework, where the classification hierarchy is represented as a Mark ..."
Abstract

Cited by 53 (6 self)
 Add to MetaCart
We present a kernelbased algorithm for hierarchical text classification where the documents are allowed to belong to more than one category at a time. The classification model is a variant of the Maximum Margin Markov Network framework, where the classification hierarchy is represented as a Markov tree equipped with an exponential family defined on the edges. We present an efficient optimization algorithm based on incremental conditional gradient ascent in singleexample subspaces spanned by the marginal dual variables. The optimization is facilitated with a dynamic programming based algorithm that computes best update directions in the feasible set. Experiments show
Structured prediction, dual extragradient and Bregman projections
 Journal of Machine Learning Research
, 2006
"... We present a simple and scalable algorithm for maximummargin estimation of structured output models, including an important class of Markov networks and combinatorial models. We formulate the estimation problem as a convexconcave saddlepoint problem that allows us to use simple projection methods ..."
Abstract

Cited by 45 (2 self)
 Add to MetaCart
We present a simple and scalable algorithm for maximummargin estimation of structured output models, including an important class of Markov networks and combinatorial models. We formulate the estimation problem as a convexconcave saddlepoint problem that allows us to use simple projection methods based on the dual extragradient algorithm (Nesterov, 2003). The projection step can be solved using dynamic programming or combinatorial algorithms for mincost convex flow, depending on the structure of the problem. We show that this approach provides a memoryefficient alternative to formulations based on reductions to a quadratic program (QP). We analyze the convergence of the method and present experiments on two very different structured prediction tasks: 3D image segmentation and word alignment, illustrating the favorable scaling properties of our algorithm. 1 1.
WebScale Ngram Models for Lexical Disambiguation
"... Webscale data has been used in a diverse range of language research. Most of this research has used web counts for only short, fixed spans of context. We present a unified view of using web counts for lexical disambiguation. Unlike previous approaches, our supervised and unsupervised systems combin ..."
Abstract

Cited by 31 (4 self)
 Add to MetaCart
Webscale data has been used in a diverse range of language research. Most of this research has used web counts for only short, fixed spans of context. We present a unified view of using web counts for lexical disambiguation. Unlike previous approaches, our supervised and unsupervised systems combine information from multiple and overlapping segments of context. On the tasks of preposition selection and contextsensitive spelling correction, the supervised system reduces disambiguation error by 2024 % over the current stateoftheart. 1
A Bayesian model for supervised clustering with the Dirichlet process prior
 Journal of Machine Learning Research
, 2005
"... We develop a Bayesian framework for tackling the supervised clustering problem, the generic problem encountered in tasks such as reference matching, coreference resolution, identity uncertainty and record linkage. Our clustering model is based on the Dirichlet process prior, which enables us to defi ..."
Abstract

Cited by 26 (0 self)
 Add to MetaCart
We develop a Bayesian framework for tackling the supervised clustering problem, the generic problem encountered in tasks such as reference matching, coreference resolution, identity uncertainty and record linkage. Our clustering model is based on the Dirichlet process prior, which enables us to define distributions over the countably infinite sets that naturally arise in this problem. We add supervision to our model by positing the existence of a set of unobserved random variables (we call these “reference types”) that are generic across all clusters. Inference in our framework, which requires integrating over infinitely many parameters, is solved using Markov chain Monte Carlo techniques. We present algorithms for both conjugate and nonconjugate priors. We present a simple—but general—parameterization of our model based on a Gaussian assumption. We evaluate this model on one artificial task and three realworld tasks, comparing it against both unsupervised and stateoftheart supervised algorithms. Our results show that our model is able to outperform other models across a variety of tasks and performance metrics.
Semantic annotation of unstructured and ungrammatical text
 In International Joint Conference on Artificial Intelligence (IJCAI
, 2005
"... The Semantic Web will revolutionize the use of the internet, but the idea faces some major challenges. First, construction of the Semantic Web requires a lot of extra markup on documents, but this work should not be forced upon everyday ..."
Abstract

Cited by 15 (4 self)
 Add to MetaCart
The Semantic Web will revolutionize the use of the internet, but the idea faces some major challenges. First, construction of the Semantic Web requires a lot of extra markup on documents, but this work should not be forced upon everyday
Maximum Margin Coresets for Active and Noise Tolerant Learning
 Proc. of the International Joint Conference on Artificial Intelligence (IJCAI
, 2006
"... We study the problem of learning large margin halfspaces in various settings using coresets to show that coresets are a widely applicable tool for large margin learning. A large margin coreset is a subset of the input data sufficient for approximating the true maximum margin solution. In this work, ..."
Abstract

Cited by 12 (0 self)
 Add to MetaCart
We study the problem of learning large margin halfspaces in various settings using coresets to show that coresets are a widely applicable tool for large margin learning. A large margin coreset is a subset of the input data sufficient for approximating the true maximum margin solution. In this work, we provide a direct algorithm and analysis for constructing large margin coresets. We show various applications including a novel coreset based analysis of large margin active learning and a polynomial time (in the number of input data and the amount of noise) algorithm for agnostic learning in the presence of outlier noise. We also highlight a simple extension to multiclass classification problems and structured output learning. 1
Maximum Entropy Discrimination Markov Networks
, 2008
"... Standard maxmargin structured prediction methods concentrate directly on the inputoutput mapping, and the lack of an elegant probabilistic interpretation causes limitations. In this paper, we present a novel framework called Maximum Entropy Discrimination Markov Networks (MaxEntNet) to do Bayesian ..."
Abstract

Cited by 11 (6 self)
 Add to MetaCart
Standard maxmargin structured prediction methods concentrate directly on the inputoutput mapping, and the lack of an elegant probabilistic interpretation causes limitations. In this paper, we present a novel framework called Maximum Entropy Discrimination Markov Networks (MaxEntNet) to do Bayesian maxmargin structured learning by using expected margin constraints to define a feasible distribution subspace and applying the maximum entropy principle to choose the best distribution from this subspace. We show that MaxEntNet subsumes the standard maxmargin Markov networks (M 3 N) as a spacial case where the predictive model is assumed to be linear and the parameter prior is a standard normal. Based on this understanding, we propose the Laplace maxmargin Markov networks (LapM 3 N) which use the Laplace prior instead of the standard normal. We show that the adoption of a Laplace prior of the parameter makes LapM 3 N enjoy properties expected from a sparsified M 3 N. Unlike the L1regularized maximum likelihood estimation which sets small weights to zeros to achieve sparsity, LapM 3 N posteriorly weights the parameters and features with smaller weights are shrunk more. This posterior weighting effect makes LapM 3 N more stable with respect to the magnitudes of the regularization coefficients and more generalizable. To
Structured Prediction with Reinforcement Learning
 MACHINE LEARNING JOURNAL
, 2008
"... We formalize the problem of Structured Prediction as a Reinforcement Learning task. We first define a Structured Prediction Markov Decision Process (SPMDP), an instantiation of Markov Decision Processes for Structured Prediction and show that learning an optimal policy for this SPMDP is equivalen ..."
Abstract

Cited by 7 (3 self)
 Add to MetaCart
We formalize the problem of Structured Prediction as a Reinforcement Learning task. We first define a Structured Prediction Markov Decision Process (SPMDP), an instantiation of Markov Decision Processes for Structured Prediction and show that learning an optimal policy for this SPMDP is equivalent to minimizing the empirical loss. This link between the supervised learning formulation of structured prediction and reinforcement learning (RL) allows us to use approximate RL methods for learning the policy. The proposed model makes weak hypothesis both on the nature of the Structured Prediction problem and on the supervision process. It does not make any assumption on the decomposition of loss functions, on data encoding, or on the availability of optimal policies for training. It then allows us to cope with a large range of structured prediction problems. Besides, it scales well and can be used for solving both complex and largescale realworld problems. We describe two series of experiments. The first one compares the model with stateoftheart algorithms on classical sequence prediction benchmarks. The second one introduces a complex tree transformation problem. The proposed algorithm is evaluated on this problem using general largescale datasets.
Acclimatizing Taxonomic Semantics for Hierarchical Content Classification ABSTRACT
"... Hierarchical models have been shown to be effective in content classification. However, we observe through empirical study that the performance of a hierarchical model varies with given taxonomies; even a semantically sound taxonomy has potential to change its structure for better classification. By ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
Hierarchical models have been shown to be effective in content classification. However, we observe through empirical study that the performance of a hierarchical model varies with given taxonomies; even a semantically sound taxonomy has potential to change its structure for better classification. By scrutinizing typical cases, we elucidate why a given semanticsbased hierarchy does not work well in content classification, and how it could be improved for accurate hierarchical classification. With these understandings, we propose effective localized solutions that modify the given taxonomy for accurate classification. We conduct extensive experiments on both toy and realworld data sets, report improved performance and interesting findings, and provide further analysis of algorithmic issues such as time complexity, robustness, and sensitivity to the number of features.
An introduction to structured discriminative learning
, 2006
"... We provide a tutorial overview of supervised learning with structured outputs. Taking the perspective of linear classification, we describe several recently discussed approaches in a unified framework, in which particular instantiations differ only in the choice of lossfunction. We describe in deta ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
We provide a tutorial overview of supervised learning with structured outputs. Taking the perspective of linear classification, we describe several recently discussed approaches in a unified framework, in which particular instantiations differ only in the choice of lossfunction. We describe in detail the problems of parameter estimation and inference in these models and discuss nonparametric variants that are based on the use of kernels. 1