Results 1  10
of
15
Using Maximum Entropy for Text Classification
, 1999
"... This paper proposes the use of maximum entropy techniques for text classification. Maximum entropy is a probability distribution estimation technique widely used for a variety of natural language tasks, such as language modeling, partofspeech tagging, and text segmentation. The underlying principl ..."
Abstract

Cited by 261 (5 self)
 Add to MetaCart
This paper proposes the use of maximum entropy techniques for text classification. Maximum entropy is a probability distribution estimation technique widely used for a variety of natural language tasks, such as language modeling, partofspeech tagging, and text segmentation. The underlying principle of maximum entropy is that without external knowledge, one should prefer distributions that are uniform. Constraints on the distribution, derived from labeled training data, inform the technique where to be minimally nonuniform. The maximum entropy formulation has a unique solution which can be found by the improved iterative scaling algorithm. In this paper, maximum entropy is used for text classification by estimating the conditional distribution of the class variable given the document. In experiments on several text datasets we compare accuracy to naive Bayes and show that maximum entropy is sometimes significantly better, but also sometimes worse. Much future work remains, but the re...
Logistic Regression, AdaBoost and Bregman Distances
, 2000
"... We give a unified account of boosting and logistic regression in which each learning problem is cast in terms of optimization of Bregman distances. The striking similarity of the two problems in this framework allows us to design and analyze algorithms for both simultaneously, and to easily adapt al ..."
Abstract

Cited by 203 (43 self)
 Add to MetaCart
We give a unified account of boosting and logistic regression in which each learning problem is cast in terms of optimization of Bregman distances. The striking similarity of the two problems in this framework allows us to design and analyze algorithms for both simultaneously, and to easily adapt algorithms designed for one problem to the other. For both problems, we give new algorithms and explain their potential advantages over existing methods. These algorithms can be divided into two types based on whether the parameters are iteratively updated sequentially (one at a time) or in parallel (all at once). We also describe a parameterized family of algorithms which interpolates smoothly between these two extremes. For all of the algorithms, we give convergence proofs using a general formalization of the auxiliaryfunction proof technique. As one of our sequentialupdate algorithms is equivalent to AdaBoost, this provides the first general proof of convergence for AdaBoost. We show that all of our algorithms generalize easily to the multiclass case, and we contrast the new algorithms with iterative scaling. We conclude with a few experimental results with synthetic data that highlight the behavior of the old and newly proposed algorithms in different settings.
Information Distance
, 1997
"... While Kolmogorov complexity is the accepted absolute measure of information content in an individual finite object, a similarly absolute notion is needed for the information distance between two individual objects, for example, two pictures. We give several natural definitions of a universal inf ..."
Abstract

Cited by 36 (4 self)
 Add to MetaCart
While Kolmogorov complexity is the accepted absolute measure of information content in an individual finite object, a similarly absolute notion is needed for the information distance between two individual objects, for example, two pictures. We give several natural definitions of a universal information metric, based on length of shortest programs for either ordinary computations or reversible (dissipationless) computations. It turns out that these definitions are equivalent up to an additive logarithmic term. We show that the information distance is a universal cognitive similarity distance. We investigate the maximal correlation of the shortest programs involved, the maximal uncorrelation of programs (a generalization of the SlepianWolf theorem of classical information theory), and the density properties of the discrete metric spaces induced by the information distances. A related distance measures the amount of nonreversibility of a computation. Using the physical theo...
Learning to tag from open vocabulary labels
 In ECML PKDD ’10
, 2010
"... Abstract. Most approaches to classifying media content assume a fixed, closed vocabulary of labels. In contrast, we advocate machine learning approaches which take advantage of the millions of freeform tags obtainable via online crowdsourcing platforms and social tagging websites. The use of such ..."
Abstract

Cited by 7 (3 self)
 Add to MetaCart
Abstract. Most approaches to classifying media content assume a fixed, closed vocabulary of labels. In contrast, we advocate machine learning approaches which take advantage of the millions of freeform tags obtainable via online crowdsourcing platforms and social tagging websites. The use of such open vocabularies presents learning challenges due to typographical errors, synonymy, and a potentially unbounded set of tag labels. In this work, we present a new approach that organizes these noisy tags into wellbehaved semantic classes using topic modeling, and learn to predict tags accurately using a mixture of topic classes. This method can utilize an arbitrary open vocabulary of tags, reduces training time by 94% compared to learning from these tags directly, and achieves comparable performance for classification and superior performance for retrieval. We also demonstrate that on open vocabulary tasks, human evaluations are essential for measuring the true performance of tag classifiers, which traditional evaluation methods will consistently underestimate. We focus on the domain of tagging music clips, and demonstrate our results using data collected with a human computation game called TagATune.
A Simple Converse of Burnashev’s Reliability Function
"... determined the reliability function of variablelength block codes over discrete memoryless channels (DMCs) with feedback. Subsequently, an alternative achievability proof was obtained by Yamamoto and Itoh via a particularly simple and instructive scheme. Their idea is to alternate between a communi ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
determined the reliability function of variablelength block codes over discrete memoryless channels (DMCs) with feedback. Subsequently, an alternative achievability proof was obtained by Yamamoto and Itoh via a particularly simple and instructive scheme. Their idea is to alternate between a communication and a confirmation phase until the receiver detects the codeword used by the sender to acknowledge that the message is correct. We provide a converse that parallels the Yamamoto–Itoh achievability construction. Besides being simpler than the original, the proposed converse suggests that a communication and a confirmation phase are implicit in any scheme for which the probability of error decreases with the largest possible exponent. The proposed converse also makes it intuitively clear why the terms that appear in Burnashev’s exponent are necessary. Index Terms—Burnashev’s error exponent, discrete memoryless channels (DMCs), feedback, reliability function, variablelength communication. I.
Codeword distinguishability in minimum diversity decoding. Submitted to
 Journal of Discrete Mathematical Sciences and Cryptography
, 2005
"... We retake a codingtheoretic notion which goes back to Cl. Shannon: codeword distinguishability. This notion is standard in zeroerror information theory, but its bearing is definitely wider and it may help to better understand new forms of coding, as we argue below. In our approach, the underlyin ..."
Abstract

Cited by 4 (4 self)
 Add to MetaCart
We retake a codingtheoretic notion which goes back to Cl. Shannon: codeword distinguishability. This notion is standard in zeroerror information theory, but its bearing is definitely wider and it may help to better understand new forms of coding, as we argue below. In our approach, the underlying decoding principle is very simple and very general: one decodes by trying to minimise the diversity (in the simplest case the Hamming distance) between a codeword and the output sequence observed at the end of the noisy transmission channel. Symmetrically and equivalently, minimumdiversity decoders and codeword distinguishabilities may be replaced by maximumsimilarity decoders and codeword confusabilities. The operational meaning of codeword distinguishability is made clear by a reliability criterion, which generalises the wellknown criterion on minimum Hamming distances for errorcorrection codes. We investigate the formal properties of distinguishabilities versus diversities; these two notions are deeply related, and yet essentially different. An encoding theorem is put forward, which supports and suggests old and new code constructions. In a list of case studies, we examine channels with crossovers and erasures, or with crossovers, deletions and insertions, a channel of cryptographic interest, and the case of a few “odd distances” taken from DNA word design.
LOSSLESS COMPRESSION AND ALPHABET SIZE
, 2006
"... Lossless data compression through exploiting redundancy in a sequence of symbols is a wellstudied field in computer science and information theory. One way to achieve compression is to statistically model the data and estimate model parameters. In practice, most general purpose data compression alg ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
Lossless data compression through exploiting redundancy in a sequence of symbols is a wellstudied field in computer science and information theory. One way to achieve compression is to statistically model the data and estimate model parameters. In practice, most general purpose data compression algorithms model the data as stationary sequences of 8bit symbols. While this model fits very well the currently used computer architectures and the vast majority of information representation standards, other models may have both computational and information theoretic merits in being more efficient in implementation or fitting some data closer. In addition, compression algorithms based on the 8 bit symbol model perform very poorly on data represented by binary sequences not aligned with byte boundaries either because the fixed symbol length is not a multiple of 8 bits (e.g. DNA sequences) or because the symbols of the source are encoded into bit sequences of variable length. Throughout this thesis, we assume that the source alphabet consists of blocks of equal size of elementary symbols (typically bits), and address the impact of this
A Simple Derivation of Burnashev’s Reliability Function
, 2006
"... In a remarkable paper published in 1976, Burnashev determined the reliability function of variable length block codes over a discrete memoryless channel with feedback by providing a lower bound to the expected decoding time in terms of the size of the message set and the probability of error. We off ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
In a remarkable paper published in 1976, Burnashev determined the reliability function of variable length block codes over a discrete memoryless channel with feedback by providing a lower bound to the expected decoding time in terms of the size of the message set and the probability of error. We offer an alternative derivation of this lower bound. This derivation is simpler than the original and relates the quantities that appear in the bound to uncertainty reduction and binary hypothesis testing. Furthermore, the derivation closely parallels that of an upper bound by Yamamoto and Itoh.
Improvement of Upper Bound to the Optimal Average Cost of the Variable Length Binary Code
, 1999
"... 9.96> x. For a tree T , denote its leaf set by #T := ({#}# {za : z # T,a # A})\T . (For example, T = {#, 0} is a tree and then #T = {00, 01, 1}.) Given probabilities D := {p(1),p(2),...,p(n)}, we want to design a tree T with n leaves and a permutation # of {1, 2,...,n}, suc h that ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
9.96> x. For a tree T , denote its leaf set by #T := ({#}# {za : z # T,a # A})\T . (For example, T = {#, 0} is a tree and then #T = {00, 01, 1}.) Given probabilities D := {p(1),p(2),...,p(n)}, we want to design a tree T with n leaves and a permutation # of {1, 2,...,n}, suc h that an averagecag defined bellow is minimized. Todesc[F e formally, we denote thelexic]]FOK]S order by ord : #T #{1, 2,...,n}. (In the above exam