Results 1 - 10
of
11
Exponentiated Gradient Versus Gradient Descent for Linear Predictors
- Information and Computation
, 1995
"... this paper, we concentrate on linear predictors . To any vector u 2 R ..."
Abstract
-
Cited by 196 (11 self)
- Add to MetaCart
this paper, we concentrate on linear predictors . To any vector u 2 R
An Information-Theoretic Approach to Traffic Matrix Estimation
- In Proc. ACM SIGCOMM
, 2003
"... Traffic matrices are required inputs for many IP network management ..."
Abstract
-
Cited by 97 (12 self)
- Add to MetaCart
Traffic matrices are required inputs for many IP network management
On-line portfolio selection using multiplicative updates
- Mathematical Finance
, 1998
"... We present an on-line investment algorithm which achieves almost the same wealth as the best constant-rebalanced portfolio determined in hindsight from the actual market outcomes. The algorithm employs a multiplicative update rule derived using a framework introduced by Kivinen and Warmuth. Our algo ..."
Abstract
-
Cited by 67 (10 self)
- Add to MetaCart
We present an on-line investment algorithm which achieves almost the same wealth as the best constant-rebalanced portfolio determined in hindsight from the actual market outcomes. The algorithm employs a multiplicative update rule derived using a framework introduced by Kivinen and Warmuth. Our algorithm is very simple to implement and requires only constant storage and computing time per stock ineach trading period. We tested the performance of our algorithm on real stock data from the New York Stock Exchange accumulated during a 22-year period. On this data, our algorithm clearly outperforms the best single stock aswell as Cover's universal portfolio selection algorithm. We also present results for the situation in which the We present an on-line investment algorithm which achieves almost the same wealth as the best constant-rebalanced portfolio investment strategy. The algorithm employsamultiplicative update rule derived using a framework introduced by Kivinen and Warmuth [20]. Our algorithm is very simple to implement and its time and storage requirements grow linearly in the number of stocks.
Tracking the Best Disjunction
- Machine Learning
, 1995
"... . Littlestone developed a simple deterministic on-line learning algorithm for learning k-literal disjunctions. This algorithm (called Winnow) keeps one weight for each of the n variables and does multiplicative updates to its weights. We develop a randomized version of Winnow and prove bounds for a ..."
Abstract
-
Cited by 64 (11 self)
- Add to MetaCart
. Littlestone developed a simple deterministic on-line learning algorithm for learning k-literal disjunctions. This algorithm (called Winnow) keeps one weight for each of the n variables and does multiplicative updates to its weights. We develop a randomized version of Winnow and prove bounds for an adaptation of the algorithm for the case when the disjunction may change over time. In this case a possible target disjunction schedule T is a sequence of disjunctions (one per trial) and the shift size is the total number of literals that are added/removed from the disjunctions as one progresses through the sequence. We develop an algorithm that predicts nearly as well as the best disjunction schedule for an arbitrary sequence of examples. This algorithm that allows us to track the predictions of the best disjunction is hardly more complex than the original version. However the amortized analysis needed for obtaining worst-case mistake bounds requires new techniques. In some cases our low...
Boosting as Entropy Projection
, 1999
"... We consider the AdaBoost procedure for boosting weak learners. In AdaBoost, a key step is choosing a new distribution on the training examples based on the old distribution and the mistakes made by the present weak hypothesis. We show how AdaBoost 's choice of the new distribution can be seen ..."
Abstract
-
Cited by 51 (8 self)
- Add to MetaCart
We consider the AdaBoost procedure for boosting weak learners. In AdaBoost, a key step is choosing a new distribution on the training examples based on the old distribution and the mistakes made by the present weak hypothesis. We show how AdaBoost 's choice of the new distribution can be seen as an approximate solution to the following problem: Find a new distribution that is closest to the old distribution subject to the constraint that the new distribution is orthogonal to the vector of mistakes of the current weak hypothesis. The distance (or divergence) between distributions is measured by the relative entropy. Alternatively, we could say that AdaBoost approximately projects the distribution vector onto a hyperplane dened by the mistake vector. We show that this new view of AdaBoost as an entropy projection is dual to the usual view of AdaBoost as minimizing the normalization factors of the updated distributions.
Estimating Point-to-Point and Point-to-Multipoint Traffic Matrices: An Information-Theoretic Approach
- IEEE/ACM Trans. Netw
, 2005
"... Traffic matrices are required inputs for many IP network management tasks, such as capacity planning, traffic engineering and network reliability analysis. However, it is difficult to measure these matrices directly in large operational IP networks, so there has been recent interest in inferring tra ..."
Abstract
-
Cited by 12 (5 self)
- Add to MetaCart
Traffic matrices are required inputs for many IP network management tasks, such as capacity planning, traffic engineering and network reliability analysis. However, it is difficult to measure these matrices directly in large operational IP networks, so there has been recent interest in inferring traffic matrices from link measurements and other more easily measured data. Typically, this inference problem is ill-posed, as it involves significantly more unknowns than data. Experience in many scientific and engineering fields has shown that it is essential to approach such ill-posed problems via "regularization". This paper presents a new approach to traffic matrix estimation using a regularization based on "entropy penalization". Our solution chooses the traffic matrix consistent with the measured data that is information-theoretically closest to a model in which source/destination pairs are stochastically independent. It applies to both point-to-point and point-to-multipoint traffic matrix estimation. We use fast algorithms based on modern convex optimization theory to solve for our traffic matrices. We evaluate our algorithm with real backbone traffic and routing data, and demonstrate that it is fast, accurate, robust, and flexible.
Learning of Depth Two Neural Networks with Constant Fan-in at the Hidden Nodes (Extended Abstract)
- In Proc. 9th Annu. Conf. on Comput. Learning Theory
, 1996
"... We present algorithms for learning depth two neural networks where the hidden nodes are threshold gates with constant fan-in. The transfer function of the output node might be more general: we have results for the cases when the threshold function, the logistic function or the identity function is u ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
We present algorithms for learning depth two neural networks where the hidden nodes are threshold gates with constant fan-in. The transfer function of the output node might be more general: we have results for the cases when the threshold function, the logistic function or the identity function is used as the transfer function at the output node. We give batch and on-line learning algorithms for these classes of neural networks and prove bounds on the performance of our algorithms. The batch algorithms work for real valued inputs whereas the on-line algorithms assume that the inputs are discretized. The hypotheses of our algorithms are essentially also neural networks of depth two. However, their number of hidden nodes might be much larger than the number of hidden nodes of the neural network that has to be learned. Our algorithms can handle such a large number of hidden nodes since they rely on multiplicative weight updates at the output node, and the performance of these algorithms s...
Groupwise point pattern registration using a novel CDF-based Jensen-Shannon divergence
- in: IEEE Computer Vision and Pattern Recognition
"... In this paper, we propose a novel and robust algorithm for the groupwise non-rigid registration of multiple unlabeled point-sets with no bias toward any of the given pointsets. To quantify the divergence between multiple probability distributions each estimated from the given point sets, we develop ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
In this paper, we propose a novel and robust algorithm for the groupwise non-rigid registration of multiple unlabeled point-sets with no bias toward any of the given pointsets. To quantify the divergence between multiple probability distributions each estimated from the given point sets, we develop a novel measure based on their cumulative distribution functions that we dub the CDF-JS divergence. The measure parallels the well known Jensen-Shannon divergence (defined for probability density functions) but is more regular than the JS divergence since its definition is based on CDFs as opposed to density functions. As a consequence, CDF-JS is more immune to noise and statistically more robust than the JS. We derive the analytic gradient of the CDF-JS divergence
Bayesian Methods: Applications in Information Aggregation and Image Data Mining
, 1999
"... More accurate interpretation of remotely sensed data is based on a concept combining synergistically signals, information or knowledge from different sources. The aim is information mining, extraction and presentation. A hierarchical structure of data fusion levels has been identified: on image sign ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
More accurate interpretation of remotely sensed data is based on a concept combining synergistically signals, information or knowledge from different sources. The aim is information mining, extraction and presentation. A hierarchical structure of data fusion levels has been identified: on image signal level, on image features, on physical parameters extracted from images, on meta features resulting from image feature modelling, on feature grouping. The Bayesian perspective is discussed aiming at a variety of aspects. The power of the Bayesian approach is endowed i. e. by the possibility to analyse uniformly the uncertainties over scene parameters in data acquired from heterogeneous and incommensurable sources.
A Computational Theory of Surprise
"... While eminently successful for the transmission of data, Shannon's theory of information does not address semantic and subjective dimensions of data, such as relevance and surprise. We propose an observer-dependent computational theory of surprise where surprise is defined by the relative entropy be ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
While eminently successful for the transmission of data, Shannon's theory of information does not address semantic and subjective dimensions of data, such as relevance and surprise. We propose an observer-dependent computational theory of surprise where surprise is defined by the relative entropy between the prior and the posterior distributions of an observer. Surprise requires integration over the space of models in contrast with Shannon's entropy, which requires integration over the space of data. We show how surprise can be computed exactly in a number of discrete and continuous cases using distributions from the exponential family with conjugate priors. We show that during sequential Bayesian learning, surprise decreases like 1/N and study how surprise differs and complements Shannon's de nition of information.

