Results 1  10
of
53
Clustering with Bregman Divergences
 JOURNAL OF MACHINE LEARNING RESEARCH
, 2005
"... A wide variety of distortion functions are used for clustering, e.g., squared Euclidean distance, Mahalanobis distance and relative entropy. In this paper, we propose and analyze parametric hard and soft clustering algorithms based on a large class of distortion functions known as Bregman divergence ..."
Abstract

Cited by 310 (52 self)
 Add to MetaCart
A wide variety of distortion functions are used for clustering, e.g., squared Euclidean distance, Mahalanobis distance and relative entropy. In this paper, we propose and analyze parametric hard and soft clustering algorithms based on a large class of distortion functions known as Bregman divergences. The proposed algorithms unify centroidbased parametric clustering approaches, such as classical kmeans and informationtheoretic clustering, which arise by special choices of the Bregman divergence. The algorithms maintain the simplicity and scalability of the classical kmeans algorithm, while generalizing the basic idea to a very large class of clustering loss functions. There are two main contributions in this paper. First, we pose the hard clustering problem in terms of minimizing the loss in Bregman information, a quantity motivated by ratedistortion theory, and present an algorithm to minimize this loss. Secondly, we show an explicit bijection between Bregman divergences and exponential families. The bijection enables the development of an alternative interpretation of an ecient EM scheme for learning models involving mixtures of exponential distributions. This leads to a simple soft clustering algorithm for all Bregman divergences.
Universal Discrete Denoising: Known Channel
 IEEE Trans. Inform. Theory
, 2003
"... A discrete denoising algorithm estimates the input sequence to a discrete memoryless channel (DMC) based on the observation of the entire output sequence. For the case in which the DMC is known and the quality of the reconstruction is evaluated with a given singleletter fidelity criterion, we pr ..."
Abstract

Cited by 79 (32 self)
 Add to MetaCart
A discrete denoising algorithm estimates the input sequence to a discrete memoryless channel (DMC) based on the observation of the entire output sequence. For the case in which the DMC is known and the quality of the reconstruction is evaluated with a given singleletter fidelity criterion, we propose a discrete denoising algorithm that does not assume knowledge of statistical properties of the input sequence. Yet, the algorithm is universal in the sense of asymptotically performing as well as the optimum denoiser that knows the input sequence distribution, which is only assumed to be stationary and ergodic. Moreover, the algorithm is universal also in a semistochastic setting, in which the input is an individual sequence, and the randomness is due solely to the channel noise.
Coding for Computing
 IEEE Transactions on Information Theory
, 1998
"... A sender communicates with a receiver who wishes to reliably evaluate a function of their combined data. We show that if only the sender can transmit, the number of bits required is a conditional entropy of a naturally defined graph. We also determine the number of bits needed when the communicators ..."
Abstract

Cited by 70 (0 self)
 Add to MetaCart
A sender communicates with a receiver who wishes to reliably evaluate a function of their combined data. We show that if only the sender can transmit, the number of bits required is a conditional entropy of a naturally defined graph. We also determine the number of bits needed when the communicators exchange two messages. 1 Introduction Let f be a function of two random variables X and Y . A sender PX knows X, a receiver PY knows Y , and both want PY to reliably determine f(X; Y ). How many bits must PX transmit? Embedding this communicationcomplexity scenario (Yao [22]) in the standard informationtheoretic setting (Shannon [17]), we assume that (1) f(X; Y ) must be determined for a block of many independent (X; Y )instances, (2) PX transmits after observing the whole block of X instances, (3) a vanishing block error probability is allowed, and (4) the problem's rate L f (XjY ) is the number of bits transmitted for the block, normalized by the number of instances. Two simple bou...
Side information aware coding strategies for sensor networks
 IEEE J. Selected Areas Commun
"... Abstract—We develop coding strategies for estimation under communication constraints in treestructured sensor networks. The strategies have a modular and decentralized architecture. This promotes the flexibility, robustness, and scalability that wireless sensor networks need to operate in uncertain ..."
Abstract

Cited by 27 (0 self)
 Add to MetaCart
Abstract—We develop coding strategies for estimation under communication constraints in treestructured sensor networks. The strategies have a modular and decentralized architecture. This promotes the flexibility, robustness, and scalability that wireless sensor networks need to operate in uncertain, changing, and resourceconstrained environments. The strategies are based on a generalization of Wyner–Ziv source coding with decoder side information. We develop solutions for general trees, and illustrate our results in serial (pipeline) and parallel (hubandspoke) networks. Additionally, the strategies can be applied to other network information theory problems. They have a successive coding structure that gives an inherently less complex way to attain a number of prior results, as well as some novel results, for the Chief Executive Officer problem, multiterminal source coding, and certain classes of relay channels. Index Terms—Chief Executive Officer (CEO) problems, data fusion, distributed detection, distributed estimation, multiterminal source coding, rate distortion theory, relay channels, sensor networks, side information, Wyner–Ziv coding. I.
Pointwise Redundancy in Lossy Data Compression and Universal Lossy Data Compression
 IEEE Trans. Inform. Theory
, 1999
"... We characterize the achievable pointwise redundancy rates for lossy data compression at a fixed distortion level. "Pointwise redundancy" refers to the difference between the description length achieved by an nthorder block code and the optimal nR(D) bits. For memoryless sources, we show that the be ..."
Abstract

Cited by 21 (13 self)
 Add to MetaCart
We characterize the achievable pointwise redundancy rates for lossy data compression at a fixed distortion level. "Pointwise redundancy" refers to the difference between the description length achieved by an nthorder block code and the optimal nR(D) bits. For memoryless sources, we show that the best achievable redundancy rate is of order O( p n) in probability. This follows from a secondorder refinement to the classical source coding theorem, in the form of a "onesided central limit theorem." Moreover, we show that, along (almost) any source realization, the description lengths of any sequence of block codes operating at distortion level D exceed nR(D) by at least as much as C p n log log n, infinitely often. Corresponding direct coding theorems are also given, showing that these rates are essentially achievable. The above rates are in sharp contrast with the expected redundancy rates of order O(log n) recently reported by various authors. Our approach is based on showing that...
Iterative decoding of a broadcast message
 in Proc. Allerton Conf. Commun., Contr., Comput
, 2003
"... We develop communication strategies for the rateconstrained interactive decoding of a message broadcast to a group of interested users. This situation differs from the relay channel in that all users are interested in the transmitted message, and from the broadcast channel because no user can decod ..."
Abstract

Cited by 18 (0 self)
 Add to MetaCart
We develop communication strategies for the rateconstrained interactive decoding of a message broadcast to a group of interested users. This situation differs from the relay channel in that all users are interested in the transmitted message, and from the broadcast channel because no user can decode on its own. We focus on twouser scenarios, and describe a baseline strategy that uses ideas of coding with decoder side information. One user acts initially as a relay for the other. That other user then decodes the message and sends back random parity bits, enabling the first user to decode. We show how to improve on this scheme’s performance through a conversation consisting of multiple rounds of discussion. While there are now more messages, each message is shorter, lowering the overall rate of the conversation. Such multiround conversations can be more efficient because earlier messages serve as side information known at both encoder and decoder. We illustrate these ideas for binary erasure channels. We show that multiround conversations can decode using less overall rate than is possible with the singleround scheme. 1
Fast statistical spam filter by approximate classifications
 In Proc. ACM SIGMETRICS 2006
, 2006
"... Statisticalbased Bayesian filters have become a popular and important defense against spam. However, despite their effectiveness, their greater processing overhead can prevent them from scaling well for enterpriselevel mail servers. For example, the dictionary lookups that are characteristic of th ..."
Abstract

Cited by 17 (1 self)
 Add to MetaCart
Statisticalbased Bayesian filters have become a popular and important defense against spam. However, despite their effectiveness, their greater processing overhead can prevent them from scaling well for enterpriselevel mail servers. For example, the dictionary lookups that are characteristic of this approach are limited by the memory access rate, therefore relatively insensitive to increases in CPU speed. We address this scaling issue by proposing an acceleration technique that speeds up Bayesian filters based on approximate classification. The approximation uses two methods: hashbased lookup and lossy encoding. Lookup approximation is based on the popular Bloom filter data structure with an extension to support value retrieval. Lossy encoding is used to further compress the data structure. While both methods introduce additional errors to a strict Bayesian approach, we show how the errors can be both minimized and biased toward a false negative classification. We demonstrate a 6x speedup over two wellknown spam filters (bogofilter and qsf) while achieving an identical false positive rate and similar false negative rate to the original filters.
On the Whiteness of High Resolution Quantization Errors
, 2000
"... A common belief in quantization theory says that the quantization noise process resulting from uniform scalar quantization of a correlated discrete time process tends to be white in the limit of small distortion ("high resolution"). A rule of thumb for this property to hold is that the source sam ..."
Abstract

Cited by 13 (1 self)
 Add to MetaCart
A common belief in quantization theory says that the quantization noise process resulting from uniform scalar quantization of a correlated discrete time process tends to be white in the limit of small distortion ("high resolution"). A rule of thumb for this property to hold is that the source samples have a "smooth" joint distribution. We give a precise statement of this property, and generalize it to nonuniform quantization and to vector quantization. We show that the quantization errors resulting from independent quantizations of dependent real random variables become asymptotically uncorrelated (although not necessarily statistically independent) if the joint Fisher information under translation of the two variables is finite and the quantization cells shrink uniformly as the distortion tends to zero.
The Information Lost in Erasures
, 2008
"... We consider sources and channels with memory observed through erasure channels. In particular, we examine the impact of sporadic erasures on the fundamental limits of lossless data compression, lossy data compression, channel coding, and denoising. We define the erasure entropy of a collection of ra ..."
Abstract

Cited by 10 (3 self)
 Add to MetaCart
We consider sources and channels with memory observed through erasure channels. In particular, we examine the impact of sporadic erasures on the fundamental limits of lossless data compression, lossy data compression, channel coding, and denoising. We define the erasure entropy of a collection of random variables as the sum of entropies of the individual variables conditioned on all the rest. The erasure entropy measures the information content carried by each symbol knowing its context. The erasure entropy rate is shown to be the minimal amount of bits per erasure required to recover the lost information in the limit of small erasure probability. When we allow recovery of the erased symbols within a prescribed degree of distortion, the fundamental tradeoff is described by the erasure rate–distortion function which we characterize. We show that in the regime of sporadic erasures, knowledge at the encoder of the erasure locations does not lower the rate required to achieve a given distortion. When no additional encoded information is available, the erased information is reconstructed solely on the basis of its context by a denoiser. Connections between erasure entropy and discrete denoising are developed. The decrease of the capacity of channels with memory due to sporadic memoryless erasures is also characterized in wide generality.
Coordination Capacity
, 2009
"... We develop elements of a theory of cooperation and coordination in networks. Rather than considering a communication network as a means of distributing information, or of reconstructing random processes at remote nodes, we ask what dependence can be established among the nodes given the communicatio ..."
Abstract

Cited by 9 (3 self)
 Add to MetaCart
We develop elements of a theory of cooperation and coordination in networks. Rather than considering a communication network as a means of distributing information, or of reconstructing random processes at remote nodes, we ask what dependence can be established among the nodes given the communication constraints. Specifically, in a network with communication rates {Ri,j} between the nodes, we ask what is the set of all achievable joint distributions p(x1,..., xm) of actions at the nodes on the network. Several networks are solved, including arbitrarily large cascade networks. Distributed cooperation can be the solution to many problems such as distributed games, distributed control, and establishing mutual information bounds on the influence of one part of a physical system on another.