Results 11  20
of
138
Streaming and sublinear approximation of entropy and information distances
 In ACMSIAM Symposium on Discrete Algorithms
, 2006
"... In most algorithmic applications which compare two distributions, information theoretic distances are more natural than standard ℓp norms. In this paper we design streaming and sublinear time property testing algorithms for entropy and various information theoretic distances. Batu et al posed the pr ..."
Abstract

Cited by 67 (13 self)
 Add to MetaCart
(Show Context)
In most algorithmic applications which compare two distributions, information theoretic distances are more natural than standard ℓp norms. In this paper we design streaming and sublinear time property testing algorithms for entropy and various information theoretic distances. Batu et al posed the problem of property testing with respect to the JensenShannon distance. We present optimal algorithms for estimating bounded, symmetric fdivergences (including the JensenShannon divergence and the Hellinger distance) between distributions in various property testing frameworks. Along the way, we close a (log n)/H gap between the upper and lower bounds for estimating entropy H, yielding an optimal algorithm over all values of the entropy. In a data stream setting (sublinear space), we give the first algorithm for estimating the entropy of a distribution. Our algorithm runs in polylogarithmic space and yields an asymptotic constant factor approximation scheme. An integral part of the algorithm is an interesting use of an F0 (the number of distinct elements in a set) estimation algorithm; we also provide other results along the space/time/approximation tradeoff curve. Our results have interesting structural implications that connect sublinear time and space constrained algorithms. The mediating model is the random order streaming model, which assumes the input is a random permutation of a multiset and was first considered by Munro and Paterson in 1980. We show that any property testing algorithm in the combined oracle model for calculating a permutation invariant functions can be simulated in the random order model in a single pass. This addresses a question raised by Feigenbaum et al regarding the relationship between property testing and stream algorithms. Further, we give a polylogspace PTAS for estimating the entropy of a one pass random order stream. This bound cannot be achieved in the combined oracle (generalized property testing) model. 1
Wikipediabased semantic interpretation for natural language processing
 J. Artif. Int. Res
"... Adequate representation of natural language semantics requires access to vast amounts of common sense and domainspecific world knowledge. Prior work in the field was based on purely statistical techniques that did not make use of background knowledge, on limited lexicographic knowledge bases such a ..."
Abstract

Cited by 65 (5 self)
 Add to MetaCart
(Show Context)
Adequate representation of natural language semantics requires access to vast amounts of common sense and domainspecific world knowledge. Prior work in the field was based on purely statistical techniques that did not make use of background knowledge, on limited lexicographic knowledge bases such as WordNet, or on huge manual efforts such as the CYC project. Here we propose a novel method, called Explicit Semantic Analysis (ESA), for finegrained semantic interpretation of unrestricted natural language texts. Our method represents meaning in a highdimensional space of concepts derived from Wikipedia, the largest encyclopedia in existence. We explicitly represent the meaning of any text in terms of Wikipediabased concepts. We evaluate the effectiveness of our method on text categorization and on computing the degree of semantic relatedness between fragments of natural language text. Using ESA results in significant improvements over the previous state of the art in both tasks. Importantly, due to the use of natural concepts, the ESA model is easy to explain to human users. 1.
Using an Ensemble of OneClass SVM Classifiers to Harden Payloadbased Anomaly Detection Systems
 In Proceedings of the IEEE International Conference on Data Mining (ICDM’06
, 2006
"... Unsupervised or unlabeled learning approaches for network anomaly detection have been recently proposed. In particular, recent work on unlabeled anomaly detection focused on high speed classification based on simple payload statistics. For example, PAYL, an anomaly IDS, measures the occurrence frequ ..."
Abstract

Cited by 44 (8 self)
 Add to MetaCart
(Show Context)
Unsupervised or unlabeled learning approaches for network anomaly detection have been recently proposed. In particular, recent work on unlabeled anomaly detection focused on high speed classification based on simple payload statistics. For example, PAYL, an anomaly IDS, measures the occurrence frequency in the payload of ngrams. A simple model of normal traffic is then constructed according to this description of the packets ’ content. It has been demonstrated that anomaly detectors based on payload statistics can be “evaded ” by mimicry attacks using byte substitution and padding techniques. In this paper we propose a new approach to construct high speed payloadbased anomaly IDS intended to be accurate and hard to evade. We propose a new technique to extract the features from the payload. We use a feature clustering algorithm originally proposed for text classification problems to reduce the dimensionality of the feature space. Accuracy and hardness of evasion are obtained by constructing our anomalybased IDS using an ensemble of oneclass SVM classifiers that work on different feature spaces. 1
Differential Entropic Clustering of Multivariate Gaussians
 Adv. in Neural Inf. Proc. Sys. (NIPS
, 2006
"... Gaussian data is pervasive and many learning algorithms (e.g., kmeans) model their inputs as a single sample drawn from a multivariate Gaussian. However, in many reallife settings, each input object is best described by multiple samples drawn from a multivariate Gaussian. Such data can arise, for ..."
Abstract

Cited by 36 (3 self)
 Add to MetaCart
(Show Context)
Gaussian data is pervasive and many learning algorithms (e.g., kmeans) model their inputs as a single sample drawn from a multivariate Gaussian. However, in many reallife settings, each input object is best described by multiple samples drawn from a multivariate Gaussian. Such data can arise, for example, in a movie review database where each movie is rated by several users, or in timeseries domains such as sensor networks. Here, each input can be naturally described by both a mean vector and covariance matrix which parameterize the Gaussian distribution. In this paper, we consider the problem of clustering such input objects, each represented as a multivariate Gaussian. We formulate the problem using an information theoretic approach and draw several interesting theoretical connections to Bregman divergences and also Bregman matrix divergences. We evaluate our method across several domains, including synthetic data, sensor network data, and a statistical debugging application. 1
Information theoretic clustering of sparse cooccurrence data
 In Proceedings of the Third IEEE International Conference on Data Mining (ICDM03
, 2003
"... ..."
(Show Context)
An informationtheoretic approach to detecting changes in multidimensional data streams
 In Proc. Symp. on the Interface of Statistics, Computing Science, and Applications
, 2006
"... Abstract An important problem in processing large data streams is detecting changes in the underlying distribution that generates the data. The challenge in designing change detection schemes is making them general, scalable, and statistically sound. In this paper, we take a general,informationthe ..."
Abstract

Cited by 30 (1 self)
 Add to MetaCart
(Show Context)
Abstract An important problem in processing large data streams is detecting changes in the underlying distribution that generates the data. The challenge in designing change detection schemes is making them general, scalable, and statistically sound. In this paper, we take a general,informationtheoretic approach to the change detection problem, which works for multidimensional as well as categorical data. We use relative entropy, also called the KullbackLeiblerdistance, to measure the difference between two given distributions. The KLdistance is known to be related to the optimal error in determining whether the two distributions are the sameand draws on fundamental results in hypothesis testing. The KLdistance also generalizes traditional distance measures in statistics, and has invariance properties that make it ideally suitedfor comparing distributions. Our scheme is general; it is nonparametric and requires no assumptions on the underlyingdistributions. It employs a statistical inference procedure based on the theory of bootstrapping, which allows us to determine whether our measurements are statistically significant. The schemeis also quite flexible from a practical perspective; it can be implemented using any spatial partitioning scheme that scales well with dimensionality. In addition to providing change detections,our method generalizes Kulldorff's spatial scan statistic, allowing us to quantitatively identify specific regions in space where large changes have occurred.We provide a detailed experimental study that demonstrates the generality and efficiency of our approach with different kinds of multidimensional datasets, both synthetic and real. 1 Introduction We are collecting and storing data in unprecedented quantities and varietiesstreams, images, audio, text, metadata descriptions, and even simple numbers. Over time, these data streams change as the underlying processes that generate them change. Some changes are spurious and pertain to glitches in the data. Some are genuine, caused by changes in the underlying distributions. Some changes are gradual and some are more precipitous. We would like to detect changes in a variety of settings:
SingleHistogram Class Models for Image Segmentation
 In Proc. ICVGIP
, 2006
"... Abstract. Histograms of visual words (or textons) have proved effective in tasks such as image classification and object class recognition. A common approach is to represent an object class by a set of histograms, each one corresponding to a training exemplar. Classification is then achieved by kne ..."
Abstract

Cited by 25 (3 self)
 Add to MetaCart
(Show Context)
Abstract. Histograms of visual words (or textons) have proved effective in tasks such as image classification and object class recognition. A common approach is to represent an object class by a set of histograms, each one corresponding to a training exemplar. Classification is then achieved by knearest neighbour search over the exemplars. In this paper we introduce two novelties on this approach: (i) we show that new compact single histogram models estimated optimally from the entire training set achieve an equal or superior classification accuracy. The benefit of the single histograms is that they are much more efficient both in terms of memory and computational resources; and (ii) we show that bag of visual words histograms can provide an accurate pixelwise segmentation of an image into object class regions. In this manner the compact models of visual object classes give simultaneous segmentation and recognition of image regions. The approach is evaluated on the MSRC database [5] and it is shown that performance equals or is superior to previous publications on this database. 1
KullbackLeibler Divergence Estimation of Continuous Distributions
 Proceedings of IEEE International Symposium on Information Theory
, 2008
"... Abstract—We present a method for estimating the KL divergence between continuous densities and we prove it converges almost surely. Divergence estimation is typically solved estimating the densities first. Our main result shows this intermediate step is unnecessary and that the divergence can be eit ..."
Abstract

Cited by 23 (0 self)
 Add to MetaCart
(Show Context)
Abstract—We present a method for estimating the KL divergence between continuous densities and we prove it converges almost surely. Divergence estimation is typically solved estimating the densities first. Our main result shows this intermediate step is unnecessary and that the divergence can be either estimated using the empirical cdf or knearestneighbour density estimation, which does not converge to the true measure for finite k. The convergence proof is based on describing the statistics of our estimator using waitingtimes distributions, as the exponential or Erlang. We illustrate the proposed estimators and show how they compare to existing methods based on density estimation, and we also outline how our divergence estimators can be used for solving the twosample problem. I.
A NearestNeighbor Approach to Estimating Divergence between Continuous Random Vectors
, 2006
"... A method for divergence estimation between multidimensional distributions based on nearest neighbor distances is proposed. Given i.i.d. samples, both the bias and the variance of this estimator are proven to vanish as sample sizes go to infinity. In experiments on highdimensional data, the nearest ..."
Abstract

Cited by 21 (1 self)
 Add to MetaCart
A method for divergence estimation between multidimensional distributions based on nearest neighbor distances is proposed. Given i.i.d. samples, both the bias and the variance of this estimator are proven to vanish as sample sizes go to infinity. In experiments on highdimensional data, the nearest neighbor approach generally exhibits faster convergence compared to previous algorithms based on partitioning.
Learning NonRedundant Codebooks for Classifying Complex Objects
"... Codebookbased representations are widely employed in the classification of complex objects such as images and documents. Most previous codebookbased methods construct a single codebook via clustering that maps a bag of lowlevel features into a fixedlength histogram that describes the distribution ..."
Abstract

Cited by 19 (2 self)
 Add to MetaCart
(Show Context)
Codebookbased representations are widely employed in the classification of complex objects such as images and documents. Most previous codebookbased methods construct a single codebook via clustering that maps a bag of lowlevel features into a fixedlength histogram that describes the distribution of these features. This paper describes a simple yet effective framework for learning multiple nonredundant codebooks that produces surprisingly good results. In this framework, each codebook is learned in sequence to extract discriminative information that was not captured by preceding codebooks and their corresponding classifiers. We apply this framework to two application domains: visual object categorization and document classification. Experiments on large classification tasks show substantial improvements in performance compared to a single codebook or codebooks learned in a bagging style. 1.