Results 1 - 10
of
109
Database-friendly Random Projections
, 2001
"... A classic result of Johnson and Lindenstrauss asserts that any set of n points in d-dimensional Euclidean space can be embedded into k-dimensional Euclidean space | where k is logarithmic in n and independent of d | so that all pairwise distances are maintained within an arbitrarily small factor. Al ..."
Abstract
-
Cited by 113 (2 self)
- Add to MetaCart
A classic result of Johnson and Lindenstrauss asserts that any set of n points in d-dimensional Euclidean space can be embedded into k-dimensional Euclidean space | where k is logarithmic in n and independent of d | so that all pairwise distances are maintained within an arbitrarily small factor. All known constructions of such embeddings involve projecting the n points onto a random k-dimensional hyperplane. We give a novel construction of the embedding, suitable for database applications, which amounts to computing a simple aggregate over k random attribute partitions.
Random projection in dimensionality reduction: Applications to image and text data
- in Knowledge Discovery and Data Mining
, 2001
"... Random projections have recently emerged as a powerful method for dimensionality reduction. Theoretical results indicate that the method preserves distances quite nicely; however, empirical results are sparse. We present experimental results on using random projection as a dimensionality reduction t ..."
Abstract
-
Cited by 99 (0 self)
- Add to MetaCart
Random projections have recently emerged as a powerful method for dimensionality reduction. Theoretical results indicate that the method preserves distances quite nicely; however, empirical results are sparse. We present experimental results on using random projection as a dimensionality reduction tool in a number of cases, where the high dimensionality of the data would otherwise lead to burdensome computations. Our application areas are the processing of both noisy and noiseless images, and information retrieval in text documents. We show that projecting the data onto a random lower-dimensional subspace yields results comparable to conventional dimensionality reduction methods such as principal component analysis: the similarity of data vectors is preserved well under random projection. However, using random projections is computationally signicantly less expensive than using, e.g., principal component analysis. We also show experimentally that using a sparse random matrix gives additional computational savings in random projection.
Toward privacy in public databases
- In TCC
, 2005
"... Abstract. We initiate a theoretical study of the census problem. Informally, in a census individual respondents give private information to a trusted party (the census bureau), who publishes a sanitized version of the data. There are two fundamentally conflicting requirements: privacy for the respon ..."
Abstract
-
Cited by 66 (11 self)
- Add to MetaCart
Abstract. We initiate a theoretical study of the census problem. Informally, in a census individual respondents give private information to a trusted party (the census bureau), who publishes a sanitized version of the data. There are two fundamentally conflicting requirements: privacy for the respondents and utility of the sanitized data. Unlike in the study of secure function evaluation, in which privacy is preserved to the extent possible given a specific functionality goal, in the census problem privacy is paramount; intuitively, things that cannot be learned “safely ” should not be learned at all. An important contribution of this work is a definition of privacy (and privacy compromise) for statistical databases, together with a method for describing and comparing the privacy offered by specific sanitization techniques. We obtain several privacy results using two different sanitization techniques, and then show how to combine them via cross training. We also obtain two utility results involving clustering. 1
Mixture Density Estimation
- IN ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 12
, 1999
"... Gaussian mixtures (or so-called radial basis function networks) for density estimation provide a natural counterpart to sigmoidal neural networks for function fitting and approximation. In both cases, it is possible to give simple expressions for the iterative improvement of performance as component ..."
Abstract
-
Cited by 51 (1 self)
- Add to MetaCart
Gaussian mixtures (or so-called radial basis function networks) for density estimation provide a natural counterpart to sigmoidal neural networks for function fitting and approximation. In both cases, it is possible to give simple expressions for the iterative improvement of performance as components of the network are introduced one at a time. In particular, for mixture density estimation we show that a k-component mixture estimated by maximum likelihood (or by an iterative likelihood improvement that we introduce) achieves log-likelihood within order 1/k of the log-likelihood achievable by any convex combination. Consequences for approximation and estimation using Kullback-Leibler risk are also given. A Minimum Description Length principle selects the optimal number of components k that minimizes the risk bound.
The Global K-Means Clustering Algorithm
, 2003
"... We present the global k-means algorithm which is an incremental approach to clustering that dynamically adds one cluster center at a time through a deterministic global search procedure consisting of N (with N being the size of the data set) executions of the k-means algorithm from suitable initial ..."
Abstract
-
Cited by 43 (5 self)
- Add to MetaCart
We present the global k-means algorithm which is an incremental approach to clustering that dynamically adds one cluster center at a time through a deterministic global search procedure consisting of N (with N being the size of the data set) executions of the k-means algorithm from suitable initial positions. We also propose modifications of the method to reduce the computational load without significantly affecting solution quality. The proposed clustering methods are tested on well-known data sets and they compare favorably to the k-means algorithm with random restarts.
Efficient greedy learning of Gaussian mixture models
- Neural Computation
, 2003
"... This paper concerns the greedy learning of Gaussian mixtures. In the greedy approach, mixture components are inserted into the mixture one after the other. We propose a heuristic for searching for the optimal component to insert. In a randomized manner a set of candidate new components is generated. ..."
Abstract
-
Cited by 40 (7 self)
- Add to MetaCart
This paper concerns the greedy learning of Gaussian mixtures. In the greedy approach, mixture components are inserted into the mixture one after the other. We propose a heuristic for searching for the optimal component to insert. In a randomized manner a set of candidate new components is generated. For each of these candidates we find the locally optimal new component. The best local optimum is then inserted into the existing mixture. The resulting algorithm resolves the sensitivity to initialization of state-of-the-art methods, like EM, and has running time linear in the number of data points and quadratic in the (final) number of mixture components. Due to its greedy nature the algorithm can be particularly useful when the optimal number of mixture components is unknown. Experimental results comparing the proposed algorithm to other methods on density estimation and texture segmentation are provided.
On the Impossibility of Dimension Reduction in l_1
- In Proc. 35th Annu. ACM Sympos. Theory Comput
, 2003
"... The Johnson-Lindenstrauss Lemma shows that any n points in Euclidean space (with distances measured by the L2 norm) may be mapped down to O((log n)/ep^2) dimensions such that no pairwise distance is distorted by more than a (1 ep) factor. Determining whether such dimension reduction is possible in L ..."
Abstract
-
Cited by 37 (1 self)
- Add to MetaCart
The Johnson-Lindenstrauss Lemma shows that any n points in Euclidean space (with distances measured by the L2 norm) may be mapped down to O((log n)/ep^2) dimensions such that no pairwise distance is distorted by more than a (1 ep) factor. Determining whether such dimension reduction is possible in L1 has been an intriguing open question. Charikar and Sahai [7] recently showed lower bounds for dimension reduction in L1 that can be achieved by linear projections, and positive results for shortest path metrics of restricted graph families. However the question of general dimension reduction in L1 was still open. For example, it was not known whether it is possible to reduce the number of dimensions to O(log n) with 1 ep distortion. We show strong lower bounds for general dimension reduction in L1. We give an explicity family of n points in L1 such that any embedding with distortion d requires n^Omega(1/d^2) dimensions. This proves that there is no analog of the Johnson-Lindenstrauss Lemma for L1
On Spectral Learning of Mixtures of Distributions
"... We consider the problem of learning mixtures of distributions via spectral methods and derive a tight characterization of when such methods are useful. Specifically, given a mixture-sample, let i , C i , w i denote the empirical mean, covariance matrix, and mixing weight of the i-th component. We ..."
Abstract
-
Cited by 36 (0 self)
- Add to MetaCart
We consider the problem of learning mixtures of distributions via spectral methods and derive a tight characterization of when such methods are useful. Specifically, given a mixture-sample, let i , C i , w i denote the empirical mean, covariance matrix, and mixing weight of the i-th component. We prove that a very simple algorithm, namely spectral projection followed by single-linkage clustering, properly classifies every point in the sample when each i is separated from all j by 2 (1/w i +1/w j ) plus a term that depends on the concentration properties of the distributions in the mixture. This second term is very small for many distributions, including Gaussians, Log-concave, and many others. As a result, we get the best known bounds for learning mixtures of arbitrary Gaussians in terms of the required mean separation. On the other hand, we prove that given any k means i and mixing weights w i , there are (many) sets of matrices C i such that each i is separated from all j by 2 (1/w i + 1/w j ) , but applying spectral projection to the corresponding Gaussian mixture causes it to collapse completely, i.e., all means and covariance matrices in the projected mixture are identical.
A Spectral Algorithm for Learning Mixtures of Distributions
- Journal of Computer and System Sciences
, 2002
"... We show that a simple spectral algorithm for learning a mixture of k spherical Gaussians in R works remarkably well --- it succeeds in identifying the Gaussians assuming essentially the minimum possible separation between their centers that keeps them unique (solving an open problem of [1]). The ..."
Abstract
-
Cited by 31 (3 self)
- Add to MetaCart
We show that a simple spectral algorithm for learning a mixture of k spherical Gaussians in R works remarkably well --- it succeeds in identifying the Gaussians assuming essentially the minimum possible separation between their centers that keeps them unique (solving an open problem of [1]). The sample complexity and running time are polynomial in both n and k. The algorithm also works for the more general problem of learning a mixture of "weakly isotropic" distributions (e.g. a mixture of uniform distributions on cubes).

