Results 1  10
of
26
FINDING STRUCTURE WITH RANDOMNESS: PROBABILISTIC ALGORITHMS FOR CONSTRUCTING APPROXIMATE MATRIX DECOMPOSITIONS
"... Lowrank matrix approximations, such as the truncated singular value decomposition and the rankrevealing QR decomposition, play a central role in data analysis and scientific computing. This work surveys and extends recent research which demonstrates that randomization offers a powerful tool for ..."
Abstract

Cited by 253 (6 self)
 Add to MetaCart
(Show Context)
Lowrank matrix approximations, such as the truncated singular value decomposition and the rankrevealing QR decomposition, play a central role in data analysis and scientific computing. This work surveys and extends recent research which demonstrates that randomization offers a powerful tool for performing lowrank matrix approximation. These techniques exploit modern computational architectures more fully than classical methods and open the possibility of dealing with truly massive data sets. This paper presents a modular framework for constructing randomized algorithms that compute partial matrix decompositions. These methods use random sampling to identify a subspace that captures most of the action of a matrix. The input matrix is then compressed—either explicitly or implicitly—to this subspace, and the reduced matrix is manipulated deterministically to obtain the desired lowrank factorization. In many cases, this approach beats its classical competitors in terms of accuracy, speed, and robustness. These claims are supported by extensive numerical experiments and a detailed error analysis. The specific benefits of randomized techniques depend on the computational environment. Consider the model problem of finding the k dominant components of the singular value decomposition
An Improved Approximation Algorithm for the Column Subset Selection Problem
"... We consider the problem of selecting the “best ” subset of exactly k columns from an m × n matrix A. In particular, we present and analyze a novel twostage algorithm that runs in O(min{mn 2, m 2 n}) time and returns as output an m × k matrix C consisting of exactly k columns of A. In the first stag ..."
Abstract

Cited by 74 (13 self)
 Add to MetaCart
(Show Context)
We consider the problem of selecting the “best ” subset of exactly k columns from an m × n matrix A. In particular, we present and analyze a novel twostage algorithm that runs in O(min{mn 2, m 2 n}) time and returns as output an m × k matrix C consisting of exactly k columns of A. In the first stage (the randomized stage), the algorithm randomly selects O(k log k) columns according to a judiciouslychosen probability distribution that depends on information in the topk right singular subspace of A. In the second stage (the deterministic stage), the algorithm applies a deterministic columnselection procedure to select and return exactly k columns from the set of columns selected in the first stage. Let C be the m × k matrix containing those k columns, let PC denote the projection matrix onto the span of those columns, and let Ak denote the “best ” rankk approximation to the matrix A as computed with the singular value decomposition. Then, we prove that ‖A − PCA‖2 ≤ O k 3 4 log 1
Finding structure with randomness: Stochastic algorithms for constructing approximate matrix decompositions
, 2009
"... Lowrank matrix approximations, such as the truncated singular value decomposition and the rankrevealing QR decomposition, play a central role in data analysis and scientific computing. This work surveys recent research which demonstrates that randomization offers a powerful tool for performing l ..."
Abstract

Cited by 62 (4 self)
 Add to MetaCart
(Show Context)
Lowrank matrix approximations, such as the truncated singular value decomposition and the rankrevealing QR decomposition, play a central role in data analysis and scientific computing. This work surveys recent research which demonstrates that randomization offers a powerful tool for performing lowrank matrix approximation. These techniques exploit modern computational architectures more fully than classical methods and open the possibility of dealing with truly massive data sets. In particular, these techniques offer a route toward principal component analysis (PCA) for petascale data. This paper presents a modular framework for constructing randomized algorithms that compute partial matrix decompositions. These methods use random sampling to identify a subspace that captures most of the action of a matrix. The input matrix is then compressed—either explicitly or implicitly—to this subspace, and the reduced matrix is manipulated deterministically to obtain the desired lowrank factorization. In many cases, this approach beats its classical competitors in terms of accuracy, speed, and robustness. These claims are supported by extensive numerical experiments and a detailed error analysis. The specific benefits of randomized techniques depend on the computational environment. Consider
Unsupervised feature selection for multicluster data
 KDD
, 2010
"... In many data analysis tasks, one is often confronted with very high dimensional data. Feature selection techniques are designed to find the relevant feature subset of the original features which can facilitate clustering, classification and retrieval. In this paper, we consider the feature selection ..."
Abstract

Cited by 36 (1 self)
 Add to MetaCart
(Show Context)
In many data analysis tasks, one is often confronted with very high dimensional data. Feature selection techniques are designed to find the relevant feature subset of the original features which can facilitate clustering, classification and retrieval. In this paper, we consider the feature selection problem in unsupervised learning scenario, which is particularly difficult due to the absence of class labels that would guide the search for relevant information. The feature selection problem is essentially a combinatorial optimization problem which is computationally expensive. Traditional unsupervised feature selection methods address this issue by selecting the top ranked features based on certain scores computed independently for each feature. These approaches neglect the possible correlation between different features and thus can not produce an optimal feature subset. Inspired from the recent developments on manifold learning and L1regularized models for subset selection, we propose in this paperanewapproach,calledMultiCluster Feature Selection (MCFS), for unsupervised feature selection. Specifically, we select those features such that the multicluster structure of the data can be best preserved. The corresponding optimization problem can be efficiently solved since it only involves a sparse eigenproblem and a L1regularized least squares problem. Extensive experimental results over various reallife data sets have demonstrated the superiority of the proposed algorithm.
Unsupervised Feature Selection for the kmeans Clustering Problem
"... We present a novel feature selection algorithm for the kmeans clustering problem. Our algorithm is randomized and, assuming an accuracy parameter ϵ ∈ (0, 1), selects and appropriately rescales in an unsupervised manner Θ(k log(k/ϵ)/ϵ 2) features from a dataset of arbitrary dimensions. We prove that ..."
Abstract

Cited by 17 (10 self)
 Add to MetaCart
(Show Context)
We present a novel feature selection algorithm for the kmeans clustering problem. Our algorithm is randomized and, assuming an accuracy parameter ϵ ∈ (0, 1), selects and appropriately rescales in an unsupervised manner Θ(k log(k/ϵ)/ϵ 2) features from a dataset of arbitrary dimensions. We prove that, if we run any γapproximate kmeans algorithm (γ ≥ 1) on the features selected using our method, we can find a (1 + (1 + ϵ)γ)approximate partition with high probability. 1
Algorithmic and Statistical Perspectives on LargeScale Data Analysis
 Combinatorial Scientific Computing,” Chapman and Hall/CRC
, 2011
"... ar ..."
(Show Context)
Effective resistances, statistical leverage, and applications to linear equation solving
 CoRR
"... ar ..."
(Show Context)
Clustered subset selection and its applications on it service metrics
 In ACM CIKM
, 2008
"... Motivated by the enormous amounts of data collected in a large IT service provider organization, this paper presents a method for quickly and automatically summarizing and extracting meaningful insights from the data. Termed Clustered Subset Selection (CSS), our method enables programguided data e ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
(Show Context)
Motivated by the enormous amounts of data collected in a large IT service provider organization, this paper presents a method for quickly and automatically summarizing and extracting meaningful insights from the data. Termed Clustered Subset Selection (CSS), our method enables programguided data explorations of highdimensional data matrices. CSS combines clustering and subset selection into a coherent and intuitive method for data analysis. In addition to a general framework, we introduce a family of CSS algorithms with different clustering components such as kmeans and ClosetoRankOne (CRO) clustering, and Subset Selection components such as best rankone approximation and RankRevealing QR (RRQR) decomposition. From an empirical perspective, we illustrate that CSS is achieving significant improvements over existing Subset Selection methods in terms of approximation errors. Compared to existing Subset Selection techniques, CSS is also able to provide additional insight about clusters and cluster representatives. Finally, we present a casestudy of programguided data explorations using CSS on a large amount of IT service delivery data collection.
Discriminative codeword selection for image representation
 in: Proceedings of the 18th ACM International Conference on Multimedia, 2010
"... Bag of features (BoF) representation has attracted an increasing amount of attention in large scale image processing systems. BoF representation treats images as loose collections of local invariant descriptors extracted from them. The visual codebook is generally constructed by using an unsupervise ..."
Abstract

Cited by 6 (2 self)
 Add to MetaCart
(Show Context)
Bag of features (BoF) representation has attracted an increasing amount of attention in large scale image processing systems. BoF representation treats images as loose collections of local invariant descriptors extracted from them. The visual codebook is generally constructed by using an unsupervised algorithm such as Kmeans to quantize the local descriptors into clusters. Images are then represented by the frequency histograms of the codewords contained in them. To build a compact and discriminative codebook, codeword selection has become an indispensable tool. However, most of the existing codeword selection algorithms are supervised and the human labeling may be very expensive. In this paper, we consider the problem of unsupervisedcodeword selection, and propose a novel algorithm called Discriminative Codeword Selection (DCS). Motivated from recent studies on discriminative clustering, the central idea of our proposed algorithm is to select those codewords so that the cluster structure of the image database can be best respected. Specifically, a multioutput linear function is fitted to model the relationship between the data matrix after codeword selection and the indicator matrix. The most discriminative codewords are thus defined as those leading to minimal fitting error. Experiments on image retrieval and clustering have demonstrated the effectiveness of the proposed method.