Matrix completion from a few entries
Let M be a random nα × n matrix of rank r ≪ n, and assume that a uniformly random subset E of its entries is observed. We describe an efficient algorithm that reconstructs M from E  = O(r n) observed entries with relative root mean square error RMSE ≤ C(α)
A rankrevealing method with updating, downdating and applications
 SIAM J. Matrix Anal. Appl
Abstract. A new rank revealing method is proposed. For a given matrix and a threshold for nearzero singular values, by employing a globally convergent iterative scheme as well as a deflation technique the method calculates approximate singular values below the threshold one by one and returns the approximate rank of the matrix along with an orthonormal basis for the approximate null space. When a row or column is inserted or deleted, algorithms for updating/downdating the approximate rank and null space are straightforward, stable and efficient. Numerical results exhibiting the advantages of our code over existing packages based on twosided orthogonal rankrevealing decompositions are presented. Also presented are applications of the new algorithm in numerical computation of the polynomial GCD as well as identification of nonisolated zeros of polynomial systems.
Fast Approximate kNN Graph Construction for High Dimensional Data via Recursive Lanczos Bisection
, 2008
Nearest neighbor graphs are widely used in data mining and machine learning. The bruteforce method to compute the exact kNN graph takes Θ(dn 2) time for n data points in the d dimensional Euclidean space. We propose two divide and conquer methods for computing an approximate kNN graph in Θ(dn t) time for high dimensional data (large d). The exponent t depends on an internal parameter and is larger than one. Experiments show that a high quality graph usually requires a small t which is close to one. A few of the practical details of the algorithms are as follows. First, the divide step uses an inexpensive Lanczos procedure to perform recursive spectral bisection. After each conquer step, an additional refinement step is performed to improve the accuracy of the graph. Finally, a hash table is used to avoid repeating distance calculations during the divide and conquer process. The combination of these techniques is shown to yield quite effective algorithms for building kNN graphs.
Internet Document Filtering Using Fourier Domain Scoring
 In Luc de Raedt and Arno Siebes, editors, Principles of Data Mining and Knowledge Discovery, number 2168 in Lecture Notes in Artificial Intelligence
, 2001
Most search engines return alW of unwanted information. A more thorough filrough process can be performed on this information to sort out therelL ant documents. A new methodcal)x Frequency Domain Scoring (FDS), which is based on the Fourier Transform is proposed.
Lanczos Vectors versus Singular Vectors for Effective Dimension Reduction
, 2008
This paper takes an indepth look at a technique for computing filtered matrixvector (matvec) products which are required in many data analysis applications. In these applications the data matrix is multiplied by a vector and we wish to perform this product accurately in the space spanned by a few of the major singular vectors of the matrix. We examine the use of the Lanczos algorithm for this purpose. The goal of the method is identical with that of the truncated singular value decomposition (SVD), namely to preserve the quality of the resulting matvec product in the major singular directions of the matrix. The Lanczosbased approach achieves this goal by using a small number of Lanczos vectors, but it does not explicitly compute singular values/vectors of the matrix. The main advantage of the Lanczosbased technique is its low cost when compared with that of the truncated SVD. This advantage comes without sacrificing accuracy. The effectiveness of this approach is demonstrated on a few sample applications requiring dimension reduction, including information retrieval and face recognition. The proposed technique can be applied as a replacement to the truncated SVD technique whenever the problem can be formulated as a filtered matvec multiplication.
Describing MANETS: Principal component analysis of sparse mobility traces
 in Proc. of PEWASUN’06 (to appear
, 2006
Data collected in realistic mobility traces for mobile ad hoc networks (MANETS) is intrinsically high dimensional. Principal Component Analysis (PCA) is a good tool for reducing the data dimemsion by extracting important features of the data. We propose a method for computing principal components using iterative regression for high dimensional matricies with missing values with an application to node degree time series. We expand this method to handle an additional dimension of information for a defined neighborhood ancestry of node degree, exposing patterns when they exist. We test our methodology on node degree data from a simulated university campus model (Pedsims) and real campus data. Results indicate that in both cases, the student’s major field of study along with class schedule are strong factors to differentiate mobile node degree time series. The ability to detect differences is a powerful tool for application specific network management, allowing for: optimal placement of routers, design of specialized protocols for various user populations and lending insight to gauging the energy/bandwidth needs of mobile devices.
Divide and Conquer Strategies for Effective Information Retrieval ∗
The standard application of Latent Semantic Indexing (LSI), a wellknown technique for information retrieval, requires the computation of a partial Singular Value Decomposition (SVD) of the termdocument matrix. This computation is infeasible for large document collections, since it is very demanding both in terms of arithmetic operations and in memory requirements. This paper discusses two divide and conquer strategies applied to LSI, with the goal of alleviating these difficulties. These strategies process a data set by dividing it in subsets and conquering the LSI results on each subset. Since each subproblem resulting from the divide and conquer strategy has a smaller size, the processing of large scale document collections requires much fewer resources. In addition, the computation is highly parallel and can be easily adapted to a parallel computing environment. To reduce the computational cost of the LSI analysis of the subsets, we employ an approximation technique that is based on the Lanczos algorithm. This technique is far more efficient than the truncated SVD, while its accuracy is comparable. Experimental results confirm that the proposed divide and conquer strategies are effective for information retrieval problems.
Semantic Based Clustering of Web Documents
Abstract. A new methodology that structures the semantics of a collection of documents into the geometry of a simplicial complex is developed. A simplicial complex is topologically equivalent to a polyhedron in Euclidean space. The semantics of documents are structured by the geometry: A primitive concept is represented by a simplex. and a concept is represented by a connected component. Based on these structures, documents can be clustered into some meaningful classes. Experiments with three different data sets from web pages and medical literature have shown that our approach performs significantly better than traditional clustering algorithms, such as kmeans, AutoClass and Hierarchical Clustering (HAC).
Incremental Matrix Factorization for Collaborative Filtering
Based on Singular Value Decomposition an incremental and iterative Matrix Factorization method for very sparse matrices is presented. Such matrices arise in Collaborative Filtering (CF) systems, like the Netflix system. This paper shows how such an incremental Matrix Factorization can be used to predict ratings in a CF system and therefore how to fill the empty fields of a rating matrix of a CF system. Also the here presented method is easy to implement and offers, if implemented in the right way, a good and reliable performance. A. Recommendation Systems I.