Results 1  10
of
136
Matrix completion from a few entries
"... Let M be a random nα × n matrix of rank r ≪ n, and assume that a uniformly random subset E of its entries is observed. We describe an efficient algorithm that reconstructs M from E  = O(r n) observed entries with relative root mean square error RMSE ≤ C(α) ..."
Abstract

Cited by 197 (8 self)
 Add to MetaCart
(Show Context)
Let M be a random nα × n matrix of rank r ≪ n, and assume that a uniformly random subset E of its entries is observed. We describe an efficient algorithm that reconstructs M from E  = O(r n) observed entries with relative root mean square error RMSE ≤ C(α)
Fast Computation of Low Rank Matrix Approximations
, 2001
"... In many practical applications, given an m n matrix A it is of interest to nd an approximation to A that has low rank. We introduce a technique that exploits spectral structure in A to accelerate Orthogonal Iteration and Lanczos Iteration, the two most common methods for computing such approximat ..."
Abstract

Cited by 163 (4 self)
 Add to MetaCart
In many practical applications, given an m n matrix A it is of interest to nd an approximation to A that has low rank. We introduce a technique that exploits spectral structure in A to accelerate Orthogonal Iteration and Lanczos Iteration, the two most common methods for computing such approximations. Our technique amounts to independently sampling and/or quantizing the entries of the input matrix A, thus speeding up computation by reducing the number of nonzero entries and/or the length of their representation. Our analysis s based on observing that both sampling and quantization can be viewed as adding a random matrix E to A, where the entries of E are independent, zeromean random variables of bounded variance. Such random matrices posses no significant linear structure, and we can thus prove that the effect of sampling and quantization nearly vanishes when a low rank approximation to A is computed. In fact, the more prominent the linear structure in A is, the more data we can afford to discard and, ultimately, the faster we can discover it. We give bounds on the quality of our approximation both in the L2 and in the Frobenius norm.
Sampling from large matrices: an approach through geometric functional analysis
 Journal of the ACM
, 2006
"... Abstract. We study random submatrices of a large matrix A. We show how to approximately compute A from its random submatrix of the smallest possible size O(r log r) with a small error in the spectral norm, where r = �A�2 F /�A�22 is the numerical rank of A. The numerical rank is always bounded by, a ..."
Abstract

Cited by 129 (5 self)
 Add to MetaCart
(Show Context)
Abstract. We study random submatrices of a large matrix A. We show how to approximately compute A from its random submatrix of the smallest possible size O(r log r) with a small error in the spectral norm, where r = �A�2 F /�A�22 is the numerical rank of A. The numerical rank is always bounded by, and is a stable relaxation of, the rank of A. This yields an asymptotically optimal guarantee in an algorithm for computing lowrank approximations of A. We also prove asymptotically optimal estimates on the spectral norm and the cutnorm of random submatrices of A. The result for the cutnorm yields a slight improvement on the best known sample complexity for an approximation algorithm for MAX2CSP problems. We use methods of Probability in Banach spaces, in particular the law of large numbers for operatorvalued random variables. 1.
Alignmentfree sequence comparisona review
 Bioinformatics
, 2003
"... Motivation: Genetic recombination and, in particular, genetic shuffling are at odds with sequence comparison by alignment, which assumes conservation of contiguity between homologous segments. A variety of theoretical foundations are being used to derive alignmentfree methods that overcome this lim ..."
Abstract

Cited by 100 (8 self)
 Add to MetaCart
(Show Context)
Motivation: Genetic recombination and, in particular, genetic shuffling are at odds with sequence comparison by alignment, which assumes conservation of contiguity between homologous segments. A variety of theoretical foundations are being used to derive alignmentfree methods that overcome this limitation. The formulation of alternative metrics for dissimilarity between sequences and their algorithmic implementations are reviewed. Results: The overwhelming majority of work on alignmentfree sequence has taken place in the past two decades, with most reports published in the past 5 years. Two main categories of methods have been proposed—methods based on word (oligomer) frequency, and methods that do not require resolving the sequence with fixed word length segments. The first category is based on the statistics of word frequency, on the distances defined in a Cartesian space defined by the frequency vectors, and on the information content of frequency distribution. The second category includes the use of Kolmogorov complexity and Chaos Theory. Despite their low visibility, alignmentfree metrics are in fact already widely used as preselection filters for alignmentbased querying of large applications. Recent work is furthering their usage as a scaleindependent methodology that is capable of recognizing homology when loss of contiguity is beyond the possibility of alignment. Availability: Most of the alignmentfree algorithms reviewed were implemented in MATLAB code and are available
A survey of eigenvector methods of web information retrieval
 SIAM Rev
"... Abstract. Web information retrieval is significantly more challenging than traditional wellcontrolled, small document collection information retrieval. One main difference between traditional information retrieval and Web information retrieval is the Web’s hyperlink structure. This structure has bee ..."
Abstract

Cited by 93 (6 self)
 Add to MetaCart
(Show Context)
Abstract. Web information retrieval is significantly more challenging than traditional wellcontrolled, small document collection information retrieval. One main difference between traditional information retrieval and Web information retrieval is the Web’s hyperlink structure. This structure has been exploited by several of today’s leading Web search engines, particularly Google and Teoma. In this survey paper, we focus on Web information retrieval methods that use eigenvector computations, presenting the three popular methods of HITS, PageRank, and SALSA.
pSearch: Information Retrieval in Structured Overlays
, 2002
"... We describe an efficient peertopeer information retrieval system, pSearch, that supports stateoftheart content and semanticbased fulltext searches. pSearch avoids the scalability problem of existing systems that employ centralized indexing, or index/query flooding. It also avoids the nondete ..."
Abstract

Cited by 89 (6 self)
 Add to MetaCart
(Show Context)
We describe an efficient peertopeer information retrieval system, pSearch, that supports stateoftheart content and semanticbased fulltext searches. pSearch avoids the scalability problem of existing systems that employ centralized indexing, or index/query flooding. It also avoids the nondeterminism that is exhibited by heuristicbased approaches. In pSearch, documents in the network are organized around their vector representations (based on modern document ranking algorithms) such that the search space for a given query is organized around related documents, achieving both eciency and accuracy.
A TwoDimensional Data Distribution Method For Parallel Sparse MatrixVector Multiplication
 SIAM REVIEW
"... A new method is presented for distributing data in sparse matrixvector multiplication. The method is twodimensional, tries to minimise the true communication volume, and also tries to spread the computation and communication work evenly over the processors. The method starts with a recursive bipar ..."
Abstract

Cited by 85 (9 self)
 Add to MetaCart
(Show Context)
A new method is presented for distributing data in sparse matrixvector multiplication. The method is twodimensional, tries to minimise the true communication volume, and also tries to spread the computation and communication work evenly over the processors. The method starts with a recursive bipartitioning of the sparse matrix, each time splitting a rectangular matrix into two parts with a nearly equal number of nonzeros. The communication volume caused by the split is minimised. After the matrix partitioning, the input and output vectors are partitioned with the objective of minimising the maximum communication volume per processor. Experimental results of our implementation, Mondriaan, for a set of sparse test matrices show a reduction in communication compared to onedimensional methods, and in general a good balance in the communication work.
Gene clustering by latent semantic indexing of Medline abstracts
 Bioinformatics
"... Motivation: A major challenge in the interpretation of highthroughput genomic data is understanding the functional associations between genes. Previously, several approaches have been described to extract gene relationships from various biological databases using termmatching methods. However, m ..."
Abstract

Cited by 52 (8 self)
 Add to MetaCart
(Show Context)
Motivation: A major challenge in the interpretation of highthroughput genomic data is understanding the functional associations between genes. Previously, several approaches have been described to extract gene relationships from various biological databases using termmatching methods. However, more flexible automated methods are needed to identify functional relationships (both explicit and implicit) between genes from the biomedical literature. In this study, we explored the utility of Latent Semantic Indexing (LSI), a vector space model for information retrieval, to automatically identify conceptual gene relationships from titles and abstracts in MEDLINE citations. Results: We found that LSI identified genetogene and keywordtogene relationships with high average precision. In addition, LSI identified implicit gene relationships based on word usage patterns in the gene abstract documents. Finally, we demonstrate here that pairwise distances derived from the vector angles of gene abstract documents can be effectively used to functionally group genes by hierarchical clustering. Our results provide proofofprinciple that LSI is a robust automated method to elucidate both known (explicit) and unknown (implicit) gene relationships from the biomedical literature. These features make LSI particularly useful for the analysis of novel associations discovered in genomic experiments. Availability: The 50gene document collection used in this study can be interactively queried at
Semantic Small World: An overlay network for peertopeer search
, 2004
"... For a peertopeer (P2P) system holding massive amount of data, efficient semantic based search for resources (such as data or services) is a key determinant to its scalability. This paper presents the design of an overlay network, namely semantic small world (SSW), that facilitates efficient semant ..."
Abstract

Cited by 44 (6 self)
 Add to MetaCart
(Show Context)
For a peertopeer (P2P) system holding massive amount of data, efficient semantic based search for resources (such as data or services) is a key determinant to its scalability. This paper presents the design of an overlay network, namely semantic small world (SSW), that facilitates efficient semantic based search in P2P systems. SSW is based on three innovative ideas: 1) small world network; 2) semantic clustering; 3) dimension reduction. Peers in SSW are clustered according to the semantics of their local data and selforganized as a small world overlay network. To address the maintenance issue of high dimensional overlay networks, a dynamic dimension reduction method, called adaptive space linearization, is used to construct a onedimensional SSW that supports operations in the high dimensional semantic space. SSW achieves a very competitive tradeoff between the search latencies/traffic and maintenance overheads. Through extensive simulations, we show that SSW is much more scalable to very large network sizes and very large numbers of data objects compared to pSearch, the stateoftheart semanticbased search technique for P2P systems. In addition, SSW is adaptive to distribution of data and locality of interest; is very resilient to failures; and has good load balancing property. 1.
PeerSearch: Efficient Information Retrieval in PeertoPeer Networks
 IN PROCEEDINGS OF HOTNETSI, ACM SIGCOMM
, 2002
"... In this paper, we propose an efficient peertopeer information retrieval system PeerSearch that supports stateoftheart content and semantic searches. PeerSearch avoids the scalability problem of existing systems that employ centralized indexing, index flooding, or query flooding. It also avoids ..."
Abstract

Cited by 40 (0 self)
 Add to MetaCart
In this paper, we propose an efficient peertopeer information retrieval system PeerSearch that supports stateoftheart content and semantic searches. PeerSearch avoids the scalability problem of existing systems that employ centralized indexing, index flooding, or query flooding. It also avoids the nondeterminism that exhibited by heuristicbased approaches. PeerSearch