Results 1 - 10
of
10
SVDPACKC (Version 1.0) User's Guide
, 1993
"... SVDPACKC comprises four numerical (iterative) methods for computing the singular value decomposition (SVD) of large sparse matrices using ANSI C. This software package implements Lanczos and subspace iteration-based methods for determining several of the largest singular triplets (singular values an ..."
Abstract
-
Cited by 53 (4 self)
- Add to MetaCart
SVDPACKC comprises four numerical (iterative) methods for computing the singular value decomposition (SVD) of large sparse matrices using ANSI C. This software package implements Lanczos and subspace iteration-based methods for determining several of the largest singular triplets (singular values and corresponding left- and right-singular vectors) for large sparse matrices. The package has been ported to a variety of machines ranging from supercomputers to workstations: CRAY Y-MP, IBM RS/6000-550, DEC 5000100, HP 9000-750, SPARCstation 2, and Macintosh II/fx. This document (i) explains each algorithm in some detail, (ii) explains the input parameters for each program, (iii) explains how to compile/execute each program, and (iv) illustrates the performance of each method when we compute lower rank approximations to sparse term-document matrices from information retrieval applications. A user-friendly software interface to the package for UNIX-based systems and the Macintosh II/fx is als...
Information Management Tools for Updating an SVD-Encoded Indexing Scheme
, 1994
"... Lexical-matching methods for information retrieval can be inaccurate when they are used to match a user's queries. Typically, information is retrieved by literally matching terms in documents with those of the query. The problem is that users want to retrieve on the basis of conceptual topic or mean ..."
Abstract
-
Cited by 26 (0 self)
- Add to MetaCart
Lexical-matching methods for information retrieval can be inaccurate when they are used to match a user's queries. Typically, information is retrieved by literally matching terms in documents with those of the query. The problem is that users want to retrieve on the basis of conceptual topic or meaning of a document. There are usually many ways to express a given concept (synonymy), so the literal terms in a user's query may not match those of a relevant document. In addition, most words have multiple meanings (polysemy), so terms in a user's query will literally match terms in irrelevant documents. The implicit high-order structure of associating terms with documents can be exploited by the singular value decomposition (SVD). Latent Semantic Indexing (LSI) is a conceptual indexing technique which uses the SVD to estimate the underlying latent semantic structure of the word to document association. By computing a lower-rank approximation to the original term-document matrix, LSI dampen...
Link-based similarity search to fight web spam
- In AIRWEB
, 2006
"... www.ilab.sztaki.hu/websearch We investigate the usability of similarity search in fighting Web spam based on the assumption that an unknown spam page is more similar to certain known spam pages than to honest pages. In order to be successful, search engine spam never appears in isolation: we observe ..."
Abstract
-
Cited by 15 (2 self)
- Add to MetaCart
www.ilab.sztaki.hu/websearch We investigate the usability of similarity search in fighting Web spam based on the assumption that an unknown spam page is more similar to certain known spam pages than to honest pages. In order to be successful, search engine spam never appears in isolation: we observe link farms and alliances for the sole purpose of search engine ranking manipulation. The artificial nature and strong inside connectedness however gave rise to successful algorithms to identify search engine spam. One example is trust and distrust propagation, an idea originating in recommender systems and P2P networks, that yields spam classificators by spreading information along hyperlinks from white and blacklists. While most previous results use PageRank variants for propagation, we form classifiers by investigating similarity top lists of an unknown page along various measures such as co-citation, companion, nearest neighbors in low dimensional projections and SimRank. We test our method over two data sets previously used to measure spam filtering algorithms. 1.
Toward Large-Scale Information Retrieval Using Latent Semantic Indexing
- Department of Computer Science, University of Tennessee
, 1996
"... As the amount of electronic information increases, traditional lexical (or Boolean) information retrieval techniques will become less useful. Large, heterogeneous collections will be difficult to search since the sheer volume of unranked documents returned in response to a query will overwhelm the u ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
As the amount of electronic information increases, traditional lexical (or Boolean) information retrieval techniques will become less useful. Large, heterogeneous collections will be difficult to search since the sheer volume of unranked documents returned in response to a query will overwhelm the user. Vector-space approaches to information retrieval, on the other hand, allow the user to search for concepts rather than specific words and rank the results of the search according to their relative similarity to the query. One vector-space approach, Latent Semantic Indexing (LSI), has achieved up to 30% better retrieval performance than lexical searching techniques by employing a reduced-rank model of the term-document space. However, the original implementation of LSI lacked the execution efficiency required to make LSI useful for large data sets. A new implementation of LSI, LSI++, seeks to make LSI efficient, extensible, portable, and maintainable. The LSI++ Application Programming In...
Semantic Density Analysis: Comparing word meaning across time and phonetic space
"... This paper presents a new statistical method for detecting and tracking changes in word meaning, based on Latent Semantic Analysis. By comparing the density of semantic vector clusters this method allows researchers to make statistical inferences on questions such as whether the meaning of a word ch ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
This paper presents a new statistical method for detecting and tracking changes in word meaning, based on Latent Semantic Analysis. By comparing the density of semantic vector clusters this method allows researchers to make statistical inferences on questions such as whether the meaning of a word changed across time or if a phonetic cluster is associated with a specific meaning. Possible applications of this method are then illustrated in tracing the semantic change of „dog‟, „do‟, and „deer ‟ in early English and examining and comparing phonaesthemes. 1
Latent semantic analysis and Fiedler retrieval
- EARLIER VERSION IN PROC. SIAM WORKSHOP ON TEXT MINING’06
, 2006
"... Latent semantic analysis (LSA) is a method for information retrieval and processing which is based upon the singular value decomposition. It has a geometric interpretation in which objects (e.g. documents and keywords) are placed in a low-dimensional geometric space. In this paper, we derive an alte ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Latent semantic analysis (LSA) is a method for information retrieval and processing which is based upon the singular value decomposition. It has a geometric interpretation in which objects (e.g. documents and keywords) are placed in a low-dimensional geometric space. In this paper, we derive an alternative algebraic/geometric method for placing objects in space to facilitate information analysis. We show that our method is closely related to LSA, and essentially equivalent for particular choices of scaling parameters. We then show that our approach supports a number of generalizations and extensions that existing LSA approaches cannot handle.
Spectral clustering in telephone call graphs
- In WebKDD/SNAKDD Workshop 2007 in conjunction with KDD
, 2007
"... We evaluate various heuristics for hierarchical spectral clustering in large telephone call graphs. Spectral clustering without additional heuristics often produces very uneven cluster sizes or low quality clusters that may consist of several disconnected components, a fact that appears to be common ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
We evaluate various heuristics for hierarchical spectral clustering in large telephone call graphs. Spectral clustering without additional heuristics often produces very uneven cluster sizes or low quality clusters that may consist of several disconnected components, a fact that appears to be common for several data sources but, to our knowledge, not described in the literature. Divide-and-Merge, a recently described postfiltering procedure may be used to eliminate bad quality branches in a binary tree hierarchy. We propose an alternate solution that enables k-way cuts in each step by immediately filtering unbalanced or low quality clusters before splitting them further. Our experiments are performed on graphs with various weight and normalization built based on call detail records. We investigate a period of eight months of more than two millions of Hungarian landline telephone users. We measure clustering quality both by cluster ratio as well as by the geographic homogeneity of the clusters obtained from telephone location data. Although divide-and-merge optimizes its clusters for cluster ratio, our method produces clusters of similar ratio much faster, furthermore we give geographically much more homogeneous clusters with the size distribution of our clusters resembling to that of the settlement structure.
SVDPACKC (Version 1.0) User's Guide
"... SVDPACKC comprises four numerical (iterative) methods for computing the singular value decomposition (SVD) of large sparse matrices using ANSI C. This software package implements Lanczos and subspace iteration-based methods for determining several of the largest singular triplets (singular values an ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
SVDPACKC comprises four numerical (iterative) methods for computing the singular value decomposition (SVD) of large sparse matrices using ANSI C. This software package implements Lanczos and subspace iteration-based methods for determining several of the largest singular triplets (singular values and corresponding left- and right-singular vectors) for large sparse matrices. The package has been ported to a variety of machines ranging from supercomputers to workstations: CRAY Y-MP, IBM RS/6000-550, DEC 5000100, HP 9000-750, SPARCstation 2, and Macintosh II/fx. This document (i) explains each algorithm in some detail, (ii) explains the input parameters for each program, (iii) explains how to compile/execute each program, and (iv) illustrates the performance of each method when we compute lower rank approximations to sparse term-document matrices from information retrieval applications. A user-friendly software interface to the package for UNIX-based systems and the Macintosh II/fx is als...
Large-Scale Principal Component Analysis on LiveJournal Friends Network
"... Principal Component Analysis (PCA) is a general means of unsupervised exploration that can be used to find basic motives and organizational themes, the guidance in friends network formation. The applications of PCA include Kleinberg’s ranking algorithm as well as spectral graph partitioning. We exte ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Principal Component Analysis (PCA) is a general means of unsupervised exploration that can be used to find basic motives and organizational themes, the guidance in friends network formation. The applications of PCA include Kleinberg’s ranking algorithm as well as spectral graph partitioning. We extend the applicability of PCA to very large scale social networks by handling the abundance of small size communities that hide the higher level structure. Strongest communities, that are still small themselves, take over the first principal axes and the analysis leaves a giant mass in the all-zeroes coordinate. In a combination of heuristics that involve the removal of community cores as well as the contraction of tentacles we are able to find meaningful high level components that characterize countries, regions, age or interest in polarized topics. Our experiments are run on a 3.5M user snapshot of the LiveJournal
A Comparative Analysis of Latent Variable Models for Web Page Classification
"... A main challenge for Web content classification is how to model the input data. This paper discusses the application of two text modeling approaches, Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA), in the Web page classification task. We report results on a comparison of these ..."
Abstract
- Add to MetaCart
A main challenge for Web content classification is how to model the input data. This paper discusses the application of two text modeling approaches, Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA), in the Web page classification task. We report results on a comparison of these two approaches using different vocabularies consisting of links and text. Both models are evaluated using different numbers of latent topics. Finally, we evaluate a hybrid latent variable model that combines the latent topics resulting from both LSA and LDA. This new approach turns out to be superior to the basic LSA and LDA models. In our experiments with categories and pages obtained from the ODP web directory the hybrid model achieves an averaged F-measure value of 0.852 and an averaged ROC value of 0.96. 1.

