Results 1 - 10
of
29
Using Linear Algebra for Intelligent Information Retrieval
- SIAM Review
, 1995
"... . Currently, most approaches to retrieving textual materials from scientific databases depend on a lexical match between words in users' requests and those in or assigned to documents in a database. Because of the tremendous diversity in the words people use to describe the same document, lexical me ..."
Abstract
-
Cited by 450 (14 self)
- Add to MetaCart
. Currently, most approaches to retrieving textual materials from scientific databases depend on a lexical match between words in users' requests and those in or assigned to documents in a database. Because of the tremendous diversity in the words people use to describe the same document, lexical methods are necessarily incomplete and imprecise. Using the singular value decomposition (SVD), one can take advantage of the implicit higher-order structure in the association of terms with documents by determining the SVD of large sparse term by document matrices. Terms and documents represented by 200-300 of the largest singular vectors are then matched against user queries. We call this retrieval method Latent Semantic Indexing (LSI) because the subspace represents important associative relationships between terms and documents that are not evident in individual documents. LSI is a completely automatic yet intelligent indexing method, widely applicable, and a promising way to improve users...
Locality Preserving Projections
, 2002
"... Many problems in information processing involve some form of dimensionality reduction. In this paper, we introduce Locality Preserving Projections (LPP). These are linear projective maps that arise by solving a variational problem that optimally preserves the neighborhood structure of the data s ..."
Abstract
-
Cited by 142 (15 self)
- Add to MetaCart
Many problems in information processing involve some form of dimensionality reduction. In this paper, we introduce Locality Preserving Projections (LPP). These are linear projective maps that arise by solving a variational problem that optimally preserves the neighborhood structure of the data set. LPP should be seen as an alternative to Principal Component Analysis (PCA) -- a classical linear technique that projects the data along the directions of maximal variance. When the high dimensional data lies on a low dimensional manifold embedded in the ambient space, the Locality Preserving Projections are obtained by finding the optimal linear approximations to the eigenfunctions of the Laplace Beltrami operator on the manifold. As a result, LPP shares many of the data representation properties of nonlinear techniques such as Laplacian Eigenmaps or Locally Linear Embedding. Yet LPP is linear and more crucially is defined everywhere in ambient space rather than just on the training data points. This is borne out by illustrative examples on some high dimensional data sets.
A survey of information retrieval and filtering methods
, 1995
"... We survey the major techniques for information retrieval. In the rst part, weprovide an overview of the traditional ones (full text scanning, inversion, signature les and clustering). In the second part we discuss attempts to include semantic information (natural language processing, latent semantic ..."
Abstract
-
Cited by 82 (0 self)
- Add to MetaCart
We survey the major techniques for information retrieval. In the rst part, weprovide an overview of the traditional ones (full text scanning, inversion, signature les and clustering). In the second part we discuss attempts to include semantic information (natural language processing, latent semantic indexing and neural networks).
Computational Methods for Intelligent Information Access
, 1995
"... Currently, most approaches to retrieving textual materials from scientific databases depend on a lexical match between words in users' requests and those in or assigned to documents in a database. Because of the tremendous diversity in the words people use to describe the same document, lexical ..."
Abstract
-
Cited by 59 (0 self)
- Add to MetaCart
Currently, most approaches to retrieving textual materials from scientific databases depend on a lexical match between words in users' requests and those in or assigned to documents in a database. Because of the tremendous diversity in the words people use to describe the same document, lexical methods are necessarily incomplete and imprecise. Using the singular value decomposition (SVD), one can take advantage of the implicit higher-order structure in the association of terms with documents by determining the SVD of large sparse term by document matrices. Terms and documents represented by 200-300 of the largest singular vectors are then matched against user queries. We call this retrieval method Latent Semantic Indexing (LSI) because the subspace represents important associative relationships between terms and documents that are not evident in individual documents. LSI is a completely automatic yet intelligent indexing method, widely applicable, and a promising way to...
Latent Semantic Space: Iterative Scaling Improves Precision of Inter-document Similarity Measurement
- In Proceedings of the SIGIR
, 2000
"... We present a novel algorithm that creates document vectors with reduced dimensionality. This work was motivated by an application characterizing relationships among documents in a collection. Our algorithm yielded inter-document similarities with an average precision up to 17.8% higher than that of ..."
Abstract
-
Cited by 33 (3 self)
- Add to MetaCart
We present a novel algorithm that creates document vectors with reduced dimensionality. This work was motivated by an application characterizing relationships among documents in a collection. Our algorithm yielded inter-document similarities with an average precision up to 17.8% higher than that of singular value decomposition (SVD) used for Latent Semantic Indexing. The best performance was achieved with dimensional reduction rates that were 43% higher than SVD on average. Our algorithm creates basis vectors for a reduced space by iteratively "scaling" vectors and computing eigenvectors. Unlike SVD, it breaks the symmetry of documents and terms to capture information more evenly across documents. We also discuss correlation with a probabilistic model and evaluate a method for selecting the dimensionality using log-likelihood estimation.
Locality Preserving Indexing for Document Representation
- In Proc. of the 27rd ACM SIGIR
, 2004
"... Document representation and indexing is a key problem for document analysis and processing, such as clustering, classification and retrieval. Conventionally, Latent Semantic Indexing (LSI) is considered effective in deriving such an indexing. LSI essentially detects the most representative features ..."
Abstract
-
Cited by 24 (12 self)
- Add to MetaCart
Document representation and indexing is a key problem for document analysis and processing, such as clustering, classification and retrieval. Conventionally, Latent Semantic Indexing (LSI) is considered effective in deriving such an indexing. LSI essentially detects the most representative features for document representation rather than the most discriminative features. Therefore, LSI might not be optimal in discriminating documents with different semantics. In this paper, a novel algorithm called Locality Preserving Indexing (LPI) is proposed for document indexing. Each document is represented by a vector with low dimensionality. In contrast to LSI which discovers the global structure of the document space, LPI discovers the local structure and obtains a compact document representation subspace that best detects the essential semantic structure. We compare the proposed LPI approach with LSI on two standard databases. Experimental results show that LPI provides better representation in the sense of semantic structure.
Applying the multiple cause mixture model to text categorization
- Proceedings of the AAAI Symposium on Machine Learning in Information Access
, 1996
"... This paper introduces the use of the Multiple ..."
Iterative Residual Rescaling: An Analysis and Generalization of LSI
- IN PROC. OF THE 24 TH INTERNATIONAL ACM SIGIR
, 2001
"... We consider the problem of creating document representations in which inter-document similarity measurements correspond to semantic similarity. We first present a novel subspace-based framework for formalizing this task. Using this framework, we derive a new analysis of Latent Semantic Indexing (LSI ..."
Abstract
-
Cited by 18 (1 self)
- Add to MetaCart
We consider the problem of creating document representations in which inter-document similarity measurements correspond to semantic similarity. We first present a novel subspace-based framework for formalizing this task. Using this framework, we derive a new analysis of Latent Semantic Indexing (LSI), showing a precise relationship between its performance and the uniformity of the underlying distribution of documents over topics. This analysis helps explain the improvements gained by Ando’s (2000) Iterative Residual Rescaling (IRR) algorithm: IRR can compensate for distributional non-uniformity. A further benefit of our framework is that it provides a well-motivated, effective method for automatically determining the rescaling factor IRR depends on, leading to further improvements. A series of experiments over various settings and with several evaluation metrics validates our claims.
Expertise modeling for matching papers with reviewers
- In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
, 2007
"... An essential part of an expert-finding task, such as matching reviewers to submitted papers, is the ability to model the expertise of a person based on documents. We evaluate several measures of the association between an author in an existing collection of research papers and a previously unseen do ..."
Abstract
-
Cited by 18 (2 self)
- Add to MetaCart
An essential part of an expert-finding task, such as matching reviewers to submitted papers, is the ability to model the expertise of a person based on documents. We evaluate several measures of the association between an author in an existing collection of research papers and a previously unseen document. We compare two language model based approaches with a novel topic model, Author-Persona-Topic (APT). In this model, each author can write under one or more “personas,” which are represented as independent distributions over hidden topics. Examples of previous papers written by prospective reviewers are gathered from the Rexa database, which extracts and disambiguates author mentions from documents gathered from the web. We evaluate the models using a reviewer matching task based on human relevance judgments determining how well the expertise of proposed reviewers matches a submission. We find that the APT topic model outperforms the other models. 1.
Technical Paper Recommendation: A Study in Combining Multiple Information Sources
- Journal of Artificial Intelligence Research
, 2001
"... The growing need to manage and exploit the proliferation of online data sources is opening ..."
Abstract
-
Cited by 16 (0 self)
- Add to MetaCart
The growing need to manage and exploit the proliferation of online data sources is opening

