Results 1 - 10
of
55
Using Linear Algebra for Intelligent Information Retrieval
- SIAM Review
, 1995
"... . Currently, most approaches to retrieving textual materials from scientific databases depend on a lexical match between words in users' requests and those in or assigned to documents in a database. Because of the tremendous diversity in the words people use to describe the same document, lexical me ..."
Abstract
-
Cited by 450 (14 self)
- Add to MetaCart
. Currently, most approaches to retrieving textual materials from scientific databases depend on a lexical match between words in users' requests and those in or assigned to documents in a database. Because of the tremendous diversity in the words people use to describe the same document, lexical methods are necessarily incomplete and imprecise. Using the singular value decomposition (SVD), one can take advantage of the implicit higher-order structure in the association of terms with documents by determining the SVD of large sparse term by document matrices. Terms and documents represented by 200-300 of the largest singular vectors are then matched against user queries. We call this retrieval method Latent Semantic Indexing (LSI) because the subspace represents important associative relationships between terms and documents that are not evident in individual documents. LSI is a completely automatic yet intelligent indexing method, widely applicable, and a promising way to improve users...
Resolving Ambiguity for Cross-language Retrieval
- In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
, 1998
"... One of the main hurdles to improved CLIR effectiveness is resolving ambiguity associated with translation. Availability of resources is also a problem. First we present a technique based on co-occurrence statistics from unlinked corpora which can be used to reduce the ambiguity associated with phras ..."
Abstract
-
Cited by 143 (3 self)
- Add to MetaCart
One of the main hurdles to improved CLIR effectiveness is resolving ambiguity associated with translation. Availability of resources is also a problem. First we present a technique based on co-occurrence statistics from unlinked corpora which can be used to reduce the ambiguity associated with phrasal and term translation. We then combine this method with other techniques for reducing ambiguity and achieve more than 90% monolingual effectiveness. Finally, we compare the co-occurrence method with parallel corpus and machine translation techniques and show that good retrieval effectiveness can be achieved without complex resources. 1
Phrasal Translation and Query Expansion Techniques for Cross-Language Information Retrieval
- In Proceedings of the 20th International ACM SIGIR Conference on Research and Development in Information Retrieval
, 1997
"... Dictionary methods for cross-language information retrieval give performance below that for mono-lingual retrieval. Failure to translate multi-term phrases has been shown to be one of the factors responsible for the errors associated with dictionary methods. First, we study the importance of phrasal ..."
Abstract
-
Cited by 143 (3 self)
- Add to MetaCart
Dictionary methods for cross-language information retrieval give performance below that for mono-lingual retrieval. Failure to translate multi-term phrases has been shown to be one of the factors responsible for the errors associated with dictionary methods. First, we study the importance of phrasal translation for this approach. Second, we explore the role of phrases in query expansion via local context analysis and local feedback and show how they can be used to significantly reduce the error associated with automatic dictionary translation. 1 Introduction The development of IR systems for languages other than English has focused on building mono-lingual systems. Increased availability of on-line text in languages other than English and increased multi-national collaboration have motivated research in cross-language information retrieval (CLIR) - the development of systems to perform retrieval across languages. There have been three main approaches to CLIR: translation via machine t...
The Web as a Parallel Corpus
- Computational Linguistics
, 2003
"... Parallel corpora have become an essential resource for work in multilingual natural language processing. In this report, we describe our work using the STRAND system for mining parallel text on the World Wide Web, first reviewing the original algorithm and results and then presenting a set of signif ..."
Abstract
-
Cited by 101 (3 self)
- Add to MetaCart
Parallel corpora have become an essential resource for work in multilingual natural language processing. In this report, we describe our work using the STRAND system for mining parallel text on the World Wide Web, first reviewing the original algorithm and results and then presenting a set of significant enhancements. These enhancements include the use of supervised learning based on structural features of documents to improve classification performance, a new content-based measure of translational equivalence, and adaptation of the system to take advantage of the Internet Archive for mining parallel text from the Web on a large scale.
Computational Methods for Intelligent Information Access
, 1995
"... Currently, most approaches to retrieving textual materials from scientific databases depend on a lexical match between words in users' requests and those in or assigned to documents in a database. Because of the tremendous diversity in the words people use to describe the same document, lexical ..."
Abstract
-
Cited by 59 (0 self)
- Add to MetaCart
Currently, most approaches to retrieving textual materials from scientific databases depend on a lexical match between words in users' requests and those in or assigned to documents in a database. Because of the tremendous diversity in the words people use to describe the same document, lexical methods are necessarily incomplete and imprecise. Using the singular value decomposition (SVD), one can take advantage of the implicit higher-order structure in the association of terms with documents by determining the SVD of large sparse term by document matrices. Terms and documents represented by 200-300 of the largest singular vectors are then matched against user queries. We call this retrieval method Latent Semantic Indexing (LSI) because the subspace represents important associative relationships between terms and documents that are not evident in individual documents. LSI is a completely automatic yet intelligent indexing method, widely applicable, and a promising way to...
A survey of multilingual text retrieval
, 1996
"... This report reviews the present state of the art in selection of texts in one language based on queries in another, a problem we refer to as "multilingual" text retrieval. Present applications of multilingual text retrieval systems are limited by the cost and complexity of developing and using the m ..."
Abstract
-
Cited by 58 (7 self)
- Add to MetaCart
This report reviews the present state of the art in selection of texts in one language based on queries in another, a problem we refer to as "multilingual" text retrieval. Present applications of multilingual text retrieval systems are limited by the cost and complexity of developing and using the multilingual thesauri on which they are based and by the level of user training that is required to achieve satisfactory search effectiveness. A general model for multilingual text retrieval is used to review the development of the field and to describe modern production and experimental systems. The report concludes with some observations on the present state of the art and an extensive bibliography of the technical literature on multilingual text retrieval.
Dictionary Methods for Cross-Lingual Information Retrieval
- IN PROCEEDINGS OF THE 7TH INTERNATIONAL DEXA CONFERENCE ON DATABASE AND EXPERT SYSTEMS APPLICATIONS
, 1996
"... Multi-lingual information retrieval (IR) has largely been limited to the development of systems for use with a specific foreign language. The explosion in the availability of electronic media in languages other than English makes the development of IR systems that can cross language boundaries incre ..."
Abstract
-
Cited by 57 (5 self)
- Add to MetaCart
Multi-lingual information retrieval (IR) has largely been limited to the development of systems for use with a specific foreign language. The explosion in the availability of electronic media in languages other than English makes the development of IR systems that can cross language boundaries increasingly important. In this paper, we present experiments that analyze the factors that affect dictionary based methods for cross-lingual retrieval and present methods that dramatically reduce the errors such an approach usually makes.
Automatic Cross-Linguistic Information Retrieval using Latent Semantic Indexing
, 1997
"... this document as a bag of freely intermingled French and English words. A set of training documents like this is analyzed using LSI, and the result is a reduced dimension semantic space in which related terms are near each other. Because the documents contained both French and English terms, the LS ..."
Abstract
-
Cited by 52 (2 self)
- Add to MetaCart
this document as a bag of freely intermingled French and English words. A set of training documents like this is analyzed using LSI, and the result is a reduced dimension semantic space in which related terms are near each other. Because the documents contained both French and English terms, the LSI space will contain terms from both languages; this is what makes it possible for the CL-LSI method to avoid query translation. Words that are consistently paired in translation (e.g., Libya and Libye) will be given identical representations in the LSI space, whereas words that are frequently associated with one another (e.g., not and pas) will be given similar representations. The next step in the CL-LSI method is to add (or "fold in") documents in just French or English. As described above, this is done by locating a new document at the weighted vector sum of its constituent terms. The result of this process is that each document in the database has a language-independent representation in terms of numerical vectors. Users can now pose queries in either French or English and get back the most similar documents regardless of language. 3.2 Experimental Tests
Alternative approaches for cross-language text retrieval
- In AAAI Symposium on cross-language text and speech retrieval. American Association for Artificial Intelligence
, 1997
"... The explosive growth of the Internet and other sources of networked information have made automatic mediation of access to networked information sources an increasingly important problem. Much of this information ..."
Abstract
-
Cited by 42 (5 self)
- Add to MetaCart
The explosive growth of the Internet and other sources of networked information have made automatic mediation of access to networked information sources an increasingly important problem. Much of this information
Learning Human-like Knowledge by Singular Value Decomposition: A Progress Report
- IN
, 1998
"... Singular value decomposition (SVD) can be viewed as a method for unsupervised training of a network that associates two classes of events reciprocally by linear connections through a single hidden layer. SVD was used to learn and represent relations among very large numbers of words (20k-60k) an ..."
Abstract
-
Cited by 38 (1 self)
- Add to MetaCart
Singular value decomposition (SVD) can be viewed as a method for unsupervised training of a network that associates two classes of events reciprocally by linear connections through a single hidden layer. SVD was used to learn and represent relations among very large numbers of words (20k-60k) and very large numbers of natural text passages (1k70k) in which they occurred. The result was 100-350 dimensional "semantic spaces" in which any trained or newly added word or passage could be represented as a vector, and similarities were measured by the cosine of the contained angle between vectors. Good accuracy in simulating human judgments and behaviors has been demonstrated by performance on multiple-choice vocabulary and domain knowledge tests, emulation of expert essay evaluations, and in several other ways. Examples are also given of how the kind of knowledge extracted by this method can be applied.

