Results 1 - 10
of
29
Cross-Language Information Retrieval (CLIR) Track Overview
- In Proceedings of the Sixth Text Retrieval Conference (TREC-6
, 1997
"... Introduction Cross-Language Information Retrieval (CLIR) was a new task in the TREC-6 evaluation. In contrast to the multilingual track included in previous TREC evaluations, which was concerned with information retrieval in Spanish or Chinese, the cross-language retrieval track focuses on the retr ..."
Abstract
-
Cited by 23 (2 self)
- Add to MetaCart
Introduction Cross-Language Information Retrieval (CLIR) was a new task in the TREC-6 evaluation. In contrast to the multilingual track included in previous TREC evaluations, which was concerned with information retrieval in Spanish or Chinese, the cross-language retrieval track focuses on the retrieval situation where the documents are written in a language which is different than the language used to specify the queries. The TREC-6 track used documents in English, French and German and queries in English, French, German, Spanish and Dutch. There are many applications or scenarios in which a user of a retrieval system may be interested in finding information written in a language other than the user's native or preferred language. In some applications, a user may want to discover all possible relevant information in a multilingual textbase, irrespective of the language of the relevant information. This may be the case when searc
Automatic 3-Language Cross-Language Information Retrieval with Latent Semantic Indexing
- In The Sixth Text Retrieval Conference Notebook Papers (TREC6), 103--110. National Institute of Standards and Technology Special Publication
, 1998
"... This paper describes cross-language informationretrieval experiments carried out for TREC-6. Our retrieval method, cross-language latent semantic indexing (CL-LSI), is completely automatic and we were able to use it to create a 3-way EnglishFrench -German IR system. This study extends our previous w ..."
Abstract
-
Cited by 15 (2 self)
- Add to MetaCart
This paper describes cross-language informationretrieval experiments carried out for TREC-6. Our retrieval method, cross-language latent semantic indexing (CL-LSI), is completely automatic and we were able to use it to create a 3-way EnglishFrench -German IR system. This study extends our previous work in terms of the large size of training and testing corpora, the use of low-quality training data, the evaluation using relevance judgments, and the number of languages analyzed. Introduction Cross-language LSI (CL-LSI) is a fully automatic method for cross-language document retrieval in which no query translation is required. Queries in one language can retrieve documents in other languages (as well as the original language). This is accomplished by a method that automatically constructs a multi-lingual semantic space using latent semantic indexing (LSI); this semantic space is exploited in the form of a vector lexicon, which assigns each word in each language to a point in the high-dim...
Data-Driven Approaches To Information Access
- COGNITIVE SCIENCE
, 2003
"... This paper summarizes three lines of research that are motivated by the practical problem of helping users find information from external data sources, most notably computers. The application areas include information retrieval, text categorization, and question answering. Acommon theme in these app ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
This paper summarizes three lines of research that are motivated by the practical problem of helping users find information from external data sources, most notably computers. The application areas include information retrieval, text categorization, and question answering. Acommon theme in these applications is that practical information access problems can be solved by analyzing the statistical properties of words in large volumes of real world texts. The same statistical properties constrain human performance, thus we believe that solutions to practical information access problems can shed light on human knowledge representation and reasoning.
Informed Projections
- In Advances in Neural Information Processing Systems 15
, 2002
"... Low rank approximation techniques are widespread in pattern recognition research --- they include Latent Semantic Analysis (LSA), Probabilistic LSA, Principal Components Analysus (PCA), the Generative Aspect Model, and many forms of bibliometric analysis. All make use of a low-dimensional manifo ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
Low rank approximation techniques are widespread in pattern recognition research --- they include Latent Semantic Analysis (LSA), Probabilistic LSA, Principal Components Analysus (PCA), the Generative Aspect Model, and many forms of bibliometric analysis. All make use of a low-dimensional manifold onto which data are projected.
Building Bilingual Dictionaries From Parallel Web
- In Proc. of ECIR
, 2002
"... In this paper we describe a system for automatically constructing a bilingual dictionary for cross-language information retrieval applications. We describe how we automatically target candidate parallel documents, filter the candidate documents and process them to create parallel sentences. The p ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
In this paper we describe a system for automatically constructing a bilingual dictionary for cross-language information retrieval applications. We describe how we automatically target candidate parallel documents, filter the candidate documents and process them to create parallel sentences. The parallel sentences are then automatically translated using an adaptation of the EMIM technique and a dictionary of translation terms is created. We evaluate our dictionary using human experts. The evaluation showed that the system performs well. In addition the results obtained from automatically-created corpora are comparable to those obtained from manually created corpora of parallel documents. Compared to other available techniques, our approach has the advantage of being simple, uniform, and easy-to-implement while providing encouraging results.
Harvesting Translingual Vocabulary Mappings for Multilingual Digital Libraries
- Proc. of 25th ACM SIGIR Conf
, 2002
"... This paper presents a method of information harvesting and consolidation to support the multilingual information requirements for cross-language information retrieval within digital library systems. We describe a way to create both customized bilingual dictionaries and multilingual query mappings fr ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
This paper presents a method of information harvesting and consolidation to support the multilingual information requirements for cross-language information retrieval within digital library systems. We describe a way to create both customized bilingual dictionaries and multilingual query mappings from a source language to many target languages. We will describe a multilingual conceptual mapping resource with broad coverage (over 100 written languages can be supported) that is truly multilingual as opposed to bilingual parings usually derived from machine translation. This resource is derived from the 10+ million title online library catalog of the University of California. It is created statistically via maximum likelihood associations from word and phrases in book titles of many languages to human assigned subject headings in English. The 150,000 subject headings can form interlingua mappings between pairs of languages or from one language to several languages. While our current demonstration prototype maps between ten languages (English, Arabic, Chinese, French, German, Italian, Japanese, Portuguese, Russian, Spanish), extensions to additional languages are straightforward. We also describe how this resource is being expanded for languages where linguistic coverage is limited in our initial database, by automatically harvesting new information from international online library catalogs using the Z39.50 networked library search protocol.
Applying machine translation to two-stage cross-language information retrieval
- In Proceedings of the 4th Conference of the Association for Machine Translation in the Americas
, 2000
"... Abstract. Cross-language information retrieval (CLIR), where queries and documents are in di erent languages, needs a translation of queries and/or documents, so as to standardize both of them into a common representation. For this purpose, the use of machine translation is an e ective approach. How ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
Abstract. Cross-language information retrieval (CLIR), where queries and documents are in di erent languages, needs a translation of queries and/or documents, so as to standardize both of them into a common representation. For this purpose, the use of machine translation is an e ective approach. However, computational cost is prohibitive in translating large-scale document collections. To resolve this problem, we proposeatwo-stage CLIR method. First, we translate a given query into the document language, and retrieve a limited number of foreign documents. Second, we machine translate only those documents into the user language, and re-rank them based on the translation result. We also show the e ectiveness of our method by way of experiments using Japanese queries and English technical documents. 1
Cross-Language Text Retrieval With Three Languages
- Duke University
, 1997
"... In cross-language text retrieval, query text objects in one language are matched against a collection of text objects in another. Previous work showed that a two-language cross-language text-retrieval system can be created completely automatically by a method called CL-LSI trained on an aligned corp ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
In cross-language text retrieval, query text objects in one language are matched against a collection of text objects in another. Previous work showed that a two-language cross-language text-retrieval system can be created completely automatically by a method called CL-LSI trained on an aligned corpus of text objects. In this paper, we look at the challenge of creating a three-language cross-language text-retrieval system. We find that CL-LSI extends easily to the three-language case when each training text object is available in all three languages (3-way aligned corpora). However, when each training text object is supplied in only two of the three languages (two-of-three pairwise aligned corpora), the natural extension of CL-LSI behaves very badly. We illustrate these observations on a large English-French-Spanish collection and introduce analysis tools that may be useful for future explorations. 1 INTRODUCTION With the explosion of large-scale heterogeneous collections of data li...
Learning a Language-Independent Representation for Terms from a Partially Aligned Corpus
- in Machine Learning. Proceedings of the Fifteenth International Conference (ICML'98
, 1998
"... Cross-language latent semantic indexing is a method that learns useful languageindependent vector representations of terms through a statistical analysis of a documentaligned text. This is accomplished by taking a collection of, say, English paragraphs and their translations in Spanish and processin ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
Cross-language latent semantic indexing is a method that learns useful languageindependent vector representations of terms through a statistical analysis of a documentaligned text. This is accomplished by taking a collection of, say, English paragraphs and their translations in Spanish and processing them by singular value decomposition to yield a high-dimensional vector representation for each term in the collection. These term vectors have the property that semantically similar terms have vectors with high cosine measure, regardless of their source language. In the present work, we extend this approach to the case in which EnglishSpanish translations are not available, but instead, translations for documents in both languages are available in a third "bridge" language, say, French. Thus, although no aligned English-Spanish documents are used, our method creates a representation in which English and Spanish terms can be compared. The resulting vector representation of terms can be use...
Exploiting comparable corpora and bilingual dictionaries for cross-language text categorization
- In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics
, 2006
"... Cross-language Text Categorization is the task of assigning semantic classes to documents written in a target language (e.g. English) while the system is trained using labeled documents in a source language (e.g. Italian). In this work we present many solutions according to the availability of bilin ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Cross-language Text Categorization is the task of assigning semantic classes to documents written in a target language (e.g. English) while the system is trained using labeled documents in a source language (e.g. Italian). In this work we present many solutions according to the availability of bilingual resources, and we show that it is possible to deal with the problem even when no such resources are accessible. The core technique relies on the automatic acquisition of Multilingual Domain Models from comparable corpora. Experiments show the effectiveness of our approach, providing a low cost solution for the Cross Language Text Categorization task. In particular, when bilingual dictionaries are available the performance of the categorization gets close to that of monolingual text categorization. 1

