Results 1 - 10
of
37
Phrasal Translation and Query Expansion Techniques for Cross-Language Information Retrieval
- In Proceedings of the 20th International ACM SIGIR Conference on Research and Development in Information Retrieval
, 1997
"... Dictionary methods for cross-language information retrieval give performance below that for mono-lingual retrieval. Failure to translate multi-term phrases has been shown to be one of the factors responsible for the errors associated with dictionary methods. First, we study the importance of phrasal ..."
Abstract
-
Cited by 143 (3 self)
- Add to MetaCart
Dictionary methods for cross-language information retrieval give performance below that for mono-lingual retrieval. Failure to translate multi-term phrases has been shown to be one of the factors responsible for the errors associated with dictionary methods. First, we study the importance of phrasal translation for this approach. Second, we explore the role of phrases in query expansion via local context analysis and local feedback and show how they can be used to significantly reduce the error associated with automatic dictionary translation. 1 Introduction The development of IR systems for languages other than English has focused on building mono-lingual systems. Increased availability of on-line text in languages other than English and increased multi-national collaboration have motivated research in cross-language information retrieval (CLIR) - the development of systems to perform retrieval across languages. There have been three main approaches to CLIR: translation via machine t...
The Web as a Parallel Corpus
- Computational Linguistics
, 2003
"... Parallel corpora have become an essential resource for work in multilingual natural language processing. In this report, we describe our work using the STRAND system for mining parallel text on the World Wide Web, first reviewing the original algorithm and results and then presenting a set of signif ..."
Abstract
-
Cited by 101 (3 self)
- Add to MetaCart
Parallel corpora have become an essential resource for work in multilingual natural language processing. In this report, we describe our work using the STRAND system for mining parallel text on the World Wide Web, first reviewing the original algorithm and results and then presenting a set of significant enhancements. These enhancements include the use of supervised learning based on structural features of documents to improve classification performance, a new content-based measure of translational equivalence, and adaptation of the system to take advantage of the Internet Archive for mining parallel text from the Web on a large scale.
Translingual information retrieval: A comparative evaluation
- In Proceedings of the 15th International Joint Conference on Artificial Intelligence
, 1997
"... Translingual information retrieval (TIR) consists of providing a query in one language and searching document collections in one or more di erent languages. This paper introduces new TIR methods and reports on comparative TIR experiments with these new methods and with previously reported ones in a ..."
Abstract
-
Cited by 59 (7 self)
- Add to MetaCart
Translingual information retrieval (TIR) consists of providing a query in one language and searching document collections in one or more di erent languages. This paper introduces new TIR methods and reports on comparative TIR experiments with these new methods and with previously reported ones in a realistic setting. Methods fall into two categories: query translation based, and statistical-IR approaches establishing translingual associations. The results show that using bilingual corpora for automated extraction of term equivalences in context outperforms other methods. Translingual versions of the Generalized Vector Space Model (GVSM) and Latent Semantic Indexing (LSI) perform relatively well, as does translingual pseudo relevance feedback (PRF). All showed relatively small performance loss between monolingual and translingual versions. Query translation based on a general machinereadable bilingual dictionary { heretofore the most popular method { did not match the performance of other, more sophisticated methods. Also, the previous very high LSI results in the literature were discon rmed by more realistic relevance-based evaluations. 1
A survey of multilingual text retrieval
, 1996
"... This report reviews the present state of the art in selection of texts in one language based on queries in another, a problem we refer to as "multilingual" text retrieval. Present applications of multilingual text retrieval systems are limited by the cost and complexity of developing and using the m ..."
Abstract
-
Cited by 58 (7 self)
- Add to MetaCart
This report reviews the present state of the art in selection of texts in one language based on queries in another, a problem we refer to as "multilingual" text retrieval. Present applications of multilingual text retrieval systems are limited by the cost and complexity of developing and using the multilingual thesauri on which they are based and by the level of user training that is required to achieve satisfactory search effectiveness. A general model for multilingual text retrieval is used to review the development of the field and to describe modern production and experimental systems. The report concludes with some observations on the present state of the art and an extensive bibliography of the technical literature on multilingual text retrieval.
Dictionary Methods for Cross-Lingual Information Retrieval
- IN PROCEEDINGS OF THE 7TH INTERNATIONAL DEXA CONFERENCE ON DATABASE AND EXPERT SYSTEMS APPLICATIONS
, 1996
"... Multi-lingual information retrieval (IR) has largely been limited to the development of systems for use with a specific foreign language. The explosion in the availability of electronic media in languages other than English makes the development of IR systems that can cross language boundaries incre ..."
Abstract
-
Cited by 57 (5 self)
- Add to MetaCart
Multi-lingual information retrieval (IR) has largely been limited to the development of systems for use with a specific foreign language. The explosion in the availability of electronic media in languages other than English makes the development of IR systems that can cross language boundaries increasingly important. In this paper, we present experiments that analyze the factors that affect dictionary based methods for cross-lingual retrieval and present methods that dramatically reduce the errors such an approach usually makes.
Improving machine translation performance by exploiting non-parallel corpora
- Computational Linguistics
, 2005
"... We present a novel method for discovering parallel sentences in comparable, non-parallel corpora. We train a maximum entropy classifier that, given a pair of sentences, can reliably determine whether or not they are translations of each other. Using this approach, we extract parallel data from large ..."
Abstract
-
Cited by 56 (2 self)
- Add to MetaCart
We present a novel method for discovering parallel sentences in comparable, non-parallel corpora. We train a maximum entropy classifier that, given a pair of sentences, can reliably determine whether or not they are translations of each other. Using this approach, we extract parallel data from large Chinese, Arabic, and English non-parallel newspaper corpora. We evaluate the quality of the extracted data by showing that it improves the performance of a state-of-the-art statistical machine translation system. We also show that a good-quality MT system can be built from scratch by starting with a very small parallel corpus (100,000 words) and exploiting a large non-parallel corpus. Thus, our method can be applied with great benefit to language pairs for which only scarce resources are available. 1.
Translingual Information Retrieval: Learning from Bilingual Corpora
- Artificial Intelligence
, 1997
"... Translingual information retrieval (TLIR) consists of providing a query in one language and searching document collections in one or more different languages. This paper introduces new TLIR methods and reports on comparative TLIR experiments with these new methods and with previously reported ones i ..."
Abstract
-
Cited by 34 (5 self)
- Add to MetaCart
Translingual information retrieval (TLIR) consists of providing a query in one language and searching document collections in one or more different languages. This paper introduces new TLIR methods and reports on comparative TLIR experiments with these new methods and with previously reported ones in a realistic setting. Methods fall into two categories: query translation and statistical-IR approaches establishing translingual associations. The results show that using bilingual corpora for automated extraction of term equivalences in context outperforms dictionary-based methods. Translingual versions of the Generalized Vector Space Model (GVSM) and Latent Semantic Indexing (LSI) also perform well, as does translingual pseudo relevance feedback (PRF) and Example-Based Term-in-context Translation (EBT). All showed relatively small performance loss between monolingual and translingual versions, ranging between 87% to 101% of monolingual IR performance. Query translation based on a general...
Parallel Strands: A Preliminary Investigation into Mining the Web for Bilingual Text
- In Third Conference of the Association for Machine Translation in the Americas
, 1998
"... . Parallel corpora are a valuable resource for machine translation, but at present their availability and utility is limited by genreand domain-specificity, licensing restrictions, and the basic difficulty of locating parallel texts in all but the most dominant of the world's languages. A parallel c ..."
Abstract
-
Cited by 29 (2 self)
- Add to MetaCart
. Parallel corpora are a valuable resource for machine translation, but at present their availability and utility is limited by genreand domain-specificity, licensing restrictions, and the basic difficulty of locating parallel texts in all but the most dominant of the world's languages. A parallel corpus resource not yet explored is the World Wide Web, which hosts an abundance of pages in parallel translation, offering a potential solution to some of these problems and unique opportunities of its own. This paper presents the necessary first step in that exploration: a method for automatically finding parallel translated documents on the Web. The technique is conceptually simple, fully language independent, and scalable, and preliminary evaluation results indicate that the method may be accurate enough to apply without human intervention. 1 Introduction In recent years large parallel corpora have taken on an important role as resources in machine translation and multilingual natural la...
Using Structured Queries for Disambiguation in Cross-Language Information Retrieval
, 1997
"... Bilingual transfer dictionaries are an important resource for query translation in cross-language text retrieval. However, term translation is not an isomorphic process, so dictionary-based systems must address the problem of ambiguity in language translation. In this paper, we claim that boolean co ..."
Abstract
-
Cited by 27 (1 self)
- Add to MetaCart
Bilingual transfer dictionaries are an important resource for query translation in cross-language text retrieval. However, term translation is not an isomorphic process, so dictionary-based systems must address the problem of ambiguity in language translation. In this paper, we claim that boolean conjunction (the AND operator) provides simple and automatic disambiguation in the target language. We derive a new weighted boolean model based on a probabilistic formulation and apply it to the crosslanguage text retrieval problem. The results suggest that the weighted boolean model is highly effective for general text retrieval, but more experimental evidence is need to conclude that it is particularly advantageous for cross-language application. Nonetheless, the preliminary results are quite promising. 1 Introduction With the ongoing development of multilingual information retrieval systems, researchers are becoming increasing interested in the problem of cross-language information retrie...
Dictionary-Based Cross-Language Information Retrieval: Problems, Methods, and Research Findings
- Information Retrieval
, 2001
"... This paper reviews literature on dictionary-based cross-language information retrieval (CLIR) and presents CLIR research done at the University of Tampere (UTA). The main problems associated with dictionary-based CLIR, as well as appropriate methods to deal with the problems are discussed. We will p ..."
Abstract
-
Cited by 20 (3 self)
- Add to MetaCart
This paper reviews literature on dictionary-based cross-language information retrieval (CLIR) and presents CLIR research done at the University of Tampere (UTA). The main problems associated with dictionary-based CLIR, as well as appropriate methods to deal with the problems are discussed. We will present the structured query model by Pirkola and report findings for four different language pairs concerning the effectiveness of query structuring. The architecture of our automatic query translation and construction system is presented.

