Results 1 -
8 of
8
On the Use of Words and N-grams for Chinese Information Retrieval
- In Fifth International Workshop on Information Retrieval with Asian Languages, IRAL2000, Hong Kong
, 2000
"... : In the processing of Chinese documents and queries in information retrieval (IR), one has to ..."
Abstract
-
Cited by 13 (2 self)
- Add to MetaCart
: In the processing of Chinese documents and queries in information retrieval (IR), one has to
English-Chinese Cross-Language IR using Bilingual Dictionaries
, 2001
"... This report describes the English-Chinese cross-language retrieval experiments at Berkeley for TREC-9 Cross-Language Information Retrieval track. We present a simple and effective Chinese word segmentation method and compare the cross-language retrieval performance of two bilingual dictionaries for ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
This report describes the English-Chinese cross-language retrieval experiments at Berkeley for TREC-9 Cross-Language Information Retrieval track. We present a simple and effective Chinese word segmentation method and compare the cross-language retrieval performance of two bilingual dictionaries for query translation.
The Johns Hopkins University 2003 Chinese-English machine translation system
- In Proceedings of the MT Summit IX
, 2003
"... We describe a Chinese to English Machine Translation system developed at the Johns Hopkins University for the NIST 2003 MT evaluation. The system is based on a Weighted Finite State Transducer implementation of the alignment template translation model for statistical machine translation. The basel ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
We describe a Chinese to English Machine Translation system developed at the Johns Hopkins University for the NIST 2003 MT evaluation. The system is based on a Weighted Finite State Transducer implementation of the alignment template translation model for statistical machine translation. The baseline MT system was trained using 100,000 sentence pairs selected from a static bitext training collection. Information retrieval techniques were then used to create specific training collections for each document to be translated. This document-specific training set included bitext and name entities that were then added to the baseline system by augmenting the library of alignment templates. We report translation performance of baseline and IR-based systems on two NIST MT evaluation test sets.
Trec-9 clir experiments at msrcn
- In Proceedings of the Nineth Text REtrieval Conference(TREC-9), NIST special publication
, 2000
"... In TREC-9, we participated in the English-Chinese Cross-Language Information Retrieval (CLIR) track. Our work involved two aspects: finding good methods for Chinese IR, and finding effective translation means between English and Chinese. On Chinese monolingual retrieval, we investigated the use of d ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
In TREC-9, we participated in the English-Chinese Cross-Language Information Retrieval (CLIR) track. Our work involved two aspects: finding good methods for Chinese IR, and finding effective translation means between English and Chinese. On Chinese monolingual retrieval, we investigated the use of different entities as indexes, pseudorelevance feedback, and length normalization, and examined their impact on Chinese IR. On English-Chinese CLIR, our focus was put on finding effective ways for query translation. Our method incorporates three improvements over the simple lexicon-based translation: (1) word/term disambiguation using co-occurrence, (2) phrase detecting and translation using a statistical language model and (3) translation coverage enhancement using a statistical translation model. This method is shown to be as effective as a good MT system.
Phrasal Translation for English-Chinese Cross Language Information Retrieval
- In Workshop on English-Chinese Cross Language Information Retrieval at the 2000 International Conference on Chinese Language Computing
, 2000
"... This paper introduces a simple and effective nonoverlapping unigram and bigram segmentation method for both monolingual Chinese and English-Chinese cross language retrieval. It also describes English-Chinese cross language retrieval experiments involving 54 topics and some 164,000 documents. The tra ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
This paper introduces a simple and effective nonoverlapping unigram and bigram segmentation method for both monolingual Chinese and English-Chinese cross language retrieval. It also describes English-Chinese cross language retrieval experiments involving 54 topics and some 164,000 documents. The translation of English queries to Chinese is done using a Chinese-English dictionary of about 120,000 entries. A technique for extracting noun phrases is presented and applied prior to query translation. The phrasal translation outperformanced word translation by 23.6% even though most of the extracted noun phrases from the queries were not translated as phrase because of the limited coverage of the bilingual dictionary. The cross language retrieval achieved about 53% of the effectiveness of the monolingual retrieval, which suggests that there is lot of room for improvement. The two main limiting factors in English-Chinese retrieval performance are the limited coverage of the bilingual dictionary and the existence of multiple Chinese translation equivalents for many English words. 1
Using Self-Supervised Word Segmentation in Chinese Information Retrieval
"... We propose a self-supervised word-segmentation technique for Chinese information retrieval. This method combines the advantages of traditional dictionary based approaches with character based approaches, while overcoming many of their shortcomings. Experiments on TREC data show comparable performanc ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
We propose a self-supervised word-segmentation technique for Chinese information retrieval. This method combines the advantages of traditional dictionary based approaches with character based approaches, while overcoming many of their shortcomings. Experiments on TREC data show comparable performance to both the dictionary based and the character based approaches. However, our method is language independent and unsupervised, which provides a promising avenue for constructing accurate multilingual information retrieval systems that are flexible and adaptive.
Applying Machine Learning to Text Segmentation for Information Retrieval
, 2002
"... We propose a self-supervised word segmentation technique for text segmentation in Chinese information retrieval. This method combines the advantages of traditional dictionary based, character based and mutual information based approaches, while overcoming many of their shortcomings. Experiments o ..."
Abstract
- Add to MetaCart
We propose a self-supervised word segmentation technique for text segmentation in Chinese information retrieval. This method combines the advantages of traditional dictionary based, character based and mutual information based approaches, while overcoming many of their shortcomings. Experiments on TREC data show this method is promising. Our method is completely language independent and unsupervised, which provides a promising avenue for constructing accurate multi-lingual or cross-lingual information retrieval systems that are exible and adaptive. We nd that although the segmentation accuracy of self-supervised segmentation is not as high as some other segmentation methods, it is enough to give comparable (in some cases even better) retrieval performance. It is commonly believed that word segmentation accuracy is monotonically related to retrieval performance in Chinese information retrieval. However, for Chinese, we nd that the relationship between segmentation and retrieval performance is in fact nonmonotonic; that is, at around 70% word segmentation accuracy an over-segmentation phenomenon begins to occur which leads to a reduction in information retrieval performance. We demonstrate this eect by presenting an empirical investigation of information retrieval on Chinese TREC data, using a wide variety of word segmentation algorithms with word segmentation accuracies ranging from 44% to 95%, including 70% word segmentation accuracy from our self-supervised word-segmentation approach.
The HAIRCUT System at TREC-9
- Proceedings of the Ninth Text REtrieval Conference (TREC-9), NIST Special Publication 500-249
, 2001
"... ream. The text was lowercased, punctuation was removed, and diacritical marks were retained. Tokens containing digits were preserved; however only the first two of a sequence of digits were retained (e.g., 1920 became 19##). The result is a stream of blank-separated words. When using n-grams we con ..."
Abstract
- Add to MetaCart
ream. The text was lowercased, punctuation was removed, and diacritical marks were retained. Tokens containing digits were preserved; however only the first two of a sequence of digits were retained (e.g., 1920 became 19##). The result is a stream of blank-separated words. When using n-grams we construct indexing terms from the same sequence of words. These n-grams may span word boundaries; an attempt is made to discover sentence boundaries so that n-grams spanning sentence boundaries are not recorded. Thus n-grams with leading, central, or trailing spaces are formed at word boundaries. Queries were parsed in the same fashion as were documents with two exceptions. On some of our title only runs we attempted to correct the spelling of words that did not occur in our dictionary. Also, we tried to remove stop structure from the description and narrative sections of the queries using a list of about 1000 phrases constructed from past TREC topic statements. words 1,588,374 3,019,547 2.96

