Results 21 - 30
of
51
CONTROL: CLEF-2003 with Open, Transparent Resources Off-Line. Experiments with merging strategies
- In C. Peters(Ed.), Results of
, 2003
"... Abstract: Corpus-based approaches to CLIR have been studied for many years. However, using commercial MT systems for CLEF has been considered easier and better performing. Our goal is to be one of the CLEF participants who show that the hypothetical performance drop is not large enough to justify th ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Abstract: Corpus-based approaches to CLIR have been studied for many years. However, using commercial MT systems for CLEF has been considered easier and better performing. Our goal is to be one of the CLEF participants who show that the hypothetical performance drop is not large enough to justify the loss of control and transparency, especially for research systems. We participated in two bilingual runs and the small multilingual run using software and data that are free to obtain, transparent and modifiable. 1
Annotating predicateargument structure for a parallel treebank
- Proceedings of the LREC Workshop “Building Lexical Resources from Semantically Annotated Corpora
, 2004
"... We report on a recently initiated project which aims at building a multi-layered parallel treebank of English and German. Particular attention is devoted to a dedicated predicate-argument layer which is used for aligning translationally equivalent sentences of the two languages. We describe both our ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
We report on a recently initiated project which aims at building a multi-layered parallel treebank of English and German. Particular attention is devoted to a dedicated predicate-argument layer which is used for aligning translationally equivalent sentences of the two languages. We describe both our conceptual decisions and aspects of their technical realisation. We discuss some selected problems and conclude with a few remarks on how this project relates to similar projects in the field. 1.
Resource Selection for Domain-Specific Cross-Lingual IR
- In Proc. of the 27th Annual Int’l ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR
, 2004
"... An under-explored question in cross-language information retrieval (CLIR) is to what degree the performance of CLIR methods depends on the availability of high-quality translation resources for particular domains. To address this issue, we evaluate several competitive CLIR methods - with different t ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
An under-explored question in cross-language information retrieval (CLIR) is to what degree the performance of CLIR methods depends on the availability of high-quality translation resources for particular domains. To address this issue, we evaluate several competitive CLIR methods - with different training corpora - on test documents in the medical domain. Our results show severe performance degradation when using a general-purpose training corpus or a commercial machine translation system (SYSTRAN), versus a domain-specific training corpus. A related unexplored question is whether we can improve CLIR performance by systematically analyzing training resources and optimally matching them to target collections. We start exploring this problem by suggesting a simple criterion for automatically matching training resources to target corpora. By using cosine similarity between training and target corpora as resource weights we obtained an average of 5.6% improvement over using all resources with no weights. The same metric yields 99.4% of the performance obtained when an oracle chooses the optimal resource every time.
String kernels and similarity measures for information retrieval
, 2006
"... Measuring a similarity between two strings is a fundamental step in many applications in areas such as text classification and information retrieval. Lately, kernel-based methods have been proposed for this task, both for text and biological sequences. Since kernels are inner products in a feature s ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Measuring a similarity between two strings is a fundamental step in many applications in areas such as text classification and information retrieval. Lately, kernel-based methods have been proposed for this task, both for text and biological sequences. Since kernels are inner products in a feature space, they naturally induce similarity measures. Information-theoretical approaches have also been subject of recent research. The goal is to classify finite sequences without explicit knowledge of their statistical nature: sequences are considered similar if they are likely to be generated by the same source. There is experimental evidence that relative entropy (albeit not being a true metric) yields high accuracy in several classification tasks. Compression-based techniques, such as variations of the Ziv-Lempel algorithm for text, or GenCompress for biological sequences, have been used to estimate the relative entropy. Algorithmic concepts based on the Kolmogorov complexity provide theoretic background for these approaches. This paper describes some string kernels and information theoretic methods. It evaluates the performance of both kinds of methods in text classification tasks, namely in the problems of authorship attribution, language detection, and cross-language document matching. 1
Bloom filter and lossy dictionary based language models
, 2007
"... Language models are probability distributions over a set of unilingual natural language text used in many natural language processing tasks such as statistical machine trans-lation, information retrieval, and speech processing. Since more well-formed training data means a better model and the increa ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Language models are probability distributions over a set of unilingual natural language text used in many natural language processing tasks such as statistical machine trans-lation, information retrieval, and speech processing. Since more well-formed training data means a better model and the increased availability of text via the Internet, the size of language modelling n-gram data sets have grown exponentially the past few years. The latest data sets available can no longer fit on a single computer. A recent investi-gation reported first known use of a probabilistic data structure to create a randomised language model capable of storing probability information for massive n-gram sets in a fraction of the space normally needed. We report and compare the properties of lossy language models using two probabilistic data structures: the Bloom filter and lossy dictionary. The Bloom filter has exceptional space requirements and only one-sided, false positive error returns but it is computationally slow in scale which is a potential drawback for a structure being queried millions of times per sentence. Lossy dictionar-ies have low space requirements and are very fast but with two-sided error that returns
Oard.: TCLEF-2005 CL-SR at Maryland:. Document and Query Expansion using Side. Collections and Thesauri
"... This paper reports results for the University of Maryland’s participation in CLEF-2005 Cross-Language Speech Retrieval track. Techniques that were tried include: (1) document expansion with manually created metadata (thesaurus keywords and segment summaries) from a large side collection, (2) query r ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
This paper reports results for the University of Maryland’s participation in CLEF-2005 Cross-Language Speech Retrieval track. Techniques that were tried include: (1) document expansion with manually created metadata (thesaurus keywords and segment summaries) from a large side collection, (2) query refinement with pseudo-relevance feedback, (3) keyword expansion with thesaurus synonyms, and (4) cross-language speech retrieval using translation knowledge obtained from the statistics of a large parallel corpus. The results show that document expansion and query expansion using blind relevance feedback were effective, although optimal parameter choices differed somewhat between the training and evaluation sets. Document expansion in which manually assigned keywords were augmented with thesaurus synonyms yielded marginal gains on the training set, but no improvement on the evaluation set. Crosslanguage retrieval with French queries yielded 79 % of monolingual mean average precision when searching manually assigned metadata despite a substantial domain mismatch between the parallel corpus and the retrieval task. Detailed failure analysis indicates that speech recognition errors for named entities were an important factor that substantially degraded retrieval effectiveness.
Thomson Legal and Regulatory Experiments for CLEF 2002
- In Working Notes for the CLEF 2002 Workshop, Italy 2002
, 2002
"... Thomson Legal and Regulatory participated in the monolingual, the bilingual and the multilingual tracks. Our monolingual runs added Swedish to the languages we had submitted in previous participations. Our bilingual and multilingual efforts used English as the query language. We experimented with di ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Thomson Legal and Regulatory participated in the monolingual, the bilingual and the multilingual tracks. Our monolingual runs added Swedish to the languages we had submitted in previous participations. Our bilingual and multilingual efforts used English as the query language. We experimented with dictionaries and similarity thesauri for the bilingual task, while we used machine translations in our multi-lingual runs. Our various merging strategies had limited success compared to a simple round robin. 1
One Size Fits All? A Simple Technique to Perform
- Several NLP Tasks, in 4 th International Conference, EsTAL 2004, J.L. Vicedo et al (Eds), LNAI 3230
, 2004
"... Abstract. Word fragments or n-grams have been widely used to perform different Natural Language Processing tasks such as information retrieval [1] [2], document categorization [3], automatic summarization [4] or, even, genetic classification of languages [5]. All these techniques share some common a ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Abstract. Word fragments or n-grams have been widely used to perform different Natural Language Processing tasks such as information retrieval [1] [2], document categorization [3], automatic summarization [4] or, even, genetic classification of languages [5]. All these techniques share some common aspects such as: (1) documents are mapped to a vector space where n-grams are used as coordinates and their relative frequencies as vector weights, (2) many of them compute a context which plays a role similar to stop-word lists, and (3) cosine distance is commonly used for document-to-document and query-to-document comparisons. blindLight is a new approach related to these classical n-gram techniques although it introduces two major differences: (1) Relative frequencies are no more used as vector weights but replaced by n-gram significances, and (2) cosine distance is abandoned in favor of a new metric inspired by sequence alignment techniques although not so computationally expensive. This new approach can be simultaneously used to perform document categorization and clustering, information retrieval, and text summarization. In this paper we will describe the foundations of such a technique and its application to both a particular categorization problem (i.e., language identification) and information retrieval tasks. 1
Application of variable length n-gram vectors to monolingual and bilingual information retrieval
- In CLEF
, 2004
"... Abstract. Our group in the Department of Informatics at the University of Oviedo has participated, for the first time, in two tasks at CLEF: monolingual (Russian) and bilingual (Spanish-to-English) information retrieval. Our main goal was to test the application to IR of a modified version of the n- ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract. Our group in the Department of Informatics at the University of Oviedo has participated, for the first time, in two tasks at CLEF: monolingual (Russian) and bilingual (Spanish-to-English) information retrieval. Our main goal was to test the application to IR of a modified version of the n-gram vector space model (codenamed blindLight). This new approach has been successfully applied to other NLP tasks such as language identification or text summarization and the results achieved at CLEF 2004, although not exceptional, are encouraging. There are two major differences between the blindLight approach and classical techniques: (1) relative frequencies are no longer used as vector weights but are replaced by n-gram significances, and (2) cosine distance is abandoned in favor of a new metric inspired by sequence alignment techniques, not so computationally expensive. In order to perform cross-language IR we have developed a naive n-gram pseudo-translator similar to those described by McNamee and Mayfield or Pirkola et al. 1
Mixedmode multilinguality in TTS: The case of Canadian French
- In: Proc. Multiling2006
, 2006
"... The coexistence of English and French in Canada presents a number of interesting problems for text-to-speech (TTS) synthesis. The pronunciation of Canadian French is fairly well documented and can be captured by recording a speaker of the appropriate dialect for the voice database. The desired behav ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
The coexistence of English and French in Canada presents a number of interesting problems for text-to-speech (TTS) synthesis. The pronunciation of Canadian French is fairly well documented and can be captured by recording a speaker of the appropriate dialect for the voice database. The desired behavior of the system in speaking the many English words, names, and expressions that can populate French text, however, is not well understood, varying from user to user and from context to context. In this paper we present an analysis of English in Canadian French TTS, examining the intelligibility and preferability of English and French pronunciations. Our results suggest that it is best to consider different modes of synthesis, ranging from near-English pronunciation of English terms to near-French, and that different tasks require different approaches to the problem. 1.

