Results 1 -
7 of
7
Knowledge-Lite Extraction of Multi-Word Units with Language Filters and Entropy Thresholds
- In Proceedings of RIAO'2000, Collége de
, 2000
"... In this paper two approaches to knowledge-lite terminology extraction are compared, both involving language filters which are used to remove ill-formed multi-word units (MWUs). A knowledge-lite approach entails swift portability to new languages and to new domains, which is difficult to achieve if k ..."
Abstract
-
Cited by 16 (3 self)
- Add to MetaCart
In this paper two approaches to knowledge-lite terminology extraction are compared, both involving language filters which are used to remove ill-formed multi-word units (MWUs). A knowledge-lite approach entails swift portability to new languages and to new domains, which is difficult to achieve if knowledge-intensive resources such as grammars, parsers, taggers and lexicons are used. The two approaches described in this paper have been applied in monolingual term extraction for translation purposes as well as in a pre-processing stage for bilingual word and MWU alignment. The implemented software has been tested for Swedish, English, German and French. Introduction Identifying terminology in a corpus of texts is related to the problem of identifying collocations and phrases. To produce compilations of such multi word units is not a trivial problem. Statistical methods based on frequency or measuring mutual information scores for strings of words (cf. Choueka, 1988; Smadja 1993; Nagao...
A Domain Specific Lexicon Acquisition Tool for Cross-Language Information Retrieval
- Proceedings of RIAO'97
, 1997
"... With the recent enormous increase of information dissemination via the web as incentive there is a growing interest in supporting tools for cross-language retrieval. In this paper we describe a disclosure and retrieval approach that fulfills the needs of both information providers and users by offer ..."
Abstract
-
Cited by 8 (4 self)
- Add to MetaCart
With the recent enormous increase of information dissemination via the web as incentive there is a growing interest in supporting tools for cross-language retrieval. In this paper we describe a disclosure and retrieval approach that fulfills the needs of both information providers and users by offering fast and cheap access to a large amounts of documents from various language domains. Relevant information can be retrieved irrespective of the language used for the specification of a query. In order to realize this type of multilingual functionality the availability of several translation tools is needed, both of a generic and a domain specific nature. Domain specific tools are often not available or only against large costs. In this paper we will therefore focus on a way to reduce these costs, namely the automatic derivation of multilingual resources from so-called parallel text corpora. The benefits of this approach will be illustrated for an example system, i.e. the demonstrator deve...
Multilingual functionality in the TwentyOne project
- In AAAI Symposium on Cross-Language Text and Speech Retrieval. American Association for Artificial Intelligence
, 1997
"... TwentyOne is a EU funded project which aims at developing advanced indexing and retrieval techniques for multimedia document bases. The document base consists of documents in four languages: Dutch, English, French and German. This paper focusses on the multilingual aspects of the project: cross-lang ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
TwentyOne is a EU funded project which aims at developing advanced indexing and retrieval techniques for multimedia document bases. The document base consists of documents in four languages: Dutch, English, French and German. This paper focusses on the multilingual aspects of the project: cross-language retrieval, partial document translation techniques and automatic hyperlinking between sour ce text and translations. Introduction TwentyOne 12 is a project funded by the EU Telematics Application Programme. Project partners include academic partners like the Universities of Twente and Tubingen, companies like Getronics and Xerox, contract research organistations like TNO and DFKI and non-profit environmental organisations like Friends of the Earth. The project can be characterised by the following keywords: Document conversion The TwentyOne system aims at the disclosure of documents of different media types and / or data formats e.g. paper documents, WEB documents, word processor d...
A Nonparametric Method for Extraction of Candidate Phrasal Terms
- Proceedings of ACL’2005
, 2005
"... This paper introduces a new method for identifying candidate phrasal terms (also known as multiword units) which applies a nonparametric, rank-based heuristic measure. Evaluation of this measure, the mutual rank ratio metric, shows that it produces better results than standard statistical measures w ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
This paper introduces a new method for identifying candidate phrasal terms (also known as multiword units) which applies a nonparametric, rank-based heuristic measure. Evaluation of this measure, the mutual rank ratio metric, shows that it produces better results than standard statistical measures when applied to this task. 1
A domain speci c lexicon acquisition tool for cross-language information retrieval
- In Proceedings of RIAO'97 Conference on ComputerAssisted Searching on the Internet
, 1997
"... With the recent enormous increase of information dissemination via the web as incentive there is a growing interest in supporting tools for cross-language retrieval. In this paper we describe a disclosure and retrieval approach that ful lls the needs of both information providers and users by o erin ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
With the recent enormous increase of information dissemination via the web as incentive there is a growing interest in supporting tools for cross-language retrieval. In this paper we describe a disclosure and retrieval approach that ful lls the needs of both information providers and users by o ering fast and cheap access to a large amounts of documents from various language domains. Relevant information can be retrieved irrespective of the language used for the speci cation of a query. In order to realize this type of multilingual functionality theavailability of several translation tools is needed, both of a generic and a domain speci c nature. Domain speci c tools are often not available or only against large costs. In this paper we will therefore focus on a way to reduce these costs, namely the automatic derivation of multilingual resources from so-called parallel text corpora. The bene ts of this approach will be illustrated for an example system, i.e. the demonstrator developed within the project Twenty-One, which is tuned to information from the area of sustainable development.
Bigram Statistics Revisited: A Comparative Examination of Some Statistical Measures in Morphological Analysis of Japanese Kanji Sequences
, 1996
"... this paper, i.e. X 2 (Hoel, 1971; Fienberg, 1977; Reynolds, 1977), 2 likelihood ratio test (Hoel, 1971; Fienberg, 1977; Reynolds, 1 1977; Dunning, 1993), Yule's coefficient of colligation Y (Yule, 1944; Reynolds, 1977; Delcourt 1992; 1994), and mutual information (Fano, 1961; Church, Gale, Hank ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
this paper, i.e. X 2 (Hoel, 1971; Fienberg, 1977; Reynolds, 1977), 2 likelihood ratio test (Hoel, 1971; Fienberg, 1977; Reynolds, 1 1977; Dunning, 1993), Yule's coefficient of colligation Y (Yule, 1944; Reynolds, 1977; Delcourt 1992; 1994), and mutual information (Fano, 1961; Church, Gale, Hanks and Hindle, 1990; Church and Hanks, 1990)
The EU project 'Twenty-One' and cross-language IR
, 1997
"... TwentyOne is a EU funded project which aims at developing advanced indexing and retrieval techniques for multimedia document bases. The document base consists of documents in four languages: Dutch, English, French and German. This paper focusses on the multilingual aspects of the project: cross-lang ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
TwentyOne is a EU funded project which aims at developing advanced indexing and retrieval techniques for multimedia document bases. The document base consists of documents in four languages: Dutch, English, French and German. This paper focusses on the multilingual aspects of the project: cross-language retrieval, partial document translation techniques and automatic hyperlinking between source text and translations. 1 Introduction TwentyOne 12 is a project funded by the EU Telematics pogramme (IE-2108). Project partners include academic partners like the Universities of Twente and Tubingen, companies like Getronics and Xerox, contract research organistations like TNO and DFKI and non-profit environmental organisations like Friends of the Earth. The project can be characterised by the following keywords: Document conversion The TwentyOne system aims at the disclosure of documents of different media types and / or data formats e.g. paper documents, WEB documents, word processor docu...

