Results 1 -
7 of
7
Stylistic Experiments For Information Retrieval
, 2000
"... Information retrieval systems are built to handle texts as topical items: texts are tabulated by occurrence frequencies of content words in them, under the assumption that text topic is reasonably well modeled by content word occurrence. But texts have several interesting characteristics beyond topi ..."
Abstract
-
Cited by 47 (8 self)
- Add to MetaCart
Information retrieval systems are built to handle texts as topical items: texts are tabulated by occurrence frequencies of content words in them, under the assumption that text topic is reasonably well modeled by content word occurrence. But texts have several interesting characteristics beyond topic. The experiments described in this text investigate stylistic variation. Roughly put, style is the difference between two ways of saying the same thing -- and systematic stylistic variation can be used to characterize the genre of documents. These experiments investigate if stylistic information is distinguishable using simple language engineering methods, and if in that case this type of information can be used to improve information retrieval systems.
Morphological typology of languages for IR
- Journal of Documentation
, 2001
"... This paper presents a morphological classification of languages from the IR perspective. Linguistic typology research has shown that the morphological complexity of every language in the world can be described by two variables, index of synthesis and index of fusion. These variables provide a theore ..."
Abstract
-
Cited by 20 (2 self)
- Add to MetaCart
This paper presents a morphological classification of languages from the IR perspective. Linguistic typology research has shown that the morphological complexity of every language in the world can be described by two variables, index of synthesis and index of fusion. These variables provide a theoretical basis for IR research handling morphological issues. A common theoretical framework is needed in particular because of the increasing significance of cross-language retrieval research and CLIR systems processing different languages. The paper elaborates the linguistic morphological typology for the purposes of IR research. It studies how the indexes of synthesis and fusion could be used as practical tools in mono- and cross-lingual IR research. The need for semantic and syntactic typologies is discussed. The paper also reviews studies made in different languages on the effects of morphology and stemming in IR. 1.
Biomedical Text Retrieval in Languages with a Complex Morphology
- PROCEEDINGS OF THE WORKSHOP ON NATURAL LANGUAGE PROCESSING IN THE BIOMEDICAL DOMAIN
, 2002
"... Document retrieval in languages with a rich and complex morphology -- particularly in terms of derivation and (single-word) composition -- suffers from serious performance degradation with the stemming-only query-term-to-text-word matching paradigm. We propose an alternative approach in which ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Document retrieval in languages with a rich and complex morphology -- particularly in terms of derivation and (single-word) composition -- suffers from serious performance degradation with the stemming-only query-term-to-text-word matching paradigm. We propose an alternative approach in which morphologically complex word forms are segmented into relevant subwords (such as stems, named entities, acronyms), and subwords constitute the basic unit for indexing and retrieval. We evaluate our approach on a large biomedical document collection.
Bilingual tests with Swedish, Finnish and German queries: dealing with morphology, compound words and query structure
- In Carol Peters (Ed.) Cross-Language Information Retrieval and Evaluation: Proceedings of the CLEF 2000 Workshop, Lecture Notes in Computer Science 2069
, 2001
"... We used a dictionary-based approach, and performed tests in the bilingual track with three language pairs, i.e., Swedish – English (Swe-Eng), Finnish – English (Fin-Eng), and German – English (Ger-Eng). All the source languages are compound languages, i.e., languages rich in compound words. A compou ..."
Abstract
-
Cited by 5 (5 self)
- Add to MetaCart
We used a dictionary-based approach, and performed tests in the bilingual track with three language pairs, i.e., Swedish – English (Swe-Eng), Finnish – English (Fin-Eng), and German – English (Ger-Eng). All the source languages are compound languages, i.e., languages rich in compound words. A compound word refers to a multi-word expression where the component words are written together. Our main efforts were to develop techniques for the processing of compounds, to study different types of compound languages, and to study the effects query structuring in different languages. We designed and implemented a method for automated query construction in FIN ⎜SWE ⎜GER-> ENG. The goal of this process is to extract automatically topical information from sentences written in one of the source languages (FIN, SWE, GER) and to create a target language (ENG) query. The resulting query may be either structured or unstructured.
Cross-Lingual Information Retrieval Problems: Methods and findings for three language pairs
"... In this pa per we will disc ss dictiona ry-baWx cross-la nga ge informa ion retrieva l (CLIR) methods, a d report recent findings a nd problems. We will consider three la nga ge paq9q for CLIR: Finnish to English, English to Finnish, Swedish to English. We show tha t Finnish a nd Swedish ha ve spec ..."
Abstract
- Add to MetaCart
In this pa per we will disc ss dictiona ry-baWx cross-la nga ge informa ion retrieva l (CLIR) methods, a d report recent findings a nd problems. We will consider three la nga ge paq9q for CLIR: Finnish to English, English to Finnish, Swedish to English. We show tha t Finnish a nd Swedish ha ve specia l fea t res, e.g., the freq ency of homograO6 a nd a high freq ency of compo nd words tha a ffect retrieva effectiveness. Especia) y correct word form norma liza ion a d compo nd splitting a re essentia,O We report findings concerning the effectiveness of va rio s q ery tra nsla tion methods, q ery str ct res a nd ling istic tools sed for CLIR. We a so point o t some problems a nd deficiencies in s ch tools. 1. Introdu tion There is a n increa ing a mo nt of f ll text ma teria9 in vaq) s laq a es a va ilaq;q thro gh the Internet a nd other informa,q n s ppliers. Therefore Cross-la nga ge informa tion retrieva l (CLIR) ha s become a n importa t new resea,x a rea (OaWq & Dorr, 1996; Pirkola, 1999). It is a process of selecting d raz ing doc ments in a,z,) a e different from the q ery la nga ge. One of the ma n aq;z9)F es to CLIR is ba ed on bilinga l tra nsla tion dictiona ries. For a n overview of the aFqW6z) es, see (H ll & Greffenstette, 1996; Oa rd & Dorr, 1996; Pirkola , 1999). The ma in problems sociaxq) with dictiona ry-b a ed CLIR aqW (1) phra e identificaO9; a nd tra sla tion, (2) so rce la nga ge a mbig ity, (3) tra nsla tion a mbig ity, (4) the covera ge of dictiona ries, (5) the processing of inflected words, d (6) ntrax la ta ble keys, in paq ic laq proper na mes spelled differently in different la, a es. Translation ambiguity refers to the proportion l incre se of b d keys due to tr nsl tion. Rese rch h s developed m ny effective methods to...

