Results 1 -
6 of
6
Textual Representations for Corpus-Based Bilingual Retrieval
, 2008
"... The traditional approach to information retrieval is based on using words as the indexing and search terms for documents. However, word-based representations have difficulty addressing morphological processes that confound retrieval, such as inflection, derivation, and compounding. One part of this ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
The traditional approach to information retrieval is based on using words as the indexing and search terms for documents. However, word-based representations have difficulty addressing morphological processes that confound retrieval, such as inflection, derivation, and compounding. One part of this research investigates alternative methods for representing text, including a method based on overlapping sequences of characters called n-gram tokenization. N-grams are studied in depth and one notable finding is that they achieve a 20 % improvement in retrieval effectiveness over words in certain situations. The other focus of this research is improving retrieval performance when foreign language documents must be searched and translation is required. In this scenario bilingual dictionaries are often used to translate user queries; however even among the most commonly spoken languages, for which large bilingual lexicons exist, dictionary-based translation suffers from several significant problems. These include: difficulty handling proper names, which are often missing; issues related to morphological variation since entries, or query terms, may not be lemmatized; and, an inability to robustly handle multiword phrases, especially non-compositional expressions. These problems can be addressed when
Paraphrase Recognition using Neural Network Classification
"... Paraphrasing refers to conveying the same content in several ways. The successful recognition of paraphrases is crucial to various natural language processing tasks such as Information Extraction, Document Summarization, Question Answering etc. Several techniques have been employed for paraphrase re ..."
Abstract
- Add to MetaCart
Paraphrasing refers to conveying the same content in several ways. The successful recognition of paraphrases is crucial to various natural language processing tasks such as Information Extraction, Document Summarization, Question Answering etc. Several techniques have been employed for paraphrase recognition using lexical, syntactic and semantic features. Many of these systems have been tested on the MicroSoft Research Paraphrase Corpus. But the performance of these systems has scope for further improvement. Since neural network architectures model the human brain structure which excels at natural language processing tasks, this paper presents a neural network classifier for recognizing paraphrases. A combination of lexical, syntactic and semantic features has been used to train a Back Propagation network. The system can be utilized for detecting similar sentences in applications such as Question Answering and detection of plagiarized content.
filtering
"... Wikipedia is an online encyclopedia that anyone can access and edit. It has become one of the most important sources of knowledge online and many third party projects rely on it for a wide-range of purposes. The open model of Wikipedia allows pranksters, lobbyists and spammers to attack the integrit ..."
Abstract
- Add to MetaCart
Wikipedia is an online encyclopedia that anyone can access and edit. It has become one of the most important sources of knowledge online and many third party projects rely on it for a wide-range of purposes. The open model of Wikipedia allows pranksters, lobbyists and spammers to attack the integrity of the encyclopedia and this endangers it as a public resource. This is known in the community as vandalism. A plethora of methods have been developed within the Wikipedia and the scientific community to tackle this problem. We have participated in this effort and developed one of the leading approaches. Our research aims to create a fully-working antivandalism system and get it working in the real world.
JOHANNES LEVELING
"... The Forum for Information Retrieval Evaluation (FIRE) provides document collections, topics, and relevance assessments for information retrieval (IR) experiments on Indian languages. Several research questions are explored in this paper: 1. how to create create a simple, languageindependent corpus-b ..."
Abstract
- Add to MetaCart
The Forum for Information Retrieval Evaluation (FIRE) provides document collections, topics, and relevance assessments for information retrieval (IR) experiments on Indian languages. Several research questions are explored in this paper: 1. how to create create a simple, languageindependent corpus-based stemmer, 2. how to identify sub-words and which types of sub-words are suitable as indexing units, and 3. how to apply blind relevance feedback on sub-words and how feedback term selection is affected by the type of the indexing unit. More than 140 IR experiments are conducted using the BM25 retrieval model on the topic titles and descriptions (TD) for the FIRE 2008 English, Bengali, Hindi, and Marathi document collections. The major findings are: The corpus-based stemming approach is effective as a knowledge-light term conflation step and useful in case of few language-specific resources. For English, the corpusbased stemmer performs nearly as well as the Porter stemmer and significantly better than the baseline of indexing words when combined with query expansion. In combination with blind relevance feedback, it also performs significantly better than the baseline for Bengali and Marathi IR. Sub-words such as consonant-vowel sequences and word prefixes can yield similar or better performance
Skip N-grams and Ranking Functions for Predicting Script Events
"... In this paper, we extend current state-of-theart research on unsupervised acquisition of scripts, that is, stereotypical and frequently observed sequences of events. We design, evaluate and compare different methods for constructing models for script event prediction: given a partial chain of events ..."
Abstract
- Add to MetaCart
In this paper, we extend current state-of-theart research on unsupervised acquisition of scripts, that is, stereotypical and frequently observed sequences of events. We design, evaluate and compare different methods for constructing models for script event prediction: given a partial chain of events in a script, predict other events that are likely to belong to the script. Our work aims to answer key questions about how best to (1) identify representative event chains from a source text, (2) gather statistics from the event chains, and (3) choose ranking functions for predicting new script events. We make several contributions, introducing skip-grams for collecting event statistics, designing improved methods for ranking event predictions, defining a more reliable evaluation metric for measuring predictiveness, and providing a systematic analysis of the various event prediction models. 1

