Results 1 - 10
of
12
Augmenting Naive Bayes Classifiers with Statistical Language Models
, 2003
"... We augment naive Bayes models with statistical n-gram language models to address shortcomings of the standard naive Bayes text classifier. The result is a generalized naive Bayes classifier ..."
Abstract
-
Cited by 38 (0 self)
- Add to MetaCart
We augment naive Bayes models with statistical n-gram language models to address shortcomings of the standard naive Bayes text classifier. The result is a generalized naive Bayes classifier
Combining Naive Bayes and n-Gram Language Models for Text Classification
- In 25th European Conference on Information Retrieval Research (ECIR
, 2003
"... We augment the naive Bayes model with an n-gram language model to address two shortcomings of naive Bayes text classifiers. ..."
Abstract
-
Cited by 26 (2 self)
- Add to MetaCart
We augment the naive Bayes model with an n-gram language model to address two shortcomings of naive Bayes text classifiers.
Large Scale Unstructured Document Classification Using Unlabeled Data and Syntactic Information
- In PAKDD 2003, LNCS
, 2003
"... Most document classification systems consider only the distribution of content words of the documents, ignoring the syntactic information underlying the documents though it is also an important factor. ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
Most document classification systems consider only the distribution of content words of the documents, ignoring the syntactic information underlying the documents though it is also an important factor.
Non-contiguous word sequences for information retrieval
- In Proceedings of the 42nd annual meeting of the Association for Computational Lingustics, Workshop on Multiword Expressions: Integrating Processing
, 2004
"... The growing amount of textual information available electronically has increased the need for high performance retrieval. The use of phrases was long seen as a natural way to improve retrieval performance over the common document models that ignore the sequential aspect of word occurrences in docume ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
The growing amount of textual information available electronically has increased the need for high performance retrieval. The use of phrases was long seen as a natural way to improve retrieval performance over the common document models that ignore the sequential aspect of word occurrences in documents, considering them as “bags of words”. However, both statistical and syntactical phrases showed disappointing results for large document collections. In this paper we present a recent type of multi-word expressions in the form of Maximal Frequent Sequences (Ahonen-Myka, 1999). Mined phrases rather than statistical or syntactical phrases, their main strengths are to form a very compact index and to account for the sequentiality and adjacency of meaningful word co-occurrences, by allowing for a gap between words. We introduce a method for using these phrases in information retrieval and present our experiments. They show a clear improvement over the well-known technique of extracting frequent word pairs. 1
Identifying variable-length meaningful phrases with correlation functions
- IEEE International Conference on Tools with Artificial Intelligence, IEEE
"... Finding meaningful phrases in a document has been studied in various information retrieval systems in order to improve the performance. Many previous statistical phrase-finding methods had a different aim such as document classification. Some are hybridized with statistical and syntactic grammatical ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Finding meaningful phrases in a document has been studied in various information retrieval systems in order to improve the performance. Many previous statistical phrase-finding methods had a different aim such as document classification. Some are hybridized with statistical and syntactic grammatical methods; others use correlation heuristics between words. We propose a new phrase-finding algorithm that adds correlated words one by one to the phrases found in the previous stage, maintaining high correlation within a phrase. Our results indicate that our algorithm finds more meaningful phrases than an existing algorithm. Furthermore, the previous algorithm could be improved by applying different correlation functions. 1.
Let’s phrase it: INEX topics need keyphrases
- In Proceedings of the SIGIR 2008 Workshop on Focused Retrieval
, 2008
"... In this paper, we study and discuss the usage of phrases in the INEX evaluation of XML retrieval as well as in related research. We find that the INEX framework could easily become a unique testbed for researchers interested in the exploitation of complex terms in IR, while triggering interest from ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
In this paper, we study and discuss the usage of phrases in the INEX evaluation of XML retrieval as well as in related research. We find that the INEX framework could easily become a unique testbed for researchers interested in the exploitation of complex terms in IR, while triggering interest from others. Unfortunately, our analysis of the use of keyphrases in INEX topics shows a downwards trend over the years that impacts on the attention of participants. While NEXI, the official query format of INEX, does indeed support keyphrases, its full potential does not materialize, as topic contents show a lack of consistency in their markup. In 2007, 87 % of the INEX queries contained keyphrases, but only 11 % of those were marked up. We present simple and low-cost solutions to let the INEX collections deliver their full potential in keyphrase retrieval.
Choosing the Right Bigrams for Information Retrieval
- In Proceeding of the Meeting of the International Federation of Classification Societies
, 2004
"... After more than 30 years of research in information retrieval, the dominant paradigm remains the “bag-of-words”, in which query terms are considered independent of their coocurrences with each other. Although there has been some work on incorporating phrases or other syntactic information into IR, s ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
After more than 30 years of research in information retrieval, the dominant paradigm remains the “bag-of-words”, in which query terms are considered independent of their coocurrences with each other. Although there has been some work on incorporating phrases or other syntactic information into IR, such attempts have given modest and inconsistent improvements, at best. This paper is a first step at investigating more deeply the question of using bigrams for information retrieval. Our results indicate that only certain kinds of bigrams are likely to aid retrieval. We used linear regression methods on data from TREC 6, 7, and 8 to identify which bigrams are able to help retrieval at all. Our characterization was then tested through retrieval experiments using our information retrieval engine, AIRE, which implements many standard ranking functions and retrieval utilities. 1.
Advanced Document Description, a Sequential Approach
, 2005
"... To be able to perform efficient document processing, information systems need to use simple models of documents that can be treated in a smaller number of operations. This problem of document representation is not trivial. For decades, researchers have tried to combine relevant document representati ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
To be able to perform efficient document processing, information systems need to use simple models of documents that can be treated in a smaller number of operations. This problem of document representation is not trivial. For decades, researchers have tried to combine relevant document representations with efficient processing. Documents are commonly represented by vectors in which each dimension corresponds to a word of the document. This approach is termed “bag of words”, as it entirely ignores the relative positions of words. One natural improvement over this representation is the extraction and use of cohesive word sequences. In this dissertation, we consider the problem of the extraction, selection and exploitation of word sequences, with a particular focus on the applicability of our work to domain-independent document collections written in any language.
IIT at TREC-10
, 2001
"... For TREC-10, we participated in the adhoc and manual web tracks and in both the site-finding and cross-lingual tracks. For the adhoc track, we did extensive calibrations and learned that combining similarity measures yields little improvement. This year, we focused on a single highperformance simila ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
For TREC-10, we participated in the adhoc and manual web tracks and in both the site-finding and cross-lingual tracks. For the adhoc track, we did extensive calibrations and learned that combining similarity measures yields little improvement. This year, we focused on a single highperformance similarity measure. For site finding, we implemented several algorithms that did well on the data provided for calibration, but poorly on the real dataset. For the cross-lingual track, we calibrated on the monolingual collection, and developed new Arabic stemming algorithms as well as a novel dictionary-based means of cross-lingual retrieval. Our results in this track were quite promising, with seventeen of our queries performing at or above the median. 1

