Results 1 - 10
of
57
Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval
, 1998
"... The naive Bayes classifier, currently experiencing a renaissance in machine learning, has long been a core technique in information retrieval. We review some of the variations of naive Bayes models used for text retrieval and classification, focusing on the distributional assump- tions made abou ..."
Abstract
-
Cited by 268 (1 self)
- Add to MetaCart
The naive Bayes classifier, currently experiencing a renaissance in machine learning, has long been a core technique in information retrieval. We review some of the variations of naive Bayes models used for text retrieval and classification, focusing on the distributional assump- tions made about word occurrences in documents.
OntoSeek: Content-Based Access to the Web
, 1999
"... this article, we discuss the special characteristics of online yellow pages and product catalogs, examine linguistic ontologies' role in content matching, and present OntoSeek's architecture. Understanding yellow pages and product catalogs Online yellow pages locate suppliers based on a generic ..."
Abstract
-
Cited by 178 (0 self)
- Add to MetaCart
this article, we discuss the special characteristics of online yellow pages and product catalogs, examine linguistic ontologies' role in content matching, and present OntoSeek's architecture. Understanding yellow pages and product catalogs Online yellow pages locate suppliers based on a generic natural-language (NL) description of their products and services; product catalogs let users select a specific product or service offered by a certain supplier. These repositories' peculiarities, with respect to generic Web documents, can be roughly characterized by four parameters (see Table 1 for their estimated values): . vocabulary size: number of concepts necessary to formalize all descriptions in the repository; . description complexity: average number of concepts for one description; . description heterogeneity: average number of semantic relations in a description with respect to the t
Feature engineering for text classification
- Proceedings of ICML-99, 16th International Conference on Machine Learning
, 1999
"... Most research in text classification has used the “bag of words ” representation of text. This paper examines some alternative ways to represent text based on syntactic and semantic relationships between words (phrases, synonyms and hypernyms). We describe the new representations and try to justify ..."
Abstract
-
Cited by 73 (0 self)
- Add to MetaCart
Most research in text classification has used the “bag of words ” representation of text. This paper examines some alternative ways to represent text based on syntactic and semantic relationships between words (phrases, synonyms and hypernyms). We describe the new representations and try to justify our suspicions that they could have improved the performance of a rule-based learner. The representations are evaluated using the RIPPER rule-based learner on the Reuters-21578 and DigiTrad test corpora, but on their own the new representations are not found to produce a significant performance improvement. Finally, we try combining classifiers based on different representations using a majority voting technique. This step does produce some performance improvement on both test collections. In general, our work supports the emerging consensus in the information retrieval community that more sophisticated Natural Language Processing techniques need to be developed before better text representations can be produced. We conclude that for now, research into new learning algorithms and methods for combining existing learners holds the most promise.
Noun-Phrase Analysis in Unrestricted Text for Information Retrieval
, 1996
"... Information retrieval is an important application area of natural-language processing where one encounters the genuine challenge of processing large quantities of unrestricted natural-language text. This paper reports on the application of a few simple, yet robust and efficient nounphrase analysis t ..."
Abstract
-
Cited by 64 (10 self)
- Add to MetaCart
Information retrieval is an important application area of natural-language processing where one encounters the genuine challenge of processing large quantities of unrestricted natural-language text. This paper reports on the application of a few simple, yet robust and efficient nounphrase analysis techniques to create bet- ter indexing phrases for information retrieval. In particular, we describe a hybrid approach to the extraction of meaningful (continuous or discontinuous) subcompounds from complex noun phrases using both corpus statistics and linguistic heuristics. Results of experiments show that indexing based on such extracted sub- compounds improves both recall and precision in an information retrieval system. The noun-phrase analysis techniques are also potentially useful for book indexing and automatic thesaurus extraction.
Text categorization of low quality images
- In Proceedings of SDAIR-95, 4th Annual Symposium on Document Analysis and Information Retrieval
, 1995
"... Categorization of text images into content-oriented classes would be a useful capability in a variety of document handling systems. Many methods can be usedtocategorize texts once their words are known, but OCR can garble a large proportion of words, particularly when low quality images are used. De ..."
Abstract
-
Cited by 52 (2 self)
- Add to MetaCart
Categorization of text images into content-oriented classes would be a useful capability in a variety of document handling systems. Many methods can be usedtocategorize texts once their words are known, but OCR can garble a large proportion of words, particularly when low quality images are used. Despite this, we show for one data set that fax quality images can be categorized with nearly the same accuracy as the original text. Further, the categorization system can be trained on noisy OCR output, without need for the true text of any image, or for editing of OCR output. The useofavector space classi er and training method robust to large feature sets, combined with discarding of low frequency OCR output strings are the key to our approach. 1
The Application of Classical Information Retrieval Techniques to Spoken Documents
, 1995
"... Object Description General Discussion Map Reading Photographic Interpretation Cartoon Description Table 4.1: Message classes in classification experiments of Rose et al. Now, an estimate of I(C i ; w k ) can be calculated by a four--way partition of the set of test messages, depending on (a) whether ..."
Abstract
-
Cited by 32 (1 self)
- Add to MetaCart
Object Description General Discussion Map Reading Photographic Interpretation Cartoon Description Table 4.1: Message classes in classification experiments of Rose et al. Now, an estimate of I(C i ; w k ) can be calculated by a four--way partition of the set of test messages, depending on (a) whether or not a message belongs to topic class C i and (b) whether or not it contains word w k . If N is the number of messages in the test collection, R i is the number belonging to topic class C i , n k is the number of messages containing word w k and r ik is the number of messages in class C i containing word w k , then, estimating the probabilities by frequency counts, I(C i ; w k ) = log ( r ik R i ) ( n k N ) : This is actually identical to a form of retrospective term relevance weight, initially proposed in the IR literature by both Barkla [66] and Miller [67], and reviewed by Robertson and Sparck Jones in their classic paper on the subject [42]. Moreover, Rose proposed, but did no...
Fast Statistical Parsing of Noun Phrases for Document Indexing
, 1997
"... Information Retrieval (IR) is an important application area of Natural Language Processing (NLP) where one encounters the genuine challenge of processing large quantities of unrestricted natural language text. While much effort has been made to apply NLP techniques to IR, very few NLP techniques hav ..."
Abstract
-
Cited by 31 (7 self)
- Add to MetaCart
Information Retrieval (IR) is an important application area of Natural Language Processing (NLP) where one encounters the genuine challenge of processing large quantities of unrestricted natural language text. While much effort has been made to apply NLP techniques to IR, very few NLP techniques have been evaluated on a document collection larger than several megabytes. Many NLP techniques are simply not efficient enough, and not robust enough, to handle a large amount of text. This paper proposes a new probabilistic model for noun phrase parsing, and reports on the application of such a parsing technique to enhance document indexing. The effectiveness of using syntactic phrases provided by the parser to supplement single words for indexing is evaluated with a 250 megabytes document collection. The experiment's resuits show that supplementing single words with syntactic phrases for indexing consistently and significantly improves retrieval performance.
Takagi T: Automatic Construction of Knowledge Base from Biological Papers
- Proc Int Conf Intell Syst Mol Biol
, 1997
"... We designed a system that acquires domain speci c knowledge from human written biological papers, and we call this system IFBP (Information Finding from Biological Papers). IFBP is divided into three phases, Information Retrieval ..."
Abstract
-
Cited by 29 (4 self)
- Add to MetaCart
We designed a system that acquires domain speci c knowledge from human written biological papers, and we call this system IFBP (Information Finding from Biological Papers). IFBP is divided into three phases, Information Retrieval
TopCat: Data Mining for Topic Identification in a Text Corpus
- In Proceedings of the 3rd European Conference of Principles and Practice of Knowledge Discovery in Databases
, 2002
"... TopCat (Topic Categories) is a technique for identifying topics that recur in articles in a text corpus. Natural language processing techniques are used to identify key entities in individual articles, allowing us to represent an article as a set of items. This allows us to view the problem in a dat ..."
Abstract
-
Cited by 27 (4 self)
- Add to MetaCart
TopCat (Topic Categories) is a technique for identifying topics that recur in articles in a text corpus. Natural language processing techniques are used to identify key entities in individual articles, allowing us to represent an article as a set of items. This allows us to view the problem in a database/data mining context: Identifying related groups of items. This paper presents a novel method for identifying related items based on "traditional" data mining techniques. Frequent itemsets are generated from the groups of items, followed by clusters formed with a hypergraph partitioning scheme. We present an evaluation against a manually-categorized "ground truth" news corpus showing this technique is effective in identifying "topics" in collections of news articles.
Methods of Automatic Term Recognition - A Review
, 1996
"... Following the growing interest in "corpus-based" approaches to computational linguistics, a number of studies have recently appeared on the topic of automatic term recognition or extraction. Because a successful term recognition method has to be based on proper insights into the nature of terms, stu ..."
Abstract
-
Cited by 24 (1 self)
- Add to MetaCart
Following the growing interest in "corpus-based" approaches to computational linguistics, a number of studies have recently appeared on the topic of automatic term recognition or extraction. Because a successful term recognition method has to be based on proper insights into the nature of terms, studies of automatic term recognition not only contribute to the applications of computational linguistics but also to the theoretical foundation of terminology. Many studies on automatic term recognition treat interesting aspects of terms, but most of them are not well founded and described. This paper tries to give an overview of the principles and methods of automatic term recognition. For that purpose, two major trends are examined, i.e. studies in automatic recognition of significant elements for indexing mainly carried out in information retrieval circles, and current research in automatic term recognition in the field of computational linguistics. Keywords Automatic term recognition, au...

