Results 1 - 10
of
53
An extensive empirical study of feature selection metrics for text classification
- Journal of Machine Learning Research
, 2003
"... Machine learning for text classification is the cornerstone of document categorization, news filtering, document routing, and personalization. In text domains, effective feature selection is essential to make the learning task efficient and more accurate. This paper presents an empirical comparison ..."
Abstract
-
Cited by 180 (11 self)
- Add to MetaCart
Machine learning for text classification is the cornerstone of document categorization, news filtering, document routing, and personalization. In text domains, effective feature selection is essential to make the learning task efficient and more accurate. This paper presents an empirical comparison of twelve feature selection methods (e.g. Information Gain) evaluated on a benchmark of 229 text classification problem instances that were gathered from Reuters, TREC, OHSUMED, etc. The results are analyzed from multiple goal perspectives—accuracy, F-measure, precision, and recall—since each is appropriate in different situations. The results reveal that a new feature selection metric we call ‘Bi-Normal Separation ’ (BNS), outperformed the others by a substantial margin in most situations. This margin widened in tasks with high class skew, which is rampant in text classification problems and is particularly challenging for induction algorithms. A new evaluation methodology is offered that focuses on the needs of the data mining practitioner faced with a single dataset who seeks to choose one (or a pair of) metrics that are most likely to yield the best performance. From this perspective, BNS was the top single choice for all goals except precision, for which Information Gain yielded the best result most often. This analysis also revealed, for example, that Information Gain and Chi-Squared have correlated failures, and so they work poorly together. When choosing optimal pairs of metrics for each of the four performance goals, BNS is consistently a member of the pair—e.g., for greatest recall, the pair BNS + F1-measure yielded the best performance on the greatest number of tasks by a considerable margin.
Computing semantic relatedness using Wikipedia-based explicit semantic analysis
- In Proceedings of the 20th International Joint Conference on Artificial Intelligence
, 2007
"... Computing semantic relatedness of natural language texts requires access to vast amounts of common-sense and domain-specific world knowledge. We propose Explicit Semantic Analysis (ESA), a novel method that represents the meaning of texts in a high-dimensional space of concepts derived from Wikipedi ..."
Abstract
-
Cited by 172 (7 self)
- Add to MetaCart
Computing semantic relatedness of natural language texts requires access to vast amounts of common-sense and domain-specific world knowledge. We propose Explicit Semantic Analysis (ESA), a novel method that represents the meaning of texts in a high-dimensional space of concepts derived from Wikipedia. We use machine learning techniques to explicitly represent the meaning of any text as a weighted vector of Wikipedia-based concepts. Assessing the relatedness of texts in this space amounts to comparing the corresponding vectors using conventional metrics (e.g., cosine). Compared with the previous state of the art, using ESA results in substantial improvements in correlation of computed relatedness scores with human judgments: from r =0.56 to 0.75 for individual words and from r =0.60 to 0.72 for texts. Importantly, due to the use of natural concepts, the ESA model is easy to explain to human users. 1
Search advertising using web relevance feedback
- In Proc 17th. Intl. Conf. on Information and Knowledge Management
, 2008
"... The business of Web search, a $10 billion industry, relies heavily on sponsored search, whereas a few carefully-selected paid advertisements are displayed alongside algorithmic search results. A key technical challenge in sponsored search is to select ads that are relevant for the user’s query. Iden ..."
Abstract
-
Cited by 25 (10 self)
- Add to MetaCart
The business of Web search, a $10 billion industry, relies heavily on sponsored search, whereas a few carefully-selected paid advertisements are displayed alongside algorithmic search results. A key technical challenge in sponsored search is to select ads that are relevant for the user’s query. Identifying relevant ads is challenging because queries are usually very short, and because users, consciously or not, choose terms intended to lead to optimal Web search results and not to optimal ads. Furthermore, the ads themselves are short and usually formulated to capture the reader’s attention rather than to facilitate query matching. Traditionally, matching of ads to queries employed standard information retrieval techniques using the bag of words approach. Here we propose to go beyond the bag of words, and augment both queries and ads with additional knowledgerich features. We use Web search results initially returned for the query to create a pool of relevant documents. Classifying these documents with respect to an external taxonomy and identifying salient named entities give rise to two new feature types. Empirical evaluation based on over 9,000 query-ad pairwise judgments confirms that using augmented queries produces highly relevant ads. Our methodology also relaxes the requirement for each ad to explicitly specify the exhaustive list of queries (“bid phrases”) that can trigger it.
A pitfall and solution in multi-class feature selection for text classification
- In: Proceedings of the 21st International Conference on Machine Learning (ICML’04
, 2004
"... Information Gain is a well-known and empirically proven method for high-dimensional feature selection. We found that it and other existing methods failed to produce good results on an industrial text classification problem. On investigating the root cause, we find that a large class of feature scori ..."
Abstract
-
Cited by 21 (1 self)
- Add to MetaCart
Information Gain is a well-known and empirically proven method for high-dimensional feature selection. We found that it and other existing methods failed to produce good results on an industrial text classification problem. On investigating the root cause, we find that a large class of feature scoring methods suffers a pitfall: they can be blinded by a surplus of strongly predictive features for some classes, while largely ignoring features needed to discriminate difficult classes. In this paper we demonstrate this pitfall hurts performance even for a relatively uniform text classification task. Based on this understanding, we present solutions inspired by round-robin scheduling that avoid this pitfall, without resorting to costly wrapper methods. Empirical evaluation on 19 datasets shows substantial improvements. 1.
Collection Synthesis
, 2002
"... The invention of the hyperlink and the HTTP transmission protocol caused an amazing new structure to appear on the Internet -- the World Wide Web. With the Web, there came spiders, robots, and Web crawlers, which go from one link to the next checking Web health, ferreting out information and resourc ..."
Abstract
-
Cited by 19 (2 self)
- Add to MetaCart
The invention of the hyperlink and the HTTP transmission protocol caused an amazing new structure to appear on the Internet -- the World Wide Web. With the Web, there came spiders, robots, and Web crawlers, which go from one link to the next checking Web health, ferreting out information and resources, and imposing organization on the huge collection of information (and dross) residing on the net. This paper reports on the use of one such crawler to synthesize document collections on various topics in science, mathematics, engineering and technology. Such collections could be part of a digital library.
Focused Crawls, Tunneling, and Digital Libraries
- In Proceedings of the European Conference on Digital Libraries (ECDL
, 2002
"... Crawling the Web to build collections of documents related to pre-specified topics became an active area of research during the late 1990's after crawler technology was developed for the benefit of search engines. Now, Web crawling is being seriously considered as an important strategy for build ..."
Abstract
-
Cited by 16 (1 self)
- Add to MetaCart
Crawling the Web to build collections of documents related to pre-specified topics became an active area of research during the late 1990's after crawler technology was developed for the benefit of search engines. Now, Web crawling is being seriously considered as an important strategy for building large scale digital libraries. This paper considers some of the crawl technologies that might be exploited for collection building. For example, to make such collection-building crawls more effective, focused crawling was developed, in which the goal was to make a "best-first" crawl of the Web. We are using powerful crawler software to implement a focused crawl but use tunneling to overcome some of the limitations of a pure best-first approach. Tunneling has been described by others as not only prioritizing links from pages according to the page's relevance score, but also estimating the value of each link and prioritizing on that as well. We add to this mix by devising a tunneling focused crawling strategy which evaluates the current crawl direction on the fly to determine when to terminate a tunneling activity. Results indicate that a combination of focused crawling and tunneling could be an e#ective tool for building digital libraries.
Tackling Concept Drift by Temporal Inductive Transfer
- In Proc. of the ACM SIGIR Conference
, 2006
"... The success of machine learning classification pales for real-world, time-varying streams of data. We define three subtypes of concept drift, and confirm that recurrent themes appear in the benchmark dataset Reuters2000. To encourage research in this difficult area, we define a ‘daily classification ..."
Abstract
-
Cited by 14 (1 self)
- Add to MetaCart
The success of machine learning classification pales for real-world, time-varying streams of data. We define three subtypes of concept drift, and confirm that recurrent themes appear in the benchmark dataset Reuters2000. To encourage research in this difficult area, we define a ‘daily classification task ’ (DCT) problem formulation, in which a few random iid training samples are provided each day. Ideally, past training data could be leveraged to improve the current day’s classifier. Empirical results for Reuters2000 show that two likely methods are not successful: (1) the popular idea of a sliding window incorporating recent past training data, and (2) inductive transfer of the previously learned classifiers to provide additional predictive features for the current learning task. The former provides a method of characterizing the degree of concept drift. The latter excels if all past labels are given: ‘hindsight DCT.’ 1
An exploration of entity models, collective classification and relation description
- In Proceedings of KDD Workshop on Link Analysis and Group Detection
, 2004
"... Traditional information retrieval typically represents data using a bag of words; data mining typically uses a highly structured database representation. This paper explores the middle ground using a representation which we term entity models, in which questions about structured data may be posed an ..."
Abstract
-
Cited by 13 (1 self)
- Add to MetaCart
Traditional information retrieval typically represents data using a bag of words; data mining typically uses a highly structured database representation. This paper explores the middle ground using a representation which we term entity models, in which questions about structured data may be posed and answered, but the complexities and task-specific restrictions of ontologies are avoided. An entity model is a language model or word distribution associated with an entity, such as a person, place or organization. Using these perentity language models, entities may be clustered, links may be detected or described with a short summary, entities may be collectively classified, and question answering may be performed. On a corpus of entities extracted from newswire and the Web, we group entities by profession with 90 % accuracy, improve accuracy further on the task of classifying politicians as liberal or conservative using collective classification and conditional random fields, and answer questions about “who a person is ” with mean reciprocal rank (MRR) of 0.52. 1.
Using Support Vector Machines for Classifying Large Sets of Multi-Represented Objects
- in Proc. 4th SIAM Int. Conf. on Data Mining
, 2004
"... Databases are a key technology for molecular biology which is a very data intensive discipline. Since molecular biological databases are rather heterogeneous, unification and data integration is mandatory to make use of the huge amount of available information. Currently, the most promising approach ..."
Abstract
-
Cited by 9 (3 self)
- Add to MetaCart
Databases are a key technology for molecular biology which is a very data intensive discipline. Since molecular biological databases are rather heterogeneous, unification and data integration is mandatory to make use of the huge amount of available information. Currently, the most promising approach for integration is the use of ontologies. Since mapping biological entities into ontologies is usually achieved manually or semiautomatically, a system for automatic classification of biological entities into ontologies saves time and effort. Therefore, we present a support vector machine based approach that automatically classifies biological entities into a given ontology. To solve this difficult task, our method copes with the following aspects. Biological entities might belong to more than one class or may be placed in classes on varying abstraction levels. An object may be described by several representations. Thus, the classifier has to be enabled to draw information from all of them, but must consider the possibility that some objects are described incompletely. Therefore, our method introduces the technique of objectadjusted weighting which regulates the impact of each representation dynamically for each object. To significantly improve the time performance of the classifier we exploit the inheritance relations of the given ontology. Our experimental evaluation on protein data and several parts of an established molecular biological ontology shows that our prototype offers impressive accuracy and is efficient enough to cope with the large number of classes encountered in real world problems. ∗ Supported by the German Ministery for Education, Science,
Harnessing the Expertise of 70,000 Human Editors: Knowledge-Based Feature Generation for Text Categorization
"... Most existing methods for text categorization employ induction algorithms that use the words appearing in the training documents as features. While they perform well in many categorization tasks, these methods are inherently limited when faced with more complicated tasks where external knowledge is ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
Most existing methods for text categorization employ induction algorithms that use the words appearing in the training documents as features. While they perform well in many categorization tasks, these methods are inherently limited when faced with more complicated tasks where external knowledge is essential. Recently, there have been efforts to augment these basic features with external knowledge, including semi-supervised learning and transfer learning. In this work, we present a new framework for automatic acquisition of world knowledge and methods for incorporating it into the text categorization process. Our approach enhances machine learning algorithms with features generated from domain-specific and common-sense knowledge. This knowledge is represented by ontologies that contain hundreds of thousands of concepts, further enriched through controlled Web crawling. Prior to text categorization, a feature generator analyzes the documents and maps them onto appropriate ontology concepts that augment the bag of words used in simple supervised learning. Feature generation is accomplished through contextual analysis of document text, thus implicitly performing word sense disambiguation. Coupled with the ability to generalize concepts using the ontology, this approach addresses two significant problems in natural language processing—synonymy and polysemy. Categorizing documents with the aid of knowledge-based features leverages information that cannot be deduced from the training documents alone. We applied our methodology using the Open Directory Project, the largest existing Web directory built by over 70,000 human editors. Experimental results over a range of datasets confirm improved performance compared to the bag of words document representation.

