Results 1 - 10
of
17
On Integrating Catalogs
, 2001
"... We address the problem of integrating documents from different sources into a master catalog. This problem is pervasive in web marketplaces and portals. Current technology for automating this process consists of building a classifier that uses the categorization of documents in the master catalog to ..."
Abstract
-
Cited by 55 (0 self)
- Add to MetaCart
We address the problem of integrating documents from different sources into a master catalog. This problem is pervasive in web marketplaces and portals. Current technology for automating this process consists of building a classifier that uses the categorization of documents in the master catalog to construct a model for predicting the category of unknown documents. Our key insight is that many of the data sources have their own categorization, and classification accuracy can be improved by factoring in the implicit information in these source categorizations. We show how a Naive Bayes classification can be enhanced to incorporate the similarity information present in source catalogs. Our analysis and empirical evaluation show substantial improvement in the accuracy of catalog integration. Keywords: Classification, Categorization, Data Mining, Catalog Integration, Web Portals, Web Marketplaces 1.
Building text classifiers using positive and unlabeled examples
- In: Intl. Conf. on Data Mining
, 2003
"... This paper studies the problem of building text classifiers using positive and unlabeled examples. The key feature of this problem is that there is no negative example for learning. Recently, a few techniques for solving this problem were proposed in the literature. These techniques are based on the ..."
Abstract
-
Cited by 46 (8 self)
- Add to MetaCart
This paper studies the problem of building text classifiers using positive and unlabeled examples. The key feature of this problem is that there is no negative example for learning. Recently, a few techniques for solving this problem were proposed in the literature. These techniques are based on the same idea, which builds a classifier in two steps. Each existing technique uses a different method for each step. In this paper, we first introduce some new methods for the two steps, and perform a comprehensive evaluation of all possible combinations of methods of the two steps. We then propose a more principled approach to solving the problem based on a biased formulation of SVM, and show experimentally that it is more accurate than the existing techniques. 1.
Mining Newsgroups Using Networks Arising From Social Behavior
, 2003
"... Recent advances in information retrieval over hyperlinked corpora have convincingly demonstrated that links carry less noisy information than text. We investigate the feasibility of applying link-based methods in new applications domains. The specific application we consider is to partition authors ..."
Abstract
-
Cited by 40 (0 self)
- Add to MetaCart
Recent advances in information retrieval over hyperlinked corpora have convincingly demonstrated that links carry less noisy information than text. We investigate the feasibility of applying link-based methods in new applications domains. The specific application we consider is to partition authors into opposite camps within a given topic in the context of newsgroups. A typical newsgroup posting consists of one or more quoted lines from another posting followed by the opinion of the author. This social behavior gives rise to a network in which the vertices are individuals and the links represent "responded-to" relationships. An interesting characteristic of many newsgroups is that people more frequently respond to a message when they disagree than when they agree. This behavior is in sharp contrast to the WWW link graph, where linkage is an indicator of agreement or common interest. By analyzing the graph structure of the responses, we are able to effectively classify people into opposite camps. In contrast, methods based on statistical analysis of text yield low accuracy on such datasets because the vocabulary used by the two sides tends to be largely identical, and many newsgroup postings consist of relatively few words of text.
Fast and accurate text classification via multiple linear discriminant projections
- In VLDB
, 2002
"... Abstract. Support vector machines (SVMs) have shown superb performance for text classification tasks.They are accurate, robust, and quick to apply to test instances.Their only potential drawback is their training time and memory requirement.For n training instances held in memory, the best-known SVM ..."
Abstract
-
Cited by 20 (0 self)
- Add to MetaCart
Abstract. Support vector machines (SVMs) have shown superb performance for text classification tasks.They are accurate, robust, and quick to apply to test instances.Their only potential drawback is their training time and memory requirement.For n training instances held in memory, the best-known SVM implementations take time proportional to n a, where a is typically between 1.8 and 2.1. SVMs have been trained on data sets with several thousand instances, but Web directories today contain millions of instances that are valuable for mapping billions of Web pages into Yahoo!-like directories.We present SIMPL, a nearly linear-time classification algorithm that mimics the strengths of SVMs while avoiding the training bottleneck.It uses Fisher’s linear discriminant, a classical tool from statistical pattern recognition, to project training instances to a carefully selected low-dimensional subspace before inducing a decision tree on the projected instances. SIMPL uses efficient sequential scans and sorts and is comparable in speed and memory scalability to widely used naive Bayes (NB) classifiers, but it beats NB accuracy decisively.It not only approaches and sometimes exceeds SVM accuracy, but also beats the running time of a popular SVM implementation by orders of magnitude.While describing SIMPL, we make a detailed experimental comparison of SVM-generated discriminants with Fisher’s discriminants, and we also report on an analysis of the cache performance of a popular SVM implementation.Our analysis shows that SIMPL has the potential to be the method of choice for practitioners who want the accuracy of SVMs and the simplicity and speed of naive Bayes classifiers.
Learning with Positive and Unlabeled Examples Using Weighted Logistic Regression
- Proceedings of the Twentieth International Conference on Machine Learning (ICML
, 2003
"... The problem of learning with positive and unlabeled examples arises frequently in retrieval applications. ..."
Abstract
-
Cited by 20 (6 self)
- Add to MetaCart
The problem of learning with positive and unlabeled examples arises frequently in retrieval applications.
Web Taxonomy Integration Using Support Vector Machines
- In Proceedings of the World-Wide Web Conference (WWW-2004). ACM
, 2004
"... We address the problem of integrating objects from a source taxonomy into a master taxonomy. This problem is not only currently pervasive on the web, but also important to the emerging semantic web. A straightforward approach to automating this process would be to train a classifier for each categor ..."
Abstract
-
Cited by 17 (1 self)
- Add to MetaCart
We address the problem of integrating objects from a source taxonomy into a master taxonomy. This problem is not only currently pervasive on the web, but also important to the emerging semantic web. A straightforward approach to automating this process would be to train a classifier for each category in the master taxonomy, and then classify objects from the source taxonomy into these categories. In this paper we attempt to use a powerful classification method, Support Vector Machine (SVM), to attack this problem. Our key insight is that the availability of the source taxonomy data could be helpful to build better classifiers in this scenario, therefore it would be beneficial to do transductive learning rather than inductive learning, i.e., learning to optimize classification performance on a particular set of test examples. Noticing that the categorizations of the master and source taxonomies often have some semantic overlap, we propose a method, Cluster Shrinkage (CS), to further enhance the classification by exploiting such implicit knowledge. Our experiments with real-world web data show substantial improvements in the performance of taxonomy integration.
The user-subjective approach to personal information management systems
- Journal of the American Society for Information Science and Technology
, 2003
"... 1 Personal Information Management (PIM) is an activity in which an individual stores his\her personal information items in order to retrieve them later on. In a former article, we suggested the user-subjective approach, a theoretical approach proposing design principles with which PIM systems can sy ..."
Abstract
-
Cited by 14 (4 self)
- Add to MetaCart
1 Personal Information Management (PIM) is an activity in which an individual stores his\her personal information items in order to retrieve them later on. In a former article, we suggested the user-subjective approach, a theoretical approach proposing design principles with which PIM systems can systematically use subjective attributes of information items. In this consecutive paper, we report on a study that tested the approach by exploring the use of subjective attributes (project, importance and context) in current PIM systems, and its dependence on design characteristics. Participants were 84 personal computer users. Tools included a questionnaire (N=84), a semi-structured interview that was transcribed and analyzed (N=20), and screen captures taken from this sub-sample. Results indicate that participants tended to use subjective attributes when the design encouraged them to, however, when the design discouraged such use, they either found their own alternative ways to use them or refrained from using them altogether. This constitutes evidence in support of the user-subjective approach as it
Learning to integrate web taxonomies
- Journal of Web Semantics
, 2004
"... We investigate machine learning methods for automatically integrating objects from different taxonomies into a master taxonomy. This problem is not only currently pervasive on the Web, but is also important to the emerging Semantic Web. A straightforward approach to automating this process would be ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
We investigate machine learning methods for automatically integrating objects from different taxonomies into a master taxonomy. This problem is not only currently pervasive on the Web, but is also important to the emerging Semantic Web. A straightforward approach to automating this process would be to build classifiers through machine learning and then use these classifiers to classify objects from the source taxonomies into categories of the master taxonomy. However, conventional machine learning algorithms totally ignore the availability of the source taxonomies. In fact, source and master taxonomies often have common categories under different names or other more complex semantic overlaps. We introduce two techniques that exploit the semantic overlap between the source and master taxonomies to build better classifiers for the master taxonomy. The first technique, Cluster Shrinkage, biases the learning algorithm against splitting source categories by making objects in the same category appear more similar to each other. The second technique, Co-Bootstrapping, tries to facilitate the exploitation of inter-taxonomy relationships by providing category indicator functions as additional features for the objects. Our experiments with real-world Web data show that these proposed add-on techniques can enhance various machine learning algorithms to achieve substantial improvements in performance for taxonomy integration.
Language Models for Hierarchical Summarization
, 2003
"... Hierarchies have long been used for organization, summarization, and access to information. In this dissertation we define summarization in terms of a probabilistic language model and use this definition to explore a new technique for automatically generating topic hierarchies. We use the language ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
Hierarchies have long been used for organization, summarization, and access to information. In this dissertation we define summarization in terms of a probabilistic language model and use this definition to explore a new technique for automatically generating topic hierarchies. We use the language model to characterize the documents that will be summarized and then apply a graph-theoretic algorithm to determine the best topic words for the hierarchical summary. This work is very different from previous attempts to generate topic hierarchies because it relies on statistical analysis and language modeling to identify descriptive words for a document and organize the words in a hierarchical structure. We compare
Hierarchical Text Categorization and Its Application to Bioinformatics
, 2005
"... In a hierarchical categorization problem, categories are partially ordered to form a hier-archy. In this dissertation, we explore two main aspects of hierarchical categorization: learning algorithms and performance evaluation. We introduce the notion of consistent hierarchical classification that ma ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
In a hierarchical categorization problem, categories are partially ordered to form a hier-archy. In this dissertation, we explore two main aspects of hierarchical categorization: learning algorithms and performance evaluation. We introduce the notion of consistent hierarchical classification that makes classification results more comprehensible and easily interpretable for end-users. Among the previously introduced hierarchical learning algo-rithms, only a local top-down approach produces consistent classification. The present work extends this algorithm to the general case of DAG class hierarchies and possible internal class assignments. In addition, a new global hierarchical approach aimed at performing consistent classification is proposed. This is a general framework of convert-ing a conventional “flat ” learning algorithm into a hierarchical one. An extensive set of experiments on real and synthetic data indicate that the proposed approach significantly outperforms the corresponding “flat ” as well as the local top-down method. For eval-uation purposes, we use a novel hierarchical evaluation measure that is superior to the existing hierarchical and non-hierarchical evaluation techniques according to a number

