Results 1 - 10
of
40
Editorial: Special Issue on Learning from Imbalanced Data Sets
- SIGKDD Explorations
, 2004
"... The class imbalance problem is one of the (relatively) new problems that emerged when machine learning matured from an embryonic science to an applied technology, amply used in the worlds of business, industry and scientific research. ..."
Abstract
-
Cited by 60 (1 self)
- Add to MetaCart
The class imbalance problem is one of the (relatively) new problems that emerged when machine learning matured from an embryonic science to an applied technology, amply used in the worlds of business, industry and scientific research.
Learning Classifiers from Only Positive and Unlabeled Data
, 2008
"... The input to an algorithm that learns a binary classifier normally consists of two sets of examples, where one set consists of positive examples of the concept to be learned, and the other set consists of negative examples. However, it is often the case that the available training data are an incomp ..."
Abstract
-
Cited by 15 (2 self)
- Add to MetaCart
The input to an algorithm that learns a binary classifier normally consists of two sets of examples, where one set consists of positive examples of the concept to be learned, and the other set consists of negative examples. However, it is often the case that the available training data are an incomplete set of positive examples, and a set of unlabeled examples, some of which are positive and some of which are negative. The problem solved in this paper is how to learn a standard binary classifier given a nontraditional training set of this nature. Under the assumption that the labeled examples are selected randomly from the positive examples, we show that a classifier trained on positive and unlabeled examples predicts probabilities that differ by only a constant factor from the true conditional probabilities of being positive. We show how to use this result in two different ways to learn a classifier from a nontraditional training set. We then apply these two new methods to solve a real-world problem: identifying protein records that should be included in an incomplete specialized molecular biology database. Our experiments in this domain show that models trained using the new methods perform better than the current state-of-the-art biased SVM method for learning from positive and unlabeled examples.
A Method for Inferring Label Sampling Mechanisms In Semi-Supervised Learning
, 2005
"... We consider the situation in semi-supervised learning, where the "label sampling" mechanism stochastically depends on the true response (as well as potentially on the features). We suggest a method of moments for estimating this stochastic dependence using the unlabeled data. This is potentially ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
We consider the situation in semi-supervised learning, where the "label sampling" mechanism stochastically depends on the true response (as well as potentially on the features). We suggest a method of moments for estimating this stochastic dependence using the unlabeled data. This is potentially useful for two distinct purposes: a. As an input to a supervised learning procedure which can be used to "de-bias" its results using labeled data only and b. As a potentially interesting learning task in itself.
Spying out real user preferences for metasearch engine adaptation
- In Proc. of WebKDD
, 2004
"... Most current metasearch engines provide uniform service to users but do not cater for the specific needs of individual users. To address this problem, research has been done on personalizing a metasearch engine. An interesting and practical approach is to optimize its ranking function using clickthr ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
Most current metasearch engines provide uniform service to users but do not cater for the specific needs of individual users. To address this problem, research has been done on personalizing a metasearch engine. An interesting and practical approach is to optimize its ranking function using clickthrough data. However, it is still challenging to infer accurate user preferences from the clickthrough data. In this paper, we propose a novel learning technique called “Spy Naïve Bayes ” (SpyNB) to identify the user preference pairs generated from clickthrough data. We then employ ranking SVM to build a metasearch engine optimizer. To evaluate the effectiveness of SpyNB on ranking quality, we develop a metasearch engine prototype that comprises three underlying search engines: MSNSearch 1, WiseNut 2 and Overture 3 to conduct experimental evaluation. The empirical results show that, compared with the original ranking, SpyNB can significantly improve the average ranks of users ’ click by 20%, while the performance of the existing methods are not satisfactory. Key Words: Spy naïve Bayes, search engine personalization, clickthrough, user preferences. 1.
Learning from positive and unlabeled examples with different data distributions
- Proceedings of ECML 2005
, 2005
"... Abstract. We study the problem of learning from positive and unlabeled examples. Although several techniques exist for dealing with this problem, they all assume that positive examples in the positive set P and the positive examples in the unlabeled set U are generated from the same distribution. Th ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
Abstract. We study the problem of learning from positive and unlabeled examples. Although several techniques exist for dealing with this problem, they all assume that positive examples in the positive set P and the positive examples in the unlabeled set U are generated from the same distribution. This assumption may be violated in practice. For example, one wants to collect all printer pages from the Web. One can use the printer pages from one site as the set P of positive pages and use product pages from another site as U. One wants to classify the pages in U into printer pages and non-printer pages. Although printer pages from the two sites have many similarities, they can also be quite different because different sites often present similar products in different styles and have different focuses. In such cases, existing methods perform poorly. This paper proposes a novel technique A-EM to deal with the problem. Experiment results with product page classification demonstrate the effectiveness of the proposed technique. 1
A Simple Probabilistic Approach to Learning from Positive and Unlabeled Examples
- In Proc. of the 5th Annual UK Workshop on Computational Intelligence
, 2005
"... We propose a simple probabilistic approach to learning from positive and unlabeled examples, and show experimentally that it can approximate or outperform other state-ofthe-art approaches to this problem in spite of its simplicity. By employing a linear-time learning algorithm such as PrTFIDF, our a ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
We propose a simple probabilistic approach to learning from positive and unlabeled examples, and show experimentally that it can approximate or outperform other state-ofthe-art approaches to this problem in spite of its simplicity. By employing a linear-time learning algorithm such as PrTFIDF, our approach can be highly efficient and scalable. 1
Semi-Supervised Named Entity Recognition: Learning to Recognize 100 Entity Types with Little Supervision
"... Table of contents List of tables........................................................................................................................ iv List of figures....................................................................................................................... v Abstrac ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Table of contents List of tables........................................................................................................................ iv List of figures....................................................................................................................... v Abstract............................................................................................................................... vi
Positive Unlabeled Learning for Data Stream Classification
- SDM, SIAM
, 2009
"... Learning from positive and unlabeled examples (PU learning) has been investigated in recent years as an alternative learning model for dealing with situations where negative training examples are not available. It has many real world applications, but it has yet to be applied in the data stream envi ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
Learning from positive and unlabeled examples (PU learning) has been investigated in recent years as an alternative learning model for dealing with situations where negative training examples are not available. It has many real world applications, but it has yet to be applied in the data stream environment where it is highly possible that only a small set of positive data and no negative data is available. An important challenge is to address the issue of concept drift in the data stream environment, which is not easily handled by the traditional PU learning techniques. This paper studies how to devise PU learning techniques for the data stream environment. Unlike existing data stream classification methods that assume both positive and negative training data are available for learning, we propose a novel PU learning technique
Classifying Documents without Labels
- In Proceedings of the SIAM International Conference on Data Mining
, 2003
"... Automatic classification of documents is an important area of research with many applications in the fields of document searching, forensics and others. Methods to perform classification of text rely on the existence of a sample of documents whose class labels are known. However, in many situations, ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
Automatic classification of documents is an important area of research with many applications in the fields of document searching, forensics and others. Methods to perform classification of text rely on the existence of a sample of documents whose class labels are known. However, in many situations, obtaining this sample may not be an easy (or even possible) task. Consider for instance, a set of documents that is returned as a result of a query. If we want to separate the documents that are truly relevant to the query from those that are not, it is unlikely that we will have at hand labelled documents to train classification models to perform this task. In this paper we focus on the classification of an unlabelled set of documents into two classes: relevant and irrelevant, given a topic of interest. By dividing the set of documents into buckets (for instance, answers returned by di#erent search engines), and using association rule mining to find common sets of words among the buckets, we can e#ciently obtain a sample of documents that has a large percentage of relevant ones. (I.e., a high "purity".) This sample can be used to train models to classify the entire set of documents. We try several methods of classification to separate the documents, including Two-class SVM, for which we develop a heuristic to identify a small sample of negative examples. We prove, via experimentation, that our method is capable of accurately classify a set of documents into relevant and irrelevant classes.
Leveraging one-class SVM and semantic analysis to detect anomalous content
- In Proceedings of IEEE International Conference on Intelligence and Security Informatics (ISI 2005
, 2005
"... Experiments were conducted to test several hypotheses on methods for improving document classification for the malicious insider threat problem within the Intelligence Community. Bag-of-words (BOW) representations of documents were compared to Natural Language Processing (NLP) based representations ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Experiments were conducted to test several hypotheses on methods for improving document classification for the malicious insider threat problem within the Intelligence Community. Bag-of-words (BOW) representations of documents were compared to Natural Language Processing (NLP) based representations in both the typical and one-class classification problems using the Support Vector Machine algorithm. Results show that the NLP features significantly improved classifier performance over the BOW approach both in terms of precision and recall, while using many fewer features. The one-class algorithm using NLP features demonstrated robustness when tested on new domains. 1

