Results 1 - 10
of
30
Clustering with instance-level constraints
- In Proceedings of the Seventeenth International Conference on Machine Learning
, 2000
"... One goal of research in artificial intelligence is to automate tasks that currently require human expertise; this automation is important because it saves time and brings problems that were previously too large to be solved into the feasible domain. Data analysis, or the ability to identify meaningf ..."
Abstract
-
Cited by 116 (6 self)
- Add to MetaCart
One goal of research in artificial intelligence is to automate tasks that currently require human expertise; this automation is important because it saves time and brings problems that were previously too large to be solved into the feasible domain. Data analysis, or the ability to identify meaningful patterns and trends in large volumes of data, is an important task that falls into this category. Clustering algorithms are a particularly useful group of data analysis tools. These methods are used, for example, to analyze satellite images of the Earth to identify and categorize different land and foliage types or to analyze telescopic observations to determine what distinct types of astronomical bodies exist and to categorize each observation. However, most existing clustering methods apply general similarity techniques rather than making use of problem-specific information. This dissertation first presents a novel method for converting existing clustering algorithms into constrained clustering algorithms. The resulting methods are able to accept domain-specific information in the form of constraints on the output clusters. At the most general level, each constraint is an instance-level statement
Mining with Rarity: A Unifying Framework
"... Rare objects are often of great interest and great value. Until recently, however, rarity has not received much attention in the context of data mining. Now, as increasingly complex real-world problems are addressed, rarity, and the related problem of imbalanced data, are taking center stage. This a ..."
Abstract
-
Cited by 57 (6 self)
- Add to MetaCart
Rare objects are often of great interest and great value. Until recently, however, rarity has not received much attention in the context of data mining. Now, as increasingly complex real-world problems are addressed, rarity, and the related problem of imbalanced data, are taking center stage. This article discusses the role that rare classes and rare cases play in data mining. The problems that can result from these two forms of rarity are described in detail, as are methods for addressing these problems. These descriptions utilize examples from existing research, so that this article provides a good survey of the literature on rarity in data mining. This article also demonstrates that rare classes and rare cases are very similar phenomena---both forms of rarity are shown to cause similar problems during data mining and benefit from the same remediation methods.
Learning when data sets are imbalanced and when costs are unequal and unknown
- ICML-2003 Workshop on Learning from Imbalanced Data Sets II
, 2003
"... The problem of learning from imbalanced data sets, while not the same problem as learning when misclassification costs are unequal and unknown, can be handled in a similar manner. That is, in both contexts, we can use techniques from roc analysis to help with classifier design. We present results fr ..."
Abstract
-
Cited by 36 (0 self)
- Add to MetaCart
The problem of learning from imbalanced data sets, while not the same problem as learning when misclassification costs are unequal and unknown, can be handled in a similar manner. That is, in both contexts, we can use techniques from roc analysis to help with classifier design. We present results from two studies in which we dealt with skewed data sets and unequal, but unknown costs of error. We also compare for one domain these results to those obtained by over-sampling and under-sampling the data set. The operations of sampling, moving the decision threshold, and adjusting the cost matrix produced sets of classifiers that fell on the same roc curve. 1.
Combining Sample Selection and Error-Driven Pruning for Machine Learning of Coreference Rules
, 2002
"... Most machine learning solutions to noun phrase coreference resolution recast the problem as a classification task. We examine three potential problems with this reformulation, namely, skewed class distributions, the inclusion of "hard" training instances, and the loss of transitivity inherent ..."
Abstract
-
Cited by 20 (4 self)
- Add to MetaCart
Most machine learning solutions to noun phrase coreference resolution recast the problem as a classification task. We examine three potential problems with this reformulation, namely, skewed class distributions, the inclusion of "hard" training instances, and the loss of transitivity inherent in the original coreference relation.
Improved Rooftop Detection in Aerial Images with Machine Learning
- Machine Learning
, 2002
"... In this paper, we examine the use of machine learning to improve a rooftop detection process, one step in a vision system that recognizes buildings in overhead imagery. We review the problem of analyzing aerial images and describe an existing system that detects buildings in such images. We briefly ..."
Abstract
-
Cited by 15 (2 self)
- Add to MetaCart
In this paper, we examine the use of machine learning to improve a rooftop detection process, one step in a vision system that recognizes buildings in overhead imagery. We review the problem of analyzing aerial images and describe an existing system that detects buildings in such images. We briefly detail four algorithms that we selected to improve rooftop detection. The data sets were highly skewed and the cost of mistakes differed between the classes, so we used ROC analysis to evaluate the methods under varying error costs. We report three experiments designed to illuminate facets of applying machine learning to the image analysis task. One investigated learning with all available images to determine the best performing method. Another focused on within-image learning, in which we derived training and testing data from the same image. A final experiment addressed between-image learning, in which training and testing sets came from different images. Results suggest that useful generalization occurred when training and testing on data derived from images differing in location and in aspect. They demonstrate that under most conditions, naive Bayes exceeded the accuracy of other methods and a handcrafted classifier, the solution currently used in the building detection system.
Severe class imbalance: Why better algorithms aren’t the answer
- Proceedings of the 16th European Conference of Machine Learning
, 2005
"... Abstract. This paper argues that severe class imbalance is not just an interesting technical challenge that improved learning algorithms will address, it is much more serious. To be useful, a classifier must appreciably outperform a trivial solution, such as choosing the majority class. Any applicat ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
Abstract. This paper argues that severe class imbalance is not just an interesting technical challenge that improved learning algorithms will address, it is much more serious. To be useful, a classifier must appreciably outperform a trivial solution, such as choosing the majority class. Any application that is inherently noisy limits the error rate, and cost, that is achievable. When data are normally distributed, even a Bayes optimal classifier has a vanishingly small reduction in the majority classifier’s error rate, and cost, as imbalance increases. For fat tailed distributions, and when practical classifiers are used, often no reduction is achieved. 1
Reasoning with Textual Cases
- in Proceedings of the Sixth International Conference on Case-Based Reasoning (ICCBR 2005); LNCS 3620, H. Muñoz-Avila and
, 2005
"... indexing concepts in textual cases and demonstrates how these cases can be used in an interpretive CBR system to carry out case-based argumentation and prediction from text cases. We implemented and evaluated these methods in SMILE+IBP, which predicts the outcome of legal cases given a textual summa ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
indexing concepts in textual cases and demonstrates how these cases can be used in an interpretive CBR system to carry out case-based argumentation and prediction from text cases. We implemented and evaluated these methods in SMILE+IBP, which predicts the outcome of legal cases given a textual summary. Our approach uses classification-based methods for assigning indices. In our experiments, we compare different methods for representing text cases, and also consider multiple learning algorithms. The evaluation shows that a text representation that combines some background knowledge and NLP combined with a nearest neighbor algorithm leads to the best performance for our TCBR task. 1
A Minimum Risk Metric for Nearest Neighbor Classification
- In Proc. 16th International Conf. on Machine Learning
, 1999
"... Nearest Neighbor is a well-known algorithm extensively studied by the Pattern Recognition and Machine Learning communities and widely exploited in Case Based Reasoning applications. The notion of metric is central to Nearest Neighbor's working and different feature weighting metrics have been ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
Nearest Neighbor is a well-known algorithm extensively studied by the Pattern Recognition and Machine Learning communities and widely exploited in Case Based Reasoning applications. The notion of metric is central to Nearest Neighbor's working and different feature weighting metrics have been proposed in order to increase its performance. In this work we present an original Probability Based Metric, i.e. a metric for classification tasks that relies on estimates of the posterior probabilities, called Minimum Risk Metric (MRM). MRM is optimal but it optimizes directly the finite misclassification risk whereas the Short and Fukunaga Metric minimize the difference between finite risk and asymptotic risk. An experimental comparison of MRM with Short and Fukunaga Metric, Value Difference Metric, and Euclidean--Hamming metrics on benchmark datasets shows that MRM outperforms the other metrics and performs comparably to the Bayes Classifier based on the same probability...
Corpus-Based Linguistic Indicators for Aspectual Classification
, 1999
"... Fourteen indicators that measure the frequency of lexico-syntactic phenomena linguistically related to aspectual class are applied to aspectual classification. This group of indicators is shown to improve classification performance for two aspectual distinctions, stativity and com- pletedness (i.e., ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
Fourteen indicators that measure the frequency of lexico-syntactic phenomena linguistically related to aspectual class are applied to aspectual classification. This group of indicators is shown to improve classification performance for two aspectual distinctions, stativity and com- pletedness (i.e., tellcity), over unrestricted sets of verbs from two corpora. Several of these indicators have not previously been discovered to correlate with aspect.
Probability Based Metrics for Nearest Neighbor Classification and Case-Based Reasoning
- Proceedings of the 3rd International Conference on Case-Based Reasoning Research and Development (ICCBR-99), volume 1650 of LNAI
, 1999
"... . This paper is focused on a class of metrics for the Nearest Neighbor classifier, whose definition is based on statistics computed on the case base. We show that these metrics basically rely on a probability estimation phase. In particular, we reconsider a metric proposed in the 80's by Short and F ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
. This paper is focused on a class of metrics for the Nearest Neighbor classifier, whose definition is based on statistics computed on the case base. We show that these metrics basically rely on a probability estimation phase. In particular, we reconsider a metric proposed in the 80's by Short and Fukunaga, we extend its definition to an input space that includes categorical features and we evaluate empirically its performance. Moreover, we present an original probability based metric, called Minimum Risk Metric (MRM), i.e. a metric for classification tasks that exploits estimates of the posterior probabilities. MRM is optimal, in the sense that it optimizes the finite misclassification risk, whereas the Short and Fukunaga Metric minimize the difference between finite risk and asymptotic risk. An experimental comparison of MRM with the Short and Fukunaga Metric, the Value Difference Metric, and Euclidean-- Hamming metrics on benchmark datasets shows that MRM outperforms the other metri...

