Results 1  10
of
80
Adaptive Duplicate Detection Using Learnable String Similarity Measures
 In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD2003
, 2003
"... The problem of identifying approximately duplicate records in databases is an essential step for data cleaning and data integration processes. Most existing approaches have relied on generic or manually tuned distance metrics for estimating the similarity of potential duplicates. In this paper, we p ..."
Abstract

Cited by 237 (12 self)
 Add to MetaCart
The problem of identifying approximately duplicate records in databases is an essential step for data cleaning and data integration processes. Most existing approaches have relied on generic or manually tuned distance metrics for estimating the similarity of potential duplicates. In this paper, we present a framework for improving duplicate detection using trainable measures of textual similarity. We propose to employ learnable text distance functions for each database field, and show that such measures are capable of adapting to the specific notion of similarity that is appropriate for the field's domain. We present two learnable text similarity measures suitable for this task: an extended variant of learnable string edit distance, and a novel vectorspace based measure that employs a Support Vector Machine (SVM) for training. Experimental results on a range of datasets show that our framework can improve duplicate detection accuracy over traditional techniques.
ROC Graphs: Notes and Practical Considerations for Researchers
, 2004
"... Receiver Operating Characteristics (ROC) graphs are a useful technique for organizing classifiers and visualizing their performance. ROC graphs are commonly used in medical decision making, and in recent years have been increasingly adopted in the machine learning and data mining research communitie ..."
Abstract

Cited by 227 (1 self)
 Add to MetaCart
Receiver Operating Characteristics (ROC) graphs are a useful technique for organizing classifiers and visualizing their performance. ROC graphs are commonly used in medical decision making, and in recent years have been increasingly adopted in the machine learning and data mining research communities. Although ROC graphs are apparently simple, there are some common misconceptions and pitfalls when using them in practice. This article serves both as a tutorial introduction to ROC graphs and as a practical guide for using them in research.
ROC graphs: Notes and practical considerations for data mining researchers
, 2003
"... Receiver Operating Characteristics (ROC) graphs are a useful technique for organizing classifiers and visualizing their performance. ROC graphs are commonly used in medical decision making, and in recent years have been increasingly adopted in the machine learning and data mining research communitie ..."
Abstract

Cited by 157 (0 self)
 Add to MetaCart
Receiver Operating Characteristics (ROC) graphs are a useful technique for organizing classifiers and visualizing their performance. ROC graphs are commonly used in medical decision making, and in recent years have been increasingly adopted in the machine learning and data mining research communities. Although ROC graphs are apparently simple, there are some common misconceptions and pitfalls when using them in practice. This article serves both as a tutorial introduction to ROC graphs and as a practical guide for using them in research. Keywords: 1
Tree Induction for Probabilitybased Ranking
, 2002
"... Tree induction is one of the most effective and widely used methods for building classification models. However, many applications require cases to be ranked by the probability of class membership. Probability estimation trees (PETs) have the same attractive features as classification trees (e.g., c ..."
Abstract

Cited by 130 (4 self)
 Add to MetaCart
Tree induction is one of the most effective and widely used methods for building classification models. However, many applications require cases to be ranked by the probability of class membership. Probability estimation trees (PETs) have the same attractive features as classification trees (e.g., comprehensibility, accuracy and efficiency in high dimensions and on large data sets). Unfortunately, decision trees have been found to provide poor probability estimates. Several techniques have been proposed to build more accurate PETs, but, to our knowledge, there has not been a systematic experimental analysis of which techniques actually improve the probabilitybased rankings, and by how much. In this paper we first discuss why the decisiontree representation is not intrinsically inadequate for probability estimation. Inaccurate probabilities are partially the result of decisiontree induction algorithms that focus on maximizing classification accuracy and minimizing tree size (for example via reducederror pruning). Larger trees can be better for probability estimation, even if the extra size is superfluous for accuracy maximization. We then present the results of a comprehensive set of experiments, testing some straghtforward methods for improving probabilitybased rankings. We show that using a simple, common smoothing methodthe Laplace correctionuniformly improves probabilitybased rankings. In addition, bagging substantioJly improves the rankings, and is even more effective for this purpose than for improving accuracy. We conclude that PETs, with these simple modifications, should be considered when rankings based on classmembership probability are required.
An Empirical Comparison of Supervised Learning Algorithms
 In Proc. 23 rd Intl. Conf. Machine learning (ICML’06
, 2006
"... A number of supervised learning methods have been introduced in the last decade. Unfortunately, the last comprehensive empirical evaluation of supervised learning was the Statlog Project in the early 90’s. We present a largescale empirical comparison between ten supervised learning methods: SVMs, n ..."
Abstract

Cited by 95 (6 self)
 Add to MetaCart
A number of supervised learning methods have been introduced in the last decade. Unfortunately, the last comprehensive empirical evaluation of supervised learning was the Statlog Project in the early 90’s. We present a largescale empirical comparison between ten supervised learning methods: SVMs, neural nets, logistic regression, naive bayes, memorybased learning, random forests, decision trees, bagged trees, boosted trees, and boosted stumps. We also examine the effect that calibrating the models via Platt Scaling and Isotonic Regression has on their performance. An important aspect of our study is the use of a variety of performance criteria to evaluate the learning methods. 1.
Classification with Hybrid Generative/Discriminative Models
 In Advances in Neural Information Processing Systems 16
, 2003
"... Although discriminatively trained classifiers are usually more accurate when labeled training data is abundant, previous work has shown that when training data is limited, generative classifiers can outperform them. This paper describes a hybrid model in which a highdimensional subset of the p ..."
Abstract

Cited by 63 (4 self)
 Add to MetaCart
Although discriminatively trained classifiers are usually more accurate when labeled training data is abundant, previous work has shown that when training data is limited, generative classifiers can outperform them. This paper describes a hybrid model in which a highdimensional subset of the parameters are trained to maximize generative likelihood, and another, small, subset of parameters are discriminatively trained to maximize conditional likelihood. We give a sample complexity bound showing that in order to fit the discriminative parameters well, the number of training examples required depends only on the logarithm of the number of feature occurrences and feature set size. Experimental results show that hybrid models can provide lower test error and can produce better accuracy/coverage curves than either their purely generative or purely discriminative counterparts. We also discuss several advantages of hybrid models, and advocate further work in this area.
Predicting Good Probabilities with Supervised Learning
 In Proc. Int. Conf. on Machine Learning (ICML
, 2005
"... We examine the relationship between the predictions made by different learning algorithms and true posterior probabilities. We show that maximum margin methods such as boosted trees and boosted stumps push probability mass away from 0 and 1 yielding a characteristic sigmoid shaped distortion i ..."
Abstract

Cited by 57 (7 self)
 Add to MetaCart
We examine the relationship between the predictions made by different learning algorithms and true posterior probabilities. We show that maximum margin methods such as boosted trees and boosted stumps push probability mass away from 0 and 1 yielding a characteristic sigmoid shaped distortion in the predicted probabilities. Models such as Naive Bayes, which make unrealistic independence assumptions, push probabilities toward 0 and 1. Other models such as neural nets and bagged trees do not have these biases and predict well calibrated probabilities. We experiment with two ways of correcting the biased probabilities predicted by some learning methods: Platt Scaling and Isotonic Regression. We qualitatively examine what kinds of distortions these calibration methods are suitable for and quantitatively examine how much data they need to be effective. The empirical results show that after calibration boosted trees, random forests, and SVMs predict the best probabilities.
Methods for costsensitive learning
 In IJCAI
, 2001
"... For many classification tasks a large number of instances available for training are unlabeled and the cost associated with the labeling process varies over the input space. Meanwhile, virtually all these problems require classifiers that minimize a nonuniform loss function associated with the class ..."
Abstract

Cited by 28 (0 self)
 Add to MetaCart
For many classification tasks a large number of instances available for training are unlabeled and the cost associated with the labeling process varies over the input space. Meanwhile, virtually all these problems require classifiers that minimize a nonuniform loss function associated with the classification decisions (rather than the accuracy or number of errors). For example, to train pattern classification models for a network intrusion detection task, experts need to analyze network events and assign them labels. This can be a very costly procedure if the instances to be labeled are selected at random. In the meantime, the loss associated with mislabeling an intrusion is much higher than the loss associated with the opposite error (i.e., labeling a legal event as being an intrusion). As a result, to address these types of tasks, practitioners need tools that minimize the total cost computed as a sum of the cost of labeling and the loss associated with the decisions. This paper describes an approach for addressing this problem. 1
Magical Thinking in Data Mining: Lessons From CoIL Challenge 2000
 In Knowledge Discovery and Data Mining
, 2001
"... CoIL challenge 2000 was a supervised learning contest that attracted 43 entries. The authors of 29 entries later wrote explanations of their work. This paper discusses these reports and reaches three main conclusions. First, naive Bayesian classifiers remain competitive in practice: they were used b ..."
Abstract

Cited by 26 (0 self)
 Add to MetaCart
CoIL challenge 2000 was a supervised learning contest that attracted 43 entries. The authors of 29 entries later wrote explanations of their work. This paper discusses these reports and reaches three main conclusions. First, naive Bayesian classifiers remain competitive in practice: they were used by both the winning entry and the next best entry. Second, identifying feature interactions correctly is important for maximizing predictive accuracy: this was the difference between the winning classifier and all others. Third and most important, too many researchers and practitioners in data mining do not appreciate properly the issue of statistical significance and the danger of overfitting. Given a dataset such as the one for the CoIL contest, it is pointless to apply a very complicated learning algorithm, or to perform a very timeconsuming model search. In either case, one is likely to overfit the training data and to fool oneself in estimating predictive accuracy and in discovering useful correlations.
Properties and Benefits of Calibrated Classifiers
 in 8th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD
, 2004
"... A calibrated classifier provides reliable estimates of the true probability that each test sample is a member of the class of interest. ..."
Abstract

Cited by 26 (4 self)
 Add to MetaCart
A calibrated classifier provides reliable estimates of the true probability that each test sample is a member of the class of interest.