Results 1  10
of
397
Logistic Model Trees
, 2006
"... Tree induction methods and linear models are popular techniques for supervised learning tasks, both for the prediction of nominal classes and numeric values. For predicting numeric quantities, there has been work on combining these two schemes into ‘model trees’, i.e. trees that contain linear regr ..."
Abstract

Cited by 163 (3 self)
 Add to MetaCart
Tree induction methods and linear models are popular techniques for supervised learning tasks, both for the prediction of nominal classes and numeric values. For predicting numeric quantities, there has been work on combining these two schemes into ‘model trees’, i.e. trees that contain linear regression functions at the leaves. In this paper, we present an algorithm that adapts this idea for classification problems, using logistic regression instead of linear regression. We use a stagewise fitting process to construct the logistic regression models that can select relevant attributes in the data in a natural way, and show how this approach can be used to build the logistic regression models at the leaves by incrementally refining those constructed at higher levels in the tree. We compare the performance of our algorithm to several other stateoftheart learning schemes on 36 benchmark UCI datasets, and show that it produces accurate and compact classifiers.
Combining InstanceBased and ModelBased Learning
, 1993
"... This paper concerns learning tasks that require the prediction of a continuous value rather than a discrete class. A general method is presented that allows predictions to use both instancebased and modelbased learning. Results with three approaches to constructing models and with eight datasets d ..."
Abstract

Cited by 143 (0 self)
 Add to MetaCart
(Show Context)
This paper concerns learning tasks that require the prediction of a continuous value rather than a discrete class. A general method is presented that allows predictions to use both instancebased and modelbased learning. Results with three approaches to constructing models and with eight datasets demonstrate improvements due to the composite method.
Logic Regression
 Journal of Computational and Graphical Statistics
, 2003
"... The odyssey cohort study consists of 8,394 participants who donated blood samples in 1974 and 1989 in Washington County, Maryland. The cohort has been followed until 2001, and environmental factors such as smoking and dietary intake are available. The goals of the study include finding associatio ..."
Abstract

Cited by 76 (13 self)
 Add to MetaCart
(Show Context)
The odyssey cohort study consists of 8,394 participants who donated blood samples in 1974 and 1989 in Washington County, Maryland. The cohort has been followed until 2001, and environmental factors such as smoking and dietary intake are available. The goals of the study include finding associations between polymorphisms in candidate genes and disease (including cancer and cardiovascular disease). Particularly, geneenvironment and genegene interactions associated with disease are of interest. Currently, SNP data from 51 sites are available for some 1600 subjects.
StatLog: Comparison of Classification Algorithms on Large RealWorld Problems
, 1995
"... This paper describes work in the StatLog project comparing classification algorithms on large realworld problems. The algorithms compared were from: symbolic learning (CART, C4.5, NewID, AC 2 , ITrule, Cal5, CN2), statistics (Naive Bayes, knearest neighbor, kernel density, linear discriminant, qua ..."
Abstract

Cited by 69 (0 self)
 Add to MetaCart
This paper describes work in the StatLog project comparing classification algorithms on large realworld problems. The algorithms compared were from: symbolic learning (CART, C4.5, NewID, AC 2 , ITrule, Cal5, CN2), statistics (Naive Bayes, knearest neighbor, kernel density, linear discriminant, quadratic discriminant, logistic regression, projection pursuit, Bayesian networks), and neural networks (backpropagation, radial basis functions). Twelve datasets were used: five from image analysis, three from medicine, and two each from engineering and finance. We found that which algorithm performed best depended critically on the dataset investigated. We therefore developed a set of dataset descriptors to help decide which algorithms are suited to particular datasets. For example, datasets with extreme distributions (skew ? 1 and kurtosis ? 7) and with many binary/categorical attributes (? 38%) tend to favor symbolic learning algorithms. We suggest how classification algorith...
Structural Regression Trees
, 1996
"... In many realworld domains the task of machine learning algorithms is to learn a theory predicting numerical values. In particular several standard test domains used in Inductive Logic Programming (ILP) are concerned with predicting numerical values from examples and relational and mostly nondeterm ..."
Abstract

Cited by 67 (10 self)
 Add to MetaCart
In many realworld domains the task of machine learning algorithms is to learn a theory predicting numerical values. In particular several standard test domains used in Inductive Logic Programming (ILP) are concerned with predicting numerical values from examples and relational and mostly nondeterminate background knowledge. However, so far no ILP algorithm except one can predict numbers and cope with nondeterminate background knowledge. (The only exception is a covering algorithm called FORS.) In this paper we present Structural Regression Trees (SRT), a new algorithm which can be applied to the above class of problems by integrating the statistical method of regression trees into ILP. SRT constructs a tree containing a literal (an atomic formula or its negation) or a conjunction of literals in each node, and assigns a numerical value to each leaf. SRT provides more comprehensible results than purely statistical methods, and can be applied to a class of problems most other ILP syste...
Using Model Trees for Classification
, 1997
"... Model trees, which are a type of decision tree with linear regression functions at the leaves, form the basis of a recent successful technique for predicting continuous numeric values. They can be applied to classification problems by employing a standard method of transforming a classification prob ..."
Abstract

Cited by 67 (5 self)
 Add to MetaCart
(Show Context)
Model trees, which are a type of decision tree with linear regression functions at the leaves, form the basis of a recent successful technique for predicting continuous numeric values. They can be applied to classification problems by employing a standard method of transforming a classification problem into a problem of function approximation. Surprisingly, using this simple transformation the model tree inducer M50, based on Quinlan's M5, generates more accurate classifiers than the stateoftheart decision tree learner C5.0, particularly when most of the attributes are numeric.
Selecting Best Practices for Effort Estimation
 IEEE Transactions on Software Engineering
, 2006
"... Abstract—Effort estimation often requires generalizing from a small number of historical projects. Generalization from such limited experience is an inherently underconstrained problem. Hence, the learned effort models can exhibit large deviations that prevent standard statistical methods (e.g., tt ..."
Abstract

Cited by 64 (26 self)
 Add to MetaCart
(Show Context)
Abstract—Effort estimation often requires generalizing from a small number of historical projects. Generalization from such limited experience is an inherently underconstrained problem. Hence, the learned effort models can exhibit large deviations that prevent standard statistical methods (e.g., ttests) from distinguishing the performance of alternative effortestimation methods. The COSEEKMO effortmodeling workbench applies a set of heuristic rejection rules to comparatively assess results from alternative models. Using these rules, and despite the presence of large deviations, COSEEKMO can rank alternative methods for generating effort models. Based on our experiments with COSEEKMO, we advise a new view on supposed “best practices ” in modelbased effort estimation: 1) Each such practice should be viewed as a candidate technique which may or may not be useful in a particular domain, and 2) tools like COSEEKMO should be used to help analysts explore and select the best method for a particular domain. Index Terms—Modelbased effort estimation, COCOMO, deviation, data mining. 1
Characterization of Classification Algorithms
, 1995
"... This paper is concerned with the problem of characterization of classication algorithms. The aim is to determine under what circumstances a particular classication algorithm is applicable. The method used involves generation of dierent kinds of models. These include regression and rule models, pi ..."
Abstract

Cited by 60 (9 self)
 Add to MetaCart
This paper is concerned with the problem of characterization of classication algorithms. The aim is to determine under what circumstances a particular classication algorithm is applicable. The method used involves generation of dierent kinds of models. These include regression and rule models, piecewise linear models (model trees) and instance based models. These are generated automatically on the basis of dataset characteristics and given test results. The lack of data is compensated for by various types of preprocessing. The models obtained are characterized by quantifying their predictive capability and the best models are identied.
EXPOSURE: Finding Malicious Domains Using Passive DNS Analysis
"... The domain name service (DNS) plays an important role in the operation of the Internet, providing a twoway mapping between domain names and their numerical identifiers. Given its fundamental role, it is not surprising that a wide variety of malicious activities involve the domain name service in on ..."
Abstract

Cited by 54 (3 self)
 Add to MetaCart
(Show Context)
The domain name service (DNS) plays an important role in the operation of the Internet, providing a twoway mapping between domain names and their numerical identifiers. Given its fundamental role, it is not surprising that a wide variety of malicious activities involve the domain name service in one way or another. For example, bots resolve DNS names to locate their command and control servers, and spam mails contain URLs that link to domains that resolve to scam servers. Thus, it seems beneficial to monitor the use of the DNS system for signs that indicate that a certain name is used as part of a malicious operation. In this paper, we introduce EXPOSURE, a system that employs largescale, passive DNS analysis techniques to detect domains that are involved in malicious activity. We use 15 features that we extract from the DNS traffic that allow us to characterize different properties of DNS names and the ways that they are queried. Our experiments with a large, realworld data set consisting of 100 billion DNS requests, and a reallife deployment for two weeks in an ISP show that our approach is scalable and that we are able to automatically identify unknown malicious domains that are misused in a variety of malicious activity (such as for botnet command and control, spamming, and phishing). 1
No One (Cluster) Size Fits All: Automatic Cluster Sizing for Dataintensive Analytics
"... InfrastructureasaService (IaaS) cloud platforms have brought two unprecedented changes to cluster provisioning practices. First, any (nonexpert) user can provision a cluster of any size on the cloud within minutes to run her dataprocessing jobs. The user can terminate the cluster once her jobs c ..."
Abstract

Cited by 54 (7 self)
 Add to MetaCart
(Show Context)
InfrastructureasaService (IaaS) cloud platforms have brought two unprecedented changes to cluster provisioning practices. First, any (nonexpert) user can provision a cluster of any size on the cloud within minutes to run her dataprocessing jobs. The user can terminate the cluster once her jobs complete, and she needs to pay only for the resources used and duration of use. Second, cloud platforms enable users to bypass the traditional middleman—the system administrator—in the clusterprovisioning process. These changes give tremendous power to the user, but place a major burden on her shoulders. The user is now faced regularly with complex cluster sizing problems that involve finding the cluster size, the type of resources to use in the cluster from the large number of choices offered by current IaaS cloud platforms, and the job configurations that best meet the performance needs of her workload. In this paper, we introduce the Elastisizer, a system to which users can express cluster sizing problems as queries in a declarative fashion. The Elastisizer provides reliable answers to these queries using an automated technique that uses a mix of job profiling, estimation using blackbox and whitebox models, and simulation. We have prototyped the Elastisizer for the Hadoop MapReduce framework, and present a comprehensive evaluation that shows the benefits of the Elastisizer in common scenarios where cluster sizing problems arise.