## Segmented regression estimators for massive data sets (2002)

### Cached

### Download Links

- [www.siam.org]
- [www.stat.purdue.edu]
- [www.siam.org]
- [siam.org]
- [www.siam.org]
- DBLP

### Other Repositories/Bibliography

Venue: | In Second SIAM International Conference on Data Mining |

Citations: | 12 - 6 self |

### BibTeX

@INPROCEEDINGS{Natarajan02segmentedregression,

author = {Ramesh Natarajan and Edwin Pednault},

title = {Segmented regression estimators for massive data sets},

booktitle = {In Second SIAM International Conference on Data Mining},

year = {2002}

}

### Years of Citing Articles

### OpenURL

### Abstract

We describe two methodologies for obtaining segmented regression estimators from massive training data sets. The first methodology, called Linear Regression Tree (LRT), is used for continuous response variables, and the second and complementary methodology, called Naive Bayes Tree (NBT), is used for categorical response variables. These are implemented in the IBM ProbE TM (Probabilistic Estimation) data mining engine, which is an object-oriented framework for building classes of segmented predictive models from massive training data sets. Based on this methodology, an application called ATM-SE TM for direct-mail targeted marketing has been developed jointly with Fingerhut Business Intelligence [1]).

### Citations

3909 |
Classification and Regression Trees
- Breiman, Friedman, et al.
- 1984
(Show Context)
Citation Context ...the accuracy of the resulting models were comparable or better than their own benchmark models (see also Section (6)). The basic segmentation approach is based on decision tree algorithms (e.g., CART =-=[6]-=-, C4.5 [21]), especially the developments in [17], [11], [12] where more elaborate probability models are used in the tree leaf nodes. A detailed qualitative comparison with this previous work is give... |

2868 |
P.: UCI Repository of Machine Learning Databases
- Merz, Merphy
- 1996
(Show Context)
Citation Context ...s are given in Section (6). The primary difference is that our computational emphasis is not on the typically small, homogenous benchmark data sets in the statistical and machine learning literature (=-=[4]-=-, [5]), but on massive (outof-core), heterogenous training data sets (treated without sub-sampling) that are appearing in many emerging data mining applications. 2 Background Let y denote the response... |

2307 |
Estimating the dimension of a model
- SCHWARZ
- 1978
(Show Context)
Citation Context ... each data point being evaluated and summed. In [19], we described a Monte Carlo heuristic to approximate the troublesome term B in (26), which in conjunction with a BIC penalized likelihood approach =-=[22]-=- can be used to obtain the best feature set for a given Naive Bayes model using just two training data scans. Furthermore, along with some further heuristics, the binary merging step in ProbE can also... |

1203 |
Categorical Data Analysis
- Agresti
- 1990
(Show Context)
Citation Context ...o evaluate the empirical log-likelihood function for unseen data (the smoothing parameters λk can be regarded as the parameters of a Bayesian Dirichlet conjugate prior for the multinomial probability =-=[2]-=-, with the special case λk = 1 being the well-known Laplace smoothing criterion). With this model for πk(t), we then have K� Nk(t) LTR(t) =− N(t) log(πk(t)). (4) k=1 The recursive partitioning procedu... |

777 |
C4.5: Programs for
- Quinlan
- 1993
(Show Context)
Citation Context ...cy of the resulting models were comparable or better than their own benchmark models (see also Section (6)). The basic segmentation approach is based on decision tree algorithms (e.g., CART [6], C4.5 =-=[21]-=-), especially the developments in [17], [11], [12] where more elaborate probability models are used in the tree leaf nodes. A detailed qualitative comparison with this previous work is given in sectio... |

601 |
Numerical methods for least squares problems
- Bjorck
- 1996
(Show Context)
Citation Context ...e use of the Gaussian probability model θy(X,t) ∼N(a0 + J� aj(t)Xj,σ(t) 2 ), (8) in each segment t, with the regression parameters computed by the well-known normal equations method [7] (see p. 224), =-=[3]-=- (see p. 49). This method satisfies the two requirements specified at the end of Section (3), since the sufficient statistics (in this case the data means and covariances) can be obtained from a singl... |

312 | Estimating continuous distributions in Bayesian classifiers
- John, Langley
- 1995
(Show Context)
Citation Context ...ining data scan. Continuous-valued covariates can be used in this model (24) by fitting parametric or non-parametric models to the relevant univariate conditional distributions from the training data =-=[9]-=-, or as we have done, by discretizing and binning, to obtain a corresponding derived categorical variable [13]. The simplest uniform discretization can be very effective for Naive Bayes modeling as sh... |

222 | Learning with continuous classes
- Quinlan
- 1992
(Show Context)
Citation Context ...ults similar to earlier programs such as CART [6] and C4.5 [21]. 4.3 Comparison with Previous Work Decision tree algorithms with linear regression node models have been considered before ([17], [11], =-=[20]-=-). The Tree-Structured Regression (TSR) code in [17] does not use feature selection in the segment models, and the computationally-expensive regression algorithms used there are only practical with sm... |

211 | Induction of selective Bayesian classifiers
- Langley, Sage
- 1994
(Show Context)
Citation Context ...with just three training data scans, as shown below. 5.2 Feature selection and Computational heuristics The forward-selection algorithm for introducing features in the Naive Bayes model is similar to =-=[16]-=- but with a different feature selection criterion based on the maximum induced decrease in the empirical negative log-likelihood of the training data (this is consistent with the scoring function used... |

177 | Scaling up the accuracy of naive-bayes classifiers: A decision-tree hybrid
- Kohavi
- 1996
(Show Context)
Citation Context ...er than their own benchmark models (see also Section (6)). The basic segmentation approach is based on decision tree algorithms (e.g., CART [6], C4.5 [21]), especially the developments in [17], [11], =-=[12]-=- where more elaborate probability models are used in the tree leaf nodes. A detailed qualitative comparison with this previous work is given in sections (4.3) and (5.3), and experimental results are g... |

157 |
An Exploratory Technique for Investigating Large Quantities of Categorical Data
- Kass
- 1980
(Show Context)
Citation Context ... for each multiway split of an existing rule have been trained, the best possible binary rule split on a given feature is found by a bottom-up binary merging procedure analogous to that used in CHAID =-=[10]-=-. Specifically, for each feature, two segments of this multi-way split are merged so that for the resulting segment this leads to the minimum increase in the scoring function. This binary merging proc... |

152 |
Estimating the dimension of a model. Annals of Statistics
- Schwarz
- 1978
(Show Context)
Citation Context ... each data point being evaluated and summed. In [19], we described a Monte Carlo heuristic to approximate the troublesome term B in (26), which in conjunction with a BIC penalized likelihood approach =-=[22]-=- can be used to obtain the best feature set for a given Naive Bayes model using just two training data scans. Furthermore, along with some further heuristics, the binary merging step in ProbE can also... |

133 | Introduction to linear regression analysis - Montgomery, Peck, et al. - 2012 |

110 | Semi-naive Bayesian classifier - Kononenko - 1991 |

103 | Error-based and entropy-based discretization of continuous features
- Kohavi, Sahami
- 1996
(Show Context)
Citation Context ...ametric models to the relevant univariate conditional distributions from the training data [9], or as we have done, by discretizing and binning, to obtain a corresponding derived categorical variable =-=[13]-=-. The simplest uniform discretization can be very effective for Naive Bayes modeling as shown by [8]. Therefore, if the covariate Xj takes on Mj values denoted by 1, 2,...Mj respectively, the estimate... |

16 | Why discretization works for naive bayesian classifiers
- Hsu, Huang, et al.
- 2000
(Show Context)
Citation Context ... we have done, by discretizing and binning, to obtain a corresponding derived categorical variable [13]. The simplest uniform discretization can be very effective for Naive Bayes modeling as shown by =-=[8]-=-. Therefore, if the covariate Xj takes on Mj values denoted by 1, 2,...Mj respectively, the estimate Pjmk(t) for P (Xj = m|y = k, t) in (24) is given by Pjmk(t) = Njmk(t)+λjmk Nk(t)+ �Mj m ′ λjm ′ , (... |

13 | BAYDA: Software for Bayesian classification and feature selection - Kontkanen, Myllymaki, et al. - 1998 |

5 | Using Simulated Pseudo Data To Speed Up Statistical Predictive Modeling, to appear
- Natarajan, Pednault
- 2001
(Show Context)
Citation Context ...ed model object. In all cases, the evaluation of the term denoted by B in (26) requires a separate pass over the training data, with the contribution of each data point being evaluated and summed. In =-=[19]-=-, we described a Monte Carlo heuristic to approximate the troublesome term B in (26), which in conjunction with a BIC penalized likelihood approach [22] can be used to obtain the best feature set for ... |

2 |
Statlog project datasets, http://www.nccp.up.pt/liacc/ML/statlog
- Brazdil, Gama
(Show Context)
Citation Context ... given in Section (6). The primary difference is that our computational emphasis is not on the typically small, homogenous benchmark data sets in the statistical and machine learning literature ([4], =-=[5]-=-), but on massive (outof-core), heterogenous training data sets (treated without sub-sampling) that are appearing in many emerging data mining applications. 2 Background Let y denote the response vari... |

1 |
Segmentation-Based Modeling For
- Apte, Bibelnieks, et al.
- 2001
(Show Context)
Citation Context ...e models from massive training data sets. Based on this methodology, an application called ATM-SE TM for direct-mail targeted marketing has been developed jointly with Fingerhut Business Intelligence =-=[1]-=-). Keywords Segmentation-based models, decision trees, linear regression, logistic regression, feature selection, Naive Bayes. 1 Introduction Fingerhut Companies Inc. is a leading database marketing c... |

1 |
regression in regression tree leaves
- Karalic, Linear
- 1992
(Show Context)
Citation Context ...r better than their own benchmark models (see also Section (6)). The basic segmentation approach is based on decision tree algorithms (e.g., CART [6], C4.5 [21]), especially the developments in [17], =-=[11]-=-, [12] where more elaborate probability models are used in the tree leaf nodes. A detailed qualitative comparison with this previous work is given in sections (4.3) and (5.3), and experimental results... |

1 |
Tree-Structured Regression (TSR) User Guide, available from http://www.stat.wisc.edu/p/stat/ftp/pub/loh/treeprogs/tsr/tsrman.zip
- Loh, Yan
- 1995
(Show Context)
Citation Context ...able or better than their own benchmark models (see also Section (6)). The basic segmentation approach is based on decision tree algorithms (e.g., CART [6], C4.5 [21]), especially the developments in =-=[17]-=-, [11], [12] where more elaborate probability models are used in the tree leaf nodes. A detailed qualitative comparison with this previous work is given in sections (4.3) and (5.3), and experimental r... |