Results 1 - 10
of
19
Privacy-Preserving Data Mining
, 2000
"... A fruitful direction for future data mining research will be the development of techniques that incorporate privacy concerns. Specifically, we address the following question. Since the primary task in data mining is the development of models about aggregated data, can we develop accurate models with ..."
Abstract
-
Cited by 483 (3 self)
- Add to MetaCart
A fruitful direction for future data mining research will be the development of techniques that incorporate privacy concerns. Specifically, we address the following question. Since the primary task in data mining is the development of models about aggregated data, can we develop accurate models without access to precise information in individual data records? We consider the concrete case of building a decision-tree classifier from tredning data in which the values of individual records have been perturbed. The resulting data records look very different from the original records and the distribution of data values is also very different from the original distribution. While it is not possible to accurately estimate original values in individual data records, we propose a-novel reconstruction procedure to accurately estimate the distribution of original data values. By using these reconstructed distributions, we are able to build classifiers whose accuracy is comparable to the accuracy of classifiers built with the original data.
Automatic Subspace Clustering of High Dimensional Data
- Data Mining and Knowledge Discovery
, 2005
"... Data mining applications place special requirements on clustering algorithms including: the ability to find clusters embedded in subspaces of high dimensional data, scalability, end-user comprehensibility of the results, non-presumption of any canonical data distribution, and insensitivity to the or ..."
Abstract
-
Cited by 461 (11 self)
- Add to MetaCart
Data mining applications place special requirements on clustering algorithms including: the ability to find clusters embedded in subspaces of high dimensional data, scalability, end-user comprehensibility of the results, non-presumption of any canonical data distribution, and insensitivity to the order of input records. We present CLIQUE, a clustering algorithm that satisfies each of these requirements. CLIQUE identifies dense clusters in subspaces of maximum dimensionality. It generates cluster descriptions in the form of DNF expressions that are minimized for ease of comprehension. It produces identical results irrespective of the order in which input records are presented and does not presume any specific mathematical form for data distribution. Through experiments, we show that CLIQUE efficiently finds accurate clusters in large high dimensional datasets.
SPRINT: A scalable parallel classifier for data mining
, 1996
"... Classification is an important data mining problem. Although classification is a well-studied problem, most of the current classi-fication algorithms require that all or a por-tion of the the entire dataset remain perma-nently in memory. This limits their suitability for mining over large databases. ..."
Abstract
-
Cited by 228 (7 self)
- Add to MetaCart
Classification is an important data mining problem. Although classification is a well-studied problem, most of the current classi-fication algorithms require that all or a por-tion of the the entire dataset remain perma-nently in memory. This limits their suitability for mining over large databases. We present a new decision-tree-based classification algo-rithm, called SPRINT that removes all of the memory restrictions, and is fast and scalable. The algorithm has also been designed to be easily parallelized, allowing many processors to work together to build a single consistent model. This parallelization, also presented here, exhibits excellent scalability as well. The combination of these characteristics makes the proposed algorithm an ideal tool for data min-ing. 1
DMQL: A Data Mining Query Language for Relational Databases
, 1996
"... The emerging data mining tools and systems lead naturally to the demand of a powerful data mining query language, on top of which manyinteractive and #exible graphical user interfaces can be developed. This motivates us to design a data mining query language, DMQL, for mining di#erent kinds of knowl ..."
Abstract
-
Cited by 109 (6 self)
- Add to MetaCart
The emerging data mining tools and systems lead naturally to the demand of a powerful data mining query language, on top of which manyinteractive and #exible graphical user interfaces can be developed. This motivates us to design a data mining query language, DMQL, for mining di#erent kinds of knowledge in relational databases. Portions of the proposed DMQL language have been implemented in our DBMiner system for interactive mining of multiple-level knowledge in relational databases. 1 Introduction Data mining is a promising #eld with #ourishing R
Machine-Learning Research -- Four Current Directions
"... Machine Learning research has been making great progress in many directions. This article summarizes four of these directions and discusses some current open problems. The four directions are (a) improving classification accuracy by learning ensembles of classifiers, (b) methods for scaling up super ..."
Abstract
-
Cited by 102 (1 self)
- Add to MetaCart
Machine Learning research has been making great progress in many directions. This article summarizes four of these directions and discusses some current open problems. The four directions are (a) improving classification accuracy by learning ensembles of classifiers, (b) methods for scaling up supervised learning algorithms, (c) reinforcement learning, and (d) learning complex stochastic models.
The Quest Data Mining System
- In Proc. of the 2nd Int'l Conference on Knowledge Discovery in Databases and Data Mining
, 1996
"... This paper is a capsule summary of the current functionality and architecture of the Quest data mining System. Our overall approach has been to identify basic data mining operations that cut across applications and develop fast, scalable algorithms for their execution (Agrawal, Imielinski, & Swami 1 ..."
Abstract
-
Cited by 72 (2 self)
- Add to MetaCart
This paper is a capsule summary of the current functionality and architecture of the Quest data mining System. Our overall approach has been to identify basic data mining operations that cut across applications and develop fast, scalable algorithms for their execution (Agrawal, Imielinski, & Swami 1993a). We wanted our algorithms to:
Partial classification using association rules
- Proc. 3th Int. Conf. on KDD
, 1997
"... Many real-life problems require a partial classification of the data. We use the term "partial classification" to describe the discovery of models that show characteristics of the data classes, but may not cover all classes and all examples of any given class. Complete classification may be infeasib ..."
Abstract
-
Cited by 57 (2 self)
- Add to MetaCart
Many real-life problems require a partial classification of the data. We use the term "partial classification" to describe the discovery of models that show characteristics of the data classes, but may not cover all classes and all examples of any given class. Complete classification may be infeasible or undesirable when there are a very large number of class attributes, most attributes values are missing, or the class distribution is highly skewed and the user is interested in understanding the low-frequency class. We show how association rules can be used for partial classification in such domains, and present two case studies: reducing telecommunications order failures and detecting redundant medical tests.
DBMiner: A System for Mining Knowledge in Large Relational Databases
- In Proc. 1996 Int'l Conf. on Data Mining and Knowledge Discovery (KDD'96
, 1996
"... A data mining system, DBMiner, has been developed for interactive mining of multiple-level knowledge in large relational databases. The system implements a wide spectrum of data mining functions, including generalization, characterization, association, classi#- cation, and prediction. By incorp ..."
Abstract
-
Cited by 49 (12 self)
- Add to MetaCart
A data mining system, DBMiner, has been developed for interactive mining of multiple-level knowledge in large relational databases. The system implements a wide spectrum of data mining functions, including generalization, characterization, association, classi#- cation, and prediction. By incorporating several interesting data mining techniques, including attributeoriented induction, statistical analysis, progressive deepening for mining multiple-level knowledge, and meta-rule guided mining, the system provides a userfriendly, interactive data mining environment with good performance.
Logic regression
- Journal of Computational and Graphical Statistics
, 2003
"... Logic regression is an adaptive regression methodology that attempts to construct predictors as Boolean combinations of binary covariates. In many regression problems a model is developed that relates the main effects (the predictors or transformations thereof) to the response, while interactions ar ..."
Abstract
-
Cited by 27 (6 self)
- Add to MetaCart
Logic regression is an adaptive regression methodology that attempts to construct predictors as Boolean combinations of binary covariates. In many regression problems a model is developed that relates the main effects (the predictors or transformations thereof) to the response, while interactions are usually kept simple (two- to three-way interactions at most). Often, especially when all predictors are binary, the interaction between many predictors may be what causes the differences in response. This issue arises, for example, in the analysis of SNP microarray data or in some data mining problems. In the proposed methodology, given a set of binary predictors we create new predictors such as “X1, X2, X3, and X4 are true, ” or “X5 or X6 but not X7 are true. ” In more speci � c terms: we try to � t regression models of the form g(E[Y]) = b0 + b1L1 + ¢ ¢ ¢ + bnLn, where Lj is any Boolean expression of the predictors. The Lj and bj are estimated simultaneously using a simulated annealing algorithm. This article discusses how to � t logic regression models, how to carry out model selection for these models, and gives some examples.
Parallel Classification for Data Mining on Shared-Memory Multiprocessors
, 1998
"... We present parallel algorithms for building decision-tree classifiers on shared-memory multiprocessor (SMP) systems. The proposed algorithms span the gamut of data and task parallelism. The data parallelism is based on attribute scheduling among processors. This basic scheme is extended with task pi ..."
Abstract
-
Cited by 25 (2 self)
- Add to MetaCart
We present parallel algorithms for building decision-tree classifiers on shared-memory multiprocessor (SMP) systems. The proposed algorithms span the gamut of data and task parallelism. The data parallelism is based on attribute scheduling among processors. This basic scheme is extended with task pipelining and dynamic load balancing to yield faster implementations. The task parallel approach uses dynamic subtree partitioning among processors. Our performance evaluation shows that the construction of a decision-tree classifier can be effectively parallelized on an SMP machine with good speedup. 1

