Results 1  10
of
20
PrivacyPreserving Data Mining
, 2000
"... A fruitful direction for future data mining research will be the development of techniques that incorporate privacy concerns. Specifically, we address the following question. Since the primary task in data mining is the development of models about aggregated data, can we develop accurate models with ..."
Abstract

Cited by 626 (3 self)
 Add to MetaCart
A fruitful direction for future data mining research will be the development of techniques that incorporate privacy concerns. Specifically, we address the following question. Since the primary task in data mining is the development of models about aggregated data, can we develop accurate models without access to precise information in individual data records? We consider the concrete case of building a decisiontree classifier from tredning data in which the values of individual records have been perturbed. The resulting data records look very different from the original records and the distribution of data values is also very different from the original distribution. While it is not possible to accurately estimate original values in individual data records, we propose anovel reconstruction procedure to accurately estimate the distribution of original data values. By using these reconstructed distributions, we are able to build classifiers whose accuracy is comparable to the accuracy of classifiers built with the original data.
Automatic Subspace Clustering of High Dimensional Data
 Data Mining and Knowledge Discovery
, 2005
"... Data mining applications place special requirements on clustering algorithms including: the ability to find clusters embedded in subspaces of high dimensional data, scalability, enduser comprehensibility of the results, nonpresumption of any canonical data distribution, and insensitivity to the or ..."
Abstract

Cited by 568 (12 self)
 Add to MetaCart
Data mining applications place special requirements on clustering algorithms including: the ability to find clusters embedded in subspaces of high dimensional data, scalability, enduser comprehensibility of the results, nonpresumption of any canonical data distribution, and insensitivity to the order of input records. We present CLIQUE, a clustering algorithm that satisfies each of these requirements. CLIQUE identifies dense clusters in subspaces of maximum dimensionality. It generates cluster descriptions in the form of DNF expressions that are minimized for ease of comprehension. It produces identical results irrespective of the order in which input records are presented and does not presume any specific mathematical form for data distribution. Through experiments, we show that CLIQUE efficiently finds accurate clusters in large high dimensional datasets.
SPRINT: A scalable parallel classifier for data mining
, 1996
"... Classification is an important data mining problem. Although classification is a wellstudied problem, most of the current classification algorithms require that all or a portion of the the entire dataset remain permanently in memory. This limits their suitability for mining over large databases. ..."
Abstract

Cited by 258 (7 self)
 Add to MetaCart
Classification is an important data mining problem. Although classification is a wellstudied problem, most of the current classification algorithms require that all or a portion of the the entire dataset remain permanently in memory. This limits their suitability for mining over large databases. We present a new decisiontreebased classification algorithm, called SPRINT that removes all of the memory restrictions, and is fast and scalable. The algorithm has also been designed to be easily parallelized, allowing many processors to work together to build a single consistent model. This parallelization, also presented here, exhibits excellent scalability as well. The combination of these characteristics makes the proposed algorithm an ideal tool for data mining. 1
MachineLearning Research  Four Current Directions
"... Machine Learning research has been making great progress in many directions. This article summarizes four of these directions and discusses some current open problems. The four directions are (a) improving classification accuracy by learning ensembles of classifiers, (b) methods for scaling up super ..."
Abstract

Cited by 115 (1 self)
 Add to MetaCart
Machine Learning research has been making great progress in many directions. This article summarizes four of these directions and discusses some current open problems. The four directions are (a) improving classification accuracy by learning ensembles of classifiers, (b) methods for scaling up supervised learning algorithms, (c) reinforcement learning, and (d) learning complex stochastic models.
Range Queries in OLAP Data Cubes
 In Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data
, 1997
"... A range query applies an aggregation operation over all selected cells of an OLAP data cube where the selection is specified by providing ranges of values for numeric dimensions. We present fast algorithms for range queries for two types of aggregation operations: SUM and MAX. These two operations c ..."
Abstract

Cited by 57 (1 self)
 Add to MetaCart
A range query applies an aggregation operation over all selected cells of an OLAP data cube where the selection is specified by providing ranges of values for numeric dimensions. We present fast algorithms for range queries for two types of aggregation operations: SUM and MAX. These two operations cover techniques required for most popular aggregation operations, such as those supported by SQL. For rangesum queries, the essential idea is to precompute some auxiliary information (prefix sums) that is used to answer ad hoc queries at runtime. By maintaining auxiliary information which is of the same size as the data cube, all range queries for a given cube can be answered in constant time, irrespective of the size of the subcube circumscribed by a query. Alternatively, one can keep auxiliary information which is 1/b d of the size of the ddimensional data cube. Response to a range query may now require access to some cells of the data cube in addition to the access to the auxiliary ...
Athena: Miningbased interactive management of text databases
 International Conference on Extending Database Technology
, 2000
"... Abstract. We describe Athena: a system for creating, exploiting, and maintaining a hierarchy of textual documents through interactive miningbased operations. Requirements of any such system include speed and minimal enduser e ort. Athena satis es these requirements through lineartime classi cation ..."
Abstract

Cited by 34 (2 self)
 Add to MetaCart
Abstract. We describe Athena: a system for creating, exploiting, and maintaining a hierarchy of textual documents through interactive miningbased operations. Requirements of any such system include speed and minimal enduser e ort. Athena satis es these requirements through lineartime classi cation and clustering engines which are applied interactively to speed the development of accurate models. Naive Bayes classi ers are recognized to be among the best for classifying text. We show that our specialization of the Naive Bayes classi er is considerably more accurate (7 to 29 % absolute increase in accuracy) than a standard implementation. Our enhancements include using Lidstone's law of succession instead of Laplace's law, underweighting long documents, and overweighting author and subject. We also present a new interactive clustering algorithm, CEvolve, for topic discovery. CEvolve rst nds highly accurate cluster digests (partial clusters), gets user feedback to merge and correct these digests, and then uses the classi cation algorithm to complete the partitioning of the data. By allowing this interactivity in the clustering process, CEvolve achieves considerably higher clustering accuracy (10 to 20 % absolute increase in our experiments) than the popular KMeans and agglomerative clustering methods. 1
Parallel Classification for Data Mining on SharedMemory Multiprocessors
, 1998
"... We present parallel algorithms for building decisiontree classifiers on sharedmemory multiprocessor (SMP) systems. The proposed algorithms span the gamut of data and task parallelism. The data parallelism is based on attribute scheduling among processors. This basic scheme is extended with task pi ..."
Abstract

Cited by 26 (2 self)
 Add to MetaCart
We present parallel algorithms for building decisiontree classifiers on sharedmemory multiprocessor (SMP) systems. The proposed algorithms span the gamut of data and task parallelism. The data parallelism is based on attribute scheduling among processors. This basic scheme is extended with task pipelining and dynamic load balancing to yield faster implementations. The task parallel approach uses dynamic subtree partitioning among processors. Our performance evaluation shows that the construction of a decisiontree classifier can be effectively parallelized on an SMP machine with good speedup. 1
Scalable Mining for Classification Rules in Relational Databases
 in Proceedings of the International Database Engineering & Application Symposium
, 1998
"... Classification is a key function of many "business intelligence" toolkits and a fundamental building block in data mining. Immense data may be needed to train a classifier for good accuracy. The stateofart classifiers [21, 25] need an inmemory data structure of size O(N), where N is the ..."
Abstract

Cited by 10 (1 self)
 Add to MetaCart
Classification is a key function of many "business intelligence" toolkits and a fundamental building block in data mining. Immense data may be needed to train a classifier for good accuracy. The stateofart classifiers [21, 25] need an inmemory data structure of size O(N), where N is the size of the training data, to achieve efficiency. For large data sets, such a data structure will not fit in the internal memory. The best previously known classifier does a quadratic number of I/Os for large N . In this paper, we propose a novel classification algorithm (classifier) called MIND (MINing in Databases). MIND can be phrased in such a way that its implementation is very easy using the extended relational calculus SQL, and this in turn allows the classifier to be built into a relational database system directly. MIND is truly scalable with respect to I/O efficiency, which is important since scalability is a key requirement for any data mining algorithm. We built a prototype of MIND in the...
Parallel Inductive Logic in Data Mining
, 2000
"... Datamining is the process of automatic extraction of novel, useful and understandable patterns from very large databases. Highperformance, scalable, and parallel computing algorithms are crucial in data mining as datasets grow inexorably in size and complexity. Inductive logic is a research area i ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
Datamining is the process of automatic extraction of novel, useful and understandable patterns from very large databases. Highperformance, scalable, and parallel computing algorithms are crucial in data mining as datasets grow inexorably in size and complexity. Inductive logic is a research area in the intersection of machine learning and logic programming, which has been recently applied to data mining. Inductive logic studies learning from examples, within the framework provided by clausal logic. It provides a uniform and very expressive means of representation: All examples, background knowledge as well as the induced theory are expressed in rstorder logic. However, such an expressive representation is often computationally expensive. This report first presents the background for parallel data mining, the BSP model, and inductive logic programming. Based on the study, this report gives an approach to parallel inductive logic in data mining that solves the potential performance problem. Both parallel algorithm and cost analysis are provided. This approach is applied to a number of problems and it shows a superlinear speedup. To justify this analysis, I implemented a parallel version of a core ILP system { Progol { in C with the support of the BSP parallel model. Three test cases are provided and a double speedup
DROLAP  A DenseRegion Based Approach to Online Analytical Processing
, 1999
"... ROLAP (Relational OLAP) and MOLAP (Multidimensional OLAP) are two opposing techniques for building Online Analytical Processing (OLAP) systems. MOLAP has good query performance but suffers when the data distribution in the multidimensional data cube is sparse. ROLAP can be built on mature RDBMS tec ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
ROLAP (Relational OLAP) and MOLAP (Multidimensional OLAP) are two opposing techniques for building Online Analytical Processing (OLAP) systems. MOLAP has good query performance but suffers when the data distribution in the multidimensional data cube is sparse. ROLAP can be built on mature RDBMS technology but its performance is not as competitive. Many data warehouses contain sparse but clustered multidimensional data. We propose a denseregionbased OLAP (DROLAP) system which surpasses both ROLAP and MOLAP in space efficiency and query performance. DROLAP applies the MOLAP approach on the dense regions discovered in the data, and handles the remaining small percentage of sparse points with the ROLAP approach. The core of building a DROLAP system lies in the mining of dense regions in a data cube. We have defined the dense region mining problem as an optimization problem. We show that conventional clustering techniques are not suitable for this problem, and have developed an efficient...