Results 1  10
of
20
PrivacyPreserving Data Mining
, 2000
"... A fruitful direction for future data mining research will be the development of techniques that incorporate privacy concerns. Specifically, we address the following question. Since the primary task in data mining is the development of models about aggregated data, can we develop accurate models with ..."
Abstract

Cited by 819 (3 self)
 Add to MetaCart
A fruitful direction for future data mining research will be the development of techniques that incorporate privacy concerns. Specifically, we address the following question. Since the primary task in data mining is the development of models about aggregated data, can we develop accurate models without access to precise information in individual data records? We consider the concrete case of building a decisiontree classifier from tredning data in which the values of individual records have been perturbed. The resulting data records look very different from the original records and the distribution of data values is also very different from the original distribution. While it is not possible to accurately estimate original values in individual data records, we propose anovel reconstruction procedure to accurately estimate the distribution of original data values. By using these reconstructed distributions, we are able to build classifiers whose accuracy is comparable to the accuracy of classifiers built with the original data.
Automatic Subspace Clustering of High Dimensional Data
 Data Mining and Knowledge Discovery
, 2005
"... Data mining applications place special requirements on clustering algorithms including: the ability to find clusters embedded in subspaces of high dimensional data, scalability, enduser comprehensibility of the results, nonpresumption of any canonical data distribution, and insensitivity to the or ..."
Abstract

Cited by 726 (12 self)
 Add to MetaCart
(Show Context)
Data mining applications place special requirements on clustering algorithms including: the ability to find clusters embedded in subspaces of high dimensional data, scalability, enduser comprehensibility of the results, nonpresumption of any canonical data distribution, and insensitivity to the order of input records. We present CLIQUE, a clustering algorithm that satisfies each of these requirements. CLIQUE identifies dense clusters in subspaces of maximum dimensionality. It generates cluster descriptions in the form of DNF expressions that are minimized for ease of comprehension. It produces identical results irrespective of the order in which input records are presented and does not presume any specific mathematical form for data distribution. Through experiments, we show that CLIQUE efficiently finds accurate clusters in large high dimensional datasets.
SPRINT: A scalable parallel classifier for data mining
, 1996
"... Classification is an important data mining problem. Although classification is a wellstudied problem, most of the current classification algorithms require that all or a portion of the the entire dataset remain permanently in memory. This limits their suitability for mining over large databases. ..."
Abstract

Cited by 310 (8 self)
 Add to MetaCart
Classification is an important data mining problem. Although classification is a wellstudied problem, most of the current classification algorithms require that all or a portion of the the entire dataset remain permanently in memory. This limits their suitability for mining over large databases. We present a new decisiontreebased classification algorithm, called SPRINT that removes all of the memory restrictions, and is fast and scalable. The algorithm has also been designed to be easily parallelized, allowing many processors to work together to build a single consistent model. This parallelization, also presented here, exhibits excellent scalability as well. The combination of these characteristics makes the proposed algorithm an ideal tool for data mining. 1
MachineLearning Research  Four Current Directions
"... Machine Learning research has been making great progress in many directions. This article summarizes four of these directions and discusses some current open problems. The four directions are (a) improving classification accuracy by learning ensembles of classifiers, (b) methods for scaling up super ..."
Abstract

Cited by 144 (1 self)
 Add to MetaCart
Machine Learning research has been making great progress in many directions. This article summarizes four of these directions and discusses some current open problems. The four directions are (a) improving classification accuracy by learning ensembles of classifiers, (b) methods for scaling up supervised learning algorithms, (c) reinforcement learning, and (d) learning complex stochastic models.
Exact and Approximation Algorithms for Clustering
, 1997
"... In this paper we present a n O(k1�1=d) time algorithm for solving the kcenter problem in R d, under L1 and L2 metrics. The algorithm extends to other metrics, and can be used to solve the discrete kcenter problem, as well. We also describe a simple (1 +)approximation algorithm for the kcenter pr ..."
Abstract

Cited by 80 (6 self)
 Add to MetaCart
In this paper we present a n O(k1�1=d) time algorithm for solving the kcenter problem in R d, under L1 and L2 metrics. The algorithm extends to other metrics, and can be used to solve the discrete kcenter problem, as well. We also describe a simple (1 +)approximation algorithm for the kcenter problem, with running time O(n log k) + (k = ) O(k1�1=d). Finally, we present a n O(k1�1=d) time algorithm for solving the Lcapacitated kcenter problem, provided that L = (n=k 1�1=d) or L = O(1). We conclude with a simple approximation algorithm for the Lcapacitated kcenter problem.
Range Queries in OLAP Data Cubes
 In Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data
, 1997
"... A range query applies an aggregation operation over all selected cells of an OLAP data cube where the selection is specified by providing ranges of values for numeric dimensions. We present fast algorithms for range queries for two types of aggregation operations: SUM and MAX. These two operations c ..."
Abstract

Cited by 64 (1 self)
 Add to MetaCart
A range query applies an aggregation operation over all selected cells of an OLAP data cube where the selection is specified by providing ranges of values for numeric dimensions. We present fast algorithms for range queries for two types of aggregation operations: SUM and MAX. These two operations cover techniques required for most popular aggregation operations, such as those supported by SQL. For rangesum queries, the essential idea is to precompute some auxiliary information (prefix sums) that is used to answer ad hoc queries at runtime. By maintaining auxiliary information which is of the same size as the data cube, all range queries for a given cube can be answered in constant time, irrespective of the size of the subcube circumscribed by a query. Alternatively, one can keep auxiliary information which is 1/b d of the size of the ddimensional data cube. Response to a range query may now require access to some cells of the data cube in addition to the access to the auxiliary ...
Athena: Miningbased interactive management of text databases
 International Conference on Extending Database Technology
, 2000
"... Abstract. We describe Athena: a system for creating, exploiting, and maintaining a hierarchy of textual documents through interactive miningbased operations. Requirements of any such system include speed and minimal enduser e ort. Athena satis es these requirements through lineartime classi cation ..."
Abstract

Cited by 42 (3 self)
 Add to MetaCart
(Show Context)
Abstract. We describe Athena: a system for creating, exploiting, and maintaining a hierarchy of textual documents through interactive miningbased operations. Requirements of any such system include speed and minimal enduser e ort. Athena satis es these requirements through lineartime classi cation and clustering engines which are applied interactively to speed the development of accurate models. Naive Bayes classi ers are recognized to be among the best for classifying text. We show that our specialization of the Naive Bayes classi er is considerably more accurate (7 to 29 % absolute increase in accuracy) than a standard implementation. Our enhancements include using Lidstone's law of succession instead of Laplace's law, underweighting long documents, and overweighting author and subject. We also present a new interactive clustering algorithm, CEvolve, for topic discovery. CEvolve rst nds highly accurate cluster digests (partial clusters), gets user feedback to merge and correct these digests, and then uses the classi cation algorithm to complete the partitioning of the data. By allowing this interactivity in the clustering process, CEvolve achieves considerably higher clustering accuracy (10 to 20 % absolute increase in our experiments) than the popular KMeans and agglomerative clustering methods. 1
Parallel Classification for Data Mining on SharedMemory Multiprocessors
, 1998
"... We present parallel algorithms for building decisiontree classifiers on sharedmemory multiprocessor (SMP) systems. The proposed algorithms span the gamut of data and task parallelism. The data parallelism is based on attribute scheduling among processors. This basic scheme is extended with task pi ..."
Abstract

Cited by 33 (2 self)
 Add to MetaCart
We present parallel algorithms for building decisiontree classifiers on sharedmemory multiprocessor (SMP) systems. The proposed algorithms span the gamut of data and task parallelism. The data parallelism is based on attribute scheduling among processors. This basic scheme is extended with task pipelining and dynamic load balancing to yield faster implementations. The task parallel approach uses dynamic subtree partitioning among processors. Our performance evaluation shows that the construction of a decisiontree classifier can be effectively parallelized on an SMP machine with good speedup. 1
Scalable Mining for Classification Rules in Relational Databases
 in Proceedings of the International Database Engineering & Application Symposium
, 1998
"... Classification is a key function of many "business intelligence" toolkits and a fundamental building block in data mining. Immense data may be needed to train a classifier for good accuracy. The stateofart classifiers [21, 25] need an inmemory data structure of size O(N), where N is the ..."
Abstract

Cited by 9 (1 self)
 Add to MetaCart
Classification is a key function of many "business intelligence" toolkits and a fundamental building block in data mining. Immense data may be needed to train a classifier for good accuracy. The stateofart classifiers [21, 25] need an inmemory data structure of size O(N), where N is the size of the training data, to achieve efficiency. For large data sets, such a data structure will not fit in the internal memory. The best previously known classifier does a quadratic number of I/Os for large N . In this paper, we propose a novel classification algorithm (classifier) called MIND (MINing in Databases). MIND can be phrased in such a way that its implementation is very easy using the extended relational calculus SQL, and this in turn allows the classifier to be built into a relational database system directly. MIND is truly scalable with respect to I/O efficiency, which is important since scalability is a key requirement for any data mining algorithm. We built a prototype of MIND in the...
Data Mining: A Database Perspective.
 in Proc. Int. Conf. Data Mining
, 1998
"... Data mining on large databases has been a major concern in research community, due to the difficulty of analyzing huge volumes of data using only traditional OLAP tools. This sort of process implies a lot of computational power, memory and disk I/O, which can only be provided by parallel computers. ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
Data mining on large databases has been a major concern in research community, due to the difficulty of analyzing huge volumes of data using only traditional OLAP tools. This sort of process implies a lot of computational power, memory and disk I/O, which can only be provided by parallel computers. We present a discussion of how database technology can be integrated to data mining techniques. Finally, we also point out several advantages of addressing data consuming activities through a tight integration of a parallel database server and data mining techniques. 1 Introduction Data mining techniques have increasingly been studied 7;9;21 , especially in their application in realworld databases. One typical problem is that databases tend to be very large, and these techniques often repeatedly scan the entire set. Sampling has been used for a long time, but subtle differences among sets of objects become less evident. This work means to provide an overview of some important data mining...