Results 1 - 10
of
21
Scaling Clustering Algorithms to Large Databases”, Microsoft Research Report
, 1998
"... Practical clustering algorithms require multiple data scans to achieve convergence. For large databases, these scans become prohibitively expensive. We present a scalable clustering framework applicable to a wide class of iterative clustering. We require at most one scan of the database. In this wor ..."
Abstract
-
Cited by 197 (5 self)
- Add to MetaCart
Practical clustering algorithms require multiple data scans to achieve convergence. For large databases, these scans become prohibitively expensive. We present a scalable clustering framework applicable to a wide class of iterative clustering. We require at most one scan of the database. In this work, the framework is instantiated and numerically justified with the popular K-Means clustering algorithm. The method is based on identifying regions of the data that are compressible, regions that must be maintained in memory, and regions that are discardable. The algorithm operates within the confines of a limited memory buffer. Empirical results demonstrate that the scalable scheme outperforms a sampling-based approach. In our scheme, data resolution is preserved to the extent possible based upon the size of the allocated memory buffer and the fit of current clustering model to the data. The framework is naturally extended to update multiple clustering models simultaneously. We empirically evaluate on synthetic and publicly available data sets.
Entropy-based Subspace Clustering for Mining Numerical Data
, 1999
"... Mining numerical data is a relatively difficult problem in data mining. Clustering is one of the techniques. We consider a database with numerical attributes, in which each transaction is viewed as a multi-dimensional vector. By studying the clusters formed by these vectors, we can discover certain ..."
Abstract
-
Cited by 89 (1 self)
- Add to MetaCart
Mining numerical data is a relatively difficult problem in data mining. Clustering is one of the techniques. We consider a database with numerical attributes, in which each transaction is viewed as a multi-dimensional vector. By studying the clusters formed by these vectors, we can discover certain behaviors hidden in the data. Traditional clustering algorithms find clusters in the full space of the data sets. This results in high dimensional clusters, which are poorly comprehensible to human. One important task in this setting is the ability to discover clusters embedded in the subspaces of a high-dimensional data set. This problem is known as subspace clustering. We follow the basic assumptions of previous work CLIQUE. It is found that the number of subspaces with clustering is very large, and a criterion called the coverage is proposed in CLIQUE for the pruning. In addition to coverage, we identify new useful criteria for this problem and propose an entropybased algorithm called ENC...
Constructing Knowledge From Multivariate Spatiotemporal Data: Integrating Geographic Visualization (GVis) with Knowledge Discovery in Database (KDD) Methods
- International Journal of Geographical Information Science
, 1999
"... In this paper, we develop an approach to the process of constructing knowledge through structured exploration of large spatiotemporal data sets. We begin by introducing our problem context and defining both Geographic Visualization (GVis) and Knowledge Discovery in Databases (KDD), the source domain ..."
Abstract
-
Cited by 49 (15 self)
- Add to MetaCart
In this paper, we develop an approach to the process of constructing knowledge through structured exploration of large spatiotemporal data sets. We begin by introducing our problem context and defining both Geographic Visualization (GVis) and Knowledge Discovery in Databases (KDD), the source domains for methods being integrated. Next, we review and compare recent GVis and KDD developments and consider the potential for their integration, emphasizing that an iterative process with user interaction is a central focus for uncovering interesting and meaningful patterns through each. We then introduce an approach to design of an integrated GVis-KDD environment directed to exploration and discovery in the context of spatiotemporal environmental data. The approach emphasizes a matching of GVis and KDD meta-operations. Following description of the GVis and KDD methods that are linked in our prototype system, we present a demonstration of the prototype applied to a typical spatiotemporal datas...
Scaling EM (Expectation-Maximization) Clustering to Large Databases
, 1999
"... Practical statistical clustering algorithms typically center upon an iterative refinement optimization procedure to compute a locally optimal clustering solution that maximizes the fit to data. These algorithms typically require many database scans to converge, and within each scan they require the ..."
Abstract
-
Cited by 35 (0 self)
- Add to MetaCart
Practical statistical clustering algorithms typically center upon an iterative refinement optimization procedure to compute a locally optimal clustering solution that maximizes the fit to data. These algorithms typically require many database scans to converge, and within each scan they require the access to every record in the data table. For large databases, the scans become prohibitively expensive. We present a scalable implementation of the Expectation-Maximization (EM) algorithm. The database community has focused on distance-based clustering schemes and methods have been developed to cluster either numerical or categorical data. Unlike distancebased algorithms (such as K-Means), EM constructs proper statistical models of the underlying data source and naturally generalizes to cluster databases containing both discrete-valued and continuous-valued data. The scalable method is based on a decomposition of the basic statistics the algorithm needs: identifying regions of the data that...
GeoVISTA Studio: A Codeless Visual Programming Environment For Geoscientific Data Analysis and Visualization
- Computational Geoscience
, 2002
"... The fundamental goal of the GeoVISTA Studio project is to improve geoscientific analysis by providing an environment that operationally integrates a wide range of analysis activities, including those both computationally and visually based. We argue here that improving the infrastructure used in ana ..."
Abstract
-
Cited by 32 (3 self)
- Add to MetaCart
The fundamental goal of the GeoVISTA Studio project is to improve geoscientific analysis by providing an environment that operationally integrates a wide range of analysis activities, including those both computationally and visually based. We argue here that improving the infrastructure used in analysis has far-reaching potential to better integrate human-based and computationally-based expertise, and so ultimately improve scientific outcomes. But to address these challenges, some difficult system design and software engineering problems must be overcome. This paper illustrates the design of a component-oriented system, GeoVISTA Studio, as a means to overcome such difficulties by using state-of-the-art component-based software engineering techniques. Advantages described include: ease of program construction (visual programming), an open (non-proprietary) architecture, simple component-based integration and advanced deployment methods. This versatility has the potential to change the nature of systems development for the geosciences, providing better mechanisms to coordinate complex functionality, and as a consequence, to improve analysis by closer integration of software tools and better engagement of the human expert. Two example applications are presented to illustrate the potentia l of the Studio environment for exploring and better understanding large, complex geographical datasets and for supporting complex visual and computational analysis. Keywords: visual programming, exploratory data analysis (EDA), knowledge construction, Java, component-oriented programming (COP). 1 1
On Objective Measures of Rule Surprisingness.
- Proceedings of the Second European Conference on the Principles of Data Mining and Knowledge Discovery (PKDD'98
, 1998
"... . Most of the literature argues that surprisingness is an inherently subjective aspect of the discovered knowledge, which cannot be measured in objective terms. This paper departs from this view, and it has a twofold goal: (1) showing that it is indeed possible to define objective (rather than subje ..."
Abstract
-
Cited by 26 (4 self)
- Add to MetaCart
. Most of the literature argues that surprisingness is an inherently subjective aspect of the discovered knowledge, which cannot be measured in objective terms. This paper departs from this view, and it has a twofold goal: (1) showing that it is indeed possible to define objective (rather than subjective) measures of discovered rule surprisingness; (2) proposing new ideas and methods for defining objective rule surprisingness measures. 1 Introduction A crucial aspect of data mining is that the discovered knowledge (usually expressed in the form of "if-then" rules) should be somehow interesting, where the term interestingness is arguably related to the properties of surprisingness (unexpectedness), usefulness and novelty of the rule [Fayyad et al. 96]. In this paper we are interested in quantitative, objective measures of one of the above three properties, namely rule surprisingness. In general, the evaluation of the interestingness of discovered rules has both an objective (data-driv...
Principles of Human Computer Collaboration for Knowledge Discovery in Science
, 1999
"... An important problem in computational scientific discovery is to identify, among the diversity of discovery programs written in various sciences, a commonality that will take a next step beyond the acknowledged general -- but weak -- framework of heuristic search. We characterize discovery in scien ..."
Abstract
-
Cited by 23 (4 self)
- Add to MetaCart
An important problem in computational scientific discovery is to identify, among the diversity of discovery programs written in various sciences, a commonality that will take a next step beyond the acknowledged general -- but weak -- framework of heuristic search. We characterize discovery in science as the generation of novel, interesting, plausible, and intelligible knowledge about the objects of study. We then analyze four current machine discovery programs in chemistry, medicine, mathematics, and linguistics according to how their design, or the circumstances of their application, heighten the chances of finding knowledge that has all four properties. Some general patterns emerge, although some strategies seem idiosyncratic. Our candidate for a commonality, which focuses on human factors, can be used pragmatically to evaluate and compare the designs of discovery programs that are intended to be used as collaborators by scientists. 1 1 Introduction Early work on machine scienti...
On rule interestingness measures
- Knowledge-Based Systems
, 1999
"... This paper discusses several factors influencing the evaluation of the degree of interestingness of rules discovered by a data mining algorithm. The main goals of this paper are: (1) drawing attention to several factors related to rule interestingness that have been somewhat neglected in the literat ..."
Abstract
-
Cited by 23 (4 self)
- Add to MetaCart
This paper discusses several factors influencing the evaluation of the degree of interestingness of rules discovered by a data mining algorithm. The main goals of this paper are: (1) drawing attention to several factors related to rule interestingness that have been somewhat neglected in the literature; (2) showing some ways of modifying rule interestingness measures to take these factors into account; (3) introducing a new criterion to measure attribute surprisingness, as a factor influencing the interestingness of discovered rules.
Data Mining At The Interface Of Computer Science And Statistics
, 2001
"... This chapter is written for computer scientists, engineers, mathematicians, and scientists who wish to gain a better understanding of the role of statistical thinking in modern data mining. Data mining has attracted considerable attention both in the research and commercial arenas in recent years, i ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
This chapter is written for computer scientists, engineers, mathematicians, and scientists who wish to gain a better understanding of the role of statistical thinking in modern data mining. Data mining has attracted considerable attention both in the research and commercial arenas in recent years, involving the application of a variety of techniques from both computer science and statistics. The chapter discusses how computer scientists and statisticians approach data from different but complementary viewpoints and highlights the fundamental differences between statistical and computational views of data mining. In doing so we review the historical importance of statistical contributions to machine learning and data mining, including neural networks, graphical models, and flexible predictive modeling. The primary conclusion is that closer integration of computational methods with statistical thinking is likely to become increasingly important in data mining applications. Keywords: Data mining, statistics, pattern recognition, transaction data, correlation. 1.
Computational and Visual Support for Geographical Knowledge Construction: Filling in the gaps between exploration and explanation
- Proceedings of the 10th International Symposium on Spatial Data Handling
, 2002
"... Although many different types of data mining tools have been developed for geographic analysis, the broader perspective of geographic knowledge discoverythe stages required and their computational supporthave been largely overlooked. This paper describes the process of knowledge construction as a nu ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
Although many different types of data mining tools have been developed for geographic analysis, the broader perspective of geographic knowledge discoverythe stages required and their computational supporthave been largely overlooked. This paper describes the process of knowledge construction as a number of inter-related activities and the support of these activities in an integrated visual and computational environment, GeoVISTA Studio. Results are presented showing examples of each stage in the knowledge construction process and a summary of the inter-relationships between visualization, computation, representation and reasoning is provided.

