Results 1 - 10
of
16
Robust Data Clustering
, 2003
"... We address the problem of robust clustering by combining data partitions (forming a clustering ensemble) produced by multiple clusterings. We formulate robust clustering under an information-theoretical framework; mutual information is the underlying concept used in the definition of quantitative me ..."
Abstract
-
Cited by 122 (6 self)
- Add to MetaCart
We address the problem of robust clustering by combining data partitions (forming a clustering ensemble) produced by multiple clusterings. We formulate robust clustering under an information-theoretical framework; mutual information is the underlying concept used in the definition of quantitative measures of agreement or consistency between data partitions. Robustness is assessed by variance of the cluster membership, based on bootstrapping. We propose and analyze a voting mechanism on pairwise associations of patterns for combining data partitions. We show that the proposed technique attempts to optimize the mutual information based criteria, although the optimality is not ensured in all situations. This evidence accumulation method is demonstrated by combining the well-known Kmeans algorithm to produce clustering ensembles. Experimental results show the ability of the technique to identify clusters with arbitrary shapes and sizes.
Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data
- in Proceedings of Second SIAM International Conference on Data Mining
, 2003
"... ..."
Combining multiple clusterings using evidence accumulation
- IEEE Transaction on Pattern Analysis and Machine Intelligence
, 2005
"... We explore the idea of evidence accumulation (EAC) for combining the results of multiple clusterings. First, a clustering ensemble- a set of object partitions, is produced. Given a data set (n objects or patterns in d dimensions), different ways of producing data partitions are: (1)- applying differ ..."
Abstract
-
Cited by 23 (3 self)
- Add to MetaCart
We explore the idea of evidence accumulation (EAC) for combining the results of multiple clusterings. First, a clustering ensemble- a set of object partitions, is produced. Given a data set (n objects or patterns in d dimensions), different ways of producing data partitions are: (1)- applying different clustering algorithms, and (2)- applying the same clustering algorithm with different values of parameters or initializations. Further, combinations of different data representations (feature spaces) and clustering algorithms can also provide a multitude of significantly different data partitionings. We propose a simple framework for extracting a consistent clustering, given the various partitions in a clustering ensemble. According to the EAC concept, each partition is viewed as an independent evidence of data organization, individual data partitions being combined, based on a voting mechanism, to generate a new n × n similarity matrix between the n patterns. The final data partition of the n patterns is obtained by applying a hierarchical agglomerative clustering algorithm on this matrix. We have developed a theoretical framework for the analysis of the proposed clustering combination strategy and its evaluation, based on the concept of mutual information between data partitions. Stability of the results is evaluated using bootstrapping techniques. A detailed discussion of an evidence accumulation-based clustering algorithm, using a split and merge strategy based on the K-means clustering algorithm, is presented. Experimental results of the proposed method on several synthetic and real data sets are compared with other combination strategies, and with individual clustering results produced by well known clustering algorithms.
Discovery of Climate Indices Using Clustering
- In Proc. of the 9th ACM SIGKDD Int’l Conference on Knowledge Discovery and Data Mining
, 2003
"... To analyze the effect of the oceans and atmosphere on land climate, Earth Scientists have developed climate indices, which are time series that summarize the behavior of selected regions of the Earth’s oceans and atmosphere. In the past, Earth scientists have used observation and, more recently, eig ..."
Abstract
-
Cited by 14 (5 self)
- Add to MetaCart
To analyze the effect of the oceans and atmosphere on land climate, Earth Scientists have developed climate indices, which are time series that summarize the behavior of selected regions of the Earth’s oceans and atmosphere. In the past, Earth scientists have used observation and, more recently, eigenvalue analysis techniques, such as principal components analysis (PCA) and singular value decomposition (SVD), to discover climate indices. However, eigenvalue techniques are only useful for finding a few of the strongest signals. Furthermore, they impose a condition that all discovered signals must be orthogonal to each other, making it difficult to attach a physical interpretation to them. This paper presents an alternative clustering-based methodology for the discovery of climate indices that overcomes these limitations and is based on clusters that represent regions with relatively homogeneous behavior. The centroids of these clusters are time series that summarize the behavior of the ocean or atmosphere in those regions. Some of these centroids correspond to known climate indices and provide a validation of our methodology; other centroids are variants of known indices that may provide better predictive power for some land areas; and still other indices may represent potentially new Earth science phenomena. Finally, we show that cluster based indices generally outperform SVD derived indices, both in terms of area weighted correlation and direct correlation with the known indices.
Privacy Preserving Nearest Neighbor Search
, 2006
"... Data mining is frequently obstructed by privacy concerns. In many cases data is distributed, and bringing the data together in one place for analysis is not possible due to privacy laws (e.g. HIPAA) or policies. Privacy preserving data mining techniques have been developed to address this issue by p ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
Data mining is frequently obstructed by privacy concerns. In many cases data is distributed, and bringing the data together in one place for analysis is not possible due to privacy laws (e.g. HIPAA) or policies. Privacy preserving data mining techniques have been developed to address this issue by providing mechanisms to mine the data while giving certain privacy guarantees. In this work we address the issue of privacy preserving nearest neighbor search, which forms the kernel of many data mining applications. To this end, we present a novel algorithm based on secure multiparty computation primitives to compute the nearest neighbors of records in horizontally distributed data. We show how this algorithm can be used in three important data mining algorithms, namely LOF outlier detection, SNN clustering, and kNN classification. 1
Kmeans clustering versus validation measures a data distribution perspective
- In KDD
, 2006
"... K-means is a widely used partitional clustering method. While there are considerable research efforts to characterize the key features of K-means clustering, further investigation is needed to reveal whether and how the data distributions can have the impact on the performance of K-means clustering. ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
K-means is a widely used partitional clustering method. While there are considerable research efforts to characterize the key features of K-means clustering, further investigation is needed to reveal whether and how the data distributions can have the impact on the performance of K-means clustering. Indeed, in this paper, we revisit the K-means clustering problem by answering three questions. First, how the “true ” cluster sizes can make impact on the performance of K-means clustering? Second, is the entropy an algorithmindependent validation measure for K-means clustering? Finally, what is the distribution of the clustering results by Kmeans? To that end, we first illustrate that K-means tends to generate the clusters with the relatively uniform distribution on the cluster sizes. In addition, we show that the entropy measure, an external clustering validation measure, has the favorite on the clustering algorithms which tend to reduce high variation on the cluster sizes. Finally, our experimental results indicate that K-means tends to produce the clusters in which the variation of the cluster sizes, as measured by the Coefficient of Variation (CV), is in a specific range, approximately from 0.3 to 1.0.
Predicting Land Temperature Using Ocean Data
"... To analyze the effect of the oceans and atmosphere on land climate, Earth Scientists have developed climate indices, which are time series that summarize the behavior of selected regions of the Earth’s oceans and atmosphere. In the past, Earth scientists have used observation and, more recently, eig ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
To analyze the effect of the oceans and atmosphere on land climate, Earth Scientists have developed climate indices, which are time series that summarize the behavior of selected regions of the Earth’s oceans and atmosphere. In the past, Earth scientists have used observation and, more recently, eigenvalue analysis techniques, such as principal components analysis (PCA) and singular value decomposition (SVD), to discover climate indices. Recently, an alternative clustering-based methodology has been developed for identifying climate indices. This paper presents preliminary work evaluating the effectiveness of Sea Surface Temperature (SST) and Sea Level Pressure (SLP) cluster-based indices in predicting land temperature and their relative performance with respect to known climate indices. As part of
A Dissimilarity Measure for Clustering High- and Infinite Dimensional Data that Satisfies the Triangle Inequality
, 2002
"... The cosine or correlation measures of similarity used to cluster high dimensional data are interpreted as projections, and the orthogonal components are used to define a complementary dissimilarity measure to form a similarity-dissimilarity measure pair. Using a geometrical approach, a number of pro ..."
Abstract
- Add to MetaCart
The cosine or correlation measures of similarity used to cluster high dimensional data are interpreted as projections, and the orthogonal components are used to define a complementary dissimilarity measure to form a similarity-dissimilarity measure pair. Using a geometrical approach, a number of properties of this pair is established. This approach is also extended to general inner-product spaces of any dimension. These properties include the triangle inequality for the defined dissimilarity measure, error estimates for the triangle inequality and bounds on both measures that can be obtained with a few floating-point operations from previously computed values of the measures. The bounds and error estimates for the similarity and dissimilarity measures can be used to reduce the computational complexity of clustering algorithms and enhance their sealability, and the triangle inequality allows the design of clustering algorithms for high dimensional distributed data.
Predicting Land Temperature Using Ocean Data
"... To analyze the e#ect of the oceans and atmosphere on land climate, Earth Scientists have developed climate indices, which are time series that summarize the behavior of selected regions of the Earth's oceans and atmosphere. In the past, Earth scientists have used observation and, more recently, eige ..."
Abstract
- Add to MetaCart
To analyze the e#ect of the oceans and atmosphere on land climate, Earth Scientists have developed climate indices, which are time series that summarize the behavior of selected regions of the Earth's oceans and atmosphere. In the past, Earth scientists have used observation and, more recently, eigenvalue analysis techniques, such as principal components analysis (PCA) and singular value decomposition (SVD), to discover climate indices. Recently, an alternative clustering-based methodology has been developed for identifying climate indices. This paper presents preliminary work evaluating the e#ectiveness of Sea Surface Temperature (SST) and Sea Level Pressure (SLP) cluster-based indices in predicting land temperature and their relative performance with respect to known climate indices. As part of our e#ort, we studied the North Atlantic Oscillation (NAO) index, which is known to impact land temperature in the US, and its cluster-based counterpart, which is derived using daily SLP data from the Atlantic Ocean for a 25 year period (1979-2003). We also studied the predictive power of 28 SST clusters that were identified as the most promising clusters derived from monthly SST data for a 41-year period (1958-1998) [14]. These clusters were shown to be similar to well known climate indices in terms of area weighted correlation to global land temperature, and were considered as prime candidates for further evaluation. Our preliminary results are very encouraging. They show that many of the cluster-based indices can outperform known climate indices in predicting anomalies in land temperature for certain parts of the world.
Mathematical Modeling in Industry XI Associating Earth-Orbiting Objects Detected by Astronomical Telescopes
, 2007
"... We are dealing with a problem of identifying streaks detected by a telescope of an earth orbiting object. The problem is reformulated into a clustering problem. A theoretical study is performed to show that the hierarchical algorithm fits the problem better then the k-means algorithm. The theory is ..."
Abstract
- Add to MetaCart
We are dealing with a problem of identifying streaks detected by a telescope of an earth orbiting object. The problem is reformulated into a clustering problem. A theoretical study is performed to show that the hierarchical algorithm fits the problem better then the k-means algorithm. The theory is tested through a series of experiments using Matlab routines for hierarchical clustering. The experiments result in conclusions that there needs to be theory created for choosing the cut-off parameter for the algorithm. Finally a section method is introduced for a future development of computationally efficient algorithms for large cardinality of the problem. The work completed gives a direction into what part s of the hierarchical algorithm need to be

