Results 1 - 10
of
20
CURLER: Finding and visualizing nonlinear correlated clusters
- In Proc. SIGMOD
, 2005
"... While much work has been done in finding linear correlation among subsets of features in high-dimensional data, work on detecting nonlinear correlation has been left largely untouched. In this paper, we present an algorithm for finding and visualizing nonlinear correlation clusters in the subspace o ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
While much work has been done in finding linear correlation among subsets of features in high-dimensional data, work on detecting nonlinear correlation has been left largely untouched. In this paper, we present an algorithm for finding and visualizing nonlinear correlation clusters in the subspace of high-dimensional databases. Unlike the detection of linear correlation in which clusters are of unique orientations, finding nonlinear correlation clusters of varying orientations requires merging clusters of possibly very different orientations. Combined with the fact that spatial proximity must be judged based on a subset of features that are not originally known, deciding which clusters to be merged during the clustering process becomes a challenge. To avoid this problem, we propose a novel concept called co-sharing level which captures both spatial proximity and cluster orientation when judging similarity between clusters. Based on this concept, we develop an algorithm which not only detects nonlinear correlation clusters but also provides a way to visualize them. Experiments on both synthetic and real-life datasets are done to show the effectiveness of our method. 1.
Deriving Quantitative Models for Correlation Clusters
- IN PROC. 12TH ACM SIGKDD INT’L CONF. ON KNOWLEDGE DISCOVERY AND DATA MINING
, 2006
"... Correlation clustering aims at grouping the data set into correlation clusters such that the objects in the same cluster exhibit a certain density and are all associated to a common arbitrarily oriented hyperplane of arbitrary dimensionality. Several algorithms for this task have been proposed recen ..."
Abstract
-
Cited by 10 (4 self)
- Add to MetaCart
Correlation clustering aims at grouping the data set into correlation clusters such that the objects in the same cluster exhibit a certain density and are all associated to a common arbitrarily oriented hyperplane of arbitrary dimensionality. Several algorithms for this task have been proposed recently. However, all algorithms only compute the partitioning of the data into clusters. This is only a first step in the pipeline of advanced data analysis and system modelling. The second (post-clustering) step of deriving a quantitative model for each correlation cluster has not been addressed so far. In this paper, we describe an original approach to handle this second step. We introduce a general method that can extract quantitative information on the linear dependencies within a correlation clustering. Our concepts are independent of the clustering model and can thus be applied as a post-processing step to any correlation clustering algorithm. Furthermore, we show how these quantitative models can be used to predict the probability distribution that an object is created by these models. Our broad experimental evaluation demonstrates the beneficial impact of our method on several applications of significant practical importance.
Comparing subspace clusterings
- IEEE Transactions on Knowledge and Data Engineering
, 2004
"... Abstract—We present the first framework for comparing subspace clusterings. We propose several distance measures for subspace clusterings, including generalizations of well-known distance measures for ordinary clusterings. We describe a set of important properties for any measure for comparing subsp ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
Abstract—We present the first framework for comparing subspace clusterings. We propose several distance measures for subspace clusterings, including generalizations of well-known distance measures for ordinary clusterings. We describe a set of important properties for any measure for comparing subspace clusterings and give a systematic comparison of our proposed measures in terms of these properties. We validate the usefulness of our subspace clustering distance measures by comparing clusterings produced by the algorithms FastDOC, HARP, PROCLUS, ORCLUS, and SSPC. We show that our distance measures can be also used to compare partial clusterings, overlapping clusterings, and patterns in binary data matrices. Index Terms—Subspace clustering, projected clustering, distance, feature selection, cluster validation.
On Exploring Complex Relationships of Correlation Clusters
, 2007
"... In high dimensional data, clusters often only exist in arbitrarily oriented subspaces of the feature space. In addition, these so-called correlation clusters may have complex relationships between each other. For example, a correlation cluster in a 1-D subspace (forming a line) may be enclosed withi ..."
Abstract
-
Cited by 6 (6 self)
- Add to MetaCart
In high dimensional data, clusters often only exist in arbitrarily oriented subspaces of the feature space. In addition, these so-called correlation clusters may have complex relationships between each other. For example, a correlation cluster in a 1-D subspace (forming a line) may be enclosed within one or even several correlation clusters in 2-D superspaces (forming planes). In general, such relationships can be seen as a complex hierarchy that allows multiple inclusions, i.e. clusters may be embedded in several super-clusters rather than only in one. Obviously, uncovering the hierarchical relationships between the detected correlation clusters is an important information gain. Since existing approaches cannot detect such complex hierarchical relationships among correlation clusters, we propose the algorithm ERiC to tackle this problem and to visualize the result by means of a graph-based representation. In our experimental evaluation, we show that ERiC finds more information than state-of-the-art correlation clustering methods and outperforms existing competitors in terms of efficiency.
Mining hierarchies of correlation clusters
- IN PROC. SSDBM
, 2006
"... The detection of correlations between different features in high dimensional data sets is a very important data mining task. These correlations can be arbitrarily complex: One or more features might be correlated with several other features, and both noise features as well as the actual dependencies ..."
Abstract
-
Cited by 6 (5 self)
- Add to MetaCart
The detection of correlations between different features in high dimensional data sets is a very important data mining task. These correlations can be arbitrarily complex: One or more features might be correlated with several other features, and both noise features as well as the actual dependencies may be different for different clusters. Therefore, each cluster contains points that are located on a common hyperplane of arbitrary dimensionality in the data space and thus generates a separate, arbitrarily oriented subspace of the original data space. The few recently proposed algorithms designed to uncover these correlation clusters have several disadvantages. In particular, these methods cannot detect correlation clusters of different dimensionality which are nested into each other. The complete hierarchical structure of correlation clusters of varying dimensionality can only be detected by a hierarchical clustering approach. Therefore, we propose the algorithm HiCO (Hierarchical Correlation Ordering), the first hierarchical approach to correlation clustering. The algorithm determines the cluster hierarchy, and visualizes it using correlation diagrams. Several comparative experiments using synthetic and real data sets show the performance and the effectivity of HiCO.
Robust Clustering in Arbitrarily Oriented Subspaces
"... In this paper, we propose an efficient and effective method to find arbitrarily oriented subspace clusters by mapping the data space to a parameter space defining the set of possible arbitrarily oriented subspaces. The objective of a clustering algorithm based on this principle is to find those amon ..."
Abstract
-
Cited by 6 (5 self)
- Add to MetaCart
In this paper, we propose an efficient and effective method to find arbitrarily oriented subspace clusters by mapping the data space to a parameter space defining the set of possible arbitrarily oriented subspaces. The objective of a clustering algorithm based on this principle is to find those among all the possible subspaces, that accommodate many database objects. In contrast to existing approaches, our method can find subspace clusters of different dimensionality even if they are sparse or are intersected by other clusters within a noisy environment. A broad experimental evaluation demonstrates the robustness, efficiency and effectivity of our method.
A general framework for increasing the robustness of PCA-based correlation clustering algorithms
- IN: PROC. SSDBM
, 2008
"... Most correlation clustering algorithms rely on principal component analysis (PCA) as a correlation analysis tool. The correlation of each cluster is learned by applying PCA to a set of sample points. Since PCA is rather sensitive to outliers, if a small fraction of these points does not correspond t ..."
Abstract
-
Cited by 6 (4 self)
- Add to MetaCart
Most correlation clustering algorithms rely on principal component analysis (PCA) as a correlation analysis tool. The correlation of each cluster is learned by applying PCA to a set of sample points. Since PCA is rather sensitive to outliers, if a small fraction of these points does not correspond to the correct correlation of the cluster, the algorithms are usually misled or even fail to detect the correct results. In this paper, we evaluate the influence of outliers on PCA and propose a general framework for increasing the robustness of PCA in order to determine the correct correlation of each cluster. We further show how our framework can be applied to PCA-based correlation clustering algorithms. A thorough experimental evaluation shows the benefit of our framework on several synthetic and real-world data sets.
Detection and visualization of subspace cluster hierarchies
- IN PROC. DASFAA
, 2007
"... Subspace clustering (also called projected clustering) addresses the problem that different sets of attributes may be relevant for different clusters in high dimensional feature spaces. In this paper, we propose the algorithm DiSH (Detecting Subspace cluster Hierarchies) that improves in the followi ..."
Abstract
-
Cited by 6 (5 self)
- Add to MetaCart
Subspace clustering (also called projected clustering) addresses the problem that different sets of attributes may be relevant for different clusters in high dimensional feature spaces. In this paper, we propose the algorithm DiSH (Detecting Subspace cluster Hierarchies) that improves in the following points over existing approaches: First, DiSH can detect clusters in subspaces of significantly different dimensionality. Second, DiSH uncovers complex hierarchies of nested subspace clusters, i.e. clusters in lower-dimensional subspaces that are embedded within higher-dimensional subspace clusters. These hierarchies do not only consist of single inclusions, but may also exhibit multiple inclusions and thus, can only be modeled using graphs rather than trees. Third, DiSH is able to detect clusters of different size, shape, and density. Furthermore, we propose to visualize the complex hierarchies by means of an appropriate visualization model, the so-called subspace clustering graph, such that the relationships between the subspace clusters can be explored at a glance. Several comparative experiments show the performance and the effectivity of DiSH.
ELKI: A Software System for Evaluation of Subspace Clustering Algorithms
- IN PROCEEDINGS OF THE 20TH INTERNATIONAL CONFERENCE ON SCIENTIFIC AND STATISTICAL DATABASE MANAGEMENT (SSDBM), HONG KONG
, 2008
"... In order to establish consolidated standards in novel data mining areas, newly proposed algorithms need to be evaluated thoroughly. Many publications compare a new proposition – if at all – with one or two competitors or even with a so called “naïve” ad hoc solution. For the prolific field of subspa ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
In order to establish consolidated standards in novel data mining areas, newly proposed algorithms need to be evaluated thoroughly. Many publications compare a new proposition – if at all – with one or two competitors or even with a so called “naïve” ad hoc solution. For the prolific field of subspace clustering, we propose a software framework implementing many prominent algorithms and, thus, allowing for a fair and thorough evaluation. Furthermore, we describe how new algorithms for new applications can be incorporated in the framework easily.
Detecting Clusters in Moderate-to-High Dimensional Data: Subspace Clustering, Pattern-based Clustering, and Correlation Clustering ABSTRACT
"... As a prolific research area in data mining, subspace clustering and related problems induced a vast amount of proposed solutions. However, many publications compare a new proposition – if at all – with one or two competitors or even with a so called “naïve ” ad hoc solution but fail to clarify the e ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
As a prolific research area in data mining, subspace clustering and related problems induced a vast amount of proposed solutions. However, many publications compare a new proposition – if at all – with one or two competitors or even with a so called “naïve ” ad hoc solution but fail to clarify the exact problem definition. As a consequence, even if two solutions are thoroughly compared experimentally, it will often remain unclear whether both solutions tackle the same problem or, if they do, whether they agree in certain tacit assumptions and how such assumptions may influence the outcome of an algorithm. In this tutorial, we try to clarify (i) the different problem definitions related to subspace clustering in general, (ii) the specific difficulties encountered in this field of research, (iii) the varying assumptions, heuristics, and intuitions forming the basis of different approaches, and (iv) how several prominent solutions essentially tackle different problems. 1.

