Results 1  10
of
12
Automatic Subspace Clustering of High Dimensional Data
 Data Mining and Knowledge Discovery
, 2005
"... Data mining applications place special requirements on clustering algorithms including: the ability to find clusters embedded in subspaces of high dimensional data, scalability, enduser comprehensibility of the results, nonpresumption of any canonical data distribution, and insensitivity to the or ..."
Abstract

Cited by 724 (12 self)
 Add to MetaCart
(Show Context)
Data mining applications place special requirements on clustering algorithms including: the ability to find clusters embedded in subspaces of high dimensional data, scalability, enduser comprehensibility of the results, nonpresumption of any canonical data distribution, and insensitivity to the order of input records. We present CLIQUE, a clustering algorithm that satisfies each of these requirements. CLIQUE identifies dense clusters in subspaces of maximum dimensionality. It generates cluster descriptions in the form of DNF expressions that are minimized for ease of comprehension. It produces identical results irrespective of the order in which input records are presented and does not presume any specific mathematical form for data distribution. Through experiments, we show that CLIQUE efficiently finds accurate clusters in large high dimensional datasets.
Approximation Algorithms for Projective Clustering
 Proceedings of the ACM SIGMOD International Conference on Management of data, Philadelphia
, 2000
"... We consider the following two instances of the projective clustering problem: Given a set S of n points in R d and an integer k ? 0; cover S by k hyperstrips (resp. hypercylinders) so that the maximum width of a hyperstrip (resp., the maximum diameter of a hypercylinder) is minimized. Let w ..."
Abstract

Cited by 302 (21 self)
 Add to MetaCart
We consider the following two instances of the projective clustering problem: Given a set S of n points in R d and an integer k ? 0; cover S by k hyperstrips (resp. hypercylinders) so that the maximum width of a hyperstrip (resp., the maximum diameter of a hypercylinder) is minimized. Let w be the smallest value so that S can be covered by k hyperstrips (resp. hypercylinders), each of width (resp. diameter) at most w : In the plane, the two problems are equivalent. It is NPHard to compute k planar strips of width even at most Cw ; for any constant C ? 0 [50]. This paper contains four main results related to projective clustering: (i) For d = 2, we present a randomized algorithm that computes O(k log k) strips of width at most 6w that cover S. Its expected running time is O(nk 2 log 4 n) if k 2 log k n; it also works for larger values of k, but then the expected running time is O(n 2=3 k 8=3 log 4 n). We also propose another algorithm that computes a c...
Finding localized associations in market basket data
 Knowledge and Data Engineering
, 2002
"... In this paper, we discuss a technique for discovering localized associations in segments of the data using clustering. Often the aggregate behavior of a data set may be very di erent from localized segments. In such cases, it is desirable to design algorithms which are e ective in discovering locali ..."
Abstract

Cited by 39 (1 self)
 Add to MetaCart
(Show Context)
In this paper, we discuss a technique for discovering localized associations in segments of the data using clustering. Often the aggregate behavior of a data set may be very di erent from localized segments. In such cases, it is desirable to design algorithms which are e ective in discovering localized associations, because they expose a customer pattern which is more speci c than the aggregate behavior. This information may bevery useful for target marketing. We present empirical results which show that the method is indeed able to nd a signi cantly larger number of associations than what can be discovered by analysis of the aggregate data.
Hypergraph Models and Algorithms for DataPattern Based Clustering
 DATA MINING AND KNOWLEDGE DISCOVERY
, 2004
"... In traditional approaches for clustering market basket type data, relations among transactions are modeled according to the items occurring in these transactions. However, an individual item might induce different relations in different contexts. Since such contexts might be captured by interesting ..."
Abstract

Cited by 15 (2 self)
 Add to MetaCart
In traditional approaches for clustering market basket type data, relations among transactions are modeled according to the items occurring in these transactions. However, an individual item might induce different relations in different contexts. Since such contexts might be captured by interesting patterns in the overall data, we represent each transaction as a set of patterns through modifying the conventional pattern semantics. By clustering the patterns in the dataset, we infer a clustering of the transactions represented this way. For this, we propose a novel hypergraph model to represent the relations among the patterns. Instead of a local measure that depends only on common items among patterns, we propose a global measure that is based on the cooccurences of these patterns in the overall data. The success of existing hypergraph partitioning based algorithms in other domains depend on sparsity of the hypergraph and explicit objective metrics. For this, we propose a two phase clustering approach for the above hypergraph, which is expected to be dense. In the first phase, the vertices of the hypergraph are merged in a multilevel algorithm to obtain large number of high quality clusters. Here, we propose new quality metrics for merging decisions in hypergraph clustering specifically for this domain. In order to enable the use of existing metrics in the second phase, we introduce a vertextocluster affinity concept to devise a method for constructing a sparse hypergraph based on the obtained clustering. The experiments we have performed show the effectiveness of the proposed framework.
ClusterTree: Integration of Cluster Representation and Nearest Neighbor Search for Large Datasets with High Dimensionality
 IEEE Internati onal Conference on Multimedia and Expo, 2000
, 2000
"... In this paper, we introduce the ClusterTree, a new indexing approach to representing clusters generated by any existing clustering approach. A cluster is decomposed into several subclusters and represented as the union of the subclusters. The subclusters can be further decomposed, which isolates t ..."
Abstract

Cited by 14 (1 self)
 Add to MetaCart
(Show Context)
In this paper, we introduce the ClusterTree, a new indexing approach to representing clusters generated by any existing clustering approach. A cluster is decomposed into several subclusters and represented as the union of the subclusters. The subclusters can be further decomposed, which isolates the most related groups within the clusters. A ClusterTree is a hierarchy of clusters and subclusters which incorporates the cluster representation into the index structure to achieve effective and efficient retrieval. Our cluster representation is highly adaptive to any kind of clusters. It is well accepted that most existing indexing techniques degrade rapidly when dimensionality goes higher. The ClusterTree can support the retrieval of the nearest neighbors effectively without having to linearly scan the highdimensional dataset. We also discuss an approach to dynamically reconstruct the ClusterTree when new data are added. We present the detailed analysis of this approach and justify it extensively by experiments. Keywords: indexing, cluster representation, nearest neighbor search, highdimensional datasets 1
A Framework for Finding Projected Clusters in High Dimensional Spaces
, 1999
"... Clustering problems are well known in the database literature for their use in numerous applications such as customer segmentation, classification and trend analysis. Unfortunately, all known algorithms tend to break down in high dimensional spaces because of the inherent sparsity of the points. In ..."
Abstract

Cited by 11 (2 self)
 Add to MetaCart
Clustering problems are well known in the database literature for their use in numerous applications such as customer segmentation, classification and trend analysis. Unfortunately, all known algorithms tend to break down in high dimensional spaces because of the inherent sparsity of the points. In such high dimensional spaces not all dimensions may be relevant to a given cluster. One way of handling this is to pick the closely correlated dimensions and find clusters in the corresponding subspace. Traditional feature selection algorithms attempt to achieve this. The weakness of this approach is that in typical high dimensional data mining applications different sets of points may cluster better for different subsets of dimensions. The number of dimensions in each such clusterspecific subspace may also vary. Hence, it may be impossible to find a single small subset of dimensions for all the clusters. We therefore introduce a generalization of the clustering problem, referred to as the ...
A finite time stochastic clustering algorithm
"... Abstract. We present a finite time local search (1 + δ)approximation method finding the optimal solution with probability almost one with respect to a general measure of within groupdissimilarity. The algorithm is based on a finitetime Markov model of the simulated annealing. A dynamic cooling sc ..."
Abstract
 Add to MetaCart
Abstract. We present a finite time local search (1 + δ)approximation method finding the optimal solution with probability almost one with respect to a general measure of within groupdissimilarity. The algorithm is based on a finitetime Markov model of the simulated annealing. A dynamic cooling schedule, allows the control of the convergence. The algorithm uses as measure of within group dissimilarity a new generalized Ward index based on a set of wellscattered representative points, which deals with the major weaknesses of partitioning algorithms regarding the hyperspherical shaped clusters and the noise. We compare it with other clustering algorithms, such as CLIQUE and DBSCAN.
Market Segmentation for Mobile TV Content on Public Transportation by Integrating Innovation Adoption Model and Lifestyle Theory
"... An integrated approach based on innovation adoption model and lifestyle theory for customer segmentation of mobile TV content on public transportation using multivariate statistical analysis is proposed. Due to high daily trips and different train types Taiwan Railway Administration is chosen as th ..."
Abstract
 Add to MetaCart
(Show Context)
An integrated approach based on innovation adoption model and lifestyle theory for customer segmentation of mobile TV content on public transportation using multivariate statistical analysis is proposed. Due to high daily trips and different train types Taiwan Railway Administration is chosen as the case study. Firstly, the content of mobile TV on the train are identified as the segmentation variable and key factor facets for mobile TV content are renamed by using factor analysis. Then, the cluster analysis is used to classify customer groups which are named by analysis of variance (ANOVA) and market segmentations are described with demographic, lifestyle and train patronage variables by using cross analysis and Chisquared independence tests. Finally, this paper discusses empirical results to provide valuable implications for better mobile TV content marketing strategies in the future.
1 CrossLayer PeertoPeer Traffic Identification and Optimization Based on Active Networking
"... Abstract: P2P applications appear to emerge as ultimate killer applications due to their ability to construct highly dynamic overlay topologies with rapidlyvarying and unpredictable traffic dynamics, which can constitute a serious challenge even for significantly overprovisioned IP networks. As a ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract: P2P applications appear to emerge as ultimate killer applications due to their ability to construct highly dynamic overlay topologies with rapidlyvarying and unpredictable traffic dynamics, which can constitute a serious challenge even for significantly overprovisioned IP networks. As a result, ISPs are facing new, severe network management problems that are not guaranteed to be addressed by statically deployed network engineering mechanisms. As a first step to a more complete solution to these problems, this paper proposes a P2P measurement, identification and optimisation architecture, designed to cope with the dynamicity and unpredictability of existing, wellknown and future, unknown P2P systems. The purpose of this architecture is to provide to the ISPs an effective and scalable approach to control and optimise the traffic produced by P2P applications in their networks. This can be achieved through a combination of different application and networklevel programmable techniques, leading to a crosslayer identification and optimisation process. These techniques can be applied using Active Networking platforms, which are able to quickly and easily deploy architectural components on demand. This flexibility of the optimisation architecture is essential to address the rapid development of new P2P protocols and the variation of known protocols. 1.
BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btm322
"... Annotationbased distance measures for patient subgroup discovery in clinical microarray studies ..."
Abstract
 Add to MetaCart
Annotationbased distance measures for patient subgroup discovery in clinical microarray studies