Results 1 - 10
of
10
Detecting the change of clustering structure in categorical data streams
- SIAM Data Mining Conference
, 2006
"... Analyzing clustering structures in data streams can provide critical information for making decision in realtime. Most research has been focused on clustering algorithms for data streams. We argue that, more importantly, we need to monitor the change of clustering structure online. In this paper, we ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
Analyzing clustering structures in data streams can provide critical information for making decision in realtime. Most research has been focused on clustering algorithms for data streams. We argue that, more importantly, we need to monitor the change of clustering structure online. In this paper, we present a framework for detecting the change of critical clustering structure in categorical data streams, which is indicated by the change of the best number of clusters (Best K) in the data stream. The framework extends the work on determining the best K for static datasets (the BkPlot method) to categorical data streams with the help of a Hierarchical Entropy Tree structure (HE-Tree). HE-Tree can efficiently capture the entropy property of the categorical data streams and allow us to draw precise clustering information from the data stream for highquality BkPLots. The experiments show that with the combination of HE-Tree and the BkPlot method we are able to efficiently and precisely detect the change of critical clustering structure in categorical data streams. 1
A unified view on clustering binary data
- Machine Learning
"... Clustering is the problem of identifying the distribution of patterns and intrinsic correlations in large data sets by partitioning the data points into similarity classes. This paper studies the problem of clustering binary data. Binary data have been oc-cupying a special place in the domain of dat ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Clustering is the problem of identifying the distribution of patterns and intrinsic correlations in large data sets by partitioning the data points into similarity classes. This paper studies the problem of clustering binary data. Binary data have been oc-cupying a special place in the domain of data analysis. A unified view of binary data clustering is presented by examining the connections among various clustering crite-ria. Experimental studies are conducted to empirically verify the relationships. 1
Non-redundant clustering
, 2005
"... Data mining and knowledge discovery attempt to reveal concepts, patterns, relationships, and struc-tures of interest in data. Typically, data may have many such structures. Most existing data mining techniques allow the user little say in which structure will be returned from the search. Those techn ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Data mining and knowledge discovery attempt to reveal concepts, patterns, relationships, and struc-tures of interest in data. Typically, data may have many such structures. Most existing data mining techniques allow the user little say in which structure will be returned from the search. Those techniques which do allow the user control over the search typically require supervised information in the form of knowledge about a target solution. In the spirit of exploratory data mining, we consider the setting where the user does not have information about a target solution. Instead we suppose the user can provide information about solutions which are not desired. These undesired solutions may be previously obtained from data mining algorithms, or they may be known to the user a priori. The goal is then to discover novel structure in the dataset which is not redundant with respect to the known structure. Techniques should guide the search away from this known structure and towards novel, interesting structures. We describe and formally define the task of non-redundant clustering. Three different algorithmic approaches are derived for non-redundant clustering. Their performance is experimentally evaluated on data sets containing multiple cluster-ings. We explore how these techniques may be extended to systematically enumerate clusterings in a data set. Finally, we also investigate whether non-redundant approaches may be incorporated to enhance state-of-the-art supervised techniques.
On clustering binary data
- Proceedings of the 2005 SIAM International Conference On Data Mining(SDM’05
, 2005
"... Clustering is the problem of identifying the distribution of patterns and intrinsic correlations in large data sets by partitioning the data points into similarity classes. This paper studies the problem of clustering binary data. This is the case for market basket datasets where the transactions co ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Clustering is the problem of identifying the distribution of patterns and intrinsic correlations in large data sets by partitioning the data points into similarity classes. This paper studies the problem of clustering binary data. This is the case for market basket datasets where the transactions contain items and for document datasets where the documents contain “bag of words”. The contribution of the paper is two-fold. First a new clustering model is presented. The model treats the data and features equally, based on their symmetric association relations, and explicitly describes the data assignments as well as feature assignments. An iterative alternating leastsquares procedure is used for optimization. Second, a unified view of binary data clustering is presented by examining the connections among various clustering criteria. 1
The ”best k” for entropy-based categorical data clustering
- In Inter. Conf. on Scien. and Stat. Database Management
, 2005
"... With the growing demand on cluster analysis for categorical data, a handful of categorical clustering algorithms have been developed. Surprisingly, to our knowledge, none has satisfactorily addressed the important problem for categorical clustering – how can we determine the best K number of cluster ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
With the growing demand on cluster analysis for categorical data, a handful of categorical clustering algorithms have been developed. Surprisingly, to our knowledge, none has satisfactorily addressed the important problem for categorical clustering – how can we determine the best K number of clusters for a categorical dataset? Since categorical data does not have the inherent distance function as the similarity measure, traditional cluster validation techniques based on the geometry shape and density distribution cannot be applied to answer this question. In this paper, we investigate the entropy property of the categorical data and propose a BkPlot method for determining a set of candidate “best Ks”. This method is implemented with a hierarchical clustering algorithm ACE. The experimental results show that our approach can effectively identify the significant clustering structures. 1
SCALE: A Scalable Framework for Efficiently Clustering Transactional Data
, 2009
"... This paper presents SCALE, a fully automated transactional clustering framework. The SCALE design highlights three unique features. First, we introduce the concept of Weighted Coverage Density as a categorical similarity measure for efficient clustering of transactional datasets. The concept of weig ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
This paper presents SCALE, a fully automated transactional clustering framework. The SCALE design highlights three unique features. First, we introduce the concept of Weighted Coverage Density as a categorical similarity measure for efficient clustering of transactional datasets. The concept of weighted coverage density is intuitive and it allows the weight of each item in a cluster to be changed dynamically according to the occurrences of items. Second, we develop the weighted coverage density measure based clustering algorithm, a fast, memory-efficient, and scalable clustering algorithm for analyzing transactional data. Third, we introduce two clustering validation metrics and show that these domain specific clustering evaluation metrics are critical to capture the transactional semantics in clustering analysis. Our SCALE framework combines the weighted coverage density measure for clustering over a sample dataset with self-configuring methods. These self-configuring methods can automatically tune the two important parameters of our clustering algorithms: (1) the candidates of the best number K of clusters; and (2) the application of two domain-specific cluster validity measures to find the best result from the set of clustering results. We have conducted extensive experimental evaluation using both synthetic and real datasets and our results show that the weighted coverage density approach powered by the SCALE framework can efficiently generate high quality clustering results in a fully automated manner. key words: transactional data clustering, cluster assessment, cluster validation, frequent itemset mining, weighted coverage density
HE-Tree: a Framework for Detecting Changes in Clustering Structure for Categorical Data Streams
"... Analyzing clustering structures in data streams can provide critical information for real-time decision making. Most research in this area has focused on clustering algorithms for numerical data streams, and very few have proposed to monitor the change of clustering structure. Most surprisingly, to ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Analyzing clustering structures in data streams can provide critical information for real-time decision making. Most research in this area has focused on clustering algorithms for numerical data streams, and very few have proposed to monitor the change of clustering structure. Most surprisingly, to our knowledge, no work has been proposed on monitoring clustering structure for categorical data streams. In this paper, we present a framework for detecting the change of primary clustering structure in categorical data streams, which is indicated by the change of the best number of clusters (Best K) in the data stream. The framework uses a Hierarchical Entropy Tree structure (HE-Tree) to capture the entropy characteristics of clusters in a data stream, and detects the change of Best K by combining our previously developed BKPlot method. The HE-Tree can efficiently summarize the entropy property of a categorical data stream and allow us to draw precise clustering information from the data stream for generating high-quality BKPlots. We also develop the time-decaying HE-Tree structure to make the monitoring more sensitive to recent changes of clustering structure. The experimental result shows that with the combination of the HE-Tree and the BKPlot method we are able to promptly and precisely detect the change of clustering structure in categorical data streams.
Online Entropy-based Model of Lexical Category Acquisition
"... Children learn a robust representation of lexical categories at a young age. We propose an incremental model of this process which efficiently groups words into lexical categories based on their local context using an information-theoretic criterion. We train our model on a corpus of childdirected s ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Children learn a robust representation of lexical categories at a young age. We propose an incremental model of this process which efficiently groups words into lexical categories based on their local context using an information-theoretic criterion. We train our model on a corpus of childdirected speech from CHILDES and show that the model learns a fine-grained set of intuitive word categories. Furthermore, we propose a novel evaluation approach by comparing the efficiency of our induced categories against other category sets (including traditional part of speech tags) in a variety of language tasks. We show the categories induced by our model typically
Determining the Best K for Clustering Transactional Datasets: A Coverage Density-based Approach
, 2008
"... The problem of determining the optimal number of clusters is important but mysterious in cluster analysis. In this paper, we propose a novel method to find a set of candidate optimal number Ks of clusters in transactional datasets. Concretely, we propose Transactional-cluster-modes Dissimilarity bas ..."
Abstract
- Add to MetaCart
The problem of determining the optimal number of clusters is important but mysterious in cluster analysis. In this paper, we propose a novel method to find a set of candidate optimal number Ks of clusters in transactional datasets. Concretely, we propose Transactional-cluster-modes Dissimilarity based on the concept of coverage density as an intuitive transactional inter-cluster dissimilarity measure. Based on the above measure, an agglomerative hierachical clustering algorithm is developed and the Merge Dissimilarity Indexes, which are generated in hierachical cluster merging processes, are used to find the candidate optimal number Ks of clusters of transactional data. Our experimental results on both synthetic and real data show that the new method often effectively estimates the number of clusters of transactional data.
A Spectral Based Clustering Algorithm for Categorical Data with Maximum Modularity
"... Abstract. In this paper we propose a spectral based clustering algorithm to maximize an extended Modularity measure for categorical data; first, we establish the connection with the Relational Analysis criterion. Second, the maximization of the extended modularity is shown as a trace maximization pr ..."
Abstract
- Add to MetaCart
Abstract. In this paper we propose a spectral based clustering algorithm to maximize an extended Modularity measure for categorical data; first, we establish the connection with the Relational Analysis criterion. Second, the maximization of the extended modularity is shown as a trace maximization problem. A spectral based algorithm is then presented to search for the partitions maximizing the extended Modularity criterion. Experimental results indicate that the new algorithm is efficient and effective at finding a good clustering across a variety of real-world data sets 1

