Results 1  10
of
10
A comparison of document clustering techniques
 In KDD Workshop on Text Mining
, 2000
"... This paper presents the results of an experimental study of some common document clustering techniques: agglomerative hierarchical clustering and Kmeans. (We used both a “standard” Kmeans algorithm and a “bisecting ” Kmeans algorithm.) Our results indicate that the bisecting Kmeans technique is ..."
Abstract

Cited by 602 (29 self)
 Add to MetaCart
(Show Context)
This paper presents the results of an experimental study of some common document clustering techniques: agglomerative hierarchical clustering and Kmeans. (We used both a “standard” Kmeans algorithm and a “bisecting ” Kmeans algorithm.) Our results indicate that the bisecting Kmeans technique is better than the standard Kmeans approach and (somewhat surprisingly) as good or better than the hierarchical approaches that we tested.
Enhancing Data Analysis with Noise Removal
"... Removing objects that are noise is an important goal of data cleaning as noise hinders most types of data analysis. Most existing data cleaning methods focus on removing noise that is the result of lowlevel data errors that result from an imperfect data collection process, but data objects that a ..."
Abstract

Cited by 23 (5 self)
 Add to MetaCart
Removing objects that are noise is an important goal of data cleaning as noise hinders most types of data analysis. Most existing data cleaning methods focus on removing noise that is the result of lowlevel data errors that result from an imperfect data collection process, but data objects that are irrelevant or only weakly relevant can also significantly hinder data analysis. Thus, if the goal is to enhance the data analysis as much as possible, these objects should also be considered as noise, at least with respect to the underlying analysis. Consequently, there is a need for data cleaning techniques that remove both types of noise. Because data sets can contain large amount of noise, these techniques also need to be able to discard a potentially large fraction of the data. This paper explores four techniques intended for noise removal to enhance data analysis in the presence of high noise levels. Three of
Kmeans clustering versus validation measures a data distribution perspective
 In KDD
, 2006
"... Kmeans is a widely used partitional clustering method. While there are considerable research efforts to characterize the key features of Kmeans clustering, further investigation is needed to reveal whether and how the data distributions can have the impact on the performance of Kmeans clustering. ..."
Abstract

Cited by 15 (2 self)
 Add to MetaCart
(Show Context)
Kmeans is a widely used partitional clustering method. While there are considerable research efforts to characterize the key features of Kmeans clustering, further investigation is needed to reveal whether and how the data distributions can have the impact on the performance of Kmeans clustering. Indeed, in this paper, we revisit the Kmeans clustering problem by answering three questions. First, how the “true ” cluster sizes can make impact on the performance of Kmeans clustering? Second, is the entropy an algorithmindependent validation measure for Kmeans clustering? Finally, what is the distribution of the clustering results by Kmeans? To that end, we first illustrate that Kmeans tends to generate the clusters with the relatively uniform distribution on the cluster sizes. In addition, we show that the entropy measure, an external clustering validation measure, has the favorite on the clustering algorithms which tend to reduce high variation on the cluster sizes. Finally, our experimental results indicate that Kmeans tends to produce the clusters in which the variation of the cluster sizes, as measured by the Coefficient of Variation (CV), is in a specific range, approximately from 0.3 to 1.0.
A generalization of proximity functions for Kmeans
 2007 Seventh IEEE International Conference on Data Mining
"... Kmeans is a widely used partitional clustering method. A large amount of effort has been made on finding better proximity (distance) functions for Kmeans. However, the common characteristics of proximity functions remain unknown. To this end, in this paper, we show that all proximity functions t ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
(Show Context)
Kmeans is a widely used partitional clustering method. A large amount of effort has been made on finding better proximity (distance) functions for Kmeans. However, the common characteristics of proximity functions remain unknown. To this end, in this paper, we show that all proximity functions that fit Kmeans clustering can be generalized as Kmeans distance, which can be derived by a differentiable convex function. A general proof of sufficient and necessary conditions for Kmeans distance functions is also provided. In addition, we reveal that Kmeans has a general uniformization effect; that is, Kmeans tends to produce clusters with relatively balanced cluster sizes. This uniformization effect of Kmeans exists regardless of proximity functions. Finally, we have conducted extensive experiments on various realworld data sets, and the results show the evidence of the uniformization effect. Also, we observed that external clustering validation measures, such as Entropy and Variance of Information (VI), have difficulty in measuring clustering quality if data have skewed distributions on class sizes. 1.
Privacy Leakage in Multirelational Databases: A Semisupervised Learning Perspective
, 2006
"... In multirelational databases, a view, which is a context and contentdependent subset of one or more tables (or other views), is often used to preserve privacy by hiding sensitive information. However, recent developments in data mining present a new challenge for database security even when tradi ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
In multirelational databases, a view, which is a context and contentdependent subset of one or more tables (or other views), is often used to preserve privacy by hiding sensitive information. However, recent developments in data mining present a new challenge for database security even when traditional database security techniques, such as database access control, are employed. This paper presents a data mining framework using semisupervised learning that demonstrates the potential for privacy leakage in multirelational databases. Many different types of semisupervised learning techniques, such as the Knearest neighbor (KNN) method, can be used to demonstrate privacy leakage. However, we also introduce a new approach to semisupervised learning, hyperclique pattern based semisupervised learning (HPSL), which differs from traditional semisupervised learning approaches in that it considers the similarity among groups of objects instead of only pairs of objects. Our experimental results show that both the KNN and HPSL methods have the ability to compromise database security, although HPSL is better at this privacy violation (has higher prediction accuracy) than the KNN method. Finally, we provide a principle for avoiding privacy leakage in multirelational databases via semisupervised learning and illustrate this principle with a simple preventive technique whose effectiveness is demonstrated by experiments.
Under consideration for publication in Knowledge and Information Systems Characterizing Pattern Preserving Clustering
, 2007
"... Abstract. This paper describes a new approach for clustering—pattern preserving clustering—which produces more easily interpretable and usable clusters. This approach is motivated by the following observation: while there are usually strong patterns in the data—patterns that may be key for the anal ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract. This paper describes a new approach for clustering—pattern preserving clustering—which produces more easily interpretable and usable clusters. This approach is motivated by the following observation: while there are usually strong patterns in the data—patterns that may be key for the analysis and description of the data— these patterns are often split among different clusters by current clustering approaches. This is, perhaps, not surprising, since clustering algorithms have no built in knowledge of these patterns and may often have goals that are in conflict with preserving patterns, e.g., minimize the distance of points to their nearest cluster centroids. In this paper, our focus is to characterize (1) the benefits of pattern preserving clustering and (2) the most effective way of performing pattern preserving clustering. To that end, we propose and evaluate two clustering algorithms, HIerarchical Clustering with pAttern Preservation (HICAP) and bisecting Kmeans Clustering with pAttern Preservation (KCAP). Experimental results on document data show that HICAP can produce overlapping clusters that preserve useful patterns, but has relatively worse clustering performance than bisecting Kmeans with respect to the clustering evaluation criterion of entropy. By contrast, in terms of entropy, KCAP can perform substantially better than the bisecting Kmeans algorithm when data sets contain clusters of widely different sizes—a common situation in the realworld. Most importantly, we also illustrate how patterns, if preserved, can aid cluster interpretation.
Enhancing Data Analysis with Noise Removal
"... Abstract—Removing objects that are noise is an important goal of data cleaning as noise hinders most types of data analysis. Most existing data cleaning methods focus on removing noise that is the product of lowlevel data errors that result from an imperfect data collection process, but data object ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract—Removing objects that are noise is an important goal of data cleaning as noise hinders most types of data analysis. Most existing data cleaning methods focus on removing noise that is the product of lowlevel data errors that result from an imperfect data collection process, but data objects that are irrelevant or only weakly relevant can also significantly hinder data analysis. Thus, if the goal is to enhance the data analysis as much as possible, these objects should also be considered as noise, at least with respect to the underlying analysis. Consequently, there is a need for data cleaning techniques that remove both types of noise. Because data sets can contain large amounts of noise, these techniques also need to be able to discard a potentially large fraction of the data. This paper explores four techniques intended for noise removal to enhance data analysis in the presence of high noise levels. Three of these methods are based on traditional outlier detection techniques: distancebased, clusteringbased, and an approach based on the Local Outlier Factor (LOF) of an object. The other technique, which is a new method that we are proposing, is a hypercliquebased data cleaner (HCleaner). These techniques are evaluated in terms of their impact on the subsequent data analysis, specifically, clustering and association analysis. Our experimental results show that all of these methods can provide better clustering performance and higher quality association patterns as the amount of noise being removed increases, although HCleaner generally leads to better clustering performance and higher quality associations than the other three methods for binary data. Index Terms—Data cleaning, very noisy data, hyperclique pattern discovery, local outlier factor (LOF), noise removal. 1
The VLDB Journal DOI 10.1007/s0077800600114 SPECIAL ISSUE PAPER Privacy leakage in multirelational databases: a semisupervised
, 2005
"... Abstract In multirelational databases, a view, which is a context and contentdependent subset of one or more tables (or other views), is often used to preserve privacy by hiding sensitive information. However, recent developments in data mining present a new challenge for database security even w ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract In multirelational databases, a view, which is a context and contentdependent subset of one or more tables (or other views), is often used to preserve privacy by hiding sensitive information. However, recent developments in data mining present a new challenge for database security even when traditional database security techniques, such as database access control, are employed. This paper presents a data mining framework using semisupervised learning that demonstrates the potential for privacy leakage in multirelational databases. Many different types of semisupervised learning techniques, such as the Knearest neighbor (KNN) method, can be used to demonstrate privacy leakage. However, we also introduce a new approach to semisupervised learning, hyperclique patternbased semisupervised learning (HPSL), which differs from traditional semisupervised learning approaches in that it considers the similarity among groups of objects instead of only pairs of objects. Our experimental A preliminary version of this work has been published as a twopage short paper in ACM CIKM 2005 (Proceedings of the ACM conference on information and knowledge management (CIKM) 2005).
April 2, 2007 11:54 WSPC/INSTRUCTION FILE copap COCLUSTERING BIPARTITE WITH PATTERN PRESERVATION FOR TOPIC EXTRACTION
"... The duality between document and word clustering naturally leads to the consideration of storing the document dataset in a bipartite. With documents and words modeled as vertices on two sides respectively, partitioning such a graph yields a coclustering of words and documents. The topic of each clu ..."
Abstract
 Add to MetaCart
The duality between document and word clustering naturally leads to the consideration of storing the document dataset in a bipartite. With documents and words modeled as vertices on two sides respectively, partitioning such a graph yields a coclustering of words and documents. The topic of each cluster can then be represented by the top words and documents that have highest withincluster degrees. However, such claims may fail if top words and documents are selected simply because they are very general and frequent. In addition, for those words and documents across several topics, it may not be proper to assign them to a single cluster. In other words, to precisely capture the cluster topic, we need to identify those microsets of words/documents that are similar among themselves and as a whole, representative of their respective topics. Along this line, in this paper, we use hyperclique patterns, strongly affiliated words/documents, to define such microsets. We introduce a new bipartite formulation that incorporates both word hypercliques and document hypercliques as super vertices. By copreserving hyperclique patterns during the clustering process, our experiments on realworld data sets show that better clustering results can be obtained in terms of various external clustering validation measures and
December 18, 2007 12:10 WSPC/INSTRUCTION FILE copap COCLUSTERING BIPARTITE WITH PATTERN PRESERVATION FOR TOPIC EXTRACTION
"... The duality between document and word clustering naturally leads to the consideration of storing the document dataset in a bipartite. With documents and words modeled as vertices on two sides respectively, partitioning such a graph yields a coclustering of words and documents. The topic of each clu ..."
Abstract
 Add to MetaCart
The duality between document and word clustering naturally leads to the consideration of storing the document dataset in a bipartite. With documents and words modeled as vertices on two sides respectively, partitioning such a graph yields a coclustering of words and documents. The topic of each cluster can then be represented by the top words and documents that have highest withincluster degrees. However, such claims may fail if top words and documents are selected simply because they are very general and frequent. In addition, for those words and documents across several topics, it may not be proper to assign them to a single cluster. In other words, to precisely capture the cluster topic, we need to identify those microsets of words/documents that are similar among themselves and as a whole, representative of their respective topics. Along this line, in this paper, we use hyperclique patterns, strongly affiliated words/documents, to define such microsets. We introduce a new bipartite formulation that incorporates both word hypercliques and document hypercliques as super vertices. By copreserving hyperclique patterns during the clusteringprocess, our experiments on realworld data sets show that better clustering results can be obtained in terms of various external clustering validation measures and