Results 1  10
of
180
Survey of clustering data mining techniques
, 2002
"... Accrue Software, Inc. Clustering is a division of data into groups of similar objects. Representing the data by fewer clusters necessarily loses certain fine details, but achieves simplification. It models data by its clusters. Data modeling puts clustering in a historical perspective rooted in math ..."
Abstract

Cited by 286 (0 self)
 Add to MetaCart
(Show Context)
Accrue Software, Inc. Clustering is a division of data into groups of similar objects. Representing the data by fewer clusters necessarily loses certain fine details, but achieves simplification. It models data by its clusters. Data modeling puts clustering in a historical perspective rooted in mathematics, statistics, and numerical analysis. From a machine learning perspective clusters correspond to hidden patterns, the search for clusters is unsupervised learning, and the resulting system represents a data concept. From a practical perspective clustering plays an outstanding role in data mining applications such as scientific data exploration, information retrieval and text mining, spatial database applications, Web analysis, CRM, marketing, medical diagnostics, computational biology, and many others. Clustering is the subject of active research in several fields such as statistics, pattern recognition, and machine learning. This survey focuses on clustering in data mining. Data mining adds to clustering the complications of very large datasets with very many attributes of different types. This imposes unique
On the equivalence of nonnegative matrix factorization and spectral clustering
 in SIAM International Conference on Data Mining
, 2005
"... Current nonnegative matrix factorization (NMF) deals with X = FG T type. We provide a systematic analysis and extensions of NMF to the symmetric W = HH T, and the weighted W = HSHT. We show that (1) W = HHT is equivalent to Kernel Kmeans clustering and the Laplacianbased spectral clustering. (2) X ..."
Abstract

Cited by 90 (11 self)
 Add to MetaCart
(Show Context)
Current nonnegative matrix factorization (NMF) deals with X = FG T type. We provide a systematic analysis and extensions of NMF to the symmetric W = HH T, and the weighted W = HSHT. We show that (1) W = HHT is equivalent to Kernel Kmeans clustering and the Laplacianbased spectral clustering. (2) X = FGT is equivalent to simultaneous clustering of rows and columns of a bipartite graph. Algorithms are given for computing these symmetric NMFs. 1
Extracting social networks and contact information from email and the web
 In Proceedings of CEAS1
, 2004
"... Abstract. We present an endtoend system that extracts a user’s social network and its members’ contact information given the user’s email inbox. The system identifies unique people in email, finds their Web presence, and automatically fills the fields of a contact address book using conditional ra ..."
Abstract

Cited by 88 (3 self)
 Add to MetaCart
(Show Context)
Abstract. We present an endtoend system that extracts a user’s social network and its members’ contact information given the user’s email inbox. The system identifies unique people in email, finds their Web presence, and automatically fills the fields of a contact address book using conditional random fields—a type of probabilistic model wellsuited for such information extraction tasks. By recursively calling itself on new people discovered on the Web, the system builds a social network with multiple degrees of separation from the user. Additionally, a set of expertisedescribing keywords are extracted and associated with each person. We outline the collection of statistical and learning components that enable this system, and present experimental results on the real email of two users; we also present results with a simple method of learning transfer, and discuss the capabilities of the system for addressbook population, expertfinding, and social network analysis. 1
SCAN: an Structural Clustering Algorithm for Networks
 IN PROC. OF 13 TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING
, 2004
"... Network clustering (or graph partitioning) is an important task for the discovery of underlying structures in networks. Many algorithms find clusters by maximizing the number of intracluster edges. While such algorithms find useful and interesting structures, they tend to fail to identify and isola ..."
Abstract

Cited by 59 (3 self)
 Add to MetaCart
Network clustering (or graph partitioning) is an important task for the discovery of underlying structures in networks. Many algorithms find clusters by maximizing the number of intracluster edges. While such algorithms find useful and interesting structures, they tend to fail to identify and isolate two kinds of vertices that play special roles – vertices that bridge clusters (hubs) and vertices that are marginally connected to clusters (outliers). Identifying hubs is useful for applications such as viral marketing and epidemiology since hubs are responsible for spreading ideas or disease. In contrast, outliers have little or no influence, and may be isolated as noise in the data. In this paper, we proposed a novel algorithm called SCAN (Structural Clustering Algorithm for Networks), which detects clusters, hubs and outliers in networks. It clusters vertices based on a structural similarity measure. The algorithm is fast and efficient, visiting each vertex only once. An empirical evaluation of the method using both synthetic and real datasets demonstrates superior performance over other methods such as the modularitybased algorithms.
Rankingbased clustering of heterogeneous information networks with star network schema
 In: Proc. 2009 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD 2009
, 2009
"... A heterogeneous information network is an information network composed of multiple types of objects. Clustering on such a network may lead to better understanding of both hidden structures of the network and the individual role played by every object in each cluster. However, although clustering on ..."
Abstract

Cited by 56 (27 self)
 Add to MetaCart
(Show Context)
A heterogeneous information network is an information network composed of multiple types of objects. Clustering on such a network may lead to better understanding of both hidden structures of the network and the individual role played by every object in each cluster. However, although clustering on homogeneous networks has been studied over decades, clustering on heterogeneous networks has not been addressed until recently. A recent study proposed a new algorithm, RankClus, for clustering on bityped heterogeneous networks. However, a realworld network may consist of more than two types, and the interactions among multityped objects play a key role at disclosing the rich semantics that a network carries. In this paper, we study clustering of multityped heterogeneous networks with a star network schema and propose a novel algorithm, NetClus, that utilizes links across multityped objects to generate highquality netclusters. An iterative enhancement method is developed that leads to effective rankingbased clustering in such heterogeneous networks. Our experiments on DBLP data show that NetClus generates more accurate clustering results than the baseline topic model algorithm PLSA and the recently proposed algorithm, RankClus. Further, NetClus generates informative clusters, presenting good ranking and cluster membership information for each attribute object in each netcluster.
Spectral clustering for multitype relational data
 In ICML
, 2006
"... Clustering on multitype relational data has attracted more and more attention in recent years due to its high impact on various important applications, such as Web mining, ecommerce and bioinformatics. However, the research on general multitype relational data clustering is still limited and prel ..."
Abstract

Cited by 44 (4 self)
 Add to MetaCart
(Show Context)
Clustering on multitype relational data has attracted more and more attention in recent years due to its high impact on various important applications, such as Web mining, ecommerce and bioinformatics. However, the research on general multitype relational data clustering is still limited and preliminary. The contribution of the paper is threefold. First, we propose a general model, the collective factorization on related matrices, for multitype relational data clustering. The model is applicable to relational data with various structures. Second, under this model, we derive a novel algorithm, the spectral relational clustering, to cluster multitype interrelated data objects simultaneously. The algorithm iteratively embeds each type of data objects into low dimensional spaces and benefits from the interactions among the hidden structures of different types of data objects. Extensive experiments demonstrate the promise and effectiveness of the proposed algorithm. Third, we show that the existing spectral clustering algorithms can be considered as the special cases of the proposed model and algorithm. This demonstrates the good theoretic generality of the proposed model and algorithm. 1.
Document clustering using locality preserving indexing
 IEEE Transactions on Knowledge and Data Engineering
, 2005
"... Abstract—We propose a novel document clustering method which aims to cluster the documents into different semantic classes. The document space is generally of high dimensionality and clustering in such a high dimensional space is often infeasible due to the curse of dimensionality. By using Locality ..."
Abstract

Cited by 43 (16 self)
 Add to MetaCart
Abstract—We propose a novel document clustering method which aims to cluster the documents into different semantic classes. The document space is generally of high dimensionality and clustering in such a high dimensional space is often infeasible due to the curse of dimensionality. By using Locality Preserving Indexing (LPI), the documents can be projected into a lowerdimensional semantic space in which the documents related to the same semantics are close to each other. Different from previous document clustering methods based on Latent Semantic Indexing (LSI) or Nonnegative Matrix Factorization (NMF), our method tries to discover both the geometric and discriminating structures of the document space. Theoretical analysis of our method shows that LPI is an unsupervised approximation of the supervised Linear Discriminant Analysis (LDA) method, which gives the intuitive motivation of our method. Extensive experimental evaluations are performed on the Reuters21578 and TDT2 data sets. Index Terms—Document clustering, locality preserving indexing, dimensionality reduction, semantics. æ 1
Consistent bipartite graph copartitioning for starstructured highorder heterogeneous data coclustering
 KDD
, 2005
"... Heterogeneous data coclustering has attracted more and more attention in recent years due to its high impact on various applications. While the coclustering algorithms for two types of heterogeneous data (denoted by pairwise coclustering), such as documents and terms, have been well studied in t ..."
Abstract

Cited by 34 (1 self)
 Add to MetaCart
(Show Context)
Heterogeneous data coclustering has attracted more and more attention in recent years due to its high impact on various applications. While the coclustering algorithms for two types of heterogeneous data (denoted by pairwise coclustering), such as documents and terms, have been well studied in the literature, the work on more types of heterogeneous data (denoted by highorder coclustering) is still very limited. As an attempt in this direction, in this paper, we worked on a specific case of highorder coclustering in which there is a central type of objects that connects the other types so as to form a star structure of the interrelationships. Actually, this case could be a very good abstract for many realworld applications, such as the coclustering of categories, documents and terms in text mining. In our philosophy, we treated such kind of problems as the fusion of multiple pairwise coclustering subproblems with the constraint of the star structure. Accordingly, we proposed the concept of consistent bipartite graph copartitioning, and developed an algorithm based on semidefinite programming (SDP) for efficient computation of the clustering results. Experiments on toy problems and real data both verified the effectiveness of our proposed method.
Coclustering by block value decomposition
 In KDD’05
, 2005
"... Dyadic data matrices, such as cooccurrence matrix, rating matrix, and proximity matrix, arise frequently in various important applications. A fundamental problem in dyadic data analysis is to find the hidden block structure of the data matrix. In this paper, we present a new coclustering framework, ..."
Abstract

Cited by 33 (6 self)
 Add to MetaCart
Dyadic data matrices, such as cooccurrence matrix, rating matrix, and proximity matrix, arise frequently in various important applications. A fundamental problem in dyadic data analysis is to find the hidden block structure of the data matrix. In this paper, we present a new coclustering framework, block value decomposition(BVD), for dyadic data, which factorizes the dyadic data matrix into three components, the rowcoefficient matrix R, the block value matrix B, and the columncoefficient matrix C. Under this framework, we focus on a special yet very popular case – nonnegative dyadic data, and propose a specific novel coclustering algorithm that iteratively computes the three decomposition matrices based on the multiplicative updating rules. Extensive experimental evaluations also demonstrate the effectiveness and potential of this framework as well as the specific algorithms for coclustering, and in particular, for discovering the hidden block structure in the dyadic data.
Unsupervised Learning on Kpartite Graphs
, 2006
"... Various data mining applications involve data objects of multiple types that are related to each other, which can be naturally formulated as a kpartite graph. However, the research on mining the hidden structures from a kpartite graph is still limited and preliminary. In this paper, we propose a g ..."
Abstract

Cited by 33 (4 self)
 Add to MetaCart
Various data mining applications involve data objects of multiple types that are related to each other, which can be naturally formulated as a kpartite graph. However, the research on mining the hidden structures from a kpartite graph is still limited and preliminary. In this paper, we propose a general model, the relation summary network, to find the hidden structures (the local cluster structures and the global community structures) from a kpartite graph. The model provides a principal framework for unsupervised learning on kpartite graphs of various structures. Under this model, we derive a novel algorithm to identify the hidden structures of a kpartite graph by constructing a relation summary network to approximate the original kpartite graph under a broad range of distortion measures. Experiments on both synthetic and real data sets demonstrate the promise and effectiveness of the proposed model and algorithm. We also establish the connections between existing clustering approaches and the proposed model to provide a unified view to the clustering approaches.