Results 1 - 10
of
346
Biclustering algorithms for biological data analysis: a survey.
- IEEE/ACM Transactions of Computational Biology and Bioinformatics,
, 2004
"... Abstract A large number of clustering approaches have been proposed for the analysis of gene expression data obtained from microarray experiments. However, the results of the application of standard clustering methods to genes are limited. These limited results are imposed by the existence of a num ..."
Abstract
-
Cited by 481 (15 self)
- Add to MetaCart
Abstract A large number of clustering approaches have been proposed for the analysis of gene expression data obtained from microarray experiments. However, the results of the application of standard clustering methods to genes are limited. These limited results are imposed by the existence of a number of experimental conditions where the activity of genes is uncorrelated. A similar limitation exists when clustering of conditions is performed. For this reason, a number of algorithms that perform simultaneous clustering on the row and column dimensions of the gene expression matrix has been proposed to date. This simultaneous clustering, usually designated by biclustering, seeks to find sub-matrices, that is subgroups of genes and subgroups of columns, where the genes exhibit highly correlated activities for every condition. This type of algorithms has also been proposed and used in other fields, such as information retrieval and data mining. In this comprehensive survey, we analyze a large number of existing approaches to biclustering, and classify them in accordance with the type of biclusters they can find, the patterns of biclusters that are discovered, the methods used to perform the search and the target applications.
Fast random walk with restart and its applications
- In ICDM ’06: Proceedings of the 6th IEEE International Conference on Data Mining
, 2006
"... How closely related are two nodes in a graph? How to compute this score quickly, on huge, disk-resident, real graphs? Random walk with restart (RWR) provides a good relevance score between two nodes in a weighted graph, and it has been successfully used in numerous settings, like automatic captionin ..."
Abstract
-
Cited by 179 (19 self)
- Add to MetaCart
(Show Context)
How closely related are two nodes in a graph? How to compute this score quickly, on huge, disk-resident, real graphs? Random walk with restart (RWR) provides a good relevance score between two nodes in a weighted graph, and it has been successfully used in numerous settings, like automatic captioning of images, generalizations to the “connection subgraphs”, personalized PageRank, and many more. However, the straightforward implementations of RWR do not scale for large graphs, requiring either quadratic space and cubic pre-computation time, or slow response time on queries. We propose fast solutions to this problem. The heart of our approach is to exploit two important properties shared by many real graphs: (a) linear correlations and (b) blockwise, community-like structure. We exploit the linearity by using low-rank matrix approximation, and the community structure by graph partitioning, followed by the Sherman-Morrison lemma for matrix inversion. Experimental results on the Corel image and the DBLP dabasets demonstrate that our proposed methods achieve significant savings over the straightforward implementations: they can save several orders of magnitude in pre-computation and storage cost, and they achieve up to 150x speed up with 90%+ quality preservation. 1
Graphscope: parameter-free mining of large time-evolving graphs
- In KDD ’07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
, 2007
"... How can we find communities in dynamic networks of social interactions, such as who calls whom, who emails whom, or who sells to whom? How can we spot discontinuity time-points in such streams of graphs, in an on-line, any-time fashion? We propose GraphScope, that addresses both prob-lems, using inf ..."
Abstract
-
Cited by 155 (14 self)
- Add to MetaCart
(Show Context)
How can we find communities in dynamic networks of social interactions, such as who calls whom, who emails whom, or who sells to whom? How can we spot discontinuity time-points in such streams of graphs, in an on-line, any-time fashion? We propose GraphScope, that addresses both prob-lems, using information theoretic principles. Contrary to the majority of earlier methods, it needs no user-defined param-eters. Moreover, it is designed to operate on large graphs, in a streaming fashion. We demonstrate the efficiency and effectiveness of our GraphScope on real datasets from sev-eral diverse domains. In all cases it produces meaningful time-evolving patterns that agree with human intuition.
A Generalized Maximum Entropy Approach to Bregman Co-clustering and Matrix Approximation
- In KDD
, 2004
"... Co-clustering is a powerful data mining technique with varied applications such as text clustering, microarray analysis and recommender systems. Recently, an informationtheoretic co-clustering approach applicable to empirical joint probability distributions was proposed. In many situations, co-clust ..."
Abstract
-
Cited by 135 (29 self)
- Add to MetaCart
(Show Context)
Co-clustering is a powerful data mining technique with varied applications such as text clustering, microarray analysis and recommender systems. Recently, an informationtheoretic co-clustering approach applicable to empirical joint probability distributions was proposed. In many situations, co-clustering of more general matrices is desired. In this paper, we present a substantially generalized co-clustering framework wherein any Bregman divergence can be used in the objective function, and various conditional expectation based constraints can be considered based on the statistics that need to be preserved. Analysis of the coclustering problem leads to the minimum Bregman information principle, which generalizes the maximum entropy principle, and yields an elegant meta algorithm that is guaranteed to achieve local optimality. Our methodology yields new algorithms and also encompasses several previously known clustering and co-clustering algorithms based on alternate minimization.
Minimum sumsquared residue co-clustering of gene expression data
- In SDM
, 2004
"... Microarray experiments have been extensively used for simultaneously measuring DNA expression levels of thousands of genes in genome research. A key step in the analysis of gene expression data is the clustering of genes into groups that show similar expression values over a range of conditions. Sin ..."
Abstract
-
Cited by 116 (6 self)
- Add to MetaCart
(Show Context)
Microarray experiments have been extensively used for simultaneously measuring DNA expression levels of thousands of genes in genome research. A key step in the analysis of gene expression data is the clustering of genes into groups that show similar expression values over a range of conditions. Since only a small subset of the genes participate in any cellular process of interest, by focusing on subsets of genes and conditions, we can lower the noise induced by other genes and conditions — a co-cluster characterizes such a subset of interest. Cheng and Church [3] introduced an effective measure of co-cluster quality based on mean squared residue. In this paper, we use two similar squared residue measures and propose two fast k-means like co-clustering algorithms corresponding to the two residue measures. Our algorithms discover k row clusters and l column clusters simultaneously while monotonically decreasing the respective squared residues. Our co-clustering algorithms inherit the simplicity, efficiency and wide applicability of the k-means algorithm. Minimizing the residues may also be formulated as trace optimization problems that allow us to obtain a spectral relaxation that we use for a principled initialization for our iterative algorithms. We further enhance our algorithms by an incremental local search strategy that helps avoid empty clusters and escape poor local minima. We illustrate co-clustering results on a yeast cell cycle dataset and a human B-cell lymphoma dataset. Our experiments show that our co-clustering algorithms are efficient and are able to discover coherent co-clusters. Keywords: Gene-expression, co-clustering, biclustering, residue, spectral relaxation
Beyond streams and graphs: Dynamic tensor analysis
- In KDD
, 2006
"... How do we find patterns in author-keyword associations, evolving over time? Or in DataCubes, with product-branchcustomer sales information? Matrix decompositions, like principal component analysis (PCA) and variants, are invaluable tools for mining, dimensionality reduction, feature selection, rule ..."
Abstract
-
Cited by 113 (16 self)
- Add to MetaCart
(Show Context)
How do we find patterns in author-keyword associations, evolving over time? Or in DataCubes, with product-branchcustomer sales information? Matrix decompositions, like principal component analysis (PCA) and variants, are invaluable tools for mining, dimensionality reduction, feature selection, rule identification in numerous settings like streaming data, text, graphs, social networks and many more. However, they have only two orders, like author and keyword, in the above example. We propose to envision such higher order data as tensors, and tap the vast literature on the topic. However, these methods do not necessarily scale up, let alone operate on semi-infinite streams. Thus, we introduce the dynamic tensor analysis (DTA) method, and its variants. DTA provides a compact summary for high-order and high-dimensional data, and it also reveals the hidden correlations. Algorithmically, we designed DTA very carefully so that it is (a) scalable, (b) space efficient (it does not need to store the past) and (c) fully automatic with no need for user defined parameters. Moreover, we propose STA, a streaming tensor analysis method, which provides a fast, streaming approximation to DTA. We implemented all our methods, and applied them in two real settings, namely, anomaly detection and multi-way latent semantic indexing. We used two real, large datasets, one on network flow data (100GB over 1 month) and one from DBLP (200MB over 25 years). Our experiments show that our methods are fast, accurate and that they find interesting patterns and outliers on the real datasets. 1.
Solving Cluster Ensemble Problems by Bipartite Graph Partitioning
- IN PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON MACHINE LEARNING
, 2004
"... A critical problem in cluster ensemble research is how to combine multiple clusterings to yield a final superior clustering result. Leveraging advanced graph partitioning techniques, we solve this problem by reducing it to a graph partitioning problem. We introduce a new reduction method that constr ..."
Abstract
-
Cited by 103 (3 self)
- Add to MetaCart
A critical problem in cluster ensemble research is how to combine multiple clusterings to yield a final superior clustering result. Leveraging advanced graph partitioning techniques, we solve this problem by reducing it to a graph partitioning problem. We introduce a new reduction method that constructs a bipartite graph from a given cluster ensemble. The resulting graph models both instances and clusters of the ensemble simultaneously as vertices in the graph. Our approach retains all of the information provided by a given ensemble, allowing the similarity among instances and the similarity among clusters to be considered collectively in forming the final clustering. Further, the resulting graph partitioning problem can be solved efficiently. We empirically evaluate the proposed approach against two commonly used graph formulations and show that it is more robust and achieves comparable or better performance in comparison to its competitors.
Center-piece subgraphs: Problem definition and fast solutions
- In KDD
, 2006
"... Given Q nodes in a social network (say, authorship network), how can we find the node/author that is the centerpiece, and has direct or indirect connections to all, or most of them? For example, this node could be the common advisor, or someone who started the research area that the Q nodes belong t ..."
Abstract
-
Cited by 77 (23 self)
- Add to MetaCart
(Show Context)
Given Q nodes in a social network (say, authorship network), how can we find the node/author that is the centerpiece, and has direct or indirect connections to all, or most of them? For example, this node could be the common advisor, or someone who started the research area that the Q nodes belong to. Isomorphic scenarios appear in law enforcement (find the master-mind criminal, connected to all current suspects), gene regulatory networks (find the protein that participates in pathways with all or most of the given Q proteins), viral marketing and many more. Connection subgraphs is an important first step, handling the case of Q=2 query nodes. Then, the connection subgraph algorithm finds the b intermediate nodes, that provide a good connection between the two original query nodes. Here we generalize the challenge in multiple dimensions: First, we allow more than two query nodes. Second, we allow a whole family of queries, ranging from ’OR ’ to ’AND’, with ’softAND ’ in-between. Finally, we design and compare a fast approximation, and study the quality/speed trade-off. We also present experiments on the DBLP dataset. The experiments confirm that our proposed method naturally deals with multi-source queries and that the resulting subgraphs agree with our intuition. Wall-clock timing results on the DBLP dataset show that our proposed approximation achieve good accuracy for about 6: 1 speedup. This material is based upon work supported by the
Neighborhood formation and anomaly detection in bipartite graphs
- In ICDM
, 2005
"... Many real applications can be modeled using bipartite graphs, such as users vs. files in a P2P system, traders vs. stocks in a financial trading system, conferences vs. authors in a scientific publication network, and so on. We introduce two operations on bipartite graphs: 1) identifying similar nod ..."
Abstract
-
Cited by 70 (17 self)
- Add to MetaCart
(Show Context)
Many real applications can be modeled using bipartite graphs, such as users vs. files in a P2P system, traders vs. stocks in a financial trading system, conferences vs. authors in a scientific publication network, and so on. We introduce two operations on bipartite graphs: 1) identifying similar nodes (Neighborhood formation), and 2) finding abnormal nodes (Anomaly detection). And we propose algorithms to compute the neighborhood for each node using random walk with restarts and graph partitioning; we also propose algorithms to identify abnormal nodes, using neighborhood information. We evaluate the quality of neighborhoods based on semantics of the datasets, and we also measure the performance of the anomaly detection algorithm with manually injected anomalies. Both effectiveness and efficiency of the methods are confirmed by experiments on several real datasets. 1
A scalable collaborative filtering framework based on co-clustering
- Fifth IEEE International Conference on Data Mining
, 2005
"... Collaborative filtering-based recommender systems, which automatically predict preferred products of a user using known preferences of other users, have become extremely popular in recent years due to the increase in web-based activities such as e-commerce and online content distribution. Current co ..."
Abstract
-
Cited by 69 (1 self)
- Add to MetaCart
(Show Context)
Collaborative filtering-based recommender systems, which automatically predict preferred products of a user using known preferences of other users, have become extremely popular in recent years due to the increase in web-based activities such as e-commerce and online content distribution. Current collaborative filtering techniques such as correlation and SVD based methods provide good accuracy, but are computationally very expensive and can only be deployed in static off-line settings where the known preference information does not change with time. However, a number of practical scenarios require dynamic real-time collaborative filtering that can allow new users, items and ratings to enter the system at a rapid rate. In this paper, we consider a novel collaborative filtering approach based on a recently proposed weighted co-clustering algorithm [3] that involves simultaneous clustering of users and items. We design incremental and parallel versions of the co-clustering algorithm and use it to build an efficient real-time collaborative filtering framework. Empirical evaluation of our approach on large movie and book rating datasets demonstrates that it is possible to obtain an accuracy comparable to that of the correlation and matrix factorization based approaches at a much lower computational cost. 1