Results 11  20
of
165
MatrixExplorer: a DualRepresentation System to Explore Social Networks
 IEEE Transactions on Visualization and Computer Graphics
, 2006
"... Abstract — MatrixExplorer is a network visualization system that uses two representations: nodelink diagrams and matrices. Its design comes from a list of requirements formalized after several interviews and a participatory design session conducted with social science researchers. Although matrices ..."
Abstract

Cited by 56 (11 self)
 Add to MetaCart
Abstract — MatrixExplorer is a network visualization system that uses two representations: nodelink diagrams and matrices. Its design comes from a list of requirements formalized after several interviews and a participatory design session conducted with social science researchers. Although matrices are commonly used in social networks analysis, very few systems support the matrixbased representations to visualize and analyze networks. MatrixExplorer provides several novel features to support the exploration of social networks with a matrixbased representation, in addition to the standard interactive filtering and clustering functions. It provides tools to reorder (layout) matrices, to annotate and compare findings across different layouts and find consensus among several clusterings. MatrixExplorer also supports Nodelink diagram views which are familiar to most users and remain a convenient way to publish or communicate exploration results. Matrix and nodelink representations are kept synchronized at all stages of the exploration process. Index Terms — social networks visualization, nodelink diagrams, matrixbased representations, exploratory process, matrix ordering, interactive clustering, consensus. Fig. 1. MatrixExplorer showing two synchronized representations of the same network: matrix on the left and nodelink on the right. 1
Modeling dyadic data with binary latent factors
 Neural Information Processing Systems
, 2008
"... We introduce binary matrix factorization, a novel model for unsupervised matrix decomposition. The decomposition is learned by fitting a nonparametric Bayesian probabilistic model with binary latent variables to a matrix of dyadic data. Unlike biclustering models, which assign each row or column t ..."
Abstract

Cited by 52 (13 self)
 Add to MetaCart
We introduce binary matrix factorization, a novel model for unsupervised matrix decomposition. The decomposition is learned by fitting a nonparametric Bayesian probabilistic model with binary latent variables to a matrix of dyadic data. Unlike biclustering models, which assign each row or column to a single cluster based on a categorical hidden feature, our binary feature model reflects the prior belief that items and attributes can be associated with more than one latent cluster at a time. We provide simple learning and inference rules for this new model and show how to extend it to an infinite model in which the number of features is not a priori fixed but is allowed to grow with the size of the data. 1 Distributed representations for dyadic data One of the major goals of probabilistic unsupervised learning is to discover underlying or hidden structure in a dataset by using latent variables to describe a complex data generation process. In this paper we focus on dyadic data: our domains have two finite sets of objects/entities and observations are made on dyads (pairs with one element from each set). Examples include sparse matrices
Clustering methods for the analysis of DNA microarray data
, 1999
"... It is now possible to simultaneously measure the expression of thousands of genes during cellular di erentiation and response, through the use of DNA microarrays. A major statistical task is to understand the structure in the data that arise from this technology. In this paper we review various meth ..."
Abstract

Cited by 40 (0 self)
 Add to MetaCart
It is now possible to simultaneously measure the expression of thousands of genes during cellular di erentiation and response, through the use of DNA microarrays. A major statistical task is to understand the structure in the data that arise from this technology. In this paper we review various methods of clustering, and illustrate how they can be used to arrange both the genes and cell lines from a set of DNA microarray experiments. The methods discussed are global clustering techniques including hierarchical, Kmeans, and block clustering, and treestructured vector quantization. Finally, we propose a new method for identifying structure in subsets of both genes and cell lines that are potentially obscured by the global clustering approaches. 1
Coclustering by block value decomposition
 In KDD’05
, 2005
"... Dyadic data matrices, such as cooccurrence matrix, rating matrix, and proximity matrix, arise frequently in various important applications. A fundamental problem in dyadic data analysis is to find the hidden block structure of the data matrix. In this paper, we present a new coclustering framework, ..."
Abstract

Cited by 32 (7 self)
 Add to MetaCart
Dyadic data matrices, such as cooccurrence matrix, rating matrix, and proximity matrix, arise frequently in various important applications. A fundamental problem in dyadic data analysis is to find the hidden block structure of the data matrix. In this paper, we present a new coclustering framework, block value decomposition(BVD), for dyadic data, which factorizes the dyadic data matrix into three components, the rowcoefficient matrix R, the block value matrix B, and the columncoefficient matrix C. Under this framework, we focus on a special yet very popular case – nonnegative dyadic data, and propose a specific novel coclustering algorithm that iteratively computes the three decomposition matrices based on the multiplicative updating rules. Extensive experimental evaluations also demonstrate the effectiveness and potential of this framework as well as the specific algorithms for coclustering, and in particular, for discovering the hidden block structure in the dyadic data.
Statistical debugging: simultaneous identification of multiple bugs
 In ICML ’06: Proceedings of the 23rd international conference on Machine learning
, 2006
"... We describe a statistical approach to software debugging in the presence of multiple bugs. Due to sparse sampling issues and complex interaction between program predicates, many generic o#theshelf algorithms fail to select useful bug predictors. Taking inspiration from biclustering algorith ..."
Abstract

Cited by 30 (4 self)
 Add to MetaCart
We describe a statistical approach to software debugging in the presence of multiple bugs. Due to sparse sampling issues and complex interaction between program predicates, many generic o#theshelf algorithms fail to select useful bug predictors. Taking inspiration from biclustering algorithms, we propose an iterative collective voting scheme for the program runs and predicates. We demonstrate successful debugging results on several real world programs and a large debugging benchmark suite.
Disco: Distributed coclustering with mapreduce. ICDM
, 2008
"... Huge datasets are becoming prevalent; even as researchers, we now routinely have to work with datasets that are up to a few terabytes in size. Interesting realworld applications produce huge volumes of messy data. The mining process involves several steps, starting from preprocessing the raw data ..."
Abstract

Cited by 30 (1 self)
 Add to MetaCart
Huge datasets are becoming prevalent; even as researchers, we now routinely have to work with datasets that are up to a few terabytes in size. Interesting realworld applications produce huge volumes of messy data. The mining process involves several steps, starting from preprocessing the raw data to estimating the final models. As data become more abundant, scalable and easytouse tools for distributed processing are also emerging. Among those, MapReduce has been widely embraced by both academia and industry. In database terms, MapReduce is a simple yet powerful execution engine, which can be complemented with other data storage and management components, as necessary. In this paper we describe our experiences and findings in applying MapReduce, from raw data to final models, on an important mining task. In particular, we focus on coclustering, which has been studied in many applications such as text mining, collaborative filtering, bioinformatics, graph mining. We propose the Distributed Coclustering (DisCo) framework, which introduces practical approaches for distributed data preprocessing, and coclustering. We develop DisCo using Hadoop, an open source MapReduce implementation. We show that DisCo can scale well and efficiently process and analyze extremely large datasets (up to several hundreds of gigabytes) on commodity hardware. 1
The discrete basis problem
, 2005
"... We consider the Discrete Basis Problem, which can be described as follows: given a collection of Boolean vectors find a collection of k Boolean basis vectors such that the original vectors can be represented using disjunctions of these basis vectors. We show that the decision version of this problem ..."
Abstract

Cited by 26 (9 self)
 Add to MetaCart
We consider the Discrete Basis Problem, which can be described as follows: given a collection of Boolean vectors find a collection of k Boolean basis vectors such that the original vectors can be represented using disjunctions of these basis vectors. We show that the decision version of this problem is NPcomplete and that the optimization version cannot be approximated within any finite ratio. We also study two variations of this problem, where the Boolean basis vectors must be mutually otrhogonal. We show that the other variation is closely related with the wellknown Metric kmedian Problem in Boolean space. To solve these problems, two algorithms will be presented. One is designed for the variations mentioned above, and it is solely based on solving the kmedian problem, while another is a heuristic intended to solve the general Discrete Basis Problem. We will also study the results of extensive experiments made with these two algorithms with both synthetic and realworld data. The results are twofold: with the synthetic data, the algorithms did rather well, but with the realworld data the results were not as good.
Bayesian coclustering
 In Proceedings of the 8th IEEE International Conference on Data Mining (ICDM
, 2008
"... In recent years, coclustering has emerged as a powerful data mining tool that can analyze dyadic data connecting two entities. However, almost all existing coclustering techniques are partitional, and allow individual rows and columns of a data matrix to belong to only one cluster. Several current ..."
Abstract

Cited by 20 (3 self)
 Add to MetaCart
In recent years, coclustering has emerged as a powerful data mining tool that can analyze dyadic data connecting two entities. However, almost all existing coclustering techniques are partitional, and allow individual rows and columns of a data matrix to belong to only one cluster. Several current applications, such as recommendation systems and market basket analysis, can substantially benefit from a mixed membership of rows and columns. In this paper, we present Bayesian coclustering (BCC) models, that allow a mixed membership in row and column clusters. BCC maintains separate Dirichlet priors for rows and columns over the mixed membership and assumes each observation to be generated by an exponential family distribution corresponding to its row and column clusters. We propose a fast variational algorithm for inference and parameter estimation. The model is designed to naturally handle sparse matrices as the inference is done only based on the nonmissing entries. In addition to finding a cocluster structure in observations, the model outputs a low dimensional coembedding, and accurately predicts missing values in the original matrix. We demonstrate the efficacy of the model through experiments on both simulated and real data. 1
Text Mining Infrastructure in R
 Journal of Statistical Software
, 2008
"... During the last decade text mining has become a widely used discipline utilizing statistical and machine learning methods. We present the tm package which provides a framework for text mining applications within R. We give a survey on text mining facilities in R and explain how typical application t ..."
Abstract

Cited by 18 (7 self)
 Add to MetaCart
During the last decade text mining has become a widely used discipline utilizing statistical and machine learning methods. We present the tm package which provides a framework for text mining applications within R. We give a survey on text mining facilities in R and explain how typical application tasks can be carried out using our framework. We present techniques for countbased analysis methods, text clustering, text classification and string kernels.
Shifting And Scaling Patterns From Gene Expression Data
, 2005
"... Motivation:During the last years, the discovering of biclusters in data is becoming more and more popular. Biclustering aims at extracting a set of clusters, each of which might use a different subset of attributes. Therefore, it is clear that the usefulness of biclustering techniques is beyond the ..."
Abstract

Cited by 15 (2 self)
 Add to MetaCart
Motivation:During the last years, the discovering of biclusters in data is becoming more and more popular. Biclustering aims at extracting a set of clusters, each of which might use a different subset of attributes. Therefore, it is clear that the usefulness of biclustering techniques is beyond the traditional clustering techniques, especially when datasets present high or very high dimensionality. Also, biclustering considers overlapping, which is an interesting aspect, algorithmically and from the point of view of the result interpretation. Since the Cheng and Church's works, the mean squared residue has turned into one of the most popular measures to search for biclusters, which ideally should discover shifting and scaling patterns.