Results 1 -
6 of
6
A General Probabilistic Framework for Clustering Individuals
- In Proceedings of the sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
, 2000
"... This paper presents a unifying probabilistic framework for clustering individuals or systems into groups when the available data measurements are not multivariate vectors of fixed dimensionality. For example, one might have data from a set of medical patients, where for each patient one has differen ..."
Abstract
-
Cited by 59 (2 self)
- Add to MetaCart
This paper presents a unifying probabilistic framework for clustering individuals or systems into groups when the available data measurements are not multivariate vectors of fixed dimensionality. For example, one might have data from a set of medical patients, where for each patient one has different numbers of time-series observations, each time-series of different lengths. We propose a general model-based framework for clustering heterogeneous data types of this form. We discuss a general Expectation-Maximization (EM) procedure for clustering within this framework and outline how it can be applied to clustering of sequences, time-series, histograms, trajectories, and other non-vector data. We show that a number of earlier algorithms can be viewed as special cases within this unifying framework. The paper concludes with several illustrations of the method, including clustering of twodimensional histograms of red blood cell data in a medical diagnosis context, clustering of proteins fr...
Model Selection for Probabilistic Clustering Using Cross-Validated Likelihood
- Statistics and Computing
, 1998
"... Cross-validated likelihood is investigated as a tool for automatically determining the appropriate number of components (given the data) in finite mixture modelling, particularly in the context of model-based probabilistic clustering. The conceptual framework for the cross-validation approach to mod ..."
Abstract
-
Cited by 46 (3 self)
- Add to MetaCart
Cross-validated likelihood is investigated as a tool for automatically determining the appropriate number of components (given the data) in finite mixture modelling, particularly in the context of model-based probabilistic clustering. The conceptual framework for the cross-validation approach to model selection is direct in the sense that models are judged directly on their out-of-sample predictive performance. The method is applied to a well-known clustering problem in the atmospheric science literature using historical records of upper atmosphere geopotential height in the Northern hemisphere. Cross-validated likelihood provides strong evidence for three clusters in the data set, providing an objective confirmation of earlier results derived using non-probabilistic clustering techniques. 1 Introduction Cross-validation is a well-known technique in supervised learning to select a model from a family of candidate models. Examples include selecting the best classification tree using cr...
Model-based clustering and visualization of navigation patterns on a web site
- Data Mining and Knowledge Discovery
, 2003
"... We present a new methodology for exploring and analyzing navigation patterns on a web site. The patterns that can be analyzed consist of sequences of URL categories traversed by users. In our approach, we rst partition site users into clusters such that users with similar navigation paths through th ..."
Abstract
-
Cited by 36 (0 self)
- Add to MetaCart
We present a new methodology for exploring and analyzing navigation patterns on a web site. The patterns that can be analyzed consist of sequences of URL categories traversed by users. In our approach, we rst partition site users into clusters such that users with similar navigation paths through the site are placed into the same cluster. Then, for each cluster, we display these paths for users within that cluster. The clustering approach weemployis model-based (as opposed to distance-based) and partitions users according to the order in which they request web pages. In particular, we cluster users by learning a mixture of rst-order Markov models using the Expectation-Maximization algorithm. The runtime of our algorithm scales linearly with the number of clusters and with the size of the data � and our implementation easily handles hundreds of thousands of user sessions in memory. In the paper, we describe the details of our method and a visualization tool based on it called WebCANVAS. We illustrate the use of our approach on user-tra c data from msnbc.com. Keywords: Model-based clustering, sequence clustering, data visualization, Internet, web 1
Probabilistic Model-Based Clustering of Multivariate and Sequential Data
- In Proceedings of Artificial Intelligence and Statistics
, 1999
"... Probabilistic model-based clustering, based on finite mixtures of multivariate models, is a useful framework for clustering data in a statistical context. This general framework can be directly extended to clustering of sequential data, based on finite mixtures of sequential models. In this paper we ..."
Abstract
-
Cited by 20 (1 self)
- Add to MetaCart
Probabilistic model-based clustering, based on finite mixtures of multivariate models, is a useful framework for clustering data in a statistical context. This general framework can be directly extended to clustering of sequential data, based on finite mixtures of sequential models. In this paper we consider the problem of fitting mixture models where both multivariate and sequential observations are present. A general EM algorithm is discussed and experimental results demonstrated on simulated data. The problem is motivated by the practical problem of clustering individuals into groups based on both their static characteristics and their dynamic behavior. 1 Introduction and Motivation Consider the following problem. We have a set of individuals (a random sample from a larger population) whomwe would like to cluster into groups based on observational data. For each individual we can measure characteristics which are relatively static (e.g., their height, weight, income, age, sex, etc)...
Probabilistic Clustering using Hierarchical Models
, 1999
"... This paper addresses the problem of clustering data when the available data measurements are not multivariate vectors of fixed dimensionality. For example, one might have data from a set of medical patients, where for each patient there are time series, image, text, and multivariate data. We propose ..."
Abstract
-
Cited by 10 (5 self)
- Add to MetaCart
This paper addresses the problem of clustering data when the available data measurements are not multivariate vectors of fixed dimensionality. For example, one might have data from a set of medical patients, where for each patient there are time series, image, text, and multivariate data. We propose a general probabilistic clustering framework for clustering heterogeneous data types of this form. We focus on two-level probabilistic hierarchical models, consisting of a high-level mixture model on parameters and a low-level model for observations. This general framework permits probabilistic clustering of "objects" (sequences, histograms, images, etc) using an extension of the expectation-maximization (EM) algorithm which we derive. We further show that earlier (intuitive) clustering algorithms can be viewed as special cases (approximations) of the framework proposed here. The paper includes several illustrations of the method, including an application to a problem in clustering two-dime...
Discovering Functional Communities in Dynamical Networks
, 2006
"... Abstract. Many networks are important because they are substrates for dynamical systems, and their pattern of functional connectivity can itself be dynamic — they can functionally reorganize, even if their underlying anatomical structure remains fixed. However, the recent rapid progress in discoveri ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Abstract. Many networks are important because they are substrates for dynamical systems, and their pattern of functional connectivity can itself be dynamic — they can functionally reorganize, even if their underlying anatomical structure remains fixed. However, the recent rapid progress in discovering the community structure of networks has overwhelmingly focused on that constant anatomical connectivity. In this paper, we lay out the problem of discovering functional communities, and describe an approach to doing so. This method combines recent work on measuring information sharing across stochastic networks with an existing and successful community-discovery algorithm for weighted networks. We illustrate it with an application to a large biophysical model of the transition from beta to gamma rhythms in the hippocampus. 1

