Results 1 - 10
of
18
Clustering Sequences with Hidden Markov Models
- Advances in Neural Information Processing Systems
, 1997
"... This paper discusses a probabilistic model-based approach to clustering sequences, using hidden Markov models (HMMs). The problem can be framed as a generalization of the standard mixture model approach to clustering in feature space. Two primary issues are addressed. First, a novel parameter initi ..."
Abstract
-
Cited by 113 (0 self)
- Add to MetaCart
This paper discusses a probabilistic model-based approach to clustering sequences, using hidden Markov models (HMMs). The problem can be framed as a generalization of the standard mixture model approach to clustering in feature space. Two primary issues are addressed. First, a novel parameter initialization procedure is proposed, and second, the more difficult problem of determining the number of clusters K, from the data, is investigated. Experimental results indicate that the proposed techniques are useful for revealing hidden cluster structure in data sets of sequences. 1 Introduction Consider a data set D consisting of N sequences, D = fS 1 ; . . . ; SN g. S i = (x i 1 ; . . . x i L i ) is a sequence of length L i composed of potentially multivariate feature vectors x. The problem addressed in this paper is the discovery from data of a natural grouping of the sequences into K clusters. This is analagous to clustering in multivariate feature space which is normally handled by m...
A General Probabilistic Framework for Clustering Individuals
- In Proceedings of the sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
, 2000
"... This paper presents a unifying probabilistic framework for clustering individuals or systems into groups when the available data measurements are not multivariate vectors of fixed dimensionality. For example, one might have data from a set of medical patients, where for each patient one has differen ..."
Abstract
-
Cited by 59 (2 self)
- Add to MetaCart
This paper presents a unifying probabilistic framework for clustering individuals or systems into groups when the available data measurements are not multivariate vectors of fixed dimensionality. For example, one might have data from a set of medical patients, where for each patient one has different numbers of time-series observations, each time-series of different lengths. We propose a general model-based framework for clustering heterogeneous data types of this form. We discuss a general Expectation-Maximization (EM) procedure for clustering within this framework and outline how it can be applied to clustering of sequences, time-series, histograms, trajectories, and other non-vector data. We show that a number of earlier algorithms can be viewed as special cases within this unifying framework. The paper concludes with several illustrations of the method, including clustering of twodimensional histograms of red blood cell data in a medical diagnosis context, clustering of proteins fr...
Model-based clustering and visualization of navigation patterns on a web site
- Data Mining and Knowledge Discovery
, 2003
"... We present a new methodology for exploring and analyzing navigation patterns on a web site. The patterns that can be analyzed consist of sequences of URL categories traversed by users. In our approach, we rst partition site users into clusters such that users with similar navigation paths through th ..."
Abstract
-
Cited by 36 (0 self)
- Add to MetaCart
We present a new methodology for exploring and analyzing navigation patterns on a web site. The patterns that can be analyzed consist of sequences of URL categories traversed by users. In our approach, we rst partition site users into clusters such that users with similar navigation paths through the site are placed into the same cluster. Then, for each cluster, we display these paths for users within that cluster. The clustering approach weemployis model-based (as opposed to distance-based) and partitions users according to the order in which they request web pages. In particular, we cluster users by learning a mixture of rst-order Markov models using the Expectation-Maximization algorithm. The runtime of our algorithm scales linearly with the number of clusters and with the size of the data � and our implementation easily handles hundreds of thousands of user sessions in memory. In the paper, we describe the details of our method and a visualization tool based on it called WebCANVAS. We illustrate the use of our approach on user-tra c data from msnbc.com. Keywords: Model-based clustering, sequence clustering, data visualization, Internet, web 1
Probabilistic Model-Based Clustering of Multivariate and Sequential Data
- In Proceedings of Artificial Intelligence and Statistics
, 1999
"... Probabilistic model-based clustering, based on finite mixtures of multivariate models, is a useful framework for clustering data in a statistical context. This general framework can be directly extended to clustering of sequential data, based on finite mixtures of sequential models. In this paper we ..."
Abstract
-
Cited by 20 (1 self)
- Add to MetaCart
Probabilistic model-based clustering, based on finite mixtures of multivariate models, is a useful framework for clustering data in a statistical context. This general framework can be directly extended to clustering of sequential data, based on finite mixtures of sequential models. In this paper we consider the problem of fitting mixture models where both multivariate and sequential observations are present. A general EM algorithm is discussed and experimental results demonstrated on simulated data. The problem is motivated by the practical problem of clustering individuals into groups based on both their static characteristics and their dynamic behavior. 1 Introduction and Motivation Consider the following problem. We have a set of individuals (a random sample from a larger population) whomwe would like to cluster into groups based on observational data. For each individual we can measure characteristics which are relatively static (e.g., their height, weight, income, age, sex, etc)...
A Hidden Markov Model-based approach to sequential data clustering
- STRUCTURAL, SYNTACTIC AND STATISTICAL PATTERN RECOGNITION. LNCS 2396, SPRINGER (2002) 734–742 CLUSTERING OF SEQUENCES USING HIDDEN MARKOV MODELS 95
, 2002
"... Clustering of sequential or temporal data is more challenging than traditional clustering as dynamic observations should be processed rather than static measures. This paper proposes a Hidden Markov Model (HMM)-based technique suitable for clustering of data sequences. The main aspect of the work i ..."
Abstract
-
Cited by 19 (9 self)
- Add to MetaCart
Clustering of sequential or temporal data is more challenging than traditional clustering as dynamic observations should be processed rather than static measures. This paper proposes a Hidden Markov Model (HMM)-based technique suitable for clustering of data sequences. The main aspect of the work is the use of a probabilistic model-based approach using HMM to derive new proximity distances, in the likelihood sense, between sequences. Moreover, a novel partitional clustering algorithm is designed which alleviates computational burden characterizing traditional hierarchical agglomerative approaches. Experimental results show that this approach provides an accurate clustering partition and the devised distance measures achieve good performance rates. The method is demonstrated on real world data sequences, i.e. the EEG signals due to their temporal complexity and the growing interest in the emerging field of Brain Computer Interfaces.
Mixtures of ARMA Models for Model-Based Time Series Clustering
- In Proceedings of the IEEE International Conference on Data Mining
, 2002
"... Clustering problems are central to many knowledge discovery and data mining tasks. However, most existing clustering methods can only work with fixed-dimensional representations of data patterns. In this paper, we study the clustering of data patterns that are represented as sequences or time series ..."
Abstract
-
Cited by 16 (1 self)
- Add to MetaCart
Clustering problems are central to many knowledge discovery and data mining tasks. However, most existing clustering methods can only work with fixed-dimensional representations of data patterns. In this paper, we study the clustering of data patterns that are represented as sequences or time series possibly of di#erent lengths. We propose a model-based approach to this problem using mixtures of autoregressive moving average (ARMA) models. We derive an expectation-maximization (EM) algorithm for learning the mixing coe#cients as well as the parameters of the component models. The algorithm can determine the number of clusters in the data automatically. Experiments were conducted on a number of simulated and real datasets. Results from the experiments show that our method compares favorably with another method recently proposed by others for similar time series clustering problems.
Temporal pattern generation using hidden markov model based unsupervised classification
- In In Proc. of IDA-99
, 1999
"... Abstract. This paper describes a clustering methodology for temporal data using hidden Markov model(HMM) representation. The proposed method improves upon existing HMM based clustering methods in two ways: (i) it enables HMMs to dynamically change its model structure to obtain a better t model for d ..."
Abstract
-
Cited by 15 (0 self)
- Add to MetaCart
Abstract. This paper describes a clustering methodology for temporal data using hidden Markov model(HMM) representation. The proposed method improves upon existing HMM based clustering methods in two ways: (i) it enables HMMs to dynamically change its model structure to obtain a better t model for data during clustering process, and (ii) it provides objective criterion function to automatically select the clustering partition. The algorithm is presented in terms of four nested levels of searches: (i) the search for the number of clusters in a partition, (ii) the search for the structure for a xed sized partition, (iii) the search for the HMM structure for each cluster, and (iv) the search for the parameter values for each HMM. Preliminary experiments with arti cially generated data demonstrate the e ectiveness of the proposed methodology. 1
Phone Clustering Using The Bhattacharyya Distance
- In Proceedings of the International Conference on Spoken Language Processing
, 1996
"... In this paper we study using the classification-based Bhattacharyya distance measure to guide biphone clustering. The Bhattacharyya distance is a theoretical distance measure between two Gaussian distributions which is equivalent to an upper bound on the optimal Bayesian classification error probabi ..."
Abstract
-
Cited by 14 (3 self)
- Add to MetaCart
In this paper we study using the classification-based Bhattacharyya distance measure to guide biphone clustering. The Bhattacharyya distance is a theoretical distance measure between two Gaussian distributions which is equivalent to an upper bound on the optimal Bayesian classification error probability. It also has the desirable properties of being computationally simple and extensible to more Gaussian mixtures. Using the Bhattacharyya distance measure in a datadriven approach together with a novel 2-Level Agglomerative Hierarchical Biphone Clustering algorithm, generalized left/right biphones (BGBs) are derived. A neural-net based phone recognizer trained on the BGBs is found to have better frame-level phone recognition than one trained on generalized biphones (BCGBs) derived from a set of commonly used broad categories. We further evaluate the new BGBs on an isolated-word recognition task of perplexity 40 and obtain a 16.2% error reduction over the broad-category generalized biphones (BCGBs) and a 41.8% error reduction over the monophones.
M.A.T.: Similarity-based clustering of sequences using hidden markov models
"... Abstract. Hidden Markov models constitute a widely employed tool for sequential data modelling; nevertheless, their use in the clustering context has been poorly investigated. In this paper a novel scheme for HMMbased sequential data clustering is proposed, inspired on the similaritybased paradigm r ..."
Abstract
-
Cited by 9 (5 self)
- Add to MetaCart
Abstract. Hidden Markov models constitute a widely employed tool for sequential data modelling; nevertheless, their use in the clustering context has been poorly investigated. In this paper a novel scheme for HMMbased sequential data clustering is proposed, inspired on the similaritybased paradigm recently introduced in the supervised learning context. With this approach, a new representation space is built, in which each object is described by the vector of its similarities with respect to a predeterminate set of other objects. These similarities are determined using hidden Markov models. Clustering is then performed in such a space. By way of this, the difficult problem of clustering of sequences is thus transposed to a more manageable format, the clustering of points (vectors of features). Experimental evaluation on synthetic and real data shows that the proposed approach largely outperforms standard HMM clustering schemes. 1
Clustering sequence data using hidden Markov model representation
- IN: PROC. OF SPIE’99 CONF. ON DATA MINING AND KNOWLEDGE DISCOVERY: THEORY
, 1999
"... This paper proposed a clustering methodology for sequence data using hidden Markov model(HMM) representation. The proposed methodology improves upon existing HMM based clustering methods in two ways: (i) it enables HMMs to dynamically change its model structure to obtain a better t model for data du ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
This paper proposed a clustering methodology for sequence data using hidden Markov model(HMM) representation. The proposed methodology improves upon existing HMM based clustering methods in two ways: (i) it enables HMMs to dynamically change its model structure to obtain a better t model for data during clustering process, and (ii) it provides objective criterion function to select the optimal clustering partition. The algorithm is presented in terms of four nested levels of searches: (i) the search for the optimal number of clusters in a partition, (ii) the search for the optimal structure for a given partition, (iii) the search for the optimal HMM structure for each cluster, and (iv) the search for the optimal HMM parameters for each HMM. Preliminary results are given to support the proposed methodology.

