Results 1 - 10
of
31
Model Selection for Probabilistic Clustering Using Cross-Validated Likelihood
- Statistics and Computing
, 1998
"... Cross-validated likelihood is investigated as a tool for automatically determining the appropriate number of components (given the data) in finite mixture modelling, particularly in the context of model-based probabilistic clustering. The conceptual framework for the cross-validation approach to mod ..."
Abstract
-
Cited by 46 (3 self)
- Add to MetaCart
Cross-validated likelihood is investigated as a tool for automatically determining the appropriate number of components (given the data) in finite mixture modelling, particularly in the context of model-based probabilistic clustering. The conceptual framework for the cross-validation approach to model selection is direct in the sense that models are judged directly on their out-of-sample predictive performance. The method is applied to a well-known clustering problem in the atmospheric science literature using historical records of upper atmosphere geopotential height in the Northern hemisphere. Cross-validated likelihood provides strong evidence for three clusters in the data set, providing an objective confirmation of earlier results derived using non-probabilistic clustering techniques. 1 Introduction Cross-validation is a well-known technique in supervised learning to select a model from a family of candidate models. Examples include selecting the best classification tree using cr...
Model-based clustering and visualization of navigation patterns on a web site
- Data Mining and Knowledge Discovery
, 2003
"... We present a new methodology for exploring and analyzing navigation patterns on a web site. The patterns that can be analyzed consist of sequences of URL categories traversed by users. In our approach, we rst partition site users into clusters such that users with similar navigation paths through th ..."
Abstract
-
Cited by 36 (0 self)
- Add to MetaCart
We present a new methodology for exploring and analyzing navigation patterns on a web site. The patterns that can be analyzed consist of sequences of URL categories traversed by users. In our approach, we rst partition site users into clusters such that users with similar navigation paths through the site are placed into the same cluster. Then, for each cluster, we display these paths for users within that cluster. The clustering approach weemployis model-based (as opposed to distance-based) and partitions users according to the order in which they request web pages. In particular, we cluster users by learning a mixture of rst-order Markov models using the Expectation-Maximization algorithm. The runtime of our algorithm scales linearly with the number of clusters and with the size of the data � and our implementation easily handles hundreds of thousands of user sessions in memory. In the paper, we describe the details of our method and a visualization tool based on it called WebCANVAS. We illustrate the use of our approach on user-tra c data from msnbc.com. Keywords: Model-based clustering, sequence clustering, data visualization, Internet, web 1
Discovery of Climate Indices Using Clustering
- In Proc. of the 9th ACM SIGKDD Int’l Conference on Knowledge Discovery and Data Mining
, 2003
"... To analyze the effect of the oceans and atmosphere on land climate, Earth Scientists have developed climate indices, which are time series that summarize the behavior of selected regions of the Earth’s oceans and atmosphere. In the past, Earth scientists have used observation and, more recently, eig ..."
Abstract
-
Cited by 14 (5 self)
- Add to MetaCart
To analyze the effect of the oceans and atmosphere on land climate, Earth Scientists have developed climate indices, which are time series that summarize the behavior of selected regions of the Earth’s oceans and atmosphere. In the past, Earth scientists have used observation and, more recently, eigenvalue analysis techniques, such as principal components analysis (PCA) and singular value decomposition (SVD), to discover climate indices. However, eigenvalue techniques are only useful for finding a few of the strongest signals. Furthermore, they impose a condition that all discovered signals must be orthogonal to each other, making it difficult to attach a physical interpretation to them. This paper presents an alternative clustering-based methodology for the discovery of climate indices that overcomes these limitations and is based on clusters that represent regions with relatively homogeneous behavior. The centroids of these clusters are time series that summarize the behavior of the ocean or atmosphere in those regions. Some of these centroids correspond to known climate indices and provide a validation of our methodology; other centroids are variants of known indices that may provide better predictive power for some land areas; and still other indices may represent potentially new Earth science phenomena. Finally, we show that cluster based indices generally outperform SVD derived indices, both in terms of area weighted correlation and direct correlation with the known indices.
Weather Regimes and Preferred Transition Paths In a Three-Level . . .
, 2003
"... Multiple flow regimes are reexamined in a global, three-level, quasi-geostrophic model with realistic topography in spherical geometry. This QG3 model, using a T21 triangular truncation in the horizontal, has a fairly realistic climatology for Northern Hemisphere winter, and exhibits multiple regime ..."
Abstract
-
Cited by 13 (12 self)
- Add to MetaCart
Multiple flow regimes are reexamined in a global, three-level, quasi-geostrophic model with realistic topography in spherical geometry. This QG3 model, using a T21 triangular truncation in the horizontal, has a fairly realistic climatology for Northern Hemisphere winter, and exhibits multiple regimes that resemble those found in atmospheric observations. Four regimes are robust to changes in the classification method, k-means vs. mixture modeling, and its parameters. These regimes correspond roughly to opposite phases of the Arctic Oscillation (AO) and the North Atlantic Oscillation (NAO), respectively. The Markov
Multilevel Regression Modeling of Nonlinear Processes: Derivation and Applications to Climatic Variability
- JOURNAL OF CLIMATE
, 2004
"... Predictive models are constructed to best describe an observed field’s statistics within a given class of nonlinear dynamics driven by a spatially coherent noise that is white in time. For linear dynamics, such inverse stochastic models are obtained by multiple linear regression (MLR). Nonlinear dyn ..."
Abstract
-
Cited by 13 (9 self)
- Add to MetaCart
Predictive models are constructed to best describe an observed field’s statistics within a given class of nonlinear dynamics driven by a spatially coherent noise that is white in time. For linear dynamics, such inverse stochastic models are obtained by multiple linear regression (MLR). Nonlinear dynamics, when more appropriate, is accommodated by applying multiple polynomial regression (MPR) instead; the resulting model uses polynomial predictors, but the dependence on the regression parameters is linear in both MPR and MLR. The basic concepts are illustrated using the Lorenz convection model, the classical double-well problem, and a three-well problem in two space dimensions. Given a data sample that is long enough, MPR successfully reconstructs the model coefficients in the former two cases, while the resulting inverse model captures the three-regime structure of the system’s probability density function (PDF) in the latter case. A novel multilevel generalization of the classic regression procedure is introduced next. In this generalization, the residual stochastic forcing at a given level is subsequently modeled as a function of variables at this level and all the preceding ones. The number of levels is determined so that the lag-0 covariance of the residual forcing converges to a constant matrix, while its lag-1 covariance vanishes. This method has been applied to the output of a three-layer, quasigeostrophic model and to the analysis of Northern Hemisphere wintertime geopotential height anomalies. In both cases, the inverse model simulations reproduce well the multiregime structure of the PDF constructed in the subspace spanned by the dataset’s leading empirical orthogonal functions, as well as the detailed spectrum of the dataset’s temporal evolution. These encouraging results are interpreted in terms of the modeled low-frequency flow’s feedback on the statistics of the subgrid-scale processes.
Multiple regimes and low-frequency oscillations in the Northern Hemisphere’s zonal-mean flow
- J. ATMOS. SCI
, 2006
"... This paper studies multiple regimes and low-frequency oscillations in the Northern Hemisphere zonal-mean zonal flow in winter, using 55 years of daily observational data. The probability density function estimated in the phase space spanned by the two leading empirical orthogonal functions exhibits ..."
Abstract
-
Cited by 10 (8 self)
- Add to MetaCart
This paper studies multiple regimes and low-frequency oscillations in the Northern Hemisphere zonal-mean zonal flow in winter, using 55 years of daily observational data. The probability density function estimated in the phase space spanned by the two leading empirical orthogonal functions exhibits two distinct, statistically significant maxima. The two regimes associated with these maxima describe persistent zonal-flow states that are characterized by meridional displacements of the midlatitude jet, poleward and equatorward of its time-mean position. The geopotential height anomalies of either regime have a pronounced zonally symmetric component, but largest-amplitude anomalies are located over the Atlantic and Pacific oceans. Highfrequency synoptic transients participate in the maintenance of and transitions between these regimes. Significant oscillatory components with periods of 147 and 72 days are identified by spectral analysis of the zonal-flow time series. These oscillations are described by singular spectrum analysis and the multitaper method. The 147-day oscillation involves zonal-flow anomalies that propagate poleward, while the 72-day oscillation only manifests northward propagation in the Atlantic sector. Both modes mainly describe changes in the midlatitude-jet position and intensity. In the horizontal plane though, the two modes exhibit synchronous centers of action located over the Atlantic and Pacific oceans. The two persistent flow regimes are associated with slow phases of either oscillation.
Probabilistic Clustering using Hierarchical Models
, 1999
"... This paper addresses the problem of clustering data when the available data measurements are not multivariate vectors of fixed dimensionality. For example, one might have data from a set of medical patients, where for each patient there are time series, image, text, and multivariate data. We propose ..."
Abstract
-
Cited by 10 (5 self)
- Add to MetaCart
This paper addresses the problem of clustering data when the available data measurements are not multivariate vectors of fixed dimensionality. For example, one might have data from a set of medical patients, where for each patient there are time series, image, text, and multivariate data. We propose a general probabilistic clustering framework for clustering heterogeneous data types of this form. We focus on two-level probabilistic hierarchical models, consisting of a high-level mixture model on parameters and a low-level model for observations. This general framework permits probabilistic clustering of "objects" (sequences, histograms, images, etc) using an extension of the expectation-maximization (EM) algorithm which we derive. We further show that earlier (intuitive) clustering algorithms can be viewed as special cases (approximations) of the framework proposed here. The paper includes several illustrations of the method, including an application to a problem in clustering two-dime...
Empirical Mode Reduction in a Model of Extratropical Low-Frequency Variability
, 2006
"... This paper constructs and analyzes a reduced nonlinear stochastic model of extratropical low-frequency variability. To do so, it applies multilevel quadratic regression to the output of a long simulation of a global baroclinic, quasigeostrophic, three-level (QG3) model with topography; the model’s p ..."
Abstract
-
Cited by 9 (8 self)
- Add to MetaCart
This paper constructs and analyzes a reduced nonlinear stochastic model of extratropical low-frequency variability. To do so, it applies multilevel quadratic regression to the output of a long simulation of a global baroclinic, quasigeostrophic, three-level (QG3) model with topography; the model’s phase space has a dimension of O(10 4). The reduced model has 45 variables and captures well the non-Gaussian features of the QG3 model’s probability density function (PDF). In particular, the reduced model’s PDF shares with the QG3 model its four anomalously persistent flow patterns, which correspond to opposite phases of the Arctic Oscillation and the North Atlantic Oscillation, as well as the Markov chain of transitions between these regimes. In addition, multichannel singular spectrum analysis identifies intraseasonal oscillations with a period of 35–37 days and of 20 days in the data generated by both the QG3 model and its low-dimensional analog. An analytical and numerical study of the reduced model starts with the fixed points and oscillatory eigenmodes of the model’s deterministic part and uses systematically an increasing noise parameter to connect these with the behavior of the full, stochastically forced model version. The results of this study point to the origin of the QG3 model’s multiple regimes and intraseasonal oscillations and identify the connections between the two types of behavior.
Probabilistic Clustering of Extratropical Cyclones Using Regression Mixture Models
- Climate Dynamics
, 2006
"... A probabilistic clustering technique is developed for classification of wintertime extratropical cyclone (ETC) tracks over the North Atlantic. We use a regression mixture model to describe the longitude-time and latitude–time propagation of the ETCs. A simple tracking algorithm is applied to 6-hourl ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
A probabilistic clustering technique is developed for classification of wintertime extratropical cyclone (ETC) tracks over the North Atlantic. We use a regression mixture model to describe the longitude-time and latitude–time propagation of the ETCs. A simple tracking algorithm is applied to 6-hourly mean sea-level pressure fields to obtain the tracks from either a general circulation model (GCM) or a reanalysis data set. Quadratic curves are found to provide the best description of the data. We select a three-cluster classification for both data sets, based on a mix of objective and subjective criteria. The track orientations in each of the clusters are broadly similar for the GCM and reanalyzed data; they are characterized by predominantly south-to-north (S–N), west-to-east (W–E), and southwest-to-northeast (SW–NE) tracking cyclones, respectively. The reanalysis cyclone tracks, however, are found to be much more tightly clustered geographically than those of the GCM. For the reanalysis data, a link is found between the occurrence of cyclones belonging to different clusters of trajectory-shape, and the phase of the North Atlantic Oscillation (NAO). The positive
Clustering Earth Science Data: Goals, Issues and Results
, 2001
"... This paper reports on recent work applying data mining to the task of finding interesting patterns in earth science data derived from global observing satellites, terrestrial observations, and ecosystem models. Patterns are "interesting" if ecosystem scientists can use them to better understand and ..."
Abstract
-
Cited by 7 (6 self)
- Add to MetaCart
This paper reports on recent work applying data mining to the task of finding interesting patterns in earth science data derived from global observing satellites, terrestrial observations, and ecosystem models. Patterns are "interesting" if ecosystem scientists can use them to better understand and predict changes in the global carbon cycle and climate system. The initial goal of the work reported here (which is only part of the overall project) is to use clustering to divide the land and ocean areas of the earth into disjoint regions in an automatic, but meaningful, way that enables the direct or indirect discovery of interesting patterns. Finding "meaningful" clusters requires an approach that is aware of various issues related to the spatial and temporal nature of earth science data: the "proper" measure of similarity between time series, removing seasonality from the data to allow detection of non-seasonal patterns, and the presence of spatial and temporal autocorrelation (i.e., measured values that are close in time and space tend to be highly correlated, or similar). While we have techniques to handle some of these spatiotemporal issues (e.g., removing seasonality) and some issues are not a problem (e.g., spatial autocorrelation actually helps our clustering), other issues require more study (e.g., temporal autocorrelation and its effect on time series similarity). Nonetheless, by using the Kmeans as our clustering algorithm and taking linear correlation as our measure of similarity between time series, we have been able to find some interesting ecosystem patterns, including some that are well known to earth scientists and some that require further investigation.

