Results 1  10
of
43
Model Selection for Probabilistic Clustering Using CrossValidated Likelihood
 Statistics and Computing
, 1998
"... Crossvalidated likelihood is investigated as a tool for automatically determining the appropriate number of components (given the data) in finite mixture modelling, particularly in the context of modelbased probabilistic clustering. The conceptual framework for the crossvalidation approach to mod ..."
Abstract

Cited by 65 (4 self)
 Add to MetaCart
Crossvalidated likelihood is investigated as a tool for automatically determining the appropriate number of components (given the data) in finite mixture modelling, particularly in the context of modelbased probabilistic clustering. The conceptual framework for the crossvalidation approach to model selection is direct in the sense that models are judged directly on their outofsample predictive performance. The method is applied to a wellknown clustering problem in the atmospheric science literature using historical records of upper atmosphere geopotential height in the Northern hemisphere. Crossvalidated likelihood provides strong evidence for three clusters in the data set, providing an objective confirmation of earlier results derived using nonprobabilistic clustering techniques. 1 Introduction Crossvalidation is a wellknown technique in supervised learning to select a model from a family of candidate models. Examples include selecting the best classification tree using cr...
Modelbased clustering and visualization of navigation patterns on a web site
 Data Mining and Knowledge Discovery
, 2003
"... We present a new methodology for exploring and analyzing navigation patterns on a web site. The patterns that can be analyzed consist of sequences of URL categories traversed by users. In our approach, we rst partition site users into clusters such that users with similar navigation paths through th ..."
Abstract

Cited by 53 (0 self)
 Add to MetaCart
We present a new methodology for exploring and analyzing navigation patterns on a web site. The patterns that can be analyzed consist of sequences of URL categories traversed by users. In our approach, we rst partition site users into clusters such that users with similar navigation paths through the site are placed into the same cluster. Then, for each cluster, we display these paths for users within that cluster. The clustering approach weemployis modelbased (as opposed to distancebased) and partitions users according to the order in which they request web pages. In particular, we cluster users by learning a mixture of rstorder Markov models using the ExpectationMaximization algorithm. The runtime of our algorithm scales linearly with the number of clusters and with the size of the data � and our implementation easily handles hundreds of thousands of user sessions in memory. In the paper, we describe the details of our method and a visualization tool based on it called WebCANVAS. We illustrate the use of our approach on usertra c data from msnbc.com. Keywords: Modelbased clustering, sequence clustering, data visualization, Internet, web 1
Discovery of Climate Indices Using Clustering
 In Proc. of the 9th ACM SIGKDD Int’l Conference on Knowledge Discovery and Data Mining
, 2003
"... To analyze the effect of the oceans and atmosphere on land climate, Earth Scientists have developed climate indices, which are time series that summarize the behavior of selected regions of the Earth’s oceans and atmosphere. In the past, Earth scientists have used observation and, more recently, eig ..."
Abstract

Cited by 26 (8 self)
 Add to MetaCart
To analyze the effect of the oceans and atmosphere on land climate, Earth Scientists have developed climate indices, which are time series that summarize the behavior of selected regions of the Earth’s oceans and atmosphere. In the past, Earth scientists have used observation and, more recently, eigenvalue analysis techniques, such as principal components analysis (PCA) and singular value decomposition (SVD), to discover climate indices. However, eigenvalue techniques are only useful for finding a few of the strongest signals. Furthermore, they impose a condition that all discovered signals must be orthogonal to each other, making it difficult to attach a physical interpretation to them. This paper presents an alternative clusteringbased methodology for the discovery of climate indices that overcomes these limitations and is based on clusters that represent regions with relatively homogeneous behavior. The centroids of these clusters are time series that summarize the behavior of the ocean or atmosphere in those regions. Some of these centroids correspond to known climate indices and provide a validation of our methodology; other centroids are variants of known indices that may provide better predictive power for some land areas; and still other indices may represent potentially new Earth science phenomena. Finally, we show that cluster based indices generally outperform SVD derived indices, both in terms of area weighted correlation and direct correlation with the known indices.
Weather Regimes and Preferred Transition Paths In a ThreeLevel . . .
, 2003
"... Multiple flow regimes are reexamined in a global, threelevel, quasigeostrophic model with realistic topography in spherical geometry. This QG3 model, using a T21 triangular truncation in the horizontal, has a fairly realistic climatology for Northern Hemisphere winter, and exhibits multiple regime ..."
Abstract

Cited by 19 (14 self)
 Add to MetaCart
Multiple flow regimes are reexamined in a global, threelevel, quasigeostrophic model with realistic topography in spherical geometry. This QG3 model, using a T21 triangular truncation in the horizontal, has a fairly realistic climatology for Northern Hemisphere winter, and exhibits multiple regimes that resemble those found in atmospheric observations. Four regimes are robust to changes in the classification method, kmeans vs. mixture modeling, and its parameters. These regimes correspond roughly to opposite phases of the Arctic Oscillation (AO) and the North Atlantic Oscillation (NAO), respectively. The Markov
Multilevel Regression Modeling of Nonlinear Processes: Derivation and Applications to Climatic Variability
 JOURNAL OF CLIMATE
, 2004
"... Predictive models are constructed to best describe an observed field’s statistics within a given class of nonlinear dynamics driven by a spatially coherent noise that is white in time. For linear dynamics, such inverse stochastic models are obtained by multiple linear regression (MLR). Nonlinear dyn ..."
Abstract

Cited by 17 (9 self)
 Add to MetaCart
Predictive models are constructed to best describe an observed field’s statistics within a given class of nonlinear dynamics driven by a spatially coherent noise that is white in time. For linear dynamics, such inverse stochastic models are obtained by multiple linear regression (MLR). Nonlinear dynamics, when more appropriate, is accommodated by applying multiple polynomial regression (MPR) instead; the resulting model uses polynomial predictors, but the dependence on the regression parameters is linear in both MPR and MLR. The basic concepts are illustrated using the Lorenz convection model, the classical doublewell problem, and a threewell problem in two space dimensions. Given a data sample that is long enough, MPR successfully reconstructs the model coefficients in the former two cases, while the resulting inverse model captures the threeregime structure of the system’s probability density function (PDF) in the latter case. A novel multilevel generalization of the classic regression procedure is introduced next. In this generalization, the residual stochastic forcing at a given level is subsequently modeled as a function of variables at this level and all the preceding ones. The number of levels is determined so that the lag0 covariance of the residual forcing converges to a constant matrix, while its lag1 covariance vanishes. This method has been applied to the output of a threelayer, quasigeostrophic model and to the analysis of Northern Hemisphere wintertime geopotential height anomalies. In both cases, the inverse model simulations reproduce well the multiregime structure of the PDF constructed in the subspace spanned by the dataset’s leading empirical orthogonal functions, as well as the detailed spectrum of the dataset’s temporal evolution. These encouraging results are interpreted in terms of the modeled lowfrequency flow’s feedback on the statistics of the subgridscale processes.
Probabilistic Clustering of Extratropical Cyclones Using Regression Mixture Models
 Climate Dynamics
, 2006
"... A probabilistic clustering technique is developed for classification of wintertime extratropical cyclone (ETC) tracks over the North Atlantic. We use a regression mixture model to describe the longitudetime and latitude–time propagation of the ETCs. A simple tracking algorithm is applied to 6hourl ..."
Abstract

Cited by 14 (5 self)
 Add to MetaCart
A probabilistic clustering technique is developed for classification of wintertime extratropical cyclone (ETC) tracks over the North Atlantic. We use a regression mixture model to describe the longitudetime and latitude–time propagation of the ETCs. A simple tracking algorithm is applied to 6hourly mean sealevel pressure fields to obtain the tracks from either a general circulation model (GCM) or a reanalysis data set. Quadratic curves are found to provide the best description of the data. We select a threecluster classification for both data sets, based on a mix of objective and subjective criteria. The track orientations in each of the clusters are broadly similar for the GCM and reanalyzed data; they are characterized by predominantly southtonorth (S–N), westtoeast (W–E), and southwesttonortheast (SW–NE) tracking cyclones, respectively. The reanalysis cyclone tracks, however, are found to be much more tightly clustered geographically than those of the GCM. For the reanalysis data, a link is found between the occurrence of cyclones belonging to different clusters of trajectoryshape, and the phase of the North Atlantic Oscillation (NAO). The positive
2006: Subseasonaltointerdecadal variability of the Australian monsoon over north Queensland
 Quart. J. Royal Meteor. Soc
"... Daily rainfall occurrence and amount at 11 stations over North Queensland are examined during summer 1958–1997, using a Hidden Markov Model (HMM). Daily rainfall variability is described in terms of the occurrence of five discrete “weather states, ” identified by the HMM. Three states are characteri ..."
Abstract

Cited by 13 (10 self)
 Add to MetaCart
Daily rainfall occurrence and amount at 11 stations over North Queensland are examined during summer 1958–1997, using a Hidden Markov Model (HMM). Daily rainfall variability is described in terms of the occurrence of five discrete “weather states, ” identified by the HMM. Three states are characterized respectively by very wet, moderately wet, and dry conditions at most stations; two states have enhanced rainfall along the coast and dry conditions inland. Each HMM rainfall state is associated with a distinct atmospheric circulation regime. The two wet states are accompanied by monsoonal circulation patterns, with largescale ascent, lowlevel inflow from the northwest, and a phase reversal with height; the dry state is characterized by circulation anomalies of the opposite sense. Two of the states show significant associations with midlatitude synoptic waves. Variability of the monsoon on time scales from subseasonal to interdecadal is interpreted in terms of changes in the frequency of occurrence of the five HMM rainfall states. Large subseasonal variability is identified in terms of active and break phases, and a highly variable monsoon onset date. The occurrence of the verywet and dry states is somewhat modulated by the MaddenJulian oscillation. On interannual timescales, there are clear relationships with the El NiñoSouthern Oscillation and Indian Ocean sea surface temperatures. Interdecadal monsoonal variability is characterized by stronger monsoons during the 1970s, and weaker monsoons plus an increased prevalence of dryer states since then.
Multiple regimes and lowfrequency oscillations in the Northern Hemisphere’s zonalmean flow
 J. ATMOS. SCI
, 2006
"... This paper studies multiple regimes and lowfrequency oscillations in the Northern Hemisphere zonalmean zonal flow in winter, using 55 years of daily observational data. The probability density function estimated in the phase space spanned by the two leading empirical orthogonal functions exhibits ..."
Abstract

Cited by 12 (10 self)
 Add to MetaCart
This paper studies multiple regimes and lowfrequency oscillations in the Northern Hemisphere zonalmean zonal flow in winter, using 55 years of daily observational data. The probability density function estimated in the phase space spanned by the two leading empirical orthogonal functions exhibits two distinct, statistically significant maxima. The two regimes associated with these maxima describe persistent zonalflow states that are characterized by meridional displacements of the midlatitude jet, poleward and equatorward of its timemean position. The geopotential height anomalies of either regime have a pronounced zonally symmetric component, but largestamplitude anomalies are located over the Atlantic and Pacific oceans. Highfrequency synoptic transients participate in the maintenance of and transitions between these regimes. Significant oscillatory components with periods of 147 and 72 days are identified by spectral analysis of the zonalflow time series. These oscillations are described by singular spectrum analysis and the multitaper method. The 147day oscillation involves zonalflow anomalies that propagate poleward, while the 72day oscillation only manifests northward propagation in the Atlantic sector. Both modes mainly describe changes in the midlatitudejet position and intensity. In the horizontal plane though, the two modes exhibit synchronous centers of action located over the Atlantic and Pacific oceans. The two persistent flow regimes are associated with slow phases of either oscillation.
Probabilistic Clustering using Hierarchical Models
, 1999
"... This paper addresses the problem of clustering data when the available data measurements are not multivariate vectors of fixed dimensionality. For example, one might have data from a set of medical patients, where for each patient there are time series, image, text, and multivariate data. We propose ..."
Abstract

Cited by 12 (5 self)
 Add to MetaCart
This paper addresses the problem of clustering data when the available data measurements are not multivariate vectors of fixed dimensionality. For example, one might have data from a set of medical patients, where for each patient there are time series, image, text, and multivariate data. We propose a general probabilistic clustering framework for clustering heterogeneous data types of this form. We focus on twolevel probabilistic hierarchical models, consisting of a highlevel mixture model on parameters and a lowlevel model for observations. This general framework permits probabilistic clustering of "objects" (sequences, histograms, images, etc) using an extension of the expectationmaximization (EM) algorithm which we derive. We further show that earlier (intuitive) clustering algorithms can be viewed as special cases (approximations) of the framework proposed here. The paper includes several illustrations of the method, including an application to a problem in clustering twodime...
Empirical Mode Reduction in a Model of Extratropical LowFrequency Variability
, 2006
"... This paper constructs and analyzes a reduced nonlinear stochastic model of extratropical lowfrequency variability. To do so, it applies multilevel quadratic regression to the output of a long simulation of a global baroclinic, quasigeostrophic, threelevel (QG3) model with topography; the model’s p ..."
Abstract

Cited by 11 (8 self)
 Add to MetaCart
This paper constructs and analyzes a reduced nonlinear stochastic model of extratropical lowfrequency variability. To do so, it applies multilevel quadratic regression to the output of a long simulation of a global baroclinic, quasigeostrophic, threelevel (QG3) model with topography; the model’s phase space has a dimension of O(10 4). The reduced model has 45 variables and captures well the nonGaussian features of the QG3 model’s probability density function (PDF). In particular, the reduced model’s PDF shares with the QG3 model its four anomalously persistent flow patterns, which correspond to opposite phases of the Arctic Oscillation and the North Atlantic Oscillation, as well as the Markov chain of transitions between these regimes. In addition, multichannel singular spectrum analysis identifies intraseasonal oscillations with a period of 35–37 days and of 20 days in the data generated by both the QG3 model and its lowdimensional analog. An analytical and numerical study of the reduced model starts with the fixed points and oscillatory eigenmodes of the model’s deterministic part and uses systematically an increasing noise parameter to connect these with the behavior of the full, stochastically forced model version. The results of this study point to the origin of the QG3 model’s multiple regimes and intraseasonal oscillations and identify the connections between the two types of behavior.