Results 11 - 20
of
38
Estimating the Number of Segments in Time Series Data Using Permutation Tests
- IEEE International Conference on Data Mining
, 2002
"... Segmentation is a popular technique for discovering structure in time series data. We address the largely open problem of estimating the number of segments that can be reliably discovered. We introduce a novel method for the problem, called Pete. Pete is based on permutation testing. ..."
Abstract
-
Cited by 13 (1 self)
- Add to MetaCart
Segmentation is a popular technique for discovering structure in time series data. We address the largely open problem of estimating the number of segments that can be reliably discovered. We introduce a novel method for the problem, called Pete. Pete is based on permutation testing.
Translation-Invariant Mixture Models for Curve Clustering
- In Proc. Ninth ACM SIGKDD Inter. Conf. on Knowledge Discovery and Data Mining, Washington D.C., August 24–27
, 2003
"... In this paper we present a family of algorithms that can simultaneously align and cluster sets of multidimensional curves defined on a discrete time grid. Our approach assumes that the data are being generated from a finite mixture of curve models. Each mixture component uses (a) a mean curve ba ..."
Abstract
-
Cited by 10 (2 self)
- Add to MetaCart
In this paper we present a family of algorithms that can simultaneously align and cluster sets of multidimensional curves defined on a discrete time grid. Our approach assumes that the data are being generated from a finite mixture of curve models. Each mixture component uses (a) a mean curve based on a flexible non-parametric representation, (b) additive measurement noise, (c) randomly selected discrete-valued shifts of each curve with respect to the independent variable (i.e., typically along the time axis), and (d) random real-valued o#sets of each curve with respect to the observed variable. We show that the Expectation-Maximization (EM) algorithm can be used to simultaneously recover both the curve models for each cluster, and the most likely shifts, o#sets, and cluster memberships for each curve. We demonstrate how Bayesian estimation methods can improve the results for small sample sizes by enforcing smoothness in the cluster mean curves. We evaluate the methodology on two real-world data sets, time-course gene expression data and storm trajectory data. Experimental results show that models that incorporate curve alignment systematically provide improvements in predictive power on test data sets. The proposed approach provides a non-parametric, computationally e#cient, and robust methodology for clustering broad classes of curve data.
Mixture Model Analysis of DNA Microarray Images
- IEEE Trans Med Imaging
, 2005
"... In this paper we propose a new methodology for analysis of microarray images. First a new gridding algorithm is proposed for determining the individual spots and their borders. ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
In this paper we propose a new methodology for analysis of microarray images. First a new gridding algorithm is proposed for determining the individual spots and their borders.
Probabilistic Clustering of Extratropical Cyclones Using Regression Mixture Models
- Climate Dynamics
, 2006
"... A probabilistic clustering technique is developed for classification of wintertime extratropical cyclone (ETC) tracks over the North Atlantic. We use a regression mixture model to describe the longitude-time and latitude–time propagation of the ETCs. A simple tracking algorithm is applied to 6-hourl ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
A probabilistic clustering technique is developed for classification of wintertime extratropical cyclone (ETC) tracks over the North Atlantic. We use a regression mixture model to describe the longitude-time and latitude–time propagation of the ETCs. A simple tracking algorithm is applied to 6-hourly mean sea-level pressure fields to obtain the tracks from either a general circulation model (GCM) or a reanalysis data set. Quadratic curves are found to provide the best description of the data. We select a three-cluster classification for both data sets, based on a mix of objective and subjective criteria. The track orientations in each of the clusters are broadly similar for the GCM and reanalyzed data; they are characterized by predominantly south-to-north (S–N), west-to-east (W–E), and southwest-to-northeast (SW–NE) tracking cyclones, respectively. The reanalysis cyclone tracks, however, are found to be much more tightly clustered geographically than those of the GCM. For the reanalysis data, a link is found between the occurrence of cyclones belonging to different clusters of trajectory-shape, and the phase of the North Atlantic Oscillation (NAO). The positive
Asymptotic optimality of likelihood-based cross-validation
- STATISTICAL APPLICATIONS IN GENETICS AND MOLECULAR BIOLOGY
, 2003
"... Likelihood-based cross-validation is a statistical tool for selecting a density estimate based on n i.i.d. observations from the true density among a collection of candidate density estimators. General examples are the selection of a model indexing a maximum likelihood estimator, and the selection o ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
Likelihood-based cross-validation is a statistical tool for selecting a density estimate based on n i.i.d. observations from the true density among a collection of candidate density estimators. General examples are the selection of a model indexing a maximum likelihood estimator, and the selection of a bandwidth indexing a nonparametric (e.g. kernel) density estimator. In this article, we establish a finite sample result for a general class of likelihood-based cross-validation procedures (as indexed by the type of sample splitting used, e.g. V-fold crossvalidation). This result implies that the cross-validation selector performs asymptotically as well (w.r.t. to the Kullback-Leibler distance to the true density) as a benchmark model selector which is optimal for each given dataset and depends on the true density. Crucial conditions of our theorem are that the size of the validation sample converges to infinity, which excludes leave-one-out cross-validation, and that the candidate density estimates are bounded away from zero and infinity. We illustrate these asymptotic results and the practical performance of likelihood-based cross-validation for the purpose of bandwidth selection with a simulation study. Moreover, we use likelihood-based cross-validation in the context of regulatory motif detection in DNA sequences.
Probabilistic Curve-Aligned Clustering and Prediction with Regression Mixture Models
- Ph.D. Dissertation, 2004. Laboratoire MAS
, 2004
"... in quality ..."
Supervised detection of regulatory motifs in DNA sequences
- STATISTICAL APPLICATIONS IN GENETICS AND MOLECULAR BIOLOGY
, 2003
"... Identification of transcription factor binding sites (regulatory motifs) is a major interest in contemporary biology. We propose a new likelihood based method, COMODE, for identifying structural motifs in DNA sequences. Commonly used methods (e.g. MEME, Gibbs motif sampler) model binding sites as ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Identification of transcription factor binding sites (regulatory motifs) is a major interest in contemporary biology. We propose a new likelihood based method, COMODE, for identifying structural motifs in DNA sequences. Commonly used methods (e.g. MEME, Gibbs motif sampler) model binding sites as families of sequences described by a position weight matrix (PWM) and identify PWMs that maximize the likelihood of observed sequence data under a simple multinomial mixture model. This model assumes that the positions of the PWM correspond to independent multinomial distributions with four cell probabilities. We address supervising the search for DNA binding sites using the information derived from structural characteristics of protein-DNA interactions. We extend the simple multinomial mixture model to a constrained multinomial mixture model by incorporating constraints on the information content profiles or on specific parameters of the motif PWMs. The parameters of this extended model are estimated by maximum likelihood using a nonlinear constraint optimization method. Likelihood-based cross-validation is used to select model parameters such as motif width and constraint type. The performance of COMODE is compared with existing motif detection methods on simulated data that incorporate real motif examples from Saccharomyces cerevisiae. The proposed method is especially effective when the motif of interest appears as a weak signal in the data. Some of the transcription factor binding data of Lee et al. (2002) were also analyzed using COMODE and biologically verified sites were identified.
A site- and time-heterogeneous model of amino-acid replacement
, 2007
"... 1 We combined the CAT mixture model (Lartillot and Philippe 2004) and the non-stationary BP model (Blanquart and Lartillot 2006) into a new model, CAT-BP, accounting for variations of the evolutionary process both along the sequence and across lineages. As in CAT, the model implements a mixture of d ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
1 We combined the CAT mixture model (Lartillot and Philippe 2004) and the non-stationary BP model (Blanquart and Lartillot 2006) into a new model, CAT-BP, accounting for variations of the evolutionary process both along the sequence and across lineages. As in CAT, the model implements a mixture of distinct Markovian processes of substitution distributed among sites, thus accommodating site-specific selective constraints induced by protein structure and function. Furthermore, as in BP, these processes are non-stationary, and their equilibrium frequencies are allowed to change along lineages in a correlated way, through discrete shifts in global amino acid composition distributed along the phylogenetic tree. We implemented the CAT-BP model in a Bayesian Markov Chain Monte Carlo framework, and compared its predictions with those of three simpler models, BP, CAT, and the site- and time-homogeneous GTR model, on a concatenation of four mitochondrial proteins of 20 arthropod species. In contrast to GTR, BP and CAT, which all display a phylogenetic reconstruction artefact positioning the bees Apis m. and Melipona b. among chelicerates, the CAT-BP model
Massive Datasets In Astronomy
"... Astronomy has a long history of acquiring, systematizing, and interpreting large quantities of data. Starting from the earliest sky atlases through the first major photographic sky surveys of the 20th century, this tradition is continuing today, and at an ever increasing rate. ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Astronomy has a long history of acquiring, systematizing, and interpreting large quantities of data. Starting from the earliest sky atlases through the first major photographic sky surveys of the 20th century, this tradition is continuing today, and at an ever increasing rate.
Visualizing Class Probability Estimators
- In Lecture Notes in Artificial Intelligence 2838
, 2003
"... Inducing classi ers that make accurate predictions on future data is a driving force for research in inductive learning. However, also of importance to the users is how to gain information from the models produced. Unfortunately, some of the most powerful inductive learning algorithms generate ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Inducing classi ers that make accurate predictions on future data is a driving force for research in inductive learning. However, also of importance to the users is how to gain information from the models produced. Unfortunately, some of the most powerful inductive learning algorithms generate \black boxes"|that is, the representation of the model makes it virtually impossible to gain any insight into what has been learned. This paper presents a technique that can help the user understand why a classi er makes the predictions that it does by providing a two-dimensional visualization of its class probability estimates. It requires the classi er to generate class probabilities but most practical algorithms are able to do so (or can be modi ed to this end).

