Results 11  20
of
42
A bit level representation for time series data mining with shape based similarity
, 2006
"... Clipping is the process of transforming a real valued series into a sequence of bits representing whether each data is above or below the average. In this paper we argue that clipping is a useful and flexible transformation for the exploratory analysis of large time dependent data sets. We demonst ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
(Show Context)
Clipping is the process of transforming a real valued series into a sequence of bits representing whether each data is above or below the average. In this paper we argue that clipping is a useful and flexible transformation for the exploratory analysis of large time dependent data sets. We demonstrate how time series stored as bits can be very efficiently compressed and manipulated and that, under some assumptions, the discriminatory power with clipped series is asymptotically equivalent to that achieved with the raw data. Unlike other transformations, clipped series can be compared directly to the raw data series. We show that this means we can form a tight lower bounding metric for Euclidean and Dynamic Time Warping distance and hence efficiently query by content. Clipped data can be used in conjunction with a host of algorithms and statistical tests that naturally follow from the binary nature of the data. A series of experiments illustrate how clipped series can be used in increasingly complex ways to achieve better results than with other popular techniques. The usefulness of the representation is demonstrated by the fact that the results with clipped data are consistently better than those achieved with a Wavelet or Discrete Fourier Transformation at the same compression ratio for both clustering and query by content. The flexibility of the representation is shown by the fact that we can take advantage of a variable run length encoding of clipped series to define an approximation of the Kolmogorov complexity and hence perform
The NVI Clustering Evaluation Measure
"... Clustering is crucial for many NLP tasks and applications. However, evaluating the results of a clustering algorithm is hard. In this paper we focus on the evaluation setting in which a gold standard solution is available. We discuss two existing information theory based measures, V and VI, and show ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
(Show Context)
Clustering is crucial for many NLP tasks and applications. However, evaluating the results of a clustering algorithm is hard. In this paper we focus on the evaluation setting in which a gold standard solution is available. We discuss two existing information theory based measures, V and VI, and show that they are both hard to use when comparing the performance of different algorithms and different datasets. The V measure favors solutions having a large number of clusters, while the range of scores given by VI depends on the size of the dataset. We present a new measure, NVI, which normalizes VI to address the latter problem. We demonstrate the superiority of NVI in a large experiment involving an important NLP application, grammar induction, using real corpus data in English, German and Chinese. 1
Comparison of Unsupervised Classifiers
, 1996
"... : The activity of sorting like objects into classes without any help from an omniscient supervisor is known as unsupervised classification. In AI both symbolic and connectionist camps study classification. The statistical classifiers such as Autoclass and Snob search for the theory that can best exp ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
: The activity of sorting like objects into classes without any help from an omniscient supervisor is known as unsupervised classification. In AI both symbolic and connectionist camps study classification. The statistical classifiers such as Autoclass and Snob search for the theory that can best explain the distribution of given data, whereas neural network classifiers such as Kohonen's networks and ART2 use the vector quantization principle for classifying data. Previously, many studies have compared supervised classification algorithms, but the more challenging problem of comparing unsupervised classifiers has largely been ignored. We performed an empirical comparison of ART2, Autoclass and Snob. We highlight the strengths and weaknesses of the various classifiers. Overall, statistical classifiers, especially Snob, perform better than their neural network counterpart ART2. Keywords: Unsupervised classification. Area of Interest: Concept formation and classification. 1 Introduction ...
Type Level Clustering Evaluation: New Measures and a POS Induction Case Study
"... Clustering is a central technique in NLP. Consequently, clustering evaluation is of great importance. Many clustering algorithms are evaluated by their success in tagging corpus tokens. In this paper we discuss type level evaluation, which reflects class membership only and is independent of the tok ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
(Show Context)
Clustering is a central technique in NLP. Consequently, clustering evaluation is of great importance. Many clustering algorithms are evaluated by their success in tagging corpus tokens. In this paper we discuss type level evaluation, which reflects class membership only and is independent of the token statistics of a particular reference corpus. Type level evaluation casts light on the merits of algorithms, and for some applications is a more natural measure of the algorithm’s quality. We propose new type level evaluation measures that, contrary to existing measures, are applicable when items are polysemous, the common case in NLP. We demonstrate the benefits of our measures using a detailed case study, POS induction. We experiment with seven leading algorithms, obtaining useful insights and showing that token and type level measures can weakly or even negatively correlate, which underscores the fact that these two approaches reveal different aspects of clustering quality. 1
Subset clustering of binary sequences, with an application to genomic abnormality data. Biometrics 2005
"... abnormality data ..."
Clustering based on Dirichlet mixtures of attribute ensembles
, 2004
"... We discuss a modelbased approach to identifying clusters of objects based on subsets of attributes, so that the attributes that distinguish a cluster from the rest of the population may depend on the cluster being considered. The method is based on a Pólya urn cluster model for multivariate means a ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
We discuss a modelbased approach to identifying clusters of objects based on subsets of attributes, so that the attributes that distinguish a cluster from the rest of the population may depend on the cluster being considered. The method is based on a Pólya urn cluster model for multivariate means and variances, resulting in a multivariate Dirichlet process mixture model. This particular modelbased approach accommodates outliers and allows for the incorporation of applicationspecific data features into the clustering scheme. For example, in an analysis of genetic CGH array data we are able to design a clustering method that accounts for spatial dependence of chromosomal abnormalities. Some key words: nonparametric Bayes, unsupervised learning, subspace clustering, variable
Measurement
"... Classification procedures are common and useful in behavioral, educational, social, and managerial research. Supervised classification techniques such as discriminant function analysis assume training data are perfectly classified when estimating parameters or classifying. In contrast, unsupervised ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
Classification procedures are common and useful in behavioral, educational, social, and managerial research. Supervised classification techniques such as discriminant function analysis assume training data are perfectly classified when estimating parameters or classifying. In contrast, unsupervised classification techniques such as finite mixture models (FMM) do not require, or even use if available, knowledge of group status to estimate parameters or classifying. This study investigates the impact of two types of misclassification errors on the classification accuracy of discriminant function analysis (both linear [LDA] and quadratic [QDA]) and FMM for two groups with a single predictor. Analytic and Monte Carlo results are provided for a variety of misclassification scenarios to investigate the performance of the two methods. Discriminant function techniques recovered the highest overall percentages of correctly classified data, whereas FMM captured higher percentages of the smaller group when group sizes are unequal. LDA marginally outperformed QDA under misclassified conditions. Keywords classification, misclassification, linear discriminant function analysis, quadratic discriminant function analysis, mixture model, training data Classification of individuals into nonoverlapping groups is regularly used in the behavioral, educational, social, and managerial research and practice, as well as in 1
An Quantification of Cluster Novelty with an Application to Martian Topography
"... Abstract. Automated tools for knowledge discovery are frequently invoked in databases where objects already group into some known classification scheme. In the context of unsupervised learning or clustering, such tools delve inside large databases looking for alternative classification schemes that ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Abstract. Automated tools for knowledge discovery are frequently invoked in databases where objects already group into some known classification scheme. In the context of unsupervised learning or clustering, such tools delve inside large databases looking for alternative classification schemes that are both meaningful and novel. A quantification of cluster novelty can be looked upon as the degree of separation between each new cluster and its most similar class. Our approach models each cluster and class as a Gaussian distribution and estimates the degree of overlap between both distributions by measuring their intersecting area. Unlike other metrics, our method quantifies the novelty of each cluster individually, and enables us to rank classes according to its similarity to each new cluster. We test our algorithm on Martian landscapes using a set of known classes called geological units; experimental results show a new interpretation for the characterization of Martian landscapes. 1
The Effects of Initially Misclassified Data on the Effectiveness of Discriminant Function Analysis and Finite Mixture Modeling
"... Authors Note: The authors would like to thank Michael P. Trelinski for all his computer programming insight. 1 RUNNING HEAD: Classification and Misclassified Data Classification procedures are common and useful in behavioral, educational, and social research. Supervised classification techniques lik ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
Authors Note: The authors would like to thank Michael P. Trelinski for all his computer programming insight. 1 RUNNING HEAD: Classification and Misclassified Data Classification procedures are common and useful in behavioral, educational, and social research. Supervised classification techniques like discriminant function analysis (DFA) assume training data is perfectly classified when estimating parameters and classification. In contrast, unsupervised classification techniques like finite mixture models (FMM) do not require knowledge of group status in order to estimate parameters or classify. The purpose of this study is to investigate the impact of misclassification errors (both randomly distributed errors and errors weighted toward the distribution overlap) on the classification accuracy of DFA and FMM. Analytic and Monte Carlo results are provided for a variety of misclassification scenarios to investigate the performance of the two methods. DFA recovered the highest overall percentages of correctly classified data, whereas FMM captured higher percentages of the smaller group when group sizes are unequal.