Results 11  20
of
26
A bit level representation for time series data mining with shape based similarity
, 2006
"... Clipping is the process of transforming a real valued series into a sequence of bits representing whether each data is above or below the average. In this paper we argue that clipping is a useful and flexible transformation for the exploratory analysis of large time dependent data sets. We demonst ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
Clipping is the process of transforming a real valued series into a sequence of bits representing whether each data is above or below the average. In this paper we argue that clipping is a useful and flexible transformation for the exploratory analysis of large time dependent data sets. We demonstrate how time series stored as bits can be very efficiently compressed and manipulated and that, under some assumptions, the discriminatory power with clipped series is asymptotically equivalent to that achieved with the raw data. Unlike other transformations, clipped series can be compared directly to the raw data series. We show that this means we can form a tight lower bounding metric for Euclidean and Dynamic Time Warping distance and hence efficiently query by content. Clipped data can be used in conjunction with a host of algorithms and statistical tests that naturally follow from the binary nature of the data. A series of experiments illustrate how clipped series can be used in increasingly complex ways to achieve better results than with other popular techniques. The usefulness of the representation is demonstrated by the fact that the results with clipped data are consistently better than those achieved with a Wavelet or Discrete Fourier Transformation at the same compression ratio for both clustering and query by content. The flexibility of the representation is shown by the fact that we can take advantage of a variable run length encoding of clipped series to define an approximation of the Kolmogorov complexity and hence perform
Comparison of Unsupervised Classifiers
, 1996
"... : The activity of sorting like objects into classes without any help from an omniscient supervisor is known as unsupervised classification. In AI both symbolic and connectionist camps study classification. The statistical classifiers such as Autoclass and Snob search for the theory that can best exp ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
: The activity of sorting like objects into classes without any help from an omniscient supervisor is known as unsupervised classification. In AI both symbolic and connectionist camps study classification. The statistical classifiers such as Autoclass and Snob search for the theory that can best explain the distribution of given data, whereas neural network classifiers such as Kohonen's networks and ART2 use the vector quantization principle for classifying data. Previously, many studies have compared supervised classification algorithms, but the more challenging problem of comparing unsupervised classifiers has largely been ignored. We performed an empirical comparison of ART2, Autoclass and Snob. We highlight the strengths and weaknesses of the various classifiers. Overall, statistical classifiers, especially Snob, perform better than their neural network counterpart ART2. Keywords: Unsupervised classification. Area of Interest: Concept formation and classification. 1 Introduction ...
Type Level Clustering Evaluation: New Measures and a POS Induction Case Study
"... Clustering is a central technique in NLP. Consequently, clustering evaluation is of great importance. Many clustering algorithms are evaluated by their success in tagging corpus tokens. In this paper we discuss type level evaluation, which reflects class membership only and is independent of the tok ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
Clustering is a central technique in NLP. Consequently, clustering evaluation is of great importance. Many clustering algorithms are evaluated by their success in tagging corpus tokens. In this paper we discuss type level evaluation, which reflects class membership only and is independent of the token statistics of a particular reference corpus. Type level evaluation casts light on the merits of algorithms, and for some applications is a more natural measure of the algorithm’s quality. We propose new type level evaluation measures that, contrary to existing measures, are applicable when items are polysemous, the common case in NLP. We demonstrate the benefits of our measures using a detailed case study, POS induction. We experiment with seven leading algorithms, obtaining useful insights and showing that token and type level measures can weakly or even negatively correlate, which underscores the fact that these two approaches reveal different aspects of clustering quality. 1
Subset clustering of binary sequences, with an application to genomic abnormality data. Biometrics 2005
"... abnormality data ..."
Measurement
"... Classification procedures are common and useful in behavioral, educational, social, and managerial research. Supervised classification techniques such as discriminant function analysis assume training data are perfectly classified when estimating parameters or classifying. In contrast, unsupervised ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Classification procedures are common and useful in behavioral, educational, social, and managerial research. Supervised classification techniques such as discriminant function analysis assume training data are perfectly classified when estimating parameters or classifying. In contrast, unsupervised classification techniques such as finite mixture models (FMM) do not require, or even use if available, knowledge of group status to estimate parameters or classifying. This study investigates the impact of two types of misclassification errors on the classification accuracy of discriminant function analysis (both linear [LDA] and quadratic [QDA]) and FMM for two groups with a single predictor. Analytic and Monte Carlo results are provided for a variety of misclassification scenarios to investigate the performance of the two methods. Discriminant function techniques recovered the highest overall percentages of correctly classified data, whereas FMM captured higher percentages of the smaller group when group sizes are unequal. LDA marginally outperformed QDA under misclassified conditions. Keywords classification, misclassification, linear discriminant function analysis, quadratic discriminant function analysis, mixture model, training data Classification of individuals into nonoverlapping groups is regularly used in the behavioral, educational, social, and managerial research and practice, as well as in 1
The Effects of Initially Misclassified Data on the Effectiveness of Discriminant Function Analysis and Finite Mixture Modeling
"... Authors Note: The authors would like to thank Michael P. Trelinski for all his computer programming insight. 1 RUNNING HEAD: Classification and Misclassified Data Classification procedures are common and useful in behavioral, educational, and social research. Supervised classification techniques lik ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
Authors Note: The authors would like to thank Michael P. Trelinski for all his computer programming insight. 1 RUNNING HEAD: Classification and Misclassified Data Classification procedures are common and useful in behavioral, educational, and social research. Supervised classification techniques like discriminant function analysis (DFA) assume training data is perfectly classified when estimating parameters and classification. In contrast, unsupervised classification techniques like finite mixture models (FMM) do not require knowledge of group status in order to estimate parameters or classify. The purpose of this study is to investigate the impact of misclassification errors (both randomly distributed errors and errors weighted toward the distribution overlap) on the classification accuracy of DFA and FMM. Analytic and Monte Carlo results are provided for a variety of misclassification scenarios to investigate the performance of the two methods. DFA recovered the highest overall percentages of correctly classified data, whereas FMM captured higher percentages of the smaller group when group sizes are unequal.
An Quantification of Cluster Novelty with an Application to Martian Topography
"... Abstract. Automated tools for knowledge discovery are frequently invoked in databases where objects already group into some known classification scheme. In the context of unsupervised learning or clustering, such tools delve inside large databases looking for alternative classification schemes that ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Abstract. Automated tools for knowledge discovery are frequently invoked in databases where objects already group into some known classification scheme. In the context of unsupervised learning or clustering, such tools delve inside large databases looking for alternative classification schemes that are both meaningful and novel. A quantification of cluster novelty can be looked upon as the degree of separation between each new cluster and its most similar class. Our approach models each cluster and class as a Gaussian distribution and estimates the degree of overlap between both distributions by measuring their intersecting area. Unlike other metrics, our method quantifies the novelty of each cluster individually, and enables us to rank classes according to its similarity to each new cluster. We test our algorithm on Martian landscapes using a set of known classes called geological units; experimental results show a new interpretation for the characterization of Martian landscapes. 1
VMeasure: A conditional entropybased external cluster evaluation measure
"... We present Vmeasure, an external entropybased cluster evaluation measure. Vmeasure provides an elegant solution to many problems that affect previously defined cluster evaluation measures including 1) dependence on clustering algorithm or data set, 2) the “problem of matching”, where the clusterin ..."
Abstract
 Add to MetaCart
We present Vmeasure, an external entropybased cluster evaluation measure. Vmeasure provides an elegant solution to many problems that affect previously defined cluster evaluation measures including 1) dependence on clustering algorithm or data set, 2) the “problem of matching”, where the clustering of only a portion of data points are evaluated and 3) accurate evaluation and combination of two desirable aspects of clustering, homogeneity and completeness. We compare Vmeasure to a number of popular cluster evaluation measures and demonstrate that it satisfies several desirable properties of clustering solutions, using simulated clustering results. Finally, we use Vmeasure to evaluate two clustering tasks: document clustering and pitch accent type clustering.
A Separability Index for Distancebased Clustering and Classification Algorithms
"... We propose a separability index that quantifies the degree of difficulty in a hard clustering problem under assumptions of a multivariate Gaussian distribution for each cluster. A preliminary index is first defined and several of its properties are explored both theoretically and numerically. Adjust ..."
Abstract
 Add to MetaCart
We propose a separability index that quantifies the degree of difficulty in a hard clustering problem under assumptions of a multivariate Gaussian distribution for each cluster. A preliminary index is first defined and several of its properties are explored both theoretically and numerically. Adjustments are then made to this index so that the final refinement is also interpretable in terms of the Adjusted Rand Index between a true grouping and its hypothetical idealized clustering, taken as a surrogate of clustering complexity. Our derived index is used to develop a datasimulation algorithm that generates samples according to the prescribed value of the index. This algorithm is particularly useful for systematically generating datasets with varying degrees of clustering difficulty which can be used to evaluate performance of different clustering algorithms. The index is also shown to be useful in providing a summary of the distinctiveness of classes in grouped datasets.