Results 1  10
of
14
An InformationTheoretic External ClusterValidity Measure
 Research Report RJ 10219, IBM
, 2001
"... In this paper we propose a measure of similarity/association between two partitions of a set of objects. Our motivation is the desire to use the measure to characterize the quality or accuracy of clustering algorithms by somehow comparing the clusters they produce with "ground truth" consisting of c ..."
Abstract

Cited by 62 (3 self)
 Add to MetaCart
In this paper we propose a measure of similarity/association between two partitions of a set of objects. Our motivation is the desire to use the measure to characterize the quality or accuracy of clustering algorithms by somehow comparing the clusters they produce with "ground truth" consisting of classes assigned to the patterns by manual means or some other means in whose veracity there is confidence. Such measures are referred to as "external". Our measure also allows clusterings with different numbers of clusters to be compared in a quantitative and principled way. Our evaluation scheme quantitatively measures how useful the cluster labels of the patterns are as predictors of their class labels. When all clusterings to be compared have the same number of clusters, the measure is equivalent to the mutual information between the cluster labels and the class labels. In cases where the numbers of clusters are different, however, it computes the reduction in the number of bits that w...
A survey of Binary similarity and distance measures
 Journal of Systemics, Cybernetics and Informatics
, 2010
"... The binary feature vector is one of the most common representations of patterns and measuring similarity and distance measures play a critical role in many problems such as clustering, classification, etc. Ever since Jaccard proposed a similarity measure to classify ecological species in 1901, numer ..."
Abstract

Cited by 9 (0 self)
 Add to MetaCart
The binary feature vector is one of the most common representations of patterns and measuring similarity and distance measures play a critical role in many problems such as clustering, classification, etc. Ever since Jaccard proposed a similarity measure to classify ecological species in 1901, numerous binary similarity and distance measures have been proposed in various fields. Applying appropriate measures results in more accurate data analysis. Notwithstanding, few comprehensive surveys on binary measures have been conducted. Hence we collected 76 binary similarity and distance measures used over the last century and reveal their correlations through the hierarchical clustering technique.
On the Index of Dissimilarity for Lack of Fit in Log Linear Models
"... The index of dissimilarity, often denoted by Delta, is commonly used, especially in social science and with large datasets, to describe the lack of fit of models for categorical data. In this paper the definition and sampling properties of the index are investigated for general loglinear models. It ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
The index of dissimilarity, often denoted by Delta, is commonly used, especially in social science and with large datasets, to describe the lack of fit of models for categorical data. In this paper the definition and sampling properties of the index are investigated for general loglinear models. It is argued that in some applications a standardized version of the index is appropriate for interpretation. A simple, approximate variance formula is derived for the index, whether standardized or not. A simple bias reduction formula is also given. The accuracy of these formulae and of confidence intervals based upon them is investigated in a simulation study based on largescale social mobility data. Key words: bias reduction; dissimilarity index; extended hypergeometric; folded normal; iterative proportional fitting; iterative scaling; stratified sampling. 1
Analysis of Local or Asymmetric Dependencies in Contingency Tables using the Imprecise Dirichlet Model
 Zaffalon (Eds.), Proc. 3rd Int. Symp. on Imprecise Probabilities their Applications (ISIPTA ’03), Proceedings in Informatics, Vol. 18, Carleton Scientific
, 2003
"... We consider the statistical problem of analyzing the association between two categorical variables from crossclassified data. The focus is put on measures which enable one to study the dependencies at a local level and to assess whether the data support some more or less strong association model ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
We consider the statistical problem of analyzing the association between two categorical variables from crossclassified data. The focus is put on measures which enable one to study the dependencies at a local level and to assess whether the data support some more or less strong association model.
ON CHARACTERIZING DEPENDENCE IN JOINT DISTRIBUTIONS by
, 1967
"... Ways of characterizing the dependence of one random variable on another (or several others) are investigated. In particular, an index of dependence of X on Y is introduced which (i) always eXists, (ii) lies between zero and unity inclusive, (iii) is zero if and only if X and Yare independent, (iv) i ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Ways of characterizing the dependence of one random variable on another (or several others) are investigated. In particular, an index of dependence of X on Y is introduced which (i) always eXists, (ii) lies between zero and unity inclusive, (iii) is zero if and only if X and Yare independent, (iv) is unity if X is a function of Y (and only if whenever X has finite variance), (v) may assume every value between zero and unity inclusive by varying the joint distribution but holding the marginal distributions fixed (assuming Y continuously distributed), (vi) is invariant under linear transformation of X and onetoone transformation of Y, and (vii) equals kim whenever X and Yare sums of (nondegenerate) independent and identically distributed random variables Zl,Z2 ' •• • 'X being the sum of the first m ZI S and Y the sum of therirst k ZI S (m> k). When the correlation ratio eXists, its square cannot exceed the dependence index, and when (X,Y) is either bivariate normal or trinomial in distribution then the index equals the square of the correlation coefficient. The index is derived by first introducing and investigating a dependence characteristic, defined as the correlation ratio of exp(itX) on Y as a function of t. A correlation characteristic and index are also introduced. A brief survey of correlation and regression theory for complexvalued random variables is included. (No statistical aspects of dependence are considered). 1. Introduction. We
Institute of Statistics Mimeo Series No. 801 February 1972AN ANALYSIS FOR COHPOUNDED LOGARITHHIC ': " EXPONENTIAL LINEAR FUNCTIONS OF CATEGORICAL DATA
"... One area of application which has becom~ increasingly important to statisticians and other researchers is the analysis of categorical data. Often the principal ..."
Abstract
 Add to MetaCart
One area of application which has becom~ increasingly important to statisticians and other researchers is the analysis of categorical data. Often the principal
Reader BYUNG sao KIM. Studies of Multinomial Mixture Models
, 1984
"... (Under the direction of Barry H. Margolin) We investigate certain inferential aspects of mixtures of multinomial distributions, both in nonparametric and parametric contexts. As a nonparametric mixture model we propose a kpopulation finite mixture of binomial distributions, which can be applied to ..."
Abstract
 Add to MetaCart
(Under the direction of Barry H. Margolin) We investigate certain inferential aspects of mixtures of multinomial distributions, both in nonparametric and parametric contexts. As a nonparametric mixture model we propose a kpopulation finite mixture of binomial distributions, which can be applied to the analysis of noniid data generated from a series of toxicological experiments. A necessary and sufficient identifiability condition for the kpopulation finite mixture of binomials is obtained. The maximum likelihood estimates (MLE's) of the kpopulation finite mixture of binomials is computed via the EM algorithm (Dempster, Laird and Rubin, 1977), and the asymptotic properties of the MLE's are discussed. The identifiability condition is equivalent to the positive definiteness of the information matrix for the parameters. The MLE's and their sampling distributions, together with the data mentioned above, provide an empirical check of the statistical procedures proposed by Margolin, Kaplan and Zeiger (1981).
AUTOMATIC CLASSIFICATION
"... In this chapter I shall attempt to present a coherent account of classification in such a way that the principles involved will be sufficiently understood for anyone wishing to use classification techniques in IR to do so without too much difficulty. The emphasis will be ..."
Abstract
 Add to MetaCart
In this chapter I shall attempt to present a coherent account of classification in such a way that the principles involved will be sufficiently understood for anyone wishing to use classification techniques in IR to do so without too much difficulty. The emphasis will be
Suggested Citation
"... Kasprzyk for their helpful comments on earlier drafts of this paper. Clerical assistance was ..."
Abstract
 Add to MetaCart
Kasprzyk for their helpful comments on earlier drafts of this paper. Clerical assistance was