Results 1  10
of
28
A general coefficient of similarity and some of its properties
 Biometrics
, 1971
"... Biometrics is currently published by International Biometric Society. Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at ..."
Abstract

Cited by 195 (0 self)
 Add to MetaCart
Biometrics is currently published by International Biometric Society. Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at
Similarity Measures for Categorical Data: A Comparative Evaluation
, 2008
"... Measuring similarity or distance between two entities is a key step for several data mining and knowledge discovery tasks. The notion of similarity for continuous data is relatively wellunderstood, but for categorical data, the similarity computation is not straightforward. Several datadriven simi ..."
Abstract

Cited by 43 (3 self)
 Add to MetaCart
(Show Context)
Measuring similarity or distance between two entities is a key step for several data mining and knowledge discovery tasks. The notion of similarity for continuous data is relatively wellunderstood, but for categorical data, the similarity computation is not straightforward. Several datadriven similarity measures have been proposed in the literature to compute the similarity between two categorical data instances but their relative performance has not been evaluated. In this paper we study the performance of a variety of similarity measures in the context of a specific data mining task: outlier detection. Results on a variety of data sets show that while no one measure dominates others for all types of problems, some measures are able to have consistently high performance.
Iterate: A conceptual clustering algorithm for data mining
 IEEE TRANSACTIONS ON SYSTEMS, MAN AND CYBERNETICS
, 1998
"... The data exploration task can be divided into three interrelated subtasks: (i) feature selection, (ii) discovery, and (iii) interpretation. This paper describes an unsupervised discovery method with biases geared toward partitioning objects into clusters that improve interpretability. The algorithm, ..."
Abstract

Cited by 20 (0 self)
 Add to MetaCart
The data exploration task can be divided into three interrelated subtasks: (i) feature selection, (ii) discovery, and (iii) interpretation. This paper describes an unsupervised discovery method with biases geared toward partitioning objects into clusters that improve interpretability. The algorithm, ITERATE, employs: (i) a data ordering scheme and (ii) an iterative redistribution operator to produce maximally cohesive and distinct clusters. Cohesion or intraclass similarity is measured in terms of the match between individual objects and their assigned cluster prototype. Distinctness or interclass dissimilarity is measured by an average of the variance of the distribution matchbetween clusters. We demonstrate that interpretability, from a problem solving viewpoint, is addressed by theintra and interclass measures. Empirical results demonstrate the properties of the discovery algorithm, and its applications to problem solving.
Probabilistic Models for Bacterial Taxonomy
 INTERNATIONAL STATISTICAL REVIEW
, 2000
"... We give a survey of different probabilistic partitioning methods that have been applied to bacterial taxonomy. We introduce a theoretical framework, which makes it possible to treat the various models in a unified way. The key concepts of our approach are prediction and storing of microbiological in ..."
Abstract

Cited by 9 (3 self)
 Add to MetaCart
We give a survey of different probabilistic partitioning methods that have been applied to bacterial taxonomy. We introduce a theoretical framework, which makes it possible to treat the various models in a unified way. The key concepts of our approach are prediction and storing of microbiological information in a Bayesian forecasting setting. We show that there is a close connection between classification and probabilistic identification and that, in fact, our approach ties these two concepts together in a coherent way.
Towards fuzzy queryrelaxation for RDF
 In ESWC
, 2012
"... Abstract. In this paper, we argue that query relaxation over RDF data is an important but largely overlooked research topic: the Semantic Web standards allow for answering crisp queries over crisp RDF data, but what of usecases that require approximate answers for fuzzy queries over crisp data? We ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
(Show Context)
Abstract. In this paper, we argue that query relaxation over RDF data is an important but largely overlooked research topic: the Semantic Web standards allow for answering crisp queries over crisp RDF data, but what of usecases that require approximate answers for fuzzy queries over crisp data? We introduce a usecase from an EADS project that aims to aggregate intelligence information for police postincident analysis. Query relaxation is needed to match incomplete descriptions of entities involved in crimes to structured descriptions thereof. We first discuss the usecase, formalise the problem, and survey current literature for possible approaches. We then present a proofofconcept framework for enabling relaxation of structured entitylookup queries, evaluating different distance measures for performing relaxation. We argue that beyond our specific scenario, query relaxation is important to many potential usecases for Semantic Web technologies, and worthy of more attention. 1
Conceptual Clustering with NumericandNominal Mixed Data  A New Similarity Based System
 in IEEE Transcript on KCE
, 1998
"... This paper presents a new Similarity Based Agglomerative Clustering(SBAC) algorithm that works well for data with mixed numeric and nominal features. A similarity measure, proposed by Goodall for biological taxonomy[13], that gives greater weight to uncommon featurevalue matches in similarity compu ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
(Show Context)
This paper presents a new Similarity Based Agglomerative Clustering(SBAC) algorithm that works well for data with mixed numeric and nominal features. A similarity measure, proposed by Goodall for biological taxonomy[13], that gives greater weight to uncommon featurevalue matches in similarity computations and makes no assumptions of the underlying distributions of the featurevalues, is adopted to define the similarity measure between pairs of objects. An agglomerative algorithm is employed to construct a concept tree, and a simple distinctness heuristic is used to extract a partition of the data. The performance of SBAC has been studied on artificially generated data sets. Results demonstrate the effectiveness of this algorithm in unsupervised discovery tasks. Comparisons with other schemes illustrate the superior performance of the algorithm. 1 Introduction The widespread use of computers and information technology has made extensive data collection in businesses, manufacturing, an...
Distance Functions for Categorical and Mixed Variables
"... In this paper, we compare three different measures for computing Mahalanobistype distances between random variables consisting of several categorical dimensions or mixed categorical and numeric dimensions regular simplex, tensor product space, and symbolic covariance. The tensor product space and ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
(Show Context)
In this paper, we compare three different measures for computing Mahalanobistype distances between random variables consisting of several categorical dimensions or mixed categorical and numeric dimensions regular simplex, tensor product space, and symbolic covariance. The tensor product space and symbolic covariance distances are new contributions. We test the methods on two application domains classification and principal components analysis. We find that the tensor product space distance is impractical with most problems. Over all, the regular simplex method is the most successful in both domains, but the symbolic covariance method has several advantages including time and space efficiency, applicability to different contexts, and theoretical neatness.
Extending Iterate Conceptual Clustering Scheme In Dealing With Numeric Data
, 1995
"... ion and Interpretation Clustering Meaningful Clusters with Interpretations Figure 1: The Key Steps in Conceptual Clustering Systems grouping the data objects into clusters or groups based on the similarity of properties among the objects. The goal is to derive more general concepts that describe the ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
ion and Interpretation Clustering Meaningful Clusters with Interpretations Figure 1: The Key Steps in Conceptual Clustering Systems grouping the data objects into clusters or groups based on the similarity of properties among the objects. The goal is to derive more general concepts that describe the problem solving task. The task of interpretation involves determining whether the induced concepts are useful for the problem solving tasks that the user is interested in. This task involves the examination of the intentional description of a class in the context of background knowledge about the domain. Overview of the Clustering Methods Traditional approaches to cluster analysis (numerical taxonomy) represent the objects to be clustered as points in a multidimensional metric space and adopt distance metrics, such as Euclidean and Mahalanobis measures, to define dissimilarity between objects. Cluster analysis methods take on one of two different forms: 1. parametric methods: they assume t...
Similarity Measures for Categorical Data  A Comparative Study
, 2007
"... Measuring similarity or distance between two entities is a key step for several data mining and knowledge discovery tasks. The notion of similarity for continuous data is relatively wellunderstood, but for categorical data, the similarity computation is not straightforward. Several datadriven simi ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
Measuring similarity or distance between two entities is a key step for several data mining and knowledge discovery tasks. The notion of similarity for continuous data is relatively wellunderstood, but for categorical data, the similarity computation is not straightforward. Several datadriven similarity measures have been proposed in the literature to compute the similarity between two categorical data instances but their relative performance has not been evaluated. In this paper we study the performance of a variety of similarity measures in the context of a specific data mining task: outlier detection. Results on a variety of data sets show that while no one measure dominates others for all types of problems, some measures are able to have consistently high performance.