Results 1 -
8 of
8
The NVI Clustering Evaluation Measure
"... Clustering is crucial for many NLP tasks and applications. However, evaluating the results of a clustering algorithm is hard. In this paper we focus on the evaluation setting in which a gold standard solution is available. We discuss two existing information theory based measures, V and VI, and show ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Clustering is crucial for many NLP tasks and applications. However, evaluating the results of a clustering algorithm is hard. In this paper we focus on the evaluation setting in which a gold standard solution is available. We discuss two existing information theory based measures, V and VI, and show that they are both hard to use when comparing the performance of different algorithms and different datasets. The V measure favors solutions having a large number of clusters, while the range of scores given by VI depends on the size of the dataset. We present a new measure, NVI, which normalizes VI to address the latter problem. We demonstrate the superiority of NVI in a large experiment involving an important NLP application, grammar induction, using real corpus data in English, German and Chinese. 1
Type Level Clustering Evaluation: New Measures and a POS Induction Case Study
"... Clustering is a central technique in NLP. Consequently, clustering evaluation is of great importance. Many clustering algorithms are evaluated by their success in tagging corpus tokens. In this paper we discuss type level evaluation, which reflects class membership only and is independent of the tok ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Clustering is a central technique in NLP. Consequently, clustering evaluation is of great importance. Many clustering algorithms are evaluated by their success in tagging corpus tokens. In this paper we discuss type level evaluation, which reflects class membership only and is independent of the token statistics of a particular reference corpus. Type level evaluation casts light on the merits of algorithms, and for some applications is a more natural measure of the algorithm’s quality. We propose new type level evaluation measures that, contrary to existing measures, are applicable when items are polysemous, the common case in NLP. We demonstrate the benefits of our measures using a detailed case study, POS induction. We experiment with seven leading algorithms, obtaining useful insights and showing that token and type level measures can weakly or even negatively correlate, which underscores the fact that these two approaches reveal different aspects of clustering quality. 1
On Using Class-Labels in Evaluation of Clusterings
"... Although clustering has been studied for several decades, the fundamental problem of a valid evaluation has not yet been solved. The sound evaluation of clustering results in particular on real data is inherently difficult. In the literature, new clustering algorithms and their results are often ext ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Although clustering has been studied for several decades, the fundamental problem of a valid evaluation has not yet been solved. The sound evaluation of clustering results in particular on real data is inherently difficult. In the literature, new clustering algorithms and their results are often externally evaluated with respect to an existing class labeling. These class-labels, however, may not be adequate for the structure of the data or the evaluated cluster model. Here, we survey the literature of different related research areas that have observed this problem. We discuss common “defects ” that clustering algorithms exhibit w.r.t. this evaluation, and show them on several real world data sets of different domains along with a discussion why the detected clusters do not indicate a bad performance of the algorithm but are valid and useful results. An useful alternative evaluation method requires more extensive data labeling than the commonly used class labels or it needs a combination of information measures to take subgroups, supergroups, and overlapping sets of traditional classes into account. Finally, we discuss an evaluation scenario that regards the possible existence of several complementary sets of labels and hope to stimulate the discussion among different sub-communities — like ensemble-clustering, subspace-clustering, multi-label classification, hierarchical classification or hierarchical clustering, and multiview-clustering or alternative clustering — regarding requirements on enhanced evaluation methods. 1.
Rough diamonds in natural language learning
- Invited keynote (10pp), Proc. Conference on Rough Sets and Knowledge Technology, Springer Lecture Notes in Computer Science
, 2009
"... Abstract. Machine Learning of Natural Language provides a rich environment for exploring supervised and unsupervised learning techniques including soft clustering and rough sets. This keynote presentation will trace the course of our Natural Language Learning as well as some quite intriguing spin-of ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract. Machine Learning of Natural Language provides a rich environment for exploring supervised and unsupervised learning techniques including soft clustering and rough sets. This keynote presentation will trace the course of our Natural Language Learning as well as some quite intriguing spin-off applications. The focus of the paper will be learning, by both human and computer, reinterpreting our work of the last 30 years [1-12,20-24] in terms of recent developments in Rough Sets.
Clustering by a Genetic Algorithm with Biased Mutation Operator
"... Abstract—In this paper we propose a genetic algorithm that partitions data into a given number of clusters. The algorithm can use any cluster validity function as fitness function. Cluster validity is used as a criterion for cross-over operations. The cluster assignment for each point is accompanied ..."
Abstract
- Add to MetaCart
Abstract—In this paper we propose a genetic algorithm that partitions data into a given number of clusters. The algorithm can use any cluster validity function as fitness function. Cluster validity is used as a criterion for cross-over operations. The cluster assignment for each point is accompanied by a temperature and points with low confidence are preferentially mutated. We present results applying this genetic algorithm to several UCI machine learning data sets and using several objective cluster validity functions for optimization. It is shown that given an appropriate criterion function, the algorithm is able to converge on good cluster partitions within few generations. Our main contributions are: 1. to present a genetic algorithm that is fast and able to converge on meaningful clusters for real-world data sets, 2. to define and compare several cluster validity criteria. I.
The Problem with Kappa
"... It is becoming clear that traditional evaluation measures used in Computational Linguistics (including Error Rates, Accuracy, Recall, Precision and F-measure) are of limited value for unbiased evaluation of systems, and are not meaningful for comparison of algorithms unless both the dataset and algo ..."
Abstract
- Add to MetaCart
It is becoming clear that traditional evaluation measures used in Computational Linguistics (including Error Rates, Accuracy, Recall, Precision and F-measure) are of limited value for unbiased evaluation of systems, and are not meaningful for comparison of algorithms unless both the dataset and algorithm parameters are strictly controlled for skew (Prevalence and Bias). The use of techniques originally designed for other purposes, in particular Receiver Operating Characteristics Area Under Curve, plus variants of Kappa, have been proposed to fill the void. This paper aims to clear up some of the confusion relating to evaluation, by demonstrating that the usefulness of each evaluation method is highly dependent on the assumptions made about the distributions of the dataset and the underlying populations. The behaviour of a number of evaluation measures is compared under common assumptions. Deploying a system in a context which has the opposite skew from its validation set can be expected to approximately negate Fleiss Kappa and halve Cohen Kappa but leave Powers Kappa unchanged. For most performance evaluation purposes, the latter is thus most appropriate, whilst for comparison

