Results 1  10
of
20
Clustering and Diversifying Web Search Results with GraphBased Word Sense Induction
"... Web search result clustering aims to facilitate information search on the Web. Rather than presenting the results of a query as a flat list, these are grouped on the basis of their similarity and subsequently shown to the user as a list of possibly labeled clusters. Each cluster is supposed to repre ..."
Abstract

Cited by 19 (8 self)
 Add to MetaCart
Web search result clustering aims to facilitate information search on the Web. Rather than presenting the results of a query as a flat list, these are grouped on the basis of their similarity and subsequently shown to the user as a list of possibly labeled clusters. Each cluster is supposed to represent a different meaning of the input query, thus taking into account the language ambiguity, i.e. polysemy, issue. However, Web clustering methods typically rely on some shallow notion of textual similarity of search result snippets. As a result, text snippets with no word in common tend to be clustered separately, even if they share the same meaning, whereas snippets with words in common may be grouped together even if they refer to different meanings of the input query. In this paper, we present a novel approach to Web search result clustering based on the automatic discovery of word senses from raw text, a task referred to as Word Sense Induction (WSI). Key to our approach is to first acquire the senses (i.e., meanings) of an ambiguous query and then cluster the search results based on their semantic similarity to the word senses induced. Our experiments, conducted on datasets of ambiguous queries, show that our approach outperforms both Web clustering and search engines. 1.
Classifying clustering schemes
 Foundations of Computational Mathematics
"... iv ..."
(Show Context)
Document Clustering Evaluation: Divergence from a Random Baseline
"... Divergence from a random baseline is a technique for the evaluation of document clustering. It ensures cluster quality measures are performing work that prevents ineffective clusterings from giving high scores to clusterings that provide no useful result. These concepts are defined and analysed usin ..."
Abstract

Cited by 4 (3 self)
 Add to MetaCart
(Show Context)
Divergence from a random baseline is a technique for the evaluation of document clustering. It ensures cluster quality measures are performing work that prevents ineffective clusterings from giving high scores to clusterings that provide no useful result. These concepts are defined and analysed using intrinsic and extrinsic approaches to the evaluation of document cluster quality. This includes the classical clusters to categories approach and a novel approach that uses ad hoc information retrieval. The divergence from a random baseline approach is able to differentiate ineffective clusterings encountered in the INEX XML Mining track. It also appears to perform a normalisation similar to the Normalised Mutual Information (NMI) measure but it can be applied to any measure of cluster quality. When it is applied to the intrinsic measure of distortion as measured by RMSE, subtraction from a random baseline provides a clear optimum that is not apparent otherwise. This approach can be applied to any clustering evaluation. This paper describes its use in the context of document clustering evaluation. 1
Modelbased clustering for multivariate functional data
, 2012
"... This paper proposes the first modelbased clustering algorithm for multivariate functional data. After introducing multivariate functional principal components analysis (MFPCA), a parametric mixture model, based on the assumption of normality of the principal components, is defined and estimated by ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
This paper proposes the first modelbased clustering algorithm for multivariate functional data. After introducing multivariate functional principal components analysis (MFPCA), a parametric mixture model, based on the assumption of normality of the principal components, is defined and estimated by an EMlike algorithm. The main advantage of the proposed model is its ability to take into account the dependence among curves. Results on simulated and real datasets show the efficiency of the proposed method.
An Empirical Study of Cluster Evaluation Metrics using Flow Cytometry Data
"... A wide range of abstract characteristics of partitions have been proposed for cluster evaluation. We empirically evaluated the performance of these metrics for flow cytometry data and found that the setmatching metrics perform closest to human. 1 ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
(Show Context)
A wide range of abstract characteristics of partitions have been proposed for cluster evaluation. We empirically evaluated the performance of these metrics for flow cytometry data and found that the setmatching metrics perform closest to human. 1
Supervised Clustering
"... Despite the ubiquity of clustering as a tool in unsupervised learning, there is not yet a consensus on a formal theory, and the vast majority of work in this direction has focused on unsupervised clustering. We study a recently proposed framework for supervised clustering where there is access to a ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
Despite the ubiquity of clustering as a tool in unsupervised learning, there is not yet a consensus on a formal theory, and the vast majority of work in this direction has focused on unsupervised clustering. We study a recently proposed framework for supervised clustering where there is access to a teacher. We give an improved generic algorithm to cluster any concept class in that model. Our algorithm is queryefficient in the sense that it involves only a small amount of interaction with the teacher. We also present and study two natural generalizations of the model. The model assumes that the teacher response to the algorithm is perfect. We eliminate this limitation by proposing a noisy model and give an algorithm for clustering the class of intervals in this noisy model. We also propose a dynamic model where the teacher sees a random subset of the points. Finally, for datasets satisfying a spectrum of weak to strong properties, we give query bounds, and show that a class of clustering functions containing SingleLinkage will find the target clustering under the strongest property. 1
A Comparative Study of Various Clustering Algorithms
 in Data Mining“, International Journal of Engineering Research and Applications (IJERA
, 2012
"... AbstractData clustering is a process of putting similar data into groups. A clustering algorithm partitions a data set into several groups such that the similarity within a group is larger than among groups.This paper reviews six types of clustering techniques kMeans Clustering, Hierarchical Clus ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
AbstractData clustering is a process of putting similar data into groups. A clustering algorithm partitions a data set into several groups such that the similarity within a group is larger than among groups.This paper reviews six types of clustering techniques kMeans Clustering, Hierarchical Clustering, DBScan clustering, Density Based Clustering, Optics, EM Algorithm. These clustering techniques are implemented and analysed using a clustering tool WEKA.Performance of the 6 techniques are presented and compared. Index TermsData clustering, KMeans Clustering,
Maximum volume clustering
 In Proceedings of 14th International Conference on Artificial Intelligence and Statistics (AISTATS
, 2011
"... The large volume principle proposed by Vladimir Vapnik, which advocates that hypotheses lying in an equivalence class with a larger volume are more preferable, is a useful alternative to the large margin principle. In this paper, we introduce a clustering model based on the large volume principle c ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
The large volume principle proposed by Vladimir Vapnik, which advocates that hypotheses lying in an equivalence class with a larger volume are more preferable, is a useful alternative to the large margin principle. In this paper, we introduce a clustering model based on the large volume principle called maximum volume clustering (MVC), and propose two algorithms to solve it approximately: a softlabel and a hardlabel MVC algorithms based on sequential quadratic programming and semidefinite programming, respectively. Our MVC model includes spectral clustering and maximum margin clustering as special cases, and is substantially more general. We also establish the finite sample stability and an error bound for softlabel MVC method. Experiments show that the proposed MVC approach compares favorably with stateoftheart clustering algorithms. 1
Maximum volume clustering: A new discriminative clustering approach
 Journal of Machine Learning Research
, 2013
"... Editor: Ulrike von Luxburg The large volume principle proposed by Vladimir Vapnik, which advocates that hypotheses lying in an equivalence class with a larger volume are more preferable, is a useful alternative to the large margin principle. In this paper, we introduce a new discriminative clusterin ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
Editor: Ulrike von Luxburg The large volume principle proposed by Vladimir Vapnik, which advocates that hypotheses lying in an equivalence class with a larger volume are more preferable, is a useful alternative to the large margin principle. In this paper, we introduce a new discriminative clustering model based on the large volume principle called maximum volume clustering (MVC), and then propose two approximation schemes to solve this MVC model: A softlabel MVC method using sequential quadratic programming and a hardlabel MVC method using semidefinite programming, respectively. The proposed MVC is theoretically advantageous for three reasons. The optimization involved in hardlabel MVC is convex, and under mild conditions, the optimization involved in softlabel MVC is akin to a convex one in terms of the resulting clusters. Secondly, the softlabel MVC method pos
Which Distance Metric is Right: An Evolutionary KMeans View
"... It is well known that the distance metric plays an important role in the clustering process. Indeed, many clustering problems can be treated as an optimization problem of a criterion function defined over one distance metric. While many distance metrics have been developed, it is not clear that how ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
It is well known that the distance metric plays an important role in the clustering process. Indeed, many clustering problems can be treated as an optimization problem of a criterion function defined over one distance metric. While many distance metrics have been developed, it is not clear that how these distance metrics can impact on the clustering/optimization process. To that end, in this paper, we study the impact of a set of popular cosinebased distance metrics on Kmeans clustering. Specifically, by revealing the common orderpreserving property, we first show that Kmeans has exactly the same cluster assignment for these metrics during the Estep. Next, by both theoretical and empirical studies, we prove that the cluster centroid is a good approximator of their respective optimal centers in the Mstep. As such, we identify a problem with Kmeans: it cannot differentiate these metrics. To explore the nature of these metrics, we propose an evolutionary Kmeans framework that integrates Kmeans and genetic algorithms. This framework not only enables inspection of arbitrary distance metrics, but also can be used to investigate different formulations of the optimization problem. Finally, this framework is used in extensive experiments on realworld data sets. The results validate our theoretical findings on the characteristics and interrelationships of these metrics. Most importantly, this paper furthers our understanding of the impact of the distance metrics on the optimization process of Kmeans.