Results 1  10
of
11
Using the Triangle Inequality to Accelerate kMeans
, 2003
"... The kmeans algorithm is by far the most widely used method for discovering clusters in data. We show how to accelerate it dramatically, while still always computing exactly the same result as the standard algorithm. The accelerated algorithm avoids unnecessary distance calculations by applying the ..."
Abstract

Cited by 98 (1 self)
 Add to MetaCart
The kmeans algorithm is by far the most widely used method for discovering clusters in data. We show how to accelerate it dramatically, while still always computing exactly the same result as the standard algorithm. The accelerated algorithm avoids unnecessary distance calculations by applying the triangle inequality in two different ways, and by keeping track of lower and upper bounds for distances between points and centers. Experiments show that the new algorithm is effective for datasets with up to 1000 dimensions, and becomes more and more effective as the number k of clusters increases. For k>=20 it is many times faster than the best previously known accelerated kmeans method.
Kboost: A Scalable Algorithm for High Quality Clustering of Microarray Gene Expression Data TR IIT2007015, Istituto di Informatica e Telematica del CNR
, 2007
"... We consider the problem of partitioning, in a highly accurate and highly efficient way, a set of n documents lying in a metric space into k nonoverlapping clusters. We augment the wellknown furthestpointfirst algorithm for kcenter clustering in metric spaces with a filtering scheme based on the ..."
Abstract

Cited by 7 (3 self)
 Add to MetaCart
We consider the problem of partitioning, in a highly accurate and highly efficient way, a set of n documents lying in a metric space into k nonoverlapping clusters. We augment the wellknown furthestpointfirst algorithm for kcenter clustering in metric spaces with a filtering scheme based on the triangular inequality. We apply this algorithm to Web snippet clustering, comparing it against strong baselines consisting of recent, fast variants of the classical kmeans iterative algorithm. Our main conclusion is that our method attains solutions of better or comparable accuracy, and does this within a fraction of the time required by the baselines. Our algorithm is thus valuable when, as in Web snippet clustering, either the realtime nature of the task or the large amount of data make the poorly scalable, traditional clustering methods unsuitable.
Centroidal Voronoi tessellation algorithms for image compression and segmentation
, 2006
"... Centroidal Voronoi tessellations (CVT’s) are special Voronoi tessellations for which the generators of the tessellation are also the centers of mass (or means) of the Voronoi cells or clusters. CVT’s have been found to be useful in many disparate and diverse settings. In this paper, CVTbased algori ..."
Abstract

Cited by 6 (5 self)
 Add to MetaCart
Centroidal Voronoi tessellations (CVT’s) are special Voronoi tessellations for which the generators of the tessellation are also the centers of mass (or means) of the Voronoi cells or clusters. CVT’s have been found to be useful in many disparate and diverse settings. In this paper, CVTbased algorithms are developed for image compression, image segmenation, and multichannel image restoration applications. In the image processing context and in its simplest form, the CVTbased methodology reduces to the wellknown kmeans clustering technique. However, by viewing the latter within the CVT context, very useful generalizations and improvements can be easily made. Several such generalizations are exploited in this paper including the incorporation of cluster dependent weights, the incorporation of averaging techniques to treat noisy images, extensions to treat multichannel data, and combinations of the aforementioned. In each case, examples are provided to illustrate the efficiency, flexibility, and effectiveness of CVTbased image processing methodologies. 1
VISTO: VIsual STOryboard for Web Video Browsing ABSTRACT
"... Web video browsing is rapidly becoming a very popular activity in the Web scenario, causing the production of a concise video content representation a real need. Currently, static video summary techniques can be used to this aim. Unfortunately, they require long processing time and hence all the sum ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Web video browsing is rapidly becoming a very popular activity in the Web scenario, causing the production of a concise video content representation a real need. Currently, static video summary techniques can be used to this aim. Unfortunately, they require long processing time and hence all the summaries are produced in advance without any users customization. With an increasing number of videos and with the large users heterogeneousness, this is a burden. In this paper we propose VISTO, a summarization technique that produces customized onthefly video storyboards. The mechanism uses a fast clustering algorithm that selects the most representative frames using their HSV color distribution and allows users to select the storyboard length and the processing time. An objective and subjective evaluation shows that the storyboards are produced with good quality and in a time that allows onthefly usage. Categories and Subject Descriptors
ABSTRACT A Scalable Algorithm for HighQuality Clustering of Web Snippets
"... We consider the problem of partitioning, in a highly accurate and highly efficient way, a set of n documents lying in a metric space into k nonoverlapping clusters. We augment the wellknown furthestpointfirst algorithm for kcenter clustering in metric spaces with a filtering scheme based on the ..."
Abstract
 Add to MetaCart
We consider the problem of partitioning, in a highly accurate and highly efficient way, a set of n documents lying in a metric space into k nonoverlapping clusters. We augment the wellknown furthestpointfirst algorithm for kcenter clustering in metric spaces with a filtering scheme based on the triangular inequality. We apply this algorithm to Web snippet clustering, comparing it against strong baselines consisting of recent, fast variants of the classical kmeans iterative algorithm. Our main conclusion is that our method attains solutions of better or comparable accuracy, and does this within a fraction of the time required by the baselines. Our algorithm is thus valuable when, as in Web snippet clustering, either the realtime nature of the task or the large amount of data make the poorly scalable, traditional clustering methods unsuitable.
as a Data Mining Tool for Environmental Applications
"... Abstract: The authors have applied multivariate cluster analysis to a variety of environmental science domains, including ecological regionalization; environmental monitoring network design; analysis of satellite, airborne, and groundbased remote sensing, and climate modelmodel and modelmeasure ..."
Abstract
 Add to MetaCart
Abstract: The authors have applied multivariate cluster analysis to a variety of environmental science domains, including ecological regionalization; environmental monitoring network design; analysis of satellite, airborne, and groundbased remote sensing, and climate modelmodel and modelmeasurement intercomparison. The clustering methodology employs a kmeans statistical clustering algorithm that has been implemented in a highly scalable, parallel high performance computing (HPC) application. Because of its efficiency and use of HPC platforms, the clustering code may be applied as a data mining tool to analyze and compare very large data sets of high dimensionality, such as very long or high frequency/resolution time series measurements or model output. The method was originally applied across geographic space and called Multivariate Geographic Clustering (MGC). Now applied across space and through time, the environmental data mining method is called Multivariate SpatioTemporal Clustering (MSTC). Described here are the clustering algorithm, recent code improvements that significantly reduce the timetosolution, and a new parallel principal components analysis (PCA) tool that can analyze very large data sets. Finally, a sampling of the authors ’ applications of MGC and MSTC to problems in the environmental
Using the Triangle Inequality to AccelerateMeans
"... Themeans algorithm is by far the most widely used method for discovering clusters in data. We show how to accelerate it dramatically, while still always computing exactly the same result as the standard algorithm. The accelerated algorithm avoids unnecessary distance calculations by applying the tr ..."
Abstract
 Add to MetaCart
Themeans algorithm is by far the most widely used method for discovering clusters in data. We show how to accelerate it dramatically, while still always computing exactly the same result as the standard algorithm. The accelerated algorithm avoids unnecessary distance calculations by applying the triangle inequality in two different ways, and by keeping track of lower and upper bounds for distances between points and centers. Experiments show that the new algorithm is effective for datasets with up to 1000 dimensions, and becomes more and more effective as the number of clusters increases. For it is many times faster than the best previously known acceleratedmeans method. 1.
Cluster AnalysisBased Approaches for Geospatiotemporal Data Mining of Massive Data Sets for Identification of Forest Threats
"... We investigate methods for geospatiotemporal data mining of multiyear land surface phenology data (250 m2 Normalized Difference Vegetation Index (NDVI) values derived from the Moderate Resolution Imaging Spectrometer (MODIS) in this study) for the conterminous United States (CONUS) as part of an ea ..."
Abstract
 Add to MetaCart
We investigate methods for geospatiotemporal data mining of multiyear land surface phenology data (250 m2 Normalized Difference Vegetation Index (NDVI) values derived from the Moderate Resolution Imaging Spectrometer (MODIS) in this study) for the conterminous United States (CONUS) as part of an early warning system for detecting threats to forest ecosystems. The approaches explored here are based on kmeans cluster analysis of this massive data set, which provides a basis for defining the bounds of the expected or “normal ” phenological patterns that indicate healthy vegetation at a given geographic location. We briefly describe the computational approaches we have used to make cluster analysis of such massive data sets feasible, describe approaches we have explored for distinguishing between normal and abnormal phenology, and present some examples in which we have applied these approaches to identify various forest disturbances in the CONUS. Keywords: phenology, MODIS, NDVI, remote sensing, kmeans clustering, data mining, anomaly detection, high performance computing 1. The Forest Incidence Recognition and State Tracking System (FIRST) Early identification of forested areas under attack from insects, disease, or other agents can enable timely response to protect forest ecosystems from longterm or irreversible damage. Unfortunately, given the sheer size of the United States and limited resources of agencies such as the USDA Forest Service to conduct aerial surveys and groundbased inspections, many threats go unnoticed until a great deal of damage has already been done. To improve threat detection,
Parallel kMeans Clustering for Quantitative Ecoregion Delineation Using Large Data Sets
"... Identification of geographic ecoregions has long been of interest to environmental scientists and ecologists for identifying regions of similar ecological and environmental conditions. Such classifications are important for predicting suitable species ranges, for stratification of ecological samples ..."
Abstract
 Add to MetaCart
Identification of geographic ecoregions has long been of interest to environmental scientists and ecologists for identifying regions of similar ecological and environmental conditions. Such classifications are important for predicting suitable species ranges, for stratification of ecological samples, and to help prioritize habitat preservation and remediation efforts. Hargrove and Hoffman [1, 2] have developed geographical spatiotemporal clustering algorithms and codes and have successfully applied them to a variety of environmental science domains, including ecological regionalization; environmental monitoring network design; analysis of satellite, airborne, and groundbased remote sensing, and climate modelmodel and modelmeasurement intercomparison. With the advances in stateoftheart satellite remote sensing and climate models, observations and model outputs are available at increasingly high spatial and temporal resolutions. Long time series of these high resolution datasets are extremely large in size and growing. Analysis and knowledge extraction from these large datasets are not just algorithmic and ecological problems, but also pose a complex computational problem. This paper focuses on the development of a massively parallel multivariate geographical spatiotemporal clustering code for analysis of very large datasets using tens of thousands processors on one of the fastest supercomputers in the world. Keywords: ecoregionalization, kmeans clustering, data mining, high performance computing
DOI: 10.1007/s1085100536204 Centroidal Voronoi Tessellation Algorithms for Image Compression, Segmentation, and Multichannel Restoration
, 2006
"... Abstract. Centroidal Voronoi tessellations (CVT’s) are special Voronoi tessellations for which the generators of the tessellation are also the centers of mass (or means) of the Voronoi cells or clusters. CVT’s have been found to be useful in many disparate and diverse settings. In this paper, CVTba ..."
Abstract
 Add to MetaCart
Abstract. Centroidal Voronoi tessellations (CVT’s) are special Voronoi tessellations for which the generators of the tessellation are also the centers of mass (or means) of the Voronoi cells or clusters. CVT’s have been found to be useful in many disparate and diverse settings. In this paper, CVTbased algorithms are developed for image compression, image segmenation, and multichannel image restoration applications. In the image processing context and in its simplest form, the CVTbased methodology reduces to the wellknown kmeans clustering technique. However, by viewing the latter within the CVT context, very useful generalizations and improvements can be easily made. Several such generalizations are exploited in this paper including the incorporation of cluster dependent weights, the incorporation of averaging techniques to treat noisy images, extensions to treat multichannel data, and combinations of the aforementioned. In each case, examples are provided to illustrate the efficiency, flexibility, and