Results 1  10
of
18
Data Clustering: 50 Years Beyond KMeans
, 2008
"... Organizing data into sensible groupings is one of the most fundamental modes of understanding and learning. As an example, a common scheme of scientific classification puts organisms into taxonomic ranks: domain, kingdom, phylum, class, etc.). Cluster analysis is the formal study of algorithms and m ..."
Abstract

Cited by 274 (6 self)
 Add to MetaCart
Organizing data into sensible groupings is one of the most fundamental modes of understanding and learning. As an example, a common scheme of scientific classification puts organisms into taxonomic ranks: domain, kingdom, phylum, class, etc.). Cluster analysis is the formal study of algorithms and methods for grouping, or clustering, objects according to measured or perceived intrinsic characteristics or similarity. Cluster analysis does not use category labels that tag objects with prior identifiers, i.e., class labels. The absence of category information distinguishes data clustering (unsupervised learning) from classification or discriminant analysis (supervised learning). The aim of clustering is exploratory in nature to find structure in data. Clustering has a long and rich history in a variety of scientific fields. One of the most popular and simple clustering algorithms, Kmeans, was first published in 1955. In spite of the fact that Kmeans was proposed over 50 years ago and thousands of clustering algorithms have been published since then, Kmeans is still widely used. This speaks to the difficulty of designing a general purpose clustering algorithm and the illposed problem of clustering. We provide a brief overview of clustering, summarize well known clustering methods, discuss the major challenges and key issues in designing clustering algorithms, and point out some of the emerging and useful research directions, including semisupervised clustering, ensemble clustering, simultaneous feature selection, and data clustering and large scale data clustering.
Semisupervised distance metric learning for collaborative image retrieval
 Proc. IEEE Conf. on CVPR
, 2008
"... Typical contentbased image retrieval (CBIR) solutions with regular Euclidean metric usually cannot achieve satisfactory performance due to the semantic gap challenge. Hence, relevance feedback has been adopted as a promising approach to improve the search performance. In this paper, we propose a no ..."
Abstract

Cited by 35 (13 self)
 Add to MetaCart
Typical contentbased image retrieval (CBIR) solutions with regular Euclidean metric usually cannot achieve satisfactory performance due to the semantic gap challenge. Hence, relevance feedback has been adopted as a promising approach to improve the search performance. In this paper, we propose a novel idea of learning with historical relevance feedback log data, and adopt a new paradigm called “Collaborative Image Retrieval ” (CIR). To effectively explore the log data, we propose a novel semisupervised distance metric learning technique, called “Laplacian Regularized Metric Learning ” (LRML), for learning robust distance metrics for CIR. Different from previous methods, the proposed LRML method integrates both log data and unlabeled data information through an effective graph regularization framework. We show that reliable metrics can be learned from real log data even they may be noisy and limited at the beginning stage of a CIR system. We conducted extensive evaluation to compare the proposed method with a large number of competing methods, including 2 standard metrics, 3 unsupervised metrics, and 4 supervised metrics with side information. 1.
Learning Bregman distance functions for semisupervised clustering
 IEEE Transactions on Knowledge and Data Engineering
, 2012
"... Abstract—Learning distance functions with side information plays a key role in many data mining applications. Conventional distance metric learning approaches often assume that the target distance function is represented in some form of Mahalanobis distance. These approaches usually work well when d ..."
Abstract

Cited by 10 (0 self)
 Add to MetaCart
(Show Context)
Abstract—Learning distance functions with side information plays a key role in many data mining applications. Conventional distance metric learning approaches often assume that the target distance function is represented in some form of Mahalanobis distance. These approaches usually work well when data are in low dimensionality, but often become computationally expensive or even infeasible when handling highdimensional data. In this paper, we propose a novel scheme of learning nonlinear distance functions with side information. It aims to learn a Bregman distance function using a nonparametric approach that is similar to Support Vector Machines. We emphasize that the proposed scheme is more general than the conventional approach for distance metric learning, and is able to handle highdimensional data efficiently. We verify the efficacy of the proposed distance learning method with extensive experiments on semisupervised clustering. The comparison with stateoftheart approaches for learning distance functions with side information reveals clear advantages of the proposed technique. Index Terms—Bregman distance, distance functions, metric learning, convex functions. Ç 1
Learning from Noisy Side Information by Generalized Maximum Entropy Model
"... We consider the problem of learning from noisy side information in the form of pairwise constraints. Although many algorithms have been developed to learn from side information, most of them assume perfect pairwise constraints. Given the pairwise constraints are often extracted from data sources suc ..."
Abstract

Cited by 8 (4 self)
 Add to MetaCart
(Show Context)
We consider the problem of learning from noisy side information in the form of pairwise constraints. Although many algorithms have been developed to learn from side information, most of them assume perfect pairwise constraints. Given the pairwise constraints are often extracted from data sources such as paper citations, they tend to be noisy and inaccurate. In this paper, we introduce the generalization of maximum entropy model and propose a framework for learning from noisy side information based on the generalized maximum entropy model. The theoretic analysis shows that under certain assumption, theclassificationmodeltrainedfromthe noisy side information can be very close to theonetrainedfromthe perfectsideinformation. Extensive empirical studies verify the effectiveness of the proposed framework. 1.
Learning Bregman Distance Functions and Its Application for SemiSupervised Clustering
"... Learning distance functions with side information plays a key role in many machine learning and data mining applications. Conventional approaches often assume a Mahalanobis distance function. These approaches are limited in two aspects: (i) they are computationally expensive (even infeasible) for hi ..."
Abstract

Cited by 8 (1 self)
 Add to MetaCart
(Show Context)
Learning distance functions with side information plays a key role in many machine learning and data mining applications. Conventional approaches often assume a Mahalanobis distance function. These approaches are limited in two aspects: (i) they are computationally expensive (even infeasible) for high dimensional data because the size of the metric is in the square of dimensionality; (ii) they assume a fixed metric for the entire input space and therefore are unable to handle heterogeneous data. In this paper, we propose a novel scheme that learns nonlinear Bregman distance functions from side information using a nonparametric approach that is similar to support vector machines. The proposed scheme avoids the assumption of fixed metric by implicitly deriving a local distance from the Hessian matrix of a convex function that is used to generate the Bregman distance function. We also present an efficient learning algorithm for the proposed scheme for distance function learning. The extensive experiments with semisupervised clustering show the proposed technique (i) outperforms the stateoftheart approaches for distance function learning, and (ii) is computationally efficient for high dimensional data. 1
Constraint Projections for Ensemble Learning
"... It is wellknown that diversity among base classifiers is crucial for constructing a strong ensemble. Most existing ensemble methods obtain diverse individual learners through resampling the instances or features. In this paper, we propose an alternative way for ensemble construction by resampling p ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
It is wellknown that diversity among base classifiers is crucial for constructing a strong ensemble. Most existing ensemble methods obtain diverse individual learners through resampling the instances or features. In this paper, we propose an alternative way for ensemble construction by resampling pairwise constraints that specify whether a pair of instances belongs to the same class or not. Using pairwise constraints for ensemble construction is challenging because it remains unknown how to influence the base classifiers with the sampled pairwise constraints. We solve this problem with a twostep process. First, we transform the original instances into a new data representation using projections learnt from pairwise constraints. Then, we build the base classifiers with the new data representation. We propose two methods for resampling pairwise constraints following the standard Bagging and Boosting algorithms, respectively. Extensive experiments validate the effectiveness of our method.
Using Knowledge Driven Matrix Factorization to Reconstruct Modular Gene Regulatory Network
"... Reconstructing gene networks from microarray data can provide information on the mechanisms that govern cellular processes. Numerous studies have been devoted to addressing this problem. A popular method is to view the gene network as a Bayesian inference network, and to apply structure learning me ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
Reconstructing gene networks from microarray data can provide information on the mechanisms that govern cellular processes. Numerous studies have been devoted to addressing this problem. A popular method is to view the gene network as a Bayesian inference network, and to apply structure learning methods to determine the topology of the gene network. There are, however, several shortcomings with the Bayesian structure learning approach for reconstructing gene networks. They include high computational cost associated with analyzing a large number of genes and inefficiency in exploiting prior knowledge of coregulation that could be derived from Gene Ontology (GO) information. In this paper, we present a knowledge driven matrix factorization (KMF) framework for reconstructing modular gene networks that addresses these shortcomings. In KMF, gene expression data is initially used to estimate the correlation matrix. The gene modules and the interactions among the modules are derived by factorizing the correlation matrix. The prior knowledge in GO is integrated into matrix factorization to help identify the gene modules. An alternating optimization algorithm is presented to efficiently find the solution. Experiments show that our algorithm performs significantly better in identifying gene modules than several stateoftheart algorithms, and the interactions among the modules uncovered by our algorithm are proved to be biologically meaningful.
COMPLEX SCENE ANALYSIS IN URBAN AREAS BASED ON AN ENSEMBLE CLUSTERING METHOD APPLIED ON LIDAR DATA
"... 3D object extraction is one of the main interests and has lots of applications in photogrammetry and computer vision. In recent years, airborne laserscanning has been accepted as an effective 3D data collection technique for extracting spatial object models such as digital terrain models (DTM) and ..."
Abstract
 Add to MetaCart
(Show Context)
3D object extraction is one of the main interests and has lots of applications in photogrammetry and computer vision. In recent years, airborne laserscanning has been accepted as an effective 3D data collection technique for extracting spatial object models such as digital terrain models (DTM) and building models. Data clustering, also known as unsupervised learning is one of the key techniques in object extraction and is used to understand structure of unlabeled data. Classical clustering methods such as kmeans attempt to subdivide a data set into subsets or clusters. A large number of recent researches have attempted to improve the performance of clustering. In this paper, the boostclustering algorithm which is a novel clustering methodology that exploits the general principles of boosting is implemented and evaluated on features extracted from LiDAR data. This method is a multiclustering technique in which At each iteration, a new training set is created using weighted random sampling from the original dataset and a simple clustering algorithm such as kmeans is applied to provide a new data partitioning. The final clustering solution is produced by aggregating the weighted multiple clustering results. This clustering methodology is used for the analysis of complex scenes in urban areas by extracting three different object classes of buildings, trees and ground, using LiDAR datasets. Experimental results indicate that boost clustering using kmeans as its underlying training method provides improved performance and accuracy comparing to simple kmeans algorithm. 1.
Proceedings of the TwentyThird AAAI Conference on Artificial Intelligence (2008) Prediction and Change Detection In Sequential Data for Interactive Applications
"... We consider the problems of sequential prediction and change detection that arise often in interactive applications: A semiautomatic predictor is applied to a timeseries and is expected to make proper predictions and request new human input when change points are detected. Motivated by the Transdu ..."
Abstract
 Add to MetaCart
(Show Context)
We consider the problems of sequential prediction and change detection that arise often in interactive applications: A semiautomatic predictor is applied to a timeseries and is expected to make proper predictions and request new human input when change points are detected. Motivated by the Transductive Support Vector Machines (Vapnik 1998), we propose an online framework that naturally addresses these problems in a unified manner. Our empirical study with a synthetic dataset and a road tracking dataset demonstrates the efficacy of the proposed approach.
Knowledge based Cluster Ensemble for 3D Head Model Classification
"... Recently, researchers are paying more attention to 3D model classification due to its useful applications in multimedia, computer graphics, and so on. Although there exist a number of approaches to classify 3D models, few of them consider the prior knowledge during the process of 3D model classifica ..."
Abstract
 Add to MetaCart
(Show Context)
Recently, researchers are paying more attention to 3D model classification due to its useful applications in multimedia, computer graphics, and so on. Although there exist a number of approaches to classify 3D models, few of them consider the prior knowledge during the process of 3D model classification. In this paper, we propose a new framework called knowledge based cluster ensemble which incorporates the prior knowledge of the dataset into the cluster ensemble framework to classify 3D models. The experiments show that knowledge based cluster ensemble framework works well on 3D human head model database. 1.