Results 1 - 10
of
32
Clustering with instance-level constraints
- In Proceedings of the Seventeenth International Conference on Machine Learning
, 2000
"... One goal of research in artificial intelligence is to automate tasks that currently require human expertise; this automation is important because it saves time and brings problems that were previously too large to be solved into the feasible domain. Data analysis, or the ability to identify meaningf ..."
Abstract
-
Cited by 116 (6 self)
- Add to MetaCart
One goal of research in artificial intelligence is to automate tasks that currently require human expertise; this automation is important because it saves time and brings problems that were previously too large to be solved into the feasible domain. Data analysis, or the ability to identify meaningful patterns and trends in large volumes of data, is an important task that falls into this category. Clustering algorithms are a particularly useful group of data analysis tools. These methods are used, for example, to analyze satellite images of the Earth to identify and categorize different land and foliage types or to analyze telescopic observations to determine what distinct types of astronomical bodies exist and to categorize each observation. However, most existing clustering methods apply general similarity techniques rather than making use of problem-specific information. This dissertation first presents a novel method for converting existing clustering algorithms into constrained clustering algorithms. The resulting methods are able to accept domain-specific information in the form of constraints on the output clusters. At the most general level, each constraint is an instance-level statement
Feature Selection in Unsupervised Learning via Evolutionary Search
- In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
, 2000
"... Feature subset selection is an important problem in knowl- edge discovery, not only for the insight gained from deter- mining relevant modeling variables but also for the improved understandability, scalability, and possibly, accuracy of the resulting models. In this paper we consider the problem of ..."
Abstract
-
Cited by 48 (3 self)
- Add to MetaCart
Feature subset selection is an important problem in knowl- edge discovery, not only for the insight gained from deter- mining relevant modeling variables but also for the improved understandability, scalability, and possibly, accuracy of the resulting models. In this paper we consider the problem of feature selection for unsupervised learning. A number of heuristic criteria can be used to estimate the quality of clusters built from a given featuresubset. Rather than combining such criteria, we use ELSA, an evolutionary lo- cal selection algorithm that maintains a diverse population of solutions that approximate the Pareto front in a multi- dimensional objectiv espace. Each evolved solution repre- sents a feature subset and a number of clusters; a standard K-means algorithm is applied to form the given n umber of clusters based on the selected features. Preliminary results on both real and synthetic data show promise in finding Pareto-optimal solutions through which we can identify the significant features and the correct number of clusters.
Differential evolution and particle swarm optimisation in partitional clustering
- Comput. Stat. Data Anal
, 2006
"... Abstract: In recent years, many partitional clustering algorithms based on genetic algorithms (GA) have been proposed to tackle the problem of finding the optimal partition of a data set. Surprisingly, very few studies considered alternative stochastic search heuristics other than GAs or simulated a ..."
Abstract
-
Cited by 16 (0 self)
- Add to MetaCart
Abstract: In recent years, many partitional clustering algorithms based on genetic algorithms (GA) have been proposed to tackle the problem of finding the optimal partition of a data set. Surprisingly, very few studies considered alternative stochastic search heuristics other than GAs or simulated annealing. Two promising algorithms for numerical optimization, which are hardly known outside the heuristic search field, are particle swarm optimisation (PSO) and differential evolution (DE). In this study, we compared the performance of GAs with PSO and DE for a medoid evolution approach to clustering, which Paterlini and Minerva (2003) introduced in a previous paper. Moreover, we compared these results with the nominal classification, k-means and random search (RS) as a lower bound. Our results show that DE is clearly and consistently superior compared to GAs and PSO for hard clustering problems, both in respect to precision as well as robustness (reproducibility) of the results. Only for simple data sets, the GA and PSO can obtain the same quality of results in contrast to k-means and RS, and, as expected, for trivial problems all algorithms can obtain comparable results. Apart from superior performance, DE is very easy to implement and requires hardly any parameter tuning compared to substantial tuning for GAs and PSOs. Our study shows that DE rather than GAs should receive primary attention in partitional cluster algorithms. Key-words: Cluster analysis, partitional clustering, differential evolution, particle swarm optimization, genetic algorithms. 1 1
Evolutionary Model Selection in Unsupervised Learning
, 2002
"... Feature subset selection is important not only for the insight gained from determining relevant modeling variables but also for the improved understandability, scalability, and possibly, accuracy of the resulting models. Feature selection has traditionally been studied in supervised learning situati ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
Feature subset selection is important not only for the insight gained from determining relevant modeling variables but also for the improved understandability, scalability, and possibly, accuracy of the resulting models. Feature selection has traditionally been studied in supervised learning situations, with some estimate of accuracy used to evaluate candidate subsets. However, we often cannot apply supervised learning for lack of a training signal. For these cases, we propose a new feature selection approach based on clustering. A number of heuristic criteria can be used to estimate the quality of clusters built from a given feature subset. Rather than combining such criteria, we use ELSA, an evolutionary local selection algorithm that maintains a diverse population of solutions that approximate the Pareto front in a multi-dimensional objective space. Each evolved solution represents a feature subset and a number of clusters; two representative clustering algorithms, K-means and EM, are applied to form the given number of clusters based on the selected features. Experimental results on both real and synthetic data show that the method can consistently find approximate Pareto-optimal solutions through which we can identify the significant features and an appropriate number of clusters. This results in models with better and clearer semantic relevance. 1.
A Genetic Rule-Based Data Clustering Toolkit
- In Proceedings of the 2002 Congress on Evolutionary Computation CEC2002
, 2002
"... Clustering is a hard combinatorial problem and is defined as the unsupervised classification of patterns. The formation of clusters is based on the principle of maximizing the similarity between objects of the same cluster while simultaneously minimizing the similarity between objects belonging to d ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
Clustering is a hard combinatorial problem and is defined as the unsupervised classification of patterns. The formation of clusters is based on the principle of maximizing the similarity between objects of the same cluster while simultaneously minimizing the similarity between objects belonging to distinct clusters. This paper presents a tool for database clustering using a rule-based genetic algorithm (RBCGA). RBCGA evolves individuals consisting of a fixed set of clustering rules, where each rule includes d non-binary intervals, one for each feature. The investigations attempt to alleviate certain drawbacks related to the classical minimization of square-error criterion by suggesting a flexible fitness function which takes into consideration, cluster asymmetry, density, coverage and homogeny.
A Maximum Variance Cluster Algorithm
- IEEE Trans. Pattern Anal. Mach. Intell
, 2002
"... We present a partitional cluster algorithm that minimizes the sum-of-squared-error criterion while imposing a hard constraint on the cluster variance. Conceptually, hypothesized clusters act in parallel and cooperate with their neighboring clusters in order to minimize the criterion and to satisfy t ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
We present a partitional cluster algorithm that minimizes the sum-of-squared-error criterion while imposing a hard constraint on the cluster variance. Conceptually, hypothesized clusters act in parallel and cooperate with their neighboring clusters in order to minimize the criterion and to satisfy the variance constraint. In order to enable the demarcation of the cluster neighborhood without crucial parameters, we introduce the notion of foreign cluster samples. Finally, we demonstrate a new method for cluster tendency assessment based on varying the variance constraint parameter.
A genetic algorithm using hyperquadtrees for low-dimensional k-means clustering
- IEEE Trans. on Pattern Analysis and Machine Intelligence
, 2006
"... Abstract—The k-means algorithm is widely used for clustering because of its computational efficiency. Given n points in d-dimensional space and the number of desired clusters k, k-means seeks a set of k cluster centers so as to minimize the sum of the squared Euclidean distance between each point an ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
Abstract—The k-means algorithm is widely used for clustering because of its computational efficiency. Given n points in d-dimensional space and the number of desired clusters k, k-means seeks a set of k cluster centers so as to minimize the sum of the squared Euclidean distance between each point and its nearest cluster center. However, the algorithm is very sensitive to the initial selection of centers and is likely to converge to partitions that are significantly inferior to the global optimum. We present a genetic algorithm (GA) for evolving centers in the k-means algorithm that simultaneously identifies good partitions for a range of values around a specified k. The set of centers is represented using a hyper-quadtree constructed on the data. This representation is exploited in our GA to generate an initial population of good centers and to support a novel crossover operation that selectively passes good subsets of neighboring centers from parents to offspring by swapping subtrees. Experimental results indicate that our GA finds the global optimum for data sets with known optima and finds good solutions for large simulated data sets. Index Terms—k-means algorithm, clustering, genetic algorithms, quadtrees, optimal partition, center selection. 1
An evolutionary clustering algorithm for gene expression microarray data analysis
- IEEE Transactions on Evolutionary Computation
, 2006
"... Abstract—Clustering is concerned with the discovery of interesting groupings of records in a database. Many algorithms have been developed to tackle clustering problems in a variety of application domains. In particular, some of them have been used in bioinformatics research to uncover inherent clus ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Abstract—Clustering is concerned with the discovery of interesting groupings of records in a database. Many algorithms have been developed to tackle clustering problems in a variety of application domains. In particular, some of them have been used in bioinformatics research to uncover inherent clusters in gene expression microarray data. In this paper, we show how some popular clustering algorithms have been used for this purpose. Based on experiments using simulated and real data, we also show that the performance of these algorithms can be further improved. For more effective clustering of gene expression microarray data, which is typically characterized by a lot of noise, we propose a novel evolutionary algorithm called evolutionary clustering (EvoCluster). EvoCluster encodes an entire cluster grouping in a chromosome so that each gene in the chromosome encodes one cluster. Based on such encoding scheme, it makes use of a set of reproduction operators to facilitate the exchange of grouping information between chromosomes. The fitness function that the EvoCluster adopts is able to differentiate between how relevant a feature value is in determining a particular cluster grouping. As such, instead of just local pairwise distances, it also takes into consideration how clusters are arranged globally. Unlike many popular clustering algorithms, EvoCluster does not require the number of clusters to be decided in advance. Also, patterns hidden in each cluster can be explicitly revealed and presented for easy interpretation even by casual users. For performance evaluation, we have tested EvoCluster using both simulated and real data. Experimental results show that it can be very effective and robust even in the presence of noise and missing values. Also, when correlating the gene expression microarray data with DNA sequences, we were able to uncover significant biological binding sites (both previously known and unknown) in each cluster discovered by EvoCluster. Index Terms—Bioinformatics, clustering, DNA sequence analysis, evolutionary algorithms (EAs), gene expression microarray data analysis. I.

