Results 1  10
of
319
An introduction to variable and feature selection
 Journal of Machine Learning Research
, 2003
"... Variable and feature selection have become the focus of much research in areas of application for which datasets with tens or hundreds of thousands of variables are available. ..."
Abstract

Cited by 1269 (17 self)
 Add to MetaCart
Variable and feature selection have become the focus of much research in areas of application for which datasets with tens or hundreds of thousands of variables are available.
A comparative study on feature selection and classification methods using gene expression profiles and proteomic patterns
 Genome Informatics
, 2002
"... Feature selection plays an important role in classification. We present a comparative study on six feature selection heuristics by applying them to two sets of data. The first set of data are gene expression profiles from Acute Lymphoblastic Leukemia (ALL) patients. The second set of data are proteo ..."
Abstract

Cited by 104 (6 self)
 Add to MetaCart
(Show Context)
Feature selection plays an important role in classification. We present a comparative study on six feature selection heuristics by applying them to two sets of data. The first set of data are gene expression profiles from Acute Lymphoblastic Leukemia (ALL) patients. The second set of data are proteomic patterns from ovarian cancer patients. Based on features chosen by these methods, error rates of several classification algorithms were obtained for analysis. Our results demonstrate the importance of feature selection in accurately classifying new samples.
Cluster Graph Modification Problems
 DISCRETE APPLIED MATHEMATICS
, 2002
"... In a clustering problem one has to partition a set of elements into homogeneous and wellseparated subsets. From a graph theoretic point of view, a cluster graph is a vertexdisjoint union of cliques. The clustering problem is the task of making fewest changes to the edge set of an input graph so th ..."
Abstract

Cited by 69 (5 self)
 Add to MetaCart
(Show Context)
In a clustering problem one has to partition a set of elements into homogeneous and wellseparated subsets. From a graph theoretic point of view, a cluster graph is a vertexdisjoint union of cliques. The clustering problem is the task of making fewest changes to the edge set of an input graph so that it becomes a cluster graph. We study the complexity of three variants of the problem. In the Cluster Completion variant edges can only be added. In Cluster Deletion, edges can only be deleted. In Cluster Editing, both edge additions and edge deletions are allowed. We also study these variants when the desired solution must contain a prespecified number of clusters. We show that
Adaptive Dimension Reduction for Clustering High Dimensional Data
, 2002
"... It is wellknown that for high dimensional data clustering, standard algorithms such as EM and the Kmeans are often trapped in local minimum. Many initialization methods were proposed to tackle this problem , but with only limited success. In this paper we propose a new approach to resolve this pro ..."
Abstract

Cited by 68 (2 self)
 Add to MetaCart
It is wellknown that for high dimensional data clustering, standard algorithms such as EM and the Kmeans are often trapped in local minimum. Many initialization methods were proposed to tackle this problem , but with only limited success. In this paper we propose a new approach to resolve this problem by repeated dimension reductions such that Kmeans or EM are performed only in very low dimensions. Cluster membership is utilized as a bridge between the reduced dimensional subspace and the original space, providing flexibility and ease of implementation. Clustering analysis performed on highly overlapped Gaussians, DNA gene expression profiles and internet newsgroups demonstrate the e#ectiveness of the proposed algorithm.
Ensemble learning for independent component analysis
 IN ADVANCES IN INDEPENDENT COMPONENT ANALYSIS
, 2000
"... This thesis is concerned with the problem of Blind Source Separation. Specifically we considerthe Independent Component Analysis (ICA) model in which a set of observations are modelled by xt = Ast: (1) where A is an unknown mixing matrix and st is a vector of hidden source components attime t. The ..."
Abstract

Cited by 59 (3 self)
 Add to MetaCart
(Show Context)
This thesis is concerned with the problem of Blind Source Separation. Specifically we considerthe Independent Component Analysis (ICA) model in which a set of observations are modelled by xt = Ast: (1) where A is an unknown mixing matrix and st is a vector of hidden source components attime t. The ICA problem is to find the sources given only a set of observations. In chapter 1, the blind source separation problem is introduced. In chapter 2 the methodof Ensemble Learning is explained. Chapter 3 applies Ensemble Learning to the ICA model and chapter 4 assesses the use of Ensemble Learning for model selection.Chapters 57 apply the Ensemble Learning ICA algorithm to data sets from physics (a medical imaging data set consisting of images of a tooth), biology (data sets from cDNAmicroarrays) and astrophysics (Planck image separation and galaxy spectra separation).
Stabilitybased model selection
 In In Advances in Neural Information Processing Systems
, 2002
"... Model selection is linked to model assessment, which is the problem of comparing different models, or model parameters, for a specific learning task. For supervised learning, the standard practical technique is crossvalidation, which is not applicable for semisupervised and unsupervised settings. I ..."
Abstract

Cited by 42 (7 self)
 Add to MetaCart
(Show Context)
Model selection is linked to model assessment, which is the problem of comparing different models, or model parameters, for a specific learning task. For supervised learning, the standard practical technique is crossvalidation, which is not applicable for semisupervised and unsupervised settings. In this paper, a new model assessment scheme is introduced which is based on a notion of stability. The stability measure yields an upper bound to crossvalidation in the supervised case, but extends to semisupervised and unsupervised problems. In the experimental part, the performance of the stability measure is studied for model order selection in comparison to standard techniques in this area. 1
DHC: A DensityBased Hierarchical Clustering Method for Time Series Gene Expression Data
 Proc. Third IEEE Symp. BioInformatics and BioEngineering (BIBE ’03
, 2003
"... Clustering the time series gene expression data is an important task in bioinformatics research and biomedical applications. Recently, some clustering methods have been adapted or proposed. However, some concerns still remain, such as the robustness of the mining methods, as well as the quality and ..."
Abstract

Cited by 37 (4 self)
 Add to MetaCart
(Show Context)
Clustering the time series gene expression data is an important task in bioinformatics research and biomedical applications. Recently, some clustering methods have been adapted or proposed. However, some concerns still remain, such as the robustness of the mining methods, as well as the quality and the interpretability of the mining results. In this paper, we tackle the problem of effectively clustering time series gene expression data by proposing algorithm DHC, a densitybased, hierarchical clustering method. We use a densitybased approach to identify the clusters such that the clustering results are of high quality and robustness. Moreover, The mining result is in the form of a density tree, which uncovers the embedded clusters in a data set. The innerstructures, the borders and the outliers of the clusters can be further investigated using the attraction tree, which is an intermediate result of the mining. By these two trees, the internal structure of the data set can be visualized effectively. Our empirical evaluation using some realworld data sets show that the method is effective, robust and scalable. It matches the ground truth provided by bioinformatics experts very well in the sample data sets. 1
Model order selection for biomolecular data clustering
 BMC BIOINFORMATICS
, 2007
"... Background: Cluster analysis has been widely applied for investigating structure in biomolecular data. A drawback of most clustering algorithms is that they cannot automatically detect the ”natural ” number of clusters underlying the data, and in many cases we have no enough ”a priori ” biological ..."
Abstract

Cited by 31 (5 self)
 Add to MetaCart
(Show Context)
Background: Cluster analysis has been widely applied for investigating structure in biomolecular data. A drawback of most clustering algorithms is that they cannot automatically detect the ”natural ” number of clusters underlying the data, and in many cases we have no enough ”a priori ” biological knowledge to evaluate both the number of clusters as well as their validity. Recently several methods based on the concept of stability have been proposed to estimate the ”optimal ” number of clusters, but despite their successful application to the analysis of complex biomolecular data, the assessment of the statistical significance of the discovered clustering solutions and the detection of multiple structures simultaneously present in highdimensional biomolecular data are still major problems. Results: We propose a stability method based on randomized maps that exploits the highdimensionality and relatively low cardinality that characterize biomolecular data, by selecting subsets of randomized linear combinations of the input variables, and by using stability indices based on the overall distribution of similarity measures between multiple pairs of clusterings performed on the randomly projected data. A χ 2based statistical test is proposed to assess the significance of the clustering solutions and to detect significant and if possible multilevel structures simultaneously present in the data (e.g. hierarchical structures).
A generic framework for efficient subspace clustering of highdimensional data
 In ICDM
, 2005
"... Subspace clustering has been investigated extensively since traditional clustering algorithms often fail to detect meaningful clusters in highdimensional data spaces. Many recently proposed subspace clustering methods suffer from two severe problems: First, the algorithms typically scale exponenti ..."
Abstract

Cited by 30 (4 self)
 Add to MetaCart
(Show Context)
Subspace clustering has been investigated extensively since traditional clustering algorithms often fail to detect meaningful clusters in highdimensional data spaces. Many recently proposed subspace clustering methods suffer from two severe problems: First, the algorithms typically scale exponentially with the data dimensionality and/or the subspace dimensionality of the clusters. Second, for performance reasons, many algorithms use a global density threshold for clustering, which is quite questionable since clusters in subspaces of significantly different dimensionality will most likely exhibt significantly varying densities. In this paper, we propose a generic framework to overcome these limitations. Our framework is based on an efficient filterrefinement architecture that scales at most quadratic w.r.t. the data dimensionality and the dimensionality of the subspace clusters. It can be applied to any clustering notions including notions that are based on a local density threshold. A broad experimental evaluation on synthetic and realworld data empirically shows that our method achieves a significant gain of runtime and quality in comparison to stateoftheart subspace clustering algorithms. 1
Analysis of Gene Expression Profiles: Class Discovery and Leaf Ordering
 In Proc. 6th Int'l Conf. Research in Comp. Mol. Bio.(RECOMB 2002
, 2002
"... Abstract We approach the class discovery and leaf ordering problems using spectral graph partitioning methodologies. For class discovery or clustering, we present a rainmax cut hierarchical clustering method and show it produces subtypes quite close to human expert labeling on the lymphoma dataset ..."
Abstract

Cited by 22 (6 self)
 Add to MetaCart
Abstract We approach the class discovery and leaf ordering problems using spectral graph partitioning methodologies. For class discovery or clustering, we present a rainmax cut hierarchical clustering method and show it produces subtypes quite close to human expert labeling on the lymphoma dataset with 6 classes. On optimal leaf ordering for displaying the gene expression data, we present a sequential or dering method that can be computed in O(tz 2) time which also preserves the cluster structure. We also show that the well known statistic methods such as Fstatistic test and the principal component analysis are very useful in gene expression analysis.