Results 1  10
of
13
On the Epistemological Crisis in Genomics
"... Abstract: There is an epistemological crisis in genomics. At issue is what constitutes scientific knowledge in genomic science, or systems biology in general. Does this crisis require a new perspective on knowledge heretofore absent from science or is it merely a matter of interpreting new scientifi ..."
Abstract

Cited by 13 (12 self)
 Add to MetaCart
(Show Context)
Abstract: There is an epistemological crisis in genomics. At issue is what constitutes scientific knowledge in genomic science, or systems biology in general. Does this crisis require a new perspective on knowledge heretofore absent from science or is it merely a matter of interpreting new scientific developments in an existing epistemological framework? This paper discusses the manner in which the experimental method, as developed and understood over recent centuries, leads naturally to a scientific epistemology grounded in an experimentalmathematical duality. It places genomics into this epistemological framework and examines the current situation in genomics. Meaning and the constitution of scientific knowledge are key concerns for genomics, and the nature of the epistemological crisis in genomics depends on how these are understood.
Validation of Computational Methods in Genomics
"... Abstract: Highthroughput technologies for genomics provide tens of thousands of genetic measurements, for instance, geneexpression measurements on microarrays, and the availability of these measurements has motivated the use of machine learning (inference) methods for classification, clustering, a ..."
Abstract

Cited by 6 (6 self)
 Add to MetaCart
(Show Context)
Abstract: Highthroughput technologies for genomics provide tens of thousands of genetic measurements, for instance, geneexpression measurements on microarrays, and the availability of these measurements has motivated the use of machine learning (inference) methods for classification, clustering, and gene networks. Generally, a design method will yield a model that satisfies some model constraints and fits the data in some manner. On the other hand, a scientific theory consists of two parts: (1) a mathematical model to characterize relations between variables, and (2) a set of relations between model variables and observables that are used to validate the model via predictive experiments. Although machine learning algorithms are constructed to hopefully produce valid scientific models, they do not ipso facto do so. In some cases, such as classifier estimation, there is a welldeveloped error theory that relates to model validity according to various statistical theorems, but in others such as clustering, there is a lack of understanding of the relationship between the learning algorithms and validation. The issue of validation is especially problematic in situations where the sample size is small in comparison with the dimensionality (number of variables), which is commonplace in genomics, because the convergence theory of learning algorithms is typically asymptotic and the algorithms often perform in counterintuitive ways when used with samples that are small in relation to the number of variables. For translational genomics, validation is perhaps the most critical issue, because it is imperative that we understand the performance of a diagnostic or therapeutic procedure to be used in the clinic, and this performance relates directly to the validity of the model behind the procedure. This
Clustering algorithms: on learning, validation, performance, and applications to genomics. Curr Genomics. 2009; 10: 430–45. doi: 10.2174/138920209789177601 PMID
"... Abstract: The development of microarray technology has enabled scientists to measure the expression of thousands of genes simultaneously, resulting in a surge of interest in several disciplines throughout biology and medicine. While data clustering has been used for decades in image processing and p ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
(Show Context)
Abstract: The development of microarray technology has enabled scientists to measure the expression of thousands of genes simultaneously, resulting in a surge of interest in several disciplines throughout biology and medicine. While data clustering has been used for decades in image processing and pattern recognition, in recent years it has joined this wave of activity as a popular technique to analyze microarrays. To illustrate its application to genomics, clustering applied to genes from a set of microarray data groups together those genes whose expression levels exhibit similar behavior throughout the samples, and when applied to samples it offers the potential to discriminate pathologies based on their differential patterns of gene expression. Although clustering has now been used for many years in the context of gene expression microarrays, it has remained highly problematic. The choice of a clustering algorithm and validation index is not a trivial one, more so when applying them to high throughput biological or medical data. Factors to consider when choosing an algorithm include the nature of the application, the characteristics of the objects to be analyzed, the expected number and shape of the clusters, and the complexity of the problem versus computational power available. In some cases a very simple algorithm may be appropriate to tackle a problem, but many situations may require a more complex and powerful algorithm better suited for the job at hand. In this paper, we will cover the theoretical aspects of clustering, including error and learning, followed by an overview of popular clustering algorithms and classical validation indices. We also discuss the relative performance of these algorithms and indices and conclude with examples of the application of clustering
Volumetric Texture Segmentation by Discriminant Feature Selection and Multiresolution Classification
"... In this paper a Multiresolution Volumetric Texture Segmentation (MVTS) algorithm is presented. The method extracts textural measurements from the Fourier domain of the data via subband filtering using an Orientation Pyramid [1]. A novel Bhattacharyya space, based on the Bhattacharyya distance, is p ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
In this paper a Multiresolution Volumetric Texture Segmentation (MVTS) algorithm is presented. The method extracts textural measurements from the Fourier domain of the data via subband filtering using an Orientation Pyramid [1]. A novel Bhattacharyya space, based on the Bhattacharyya distance, is proposed for selecting the most discriminant measurements and producing a compact feature space. An oct tree is built of the multivariate features space and a chosen level at a lower spatial resolution is first classified. The classified voxel labels are then projected to lower levels of the tree where a boundary refinement procedure is performed with a 3D equivalent of butterfly filters. The algorithm was tested in 3D with artificial data and three Magnetic Resonance Imaging sets of human knees with encouraging results. The regions segmented from the knees correspond to anatomical structures that can be used as a starting point for other measurements such as cartilage extraction.
Simcluster: clustering enumeration gene expression data on the simplex space
, 2008
"... Transcript enumeration methods such as SAGE, MPSS, and sequencingbysynthesis EST “digital northern”, are important highthroughput techniques for digital gene expression measurement. As other counting or voting processes, these measurements constitute compositional data exhibiting properties partic ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Transcript enumeration methods such as SAGE, MPSS, and sequencingbysynthesis EST “digital northern”, are important highthroughput techniques for digital gene expression measurement. As other counting or voting processes, these measurements constitute compositional data exhibiting properties particular to the simplex space where the summation of the components is constrained. These properties are not present on regular Euclidean spaces, on which hybridizationbased microarray data is often modeled. Therefore, pattern recognition methods commonly used for microarray data analysis may be noninformative for the data generated by transcript enumeration techniques since they ignore certain fundamental properties of this space. Here we present a software tool, Simcluster, designed to perform clustering analysis for data on the simplex space. We present Simcluster as a standalone commandline C package and as a userfriendly online tool. Both versions are available at:
The Bhattacharyya Space for Feature Selection and its application to Texture Segmentation
"... A feature selection methodology based on a novel Bhattacharyya Space is presented and illustrated with a texture segmentation problem. The Bhattacharyya Space is constructed from the Bhattacharyya distances of different measurements extracted with subband filters from training samples. The marginal ..."
Abstract
 Add to MetaCart
(Show Context)
A feature selection methodology based on a novel Bhattacharyya Space is presented and illustrated with a texture segmentation problem. The Bhattacharyya Space is constructed from the Bhattacharyya distances of different measurements extracted with subband filters from training samples. The marginal distributions of the Bhattacharyya Space present a sequence of the most discriminant subbands that can be used as a path for a wrapper algorithm. When this feature selection is used with a multiresolution classification algorithm on a standard set of texture mosaics, it produces the lowest misclassification errors reported.
BIOINFORMATICS
"... Vol. 21 Suppl. 2 2005, pages ii130–ii136 doi:10.1093/bioinformatics/bti1122 A fully Bayesian model to cluster geneexpression profiles ..."
Abstract
 Add to MetaCart
(Show Context)
Vol. 21 Suppl. 2 2005, pages ii130–ii136 doi:10.1093/bioinformatics/bti1122 A fully Bayesian model to cluster geneexpression profiles
Similarity and Pattern Recognition
"... Abstract—This paper formally defines similarities as tolerance relations, which are reflexive and symmetric binary relations. An abstract set with a similarity is called a tolerance space. The training data set in a learning task is a given database of independent identically distributed random pair ..."
Abstract
 Add to MetaCart
Abstract—This paper formally defines similarities as tolerance relations, which are reflexive and symmetric binary relations. An abstract set with a similarity is called a tolerance space. The training data set in a learning task is a given database of independent identically distributed random pairs (Xi, Yi), where each Xi is a record and Yi is its label: Yi ∈ {0,1}. The goal of the learning is to design a classifier of which the error probability is near to the theoretical limitation, the Bayes error. The learning process consists of finding a similarity of feature vectors ψ(Xi)’s and the learning result is a representative data clustering on the tolerance space of feature vectors. The information about a record X derived from the representative clustering is the set of representatives similar to the feature vector ψ(X). The percentage of the records of class 1 in the intersection of these representative clusters is used to estimate the conditional probability of Y = 1. This paper defines a θclassifier, which assigns the record to class 1 if the conditional probability is larger than the threshold θ. If the clustering is a partition, the threshold θ = 1 2 minimizes error probability in the training data set. In general, an optimal θclassifier has a different threshold. The experiments show the tradeoff between the number of clusters and the error probabilities of the optimal θclassifiers.
c © 2009, Indian Statistical Institute Trimmed ML Estimation of Contaminated Mixtures
"... We establish a mixture model with “spurious ” outliers and derive its maximum likelihood estimator, the maximum trimmed likelihood estimator MTLE. It may be computed with a trimmed version of the EM algorithm which we call the EMT algorithm. We analyze its properties and compute various breakdown v ..."
Abstract
 Add to MetaCart
We establish a mixture model with “spurious ” outliers and derive its maximum likelihood estimator, the maximum trimmed likelihood estimator MTLE. It may be computed with a trimmed version of the EM algorithm which we call the EMT algorithm. We analyze its properties and compute various breakdown values of the estimator for normal mixtures thereby proving robustness of the method. AMS (2000) subject classification. Primary 62H12; Secondary 62F35.