Results 1 - 10
of
32
Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation
, 2002
"... There are many sources of systematic variation in cDNA microarray experiments which affect the measured gene expression levels (e.g. differences in labeling efficiency between the two fluorescent dyes). The term normalization refers to the process of removing such variation. A constant adjustment is ..."
Abstract
-
Cited by 194 (3 self)
- Add to MetaCart
There are many sources of systematic variation in cDNA microarray experiments which affect the measured gene expression levels (e.g. differences in labeling efficiency between the two fluorescent dyes). The term normalization refers to the process of removing such variation. A constant adjustment is often used to force the distribution of the intensity log ratios to have a median of zero for each slide. However, such global normalization approaches are not adequate in situations where dye biases can depend on spot overall intensity and/or spatial location within the array. This article proposes normalization methods that are based on robust local regression and account for intensity and spatial dependence in dye biases for different types of cDNA microarray experiments. The selection of appropriate controls for normalization is discussed and a novel set of controls (microarray sample pool, MSP) is introduced to aid in intensity-dependent normalization. Lastly, to allow for comparisons of expression levels across slides, a robust method based on maximum likelihood estimation is proposed to adjust for scale differences among slides.
Cluster Analysis for Gene Expression Data: A Survey
- IEEE Transactions on Knowledge and Data Engineering
, 2004
"... Abstract—DNA microarray technology has now made it possible to simultaneously monitor the expression levels of thousands of genes during important biological processes and across collections of related samples. Elucidating the patterns hidden in gene expression data offers a tremendous opportunity f ..."
Abstract
-
Cited by 48 (3 self)
- Add to MetaCart
Abstract—DNA microarray technology has now made it possible to simultaneously monitor the expression levels of thousands of genes during important biological processes and across collections of related samples. Elucidating the patterns hidden in gene expression data offers a tremendous opportunity for an enhanced understanding of functional genomics. However, the large number of genes and the complexity of biological networks greatly increases the challenges of comprehending and interpreting the resulting mass of data, which often consists of millions of measurements. A first step toward addressing this challenge is the use of clustering techniques, which is essential in the data mining process to reveal natural structures and identify interesting patterns in the underlying data. Cluster analysis seeks to partition a given data set into groups based on specified features so that the data points within a group are more similar to each other than the points in different groups. A very rich literature on cluster analysis has developed over the past three decades. Many conventional clustering algorithms have been adapted or directly applied to gene expression data, and also new algorithms have recently been proposed specifically aiming at gene expression data. These clustering algorithms have been proven useful for identifying biologically relevant groups of genes and samples. In this paper, we first briefly introduce the concepts of microarray technology and discuss the basic elements of clustering on gene expression data. In particular, we divide cluster analysis for gene expression data into three categories. Then, we present specific challenges pertinent to each clustering category and introduce several representative approaches. We also discuss the problem of cluster validation in three aspects and review various methods to assess the quality and reliability of clustering results. Finally, we conclude this paper and suggest the promising trends in this field. Index Terms—Microarray technology, gene expression data, clustering.
Statistical Issues in cDNA Microarray Data Analysis
, 2003
"... This article summarizes some of the issues involved and provides a brief review of the analysis tools which are available to researchers to deal with them. Any microarray experiment involves a number of distinct stages. Firstly there is the design of the experiment. The researchers must decide which ..."
Abstract
-
Cited by 39 (2 self)
- Add to MetaCart
This article summarizes some of the issues involved and provides a brief review of the analysis tools which are available to researchers to deal with them. Any microarray experiment involves a number of distinct stages. Firstly there is the design of the experiment. The researchers must decide which genes are to be printed on the arrays, which sources of RNA are to be hybridized to the arrays and on how many arrays the hybridizations will be replicated. Secondly, after hybridization, there follows a number of data-cleaning steps or `low-level analysis' of the microarray data. The microarray images must be processed to acquire red and green foreground and background intensities for each spot. The acquired red/green ratios must be normalized to adjust for dye-bias and for any systematic variation other than that due to the differences between the RNA samples being studied. Thirdly, the normalized ratios are analyzed by various graphical and numerical means to select differentially expressed genes or to find groups of genes whose expression profiles can reliably classify the different RNA sources into meaningful groups. The sections of this article correspond roughly to the various analysis steps. The following notation will be used throughout the article. The foreground red and green intensities will be written Pp and 9p for each spot. The background intensities will be Pf and 9f . The background-corrected intensities will be P and 9 where usually P Pp Pf 0 # and 9 9p 9f 0 # . The log-differential expression ratio will be vyq # E P 9 0 for each spot. Finally, the log-intensity of the spot will be vyq 3 P9 0 , a measure of the overall brightness of the spot. (The letter E is a mnemonic for minus as vyq vyq E P 9 0 # while 3 is a mnemonic for add as #vyq vyq #...
Optimal Sample Size for Multiple Testing: the Case of Gene Expression Microarrays
- Journal of the American Statistical Association
, 2004
"... We consider the choice of an optimal sample size for multiple comparison problems. The motivating application is the choice of the number of microarray experiments to be carried out when learning about dierential gene expression. However, the approach is valid in any application that involves multip ..."
Abstract
-
Cited by 30 (1 self)
- Add to MetaCart
We consider the choice of an optimal sample size for multiple comparison problems. The motivating application is the choice of the number of microarray experiments to be carried out when learning about dierential gene expression. However, the approach is valid in any application that involves multiple comparison in a large number of hypothesis tests.
Statistical challenges with high dimensionality: Feature selection in knowledge discovery
- Proceedings of the International Congress of Mathematicians
, 2006
"... Abstract. Technological innovations have revolutionized the process of scientific research and knowledge discovery. The availability of massive data and challenges from frontiers of research and development have reshaped statistical thinking, data analysis and theoretical studies. The challenges of ..."
Abstract
-
Cited by 25 (7 self)
- Add to MetaCart
Abstract. Technological innovations have revolutionized the process of scientific research and knowledge discovery. The availability of massive data and challenges from frontiers of research and development have reshaped statistical thinking, data analysis and theoretical studies. The challenges of high-dimensionality arise in diverse fields of sciences and the humanities, ranging from computational biology and health studies to financial engineering and risk management. In all of these fields, variable selection and feature extraction are crucial for knowledge discovery. We first give a comprehensive overview of statistical challenges with high dimensionality in these diverse disciplines. We then approach the problem of variable selection and feature extraction using a unified framework: penalized likelihood methods. Issues relevant to the choice of penalty functions are addressed. We demonstrate that for a host of statistical problems, as long as the dimensionality is not excessively large, we can estimate the model parameters as well as if the best model is known in advance. The persistence property in risk minimization is also addressed. The applicability of such a theory and method to diverse statistical problems is demonstrated. Other related problems with high-dimensionality are also discussed.
A Statistical Framework for Expression-Based Molecular Classification in Cancer
, 2002
"... this paper, our aim is to provide a framework to support this tree-faceted enterprise. We propose a probabilistic definition of differential expression in the context of unsupervised classification, and we use it to define molecular profiles, and to assess quantities of potential use in classificati ..."
Abstract
-
Cited by 17 (2 self)
- Add to MetaCart
this paper, our aim is to provide a framework to support this tree-faceted enterprise. We propose a probabilistic definition of differential expression in the context of unsupervised classification, and we use it to define molecular profiles, and to assess quantities of potential use in classification, such as the probability that a tumour belongs to a given profile and the probability that two tumours have the same profile. Our long-term goals are (a) to provide tools that will facilitate the use of prior knowledge about gene function in the screening process, in an interactive way, to improve the interpretation and clinical validation of the classification that will ultimately emerge from the analysis, and (b) to capture the potentially categorical nature of differential gene expression, by using latent categorical data that can be interpreted as a gene being turned `on' or `off ' compared with normal expression
Microarray standard data set and figures of merit for comparing data processing methods and experiment designs
, 2003
"... ..."
ExpressYourself: A modular platform for processing and visualizing microarray data. Nucleic Acids Res
, 2003
"... DNA microarrays are widely used in biological research; by analyzing differential hybridization on a single microarray slide, one can detect changes in mRNA expression levels, increases in DNA copy numbers and the location of transcription factor binding sites on a genomic scale. Having performed th ..."
Abstract
-
Cited by 9 (5 self)
- Add to MetaCart
DNA microarrays are widely used in biological research; by analyzing differential hybridization on a single microarray slide, one can detect changes in mRNA expression levels, increases in DNA copy numbers and the location of transcription factor binding sites on a genomic scale. Having performed the experiments, the major challenge is to process large, noisy datasets in order to identify the specific array elements that are significantly differentially hybridized. This normally requires aggregating different, often incompatible programs into a multistep pipeline. Here we present ExpressYourself, a fully integrated platform for processing microarray data. In completely automated fashion, it will correct the background array signal, normalize the Cy5 and Cy3 signals, score levels of differential hybridization, combine the results of replicate experiments, filter problematic regions of the array and assess the quality of individual and replicate experiments. ExpressYourself is designed with a highly modular architecture so various types of microarray analysis algorithms can readily be incorporated as they are developed; for example, the system currently implements several normalization methods, including those that simultaneously consider signal intensity and slide location. The processed data are presented using a web-based graphical interface to facilitate comparison with the original images of the array slides. In particular, Express Yourself is able to regenerate images of the original microarray after applying various steps of processing, which greatly facilities identification of position-specific artifacts. The program is freely available for use at
A two-way semilinear model for normalization and analysis of cDNA microarray data
- J. Amer. Statist. Assoc
, 2005
"... ABSTRACT A basic question in analyzing cDNA microarray data is normalization, the purpose of which is to remove systematic bias in the observed expression values by establishing a normalization curve across the whole dynamic range. A proper normalization procedure ensures that the normalized intensi ..."
Abstract
-
Cited by 7 (4 self)
- Add to MetaCart
ABSTRACT A basic question in analyzing cDNA microarray data is normalization, the purpose of which is to remove systematic bias in the observed expression values by establishing a normalization curve across the whole dynamic range. A proper normalization procedure ensures that the normalized intensity ratios provide meaningful measures of relative expression levels. We propose a two-way semi-linear model (TW-SLM) for normalization and analysis of microarray data. This method does not make the usual assumptions underlying some of the existing methods. For example, it does not assume that: (i) the percentage of differentially expressed genes is small; or (ii) there is symmetry in the expression levels of up- and down-regulated genes, as required in the lowess normalization method. The TW-SLM also naturally incorporates uncertainty due to normalization into significance analysis of microarrays. We use a semiparametric approach based on polynomial splines in the TW-SLM to estimate the normalization curves and the normalized expression values. We study the theoretical properties of the proposed estimator in the TW-SLM, including the finite sample distributional properties of the estimated gene effects and the rate of convergence of the estimated normalization curves when the number of genes under study is large. We also conduct simulation studies to evaluate the TW-SLM method and illustrate the proposed
Oligonucleotide microarray for the study of functional gene diversity in the nitrogen cycle in the environment
- Appl. Environ. Microbiol
, 2003
"... The analysis of functional diversity and its dynamics in the environment is essential for understanding the microbial ecology and biogeochemistry of aquatic systems. Here we describe the development and optimization of a DNA microarray method for the detection and quantification of functional genes ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
The analysis of functional diversity and its dynamics in the environment is essential for understanding the microbial ecology and biogeochemistry of aquatic systems. Here we describe the development and optimization of a DNA microarray method for the detection and quantification of functional genes in the environment and report on their preliminary application to the study of the denitrification gene nirS in the Choptank River-Chesapeake Bay system. Intergenic and intragenic resolution constraints were determined by an oligonucleotide (70-mer) microarray approach. Complete signal separation was achieved when comparing unrelated genes within the nitrogen cycle (amoA, nifH, nirK, and nirS) and detecting different variants of the same gene, nirK, corresponding to organisms with two different physiological modes, ammonia oxidizers and denitrifying halobenzoate degraders. The limits of intragenic resolution were investigated with a microarray containing 64 nirS sequences comprising 14 cultured organisms and 50 clones obtained from the Choptank River in Maryland. The nirS oligonucleotides covered a range of sequence identities from approximately 40 to 100%. The threshold values for specificity were determined to be 87 % sequence identity and a target-to-probe perfect match-to-mismatch binding free-energy ratio of 0.56. The lower detection limit was 10 pg of DNA (equivalent to approximately 10 7 copies) per target per microarray. Hybridization patterns on the microarray differed between sediment samples from two stations in the Choptank River, implying important differences in the

