DMCA
BeadArray expression analysis using bioconductor
Cached
Download Links
Venue: | PLoS Comput. Biol |
Citations: | 1 - 0 self |
BibTeX
@ARTICLE{Dunning_beadarrayexpression,
author = {Mark Dunning and Wei Shi and Andy Lynch and Mike Smith and Matt Ritchie},
title = {BeadArray expression analysis using bioconductor},
journal = {PLoS Comput. Biol},
year = {},
pages = {10--1371}
}
OpenURL
Abstract
Introduction to this vignette This vignette describes how to process Illumina BeadArray gene expression data in its various formats; raw files produced by the scanning software as well as summarized BeadStudio/GenomeStudio output, and data deposited in a public microarray database. For each file format we introduce a series of 'use cases' in the order in which they might be encountered by an analyst, and demonstrate how to perform specific tasks on the example datasets using R code from Bioconductor or base packages. The R code is explained immediately afterwards and where appropriate we given an interpretation of the resultant output (indicated by the symbol), and pointers (Z) to how the R code might be adapted to other datasets or use cases. Where attention must be paid to avoid alarming results, we give warnings indicated by the symbol. It is assumed that the reader has familiarity with basic R data structures and operations such as plotting and reading text files, therefore explanations of these functions will not be given. The version of R used to compile this vignette is 2.15.0. The Bioconductor packages required can be installed by typing the following commands at the R command prompt: 1 > source ( " http : / / www . bioconductor . org / biocLite . R " ) 2 > biocLite ( c ( " beadarray " , " limma " , " GEOquery " , " illuminaHumanv1 . db " , 3 + " illuminaHumanv2 . db " , " illuminaHumanv3 . db " , " BeadArrayUseCases " )) There are also a few packages that are used to illustrate particular use case, but are not crucial to the vignette. They can be installed in advance, or as required. 1 > biocLite ( c ( " GOstats " , " GenomicRanges " , " Biostrings " )) Introduction to the BeadArray technology Illumina Inc. (San Diego, CA) are manufacturers of the BeadArray microarray technology, which can be used in genomic studies to profile transcript expression or methylation status, or genotype SNPs. The BeadArray platform makes use of 50mer oligonucleotide probes attached to beads that are randomly assembled and replicated many times on each array. Different probes are designed to target specific positions (transcripts/SNPs) in the genome. A series of decoding hybridizations performed in-house by Illumina determines the identity of each bead 1 View of a BeadArray WG-6 Expression BeadChip randomly assembled, replicate beads on each array. A BeadChip comprises of a series of rectangular BeadArray sections on a slide, with each section (sometimes referred to as a strip) containing many thousands of different probes (see 2 Experimental data We make use of three datasets in this vignette (summarized in The first example consists of bead-level data from a series of Human HT-12 version 3 BeadChips hybridized at the Cancer Research UK, Cambridge Research Institute Genomics Core facility. 'Bead-level' refers to the availability of intensity and location information for each bead on each BeadArray in an experiment. In this dataset, BeadArrays were hybridized with either Universal Human Reference RNA (UHRR, Stratagene) or Brain Reference RNA (Ambion) as used in the MAQC project The second dataset is the AsuragenMAQC_BeadStudioOutput.zip file from Illumina's website http://www.switchtoi.com/datasets.ilmn This experiment uses a Human WG-6 version 2 BeadChip, and consists of 3 replicates of each of the UHRR and Brain Reference samples. These data are available at the summary level as generated by Illumina's BeadStudio software. The Bioconductor packages lumi, limma and beadarray can all handle data in this format. For this dataset, we focus on tools available in the limma package. The final dataset, which is also at the summary level, is publicly available from the GEO database (accession GSE5350) and was deposited by the MAQC project The goal of each analysis is to find differentially expressed probes between the two distinct RNA samples included in each experiment (UHRR and Brain Reference). For the differential expression analysis, we make use of the linear modelling approach available in the limma package. 3 Analysis of bead-level and raw data using beadarray Raw and bead-level data types Reading raw or bead-level data into R using the beadarray package requires several files produced by Illumina's scanning software. We briefly describe these files below. • text files (required -unless bab files present) -a text file (.txt or .csv) for each section which stores the position, identity and intensity of each bead. These files are usually named chipID_section.txt for arrays from BeadChips (e.g. 4613710017_B.txt) and are required because of the random arrangement of probes on the array surface that is unique for each BeadArray. • locs files (optional) -1 (single channel) or 2 (two-color) for each section on a BeadChip. These are usually named using the convention chipID_section_channel.locs. The locs file stores the locations of all beads on the array, including all those that could not be decoded (beads present on the array, but not in the text files). The locs files are particularly useful for investigating spatial phenomena on the arrays. • bab files (optional) -one for each section of a BeadChip. These files contain all of the information from the text and locs files, stored in a substantially more efficient manner [2]. • tiff files (optional) -1 (single channel) or 2 (two-color) for each section on a BeadChip. These are usually named using the convention chipID_section_channel.tif. For example, 4613710017_B_Grn.tif is the Cy3 (green) image for the sample in position B on BeadChip 4613710017. The Cy5 (red) files end in _Red.tif. We refer to these as the raw data, as access to these images allows the user to carry out their own image processing, and provides access to the background intensities (the values stored in the text files have already been background corrected). • sdf file (optional) -Illumina's sample descriptor file (one per BeadChip, e.g. 4613710017.sdf). This file is used to determine the physical properties of a section and which sections to combine for each sample. • targets file (recommended) -contains sample information for each array. See the file targetsHT12.txt for an example. 4 • metrics file (optional) -one for each BeadChip (usually named Metrics.txt) which contains summary information about intensity, the amount of saturation, focus and registration on the image(s) from each section. To obtain the tiff and text files from Illumina's BeadScan software version 3.1 you will need to modify the settings.xml file used by the software. For further details see the Scanning Settings section of http://www.compbio.group.cam.ac.uk/Resources/illumina/. For iScan, the steps are similar, although there is a separate settings file for each platform (expression, Infinium genotyping, etc). Quality assessment using scanner metrics The first view of array quality can be assessed using the metrics calculated by the scanner. These include the 95th (P95) and 5th (P05) quantiles of all pixel intensities on the image. A signal-to-noise ratio (SNR) can be calculated as the ratio of these two quantities. These metrics can be viewed in real-time as the arrays themselves are being scanned. By tracking these metrics over time, one can potentially halt problematic experiments before they even reach the analysis stage. Use Case: Plot the P95:P05 signal-to-noise ratio for the HT-12 arrays in this experiment and assess whether any samples appear to be outliers that may need to be removed or down-weighted in further analyses. 1 > ht12metrics <-read . table ( system . file ( " extdata / Chips / Metrics . txt " , 2 + package = " BeadArrayUseCases " ) , sep = " \ t " , header = TRUE , 3 + as . is = TRUE ) 4 > ht12snr <-ht12metrics $ P95Grn / ht12metrics $ P05Grn 5 > labs <-paste ( ht12metrics [ , 2] , ht12metrics [ , 3] , sep = " _ " ) 6 > par ( mai = c (1.5 , 0.8 , 0.3 , 0.1)) 7 > plot (1:12 , ht12snr , pch = 19 , ylab = " P95 / P05 " , xlab = " " , 8 + main = " Signal -to -noise ratio for HT12 data " , axes = FALSE , 9 + frame . plot = TRUE ) 10 > axis (2) 11 > axis (1 , 1:12 , labs , las = 2) The above code uses standard R functions to obtain the P95 and P05 values from the metrics file stored in the package. The system.file function is a base function that will locate the directory where the BeadarrayUseCases package is installed. The plotting commands are just a suggestion of how the data could be presented and could be adapted to individual circumstances. The SNR for these arrays seems to be over 15 in most cases, although there is one exception that is below 2. Illumina recommend that this ratio be above 10 and in our experience a value below 2 would be grounds for discarding a sample from further analysis. The P95 value for this sample is typical of what we would expect from a blank array with no RNA hybridized! Z Scanner metrics information is equally as useful for sample quality assessment of summarized data. 5 Z The P95 and P05 values will fluctuate over time and are dependant upon the scanner setup. Including SNR values for arrays other than those currently being analysed will give a better indication of whether any outlier arrays exist. Data import and storage The next step in our analysis is to read the data into R using the readIllumina function. The bead-level data you will need for this example are available in the file beadlevelbabfiles.zip (133 MB) which is located in the extdata/BeadLevelBabFiles directory of the BeadArrayUseCases package. These files were produced using the BeadDataPackR package in Bioconductor which provides a more compact representation of bead-level data. Use Case: Read the sample information and bead-level data (stored in compressed bab format) into R. 1 > library ( beadarray ) 2 > chipPath <-system . file ( " extdata / Chips " , package = " BeadArrayUseCases " ) 3 > list . files ( chipPath ) 4 > sampleSheetFile <-paste ( chipPath , " / sampleSheet . csv " , 5 + sep = " " ) 6 > readLines ( sampleSheetFile ) 7 > data <-readIllumina ( dir = chipPath , sampleSheet = sampleSheetFile , 8 + useImages = FALSE , illuminaAnnotation = " Humanv3 " ) Usually the directory will be in a known location, but for convenience we use the system.file function this directory and the sample sheet. The section names to be read are then constructed using the contents of the sample sheet. The final line executes the reading of the data. By default readIllumina will look in the current working directory and try to find all BeadArray-associated files. The function will read any text (or .bab) and TIFF files (if instructed to do so) and save the names of any locs and sdf files for future reference. Here we have used the dir and sectionNames arguments to specify what directory to look in and the names of the sections that should be imported. By setting useImages to FALSE, we use the intensity values stored in the text files, rather than recomputing them from the images. The argument illuminaAnnotation is a character string to identify the organism and annotation revision number of the chip being analysed, but not the number of samples on the chip. Hence the value Humanv3 can be used for both Human WG-6 and HumanHT12 v3 data. Setting this value correctly will allow beadarray to identify what bead types are to be used for control purposes and convert the numeric ArrayAddressIDs to a more commonly-used format. If you are unsure of the correct annotation to use, this argument can be left blank at this stage. Z If the data to be imported are not in .bab format, specifying the same arguments to readIllumina will search for files of type .txt instead. Use of the sectionNames argument is recommended when the directory containing beadlevel data contains files other than those produced by the Illumina scanner. Extraneous files in such directories may confuse the readIllumina function. 6 Storing and reading bead-level data requires a considerable amount of disk space and RAM. For this example, around 1.1 GB of RAM is required to read in and store the data. One could process bead-level data in smaller batches if memory is limited and then combine at the summary-level. Selecting the annotation for a dataset If you are unsure of the correct annotation to use and thus left the illuminaAnnotation argument to readIllumina as the default, suggestAnnotation can be employed to identify the platform, based on the ArrayAddressIDs that are present in the data. The setAnnotation function can then be used to assign this annotation to the dataset. Use Case: Verify the version number of the dataset that has been read in and set the annotation of the bead-level data object accordingly. 1 > suggestAnnotation ( data , verbose = TRUE ) 2 > annotation ( data ) <-" Humanv3 " You should see that the percentage of overlapping IDs is greatest for the Humanv3 platform. Z The result of suggestAnnotation only gives guidance about which annotation to use. Hence, the results may be unpredictable on custom arrays, or arrays that are not listed in the suggestAnnotation output. The beadLevelData class Once imported, bead-level data are stored in an object of class beadLevelData. This class can handle raw data from both single-channel and two-color BeadArray platforms. > slotNames ( data ) The command above gives us an overview of the structure of the beadLevelData class, which is composed of several slots. The experimentData, sectionData and beadData slots can be viewed as a hierarchy, with each containing data at a different level. Each can be accessed using the @ operator. The experimentData slot holds information that is consistent across the entire dataset. Quantities with one value per array-section are stored in the sectionData slot. For instance, any metrics information read by readIllumina, along with section names and the total number of beads, will be stored there. This is also a convenient place to store any QC information derived during the preprocessing of the data. The data extracted from the individual text files are stored in the beadData slot. Accessing data in a beadLevelData object Use Case: Output the data stored in the sectionData slot, and determine the section names and number of bead intensities available from each section. Access the intensities, x-coordinates 7 and probe IDs for the first 5 beads on the first array section. Using the @ operator to access the data in particular slots is not particularly convenient or intuitive. The functions sectionNames, numBeads and getBeadData provide more convenient interfaces to the beadLevelData object to retrieve specific information. The first line above uses the @ operator to access all the data in the sectionData slot, which can be quite large and unwieldy. The commands below it (lines 2-3) are accessor functions for retrieving a specific subset of data from the same slot. Line 4 shows that if a beadLevelData object is accessed in the same fashion as a list, a data frame containing the bead-level data for the specified array is returned. To access a specific entry in this data frame, we can use a further subset, or the data can be accessed using the getBeadData function. In addition to the beadLevelData object, you need to specify the section (array=...) of interest and the column heading you want. Extracting transformed data In this example, the data stored in the beadLevelData object by readIllumina are extracted directly from the Illumina text files. The values in the Grn vector are intensity values inferred from a known location in the scanned image and there are a number of steps involved before these can be translated into quantities that relate to the expression levels. The scanner generally produces values in the range 0 to 2 16 −1, although the image manipulation and background subtraction steps can lead to values outside this range. This is not a convenient scale for visualization and analysis and it is common to convert intensities onto the approximate range 0 to 16 using a log 2 transformation (possibly after an additional step to adjust non-positive intensities). Although this is simple to do in isolation using R's built in functions, it becomes more complicated within a function, leading to a large number of arguments being required in order to specify whether the function should process the green or red channel, use foreground or background intensities, convert to the log 2 scale etc. Even in this situation the user is restricted to the options that are provided by the arguments. A more flexible way to obtain transformed per-bead data from a beadLevelData object is to define a transformation function that takes as arguments the beadLevelData object and an array index. The function then manipulates the data in the desired manner and returns a vector the same length as the number of beads on the array. Many of the plotting and quality assessment functions within beadarray take such a function as one of their arguments. By using such a system, beadarray provides a great deal of flexibility over exactly how the data is analysed. 8 Use Case: Extract the green intensities on the log 2 scale for the first 10 probes from the first array section. The above example shows three different ways of obtaining the log green channel intensity data. Lines 1 and 2 use R's log2 function on data extracted using the methods we've already seen. The third entry uses one of beadarray's built in transformation functions. To view an example of how a transformation function is defined you can enter the name of one of beadarray's pre-defined functions without any parentheses or arguments. 1 > l o g G r e enCh annel Transf orm function ( BLData , array ) { x = getBeadData ( BLData , array = array , what = " Grn " ) return ( log2 . na ( x )) } < environment : namespace : beadarray > Z In addition to the logGreenChannelTransform function shown above, beadarray provides predefined functions for extracting the green intensities on the unlogged scale (greenChannelTransform), analogous functions for two-channel data (logRedChannelTransform, redChannelTransform), and functions for computing the logratio between channels (logRatioTransform). Analysis of raw data when the images are available The bead-level expression intensity values that Illumina's software provides (i.e. those stored in the .txt or .bab files) are the result of a certain amount of preprocessing and so are not strictly the raw data. In most situations, these values are sufficient for our use, but we may on occasion wish to begin from the image file, either to reassure ourselves that there are no concerns or to address a problem that has become manifest. It is important then that we understand what preprocessing steps Illumina apply by default. These are well-documented elsewhere, but to summarize: a local background value is calculated by taking the mean of the five lowest intensity pixels from a square around the bead. A filter is then applied to the image (to concentrate the intensities in the centre of beads) and foreground values are calculated as a weighted sum of the intensities in a 4 × 4 square around the bead centre. It is worth noting that the filter applied to the image, a sharpening filter, contrasts the value of a pixel with the pixels surrounding it, and as such itself could be viewed as a background correction step. The final intensity for a bead is then calculated as the foreground value minus the local background value. By starting from the image we can adjust any of these steps (e.g. for the background we can adjust the size of the area around the bead or the function applied to it, for the foreground we can change the filtering step or adjust the pixel weighting scheme, or finally we can use a more sophisticated measure of intensity than 'foreground minus local background' to avoid negative 9 values. beadarray includes three functions that closely mirror the processing performed by Illumina: illuminaBackground, illuminaSharpen and illuminaForeground. We will illustrate a change to the Illumina process that we recommend if beginning from the image. Use Case: Identify abnormally low intensity pixels and then plot a section of the image that illustrates the benefit of adjusting the standard analysis. There are three pixels of value zero in the image that must be an imaging artefact rather than a true measure of intensity for that location (note that we add an offset of 1 to avoid taking the log of zero). These pixels will bias the estimate of background for all nearby beads and so we will try a more robust estimate of the background. Rather than using the mean of the five lowest pixel values, we will use the median (equivalently, the third lowest pixel value). This is done using the medianBackground function. Use Case: Calculate a robust measure of background for this array and store it in a new slot "GrnRB". 1 > Brob <-medianBackground ( TIFF , cbind ( xcoords , ycoords )) 2 > data <-insertBeadData ( data , array = 2 , what = " GrnRB " , 3 + Brob ) The medianBackground function takes two arguments, the first of which is the image itself and the second is a two-column data frame specifying the bead-centre coordinates. We then use insertBeadData to add the new values to the existing beadLevelData object. Because the presence of extremely low intensity beads is a known issue, the authors have provided the medianBackground function within beadarray to perform an alternative local 10 background correction. However the same methodology can be used to perform the image processing in any way the user sees fit.