mirWIP: microRNA target prediction based on microRNA-containing ribonucleoprotein-enriched transcripts. (2008)
Venue: | Nat. Methods. |
Citations: | 45 - 5 self |
BibTeX
@ARTICLE{Hammell08mirwip:microrna,
author = {Molly Hammell and Dang Long and Liang Zhang and Andrew Lee and Steven Carmack and Min Han and Ye Ding and Victor Ambros},
title = {mirWIP: microRNA target prediction based on microRNA-containing ribonucleoprotein-enriched transcripts.},
journal = {Nat. Methods.},
year = {2008},
pages = {10--1038}
}
OpenURL
Abstract
Target prediction for animal microRNAs (miRNAs) has been hindered by the small number of verified targets available to evaluate the accuracy of predicted miRNA-target interactions. Recently, a dataset of 3,404 miRNA-associated mRNA transcripts was identified by immunoprecipitation of the RNA-induced silencing complex components AIN-1 and AIN-2. Our analysis of this AIN-IP dataset revealed enrichment for defining characteristics of functional miRNA-target interactions, including structural accessibility of target sequences, total free energy of miRNA-target hybridization and topology of base-pairing to the 5¢ seed region of the miRNA. We used these enriched characteristics as the basis for a quantitative miRNA target prediction method, miRNA targets by weighting immunoprecipitation-enriched parameters (mirWIP), which optimizes sensitivity to verified miRNA-target interactions and specificity to the AIN-IP dataset. MirWIP can be used to capture all known conserved miRNA-mRNA target relationships in Caenorhabditis elegans at a lower falsepositive rate than can the current standard methods. The discovery of miRNAs 1 and their roles in post-transcriptional gene regulation has added a new dimension to the study of animal development and disease 2 . miRNAs, bound to their mRNA targets, can repress gene expression through translational inhibition or by mRNA destabilization 3 . Under some conditions, miRNAs may also promote protein production from a target mRNA 4 . Animal miRNAs have a role in regulating many developmental processes and have been implicated in human disease pathways 5 . For these reasons, it is crucial to efficiently identify the functionally important mRNA targets of miRNAs. Target prediction for miRNAs in plants is straightforward, as plant miRNAs bind with near-perfect complementarity to their target mRNAs. In animals, miRNAs interact with their targets predominantly by partial base-pairing, and the rules that govern the formation and functional efficacy of miRNA-mRNA interactions are not fully understood. Depending on the computational algorithm applied, the number of predicted targets for a given miRNA can range from dozens to hundreds and even thousands of genes 6,7 . The thorough experimental testing of such vast numbers of predicted targets using labor-intensive transgenic reporter assays is impractical. There remains the need both for more accurate computational methods to identify functional miRNA-target interactions and for more efficient methods to experimentally validate miRNA-target interactions in vivo. Many computational methods have been developed to predict miRNA targets (reviewed in ref. 7). The criteria for target prediction vary widely, but often include (i) strong Watson-Crick basepairing of the 5¢ seed of the miRNA (nucleotide positions 2-8 of the miRNA) to a complementary site in the 3¢ untranslated region (UTR) of the mRNA, (ii) conservation of the miRNA binding site, (iii) favorable minimum free energy (MFE) for the local miRNAmRNA interaction, and/or (iv) structural accessibility of the surrounding mRNA sequence. Experimental support exists for each of these binding-site features, but the relative importance of each feature and how they interact to contribute to function remains uncertain. Moreover, it is likely that other important parameters for functional miRNA-target interactions remain to be identified. The principle of 5¢ seed primacy in miRNA-target binding is well supported by experimental data. Many genetically validated miRNA-target interactions involve uninterrupted Watson-Crick base-pairing in the 5¢ region of the miRNA. Experiments show that G-U wobble pairs and bulges within this seed region can significantly disrupt repression of reporter constructs 8 and that perfectly matched seed regions are significantly enriched in the 3¢ UTRs of transcripts whose levels decrease in response to miRNA overexpression 9 . However, other experimental data suggest that perfectly matched miRNA seeds are neither necessary nor sufficient for all functional miRNA-target interactions. For instance, three of the genetically verified let-7 targets in C. elegans-lin-41, pha-4 and let-60-contain only imperfect binding sites, with G-U wobble pairs or bulges in the seed region [10][11][12] . Two recent studies using immunoprecipitation of miRNA-containing ribonucleoprotein complexes indicate that only 30-45% of miRNAs associated with these complexes contain perfectly matched, conserved seed elements in their 3¢ UTRs 13,14 . There is thus a need for target prediction algorithms that accurately incorporate modified 5¢ seed rules. The conservation of sequences among multiple genomes has been invaluable in identifying functional regulatory elements in genomes. Most computational methods for predicting miRNA targets include an evolutionary conservation filter, often requiring strict alignment of seed-complementary sequences across multiple genomes 7 . However, many miRNA binding sites that do not fit this strict definition could still be functionally important. For example, 40% of the verified miRNA targets in C. elegans reside within 3¢ UTRs that align poorly between C. elegans and Caenorhabditis briggsae (for example, the let-7 target sites in die-1, lss-4 and pha-4 (ref. 11); let-60 (ref. 12); and nhr-23 and nhr-25 (ref. 15)). If the requirement for strict alignments is ignored in these cases, conserved sites for let-7 can be found in the orthologous 3¢ UTRs, indicating evolutionary selection for a functional miRNA-target interaction. Indeed, in the case of the regulatory relationship between let-7 and let-60 (ref. 12), the presence of let-7 sites is conserved between worms and humans, although the sequence context of the sites is too divergent for strict alignment. Many miRNA target prediction methods have incorporated MFE calculations to identify energetically stable base-pairing between a miRNA and its target sequence [16][17][18][19][20][21] . Some methods also include estimates of the structural accessibility of miRNA binding sites in the mRNA targets [18][19][20][21] , and more recent methods join the two features into a single calculation 20,21 . Notably, the incorporation of target structure into calculations of the free energy of miRNAtarget interactions can distinguish between a set of targets that tested positive for miRNA-mediated repression and a set that were refractory to miRNA-mediated repression 20 . However, current prediction methods vary widely in how energy and accessibility estimates are incorporated into their calculations. Two studies 18,19 consider accessibility of the binding sites but differ in the amount of mRNA sequence used to calculate that parameter. Two more recent studies 20,21 combine energy and accessibility calculations into a single prediction parameter but vary in the length of sequence and the method used to calculate accessibility. Further algorithm development is required to determine the optimal involvement of accessibility and binding energy in miRNA-target interactions. Optimizing algorithms based on sequence features alone has been complicated by the lack of a large dataset of verified miRNAtarget relationships. The number of targets that have been tested by rigorous genetic or reporter assays in various organisms has increased, but the assays vary in terms of how closely they model the endogenous characteristics of the interaction being tested 7 . Genome-scale datasets linking specific miRNAs to specific mRNA targets have emerged from microarray hybridization experiments that assay mRNA transcript levels after introduction of a particular miRNA by transfection 9, We found several contextual features of miRNA binding sites that were enriched in sites in the AIN-IP set of transcripts: structural accessibility of target sequences, total free energy of miRNA-target hybridization and topology of base-pairing to the 5¢ seed region of the miRNA. We used these features to develop a miRNA target prediction algorithm, mirWIP, that scores miRNA target sites by weighting site characteristics in proportion to their enrichment in the experimental AIN-IP set. MirWIP has improved overall Binding site scores were then combined into total miRNA family scores for each target, estimating the likelihood that a given transcript is regulated by a particular miRNA family. Finally, the miRNA family scores were combined into a total target score for each transcript, estimating the likelihood that a given transcript is regulated by a miRNA. S, 5¢ seed matching; A, upstream structural accessibility; E, total energy; I, initial; F, final. 814 RESULTS Initial target prediction We used RNAhybrid 17 with modifications (Supplementary Methods online) to generate a list of all miRNA-target matches in C. elegans and C. briggsae. We filtered this initial set of raw miRNA target matches on the basis of minimal free energy, phylogenetic conservation and seed pairing configuration (Supplementary Methods and 5¢ seed match features enriched in AIN-IP transcripts Extensive 5¢ seed pairing shows the best enrichment for AIN-IP targets over all other transcripts assayed AIN-IP binding sites are structurally accessible We used the Sfold method 24 to fold whole 3¢ UTR sequences plus 300 nucleotides of adjacent coding sequence for all predicted C. elegans transcripts. The Sfold output returns the probability that each nucleotide in the 3¢ UTR is predicted to be singlestranded (that is, accessible). We used this output to calculate the average accessibility over 25-nucleotide windows around and including each potential miRNA binding site. The average structural accessibility in upstream sequence windows shows the best enrichment for AIN-IP transcripts AIN-IP binding sites are more energetically favorable Hybridization between a miRNA and a structured mRNA target involves two major components: DG hybrid , the stability (hybrid free energy) of the miRNA-target duplex, and DG disruption , the cost of altering the local structure of the mRNA target 20 . For a successful hybridization, the net energy of the process, DG total ¼ DG hybrid -DG disruption , must be thermodynamically favorable (that is, negatively valued). The binding sites in AIN-IP structures were strongly enriched for highly favorable values of DG total Individual binding site scores were assigned in a two-step process ( After scoring all sites using the postfilter enrichments, we sought to again filter out the relatively low-scoring individual binding sites while calculating scores for 3¢ UTR targets The target scores varied from 2 to 400, with the highest scores going to lin-14 and hbl-1, two of the first identified miRNA targets in C. elegans. Comparison of mirWIP performance to other methods We compared our algorithm to the three most commonly used target prediction methods for C. elegans: PicTAR 16 , TargetScanS 22 and miRanda 26 . We also included rna22 (ref. 6) in our comparisons as this method does not use any of the typical prediction criteria (seed matching, conservation, energy or structure). Finally, we included a recent method, PITA 21 , that is similar to our technique in that PITA also uses seed, structure and energy calculations to predict target transcripts, but without sequence conservation (PITA online release 5, with the suggested threshold DDG o -10 kcal/ mol). We selected these methods to show the improvements gained by using our AIN-IP-derived weights and our particular combination of contextual features. We used two metrics to compare the performance of mirWIP to that of the other algorithms. First, we considered the ability of the algorithms to return the experimentally verified C. elegans miRNAtarget matches listed in Supplementary Table 1. (Although this dataset is small, it represents the strictest test of the sensitivity of miRNA prediction methods and is a true experimental validation set for mirWIP, as these sites were not included in our enrichment analysis.) We compared this to the percentage of predicted targets that are not in either the AIN-IP or verified target list-an estimated maximum false-positive rate. A receiver-operator characteristic plot In a second estimate of algorithm specificity Overlap among miRNA prediction methods We next compared the overlap in predicted miRNA-target interactions for mirWIP and each of the five methods described above Overall, there was only modest overlap among the six methods in the sets of miRNA-target interactions predicted. Approximately 25% of the specific miRNA-target interactions predicted by mirWIP were shared with at least one of the five other methods. However, there was better agreement among these methods in terms of the mRNAs predicted to be targeted by miRNAs in general. That is, 96% of the genes in the mirWIP catalog were also predicted to be targets of miRNAs by at least one of the other methods. In other words, these prediction methods agree about many of the genes targeted by miRNAs but disagree about which miRNA is regulating that gene. Notably, 27% of the verified miRNA-target interactions were in that set of predictions unique to mirWIP. Analysis of falsely rejected AIN-IP targets The mirWIP algorithm identified 79% of the AIN-IP transcripts on the basis of conserved binding sites in the 3¢ UTR Finally, an additional 10% of the AIN-IP genes do not have an ortholog in which to look for conserved binding sites 28 . There may be many nonconserved binding sites for known miRNAs in this group, as well as conserved binding sites for unknown miRNAs. Relaxing the already lenient orthology filter, however, would lead to an unacceptable false-positive rate as conservation is one of the strongest filters in the algorithm. DISCUSSION The AIN-IP set of miRISC-associated mRNA transcripts represents the largest currently available set of true miRNA targets identified from their endogenous context. This target list is not biased by selection from a particular target prediction method, allowing a fair comparison across methods. The large number of targets in the AIN-IP list allowed for a statistical analysis of both sequence and structural features associated with regulation by the miRISC complex. We found that AIN-IP transcripts are enriched for miRNA complementary sites and that certain features of the miRNA binding sites are strongly enriched. These features include a range of 5¢ seed base-pairing configurations, structural accessibility of the binding site and an upstream region, and favorable total interaction energy of the miRNA-mRNA hybridization. These findings are consistent with previous reports on the importance of both canonical and noncanonical seed matches [8][9][10][11] Among the mirWIP-predicted targets, 40% were identified by the AIN-IP method; 60% of the mirWIP-predicted transcripts were not stably associated with AIN proteins in the miRISC. Many of these non-AIN-IP transcripts could represent false-positive predictions by mirWIP, which would imply a lower bound of 40% for our true-positive predicted fraction. However, for several reasons, we believe that a substantial portion of these non-AIN-IP transcripts represent bona fide miRNA targets. First, the strict cutoff implemented in defining the AIN-IP list 23 may have removed many true targets. Second, we expect the sensitivity of the AIN-IP method to be poor for interactions that involve a small fraction of the total population of the target mRNA. For example, some interactions may occur only transiently and/or in a limited number of cells in the animal, as is the case for lsy-6 and cog-1 (ref. 29). Third, the AIN-IP method is likely to be most effective at recovering stable miRNA-mRNA complexes and is expected to recover unstable mRNAs much less efficiently. Some miRNAs regulate their targets on the level of mRNA stability 30 , and such miRNA-mRNA complexes would be relatively short-lived and poorly detected by microarray hybridization. Finally, 4 of the 14 genetically validated miRNA targets were not in the AIN-IP list (29%). This suggests that as many as 29% of the mirWIP predictions are true miRNA targets that were not identified by AIN-IP. By this estimate, an upper bound on our positive prediction rate could be as high as 70%. Analysis of additional experimental datasets should improve the sensitivity and specificity of mirWIP target predictions. For example, analysis of miRISC-associated RNAs from populations of developmentally staged worms or specific cell types should help reduce the noise associated with averaging regulatory interactions over all stages and tissues. Moreover, mirWIP in its current form is supported by immunoprecipitation experiments that identify transcripts by their probable association with miRNAs, but these experiments do not directly provide information about what particular miRNA or set of miRNAs is responsible for miRISC association. The immunoprecipitation of miRISC proteins from animals lacking a specific miRNA would allow us to match individual miRNAs to the targets they regulate. One such experiment 13 was conducted with a tagged version of Argonaute in Drosophila, significantly enriching for a small number of targets for dme-miR-1. Similar experiments can be applied to C. elegans, where a comprehensive set of miRNA mutants is available. Finally, because the miRISC immunoprecipitation approach may be biased toward the identification of stable miRNA-target complexes, miRNA-induced target destabilization can be screened using complementary datasets, such as microarray assays to identify mRNA transcripts that change in response to miRNA activity. METHODS miRNA target identification. We used the RNAhybrid algorithm 17 to identify the raw list of possible miRNA matches in the set of orthologous 3¢ UTRs of C. elegans and C. briggsae, with a few modifications (see Supplementary Methods). Subsequent filtering and scoring of miRNA sites, and the derivation of methods for combining site scores to produce target (3¢ UTR) scores, are described in Supplementary Methods and shown in Supplementary Structural accessibility calculations. We use the Sfold method 24 to fold 3¢ UTR sequences for all C. elegans transcripts, plus 300 nucleotides of coding sequence adjacent to the stop codon. Details of accessibility calculations and lengths of sequences examined are given in Supplementary Methods. Total interaction energy calculations. The calculations for DG total were separate from the average accessibility calculations described above, but we also used the predicted accessibilities as follows. We used the predicted structures for each binding site, calculating the energy necessary to disrupt any bound nucleotides in that region (DG disruption ). We then added this disruption energy to the minimal free energy, DG hybrid , to obtain the total interaction energy, DG total . Statistical analysis. We estimated the significance of the prefilter enrichments for seed, structural accessibility measures and total free energy