Literature DB >> 19390574

Features of mammalian microRNA promoters emerge from polymerase II chromatin immunoprecipitation data.

David L Corcoran1, Kusum V Pandit, Ben Gordon, Arindam Bhattacharjee, Naftali Kaminski, Panayiotis V Benos.   

Abstract

BACKGROUND: MicroRNAs (miRNAs) are short, non-coding RNA regulators of protein coding genes. miRNAs play a very important role in diverse biological processes and various diseases. Many algorithms are able to predict miRNA genes and their targets, but their transcription regulation is still under investigation. It is generally believed that intragenic miRNAs (located in introns or exons of protein coding genes) are co-transcribed with their host genes and most intergenic miRNAs transcribed from their own RNA polymerase II (Pol II) promoter. However, the length of the primary transcripts and promoter organization is currently unknown.
METHODOLOGY: We performed Pol II chromatin immunoprecipitation (ChIP)-chip using a custom array surrounding regions of known miRNA genes. To identify the true core transcription start sites of the miRNA genes we developed a new tool (CPPP). We showed that miRNA genes can be transcribed from promoters located several kilobases away and that their promoters share the same general features as those of protein coding genes. Finally, we found evidence that as many as 26% of the intragenic miRNAs may be transcribed from their own unique promoters.
CONCLUSION: miRNA promoters have similar features to those of protein coding genes, but miRNA transcript organization is more complex.

Entities:  

Mesh:

Substances:

Year:  2009        PMID: 19390574      PMCID: PMC2668758          DOI: 10.1371/journal.pone.0005279

Source DB:  PubMed          Journal:  PLoS One        ISSN: 1932-6203            Impact factor:   3.240


Introduction

MicroRNAs (miRNAs) are short, ∼22 nt, single-stranded RNAs that act as regulators of genes' expression. By virtue of base complementarity, they bind to their target gene mRNAs and can block translation or accelerate their degradation [1]. miRNAs have been implicated in a variety of human diseases [2], [3] and more recent studies showed their association with particular cellular pathways [4]. Although miRNA genes play an important role in many biological processes, little is known about their transcriptional regulation. Currently, it is believed that most miRNA genes are transcribed by RNA polymerase II (Pol II) [5], [6], although some exceptions exist [7]. A first step toward understanding miRNA regulation is to identify their transcription start sites (TSSs). Currently, only a small number of human miRNA genes have confirmed TSSs [5], [8], [9], which is insufficient for studying the promoter sequence features and for comparison with protein coding genes. Due to this lack of information, all studies attempting to analyze the miRNA core promoters have focused on the area immediately upstream of the computational prediction of the pri-miRNA [10], [11], [12]. While these regions exhibit similar conservation patterns to the promoters of protein coding genes [13] their potential to act as promoters is still unknown. Identifying the active core miRNA promoters will thus allow us to study particular pri-miRNA characteristics such as transcript length and core promoter features. Recently, two studies that utilized high-throughput genomic techniques offered a first glimpse into the likely location and sequence characteristics of human miRNA TSSs [14], [15]. In addition, two other studies involving high-throughput data from mouse and C. elegans offered insights on miRNA gene transcription in these species [14], [16]. Existing algorithms for modeling Pol II core promoters vary both in methodology and in performance. Previous algorithms have used transcription factor binding site frequencies [17], the size and location of CpG islands [18], and the physical properties of the DNA. Algorithmically, neural networks [19], relevance vector machines [20], and additive logistic regression with boosting have been applied [21]. To better understand the transcriptional regulation of miRNAs we performed chromatin immunoprecipitation (ChIP)-chip for the Pol II complex using a custom designed miRNA location array. After comparing different DNA features, we developed an efficient Support Vector Machine (SVM) based method for Pol II core promoter classification (Core Promoter Prediction Program, CPPP). We applied these tools to identify miRNA TSSs, better understand how intergenic and intragenic (i.e., intronic or exonic) miRNA genes are transcribed and to compare the features of their promoters with those of the protein coding genes.

Results and Discussion

Identification of regions containing pri-miRNA TSSs from Pol II ChIP-chip data

To identify the TSS for pri-miRNAs, ChIP-chip was performed on A549 lung epithelial cells with a Pol II-specific antibody, as described in . Statistical analysis [22] was used to identify windows of 1 Kb in length that exhibit significant Pol II signals (immunoprecipitated DNA vs. background). The nearest statistically significant window to the 5′ end of each of the 531 known pre-miRNAs was further analyzed with our algorithm to predict whether it contained the miRNA TSS. The custom-made tiling array we used included 50 Kb upstream of each known miRNA gene (see ). This distance threshold is consistent with previous studies that showed high correlation of expression between miRNA genes located up to 50 Kb apart [23]. Our method resulted in 34 intergenic pre-miRNAs or polycistronic pri-miRNAs having a statistically significant Pol II signal associated with them ( ). Regions with a significant Pol II signal that also overlapped the 5′ end of a known gene (as identified by the UCSC table browser [24], [25]) were excluded from subsequent analysis. This was necessary because the ChIP-chip data cannot distinguish shared core promoter regions. An example of the distribution of the Pol II binding signals and the identified TSS of the pri-miR-10a is presented in .
Table 1

Identification of promoters of intergenic miRNA genes.

miRNAChromosomal locationChIP-chip regionCPPP ModelPredicted TSS (CPPP)Distance
miR-200b∼miR-200a∼miR-429Chr1: 1092346 (+)[1082033, 1083782]CpG+10833338763
miR-34aChr1: 9134423 (−)[9162283, 9166532]CpG+916373329310
miR-101-1Chr1: 65296779 (−)[65304283, 65307532]CpG+653058339054
miR-181a-1∼miR-181b-1Chr1: 197094905 (−)[197125783, 197127032] CpG− none predicted NA
miR-202Chr10: 134911115 (−)[134919994, 134925743]CpG+1349248448879
miR-210Chr11: 558198 (−)[559355, 560354]CpG+ none predicted NA
miR-194-2∼miR-192Chr11: 64415487 (−)[64416605, 64418104]CpG−644169301193
miR-200c∼miR-141Chr12: 6943122 (+)[6940546, 6942545]CpG+69411461976
let-7iChr12: 61283732 (+)[61279796, 61291045]CpG+61283796506
miR-379∼miR411∼…∼miR-410∼miR-656Chr14: 100558155 (+)[100524119, 100525868] CpG− none predicted NA
miR-193bChr16: 14305324 (+)[14302031, 14310280]CpG+14304581743
miR-138-2Chr16: 55449930 (+)[55439531, 55441030]CpG−554398569824
miR-497∼miR-195Chr17: 6862065 (−)[6863309, 6865058]CpG−68647592444
miR-10aChr17: 44012308 (−)[44017059, 44018808]CpG+440177095401
miR-196a-1Chr17: 44064920 (−)[44078809, 44080558]CpG+4407950914589
miR-21 Chr17: 55273408 (+) [55267309, 55276558] CpG− 55271984 1174
miR-122Chr18: 54269285 (+)[54235566, 54236565]CpG−5423589133144
miR-23a∼miR-27a∼miR-24-2 Chr19: 13808473 () [13807348, 13809097] CpG− 13808448 0
miR-181c∼miR-181dChr19: 13846512 (+)[13832848, 13834847] CpG− none predicted NA
miR-99b∼let-7e∼miR125aChr19: 56887676 (+)[56882098, 56886347]CpG+ none predicted NA
miR-216a∼miR-217Chr2: 56069698 (−)[56072783, 56074282]CpG−560739333985
miR-301b∼miR-130bChr22: 20337269 (+)[20335283, 20337282]CpG+20336583686
let-7a-3∼let-7bChr22: 44887292 (+)[44879283, 44883032]CpG+448819335109
miR-206∼miR-133bChr6: 52117105 (+)[52096878, 52098877]CpG−5209845318402
miR-30aChr6: 72170045 (−)[72164628, 72176377]CpG−721742033908
miR-129-1Chr7: 127635160 (+)[127593752, 127595501]CpG+12759409241068
miR-183∼miR-96∼miR-182Chr7: 129202090 (−)[129206752, 129207751]CpG+1292072025112
miR-29b-1∼miR-29aChr7: 130212838 (−)[130219002, 130223501]CpG−1302230279939
miR-30d∼miR-30bChr8: 135886370 (−)[135913283, 135915782]CpG+13591413327763
let-7a-1∼let-7f-1∼let-7dChr9: 95978059 (+)[95966631, 95971380]CpG+959691319928
miR-181a-2∼miR-181b-2Chr9: 126494541 (+)[126459631, 126464380]CpG−12646083133460
miR-222∼miR-221ChrX: 45491474 (−)[45504862, 45507861]CpG−4550678215308
miR-542∼miR-450a-2∼miR-450a-1∼miR-450bChrX: 133503133 (−)[133502362, 133506611]CpG+1335057622629
miR-505ChrX: 138834056 (−)[138842362, 138844111]CpG+1388431229066

miRNA: miRNA gene symbol, multiple symbols designate cluster of co-expressed miRNAs; Chromosomal location: the chromosomal position and orientation of the miRNA gene; ChIP-chip region: the nearest region with a statistically significant peak; CPPP model: the CpG (CpG+) or non-CpG (CpG−) model used for the TSS prediction; Predicted TSS: TSS predicted by CPPP; Distance: the distance of the predicted TSS from the most 5′ pre-miRNA transcript. Bold letters designate previously verified TSSs.

Figure 1

Pol II ChIP-chip results for miR-10a.

The blue arrow represents the location and transcriptional direction of hsa-miR-10a. The red dashes represent the location and value of the ChIP-chip probes. TSS – transcription start site of this miRNA.

Pol II ChIP-chip results for miR-10a.

The blue arrow represents the location and transcriptional direction of hsa-miR-10a. The red dashes represent the location and value of the ChIP-chip probes. TSS – transcription start site of this miRNA. miRNA: miRNA gene symbol, multiple symbols designate cluster of co-expressed miRNAs; Chromosomal location: the chromosomal position and orientation of the miRNA gene; ChIP-chip region: the nearest region with a statistically significant peak; CPPP model: the CpG (CpG+) or non-CpG (CpG−) model used for the TSS prediction; Predicted TSS: TSS predicted by CPPP; Distance: the distance of the predicted TSS from the most 5′ pre-miRNA transcript. Bold letters designate previously verified TSSs. The miR-23amiR-27amiR24-2 cluster is probably the best-studied human intergenic pri-miRNA transcript. Lee et al. [5] have shown that the TSS for this cluster is located 124 nucleotides upstream of miR-23a, which our ChIP-chip data analysis confirmed. The ChIP-chip data was also able to confirm the previously reported pri-miRNA TSS listed by Fujita and Iba [26] for miR-21 ( ). The distance between the Pol II peaks and the location of the pre-miRNA varies substantially between genes, from a minimum of zero to a maximum of 40 Kb. The average and median values are 10.8 Kb and 8.7 Kb, respectively. The analysis of the Pol II ChIP-chip data also provided insights into how intragenic (intronic, exonic) miRNAs are transcribed. Currently, it is believed that they are transcribed along with their host gene [6]. Indeed, for many intragenic miRNAs the nearest significant upstream Pol II ChIP-chip peak region overlapped the 5′ region of their host gene ( ). These cases include a few miRNAs that were previously shown to be co-transcribed with their host gene, such as miR-146a [27] and the miR-17miR-18amiR-19amiR-20a∼miR-19b-1∼miR-92a-1 cluster [9]. Interestingly, the analysis found that some of the intragenic miRNA genes may be transcribed by their own promoter ( ), which was also observed in the recent analysis by Ozsolak et al. [15]. We note that in contrast with the promoters of intergenic miRNA genes, CpG islands were identified only in 3 of the 11 intragenic promoters. The distance between the Pol II peak and the beginning of the (intragenic) pre-miRNA gene also varies between zero and 41 Kb, but with a higher number of TSSs observed at longer distances (average and median = 19 Kb).
Table 2

Intragenic miRNAs who's nearest ChIP-chip peak overlaps the host gene's TSS.

miRNAHost GeneChromosomal locationChIP-chip region
miR-30e∼miR30c-1NFYCChr1: 40992613 (+)[40946783, 40950532]
miR-186ZRANB2Chr1: 71305987 (−)[71316783, 71320532]
miR-130aAK096335Chr11: 57165246 (+)[57161605, 57163604]
miR-148bCOPZ1Chr12 53017266 (+)[53004046, 53006295]
miR-26a-2CTDSP2Chr12: 56504742 (−)[56524546, 56528295]
miR-15a∼miR-16-1DLEU2Chr13: 49521338 (−)[49551648, 49555397]
miR-17∼miR-18a∼miR-19a∼miR-20a∼miR-19b-1∼miR-92a-1 C13orf25 v_1 Chr13: 90800859 (+) [90798648, 90800647]
miR-423CCDC55Chr17: 25468222 (+)[25467059, 25470058]
miR-301a∼miR-454FAM33AChr17: 54583364 (−)[54583809, 54589308]
miR-330EML2Chr19: 50834185 (−)[50833598, 50834597]
miR-26bCTDSP1Chr2: 218975612 (+)[218968033, 218974282]
miR-103-2PANK2Chr20: 3846140 (+)[3816001, 3820000]
miR-185C22orf25Chr22: 18400661 (+)[18387533, 18389782]
miR-191∼miR-425DALRD3Chr3: 49033146 (−)[49026104, 49038353]
miR-15b∼miR-16-2SMC4Chr3: 161605069 (+)[161598354, 161603353]
miR-378PPARGC1BChr5: 149092580 (+)[149089935, 149091684]
miR-103-1PANK3Chr5: 167920556 (−)[167938685, 167940184]
miR-335MESTChr7: 129923187 (+)[129912502, 129914001]
miR-31LOC554202Chr9: 21502184 (−)[21539381, 21557130]
miR-421AK125301ChrX: 73355021 (−)[73377862, 73379611]
miR-374b∼miR-374a∼miR-545AK057701ChrX: 73355178 (−)[73421362, 73431611]
miR-361CHMChrX: 85045368 (−)[85188362, 85189861]
miR-503MGC16121ChrX: 133508094 (−)[133506612, 133515611]
miR-452∼miR-224GABREChrX: 150878840 (−)[150889112, 150894611]
miR-22MGC14376Chr17: 1564031 (−)[1563059, 1569558]
miR-636SFRS2Chr17: 72244225 (−)[72244059, 72246308]
miR-146a DQ658414 Chr5: 159844936 (+) [159826435, 159828934]

Host gene: the gene whose intron the miRNA was found in. Other column names as in Table 1. Bold letters designate genes that are known to be co-transcribed with their host genes.

Table 3

Identification of promoters for intragenic miRNA genes.

miRNAHost GeneChromosomal locationChIP-chip regionCPPP ModelPredicted TSS (CPPP)Distance
miR-107PANK1Chr10: 91342564 (−)[91382494, 91383493]CpG−9138284440030
let-7a-2∼miR-100AK091713Chr11: 121522511 (−)[121521855, 121523854] NA none predicted NA
miR-190TLN2Chr15: 60903208 (+)[60860703, 60861952]CpG−6086142841530
miR-99a∼let-7cC21orf34Chr21: 16833279 (+)[16826951, 16832700]CpG−168278265203
miR-125b-2C21orf34Chr21: 16884427 (+)[16880451, 16883950]CpG−168809513226
miR-26a-1CTDSPLChr3: 37985898 (+)[37961854, 37963353]CpG−3796252923119
miR-196bHOXA9Chr7: 27175707 (−)[27178752, 27180251]CpG+271788023095
miR-489∼miR-653CALCRChr7: 92951267 (−)[92953002, 92954251] NA none predicted NA
miR-101-2RCL1Chr9: 4840296 (+)[4827381, 4828630]CpG−482828111765
miR-491KIAA1797Chr9: 20706103 (+)[20673131, 20677880]CpG+2067718128922
miR-204TRPM3Chr9: 72614820 (−)[72633881, 72634880] NA none predicted NA
miR-7-1HNRPKChr9: 85774592 (−)[85774131, 85775630]CpG−85775081239
mir-23b∼miR-27b∼miR-24-1C9orf3Chr9: 96887310 (+)[96846381, 96860880]CpG+9685588131429
miR-32 C9orf5 Chr9: 1108483999 (−) [110866881, 110868380] CpG− 110867881 19232
miR-448HTR2CChrX: 113964272 (+)[113955612, 113956861] NA none predicted NA

Column names as in Table 1 and 2. Bold letters designate genes whose expression was found to be anti-correlated with their host genes.

Host gene: the gene whose intron the miRNA was found in. Other column names as in Table 1. Bold letters designate genes that are known to be co-transcribed with their host genes. Column names as in Table 1 and 2. Bold letters designate genes whose expression was found to be anti-correlated with their host genes.

Modeling Pol II core promoter features with n-mers and weight matrices

In the following section, we describe the development of CPPP, a novel SVM-based method for prediction of Pol II TSSs. CPPP was used for the identification of the miRNA TSSs from the ChIP-chip data and for comparing the features of the miRNA promoters to those of protein coding gene promoters. It is known that the genomic regions immediately upstream of the TSS of protein coding genes exhibit high levels of sequence conservation [13], [28], [29], [30], [31], which is probably related to the high concentration of cis-regulatory sites in this region [32]. All of the existing algorithms for modeling Pol II core promoters have used this property to different extents. Generally one can model DNA target sites using either n-mer frequencies or weight matrices, commonly known as position-specific scoring matrices (PSSM) [33]. The first class of methods (also termed enumerating or dictionary-based methods; e.g., [34], [35], [36], [37]) is better suited for representation of the binding preferences of those transcription factors that have a restricted set of DNA targets. n-mer frequencies have been used in the past to model Pol II core promoters either alone [12] or in conjunction with some promoter entropy measure [38]. However, the DNA targets of most transcription factors are not highly conserved, which is the reason why PSSM models are widely used for representing DNA motifs. Regardless, using PSSMs for Pol II core promoter recognition has also its limitations. First, the currently known DNA motifs are redundant, not only because the available databases contain multiple matrices for the same factor, but also because structurally similar transcription factors are known to recognize similar “core” motifs [29], [39]. Second, the binding preferences are known only for a small percent of the transcription factor proteins and protein complexes. For example, TRANSFAC database [40] currently has annotated 2,113 mammalian transcription factors, but it only contains 601 binding models. Third, even if the binding preference of a given transcription factor is known, the task of determining whether it binds to a given promoter is not trivial, mainly due to the high false positive prediction rate [41], [42]. Despite the above limitations, PSSM models have been used extensively in the past for Pol II core promoter identification [10], [21], [43]. The problems of PSSM model redundancy and the relatively small number of transcription factors with known binding preferences can be diminished if one uses familial binding profiles (FBPs) [39]. FBPs represent an “average” of the binding preferences of related transcription factors. They are based on the fact that transcription factors of the same structural group typically bind to similar sets of sequences. This method not only reduces the PSSM model redundancy, but also offers models for the transcription factors for which the binding preference is currently unknown (since the transcription factors with unknown preferences are likely to belong in one of the existing families). Sandelin and Wasserman initially built a set of 11 FBPs using a semi-manual method [44]. In that study, the zinc finger proteins were excluded from the FBP construction due to their high degree of target promiscuity, which in turn makes them difficult to cluster correctly. More recently, Mahony et al. [29] used an automatic method to construct 17 FBPs. This set of FBPs includes all but the C2H2 the zinc finger (sub)family. Using the same clustering method developed in Mahony et al. [29] we built 31 new FBPs from the C2H2 zinc finger proteins. These 31 new FBPs together with the 17 FBPs from the Mahony et al. study were used in the subsequent analysis.

Evaluating Core Promoter Features Using Support Vector Machines

In order to better understand how various features contribute to the characterization of the Pol II core promoters we compared them using an SVM [45], [46]. The SVM methodology was chosen because it can combine multiple types of evidence (features) under the same general framework. In this study, we used (a) the n-mer frequencies (n = 3,4) and (b) matches to the set of 48 generalized DNA binding profiles (FBPs) as features of the SVM, and (c) the GC content. The reason for using GC content as an additional feature in the SVM training is that it seems to be a prominent feature in a subset of eukaryotic promoters [47]. Overall, we constructed and compared five SVM models: (1) FBPs only (48 features), (2) n-mers only (n = 3,4) (320 features), (3) FBPs+GC content, (4) n-mers+GC content, and (5) FBPs+n-mers+GC content. All models were trained on the same set of 3,015 verified core promoters of protein coding genes (positive examples; see ) and 3,015 randomly chosen intergenic sequences (negative examples; see ). Performance was measured by a 20× cross-validation in which 75% of the examples in each dataset were used for training and the remaining 25% for testing. The results are presented in , and indicate that the n-mer-based models perform generally better than the FBP-based models, both in terms of sensitivity (percent of correctly predicted positive examples) and specificity (percent of true positive examples among all predictions). For example, the “n-mer only” SVM model (n = 3, 4) exhibited S and S compared to S and S of the “FBP only” model. It should be noted, however, that none of these differences is statistically significant. Based on these results, one may choose to use FBPs for this type of modeling, especially in species where the number of training examples is limited.
Figure 2

Performance of the n-mers and FBPs (alone and in combination) in predicting Pol II core promoter regions.

Sn – sensitivity, Sp – specificity.

Performance of the n-mers and FBPs (alone and in combination) in predicting Pol II core promoter regions.

Sn – sensitivity, Sp – specificity. The SVM results reported above were based on the dot plot kernel function (linear discriminator). Tests with polynomial (3rd order) and radial kernels gave the same or slightly worse results (data not shown). Also, all SVM models were constructed using random intergenic regions as background (see ) instead of the intronic regions previous studies have used [38]. Evaluation with intronic sequences as background was found to be slightly worse (data not shown). We note that other studies have occasionally reported better performance (e.g., [12], [38]). We believe this is due to the smaller size of the datasets they used and the type of promoters these datasets contained. For example, Gangal and Sharma [38] reported S>87% and S, but the 800 promoter sequences in their dataset were all taken from EPD [48]. EPD is highly enriched in promoters containing CpG islands (about 83% of the total.) A very powerful separation hyperplane can be created using these GC-rich promoters as positive set and the, generally AT-rich, intronic sequences as negative set. However this model is expected to perform poorly on non-CpG island promoters, as we will show. In our case, only half of the promoters in the training/testing dataset contained CpG islands. When the EPD dataset is used for training/testing in this study, our results are similar (intronic background) or slightly better (intergenic background) to those reported in Gangal and Sharma [38]. Nevertheless, we found that partitioning the promoters to those containing CpG islands and those lacking CpG islands improves the results substantially (see below).

The effect of the presence or absence of CpG islands in the prediction efficiency of Pol II core promoters

In general, the frequency of CG dinucleotides in vertebrate genomes is lower than expected by chance [49]. This is due to the frequent conversion of methylated-CG into TG [50]. However, often the promoters of vertebrate genes contain stretches with high frequency of CG dinucleotides (CpG islands) [51]. These regions are often defined as 200 nt or more with GC content greater than 50% [18]. Ioshikhes and Zhang [18] have previously used this feature to predict the CpG island containing promoters with high efficiency. For this reason, we tested the prediction efficiency of the “n-mer only” and “FBP only” SVM models in mammalian core promoters in the presence or absence of CpG islands. Focus was placed on these two models because they are simpler than the composite model and their performance in the general dataset is the same or slightly better than the other models ( ). The positive training set was partitioned into CpG containing promoters (CpG+) and non-CpG promoters (CpG−), for each of which a n-mer-based and a FBP-based SVM model were calculated. The negative dataset contained equal number of randomly selected sequences from the intergenic parts of the genome (see ). The results demonstrate that if SVMs are trained in this way, then the prediction efficiency differs significantly between the two types of promoters. In particular, the “n-mer model” trained on CpG+ promoters exhibits S = 94.8% (SD = 1.1%) and S = 97.6% (SD = 1%) in the cross-validation tests. By contrast, when trained on CpG− promoters the “n-mer model” performs significantly worse (S = 73.4%, SD = 2.6% and S = 73.2%, SD = 2.9%) ( ). The results with the “FBP model” are similar for both the CpG+ and CpG− trained models ( ). Also, the results show that in general n-mer models perform slightly better than the corresponding FBP models regardless of the training (CpG+ or CpG− datasets) ( ). Furthermore, the results show that n-mer-based models trained on CpG+ promoters tend to predict extremely well the CpG promoters (S, S), which agrees with previous reports [18]. We have discovered that this better performance can be attributed to the GC content of these promoters (compared to the background), and this could be misleading. When intergenic sequences with similar GC content were used as negative dataset during training, the efficiency of the n-mer-based SVM on CpG+ promoters was reduced to values similar to the prediction of the CpG− promoters with the CpG− model (S = 75.3% with SD = 2.4% and S = 80.0% with SD = 2.0%). Since our main aim in this report is to discover important promoter features, not simply the features of the CpG islands, in the following analysis we use the seemingly less efficient models (i.e., n-mer SVMs trained on CpG+ vs. GC-normalized intergenic background and CpG− vs. random intergenic background).
Figure 3

Performance of the SVM models in predicting CpG+ and CpG− promoters.

Two SVM models were evaluated in the prediction of the CpG+ promoters: one with random intergenic background (CpG+/Rnd_bg) and one with intergenic background with similar GC content (CpG+/GC_bg). Sn – sensitivity, Sp – specificity.

Performance of the SVM models in predicting CpG+ and CpG− promoters.

Two SVM models were evaluated in the prediction of the CpG+ promoters: one with random intergenic background (CpG+/Rnd_bg) and one with intergenic background with similar GC content (CpG+/GC_bg). Sn – sensitivity, Sp – specificity. The program ‘gist-fselect’ from the Gist package [45] was used to evaluate the significance of each of the features of core promoter regions (t-test metric p-value was used to determine significance) in CpG and non-CpG promoters. The top 20 features (ranked by the Fisher score of the package) are presented in . Interestingly, despite correcting for GC content, the most significant features for the CpG+ promoters were CG containing n-mers. Also of interest is the n-mer ‘CTG,’ which is present in the top 20 most significant features of both models.
Table 4

The top 20 most significant n-mers for each of the two models and the Fisher score as well as the −log10 of their p-value from Gist package (t-test metric).

non-CpGCpG
Feature−log10(p-value)Fisher ScoreFeature−log10(p-value)Fisher Score
CCCT 29.79250.152704 GCG 26.84580.136008
AGGG 26.85740.136658 GGCG 23.38170.11732
GCCC 23.99960.12122 CTG 21.5833 0.107693
CCC 23.66380.119395 CGC 21.34340.106413
TGTA 23.70210.119389 CCTG 18.02130.0887804
CCCC 23.62480.119181 TCCG 17.97560.088539
AAT 23.41040.117827 GCGG 17.80460.0876364
GAAG 22.29790.111908 TCGC 15.9980.0781331
AGC 21.14280.105734 CGA 14.45890.0700828
TAC 20.87540.104254 CTGG 14.4480.0700258
ATT 19.53440.0971108 CAGG 14.01660.0677778
TAAT 19.15610.0950535 CTA 13.95020.067432
ATTA 19.10210.0947959 CGGA 13.76320.0664587
TACA 19.00210.0942557 GCGC 13.73310.0663022
GTA 18.68680.0925992 CGCC 13.23850.0637315
AATA 18.60510.092075 CAC 12.94860.0622268
GGG 18.00890.0890836 CAG 12.64670.0606617
CTGC 18.00340.0890473 CGG 12.33750.0590611
CAGC 16.87980.0831072 CGCG 12.31670.0589534
CTG 16.3289 0.0801748 TCG 12.24700.0585926

Bold letters indicate the n-mer that appears to be a significant feature in both the CpG+ and CpG− models.

Bold letters indicate the n-mer that appears to be a significant feature in both the CpG+ and CpG− models.

Comparison of core promoters for protein coding and miRNA genes with SVM models

The ChIP-chip data showed that 34 of the intergenic miRNA genes had significant Pol II signals less than 50 Kb away. The 3 Kb regions surrounding the windows with the most significant Pol II peak were collected and the presence or absence of CpG islands was determined using the same method as in Zhao et al. [21]. CpG islands were identified in about 55% of these promoters. Subsequently, the corresponding SVM model (trained on CpG+ or CpG− promoters of protein coding genes) was used across the significant ChIP-chip region to identify the top scoring 500 bp window that contains the predicted TSS. The CPPP algorithm identified a TSS in the upstream regions of 29 out of the 34 intergenic miRNA genes ( ). Each of the five intergenic miRNAs for which CPPP was unable to identify a core promoter contained a 500 bp region that scored just below the threshold cutoff for identifying a core promoter from a background sequence (data not shown). The number of Pol II associated intergenic miRNA genes is not large enough to retrain the SVM models and calculate significant sequence features. However, we can test whether the most significant features in the promoters of the protein coding genes ( ) are also overrepresented in the miRNA promoters. Comparison of all n-mer frequencies of the CpG promoters of protein coding genes with those of the miRNA genes resulted in a statistically significant difference of 5 n-mers (‘CAC’, ‘GCAC’, ‘CGGT, ‘GTAC’, and ‘CTTA’; Wilcoxon signed-rank test; p-value<0.05 after Bonferroni correction). However, the only 4-mer in the list of the top 50 most important features for the model was ‘CAC’. For the non-CpG miRNA promoters, we found no features with a statistically different frequency when compared to that of the protein coding genes.

Computational analysis of potential promoters of intragenic miRNA genes

Intragenic miRNA genes are generally believed to be co-transcribed with their host genes. Overall, we found significant Pol II peaks associated with 43 intragenic miRNA genes or gene clusters. In 27 cases, the Pol II peak overlapped the promoter of the host gene ( ), but in 15 cases the Pol II peak was located within the host gene ( ). We scanned the ChIP identified regions with internal Pol II peaks with the corresponding SVM model (CpG+ or CpG−) and we found that 11 of these 15 intragenic genes contained a highly likely TSS region ( .) This result indicates that 26% or more of the intragenic genes may be transcribed from their own promoter. In agreement with this finding, the miR-32 gene was previously shown to have a negative correlation with its host gene, C9orf5 [23]. This is an important and interesting finding about the transcriptional regulation of intragenic miRNAs, although further biochemical validation is required.

Comparison with ChIP-seq data

Marson and colleagues [14] recently performed ChIP-seq experiments with four general transcription factors in human and mouse cells. They then combined their data with those from previous studies on epigenetic markers. Using a variety of features such as evolutionary conservation and distance of the peak from the known miRNA genes, they assigned an ad hoc score to each putative TSS. A positive score indicated that the TSS is likely to belong to the miRNA and a negative score indicated that the TSS likely belonged to another gene. Although their study is very different than ours in the methodology followed (ChIP-chip vs. ChIP-seq) and the proteins targeted by the ChIP, comparing the two can be useful. Overall, we found the two datasets are comparable with respect to miRNA promoter characteristics. Furthermore, we ran the CPPP algorithm on their data and verified most of their predicted promoters. In particular, there are 44 TSSs associated with intergenic miRNAs with positive Marson score (after removing some inconsistent or mis-annotated TSSs, see ); and CPPP scored all but three of them as likely to contain a core promoter (see Supplementary Data). The two datasets (ours and Marson's) include one miRNA gene and one miRNA cluster with well studied TSSs: cluster miR-23amiR-27amiR-24-2 [5], and gene miR-21 [26]. CPPP correctly identified the location of both known TSSs, while the Marson dataset only found the correct TSS for the miR-23a cluster (see Supplementary Data.)

Conclusions

The prediction of miRNA TSSs and the understanding of the processes that affect their transcription is an essential step towards deciphering their role in regulatory networks. In this study, high-throughput Pol II ChIP-chip data were collected and used to infer miRNA TSSs. Analysis of these data showed that intergenic and some intragenic miRNAs are transcribed by Pol II at a distance that can be as large as 40 Kb from the pre-miRNA genes, indicating that pri-miRNA transcripts might be much longer than originally thought [10], [11], [12]. We compared the most commonly used promoter features (n-mer frequencies and PSSM models) and found the n-mer frequencies to be generally better than the generalized PSSM models, at the cost of additional parameters. Also, in agreement with other studies [18], we found that CpG+ promoters are much easier to predict than CpG− and that core promoter prediction is more efficient when both models are used independently. However, we also found that this generally superior performance of the SVM models on CpG promoters vs. non-CpG promoters is due to the GC bias of the former. When a random background with similar GC content as the CpG+ promoters was used, the efficiency of the SVM model became similar to that of the CpG− model. This is a previously overlooked feature of the SVM training for core promoter recognition. Using the best performing SVM on our ChIP-chip data, we found that miRNA Pol II promoters contain most of the same features as the protein coding gene promoters. Our results highlight the complexity and independence of the miRNA gene expression regulation and thus encourage more detailed studies in various cells, organs and physiological conditions. Our study gives a biochemical verification to previous statistical analyses that indicated that pri-miRNA transcripts can be tens of thousands of bases long [23]. Finally, the finding that 26% or more of the intragenic miRNA genes may be transcribed by their own promoter also encourages much more detailed studies into their transcriptional regulation. Elucidating the transcriptional networks that determine expression of miRNAs is critically important considering their important regulatory roles. miRNA location arrays may be useful tools in elucidating these networks

Materials and Methods

Chromatin Immunoprecipitation (ChIP-chip)

Approximately 108 A549 cells (American Type Culture Collection, Manassas, VA) were grown in F12K medium (Invitrogen, Carlsbad, CA) with 2 mM L-glutamine and 10% fetal bovine serum. Cells were incubated at 37°C in a humidified chamber supplemented with 5% CO2. Once 80% confluent, cells were serum starved overnight. Proteins were cross-linked to the DNA using fresh formaldehyde solution (50 mM Hepes-KOH pH 7.5, 100 mM NaCl, 1 mM EDTA pH 8.0, 0.5 mM EGTA pH 8.0, 11% Formaldehyde) for 10 min at room temperature. The formaldehyde was quenched with 2.5 M glycine for 5 min at room temperature. Cells were washed twice in PBS and harvested using a silicone scraper. Cells were centrifuged at 1,350×g for 5 minutes at 4°C and the pellet washed twice with PBS. The pellet was resuspended in 5 ml of lysis buffer 1 (50 mM Hepes-KOH pH 7.5, 140 mM NaCl, 1 mM EDTA, 10% glycerol, 0.5% NP-40, 0.25% Triton X-100) and rocked at 4°C for 10 min. The cells were centrifuged at 1,350×g for 5 minutes at 4°C and the pellet resuspended in 5 ml of lysis buffer 2 (10 mM Tris-HCl, pH 8.0, 200 mM NaCl, 1 mM EDTA, 0.5 mM EGTA), rocked at room temperature for 10 min. The nuclei were pelleted by centrifuging at 1,350×g for 5 minutes at 4°C. The pellet was resuspended in 5 ml of lysis buffer 3 (10 mM Tris-HCl, pH 8.0, 100 mM NaCl, 1 mM EDTA, 0.5 mM EGTA, 0.1% Na-deoxycholate, 0.5% N-lauroylsarcosine). The cells were sonicated for 7 cycles of 30 seconds ON and 60 seconds OFF at a power 7 using a sonic dismembrator Model 100 (Fisher Scientific, Waltham, MA). The cells were centrifuged at 20,000×g for 10 minutes at 4°C and 50 µl of the supernatant was set aside as the whole cell extract (WCE). The rest of the supernatant was incubated overnight at 4°C with 100 µl of Dynal Protein G magnetic beads that had been pre-incubated with either 10 µg RNA polymerase II antibody (Abcam, Cambridge, MA) or 10 µg E2F-4 antibody (Santa Cruz Biotechnology, Santa Cruz, CA). The beads were washed 7 times in RIPA buffer (50 mM Hepes-KOH pH 7.6, 500 mM LiCl, 1 mM EDTA pH 8.0, 1% NP-40, 0.7% Na-deoxycholate) and once in Tris-EDTA containing 50 mM NaCl. Elution was done in elution buffer (50 mM Tris-HCl pH 8.0, 10 mM EDTA pH 8.0, 1% SDS) for 15 min at 65°C. Reversal of crosslinks of the immunoprecipitate (IP) and the WCE was done at 65°C overnight. Cellular RNA was digested with 0.2 mg/ml RNaseA (Invitrogen) at 37°C for 2 h followed by protein digestion with 0.2 mg/ml proteinase K (Invitrogen) at 55°C for 30 min. The DNA was purified by phenolchloroformisoamyl alcohol extraction and ethanol precipitation. Purified DNA was blunted using T4 DNA polymerase (New England Biolabs, Ipswich, MA) and ligated to 2 µM linkers using T4 DNA ligase (New England Biolabs). The IP and the WCE was amplified in two stages of PCR and purified by phenolchloroformisoamyl alcohol extraction and ethanol precipitation. 2 µg each of IP and WCE was labeled with Cy5-dUTP and Cy3-dUTP (Perkin Elmer, Waltham, MA) respectively. Labeling was carried out by random-primed Klenow-based extension using the CGH Labeling kit (Invitrogen). The samples were cleaned up using Invitrogen's CGH columns included in the kit. 5 µg each of IP and WCE were combined with cot-1 DNA and the 10× blocking agent and 2× hybridization buffer supplied in the Agilent Oligo aCGH/ChIP-on-chip Hybridization Kit (Agilent, Santa Clara, CA). Hybridization was carried out in Agilent's SureHyb chambers at 65°C for 40 h in the DNA Microarray Hybridization Oven (Agilent). The slides were washed using Oligo aCGH/ChIP-on-chip wash buffer 1 and 2 (Agilent) and scanned in the DNA microarray scanner (Agilent). The scanned images were processed using Agilent's Feature Extraction software version 9.5.3

ChIP-seq data

Marson et al. [14] recently published a study where they combined different ChIP-seq datasets for multiple DNA binding proteins (or modifications of them) to unravel the transcriptional machinery of the miRNA genes in mouse and human cells. We analyzed their human dataset with our CPPP and compared their results with ours. Their original dataset contained TSSs associated with 101 intergenic miRNA genes or gene clusters. We excluded 19 of them from this analysis, because we found them to either overlap with promoters of protein coding genes (10 TSSs) or were located downstream of the corresponding miRNA gene (4 TSSs) or there were genes found between the miRNA gene and the TSS prediction (5 genes; see Supplementary Data.) We also converted the coordinates to the current version of the human genome (hg18) using the liftOver utility of the UCSC Genome Browser [52]. This caused the miRNA-TSS distances to change significantly (more than 50 bp) for 7 of the 101 genes.

miRNA Location Array Design

The miRNA location array was custom-made by Agilent with AMADID (Agilent Microarray Design Identifier) 014119. The array is available on the 44 K design. The probes tile 100 Kb regions (∼200 bp spacing) surrounding each miRNA and only in non-repeat masked regions. The probes are 45–60-mers, Tm balanced and map to the Hg17 database. There are 41585 probes from ∼164 miRNA intervals. Control probes such as GD (gene desert), intensity controls (LACC) and some negative controls were also included.

Analysis of ChIP-chip Data

Median normalization of the log2 values of the ratio of signal to mock (precipitated DNA without antibody) was performed across the three-ChIP-chip arrays followed by a mean centralization to 0. Regions of Pol II binding were identified by the ChIPOTle sliding window method [22]; a window size of 1 Kb was used with a step size of 50 bp. The window was reported as significant if the p-value was below 0.05 after adjustment by the conservative Bonferroni correction method for multiple testing. Overlapping significant windows were combined and the region with the lowest p-value was reported.

Gene Coordinate and Sequence Collection

Pol II core promoters were extracted from two databases: Eukaryotic Promoter Database [48] and DBTSS [53]. Between the two databases there were 3,015 unique human TSSs (1,744 from Eukaryotic Promoter Database and 1,271 from DBTSS as originally identified by Zhou et al [12]). The core promoter regions were partitioned into 1,445 that contained CpG islands and 1,570 that did not according to the method and threshold used in Zhao et al [21]. For the training and testing of the various SVM models the area [−450, +50] surrounding the TSS was used as the positive dataset. An equal number of 500 bp genomic sequences, randomly selected from the intergenic regions of all chromosomes were used as the negative dataset for the CpG− model. A second set of sequences was collected as described by Zhao et al. [21] This dataset had GC content similar to the CpG island promoter dataset and was used as negative dataset for the CpG− model. Special care was given so that the randomly selected regions were not located within 3 Kb from the 5′ end of any annotated gene. Genomic coordinates for all mRNA TSSs, mRNA introns and miRNA were collected from the UCSC table browser [24], [25]. Intragenic miRNAs were identified as those found within an intron, exon or UTR of a mRNA and transcribed in the same orientation. All other miRNAs were labeled as intergenic.
  53 in total

Review 1.  DNA binding sites: representation and discovery.

Authors:  G D Stormo
Journal:  Bioinformatics       Date:  2000-01       Impact factor: 6.937

2.  MicroRNA genes are transcribed by RNA polymerase II.

Authors:  Yoontae Lee; Minju Kim; Jinju Han; Kyu-Hyun Yeom; Sanghyuk Lee; Sung Hee Baek; V Narry Kim
Journal:  EMBO J       Date:  2004-09-16       Impact factor: 11.598

3.  Constrained binding site diversity within families of transcription factors enhances pattern discovery bioinformatics.

Authors:  Albin Sandelin; Wyeth W Wasserman
Journal:  J Mol Biol       Date:  2004-04-23       Impact factor: 5.469

4.  MicroRNA promoter element discovery in Arabidopsis.

Authors:  Molly Megraw; Vesselin Baev; Ventsislav Rusinov; Shane T Jensen; Kriton Kalantidis; Artemis G Hatzigeorgiou
Journal:  RNA       Date:  2006-08-03       Impact factor: 4.942

5.  An intragenic MEF2-dependent enhancer directs muscle-specific expression of microRNAs 1 and 133.

Authors:  Ning Liu; Andrew H Williams; Yuri Kim; John McAnally; Svetlana Bezprozvannaya; Lillian B Sutherland; James A Richardson; Rhonda Bassel-Duby; Eric N Olson
Journal:  Proc Natl Acad Sci U S A       Date:  2007-12-19       Impact factor: 11.205

6.  Footer: a quantitative comparative genomics method for efficient recognition of cis-regulatory elements.

Authors:  David L Corcoran; Eleanor Feingold; Jessica Dominick; Marietta Wright; Jo Harnaha; Massimo Trucco; Nick Giannoukakis; Panayiotis V Benos
Journal:  Genome Res       Date:  2005-06       Impact factor: 9.043

7.  Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies.

Authors:  J van Helden; B André; J Collado-Vides
Journal:  J Mol Biol       Date:  1998-09-04       Impact factor: 5.469

8.  DNA methylation and the frequency of CpG in animal DNA.

Authors:  A P Bird
Journal:  Nucleic Acids Res       Date:  1980-04-11       Impact factor: 16.971

9.  The transcriptional landscape of the mammalian genome.

Authors:  P Carninci; T Kasukawa; S Katayama; J Gough; M C Frith; N Maeda; R Oyama; T Ravasi; B Lenhard; C Wells; R Kodzius; K Shimokawa; V B Bajic; S E Brenner; S Batalov; A R R Forrest; M Zavolan; M J Davis; L G Wilming; V Aidinis; J E Allen; A Ambesi-Impiombato; R Apweiler; R N Aturaliya; T L Bailey; M Bansal; L Baxter; K W Beisel; T Bersano; H Bono; A M Chalk; K P Chiu; V Choudhary; A Christoffels; D R Clutterbuck; M L Crowe; E Dalla; B P Dalrymple; B de Bono; G Della Gatta; D di Bernardo; T Down; P Engstrom; M Fagiolini; G Faulkner; C F Fletcher; T Fukushima; M Furuno; S Futaki; M Gariboldi; P Georgii-Hemming; T R Gingeras; T Gojobori; R E Green; S Gustincich; M Harbers; Y Hayashi; T K Hensch; N Hirokawa; D Hill; L Huminiecki; M Iacono; K Ikeo; A Iwama; T Ishikawa; M Jakt; A Kanapin; M Katoh; Y Kawasawa; J Kelso; H Kitamura; H Kitano; G Kollias; S P T Krishnan; A Kruger; S K Kummerfeld; I V Kurochkin; L F Lareau; D Lazarevic; L Lipovich; J Liu; S Liuni; S McWilliam; M Madan Babu; M Madera; L Marchionni; H Matsuda; S Matsuzawa; H Miki; F Mignone; S Miyake; K Morris; S Mottagui-Tabar; N Mulder; N Nakano; H Nakauchi; P Ng; R Nilsson; S Nishiguchi; S Nishikawa; F Nori; O Ohara; Y Okazaki; V Orlando; K C Pang; W J Pavan; G Pavesi; G Pesole; N Petrovsky; S Piazza; J Reed; J F Reid; B Z Ring; M Ringwald; B Rost; Y Ruan; S L Salzberg; A Sandelin; C Schneider; C Schönbach; K Sekiguchi; C A M Semple; S Seno; L Sessa; Y Sheng; Y Shibata; H Shimada; K Shimada; D Silva; B Sinclair; S Sperling; E Stupka; K Sugiura; R Sultana; Y Takenaka; K Taki; K Tammoja; S L Tan; S Tang; M S Taylor; J Tegner; S A Teichmann; H R Ueda; E van Nimwegen; R Verardo; C L Wei; K Yagi; H Yamanishi; E Zabarovsky; S Zhu; A Zimmer; W Hide; C Bult; S M Grimmond; R D Teasdale; E T Liu; V Brusic; J Quackenbush; C Wahlestedt; J S Mattick; D A Hume; C Kai; D Sasaki; Y Tomaru; S Fukuda; M Kanamori-Katayama; M Suzuki; J Aoki; T Arakawa; J Iida; K Imamura; M Itoh; T Kato; H Kawaji; N Kawagashira; T Kawashima; M Kojima; S Kondo; H Konno; K Nakano; N Ninomiya; T Nishio; M Okada; C Plessy; K Shibata; T Shiraki; S Suzuki; M Tagami; K Waki; A Watahiki; Y Okamura-Oho; H Suzuki; J Kawai; Y Hayashizaki
Journal:  Science       Date:  2005-09-02       Impact factor: 47.728

10.  EPD in its twentieth year: towards complete promoter coverage of selected model organisms.

Authors:  Christoph D Schmid; Rouaïda Perier; Viviane Praz; Philipp Bucher
Journal:  Nucleic Acids Res       Date:  2006-01-01       Impact factor: 16.971

View more
  130 in total

Review 1.  Aberrant epigenetic grooming of miRNAs in pancreatic cancer: a systems biology perspective.

Authors:  Asfar S Azmi; Frances W J Beck; Bin Bao; Ramzi M Mohammad; Fazlul H Sarkar
Journal:  Epigenomics       Date:  2011-12       Impact factor: 4.778

2.  Identification of an miRNA candidate reflects the possible significance of transcribed microsatellites in the hairpin precursors of black pepper.

Authors:  Nisha Joy; Eppurathu Vasudevan Soniya
Journal:  Funct Integr Genomics       Date:  2012-02-25       Impact factor: 3.410

Review 3.  General principals of miRNA biogenesis and regulation in the brain.

Authors:  Dónal O'Carroll; Anne Schaefer
Journal:  Neuropsychopharmacology       Date:  2012-06-06       Impact factor: 7.853

Review 4.  Mechanisms of control of microRNA biogenesis.

Authors:  Brandi N Davis-Dusenbery; Akiko Hata
Journal:  J Biochem       Date:  2010-09-09       Impact factor: 3.387

Review 5.  The widespread regulation of microRNA biogenesis, function and decay.

Authors:  Jacek Krol; Inga Loedige; Witold Filipowicz
Journal:  Nat Rev Genet       Date:  2010-07-27       Impact factor: 53.242

6.  Gain of miR-151 on chromosome 8q24.3 facilitates tumour cell migration and spreading through downregulating RhoGDIA.

Authors:  Jie Ding; Shenglin Huang; Shunquan Wu; Yingjun Zhao; Linhui Liang; Mingxia Yan; Chao Ge; Jian Yao; Taoyang Chen; Dafang Wan; Hongyang Wang; Jianren Gu; Ming Yao; Jinjun Li; Hong Tu; Xianghuo He
Journal:  Nat Cell Biol       Date:  2010-03-21       Impact factor: 28.824

7.  Identification of the transcriptional promoters in the proximal regions of human microRNA genes.

Authors:  Yue-Sheng Long; Guang-Fei Deng; Xun-Sha Sun; Yong-Hong Yi; Tao Su; Qi-Hua Zhao; Wei-Ping Liao
Journal:  Mol Biol Rep       Date:  2010-11-24       Impact factor: 2.316

Review 8.  Smad-mediated regulation of microRNA biosynthesis.

Authors:  Matthew T Blahna; Akiko Hata
Journal:  FEBS Lett       Date:  2012-01-28       Impact factor: 4.124

9.  Discordant expression of miR-103/7 and pantothenate kinase host genes in mouse.

Authors:  Brenda J Polster; Shawn K Westaway; Thuy M Nguyen; Moon Y Yoon; Susan J Hayflick
Journal:  Mol Genet Metab       Date:  2010-08-04       Impact factor: 4.797

Review 10.  Diversifying microRNA sequence and function.

Authors:  Stefan L Ameres; Phillip D Zamore
Journal:  Nat Rev Mol Cell Biol       Date:  2013-06-26       Impact factor: 94.444

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.