Literature DB >> 30598455

Exploiting regulatory heterogeneity to systematically identify enhancers with high accuracy.

Hamutal Arbel^1,2, Sumanta Basu^3,2,4, William W Fisher⁵, Ann S Hammonds⁵, Kenneth H Wan⁵, Soo Park⁵, Richard Weiszmann⁵, Benjamin W Booth⁵, Soile V Keranen⁵, Clara Henriquez⁵, Omid Shams Solari², Peter J Bickel⁶, Mark D Biggin⁵, Susan E Celniker^1,5, James B Brown^1,2,7.

Abstract

Identifying functional enhancer elements in metazoan systems is a major challenge. Large-scale validation of enhancers predicted by ENCODE reveal false-positive rates of at least 70%. We used the pregrastrula-patterning network of Drosophila melanogaster to demonstrate that loss in accuracy in held-out data results from heterogeneity of functional signatures in enhancer elements. We show that at least two classes of enhancers are active during early Drosophila embryogenesis and that by focusing on a single, relatively homogeneous class of elements, greater than 98% prediction accuracy can be achieved in a balanced, completely held-out test set. The class of well-predicted elements is composed predominantly of enhancers driving multistage segmentation patterns, which we designate segmentation driving enhancers (SDE). Prediction is driven by the DNA occupancy of early developmental transcription factors, with almost no additional power derived from histone modifications. We further show that improved accuracy is not a property of a particular prediction method: after conditioning on the SDE set, naïve Bayes and logistic regression perform as well as more sophisticated tools. Applying this method to a genome-wide scan, we predict 1,640 SDEs that cover 1.6% of the genome. An analysis of 32 SDEs using whole-mount embryonic imaging of stably integrated reporter constructs chosen throughout our prediction rank-list showed >90% drove expression patterns. We achieved 86.7% precision on a genome-wide scan, with an estimated recall of at least 98%, indicating high accuracy and completeness in annotating this class of functional elements.

Entities: Chemical Disease Gene Mutation Species

Keywords: Drosophila; embryo development; enhancers; machine learning; random forests

Mesh：

Substances：

Year: 2018 PMID： 30598455 PMCID： PMC6338827 DOI： 10.1073/pnas.1808833115

Source DB: PubMed Journal: Proc Natl Acad Sci U S A ISSN： 0027-8424 Impact factor: 11.205

Enhancers are ∼100- to 1,000-bp cis-regulatory elements that direct spatial and temporal pattern transcription in metazoans. Definitive epigenetic signatures of enhancer elements have been challenging to identify. A number of computational tools have been developed to predict enhancer elements from chromatin state and transcription factor in vivo DNA binding information (1–12). Tools that attempt to measure predictive accuracy using only indirect evidence of enhancer activity [e.g., enrichment in H3K27 acetylation (H3K27ac) or histone acetyltransferase p300 (EP300)] often display excellent accuracy by these limited criteria (1, 3, 13, 14). When algorithms are benchmarked on held-out in vivo tests of functional enhancer activity, however, positive predictive power on genome-wide scans in metazoan systems has been lower than expected. In most cases, precision does not exceed 40% (13–15). However, by targeting transcription factors (TF) that function in a specific biological process, a higher precision of 56% was achieved in a randomly selected validation sample through transient transfection (16). Higher precision has also been reported when tests were confined to the top of the prediction rank list (17), but such numbers are unlikely to represent the precision of the prediction set as a whole. There are several possible explanations for the relatively low accuracy of current enhancer prediction algorithms. The transient in vivo enhancer assays often employed to test predictions may suffer a high false-negative rate due to the loss of local chromatin context. Alternatively, the data provided to the prediction algorithms might be insufficient. Features such as H3K27ac and EP300 can partially distinguish active enhancers (18, 19), but it remains unclear whether any chromatin mark or combination of chromatin marks and EP300 uniquely identifies enhancers among all sequences in a genome (16, 20). Indeed, enhancers that lack H3K27ac yet have patterns of DNA hypermethylation are essential during early vertebrate development (21). Hence, there may be more than a single class of genomic element that drives patterned expression or, more precisely, the term “enhancer” may encapsulate a mechanistically diverse class of functional elements. TF occupancy is a better predictor of enhancer activity than canonical chromatin marks (including H3K27ac, H3K4me1, and H3K4me3) in mouse and humans (16). Thus, mechanistic subtypes of functional enhancer elements may emerge from distinct patterns of TF occupancy and chromatin context. To test the possibility that heterogeneity among enhancers is a major reason for the difficulty in predicting enhancers, we have exploited the pregastrula Drosophila embryo network. A cohort of ∼30 spatially patterned TFs drive body patterning in concert with another 30 or so ubiquitously expressed sequence-specific TFs (22–32). Embryonic patterning is established along the anterior–posterior (A-P) axis and dorsal–ventral (D-V) axis by two separate sets of maternally deposited TFs. Over a 90-min period corresponding to developmental stages 4 and 5, these proteins act in concert with zygotically expressed A-P and D-V TFs to refine initially broad patterns of transcription into narrower striped patterns that define the basic segmental body plan of the fruit fly (33). The pregastrula fly network is thus a particularly well-defined model system for studying the relationship between TF DNA binding and spatially patterned enhancer activity. We have tested the utility of a wide range of data for predicting enhancers, including in vivo DNA binding patterns for 22 pregastrula TFs, a variety chromatin marks, evolutionary conservation, whole-embryo mRNA sequencing (mRNA-seq), and RNA polymerase II (Pol II) location. Using a test set of nearly 8,000 genomic regions whose enhancer activity had been determined in transgenic assays in whole embryos (34, 35), we applied supervised machine learning to identify enhancer sequences active in pregastrula embryos. Verified enhancers were separated into two approximately equally sized groups based on the reproducibility with which they were correctly predicted in multiple runs of a random forest (RF). A model trained using the set of enhancers that were reproducibly classified correctly has >98% predictive accuracy when tested on a balanced set of known enhancer positive and negative genomic regions. In contrast, the other set of training enhancers generated models that predicted no better than random. Subsequent analyses revealed that the well-predicted class of enhancers are near genes that show a strong tendency to be involved in controlling segmentation and other developmental processes, and to be expressed in many cells of the embryo. The poorly predicted enhancers are without obvious ties to the control of segmentation and tend to be expressed in less than 15% of cells. By focusing on the well-predicted class of enhancers, which we term segmentation-driving enhancers (SDEs), we find that TF DNA binding is highly predictive, whereas histone modifications and the remaining features tested have little or no additional predictive power. In a de novo, genome-wide prediction, we predict ∼1,640 SDEs in the early embryo that cover 1.6% of the euchromatic genome. As validated by an in vivo transgenic reporter gene assay, this set is predicted with 98% estimated recall and 95% precision, as validated in an in vivo transgenic assay. Unlike most previous studies, we concentrated validation away from the top of our rank list to increase the likelihood of identifying false-positives and to improve our power to compute accurate error rates. Importantly, we show that our model performance is driven by the need to treat SDEs separately from other enhancer elements, rather than the properties of a specific computational method: naïve Bayes and logistic regression perform as well as more complex models after conditioning on the SDE set. This demonstrates the prediction of a specific class of enhancers with sufficient precision to enable their identification genome-wide.

Results

Data, Feature, and Feature Selection.

Transgenic reporter data for enhancer activity in Drosophila embryos were combined from two sources. Kvon et al. (34) conducted a semiautomated screen of the reporter gene-expression patterns driven by 7,705 genomic regions (enhancers.starklab.org/) at multiple stages throughout embryogenesis. While this high-throughput assay allowed an unprecedented number of genomic areas to be tested, the small number of embryos per collection plate led to increased misclassifications in the data. The activity of an additional 282 genomic segments was determined by the Berkeley Drosophila Transcription Network Project (BDTNP) (35). Altogether, 7,987 genomic regions were examined and 731 were experimentally found to drive reporter gene expression in Drosophila embryonic stages 4–6 (36) (Dataset S1). By manually comparing the activity of overlapping genomic regions in the BDTNP database with the larger data from Kvon et al. (34), we estimate a 10% false-negative rate in the latter. Features used in the initial model included ChIP-chip data for 20 of the TFs that pattern transcription along the A-P and D-V axes of the embryo (37–39), chromatin immunoprecipitation-sequencing (ChIP-seq) data for the ubiquitous TFs Zelda (ZLD) and Zeste (Z), 45 chromatin proteins and histone modifications (40), DNase accessibility data (41–43), and evolutionary conservation scores (44–46). Also considered were the presence of: bidirectional RNA transcripts, exon and intron coverage, distance to RNA Polymerase II ChIP-chip binding peaks, and distance to transcription start sites. A summarized list of features is presented in Table 1. For a full list and description, see and , respectively.

Table 1.

Summary of features used for prediction

Category	Features included
Histone and histone modifications	H3, H3K18ac, H3K27ac, H3K27me3, H3K36me3, H3K4me1, H3K4me3, H3K9ac, H4K5ac, H4K8ac
AP regulatory transcription on factors	BCD, CAD, GT, HB, KNI, KR, HKB, TLL, D, _FTZ, _PRD, RUN, SLP
DV regulatory transcription on Factors	DA, DL, MAD, MED SHN, SNA, TWI
Ubiquitous transcription on factors	Z, ZLD, sum of all transcription factor scores
DNA data	Conservation, DNA accessibility, distance to Pol. II, distance to TSS, bidirectional-RNA transcription
Exon/intron data	Exons, coding exons and introns coverage/presence

Summary of features used for prediction With these data we trained and tested RF, a supervised machine-learning approach based on an ensemble of decision trees (47–49). To reduce parameter number and prevent overfitting, we culled input features (). We found that TFs and histone modification data were sufficient to minimize the error rate. We note that DNase accessibility did not contribute to RF predictive power in the presence of TF binding data, nor did it significantly improve performance in the presence of histone data, and it adds only modest predictive power when it is used as the sole feature for prediction (). Conservation scores () did not contribute to the predictive power in any fitted model, and the error rate utilizing solely conservation scores was ∼50%, suggesting that conservation is not a distinguishing feature of enhancers in the Drosophila embryo in the absence of other genomic context.

Heterogeneity Among Enhancer Elements.

With our optimal feature set, our error rate in a single forest as defined by misclassification was nearly 30%. The performance of the forest voting probabilities as indicated by the area under the receiver operating characteristic (ROC) curve, AUC = 0.82 (Fig. 1), is very similar to that in previously published work (16, 17), implying a similarly modest success rate. However, while this overall predictive power falls short of that required for predicting enhancers genome-wide, we noticed that some enhancers were consistently correctly classified while others were consistently misclassified. Hypothesizing that the model’s poor performance may be due to heterogeneity in the enhancer set, enhancers were separated into two classes. Class I contained the 358 enhancer segments that were correctly classified at least 75% of the time and class II contained the 373 that were not. When class II enhancers were excluded from the test sample, the single forest error rate drops to ∼3%, and the area under the ROC curve is ∼0.99 (Fig. 1). When class I enhancers were excluded from the test sample instead, errors of a single forest are ∼40%, and the ROC curve indicated performance only marginally better than random guessing (Fig. 1). To establish that enhancer heterogeneity is data-driven and not an artifact of our choice of method, logistic regression and naïve Bayes models of the data were also constructed. In both cases the removal of the class II enhancer set significantly improves the model’s predictive power (Fig. 1). Interestingly, the effect of retaining and removing class I and class II enhancers appears to have almost identical effect on recall regardless of the method, and indeed the ROC curves are nearly overlapping (Fig. 1). This is particularly noteworthy as the underlying assumption of both models—primarily, feature additivity and independence—are unlikely to be present in the data, yet both perform as well as RFs, which do not require such assumptions. Precision-recall (PR) curves also show that logistic regression performance closely matches that of RFs, although naïve Bayes precision is poor (). In all cases accounting for heterogeneity increases precision significantly. When a nonenhancer set is purged of a later-stage enhancer, the PR curve for RF has an AUC > 0.95, demonstrating extremely high sensitivity in the data.

Fig. 1.

(A) RF ROC curves for the complete dataset of 7,987 previously validated genomic regions (blue) shows mediocre performance, with an AUC of 0.83. When only class I enhancers and nonenhancers are used for training, the predictive power rises sharply, AUC of 0.99 (yellow). When only class II enhancers and nonenhancers are used, the result is close to a random guess (gray). When predicting the class I enhancer set the ROC curves for RFs, logistic regression, and a naïve Bayes classifier are nearly overlapping. (B) This can be explained by the colocalization of class II enhancers and nonenhancers in a PCA projection. (C) The separation is mainly driven by TFs as exemplified by the normalized ChiP strength across features of 200 randomly selected class I and class II enhancers. This separation by the model can be understood by principal component analysis (PCA) (Fig. 1): class II enhancers are collocated with nonenhancers while class I enhancers are separated from both. Examination of feature space statistics of the three groups shows that class II enhancers are indistinguishable from nonenhancers along our entire feature space—TF DNA binding, histone marks, conservation, and DNase accessibility—while class I enhancers segregated from both by multiple features. The separation is most notable in TF DNA binding and DNase accessibility profiles (Fig. 1 and ), where class I enhancers consistently have higher ChIP scores and are more accessible in whole-embryo average data. This indicates a possible reason and mechanism for the separation of the two classes and shows that RFs can be readily used to separate heterogeneous enhancer sets. Excluding class II enhancers from the sampled training set gives us unprecedented prediction accuracy. On a balanced held-out test set, built from genome regions that prior studies suggested half were enhancers and half were nonfunctional, more than 98% of class I enhancers are discovered by our algorithm with better than 95% precision. This model would have much lower accuracy if used to predict enhancers genome-wide, however. As one moves away from a balanced test set by adding a more realistic number of inactive genomic regions, the false-positive rate in the test set will increase. To demonstrate this point, RFs were trained on a balanced set and then tested on a series of increasingly imbalanced test sets at various degrees of stringency (Fig. 2). The false-positive rate for test sets increases sharply as either the fraction of nonenhancers in the test set increases or as the accuracy of the model—defined during training—increases. This can also be seen in 2D plots of the same analysis (Fig. 2 ): unless the sample is very close to a 50%/50% balance, the rise in the false-discovery rate (FDR) in the test set is extremely sharp. Conversely, in genomic scans where nonenhancer regions are at least a 100-fold more prevalent, a precision considerably better then 95% during training (measured out of bag) () is needed to achieve a 75% FDR in the test.

Fig. 2.

False-positive rate is a function of method accuracy and imbalance in the test data. (A) A 3D surface plot shows a sharp increase in the test-set false-positive rate as either the training set false-positive rate or the fraction of nonenhancer regions in the test-set increase. This shows that in genomic settings, where the imbalance cannot be controlled, a very high degree of accuracy is required. (B and C) Two-dimensional plots of the marginals of the 3D image in A, demonstrating the sharp rise in test inaccuracy for both false-positive rate in the training set or dilution of enhancer class in the test set. In the dataset of Kvon et al. (34) there are 20 times more annotated nonenhancers than enhancer elements. In randomly drawn test sets with only 5% true enhancers, we find that our fitted model recovers 90% of enhancers with 60% precision. However, our prediction accuracy is likely considerably higher than this analysis implies due to an abundance of false-negatives in the high-throughput Kvon et al. (34) annotations. Manual reexamination of their reporter gene-expression image data for the 100 genomic regions that our method most highly predicted to be enhancers, but which were reported as nonenhancers, revealed that only 15 were true nonenhancers, 47 were clearly enhancers, and the remainder could not be classified due to insufficient data, specifically the lack of embryos of the appropriate stage in the high-throughput images (Fig. 3).

Fig. 3.

Examples of reporter gene-expression patterns driven by (A) class I enhancers, (B) class II enhancers, and (C) genome regions misclassified by Kvon et al. (34) as nonenhancers in stages 4–6. Magnification is 20× and the embyos are 0.5 mm in length on average.

Genome-Wide Analysis to Identify Active Enhancers in the Early Embryo.

Given the high accuracy of the model on our training and held-out datasets, a genome-wide search for class I enhancers was feasible. RF was therefore used to predict enhancer probability on a computationally segmented genome (). More than 82% of all segments had less than 0.01 probability of being enhancer, and more than 93% had less than 0.1 probability (Fig. 4). While it is challenging to see initially as the histogram is dominated by a peak between probabilities 0 and 0.01, the histogram is in fact bimodal (Fig. 4, Inset), with a secondary peak around P = 0.95. To call enhancers a threshold of P > 0.75 was established that covers ∼1.6% of the genome and rediscovers de novo 98% of the training set. Of the 1,640 class I enhancers predicted, 1,174 do not overlap with training data, 364 overlap known cis-regulatory modules identified in a database of enhancers discovered in other studies, REDfly (50–52); and 822 are completely novel. The prediction list can be viewed at genome.ucsc.edu/cgi-bin/hgTracks?db=dm3&hubUrl=https://sina.lbl.gov/seqdata/ucsc/EnhancerPrediction/hub.txt (53) and Dataset S2).

Fig. 4.

(A) Histogram of RF predicted enhancer probabilities for the entire genome. While >82% of the genome has P < 0.01, a secondary peak can be seen at P ∼ 0.95 (Inset). (B–F) As validation, predicted enhancers were inserted into the Drosophila genome and were found to drive spatial expression. (G and H) Two enhancers, CEP01219 and CEP01220, are predicted proximal to the comm2 gene. Each of their patterns is a component of the comm2 expression pattern (I). (J) The genomic region of the two predicted enhancers is shown, along with the raw prediction track showing the predicted probability of enhancer activity with 100-bp resolution and the sum of TF binding ChIP scores at the same resolution. Magnification is 20×, and the embryos are 0.5 mm in length on average.

Validation of Predicted Enhancer Set in New Experiments.

To validate the prediction precision, an in vivo reporter expression-driving test was conducted. Five, 11, and 17 genomic regions were selected with probability scores corresponding to estimated (cross-validation) FDRs of 4%, 25%, and 50%, respectively. We test down the rank list to enable estimation of the overall FDR of the entire set of predicted enhancers, not just the top predictions. Test regions were cloned into the pBPGUW expression vector, then injected into flies using the attP integration system (54) (). All but three of the enhancers, including all but two of those predicted to be in the 50% FDR region, were found to be enhancers (Fig. 4 and ). We thus needed to adjust our FDR estimation; assuming a Poisson distribution we obtained a maximum likelihood estimate (MLE) of 12.5% FDR at our previously cross-validation–based 50% FDR threshold (). For our entire collection of 1,640 predicted class I enhancers, we estimate an overall FDR of 13.58%. An interesting example and validation for the use of transcription factors to separate proximal enhancers () can be seen in two predicted segments (CEP01219 and CEP01220) proximal to the comm2 gene (Fig. 4). comm2 encodes an important protein required for proper axon guidance across the embryonic midline (55, 56). The two predicted enhancer-combined expression patterns (Fig. 4 ) match the more complicated expression pattern of comm2 (Fig. 4).

Segmentation Driving Enhancers.

We next sought to understand if the separation of the enhancers into two classes in our feature space is related to their biology. In a detailed quantification of images of embryonic reporter gene-expression patterns for 85 randomly selected class I and 82 randomly selected class II enhancers, class II enhancers tend to be expressed in a smaller percent of nuclei. Of class II enhancers, 74% are expressed in ≤15% of cells versus only 33% of class I (). While separation by this criterion is not complete, it is unlikely that these differences in expression are due to chance (P < 10−7). In addition, we find that class I enhancers are more likely to remain active throughout embryogenesis and show a significant enrichment for the expression in A-P stripes, posterior, or gap gene-like patterns (P < 10−4). Gene ontology (GO)-term analysis of the genes proximal to class I enhancers also showed a highly significant enrichment of terms related to segmentation (Fig. 5), while those of class II enhancers showed much lower enrichment for any GO terms and no significant enrichment for any particular pathway (Fig. 5). We therefore hypothesize that class I enhancers are likely to drive expression patterns needed for establishing the segmented body plan. We thus term class I and class II enhancers SDEs and non-SDEs respectively. We note that while the differences between these two classes are significant, there is not a clear separation in function because a minority of non-SDE direct patterns of expression resemble those of SDEs (Fig. 3 ).

Fig. 5.

The significance (measured as the negative log of the P value) of GO-term enrichment in genes proximal to class I enhancers is very high in terms associated with development and segmentation (SDEs, yellow). For class II enhancers, no significant GO-term enrichment (P value below 10−5) is found (non-SDEs, blue).

Feature Importance Is Dominated by TFs.

The RF importance measures “mean decrease accuracy” and “mean decrease Gini” (49) varied between bootstrap repetitions, but in all cases a small set of TFs were found near the top of the importance ranking list. This can be seen by the spread of the bootstrap confidence interval of these two importance measures calculated in 50,000 trees (Fig. 6 and ). ChIP binding scores for several TFs (KR, MED, TWI, DL, D) were among the most important predictors by both measures (). The sum of all TF binding scores, or TFsum, was also an important predictor. Other TFs, such as BCD and FTZ (33), were uninformative despite their importance for embryo segmentation. This can be at least partially explained by low coverage in the ChIP-chip data, as there is a clear correlation (r = 0.7) between coverage and importance measure (). The only histone mark to have an importance above random noise was H3K4 monomethylation (H3K4me1). The histone and histone modifications we measured, including H3K27ac, which has been widely regarded as a key indicator of enhancer regions (18, 57), were found uninformative by the model in the presence of the TF data.

Fig. 6.

(A) Feature importance is dominated with transcription factors, with the H3K4me1 the only histone mark in the top 25. (B–F) “Local importance” measurements of randomly selected segments indicting how important each feature was in the segment classification when the forest was trained on (B) SDEs vs. nonenhancers, (C) SDE vs. non-SDEs, (D) non-SDEs vs. nonenhancers, (E) SDEs and non-SDEs vs. nonenhancers, and (F) SDEs vs. non-SDEs and nonenhancers. Feature order (x axis) can be found in .

Localized Feature-Importance Measures Support Two-Class Structure.

RF’s “local importance” (49) is a third measure that provides a detailed determination of the importance of each feature in classifying each instance, allowing a more direct understanding on the RF decision-making process (Fig. 6 ). This measure shows that the same small set of features are used to distinguish SDE and nonenhancers (Fig. 6) as are used to distinguish SDEs from non-SDEs (Fig. 6), while an attempt to separate non-SDEs and nonenhancers (Fig. 6) shows that no variable can consistently be used and that many more parameters are employed. The increase in features used and the blurring of decision criteria is also seen when non-SDEs are presented to RF as enhancers (Fig. 6) rather than as nonenhancers (Fig. 6). Spectral clustering is a technique that relies on the eigenvectors of the similarity matrix or the Laplacian thereof, usually followed by k-nearest neighbors or k-means clustering (58). It is an efficient way of dimension reduction, and the number of clusters in the data can often be inferred by the eigenvalues. Applying spectral clustering to an affinity matrix computed from the local importance values (seventh-nearest neighbor of a Euclidian distance matrix calculated with a Gaussian kernel) yields a good separation of the data, however, with a sharp jump after the second eigenvalue (), consistent with the presence of a two-class structure.

Discussion

The identification of enhancer elements from genomics data has remained a challenging problem, in part due to the relative scarcity of enhancers in genome sequences versus nonenhancer sequences. As illustrated in Fig. 2, even an incisive enhancer prediction algorithm fitted on balanced training data (i.e., a training set with nearly equal numbers of positive and negative elements) is likely to generate high FDRs when tested on a genome-wide scan. Hence, to accurately discover enhancer elements using in silico techniques, extremely high-fidelity models are needed. Although high-precision predictions were reported previously, the validation methods and measures used in the literature varied greatly. Many papers defined success as the colocation of data for epigenetic marks, such as EP300 and H3K27ac, but it is yet to be established that these marks are exclusive to enhancers or that all enhancers possess them. Indeed, we report here a class of H3K27ac-free enhancers (). Other reports tested for functional enhancer activity of genomics regions from the top of a rank list (17), which gives a biased estimate of the overall prediction accuracy. We suggest that precision must be measured by testing throughout the prediction rank list to establish a uniform, unbiased measure of success for entire prediction sets. We found that the prediction of enhancer elements en masse was vexed by heterogeneity among enhancer elements. For about half of previously validated enhancer elements, strong TF in vivo DNA binding signals for multiple factors is indicative of enhancer activity. The remaining half of validated elements is typically bound more weakly by fewer TFs (Fig. 1 and ). For this latter set, the residual TF binding signal is only weakly associated with enhancer activity. That is, a prediction engine that works extremely well on one class tends to fail on the other. We posit that this challenge—heterogeneity in element classes—is a widespread and foundational challenge in genomics. For example, the emerging literature on “chromatin priming elements” (59) demonstrates the existence of “enhancer-like” functional elements that, while they share chromatin structure and similar patterns of TF occupancies with enhancers, do not themselves drive patterned expression; rather, they establish chromatin context that subsequently gives rise to enhancer activity for proximal elements. It may be that the class of elements we presently denote “enhancers” is in fact diverse, admitting elements that exert regulatory effects through a variety of underlying molecular mechanisms. Indeed, it remains unclear what fraction of enhancers require eRNAs for their activity (59), or whether priming elements are transcribed like many enhancers. It may also be that the non-SDEs or class II enhancers we studied here are simply regulated by cohorts of TFs we have yet to survey. These enhancers, however, are often expressed in a smaller fraction of the embryo than SDEs and have lower whole-embryo average DNase I-seq and TF ChIP binding scores (), consistent with them being accessible in only a small subset of cells. This would thus make non-SDEs less amenable to interrogation through whole-animal ChIP-seq. The differences between the two enhancer classes are statistical, although not categorical. While there is a significant enrichment in segmentation GO-terms in SDEs compared with non-SDEs, some non-SDEs also display segmentally repeating expression patterns similar to those of SDEs, and many (20%) show activity above the 15% expression area threshold. Finally, 8.5% of non-SDEs were active in as much or more of the embryo as the median for SDEs. Our validation assays revealed that cross-validation had led to significant overestimation of the FDR for SDEs. We attribute this to an abundance of false-negatives in our training set as our analysis indicated that ∼10% of negatives are erroneously labeled, which would double the number of positives and explain the significant increase in validated FDRs we observed. An alternative explanation for the discrepancy is selection bias in our training set as the genomic regions tested in the previous studies were not selected at random. Thus, it is possible that there is a stronger separation of features when the complete set of genomic enhancers is considered. Overall, we recover 98% of the training-set SDEs with an estimated FDR of less than 15%, indicating that our genome-wide predicted catalog of these elements may be close to comprehensive. Further experiments, particularly concentrated at high FDRs, are needed to better assess the boundary between functional and nonfunctional elements. It will also be important to assess the impact of the minimal promoter elements selected for these screens to study promoter–enhancer interactions, as was recently done in Arnold et al. (60), who found that different minimal promoters respond differently to the same enhancer elements. It may be that many of our remaining false-positives are in fact false-negatives in the enhancer screens due to mismatch between the putative enhancer element and our minimal promoter. At this time, it appears that at least 1,600 elements, composing more than 1.6% of the Drosophila genome, are involved in establishing early body patterning in the blastoderm.

Materials and Methods

Data Acquisition and Processing: TF Binding Data and Pol II Data.

Twenty-five percent of FDR Transcription factor ChIP-chip data were taken from the Drosophila TF network project (available at genome.ucsc.edu/cgi-bin/hgTracks?db=dm3&lastVirtModeType=default&lastVirtModeExtraState=&virtModeType=default&virtMode=0&nonVirtPosition=&position=chr2L%3A826001-851000&hgsid=699690721_K8EWMPlLw7903qNMAaAqfOfBnHcn), containing data for 22 transcription factors: BCD, CAD, D, DA, DL, FTZ, GT, H, HB, HKB, KNI, KR, MAD, MED, PRD, RUN, SHN, SLP, SNA, TLL, TWI, Z, some with biological duplicates to give 34 tracks. Of FDR ChiP-chip data for Pol II binding, 1% was also taken from BDTNP (37–39).

Data Acquisition and Processing: Histone Data.

Histone modification and ZLD ChIP-seq data were retrieved from the University of California, Santa Cruz (UCSC) genome browser track provided by Li et al. (40). Histone modifications data collected in the Zld mutant strain were not used; all other tracks were included in the analysis.

DNase Accessibility, Conservation Scores, and Gene Annotations.

DNase accessibility data and 12-fly conservation phastCons scores were obtained from the UCSC genome browser (61, 62) and conservation scores were computed as the maximum of the mean conservation score over 200-nt windows across the element. Note that the study was repeated varying the window size from 1 nt (simple maximum over the enhancer) to 300 nt (in 50-nt steps), but our analysis was insensitive to how the conservation data were processed. Conservation scores did not constitute an important predictor under any scoring protocol that we attempted. FlyBase gene data for exon, coding exons, and intron location data (63) we also downloaded from the UCSC. Bidirectional RNA transcript data were obtained from Nechaev et al. (64) and analyzed as described in Andersson et al. (65). The transcription start site (TSS) was taken as the start of the first exon in FlyBase’s mRNA data, described above.

Defining Predictors.

Although 80% of the DNA segments in the training set were between 2- and 2.5-kb long, segment sizes varied from 100 bp to 4.5 kb in the set, and the percent of enhancer region contained by each segment is unknown, making averages a biased estimator. Thus, the maximum of ChiP data were calculated over every segment in the training set and the segmented genome using bedtools and the UCSC genome browser utilities for TF data, histones, conservation score, and DNase accessibility. In addition, the sum of TF biological replicas and the sum of all TF tracks was also calculated and included as features in the model. In addition to the maximum score, for ZLD higher-resolution ChiP-seq data and for the conservation phastCons conservation scores we also calculated the average over the segment, maximal score over a sliding window of 200, 500, and 1000 bp, and the longest continuous stretch of scores above the 0.85 quantile. For the gene data, bedtools coverage was used to calculate percent of segment covered by exons, coding exons, or introns and three binary tracks indicate the presence or absence of intron and exons. Bedtools was used to calculate distance to the closest TSS and to pol II binding peak.

Modeling.

RFs were modeled in R (66) using RandomForest (48). Initial feature-set culling was done through error rates average of 1,000 forests of 500 trees when excluding/adding one feature at a time. Our training data are highly unbalanced, with only 10% of segments being enhancers. To improve RF performance, balanced samples were used as a training set. To improve stability of the prediction, and counteract the sampling process employed by balancing the training set, we relied on forest voting. One-thousand forests of 50 trees each were trained on randomly selected sets of 300 enhancer and 300 nonenhancers with 10% of the data held out of the samples and used as test sets. Then we estimated the probability that a given segment will function as an enhancer in our assay as the fraction of trees predicting that the segment is an enhancer. This was repeated until such score was computed for each segment in the set. The same sampling and testing scheme was employed for logistic regression and naïve Bayes (67). Importance measures varied from sample to sample and averages required 10,000 forests of 50 trees to converge. To increase stability of the importance measure, the average of 50,000 RFs mean decrease in accuracy and mean decrease in Gini index were used to find the importance RF confidence intervals. For local importance calculations, we used a single forest of 50,000 trees produced using all enhancers and a balanced nonenhancer subsample.

Analyses.

ROC curve areas were calculated with R package PRROC (68). PCA was done using prcomp (66). GO term analysis used bedtools (69) to find FlyBase genes located inside training enhancer regions, or to identify the closest genes if none are overlapping. The DAVID bioinformatics resource (70, 71) was used to find and quantify GO term and GO-term enrichment, with the full set of ∼8,000 genomic regions as the genomic background. To find the Affinity matrix of the data, we converted Euclidian distance into a similarity matrix, and calculated seven nearest neighbors for each segment. Spectral clustering and eigenvalue extraction was done using kknn (72) with default settings. We used a masked strategy to assess expression size and pattern on an unannotated randomly ordered set of both enhancer classes.

Genome-Wide Prediction.

A sliding window of 1,000 bp with 100-bp distance was used to create overlapping bins across the entire Drosophila genome. As above, we used an ensemble of RFs (1,000 forests each composed of 500 trees) trained on SDE and nonenhancers only. As above, training sets included a random sample of 300 SDEs and 300 nonenhancers. We then generated genome-wide predictions as follows: for each 1,000-bp segment in the genome, we computed the percent trees (across all forests) identifying the segment as an SDE, a number that we interpret as an estimate of the probability that the given segment is an SDE. Note that each 100-bp segment in the genome occurs in nine 1,000-bp windows. Hence, for each 100-bp segment in the genome, we have nine predicted probabilities corresponding to each of the 1,000-bp windows in which it is included. We define our estimate of the probability that a given 100-bp segment is part of an SDE as the mean of the estimated SDE probabilities for each of the overlapping 1,000-bp windows. We defined a threshold of 0.75 for 100-bp segments; all segments with predicted probabilities greater than 0.75 were labeled as part of enhancers, and all other segments were labeled as nonenhancer regions. Adjacent 100-bp windows above this threshold were merged into larger enhancer elements. For predicted SDEs longer than 1.5 kb, we attempted to refine our resolution by leveraging the TF binding data. Specifically, we looked for distinct peaks in the TFsum predictor as follows: the mean of the TFsum track was calculated for each 100-bp window; we then computed the numerical second derivative along the SDE to find extremum points and thus call peaks in the data. Peaks below noise threshold were removed, and peaks closer than 200 bp were merged. If more than one peak remained, the minimum between adjacent peaks was used to separate the longer predicted enhancers. We call this set of elements our “preliminary predicted SDE” (PPSDE) set. Finally, we reran the ensemble of RFs across all PPSDEs and computed estimates of the probability that each PPSDE is an SDE. To estimate a new FDR MLE and confidence interval, we considered the probability of being an enhancer calculated on our training set. By considering the FDR in each of several short probability threshold regions in the training set, and assuming a Poisson distribution for the false discovery, we calculated the MLE of FDR in those regions. The center of the probability threshold region points was taken to have that FDR, and we further considered 100% FDR at 0 probability as an additional data point. A second-order polynomial was fitted to these data points, so that an FDR can be calculated at each probability level. The 1,640 predicted enhancers were fitted to the polynomial, with the average taken as the predicted and confidence interval FDR score.

PCR of Fragments from Genomic DNA and Cloning into the Gateway Vector.

PfuUltra High-Fidelity DNA Polymerase (Stratagene) or EASYA DNA polymerase (Agilent) was used to amplify selected fragments (see above) by using isogenic genomic DNA from y; cn bw sp (73) as a template. The PCR products were confirmed by agarose gel analysis, purified by using the QIA-quick PCR Purification Kit (Qiagen). PCR fragment cloning was performed by adding three A-overhangs to the PCR products produced using the PfuUltra High-Fidelity DNA Polymerase (A-overhangs were not added to the products produced using the EasyA DNA Polymerase) with the addition of dATP and Taq polymerase in a 10-min incubation at 72 °C before Qiagen purification. The products (9.5 μL of each) were used in a TA TOPO cloning reaction with pCR8/GW/TOPO, as described by the manufacturer (Invitrogen). Cloning reactions were allowed to proceed for 30 min at room temperature, and then 2 μL of each reaction was used to transform Mach1 cells (Invitrogen). For each cloning reaction, two isolates were picked, purified, and confirmed by sequence verification.

Sequence Verification of Clones.

Two Gateway clones were picked for each enhancer fragment, for a total of 78 processed clones. Sequencing primers M13 forward -20 (5′ GTAAAACGACGGCCA 3′; Invitrogen) and M13 reverse (5′ GGAAACAGCTATGACCATG 3′; Invitrogen) were used to generate sequences to verify targets. One clone was identified and selected for future studies.

Transfer of Gateway Clones into Integration Vectors.

Thirty-seven nanograms of the destination vector, pBPGUw, were combined with 37.5 ng of DNA carrying a PCR fragment cloned in the Gateway vector in a LR reaction (Invitrogen) and incubated overnight at room temperature. TAM1 cells (Invitrogen) were transformed with 2.5 μL of the LR reaction and plated. A single isolate from each reaction was picked into a 96-well Beckman Deepwell block, allowed to grow overnight at 37 °C, and DNA was prepared by using the PerfectPrep kit (5 PRIME). The constructs were verified by analysis of restriction enzyme digests. A second isolate was picked in cases where there was a discrepancy between the observed and expected results. DNA for injection was prepared from 7 mL of overnight culture for production of transgenic flies.

Drosophila Genetics.

DNA constructs (100–200 ng/μL) were microinjected into embryos derived from parents homozygous for both the attP2 integration site (74) and a fusion gene encoding the PhiC31 integrase under the control of the nanos promoter (nos–integ), which provides a maternal source of integrase (75). Single males derived from these embryos were crossed to y w;Sco/CyO females, and males carrying the inserted construct (identified by their w+ eye color) were selected; integrase is removed in this step. These males were crossed to w[1118]/Dp(1;Y)y[+]; TM2/TM6C, Sb[1] (Bl stock #5906) females to establish balanced, homozygous stocks. We obtained 48 transgenic lines containing predicted enhancer elements, called CEPs.

Verification of Insertion Site and Fragment Identity by Genomic PCR.

To verify the identity of transformant flies and to confirm that all integration events occurred at the attP2 site, we performed genomic PCR on DNA isolated from homozygous transformant flies. Twenty flies were homogenized and genomic DNA isolated by using the ZR Genomic DNA II Kit (Zymo Research). To assay proper integration in the attP2 landing site, PCR was performed by using a primer from the y gene marker in the attP2 genomic docking site (TCATGACTTTGTTGCCTTAGA) and a reverse primer from the w gene (CGAAAGAGACGGC- GATATT) carried in the constructs. Only proper integration events yield a product of 1,839 bp, because y and w lie more than 2 Mb away in the Drosophila genome. Fragment identity was confirmed by using a vector-specific primer (ACAAGTTTGTACAAAAAAGCAGGCT) and a reverse primer specific to the cloned fragment being tested for enhancer activity; the position of the fragment-specific primer was chosen so as to yield a PCR product of 350–400 bp.

Embryo Whole-Mount in Situ Hybridizations.

Embryos were collected directly from the homozygous stock. Embryonic whole-mount in situ RNA hybridizations were performed as described previously (76). A summary is shown in .

67 in total

1. Activation of transcription in Drosophila embryos is a gradual process mediated by the nucleocytoplasmic ratio.

Authors: D K Pritchard; G Schubiger
Journal: Genes Dev Date: 1996-05-01 Impact factor: 11.361

2. Genome-scale functional characterization of Drosophila developmental enhancers in vivo.

Authors: Evgeny Z Kvon; Tomas Kazmar; Gerald Stampfel; J Omar Yáñez-Cuna; Michaela Pagani; Katharina Schernhuber; Barry J Dickson; Alexander Stark
Journal: Nature Date: 2014-06-01 Impact factor: 49.962

3. Maternal-Zygotic Gene Interactions during Formation of the Dorsoventral Pattern in Drosophila Embryos.

Authors: P Simpson
Journal: Genetics Date: 1983-11 Impact factor: 4.562

4. A Hidden Markov Model approach to variation among sites in rate of evolution.

Authors: J Felsenstein; G A Churchill
Journal: Mol Biol Evol Date: 1996-01 Impact factor: 16.240

5. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes.

Authors: Adam Siepel; Gill Bejerano; Jakob S Pedersen; Angie S Hinrichs; Minmei Hou; Kate Rosenbloom; Hiram Clawson; John Spieth; Ladeana W Hillier; Stephen Richards; George M Weinstock; Richard K Wilson; Richard A Gibbs; W James Kent; Webb Miller; David Haussler
Journal: Genome Res Date: 2005-07-15 Impact factor: 9.043

6. Construction of transgenic Drosophila by using the site-specific integrase from phage phiC31.

Authors: Amy C Groth; Matthew Fish; Roel Nusse; Michele P Calos
Journal: Genetics Date: 2004-04 Impact factor: 4.562

7. BiRen: predicting enhancers with a deep-learning-based model using the DNA sequence alone.

Authors: Bite Yang; Feng Liu; Chao Ren; Zhangyi Ouyang; Ziwei Xie; Xiaochen Bo; Wenjie Shu
Journal: Bioinformatics Date: 2017-07-01 Impact factor: 6.937

8. High-throughput functional testing of ENCODE segmentation predictions.

Authors: Jamie C Kwasnieski; Christopher Fiore; Hemangi G Chaudhari; Barak A Cohen
Journal: Genome Res Date: 2014-07-17 Impact factor: 9.043

9. Transcription factors bind thousands of active and inactive regions in the Drosophila blastoderm.

Authors: Xiao-yong Li; Stewart MacArthur; Richard Bourgon; David Nix; Daniel A Pollard; Venky N Iyer; Aaron Hechmer; Lisa Simirenko; Mark Stapleton; Cris L Luengo Hendriks; Hou Cheng Chu; Nobuo Ogawa; William Inwood; Victor Sementchenko; Amy Beaton; Richard Weiszmann; Susan E Celniker; David W Knowles; Tom Gingeras; Terence P Speed; Michael B Eisen; Mark D Biggin
Journal: PLoS Biol Date: 2008-02 Impact factor: 8.029

10. Track data hubs enable visualization of user-defined genome-wide annotations on the UCSC Genome Browser.

Authors: Brian J Raney; Timothy R Dreszer; Galt P Barber; Hiram Clawson; Pauline A Fujita; Ting Wang; Ngan Nguyen; Benedict Paten; Ann S Zweig; Donna Karolchik; W James Kent
Journal: Bioinformatics Date: 2013-11-13 Impact factor: 6.937

8 in total

1. Heterogeneity of enhancers embodies shared and representative functional groups underlying developmental and cell type-specific gene regulation.

Authors: Wei Song; Ivan Ovcharenko
Journal: Gene Date: 2022-06-06 Impact factor: 3.913

2. A map of cis-regulatory modules and constituent transcription factor binding sites in 80% of the mouse genome.

Authors: Pengyu Ni; David Wilson; Zhengchang Su
Journal: BMC Genomics Date: 2022-10-19 Impact factor: 4.547

3. Accurate prediction of cis-regulatory modules reveals a prevalent regulatory genome of humans.

Authors: Pengyu Ni; Zhengchang Su
Journal: NAR Genom Bioinform Date: 2021-06-17

4. Ensemble of Deep Recurrent Neural Networks for Identifying Enhancers via Dinucleotide Physicochemical Properties.

Authors: Kok Keng Tan; Nguyen Quoc Khanh Le; Hui-Yuan Yeh; Matthew Chin Heng Chua
Journal: Cells Date: 2019-07-23 Impact factor: 6.600

Review 5. REDfly: An Integrated Knowledgebase for Insect Regulatory Genomics.

Authors: Soile V E Keränen; Angel Villahoz-Baleta; Andrew E Bruno; Marc S Halfon
Journal: Insects Date: 2022-07-11 Impact factor: 3.139

6. Accurate prediction of functional states of cis-regulatory modules reveals common epigenetic rules in humans and mice.

Authors: Pengyu Ni; Joshua Moe; Zhengchang Su
Journal: BMC Biol Date: 2022-10-05 Impact factor: 7.364

Review 7. Annotating the Insect Regulatory Genome.

Authors: Hasiba Asma; Marc S Halfon
Journal: Insects Date: 2021-06-29 Impact factor: 2.769

8. A universal framework for detecting cis-regulatory diversity in DNA regions.

Authors: Anushua Biswas; Leelavati Narlikar
Journal: Genome Res Date: 2021-07-19 Impact factor: 9.043

8 in total