| Literature DB >> 30598455 |
Hamutal Arbel1,2, Sumanta Basu3,2,4, William W Fisher5, Ann S Hammonds5, Kenneth H Wan5, Soo Park5, Richard Weiszmann5, Benjamin W Booth5, Soile V Keranen5, Clara Henriquez5, Omid Shams Solari2, Peter J Bickel6, Mark D Biggin5, Susan E Celniker1,5, James B Brown1,2,7.
Abstract
Identifying functional enhancer elements in metazoan systems is a major challenge. Large-scale validation of enhancers predicted by ENCODE reveal false-positive rates of at least 70%. We used the pregrastrula-patterning network of Drosophila melanogaster to demonstrate that loss in accuracy in held-out data results from heterogeneity of functional signatures in enhancer elements. We show that at least two classes of enhancers are active during early Drosophila embryogenesis and that by focusing on a single, relatively homogeneous class of elements, greater than 98% prediction accuracy can be achieved in a balanced, completely held-out test set. The class of well-predicted elements is composed predominantly of enhancers driving multistage segmentation patterns, which we designate segmentation driving enhancers (SDE). Prediction is driven by the DNA occupancy of early developmental transcription factors, with almost no additional power derived from histone modifications. We further show that improved accuracy is not a property of a particular prediction method: after conditioning on the SDE set, naïve Bayes and logistic regression perform as well as more sophisticated tools. Applying this method to a genome-wide scan, we predict 1,640 SDEs that cover 1.6% of the genome. An analysis of 32 SDEs using whole-mount embryonic imaging of stably integrated reporter constructs chosen throughout our prediction rank-list showed >90% drove expression patterns. We achieved 86.7% precision on a genome-wide scan, with an estimated recall of at least 98%, indicating high accuracy and completeness in annotating this class of functional elements.Entities:
Keywords: Drosophila; embryo development; enhancers; machine learning; random forests
Mesh:
Substances:
Year: 2018 PMID: 30598455 PMCID: PMC6338827 DOI: 10.1073/pnas.1808833115
Source DB: PubMed Journal: Proc Natl Acad Sci U S A ISSN: 0027-8424 Impact factor: 11.205
Summary of features used for prediction
| Category | Features included |
| Histone and histone modifications | H3, H3K18ac, H3K27ac, H3K27me3, H3K36me3, H3K4me1, H3K4me3, H3K9ac, H4K5ac, H4K8ac |
| AP regulatory transcription on factors | BCD, CAD, GT, HB, KNI, KR, HKB, TLL, D, FTZ, PRD, RUN, SLP |
| DV regulatory transcription on Factors | DA, DL, MAD, MED SHN, SNA, TWI |
| Ubiquitous transcription on factors | Z, ZLD, sum of all transcription factor scores |
| DNA data | Conservation, DNA accessibility, distance to Pol. II, distance to TSS, bidirectional-RNA transcription |
| Exon/intron data | Exons, coding exons and introns coverage/presence |
Fig. 1.(A) RF ROC curves for the complete dataset of 7,987 previously validated genomic regions (blue) shows mediocre performance, with an AUC of 0.83. When only class I enhancers and nonenhancers are used for training, the predictive power rises sharply, AUC of 0.99 (yellow). When only class II enhancers and nonenhancers are used, the result is close to a random guess (gray). When predicting the class I enhancer set the ROC curves for RFs, logistic regression, and a naïve Bayes classifier are nearly overlapping. (B) This can be explained by the colocalization of class II enhancers and nonenhancers in a PCA projection. (C) The separation is mainly driven by TFs as exemplified by the normalized ChiP strength across features of 200 randomly selected class I and class II enhancers.
Fig. 2.False-positive rate is a function of method accuracy and imbalance in the test data. (A) A 3D surface plot shows a sharp increase in the test-set false-positive rate as either the training set false-positive rate or the fraction of nonenhancer regions in the test-set increase. This shows that in genomic settings, where the imbalance cannot be controlled, a very high degree of accuracy is required. (B and C) Two-dimensional plots of the marginals of the 3D image in A, demonstrating the sharp rise in test inaccuracy for both false-positive rate in the training set or dilution of enhancer class in the test set.
Fig. 3.Examples of reporter gene-expression patterns driven by (A) class I enhancers, (B) class II enhancers, and (C) genome regions misclassified by Kvon et al. (34) as nonenhancers in stages 4–6. Magnification is 20× and the embyos are 0.5 mm in length on average.
Fig. 4.(A) Histogram of RF predicted enhancer probabilities for the entire genome. While >82% of the genome has P < 0.01, a secondary peak can be seen at P ∼ 0.95 (Inset). (B–F) As validation, predicted enhancers were inserted into the Drosophila genome and were found to drive spatial expression. (G and H) Two enhancers, CEP01219 and CEP01220, are predicted proximal to the comm2 gene. Each of their patterns is a component of the comm2 expression pattern (I). (J) The genomic region of the two predicted enhancers is shown, along with the raw prediction track showing the predicted probability of enhancer activity with 100-bp resolution and the sum of TF binding ChIP scores at the same resolution. Magnification is 20×, and the embryos are 0.5 mm in length on average.
Fig. 5.The significance (measured as the negative log of the P value) of GO-term enrichment in genes proximal to class I enhancers is very high in terms associated with development and segmentation (SDEs, yellow). For class II enhancers, no significant GO-term enrichment (P value below 10−5) is found (non-SDEs, blue).
Fig. 6.(A) Feature importance is dominated with transcription factors, with the H3K4me1 the only histone mark in the top 25. (B–F) “Local importance” measurements of randomly selected segments indicting how important each feature was in the segment classification when the forest was trained on (B) SDEs vs. nonenhancers, (C) SDE vs. non-SDEs, (D) non-SDEs vs. nonenhancers, (E) SDEs and non-SDEs vs. nonenhancers, and (F) SDEs vs. non-SDEs and nonenhancers. Feature order (x axis) can be found in .