| Literature DB >> 29219068 |
Xu Min1,2, Wanwen Zeng1,3, Shengquan Chen1,3, Ning Chen1,2, Ting Chen1,2,4, Rui Jiang5,6.
Abstract
BACKGROUND: With the rapid development of deep sequencing techniques in the recent years, enhancers have been systematically identified in such projects as FANTOM and ENCODE, forming genome-wide landscapes in a series of human cell lines. Nevertheless, experimental approaches are still costly and time consuming for large scale identification of enhancers across a variety of tissues under different disease status, making computational identification of enhancers indispensable.Entities:
Mesh:
Substances:
Year: 2017 PMID: 29219068 PMCID: PMC5773911 DOI: 10.1186/s12859-017-1878-3
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Overview of DeepEnhancer. A raw DNA sequence is first encoded into a binary matrix. Kernels of the first convolutional layer scan for motifs on the input matrix by the convolution operation. Subsequent Max-pooling layer and batch normalization layer are used for dimension reduction and convergence acceleration. Additional convolutional layers will model the interaction between motifs in previous layers and obtain high-level features. Fully-connected layers with dropout will perform nonlinear transformations and finally predict the response variable through softmax layer
Different network architectures of DeepEnhancer
| Layer ID | Layer Type | Size | Output shape |
|---|---|---|---|
| 0 | Input | – | 4x1x300 |
| 1 | Conv | 128x4x1x8 | 128x1x293 |
| 2 | Batchnorm | – | 128x1x293 |
| 3 | Conv | 128x128x1x8 | 128x1x286 |
| 4 | Batchnorm | – | 128x1x286 |
| 5 | Maxpooling | 1 × 2 | 128x1x143 |
| 6 | Conv | 64x128x1x3 | 64x1x141 |
| 7 | Batchnorm | – | 64x1x141 |
| 8 | Conv | 64x64x1x3 | 64x1x139 |
| 9 | Batchnorm | – | 64x1x139 |
| 10 | Maxpooling | 1 × 2 | 64x1x69 |
| 11 | Dense | 256 | 256 |
| 12 | Dropout | – | 256 |
| 13 | Dense | 128 | 128 |
| 14 | Softmax | 2 | 2 |
The size column records the convolutional kernel size, the max-pooling window size and the fully connected layer size. The output shape depicts the change of data’s shape in the flow
Classification performance for different network architectures
| Model | AUROC | AUPRC | Epoch Time |
|---|---|---|---|
| gkmSVM | 0.887 (0.004) | 0.899 (0.004) | 6 h (total) |
| 4conv2pool | 0.910 (0.004) | 0.915 (0.004) | 272 s |
| 4conv2pool4norm | 0.916 (0.004) | 0.917 (0.003) | 376 s |
| 4conv | 0.896 (0.005) | 0.897 (0.005) | 325 s |
| 6conv3pool | 0.898 (0.005) | 0.898 (0.006) | 251 s |
| 6conv3pool6norm | 0.911 (0.006) | 0.909 (0.005) | 415 s |
The conventional gkmSVM is used as the baseline for comparison. For each model, we carried out 10-fold cross validation experiments. This table records the mean value of AUC values with standard error behind in the brackets
Fig. 2AUROCs of different methods on the permissive enhancer dataset. a: boxplot for AUROC scores. b: boxplot for AUPRC scores. The main body of the boxplot shows the quartiles. The horizontal lines at the median of each box show the medians. The vertical lines extending to the most extreme represent non-outlier data points
Pairwise Wilcoxon tests on AUROCs of different methods
| gkmSVM | 4conv2pool | 4conv2pool4norm | 4conv | 6conv3pool | 6conv3pool6norm | |
|---|---|---|---|---|---|---|
| gkmSVM | – | 5.1e-3 | 5.1e-3 | 5.1e-3 | 5.1e-3 | 5.1e-3 |
| 4conv2pool | – | – | 4.6e-2 | 5.1e-3 | 5.1e-3 | 9.6e-1 |
| 4conv2pool4norm | – | – | – | 5.1e-3 | 5.1e-3 | 2.8e-2 |
| 4conv | – | – | – | – | 2.4e-1 | 5.1e-3 |
| 6conv3pool | – | – | – | – | – | 6.9e-3 |
| 6conv3pool6norm | – | – | – | – | – | – |
We perform pairwise Wilcoxon tests on AUROCs of the six methods. Tests are conducted with the alternative hypothesis that the AUROCs of two methods are different in their medians. Small p-values indicate that two methods have different performance
Pairwise Wilcoxon tests on AUPRCs of different methods
| gkmSVM | 4conv2pool | 4conv2pool4norm | 4conv | 6conv3pool | 6conv3pool6norm | |
|---|---|---|---|---|---|---|
| gkmSVM | – | 5.1e-3 | 5.1e-3 | 6.5e-1 | 5.8e-1 | 5.1e-3 |
| 4conv2pool | – | – | 2.8e-1 | 5.1e-3 | 5.1e-3 | 5.1e-3 |
| 4conv2pool4norm | – | – | – | 5.1e-3 | 5.1e-3 | 5.1e-2 |
| 4conv | – | – | – | – | 4.4e-1 | 5.1e-3 |
| 6conv3pool | – | – | – | – | – | 9.3e-3 |
| 6conv3pool6norm | – | – | – | – | – | – |
We perform pairwise Wilcoxon tests on AUPRCs of the six methods. Tests are conducted with the alternative hypothesis that the AUPRCs of two methods are different in their medians. Small p-values indicate that two methods have different performance
Classification performance for different cell lines
| Cell Type | AUROC | AUPRC | ||
|---|---|---|---|---|
| DeepEnhancer | gkmSVM | DeepEnhancer | gkmSVM | |
| GM12878 | 0.874 | 0.784 | 0.875 | 0.819 |
| H1-hESC | 0.923 | 0.869 | 0.919 | 0.861 |
| HepG2 | 0.882 | 0.800 | 0.883 | 0.827 |
| HMEC | 0.903 | 0.848 | 0.907 | 0.892 |
| HSMM | 0.904 | 0.830 | 0.910 | 0.856 |
| HUVEC | 0.898 | 0.824 | 0.905 | 0.870 |
| K562 | 0.883 | 0.794 | 0.886 | 0.799 |
| NHEK | 0.888 | 0.809 | 0.893 | 0.840 |
| NHLF | 0.909 | 0.848 | 0.910 | 0.869 |
|
| 1.9e-3 | 1.9e-3 | ||
We compare the performance of our DeepEnhancer model and gkmSVM on 9 cell types using two measures: area under receiver operating characteristic curve (AUROC) and area under precision-recall curve (AUPRC). The last row shows the p-value result of the binomial exact test, which makes us choose the alternative hypothesis that DeepEnhancer has a larger AUC score than gkmSVM
Fig. 3ROC curves for enhancers specific to different cell lines. The first nine subplots depict the receiver operating characteristic (ROC) curves, and the last subplot is the barplot of the AUROC
Fig. 4PR curves for enhancers specific to different cell lines. The first nine subplots depict the precision-recall (PR) curves for the 9 cell types respectively, and the last subplot is the barplot for AUPRC
Fig. 5Visualization of learned motifs. For each cell line, we show a pattern learned by our model and can be matched to a known motif in the JASPAR database
Fig. 6Loss of the model 4conv2pool4norm during training. The loss of the training set decreases rapidly, and we hold out a validation set for early stopping after 8 epochs of unimproved valid loss
Fig. 7Diagram of data augmentation. Suppose the model accepts sequences of length W bps as input. a In the case that an enhancer is shorter than W, we slide a window of size W along the genome with stride s (default 2) around the input sequence, and take every sequence overlapping with the original one to obtain augmented sequences. b In the case that an enhancer is longer than W, we slide a window of size W along the input sequence with stride s (default 2) to obtain a number of sequences, each of length W