| Literature DB >> 32938377 |
Domenico Amato1, Giosue' Lo Bosco2,3, Riccardo Rizzo4.
Abstract
BACKGROUND: Nucleosomes wrap the DNA into the nucleus of the Eukaryote cell and regulate its transcription phase. Several studies indicate that nucleosomes are determined by the combined effects of several factors, including DNA sequence organization. Interestingly, the identification of nucleosomes on a genomic scale has been successfully performed by computational methods using DNA sequence as input data.Entities:
Keywords: Deep learning networks; Epigenetic; Nucleosome classification; Recurrent neural networks
Mesh:
Substances:
Year: 2020 PMID: 32938377 PMCID: PMC7493859 DOI: 10.1186/s12859-020-03627-x
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1The one-hot representation A simple visualization of the build process for an one-hot representation
Fig. 2The ConvNet network. The convolutional network used for the classification in [27]
Fig. 3The LSTM network. The architecture with an LSTM network used in [26]
Fig. 4The CORENup architecture. A representation of the CORENup architecture presented in this paper. The details of the architecture are reported in Table 1
CORENup structure
| Layer | Kernel Dim | # Hidden Units | stride Dim | Output Dim | # Params |
| Conv1D | 5 | 50 | 1 | 147x50 | 1.050 |
| MaxPool1D | - | - | 2 | 73x50 | 0 |
| Dropout 50% | - | - | - | 73x50 | 0 |
| Layer | Kernel Dim | # Hidden Units | stride Dim | Output Dim | # Params |
| LSTM | - | 50 | - | 73x50 | 20.200 |
| Dropout 50% | - | - | - | 73x50 | 0 |
| Flatten | - | - | - | 3.650x1 | 0 |
| Layer | Kernel Dim | # Hidden Units | stride Dim | Output Dim | # Params |
| Conv1D | 10 | 50 | 1 | 73x50 | 25.050 |
| MaxPool1D | - | - | 2 | 36x50 | 0 |
| Dropout 50% | - | - | - | 36x50 | 0 |
| Flatten | - | - | - | 1.800x1 | 0 |
| Concatenate | - | - | - | 5.450x1 | 0 |
| Dense | - | 370 | - | 370x1 | 2.016.870 |
| Dropout 50% | - | - | - | 370x1 | 0 |
| Dense | - | 1 | - | 1x1 | 371 |
| Features Extraction Path | 1.050 | ||||
| LSTM Path | 20.200 | ||||
| Convolutional Path | 25.050 | ||||
| Dense Path | 2.017.241 | ||||
| 2.063.541 | |||||
Notice that the dense layer contains the majority of the network parameters
Number of samples in the first group for each class
| Nucleosome | Linker | Total | |
|---|---|---|---|
| HS | 2273 | 2300 | 4573 |
| DM | 2900 | 2850 | 5750 |
| E | 2567 | 2608 | 5175 |
| Y | 1880 | 1740 | 3620 |
HS represents the Homo Sapiens group; DM represents Drosophila Melanogaster group; E represents the Elegans group; Y represents Yeast group
Number of samples in the second group for each class
| Nucleosome | Linker | Total | |
|---|---|---|---|
| HS - LC | 97209 | 65563 | 162772 |
| HS - PM | 56404 | 44639 | 101043 |
| HS - 5U | 11769 | 4880 | 16649 |
| DM - LC | 46054 | 30458 | 76512 |
| DM - PM | 48251 | 28763 | 77014 |
| DM - 5U | 4669 | 2704 | 7373 |
| Y - WG | 39661 | 4824 | 44485 |
| Y - PM | 1880 | 4463 | 31836 |
The labels HS, DM and Y have the same meaning as in Table 2. LC represents the longest chromosome; P represents the promoter sequences; 5U represents the 5UTR Exon region; WG represents the whole genome
Experimental results for the 20-Fold procedure with the first group of data-sets
| LSTM | ConvNet | CORENup | LeNup | ||
|---|---|---|---|---|---|
| HS | ACC | 0,836 ±0,03 | 0,83 ±0,03 | 0,873 ±0,02 | |
| SENS | 0,898 ±0,03 | 0,867 ±0,03 | 0,839 ±0,03 | ||
| SPEC | 0,792 ±0,03 | 0,814 ±0,03 | 0,843 ±0,02 | ||
| MCC | 0,681 ±0,002 | 0,666 ±0,0 | 0,758 ±0,07 | ||
| AUC | 0,92 ±0,03 | 0,91 ±0,03 | 0,928 ±0,01 | ||
| DM | ACC | 0,854 ±0,04 | 0,838 ±0,04 | 0,875±0,02 | |
| SENS | 0,872 ±0,03 | 0,816 ±0,03 | 0,876±0,03 | ||
| SPEC | 0,841 ±0,05 | 0,838 ±0,06 | 0,74 ±0,13 | ||
| MCC | 0,71 ±0,003 | 0,68 ±0,003 | |||
| AUC | 0,93 ±0,02 | 0,92 ±0,02 | 0,937 ±0,02 | ||
| E | ACC | 0,897 ±0,03 | 0,895 ±0,03 | 0,912±0,02 | |
| SENS | 0,938 ±0,03 | 0,924 ±0,02 | 0,885±0,02 | ||
| SPEC | 0,865 ±0,04 | 0,874 ±0,04 | 0,882 ±0,03 | ||
| MCC | 0,799 ±0,002 | 0,795 ±0,001 | 0,832±0,03 | ||
| AUC | 0,96 ±0,02 | 0,96 ±0,02 | |||
| Y | ACC | 0,996 ±0,05 | 0,996 ±0,06 | 1,0 ±0,0 | |
| SENS | 0,998 ±0,05 | 0,998 ±0,05 | 0,999 ±0,005 | ||
| SPEC | 0,995 ±0,07 | 0,995 ±0,08 | 1,0 ±0,0 | ||
| MCC | 0,992 ±0,003 | 0,993 ±0,002 | 0,999 ±0,005 | ||
| AUC | 0,99 ±0,0 | 0,99 ±0,0 | 0,99 ±0,0 |
The two networks LeNup and CORENup outperform the simpler networks in Figs. 2 and 3. Best values are shown in boldface
Time for Epochs
| CORENup | LeNup | Overhead | |
|---|---|---|---|
| HS | 5 | 16 | 0.31 |
| DM | 6 | 20 | 0.30 |
| E | 5 | 18 | 0.27 |
| Y | 4 | 12 | 0.33 |
| HS - LC | 63 | 233 | 0,27 |
| HS - PM | 44 | 158 | 0,28 |
| HS - 5U | 10 | 35 | 0,29 |
| DM - LC | 47 | 171 | 0,27 |
| DM - PM | 57 | 211 | 0,27 |
| DM - 5U | 4 | 13 | 0,30 |
| Y - WG | 36 | 124 | 0.29 |
| Y - PM | 24 | 88 | 0.27 |
Comparison between CORENup and LeNup time for epochs, all the time are expressed in seconds. The Overhead column reports the ratio between CORENup and LeNup times
Training time
| CORENup | LeNup | Overhead | |
|---|---|---|---|
| HS | 166,79 | 305 | 0,55 |
| DM | 219,14 | 382,75 | 0,72 |
| E | 207,40 | 312,35 | 0,66 |
| Y | 287,46 | 186,5 | 1,54 |
| HS - LC | 2544,14 | 1426 | 1,78 |
| HS - PM | 1558,92 | 1806 | 0,86 |
| HS - 5U | 316,08 | 750 | 0,42 |
| DM - LC | 2025.52 | 4435 | 0.46 |
| DM - PM | 1823.39 | 3040 | 0.60 |
| DM - 5U | 121.78 | 171 | 0.71 |
| Y - WG | 791.37 | 1294 | 0.61 |
| Y - PM | 500.37 | 1311 | 0.38 |
Comparison between CORENup and LeNup Training time, all the time are expressed in seconds. The Overhead column reports the ratio between CORENup and LeNup times
Experiments result for the second group of data-sets
| CORENup | LeNup | Best for [ | |
|---|---|---|---|
| HS - LC | 0,912 | 0,65 | |
| HS - PM | 0,875 | 0,67 | |
| HS - 5U | 0,732 | ∼0,7 | |
| DM - LC | 0,724 | ∼0,7 | |
| DM - PM | 0,734 | ∼0,7 | |
| DM - 5U | 0,695 | ∼0,7 | |
| Y - WG | 0,939 | 0,77 | |
| Y - PM | 0,909 | 0,79 |
The AUC values are calculated as explained in the work [35] where the data-sets were originally proposed. The last column reports the results of the best performer among the 8 methods compared in the original paper. Best values are shown in boldface