| Literature DB >> 35418074 |
Guo-Sheng Han1,2, Qi Li3,4, Ying Li3,5.
Abstract
BACKGROUND: Nucleosome positioning is the precise determination of the location of nucleosomes on DNA sequence. With the continuous advancement of biotechnology and computer technology, biological data is showing explosive growth. It is of practical significance to develop an efficient nucleosome positioning algorithm. Indeed, convolutional neural networks (CNN) can capture local features in DNA sequences, but ignore the order of bases. While the bidirectional recurrent neural network can make up for CNN's shortcomings in this regard and extract the long-term dependent features of DNA sequence.Entities:
Keywords: Bidirectional recurrent neural network; Convolutional neural network; Deep learning; Nucleosome positioning; Word vector
Mesh:
Substances:
Year: 2022 PMID: 35418074 PMCID: PMC9006412 DOI: 10.1186/s12864-022-08508-6
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 1The histograms show the overall accuracy of nucleosome positioning by using SVM with different k and vector dimensions. a H. sapiens achieves the highest classification accuracy with k = 6 and vector dimension of 200; b C. elegans achieves the highest classification accuracy, with k = 4 and vector dimension of 100; c D. melanogaster achieves the highest classification accuracy, with k = 5 and vector dimension of 180
DNA sequence vector dimension setting
| Species | k-mer | Vector dimension |
|---|---|---|
| H. sapiens | 6 | 200 |
| C. elegans | 4 | 100 |
| D. melanogaster | 5 | 180 |
Classification results of SVM and CNN via tenfold cross validation
| Species | H. sapiens | C. elegans | D. melanogaster | |
|---|---|---|---|---|
| SVM | ACC | 0.8589 | 0.8167 | |
| Sn | 0.8944 | 0.7928 | ||
| Sp | 0.824 | 0.8411 | ||
| MCC | 0.7202 | 0.6346 | ||
| CNN | ACC | 0.8443 | ||
| Sn | 0.879 | |||
| Sp | 0.81 | |||
| MCC | 0.6958 | |||
The prediction quality of BiGRU + BiLSTM via tenfold cross validation
| Species | ACC | Sn | Sp | MCC |
|---|---|---|---|---|
| H. sapiens | 0.8428 | 0.8891 | 0.797 | 0.6917 |
| C. elegans | 0.8817 | 0.9119 | 0.8520 | 0.7666 |
| D. melanogaster | 0.8285 | 0.7714 | 0.8867 | 0.6629 |
The prediction performance of NP_CBiR via tenfold cross validation
| Species | ACC | Sn | Sp | MCC | AUC |
|---|---|---|---|---|---|
| H. sapiens | 0.8618 | 0.8909 | 0.8330 | 0.7284 | 0.9234 |
| C. elegans | 0.8939 | 0.9427 | 0.8459 | 0.7924 | 0.9530 |
| D. melanogaster | 0.8555 | 0.8769 | 0.8337 | 0.7119 | 0.9251 |
Fig. 2The ROC curves show the performance of NP_CBiR. a AUC is 0.9234 for H. sapiens; b AUC is 0.953 for C. elegans; c AUC is 0.9251 for D. melanogaster
Experimental results of the second dataset
| Dataset | Best for Liu | DLNN | CORENup | NP_CBiR |
|---|---|---|---|---|
| H-5U | ∼0.7 | 0.68 | 0.760 | |
| H-LC | ∼0.65 | 0.81 | 0.910 | |
| H-PM | 0.67 | 0.77 | 0.86 | |
| D-5U | ∼0.7 | 0.67 | 0.71 | |
| D-LC | ∼0.7 | 0.71 | 0.72 | |
| D-PM | ∼0.7 | 0.73 | 0.738 |
Due to the limitation of table size, the species name is indicated by an abbreviation. H H Sapiens, D D Melanogaster, LC Largest chromosome, 5U 5’UTR exon region, PM Promoter
Classification results of SVM and NP_CBiR
| Dataset | H-5U | H-LC | H-PM | D-5U | D-LC | D-PM |
|---|---|---|---|---|---|---|
| SVM | 0.6890 | 0.7128 | 0.7294 | |||
| NP_CBiR | 0.78 | 0.86 |
Comparison of NP_CBiR with other methods on H. sapiens
| Method | ACC | Sn | Sp | MCC | AUC |
|---|---|---|---|---|---|
| DLNN | 0.8537 | 0.8834 | 0.8229 | - | - |
| ZCMM | 0.7772 | 0.7487 | 0.8151 | 0.5600 | 0.8610 |
| NP_CBiR |
Comparison of NP_CBiR with other methods on C. elegans
| Method | ACC | Sn | Sp | MCC | AUC |
|---|---|---|---|---|---|
| DLNN | 0.9304 | - | - | ||
| ZCMM | 0.8534 | 0.7880 | 0.8410 | 0.6200 | 0.9120 |
| NP_CBiR | 0.8939 | 0.8459 |
Comparison of NP_CBiR with other methods on D. melanogaster
| Method | ACC | Sn | Sp | MCC | AUC |
|---|---|---|---|---|---|
| DLNN | 0.8560 | 0.8781 | 0.8333 | - | - |
| ZCMM | 0.7964 | 0.7000 | 0.9110 | ||
| NP_CBiR | 0.8555 | 0.8769 |
Statistical information of the second dataset
| Species | region | P-S | N-S | Total |
|---|---|---|---|---|
| H. sapiens | LC | 97,209 | 65,563 | 162,772 |
| PM | 56,404 | 44,639 | 101,043 | |
| 5U | 11,769 | 4880 | 16,649 | |
| D. melanogaster | LC | 46,054 | 30,458 | 76,512 |
| PM | 48,251 | 28,763 | 77,014 | |
| 5U | 4669 | 2704 | 7373 |
Statistical information of the first datasets
| Species | P-S | N-S | Total |
|---|---|---|---|
| H. sapiens | 2273 | 2300 | 4573 |
| C. elegans | 2567 | 2608 | 5175 |
| D. melanogaster | 2900 | 2850 | 5750 |
Fig. 3DNA sequence word vector representation flowchart
Fig. 4Nucleosome positioning model based on CNN and word vector
Fig. 5Nucleosome positioning model based on BiGRU + BiLSTM and word vector
Fig. 6Nucleosome positioning model based on hybrid model and word vector