| Literature DB >> 33711547 |
Yun Yang1, Jing Guo2, Pei Wang2, Yaowei Wang3, Minghao Yu3, Xiang Wang3, Po Yang4, Liang Sun5.
Abstract
The recent outbreak of COVID-19 has infected millions of people around the world, which is leading to the global emergency. In the event of the virus outbreak, it is crucial to get the carriers of the virus timely and precisely, then the animal origins can be isolated for further infection. Traditional identifications rely on fields and laboratory researches that lag the responses to emerging epidemic prevention. With the development of machine learning, the efficiency of predicting the viral hosts has been demonstrated by recent researchers. However, the problems of the limited annotated virus data and imbalanced hosts information restrict these approaches to obtain a better result. To assure the high reliability of predicting the animal origins on COVID-19, we extend transfer learning and ensemble learning to present a hybrid transfer learning model. When predicting the hosts of newly discovered virus, our model provides a novel solution to utilize the related virus domain as auxiliary to help building a robust model for target virus domain. The simulation results on several UCI benchmarks and viral genome datasets demonstrate that our model outperforms the general classical methods under the condition of limited target training sets and class-imbalance problems. By setting the coronavirus as target domain and other related virus as source domain, the feasibility of our approach is evaluated. Finally, we show the animal reservoirs prediction of the COVID-19 for further analysing.Entities:
Keywords: COVID-19; Ensemble learning; Hosts prediction; Machine learning; Transfer learning; Virus origins
Mesh:
Year: 2021 PMID: 33711547 PMCID: PMC7942058 DOI: 10.1016/j.jbi.2021.103736
Source DB: PubMed Journal: J Biomed Inform ISSN: 1532-0464 Impact factor: 8.000
Fig. 1An overview of the proposed AI-based viral origins prediction.
Fig. 2The phylogenetic tree of the viral genome sequences. By proportional sampling from the collected datasets, the phylogenetic tree is built to demonstrate the variances among different virus species. The blue leaf nodes denote the other related virus species, the red denote the coronavirus and we highlight the COVID-19 and colored with orange.
The descriptions of the virus datasets.
| 79 | 73 | 6 | – | |
| 36 | 33 | 3 | – | |
| 35 | 32 | 3 | – | |
| 78 | 72 | 6 | – | |
| 47 | 40 | 7 | – | |
| 111 | 107 | 4 | – | |
| 146 | 140 | 6 | – | |
| 62 | 53 | 9 | – | |
| 147 | 141 | 6 | – | |
| 34 | 25 | 9 | – | |
| 775 | 716 | 59 | 308 | |
Fig. 3The flow chart of hybrid transfer learning method.
The descriptions of the UCI medical benchmarks.
| Datasets | ||||||
|---|---|---|---|---|---|---|
| 20 | 346 | 126 | 122 | 62 | 36 | |
| 12 | 270 | 83 | 100 | 67 | 20 | |
| 8 | 277 | 32 | 30 | 164 | 51 | |
| 18 | 1151 | 194 | 193 | 417 | 347 | |
| 54 | 303 | 129 | 40 | 87 | 47 | |
| 15 | 470 | 55 | 268 | 15 | 132 | |
| 7 | 365 | 149 | 100 | 56 | 60 | |
| 16 | 368 | 160 | 110 | 72 | 28 | |
| 32 | 668 | 37 | 535 | 8 | 88 | |
| 27 | 2643 | 1633 | 131 | 798 | 81 | |
| 22 | 195 | 31 | 19 | 116 | 29 | |
| 5 | 830 | 23 | 44 | 380 | 383 | |
| 14 | 569 | 185 | 93 | 172 | 119 | |
Classification accuracy of different components of HTL on benchmarks.
| Datasets | |||||
|---|---|---|---|---|---|
| 81.5 ± 3.5 | 79.1 ± 2.3 | 79.3 ± 3.8 | 87.3 ± 2.1 | ||
| 80.3 ± 3.1 | 79.8 ± 2.8 | 81.4 ± 3.6 | 82.1 ± 2.3 | ||
| 72.3 ± 1.2 | 74.1 ± 2.3 | 73.1 ± 3.5 | 78.1 ± 2.2 | ||
| 66.1 ± 0.6 | 65.2 ± 0.7 | 68.2 ± 1.3 | 69.4 ± 1.0 | ||
| 70.2 ± 2.3 | 72.4 ± 1.9 | 71.5 ± 2.1 | 74.9 ± 1.8 | ||
| 85.1 ± 1.1 | 83.2 ± 1.2 | 84.5 ± 2.3 | 80.1 ± 1.7 | ||
| 72.1 ± 2.4 | 75.9 ± 3.5 | 76.3 ± 2.1 | 76.2 ± 1.4 | ||
| 69.5 ± 1.9 | 70.8 ± 1.3 | 72.1 ± 1.5 | 76.3 ± 2.1 | ||
| 88.4 ± 2.2 | 87.5 ± 2.1 | 87.1 ± 3.6 | 87.6 ± 1.2 | ||
| 93.2 ± 1.3 | 92.4 ± 2.5 | 93.2 ± 1.6 | 88.1 ± 0.3 | ||
| 76.8 ± 3.1 | 78.5 ± 4.3 | 77.9 ± 3.5 | 79.3 ± 3.9 | ||
| 71.3 ± 2.3 | 72.4 ± 3.2 | 72.1 ± 2.7 | 73.5 ± 2.8 | ||
| 90.1 ± 0.4 | 91.6 ± 0.8 | 91.8 ± 1.2 | 91.3 ± 0.8 | ||
Classification accuracy of different transfer learning methods on benchmarks
| 83.6 ± 3.3 | 87.6 ± 2.6 | 80.7 ± 2.3 | ||
| 70.8 ± 1.3 | 78.9 ± 1.1 | 75.4 ± 1.2 | ||
| 69.1 ± 3.2 | 71.4 ± 3.6 | 73.1 ± 2.3 | ||
| 62.8 ± 1.3 | 64.4 ± 1.2 | 65.7 ± 1.3 | ||
| 72.5 ± 2.8 | 74.8 ± 1.7 | 73.4 ± 2.9 | ||
| 80.2 ± 2.4 | 82.1 ± 1.7 | 80.1 ± 1.7 | ||
| 67.2 ± 2.9 | 68.8 ± 1.7 | 77.4 ± 2.1 | ||
| 74.7 ± 3.4 | 76.1 ± 1.0 | 75.7 ± 1.6 | ||
| 91.4 ± 0.9 | 90.3 ± 0.7 | 87.6 ± 1.2 | ||
| 92.5 ± 0.7 | 90.6 ± 0.2 | 88.1 ± 0.3 | ||
| 81.5 ± 0.5 | 81.2 ± 1.2 | 80.5 ± 3.4 | ||
| 70.4 ± 1.3 | 71.2 ± 1.6 | 69.8 ± 2.3 | ||
| 89.8 ± 1.5 | 90.7 ± 0.9 | 90.5 ± 1.5 |
The compositions of datasets
| Baseline | Training dataset | Testing dataset |
|---|---|---|
Classification accuracy on virus datasets
| ACC ± std | ACC ± std | ACC ± std | ||
|---|---|---|---|---|
| 22.6%±5.6 | 24.4%±4.8 | 28.9%±5.3 | ||
| 21.3%±2.4 | 28.6%±2.6 | 29.1%±5.5 | ||
| 28.9%±4.1 | 36.7%±4.5 | 39.3%±6.2 | ||
| – | 34.4%±3.9 | 31.5%±4.2 | ||
| 25.2%±1.9 | 26.7%±3.3 | 24.5%±3.6 | ||
| 24.7%±7.4 | 28.6%±5.8 | 32.3%±5.7 | ||
| 29.2%±4.5 | 28.1%±5.4 | 23.3%±4.7 | ||
| – | 32.2%±6.5 | 35.4%±7.6 | ||
| 19.0%±2.1 | 25.5%±4.6 | 27.2%±5.5 | ||
| 17.1%±3.9 | 29.0%±5.9 | 23.3%±6.6 | ||
| 10.3%±5.9 | 22.7%±5.6 | 27.8%±4.6 | ||
| – | 19.5%±5.5 | 25.1%±7.0 | ||
| 15.4%±4.4 | 22.0%±4.7 | 29.6%±6.4 | ||
| 19.2%±2.9 | 33.9%±6.3 | 36.3%±7.7 | ||
| 27.1%±5.0 | 26.9%±6.6 | 33.9%±5.5 | ||
| – | 30.6%±5.3 | 38.4%±9.0 | ||
| 30.1%±2.3 | 35.4%±3.3 | 44.7%±3.6 | ||
| 34.7%±3.9 | 36.1%±4.5 | 48.1%±4.2 | ||
| 28.3%±5.9 | 30.4%±5.1 | 35.1%±4.8 | ||
Fig. 4The ROC curve of the different amounts of target training datasets. In each experiment group, Top-5 best performance methods are selected to compare the AUC with HTL.
Fig. 5The statistics results of the prediction results and the heat map of the hosts probabilities for COVID-19. In (a), the depth of the color represents the hosts probability of each virus; (b) illustrates the prediction results of complete genome sequences of COVID-19; (c) records the Top-3 prediction confidence of refseq and average of complete genome sequences
The sequence identities with the refseq of COVID-19 (NC_045512.2).
| NC_004718.3 | Coronavirus | SARS coronavirus | 80.3% | Pterobat |
| NC_014470.1 | Coronavirus | Bat coronavirus | 78.0% | Pterobat |
| NC_016991.1 | Coronavirus | White-eye coronavirus HKU16 | 68.9% | Neoaves |
| NC_016992.1 | Coronavirus | Sparrow coronavirus HKU17 | 67.8% | Neoaves |
| NC_030886.1 | Coronavirus | Rousettus bat coronavirus | 67.4% | Pterobat |
Fig. 6The phylogenetic tree constructed by the refseq of COVID-19 (NC_045512.2) and Top-20 related sequences.