| Literature DB >> 35599960 |
Thanh Thi Nguyen1, Mohamed Abdelrazek1, Dung Tien Nguyen2, Sunil Aryal1, Duc Thanh Nguyen1, Sandeep Reddy3, Quoc Viet Hung Nguyen4, Amin Khatami1, Thanh Tam Nguyen5, Edbert B Hsu6, Samuel Yang7.
Abstract
Origin of the COVID-19 virus (SARS-CoV-2) has been intensely debated in the scientific community since the first infected cases were detected in December 2019. The disease has caused a global pandemic, leading to deaths of thousands of people across the world and thus finding origin of this novel coronavirus is important in responding and controlling the pandemic. Recent research results suggest that bats or pangolins might be the hosts for SARS-CoV-2 based on comparative studies using its genomic sequences. This paper investigates the SARS-CoV-2 origin by using artificial intelligence (AI)-based unsupervised learning algorithms and raw genomic sequences of the virus. More than 300 genome sequences of COVID-19 infected cases collected from different countries are explored and analysed using unsupervised clustering methods. The results obtained from various AI-enabled experiments using clustering algorithms demonstrate that all examined SARS-CoV-2 genomes belong to a cluster that also contains bat and pangolin coronavirus genomes. This provides evidence strongly supporting scientific hypotheses that bats and pangolins are probable hosts for SARS-CoV-2. At the whole genome analysis level, our findings also indicate that bats are more likely the hosts for the COVID-19 virus than pangolins.Entities:
Keywords: AI; Artificial intelligence; Bat; COVID-19; Machine learning; Origin; Pandemic; Pangolin; SARS-coV-2
Year: 2022 PMID: 35599960 PMCID: PMC9110011 DOI: 10.1016/j.mlwa.2022.100328
Source DB: PubMed Journal: Mach Learn Appl ISSN: 2666-8270
Number of COVID-19 sequences collected from different countries.
| Countries | Number of sequences | Countries | Number of sequences |
|---|---|---|---|
| USA | 258 | India | 2 |
| China | 49 | Brazil | 1 |
| Japan | 5 | Italy | 1 |
| Spain | 4 | Peru | 1 |
| Taiwan | 3 | Nepal | 1 |
| Vietnam | 2 | South Korea | 1 |
| Israel | 2 | Australia | 1 |
| Pakistan | 2 | Sweden | 1 |
Reference viruses from major virus classes at a high taxonomic level - Set 1.
| Virus (Accession Number) | Taxonomy | Virus (Accession Number) | Taxonomy |
|---|---|---|---|
| Human adenovirus D8 (AB448767) | Adenoviridae | Murine leukemia virus (AB187566) | Ortervirales |
| TT virus sle1957 (AM711976) | Anelloviridae | Human papillomavirus type 69 (AB027020) | Papillomaviridae |
| Staphylococcus phage S13’ (AB626963) | Caudovirales | Adeno-associated virus - 6 (AF028704) | Parvoviridae |
| Chili leaf curl virus-Oman (KF229718) | Geminiviridae | Cotesia plutellae polydnavirus (AY651828) | Polydnaviridae |
| Meles meles fecal virus (JN704610) | Genomoviridae | Aves polyomavirus 1 (AF118150) | Polyomaviridae |
| Chlamydia phage 3 (AJ550635) | Microviridae | Middle East respiratory syndrome (MERS) CoV (NC_019843) | Riboviria |
Reference viruses within the Riboviria realm - Set 2.
| Virus (Accession Number) | Taxonomy | Virus (Accession Number) | Taxonomy |
|---|---|---|---|
| Grapevine rupestris stem pitting-associated virus (GRSPV) 1 (AF057136) | Betaflexiviridae | Lymantria dispar cypovirus 14 (AF389452) | Reoviridae |
| Cucumber mosaic virus (AJ276479) | Bromoviridae | Hybrid snakehead virus (KC519324) | Rhabdoviridae |
| Chiba virus (AB042808) | Caliciviridae | Rice tungro spherical virus (NC_001632) | Secoviridae |
| Mercadeo virus (NC_027819) | Flaviviridae | Bulbul CoV HKU11-796 (FJ376620) | Coronaviridae; DeltaCoV |
| Bunyamwera virus (NC_001925) | Peribunyaviridae | Avian infectious bronchitis virus (AIBV) (AY646283) | Coronaviridae; GammaCoV |
| Rice grassy stunt tenuivirus (NC_002323) | Phenuiviridae | Human CoV NL63 (NC_005831) | Coronaviridae; AlphaCoV |
| Theiler’s-like virus of rats (AB090161) | Picornaviridae | SARS CoV BJ01 (AY278488) | Coronaviridae; BetaCoV |
| Turnip mosaic virus (AB194796) | Potyviridae |
Reference viruses in the genus AlphaCoV and BetaCoV - Set 3.
| Virus (Accession Number) | Taxonomy | Virus (Accession Number) | Taxonomy |
|---|---|---|---|
| Transmissible gastroenteritis virus (TGEV) (NC_038861) | AlphaCoV | SARS CoV BJ01 (AY278488) | BetaCoV; Sarbecovirus |
| Mink CoV WD1127 (NC_023760) | AlphaCoV | Bat SARS CoV RsSHC014 (KC881005) | BetaCoV; Sarbecovirus |
| Porcine epidemic diarrhea virus (PEDV) (NC_003436) | AlphaCoV | Bat SARS CoV WIV1 (KF367457) | BetaCoV; Sarbecovirus |
| Rhinolophus bat CoV HKU2 (NC_009988) | AlphaCoV | Bat SARS CoV Rp3 (DQ071615) | BetaCoV; Sarbecovirus |
| Human CoV 229E (NC_002645) | AlphaCoV | Bat SARS CoV Rs672/2006 (FJ588686) | BetaCoV; Sarbecovirus |
| Human CoV NL63 (NC_005831) | AlphaCoV | Bat SARS CoV Rf1 (DQ412042) | BetaCoV; Sarbecovirus |
| Human CoV OC43 (NC_006213) | BetaCoV; Embecovirus | Bat SARS CoV Longquan-140 (KF294457) | BetaCoV; Sarbecovirus |
| Murine hepatitis virus (MHV) (AC_000192) | BetaCoV; Embecovirus | Bat SARS CoV HKU3–1 | BetaCoV; Sarbecovirus |
| Rousettus bat CoV HKU9 (NC_009021) | BetaCoV; Nobecovirus | Bat SARS CoV ZXC21 (MG772934) | BetaCoV; Sarbecovirus |
| Rousettus bat CoV GCCDC1 (NC_030886) | BetaCoV; Nobecovirus | Bat SARS CoV ZC45 (MG772933) | BetaCoV; Sarbecovirus |
| MERS CoV (NC_019843) | BetaCoV; Merbecovirus | Bat CoV RaTG13 (MN996532) | BetaCoV; Sarbecovirus |
| Pipistrellus bat CoV HKU5 (NC_009020) | BetaCoV; Merbecovirus | Guangxi pangolin CoV GX/P4L (EPI_ISL_410538) | BetaCoV; Sarbecovirus |
| Tylonycteris bat CoV HKU4 (NC_009019) | BetaCoV; Merbecovirus | Guangxi pangolin CoV GX/P1E (EPI_ISL_410539) | BetaCoV; Sarbecovirus |
| Bat Hp-betaCoV/Zhejiang2013 (NC_025217) | BetaCoV; Hibecovirus | Guangxi pangolin CoV GX/P5L (EPI_ISL_410540) | BetaCoV; Sarbecovirus |
| SARS CoV BtKY72 (KY352407) | BetaCoV; Sarbecovirus | Guangxi pangolin CoV GX/P5E (EPI_ISL_410541) | BetaCoV; Sarbecovirus |
| Bat CoV BM48-31/BGR/2008 (GU190215) | BetaCoV; Sarbecovirus | Guangxi pangolin CoV GX/P2V (EPI_ISL_410542) | BetaCoV; Sarbecovirus |
| SARS CoV LC5 (AY395002) | BetaCoV; Sarbecovirus | Guangxi pangolin CoV GX/P3B (EPI_ISL_410543) | BetaCoV; Sarbecovirus |
| SARS CoV SZ3 (AY304486) | BetaCoV; Sarbecovirus | Guangdong pangolin CoV (EPI_ISL_410721) | BetaCoV; Sarbecovirus |
| SARS CoV Tor2 (AY274119) | BetaCoV; Sarbecovirus |
Fig. 1Dendrogram plots showing hierarchical clustering results using only the reference sequences in Set 1 (Table 2) with the cut-off parameter equal to (top), and using a set that merges 16 representative SARS-CoV-2 sequences and reference sequences with also set to (bottom). A number at the beginning of each virus name indicates the cluster that virus belongs to after clustering.
Fig. 2Phylogenetic trees showing DBSCAN results using only the reference sequences in Set 1 (Table 2) with the search radius parameter equal to 0.7 (top), and using a set that merges SARS-CoV-2 sequences and reference sequences with also set to 0.7 (bottom). As Set 1 includes representatives of major virus classes and the minimum number of neighbours is set to 3 while is set to 0.7, DBSCAN considers individual viruses as outliers (top). When the dataset is expanded to include SARS-CoV-2 sequences, DBSCAN forms cluster “1” that includes all SARS-CoV-2 sequences and the MERS CoV, which represents the Riboviria realm (bottom).
Fig. 3Dendrogram plots showing hierarchical clustering results using only the reference sequences in Set 2 (Table 3) with the cut-off parameter equal to 0.001 (top), and using a set that merges SARS-CoV-2 sequences and reference sequences with also set to 0.001 (bottom).
Fig. 4Phylogenetic trees showing DBSCAN results using only the reference sequences in Set 2 (Table 3) with the search radius parameter equal to 0.6 (top), and using a set that merges SARS-CoV-2 sequences and reference sequences with also set to 0.6 (bottom).
Fig. 5Distances between each of the reference genomes in Set 3 (Table 4) with 334 SARS-CoV-2 genomes where the latter are ordered by the released date (earliest to latest) over a period of approximately 3 months, from late December 2019 to late March 2020. These pairwise distances are computed based on the Jukes–Cantor method using whole genomes. The line at the bottom, for instance, represents the distances between the bat CoV RaTG13 genome with each of the 334 SARS-CoV-2 genomes.
Fig. 6Similarities between genomes of bat CoV RaTG13, Guangdong pangolin CoV and Guangxi pangolin CoV GX/P4L with the SARS-CoV-2/Australia/VIC01/2020 genome sequence.
Fig. 7Set 3 of reference sequences — results obtained by using hierarchical clustering via dendrogram plots. Numbers at the beginning of each virus label indicate the cluster that virus is a member of as a result of the clustering algorithm. (A) when the cut-off is equal to 0.7 and using Set 3 of reference sequences only: there are 6 clusters where cluster “5” covers all examined viruses in the Sarbecovirus sub-genus. (B) when cut-off is still equal to 0.7 and the dataset now merges between reference sequences and 16 representative SARS-CoV-2 sequences (merged set). (C) using the merged set with . (D) using the merged set with . (E) using the merged set with . (F) using the merged set with .
Fig. 8Set 3 of reference sequences — results obtained by using DBSCAN via phylogenetic trees. Numbers at the beginning of each virus label indicate the cluster that virus is a member of as a result of the clustering algorithm. (A) when the search radius parameter is equal to 0.55 and using Set 3 of reference sequences only: there are 3 clusters where the cluster “1” covers all examined viruses in the Sarbecovirus sub-genus. (B) when search radius is still equal to 0.55 and the dataset now merges between reference sequences and 16 representative SARS-CoV-2 sequences (merged set). (C) using the merged set with . (D) using the merged set with . (E) using the merged set with . (F) using the merged set with .
Fig. 9Evolutionary distances between reference genomes and 16 representative SARS-CoV-2 genomes of 16 countries based on the maximum composite likelihood method. Each line is composed of 16 data points representing 16 pairwise distances. The bottom line, for instance, shows the distances between bat CoV RaTG13 and each of 16 representative genomes on the x-axis.
Fig. 10Hierarchical clustering results using pairwise distances based on the maximum composite likelihood method, (A) the cut-off parameter is set to 0.1, (B) cut-off equal to 0.01, (C) cut-off is 0.001, and (D) cut-off is 0.0001.
Accession numbers of 334 SARS-CoV-2 genome sequences obtained from NCBI GenBank, sorted by date released.
| MN908947, MN985325, MN975262, MN938384, MN988713, MN997409, MN994468, MN994467, MN988669, MN988668, MN996531, MN996530, MN996529, MN996528, MN996527, MT007544, MT019533, MT019532, MT019531, MT019530, MT019529, MT020881, MT020880, MT027064, MT027063, MT027062, MT039890, MT039888, MT039887, MT039873, MT049951, MT044258, MT044257, MT066176, MT066175, MT072688, MT093631, MT093571, MT106054, MT106053, MT106052, MT118835, MT123293, MT123292, MT123291, MT123290, LC528233, LC528232, MT126808, MT135044, MT135043, MT135042, MT135041, MT152824, MT050493, MT012098, LC529905, MT159722, MT159721, MT159720, MT159719, MT159718, MT159717, MT159716, MT159715, MT159714, MT159713, MT159712, MT159711, MT159710, MT159709, MT159708, MT159707, MT159706, MT159705, MT121215, MT066156, MT163719, MT163718, MT163717, MT163716, MT184913, MT184912, MT184911, MT184910, MT184909, MT184908, MT184907, MT188341, MT188340, MT188339, MT192773, MT192772, MT192765, MT192759, MT198652, MT226610, MT233523, MT233522, MT233519, MT240479, MT246667, MT246490, MT246489, MT246488, MT246487, MT246486, MT246485, MT246484, MT246482, MT246481, MT246480, MT246479, MT246478, MT246477, MT246476, MT246475, MT246474, MT246473, MT246472, MT246471, MT246470, MT246469, MT246468, MT246467, MT246466, MT246464, MT246462, MT246461, MT246460, MT246459, MT246458, MT246457, MT246456, MT246455, MT246454, MT246453, MT246452, MT246451, MT246450, MT246449, MT233526, MT253710, MT253709, MT253708, MT253707, MT253706, MT253705, MT253704, MT253703, MT253702, MT253701, MT253700, MT253699, MT253698, MT253697, MT253696, MT251980, MT251979, MT251978, MT251977, MT251976, MT251975, MT251974, MT251973, MT251972, LC534419, LC534418, MT259287, MT259286, MT259285, MT259284, MT259282, MT259281, MT259280, MT259278, MT259277, MT259275, MT259274, MT259273, MT259271, MT259269, MT259268, MT259267, MT259266, MT259264, MT259263, MT259261, MT259260, MT259258, MT259257, MT259256, MT259254, MT259253, MT259252, MT259251, MT259250, MT259249, MT259248, MT259247, MT259246, MT259245, MT259244, MT259243, MT259241, MT259239, MT259237, MT259236, MT259235, MT259231, MT259230, MT259229, MT259228, MT259227, MT259226, MT258383, MT258382, MT258381, MT258380, MT258379, MT258378, MT258377, MT263469, MT263468, MT263467, MT263465, MT263464, MT263463, MT263462, MT263459, MT263458, MT263457, MT263456, MT263455, MT263454, MT263453, MT263452, MT263451, MT263450, MT263449, MT263448, MT263447, MT263446, MT263445, MT263444, MT263443, MT263442, MT263441, MT263440, MT263439, MT263438, MT263437, MT263436, MT263435, MT263434, MT263433, MT263432, MT263431, MT263430, MT263429, MT263428, MT263426, MT263425, MT263424, MT263423, MT263422, MT263421, MT263420, MT263419, MT263418, MT263417, MT263416, MT263415, MT263414, MT263413, MT263412, MT263411, MT263410, MT263408, MT263406, MT263405, MT263404, MT263403, MT263402, MT263400, MT263399, MT263398, MT263396, MT263395, MT263394, MT263392, MT263391, MT263390, MT263388, MT263387, MT263386, MT263384, MT263383, MT263382, MT263381, MT263074, MT262993, MT262916, MT262915, MT262914, MT262913, MT262912, MT262911, MT262910, MT262909, MT262908, MT262907,MT262906, MT262905, MT262904, MT262903, MT262902, MT262901, MT262900, MT262899, MT262898, MT262897, MT262896, MT276598, MT276597, MT276331, MT276330, MT276329, MT276328, MT276327, MT276326, MT276325, MT276324, MT276323. |