| Literature DB >> 33824882 |
Ahmad M Abu Haimed1, Tanzila Saba1, Ayman Albasha1, Amjad Rehman1, Mahyar Kolivand2.
Abstract
This research presents a reverse engineering approach to discover the patterns and evolution behavior of SARS-CoV-2 using AI and big data. Accordingly, we have studied five viral families (Orthomyxoviridae, Retroviridae, Filoviridae, Flaviviridae, and Coronaviridae) that happened in the era of the past one hundred years. To capture the similarities, common characteristics, and evolution behavior for prediction concerning SARS-CoV-2. And how reverse engineering using Artificial intelligence (AI) and big data is efficient and provides wide horizons. The results show that SARS-CoV-2 shares the same highest active amino acids (S, L, and T) with the mentioned viral families. As known, that affects the building function of the proteins. We have also devised a mathematical formula representing how we calculate the evolution difference percentage between each virus concerning its phylogenic tree. It shows that SARS-CoV-2 has fast mutation evolution concerning its time of arising. Artificial Intelligence (AI) is used to predict the next evolved instance of SARS-CoV-2 by utilizing the phylogenic tree data as a corpus using Long Short-term Memory (LSTM). This paper has shown the evolved viral instance prediction process on ORF7a protein from SARS-CoV-2 as the first stage to predict the complete mutant virus. Finally, in this research, we have focused on analyzing the virus to its primary factors by reverse engineering using AI and big data to understand the viral similarities, patterns, and evolution behavior to predict future viral mutations of the virus artificially in a systematic and logical way.Entities:
Keywords: COVID19; Healthcare; Long Short-term Memory (LSTM); Public health; SARS-CoV-2; Viral reverse engineering
Year: 2021 PMID: 33824882 PMCID: PMC8016547 DOI: 10.1016/j.eti.2021.101531
Source DB: PubMed Journal: Environ Technol Innov ISSN: 2352-1864
Viral families data identifications.
| Family | Viruses | Samples date | Samples size | Samples locations |
|---|---|---|---|---|
| Orthomyxoviridae | H1N1 (Spanish Flu) | 1934–2013 | 5 viruses | USA, Puerto Rico |
| Retroviridae | HIV-1 | 1981–2020 | 2 viruses | USA |
| Filoviridae | EBOV (Ebola Virus) | 1976–2020 | 1 virus | USA |
| Flaviviridae | HVC (Hepatitis-C) | 1989–2020 | 1 virus | USA |
| Coronaviridae | HCoV-OC43 | 1960–2020 | 7 viruses | USA |
Fig. 1(a) Genetic Analysis (GA) for getting genome indicators. (b) Genetic Analysis (GA) for getting the Phylogenic tree and PEFD.
Fig. 2Genome represented in three codons corresponding to one amino acid.
Fig. 3The mRNA structure with corresponding codons.
Fig. 4The training formation of the genomes in the Machine.
Fig. 5Genetic reverse engineering process.
Influenza virus identification.
| Virus Name | Geo location | Collection date |
|---|---|---|
| H1N1 (Spanish Flu) | USA, Puerto Rico | 1934 |
| H2N2 (Asian Flu) | Korea | 1986 |
| H3N2 (Hong Kong Flu) | USA, New York | 2004 |
| H5N1 (Avian Flu) | China, Guangdong | 1996 |
| H1N1 (Swine Flu) | USA, California | 2009 |
H1N1 (Spanish flu)’s genome data.
| No. | Gnome/segment | Accessions | Total protein sequences | Functional proteins | Total amino acids | Highest amino acid occurrence (%) |
|---|---|---|---|---|---|---|
| 1 | Neuraminidase (NA) | NC_002018.1 | 38 | 5 | 430 | L: 12.3% |
| 2 | Hemagglutinin (HA) | NC_002017.1 | 42 | 8 | 548 | S: 8.4% |
| 3 | Matrix Protein (M1) | NC_002016.1 | 27 | 3 | 316 | G: 7% |
| 4 | RNA Polymerase (PA) | NC_002022.1 | 3 | 1 | 742 | E: 10.5% |
| 5 | RNA Polymerase (PB1) | NC_002021.1 | 5 | 1 | 776 | L: 7.6% |
| 6 | RNA Polymerase (PB2) | NC_002023.1 | 3 | 1 | 778 | R: 8.2% |
| 7 | Nucleoprotein (NP) | NC_002019.1 | 2 | 1 | 520 | R: 9.9% |
| 8 | Non-Structural Protein (NS1) | NC_002020.1 | 15 | 2 | 280 | L: 13.6% |
Fig. 6Amino Acids percentages per segment.
H1N1 (Swine flu)’s genome data.
| No. | Gnome/segment | Accessions | Total protein sequences | Functional | Total amino | Highest amino acids occurrence (%) |
|---|---|---|---|---|---|---|
| 1 | Neuraminidase (NA) | NC_026434.1 | 1 | 1 | 469 | G: 9.6%, I: 9.6% |
| 2 | Hemagglutinin (HA) | NC_026433.1 | 1 | 1 | 566 | S: 8.3% |
| 3 | Matrix Protein (M1) | NC_026431.1 | 6 | 1 | 322 | A: 9.3% |
| 4 | RNA Polymerase (PA) | NC_026437.1 | 1 | 1 | 716 | E: 10.6% |
| 5 | RNA Polymerase (PB1) | NC_026435.1 | 1 | 1 | 757 | T: 7.8% |
| 6 | RNA Polymerase (PB2) | NC_026438.1 | 1 | 1 | 759 | V: 8.3% |
| 7 | Nucleoprotein (NP) | NC_026436.1 | 1 | 1 | 498 | R: 10% |
| 8 | Non-Structural Protein (NS1) | NC_026432.1 | 6 | 1 | 282 | L: 10.3% |
Fig. 7Amino Acids percentages per segment.
HIV identification.
| Virus name | Geo location | Release date |
|---|---|---|
| HIV-1 | USA | 2015 |
| HIV-2 | USA | 2015 |
HIVs genome data.
| Name | Total proteins | Accessions | Functional proteins | Total amino acids | Highest amino acids occurrence (%) | Total PEFD |
|---|---|---|---|---|---|---|
| HIV-1 | 243 | NC_001802.1 | 35 | 2818 | R: 9.5% | 147.26 |
| HIV-2 | 135 | NC_001722.1 | 25 | 3319 | R: 8.6% | 143.5 |
Fig. 8Amino Acids percentages per virus.
Fig. 9HIV-1 PEFD graph.
Fig. 10HIV-2 PEFD graph.
Ebola virus identifications.
| Virus name | Geo location | Release date |
|---|---|---|
| EBOV | USA | 2015 |
Ebola virus genome data.
| Name | Total proteins | Accessions | Functional proteins | Total amino acids | Highest amino acids occurrence (%) | Total PEFD |
|---|---|---|---|---|---|---|
| EBOV | 465 | NC_014373.1 | 70 | 5849 | L: 9.7% | 193.1 |
Fig. 11Amino acid percentages.
Fig. 12Ebola virus PEFD.
Fig. 13Amino acid percentages.
Hepatitis-C (HCV)’s identification.
| Virus name | Geo location | Release date |
|---|---|---|
| Hepatitis-C (HCV) | USA | 2018 |
Hepatitis-C virus (HCV)’s genome data.
| Virus name | Total proteins | Accessions | Functional proteins | Total amino acids | Highest amino acids occurrence (%) | Total PEFD |
|---|---|---|---|---|---|---|
| Hepatitis-C (HCV) | 103 | NC_038882.1 | 44 | 3097 | S: 12.7% | 31.9 |
Fig. 14Hepatitis-C (HCV)’s PEFD.
Coronaviruses identifications.
| Virus name | Geo location | Collection date |
|---|---|---|
| HCoV-229E | – | – |
| HCoV-OC43 | USA | – |
| SARS-CoV | Toronto, Canada | – |
| HCoV-NL63 | – | – |
| HCoV-HKU1 | USA | 2010 |
| MERS-CoV | – | 2012 |
| SARS-CoV-2 | China | 2019 |
Coronavirus’s genome data.
| Gnome | Accessions | Total Protein Sequences | Functional proteins | Total amino acids | Highest amino acids occurrence (%) | Total PEFD |
|---|---|---|---|---|---|---|
| HCoV-229E | NC_002645.1 | 762 | 28 | 8344 | C: 10%, L: 10% | 86.4 |
| HCoV-OC43 | NC_006213.1 | 794 | 92 | 9454 | L: 18.4% | 9.8 |
| SARS-CoV | NC_004718.3 | 804 | 43 | 8381 | L: 12.6% | 0.59 |
| HCoV-NL63 | NC_005831.2 | 370 | 60 | 9625 | L: 15.7% | 28 |
| HCoV-HKU1 | KF686346.1 | 775 | 41 | 9193 | L: 9.6% | 147.6 |
| MERS-CoV | NC_019843.3 | 273 | 68 | 9645 | L: 14.2% | 3.94 |
| SARS-CoV-2 | NC_045512 | 960 | 113 | 9350 | L: 18.3% | 0.134 |
Fig. 15Amino acid percentages per virus.
Fig. 16HCoV-229E PEFD.
Fig. 17HCoV-OC43 PEFD.
Fig. 18SARS-CoV PEFD.
Fig. 19HCoV-NL63 PEFD.
Fig. 20HCoV-HKU1’s PEFD.
Fig. 21MERS-CoV’s PEFD.
Fig. 22SARS-CoV-2’s PEFD.
Fig. 23ORF7a protein prediction using LSTM neural network to predict the protein sequence. And “Robetta” structure prediction tool from Baker lab at University of Washington. (Main, 2020).