| Literature DB >> 33681535 |
Zaid Almubaid1, Hisham Al-Mubaid2.
Abstract
We present an analysis and comparison study of genetic variants and mutations of about 1200 genomes of SARS-CoV-2 virus sampled across the first seven months of 2020. The study includes 12 sets of about 100 genomes each collected between January and September. We analyzed the mutations, mutation frequency and count trends over time, and genomes trends over time from January through September. We show that certain mutations in the SARS-CoV-2 genome are not occurring randomly as it has been commonly believed. This finding is in agreement with other recently published research in this domain. Therefore, this validates other findings in this direction. This study includes approximately 1000 genomes and was able to identify over 35 different mutations most of which are common to almost all genomes groups. Some mutations' ratios (frequency percentage) fluctuate over time to adapt the virus to various environmental factors, climate, and populations. One of the interesting findings in this paper is that the coding region, at the nucleotide level for NSP13 protein is relatively conserved compared with other protein regions in the ORF1ab gene which makes this protein a good candidate for developing drug targets and treatment for the COVID-19 disease. Although this outcome was already reported by other researchers, we corroborated their result with our work in a different approach and another experimental setting with almost one thousand complete genome sequences. We presented and discussed all these results and findings with tables of results and illustrating figures.Entities:
Keywords: 2019 H-CoV 2; Genetic variants; Novel coronavirus; SARS-CoV-2; SARS-CoV-2 genetic variants; SARS-CoV-2 genome
Year: 2021 PMID: 33681535 PMCID: PMC7917442 DOI: 10.1016/j.genrep.2021.101064
Source DB: PubMed Journal: Gene Rep ISSN: 2452-0144
Fig. 1Structure of the SARS-CoV-2 genome in three different views: (a) basic illustration of the complete genome structure (b) illustration of coronavirus 2 isolate Wuhan-Hu-1, NC_045512 (complete genome 29,903 bp) as presented in the GenBank/NCBI https://www.ncbi.nlm.nih.gov/nuccore/NC_045512.2?report=graph (c) another view of the NC_045512.2 reference genome from NCBI showing all location of all genes. {Note: (b) and (c) are both courtesy of The US National Center for Biotechnology Information NCBI www.ncbi.nlm.nih.org.}
Description and details of all genes and proteins of the SARS-CoV-2.
(a) The description and division of regions of a complete SARS-CoV-2 genome (reference genome nc-045512 from the GenBank).
(b) This table shows the location and number of amino acids for each gene/protein {the first gene ORF1ab expresses a polyprotein comprised of 16 nonstructural proteins (NSP's) shown in (c)}.
(c) The 16 NSP's {nonstructural proteins} from the SARS-CoV-2 polyprotein.
| (a) | |||||
|---|---|---|---|---|---|
| NC_045512.2 29903 nt | |||||
| Start | Stop | Gene symbol | Strand | NCBI gene ID | |
| 1 | 266 | 21555 | ORF1ab | Plus | 43740578 |
| 2 | 21563 | 25384 | S | Plus | 43740568 |
| 3 | 25393 | 26220 | ORF3a | Plus | 43740569 |
| 4 | 26245 | 26472 | E | Plus | 43740570 |
| 5 | 26523 | 27191 | M | Plus | 43740571 |
| 6 | 27202 | 27387 | ORF6 | Plus | 43740572 |
| 7 | 27394 | 27759 | ORF7a | Plus | 43740573 |
| 8 | 27756 | 27887 | ORF7b | Plus | 43740574 |
| 9 | 27894 | 28259 | ORF8 | Plus | 43740577 |
| 10 | 28274 | 29533 | N | Plus | 43740575 |
| 11 | 29558 | 29674 | ORF10 | Plus | 43740576 |
Fig. 2This figure shows the divisions of the genome into genes/proteins and highlights the ORF1ab and Spike protein.
More details about then main genes (and coded proteins) in the SARS-CoV-2.
| Gene | Gene symbol | Gene ID | Gene type | Gene description |
|---|---|---|---|---|
| S surface glycoprotein | S | 43740568 | Protein coding | Surface glycoprotein (3822 nt 21563–25384) |
| M membrane glycoprotein [severe acute respiratory syndrome coronavirus 2] | M | 43740571 | Protein coding | Gene description |
| ORF1ab ORF1a polyprotein;ORF1ab polyprotein [severe acute respiratory syndrome coronavirus 2] | ORF1ab | 43740578 | Protein coding | Gene description |
| ORF3a protein [severe acute respiratory syndrome coronavirus 2] | ORF3a | 43740569 | Protein coding | Gene description |
Details of the twelve dataset of genome sequences.
| Dataset | Number of genomes | Collection date | Mean length (nt) | |
|---|---|---|---|---|
| S1-genomes-01-01-to-01-31 | ~100 | 1–31 January | 29,858 | |
| S2-101-genomes-01-01-to-02-29 | ~100 | 1 Jan.–29 Feb. | 29,921 | |
| S3-103-genomes-02-01-to-02-29 | ~100 | 1 Feb.–29 Feb. | 29,752 | |
| S4-99-genomes-02-01-03-25-. | ~100 | 2 Feb.–25 March | 29,676 | |
| S5-93-genomes-3-1-3-26 | ~100 | 1–26 March | 29,839 | |
| S6-95-genomes-04-21-04-30 | ~100 | 21–30 April | 29,703 | |
| S7_100-genomes-05-01-to-05-07 | ~100 | 1–7 May | 29,666 | |
| S8-100-genomes-05-22-to-05-31 | ~100 | 22–31 May | 29,822 | |
| S9-128-genomes-06-09-06-12 | ~100 | 9–12 June | 29,686 | |
| S10-101-genomes-07-01-07-17 | ~100 | 1–17 July | 29,836 | |
| S11-100-genomes-08-01-08-31 | ~100 | 1–31 August | 29,824 | |
| S12-100-genomes-09-01-09-30 | ~100 | 1–30 September | 29,886 |
Total number of genome sequences available at NCBI based on the collection month starting from December 2019.
| Month of collection | No. of genome sequences |
|---|---|
| December/2019 | 22 |
| January | 399 |
| February | 555 |
| March | 6189 |
| April | 5157 |
| May | 2508 |
| June | 3005 |
| July | 4545 |
| August | 652 |
| September | 200 |
Fig. 3Visualization of the genomic sequences showing the alignment and genetic variants.
(a) This shows 23406A>G using NCBI msa viewer (this is the dataset S5 (this is only partial view showing position 23354 to position 23463 with a sample of 30 genomes).
(b) Visualization of the multiple sequence alignment (using Jalview tool).
(c) S1c-100-genome (aligned with clustal msa)>>partial view with Jalview.
Note: Y is C or T; R is A or G; W is A or T.
Note: (a) is courtesy of the US National Center for Biotechnology Information NCBI. (b) and (c) are courtesy of Jalview tool (Waterhouse et al., n.d.): https://www.jalview.org/.
The most common genetic variants found in the 100 genomes of dataset S7 (s7_100-genomes-05-01-to-05-07) which includes genomes collected during 1–7 May.
| Results from S7 100-genomes-05-01-to-05-07 |
|---|
| — 100% occupancy and 100% identity started from position 142 |
| — last 100% occupancy 29,666 (last 100% identity 29,652 100%T) |
| — 241t>c 37% (i.e., T is 63% and mutation c is 37%) |
| — 1059c>t 36% |
| — 2416c>t 14% |
| — 2447g>a 3% |
| — 3037c>t 41% |
| — 4523g>a 13% |
| — MT434802: 5698c>t |
| — mt434786 6639:a>g |
| — 438551 8653 g>t |
| — 8782 c>t 17% |
| — 11083 g>t 6% |
| — 13265 a>t 3% |
| — 14408 t>c 37% |
| 14805 c>t 6% |
| 17747 c>t 15% |
| 17845 a>g 14% |
| 18060 c>t 14% |
| 23403 g>a 37% |
| 25563 t>g 47% |
| 26144g>a 8% |
| 27964 c>t 3% (including: mt429186, mt422807, and mt422806) |
| 28883 g>c 4% |
| 29540 g>a 10% |
| 29711g>t 7% |
The genetic variants in the set S8 (100 genomes).
| Results from S8 100-genomes-05-29-to-05-31 |
|---|
| — 100% occupancy and 100% identity started from position 50 with 100% C. |
| — last 100% occupancy 29,806 (also last 100% identity 29,806 100% A) |
| — 241 t>c 46% (i.e., T is 54% and mutation c is 46%) |
| — MT539162 490 t>a (this mutation is 4% >> in 4 sequences) |
| — MT535481 833 t>c (this mutation 3%) |
| — 1059 c>t 19% |
| — 2416c>t 14% |
| Mt536953 370 g>t |
| Mt539159 2243g>a |
| 3037 t>c 47% |
| 3177 c>t 4% (mt539163 is one of them). |
| 4084 c>t 7% |
| 6512 a>c |
| 8782 c>t 12% |
| 10129 a> |
| MT534285: 11083 g>t (this 3%) |
| 12557 a>g 25% |
| 14408 t>c 46% |
| 14940 a>g 5% |
| 15771 t>y (or t>k) 15% |
| 17747 c>t 6% |
| 18877 c>t 10% |
| 23403 g>a 46% >>> see this in |
| 24904 c>t 25% |
| 25563 g>t 33% |
| 25916 c>t 25% |
| 27359 a>g 25% |
| 27964 c>t 11% |
| 28144 t>c 12% |
| {near: 22 mutations in positions: 28878–28896} |
| 29360 t> |
Percentage of 13 genetic variants across ten sets of genomes. Each row represents on set of genomes (~about 100 genomes in each set). The 13 genetic variants are shown as column headers 1…13.
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 241t>c | 1059c>t: 19% | 3037c>t | 8782C>T in ORF1ab gene | 11083g>t | 14408c>t | 17747c>t | 17858a>g | 18060c>t | 23406a>g in S gene | 25566g>t in orf3a | 28144T>C in ORF8 gene | 29095C>T | ||
| 1 | 0% | 0% | 0% | 35% | 6% | 0% | 0% | 0% | 6% | 0% | 0% | 35% | 18% | |
| 2 | 232t>c: 10% | 1060c>t: 3% | 3% 3037t>c | 8798c>t: 39% | 1% 11069 | 10% 14407c>t | 6% 17749c>t | 10% 17836 | 0% | 0% | 0% | 28192t>c: 42% | 29143c>t: 23% | |
| 3 | 0% | 0% | 0% | 8785c>t: 30% | 4% 11086 | 0% | 0% | 0% | 6%: 18063 | 0% | 0% | 28147t>c: 30% | 29098c>t: 9% | |
| 4 | 9% | 1071: 2% | 3045: 5% | 8790: 29% | 11091t>g: 46% | 5% | 17755: 14% | 17868: 14% | 18068: 14% | 23414: 5% | 0% | 28155: 29% | 0% | |
| 5 | 18%: 241c>t | 10% | 15% | 46% | 35% | 15% | 34% | 34% | 37% | 15% *1 | 11% | 46% (28147) | 0% | |
| 6 | 13% | 32% 1059t>c | 13% t>c | 9% | 4% | 13% 14408t>c | 1% | 3% | 3% | 13% 23403 | 28% 25563t>g | 9% | 0% | |
| 7 | 37% | 34% | 41% | 17% | 6% | 37% t>c | 15% | 14% | 14% | 37% g>a | 47% 25563t>g | 17% | 0% | |
| 8 | 46% | 19% | 46% t>c | 12% | 3% | 46% t>c | 6% | 6% | 6% | 46% 23403g>a | 33% 25563g>t | 12% | 1% | |
| 9 | 11.7% 241 t>c | 1163 a>t 10% | 8.60% | 7% | 0% (zero) | 14408t>c: 14% | 1.60% | 17858 1.6% | 1.60% | 23403: 8.6% | 25563: 48.44% | 4.70% | 0% | |
| 10 | 19% | 11% | 17% t>c | 8% | 15% | 17% t>c | 4% | 4% | 4% | 23403: 17% | 32% | 6% | 11% | |
Fig. 4Illustration of the percentage of five mutations in eight genomic sets (sets S1…S8) collected between 1 January and 31 May.
Top genetic variants in genome sets S11-August (a); and in S12-September (b).
| (a) | |
|---|---|
| 1059t>c | 42% |
| 25563t>g | 33% |
| 16260c>t | 23% |
| 28821c>a | 22% |
| 28881g>a | 19% |
| 28882g>a | 19% |
| 28883g>c | 19% |
| 27964c>t | 16% |
| 20268a>g | 15% |
| 28854c>t | 13% |
| 11498c>t | 11% |
| 21575c>t | 11% |
| 9115c>t | 10% |
| 19603a>g | 10% |
| 21304c>t | 10% |
The frequent genetic variants that found in common between S11-August and S12-September genome sequences.
| 16269c>t |
| 20268a>g |
| 21304c>t |
| 27964c>t |
| 28821c>a |
| 28854c>t |
| 28881g>a |
| 28882g>a |
| 28883g>c |
| Area or task | References |
|---|---|
| 1. History and | ( |
| 2. | ( |
| 3. | ( |
| 4. Studying the | ( |
| 5. Studying the virus in connection with the | ( |
Note: Literature and publications related to the COVID-19 disease and pandemic (see Junejo et al., 2020; El Idrissi, 2020; Rothan and Byrareddy, 2020; Wilder-Smith et al., 2020; Anastassopoulou et al., 2020) are mainly interested in the disease outbreak data analysis, timeline and infection rate analysis, pandemic declining prediction, etc. (Junejo et al., 2020; El Idrissi, 2020; Rothan and Byrareddy, 2020; Wilder-Smith et al., 2020; Anastassopoulou et al., 2020).