| Literature DB >> 35456492 |
Hendrick Gao-Min Lim1, Shih-Hsin Hsiao2,3, Yang C Fann4, Yuan-Chii Gladys Lee1.
Abstract
Several variants of the novel severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) are emerging all over the world. Variant surveillance from genome sequencing has become crucial to determine if mutations in these variants are rendering the virus more infectious, potent, or resistant to existing vaccines and therapeutics. Meanwhile, analyzing many raw sequencing data repeatedly with currently available code-based bioinformatics tools is tremendously challenging to be implemented in this unprecedented pandemic time due to the fact of limited experts and computational resources. Therefore, in order to hasten variant surveillance efforts, we developed an installation-free cloud workflow for robust mutation profiling of SARS-CoV-2 variants from multiple Illumina sequencing data. Herein, 55 raw sequencing data representing four early SARS-CoV-2 variants of concern (Alpha, Beta, Gamma, and Delta) from an open-access database were used to test our workflow performance. As a result, our workflow could automatically identify mutated sites of the variants along with reliable annotation of the protein-coding genes at cost-effective and timely manner for all by harnessing parallel cloud computing in one execution under resource-limitation settings. In addition, our workflow can also generate a consensus genome sequence which can be shared with others in public data repositories to support global variant surveillance efforts.Entities:
Keywords: COVID-19; Common Workflow Language; Illumina sequencing; SARS-CoV-2; cloud workflow; genomics surveillance; lineage; mutation; parallel computing; variant
Mesh:
Substances:
Year: 2022 PMID: 35456492 PMCID: PMC9028989 DOI: 10.3390/genes13040686
Source DB: PubMed Journal: Genes (Basel) ISSN: 2073-4425 Impact factor: 4.141
SARS-CoV-2 variants of concern (VOCs) and variants of interest (VOIs).
| Variant. | Lineage * | Alias | Place of Origin | Time of Origin | Type | Designation |
|---|---|---|---|---|---|---|
| α (Alpha) | B.1.1.7 | - | United Kingdom | September 2020 | VOC | 18 December 2020 |
| β (Beta) | B.1.351 | - | South Africa | May 2020 | VOC | 18 December 2020 |
| γ (Gamma) | P.1 | B.1.1.28.1 | Brazil | November 2020 | VOC | 11 January 2021 |
| δ (Delta) | B.1.617.2 | - | India | October 2020 | VOC | 11 May 2021 |
| ε (Epsilon) | B.1.427/B.1.429 | - | United States | March 2020 | VOI | 5 March 2021 |
| ζ (Zeta) | P.2 | B.1.1.28.2 | Brazil | April 2020 | VOI | 17 March 2021 |
| η (Eta) | B.1.525 | - | multiple countries | December 2020 | VOI | 17 March 2021 |
| θ (Theta) | P.3 | B.1.1.28.3 | Philippines | January 2021 | VOI | 24 March 2021 |
| ι (Iota) | B.1.526 | - | United States | November 2020 | VOI | 24 March 2021 |
| κ (Kappa) | B.1.617.1 | - | India | October 2020 | VOI | 4 April 2021 |
| λ (Lambda) | C.37 | B.1.1.1.37 | Peru | December 2020 | VOI | 14 June 2021 |
| μ (Mu) | B.1.621 | - | Colombia | January 2021 | VOI | 30 August 2021 |
| ο (Omicron) | B.1.1.529 | - | multiple countries | November 2021 | VOC | 26 November 2021 |
SARS-CoV-2, severe acute respiratory syndrome coronavirus 2. * Includes all descendent lineages.
Figure 1Flowchart diagram for selecting the data sets.
Summary of selected data sets.
| Data Set | No. of Samples | No. of Reads | Sequencer Type | Sequencing | |||
|---|---|---|---|---|---|---|---|
| Alpha | Beta | Gamma | Delta | ||||
| PRJNA704235 | - | - | 3 | - | 1,483,054 | MiniSeq | WGS |
| PRJNA708134 | 39 | - | - | - | 16,060,411 | NovaSeq | amplicon |
| PRJNA726840 | 3 | 8 | - | - | 2,801,089 | iSeq | WGS |
| PRJNA726871 | - | - | - | 1 | 128,048 | MiSeq | amplicon |
| PRJNA733209 | - | - | 1 | - | 2,415,993 | MiniSeq | WGS |
| Total | 42 | 8 | 4 | 1 | 22,888,595 | - | - |
WGS, whole-genome sequencing.
Figure 2Graphical representation of our cloud workflow.
Summary of the cloud workflow performance.
| Step | Name | Instance 1 | Time 2 | Cost 3 |
|---|---|---|---|---|
| Data preprocessing | SRA Download and Set Metadata workflow | c4.8xlarge a | 7 | 0.09 |
| FastQC tool | c4.2xlarge b | 6 | 0.04 | |
| Mutation Profiling | Our cloud workflow * | c4.2xlarge | 7 | 2.09 |
| Total | 20 | 2.22 | ||
1 From Amazon Web Services; 2 in minutes; 3 in US$. a Thirty-six virtual central processing units (vCPUs), 60 gigabytes (GB) of memory, and 1024 GB attached storage; b 8v CPUs, 15 GB of memory, and 1024 GB attached storage. * In parallel with one instance per sample that cost on average US $0.04 per instance.
Figure 3The most frequent mutations found in our variant of concern (VOC) samples. Mutation percentages highlighted in yellow denote the most common mutations listed in the COVID-19 CG for corresponding VOCs, while red mutation percentages denote key mutations listed by the UCSC SARS-CoV-2 Genome Browser. * Most mutations found in the sample, unless for its sub-lineage (Q.4) with C23604G.
Key mutations in the Alpha VOC.
| Mutation | Type | Level | Protein | Codon | Consequence for the AA Sequence | Corresponding Protein Annotation of UCSC SARS-CoV-2 |
|---|---|---|---|---|---|---|
| C3267T | SNV | moderate | ORF1ab | aCt3002aTt | T1001I | nsp3 |
| C5388A | SNV | moderate | ORF1ab | gCt5123gAt | A1708D | nsp3 |
| T6954C | SNV | moderate | ORF1ab | aTa6689aCa | I2230T | nsp3 |
| TCTGGTTTT11288- | deletion | moderate | ORF1ab | TCTGGTTTT | SGF3675:3677- | nsp6 |
| C14408T | SNV | moderate | ORF1ab | cCt14144cTt | P4715L | nsp12 |
| TACATG21765- | deletion | moderate | S | aTACATGtc | IHV68:70I | S |
| TTA21991- | deletion | moderate | S | gtTTAt | VY143:144V | S |
| A23063T | SNV | moderate | S | Aat1501Tat | N501Y | S |
| C23271A | SNV | moderate | S | gCt1709gAt | A570D | S |
| A23403G | SNV | moderate | S | gAt1841gGt | D614G | S |
| C23604A | SNV | moderate | S | cCt2042cAt | P681H | S |
| C23709T | SNV | moderate | S | aCa2147aTa | T716I | S |
| T24506G | SNV | moderate | S | Tca2944Gca | S982A | S |
| G24914C | SNV | moderate | S | Gac3352Cac | D1118H | S |
| C27972T | SNV | high | ORF8 | Caa79Taa | Q27stop | ORF8 |
| G28048T | SNV | moderate | ORF8 | aGa155aTa | R52I | ORF8 |
| A28111G | SNV | moderate | ORF8 | tAc218tGc | Y73C | ORF8 |
| G28280C | SNV | moderate | N | Gat7Cat | D3H | N |
| A28281T | SNV | moderate | N | gAt8gTt | D3V | N |
| T28282A | SNV | moderate | N | gaT9gaA | D3E | N |
| G28881A | SNV | moderate | N | aGg608aAg | R203K | N |
| G28883C | SNV | moderate | N | Gga610Cga | G204R | N |
| C28977T | SNV | moderate | N | tCt704tTt | S235F | N |
AA, amino acid; SNV, single nucleotide variation; ORF, open reading frame; nsp, non-structural protein; S, spike; N, nucleocapsid.
Key mutations in the Beta VOC.
| Mutation | Type | Level | Protein | Codon Change | Consequence for the AA Sequence | Corresponding Protein Annotation of UCSC SARS-CoV-2 |
|---|---|---|---|---|---|---|
| C1059T | SNV | moderate | ORF1ab | aCc794aTc | T265I | nsp2 |
| G5230T | SNV | moderate | ORF1ab | aaG4965aaT | K1655N | nsp3 |
| A10323G | SNV | moderate | ORF1ab | aAg10058aGg | K3353R | nsp5 |
| TCTGGTTTT11288- | deletion | moderate | ORF1ab | TCTGGTTTT | SGF3675:3677- | nsp6 |
| C14408T | SNV | moderate | ORF1ab | cCt14144cTt | P4715L | nsp12 |
| A21801C | SNV | moderate | S | gAt239gCt | D80A | S |
| CTTTACTTG22281- | deletion | moderate | S | aCTTTACTTGct719-727act | TLLA240:243T | S |
| G22813T | SNV | moderate | S | aaG1251aaT | K417N | S |
| G23012A | SNV | moderate | S | Gaa1450Aaa | E484K | S |
| A23063T | SNV | moderate | S | Aat1501Tat | N501Y | S |
| A23403G | SNV | moderate | S | gAt1841gGt | D614G | S |
| C23664T | SNV | moderate | S | gCa2102gTa | A701V | S |
| G25563T | SNV | moderate | ORF3a | caG171caT | Q57H | ORF3a |
| C26456T | SNV | moderate | E | cCt212cTt | P71L | E |
| C28887T | SNV | moderate | N | aCt614aTt | T205I | N |
E, envelope.
Key mutations in the Gamma VOC.
| Mutation | Type | Level | Protein | Codon | Consequence for the AA Sequence | Corresponding Protein Annotation of UCSC SARS-CoV-2 |
|---|---|---|---|---|---|---|
| C3828T | SNV | moderate | ORF1ab | tCa3563tTa | S1188L | nsp3 |
| A5648C | SNV | moderate | ORF1ab | Aaa5383Caa | K1795Q | nsp3 |
| TCTGGTTTT11288- | deletion | moderate | ORF1ab | TCTGGTTTT | SGF3675:3677- | nsp6 |
| C14408T | SNV | moderate | ORF1ab | cCt14144cTt | P4715L | nsp12 |
| G17259T | SNV | moderate | ORF1ab | gaG16995gaT | E5665D | nsp13 |
| C21614T | SNV | moderate | S | Ctt52Ttt | L18F | S |
| C21621A | SNV | moderate | S | aCc59aAc | T20N | S |
| C21638T | SNV | moderate | S | Cct76Tct | P26S | S |
| G21974T | SNV | moderate | S | Gat412Tat | D138Y | S |
| G22132T | SNV | moderate | S | agG570agT | R190S | S |
| A22812C | SNV | moderate | S | aAg1250aCg | K417T | S |
| G23012A | SNV | moderate | S | Gaa1450Aaa | E484K | S |
| A23063T | SNV | moderate | S | Aat1501Tat | N501Y | S |
| A23403G | SNV | moderate | S | gAt1841gGt | D614G | S |
| C23525T | SNV | moderate | S | Cat1963Tat | H655Y | S |
| C24642T | SNV | moderate | S | aCt3080aTt | T1027I | S |
| G25088T | SNV | moderate | S | Gtt3526Ttt | V1176F | S |
| T26149C | SNV | moderate | ORF3a | Tcc757Ccc | S253P | ORF3a |
| G28167A | SNV | moderate | ORF8 | Gaa274Aaa | E92K | ORF8 |
| C28512G | SNV | moderate | N | cCa239cGa | P80R | N |
| A28877T | SNV | moderate | N | Agt604Tgt | S202C | N |
| G28878C | SNV | moderate | N | aGt605aCt | S202T | N |
| G28881A | SNV | moderate | N | aGg608aAg | R203K | N |
| G28883C | SNV | moderate | N | Gga610Cga | G204R | N |
Key mutations in the Delta VOC.
| Mutation. | Type | Level | Protein | Codon | Consequence for the AA Sequence | Corresponding Protein Annotation of UCSC SARS-CoV-2 |
|---|---|---|---|---|---|---|
| C14408T | SNV | moderate | ORF1ab | cCt14144cTt | P4715L | nsp12 |
| C21618G | SNV | moderate | S | aCa56aGa | T19R | S |
| T22917G | SNV | moderate | S | cTg1355cGg | L452R | S |
| C22995A | SNV | moderate | S | aCa1433aAa | T478K | S |
| A23403G | SNV | moderate | S | gAt1841gGt | D614G | S |
| C23604G | SNV | moderate | S | cCt2042cGt | P681R | S |
| C25469T | SNV | moderate | ORF3a | tCa77tTa | S26L | ORF3a |
| T26767C | SNV | moderate | M | aTc245aCc | I82T | M |
| T27638C | SNV | moderate | ORF7a | gTt245gCt | V82A | ORF7a |
| C27752T | SNV | moderate | ORF7a | aCa359aTa | T120I | ORF7a |
| G28881T | SNV | moderate | N | aGg608aTg | R203M | N |
| G29402T | SNV | moderate | N | Gat1129Tat | D377Y | N |
M, membrane.