Literature DB >> 33521382

Genome-wide in silico identification and characterization of Simple Sequence Repeats in diverse completed SARS-CoV-2 genomes.

Rasel Siddiqe1, Ajit Ghosh1.   

Abstract

Simple sequence repeats (SSRs) or, Microsatellites are short repeat sequences that have been extensively studied in eukaryotic (plants) and prokaryotic (bacteria) organisms. Compared to other organisms, the presence and incidence of SSR on viral genomes are less studied. With the emergence of novel infectious viruses over the past few decades, it is imperative to study the genetic diversity in such viruses to predict their evolutionary and functional changes over time. Following the emergence of SARS-CoV-2, we have assembled 121 complete genomes reported from 31 countries across the six continents for the identification and characterization of SSR repeats. Using two independent SSR identification tools, we have found remarkable consistency in the diversity of microsatellites pattern (38-42 per genome) found in the 121 analyzed SARS-CoV-2 genomes indication their important role for genome stability. Among the identified motifs, trinucleotide and hexanucleotide repeats were found to be the most abundant form followed by mono- and di-nucleotide. There were no tetra- or penta-nucleotide repeats in the analyzed SARS-CoV-2 genomes. The discovery of microsatellites in SARS-CoV-2 genomes may become useful for the population genetics, evolutionary analysis, strain identification and genetic variation.
© 2021 Elsevier Inc. All rights reserved.

Entities:  

Keywords:  COVID-19, coronavirus disease 2019; Comparative genomics; Genome sequence; HCV, hepatitis C virus; Microsatellite; RA, relative abundance; RD, relative density; SARS-CoV-2 virus; SARS-CoV-2, severe acute respiratory syndrome coronavirus 2; SSR, simple sequence repeats; Simple sequence repeat; SpliMNPV, Spodoptera littoralis multiple nucleopolyhedrovirus

Year:  2021        PMID: 33521382      PMCID: PMC7835092          DOI: 10.1016/j.genrep.2021.101020

Source DB:  PubMed          Journal:  Gene Rep        ISSN: 2452-0144


Introduction

Coronavirus disease 2019 (COVID-19) is an acute respiratory infectious disease caused by a novel severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). It belongs to the subfamily Coronavirinae of the family Coronaviridae of the order Nidovirales and genus Betacoronavirus (Saha et al., 2020; Weiss and Leibowitz, 2011). According to the serotype and genomic characteristics, coronaviruses could be divided into four major genera that include alpha and beta causing infection primarily to mammals, and gamma and delta forms predominantly infect birds (Tang et al., 2015). Coronaviruses are enveloped, unsegmented single positive-stranded RNA virus with a genomic length varying from 26 to 32 kilobases (Wang et al., 2020). Genome of SARS-CoV-2 possesses 14 ORFs which codes for 27 proteins (Wu et al., 2020). In the recent years, there are three large scale epidemic outbreaks of coronaviruses include SARS-CoV of 2003, MERS-CoV of 2012 and SARS-CoV-2 of 2019 (Khan et al., 2020; Zhou et al., 2020). COVID-19 was initially reported from China but spread all over the world rapidly (Guo et al., 2020). The total number of COVID-19 cases diagnosed so far exceeds 63 million worldwide as on 30th November 2020 with a total death of more than 1.4 million (https://www.worldometers.info/coronavirus/). SARS-CoV-2 has caused a state of alarm across the world due to its high infection rate and mortality among the elderly and immune-deficient individuals. Due to very limited knowledge of this novel virus, high rate of transmission has occured to all the age groups and diverse demographics population. Thus, the study of genome sequence and comparative genomics has attracted much attention. Moreover, the advancements in sequencing technologies and analysis tools boost-up the process at an unprecedented speed. The first three novel coronaviruses (GISAID accession ID: EPI_ISL_402119, EPI_ISL_402120 and EPI_ISL_402121) were sequenced from Wuhan (Wu et al., 2020). Currently, over 94,000 SARS-CoV2 viral genomes have already been sequenced and deposited for in the public domain like GenBank database (Benson et al., 2000) and GISAID database (Shu and McCauley, 2017). To understand the molecular genetics, evolutionary genomics and other important features of these viruses, development of a reliable biomarker like SSR could be an excellent tool. Simple sequence repeats (SSRs) are short tandem repeat sequences found across the genomes of all organisms. SSRs are essentially sequences of varying lengths containing repeats of 1–6 nucleotides. There are several characteristics associated with SSR sequences such as they are present ubiquitously in any genome (Li et al., 2004); their accumulation has been associated with the variation in genome size (Gao and Qi, 2007); they could exist in both coding and non-coding sequences (Riley and Krieger, 2009); they are highly variable and polymorphic in nature (Kim et al., 2008). SSRs are found to be associated with the recombination hotspots and random integration. This could be considered as an explanation of the fact that pathogenic organisms use this variability to combat host immune responses (Zhao et al., 2012). One of the extensive applications of SSR has been considered to use as a genetic marker (Heesacker et al., 2008; Temnykh et al., 2001). A few notable results have also been found using SSR in genome mapping, along with ecological and evolutionary biology. Although several independent studies have focused on SSR in viral genomes, a distinct distribution pattern is yet to be established (Chen et al., 2011). Viral SSRs are capable of generating genomic diversity that in turn manifest phenotypic changes (Li et al., 2004). Genome features including length and GC content largely influence their occurrence (Dieringer and Schlötterer, 2003; Kelkar et al., 2008). Here, we have investigated the distribution, size and GC content variability among 121 SARS-CoV-2 genome sequence isolated from different countries and identified the prevalence of SSR markers.

Methods and materials

Genome sequence collection

Complete genome sequences of SARS-CoV-2 (121) were acquired from the NCBI Virus database (https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/). Sequences were collected from 31 countries (Table S1) and selected according to the date of data deposition ranging from early January 2020 to late June 2020. The sequence data were processed in FASTA format.

Simple sequence repeat identification

Two SSR identification tools were used in the study. First, Simple Sequence Repeat Identification Tool (SSRIT, https://archive.gramene.org/db/markers/ssrtool) was used to detect perfect SSR motifs in the given sequences at FASTA format. The minimum number of repeats was set to 5 for dimers, 3 for Trimeric, Tetrameric and Pentameric repeats: and 2 for hexameric repeats. Thus, the resulting configuration is 5-3-3-3-2 for the minimum number of repeats. As SSRIT cannot detect monomeric repeats, we have employed a second tool IMEx-web: Imperfect Microsatellite Extraction Webserver (http://43.227.129.132:8008/IMEX/). IMEx advanced mode was used to identify the perfect microsatellites in the complete genomic sequences. Minimum repeat numbers for monomers was set to 10; 5 for dimers; 3 for Trimer, Tetramer and Pentamer repeats; and 2 for Hexameric repeats. The resulting configuration was 10-5-3-3-3-2 for the minimum number of repeats.

Calculation of relative density (RD) and relative abundance (RA)

To accurately assess the significance of SSR in a genome, the Relative Density (RD) and Relative Abundance (RA) of the matrix has been calculated using the following equation.

Programming

Python - Programming Language (IDLE 5.8.2) was used to manage and keep track of data collected in CSV format. Subsequently, all of the data analysis of various repeats in individual sequences was carried out using Python. Lastly, Python Module, Matplotlib, PyPlot was used to generate the bar charts.

Statistical analysis

Correlation between total relative abundance, relative density against genome size was established using Microsoft Excel 2016.

Results

Collection and distribution of SARS-Cov-2 genome sequences

We analyzed the presence of perfect SSRs over 6 bp long, from a pool of 121 completely sequenced SARS-Cov-2 genomes, with an average size of 29,855 bases ranging from 29,574 to 29,945 bases. All these sequences were sampled from 31 different countries over 6 continents (Table 1 ). A maximum number of 12 sequences were taken from china, while minimum one genome sequence was taken from Nepal, Turkey, Sweden, Peru, Ukraine, and South Africa to make sure the presence of diversity. The list of genomic sequences including their accession number, size, attributed region and GC content are summarized in Table 1.
Table 1

List of analyzed completed SARS-CoV-2 genomes along with their attributed regions, genome size and G/C content.

NoAccessionSize (bp)CountryG/C contentNoAccessionSize (bp)CountryG/C contentNoAccessionSize (bp)CountryG/C content
S1MT47638529,902BGD37.96S41MT49920829,873POL37.99S81MT12121529,945CHN37.91
S2MT63567229,832BGD37.99S42MT49920929,903POL37.95S82MN93838429,838CHN38.02
S3MT60724629,903BGD37.95S43MT49921029,899POL37.94S83MT25922929,864CHN38.01
S4MT57735929,816BGD38.01S44MT45087229,782SRB38.01S84MT25923029,866CHN38.01
S5MT53916029,758BGD38.01S45MT45997929,782SRB38.01S85MT44631229,879CHN37.99
S6MT50277429,859BGD38.01S46MT32406229,903ZAF37.96S86MT12329029,891CHN38.00
S7MT12680829,876BRA38.00S47MT30447529,882KOR37.98S87MT28157729,903CHN37.97
S8MT35028229,903BRA37.96S48MT30447429,882KOR37.98S88MT47017629,903FRA37.96
S9MT25692429,782COL38.01S49MT03989029,903KOR37.96S89MT47017729,903FRA37.97
S10MT47021929,903COL37.96S50MT29257129,782ESP38.01S90MT47017829,903FRA37.96
S11MT37156829,740CZE37.87S51MT29257429,782ESP38.00S91MT47017929,903FRA37.96
S12MT37157229,756CZE38.00S52MT29256929,782ESP38.02S92MT32053829,882FRA37.99
S13MT37157329,756CZE38.00S53MT35986529,890ESP37.98S93MT45984729,812GRC38.01
S14MT35864129,903DEU37.97S54MT37104729,903LKA37.96S94MT45992429,818GRC38.01
S15MT31882729,870DEU38.00S55MT37104829,903LKA37.96S95MT45989929,818GRC38.00
S16MT35864229,903DEU37.96S56MT37105029,903LKA37.97S96MT45989729,818GRC38.01
S17MT35863829,903DEU37.97S57MT09357129,886SWE38.00S97MT45986729,818GRC38.01
S18MT45998529,903GUM37.95S58MT37411429,901TWN37.96S98MT45986229,812GRC38.01
S19MT45998629,903GUM37.96S59MT37410229,901TWN37.97S99MT27081429,764HKG38.02
S20MT45998729,890GUM37.96S60MT37051629,900TWN37.97S100MT21519529,764HKG38.03
S21MT32089129,822IRN38.00S61MT06617629,870TWN38.01S101MT36503129,891HKG37.99
S22MT44717729,793IRN38.01S62MT06617529,870TWN38.01S102MT36503029,891HKG37.99
S23MT27659729,851ISR38.02S63MT44715529,805THA38.02S103MT11441229,889HKG37.99
S24MT27659829,870ISR38.00S64MT44715929,834THA38.01S104MT23090429,891HKG37.98
S25MT07712529,785ITA38.02S65MT44716529,671THA37.97S105MT41532129,903IND37.97
S26MT06615629,867ITA38.01S66MT44717629,840THA37.99S106MT41532029,901IND37.97
S27MT42855129,900KAZ37.96S67MT32774529,832TUR38.01S107MT47788529,899IND37.96
S28MT42855229,903KAZ37.97S68MT46607129,903URY37.97S108MT01209829,854IND38.02
S29MT42855329,903KAZ37.96S69MT19277229,891VNM37.98S109MT05049329,851IND38.01
S30MT37248229,865MYS37.64S70MT19277329,890VNM37.98S110MT46726029,800IND38.01
S31MT37248129,898MYS37.94S71MT00754429,893AUS37.97S111MT46725329,800IND37.99
S32MT37248029,868MYS37.94S72MT45093529,805AUS38.02S112LC54297629,903JPN37.97
S33MT07268829,811NPL38.02S73MT45093229,802AUS38.02S113LC52990529,903JPN37.97
S34MT39626629,880NLD37.98S74MT45178329,802AUS37.73S114LC54280929,903JPN37.96
S35MT45739929,876NLD37.99S75MT45175529,812AUS37.94S115MT44462629,840USA37.94
S36MT45739629,877NLD38.00S76LR75799829,866CHN37.99S116MT38073029,882USA37.98
S37MT24047929,836PAK37.99S77LR75799629,868CHN38.00S117MT38073129,882USA37.99
S38MT26299329,836PAK38.02S78MT25371029,781CHN38.02S118MT15971229,882USA37.99
S39MT50012229,819PAK38.02S79MT25370029,781CHN38.02S119MT15971729,882USA37.99
S40MT26307429,856PER38.01S80MT04995129,903CHN37.97S120MN98532529,882USA38.00
S121MT32617329,574USA37.95

Country tri-letter code legend in Supplementary Table 3.

List of analyzed completed SARS-CoV-2 genomes along with their attributed regions, genome size and G/C content. Country tri-letter code legend in Supplementary Table 3.

Incident frequency of SSRs

Incident frequency of SSRs in the 121 genomes varied at a negligible level (Fig. 1 ) regardless of regional variation or SSR search tool specialization. No tetrameric or pentameric repeats were observed in any of the sequences. Both IMEx and SSRIT provided almost identical data for di, tri, and hexameric repeats with few exceptions. The total number of SSR found in each sequence ranged between 38−42 and 38–41 repeats with monomeric repeats detected by only IMEx or without monomeric repeats as detected by SSRIT, respectively. Thus, the total number of repeats mainly varied in the sequences having monomeric repeats detected by IMEx. Sequences such as S2 (MT635672) show an equal number of repeats from both IMEx and SSRIT which doesn't contain any monomeric repeats (Fig. 1). The average number of trimeric repeats is ~20 (19.95041322) with the highest value being 20 and the lowest is 18. The average number of hexameric repeats was ~18 (17.97520661), with the highest value of 18 and the lowest value of 16. Almost all of the sequences contained 2 dimeric, 20 trimeric and 18 hexameric repeats except 4 sequences MT635672 (S2), MT502774 (S6), MT372482 (S30), MT451783 (S74) which had lesser number of trimeric repeats and three including MT372482 (S30), MT039890 (S49), MT447176 (S66) had lesser number of hexameric repeats (Fig. 2, Fig. 3 ).
Fig. 1

Comparison of the total number of SSR repeats using IMEx and SSRIT tools. SSRIT tool cannot detect the presence of monomeric repeats in the identified genome, while IMEx can. That creates a variation in the total number of identified SSR motifs and presented in the figure.

Fig. 2

Analysis of SSRs found in IMEx tool. (A) Analysis of total SSR per genome and (B) relative density and abundance of the identified SSR repeats present.

Fig. 3

Analysis of SSRs found in SSRIT tool. (A) Analysis of total SSR per genome and (B) relative density and abundance of the identified SSR repeats present.

Comparison of the total number of SSR repeats using IMEx and SSRIT tools. SSRIT tool cannot detect the presence of monomeric repeats in the identified genome, while IMEx can. That creates a variation in the total number of identified SSR motifs and presented in the figure. Analysis of SSRs found in IMEx tool. (A) Analysis of total SSR per genome and (B) relative density and abundance of the identified SSR repeats present. Analysis of SSRs found in SSRIT tool. (A) Analysis of total SSR per genome and (B) relative density and abundance of the identified SSR repeats present.

Calculation of RA and RD

Relative abundance (RA) and Relative density (RD) of SSR was calculated as the number of repeats per kilobase pair (kb) and total length in repeats per kb, respectively (Fig. 2, Fig. 3). Relative abundance was calculated for each type of repeats (i.e: monomeric, dimeric, trimeric, hexameric denoted by RA1, RA2, RA3 and RA6) as well as for the total number of repeats in a sequence (Table 2, Table 3 ). All the identified SSR repeats from IMEx and SSRIT tools were analyzed with little variation among all the 121 genome sequences. Similarly, relative density (RD) was calculated as the total length of repeats divided by the genome size in kb for all the repeats detected by both IMEx and SSRIT tools. There is more variation in RD values using IMEx analyzed SSRs due to the inconsistency of monomeric repeats (Fig. 2B). The highest value of total RA and RD from the IMEx tool is 1.42 and 14.89; while the lowest value is 13.29 and 1.27, respectively (Table 2). Likewise, the highest value of total RA and RD for SSRIT tool is 1.37 and 14.36; while the lowest is 1.27 and 13.45, respectively (Table 3).
Table 2

Relative abundance (RA) and relative density (RD) in SARS-CoV-2 genome sequence calculated for IMEx derived SSR repeats.

NoRA1RA2RA3RA6RALengthRDNoRA1RA2RA3RA6RALengthRD
S10.0330.0670.6690.6021.37142914.34S610.0000.0670.6700.6031.33941914.02
S20.0000.0670.6370.6031.30741013.74S620.0000.0670.6700.6031.33941914.02
S30.0330.0670.6690.6021.37142914.34S630.0000.0670.6710.6041.34241914.05
S40.0000.0670.6710.6041.34241914.05S640.0000.0670.6700.6031.34141914.04
S50.0000.0670.6720.6381.37843114.48S650.0000.0670.6740.6071.28141914.12
S60.0000.0670.6360.6031.30641013.73S660.0000.0670.6700.5701.30740713.63
S70.0330.0670.6690.6021.37242914.35S670.0340.0670.6700.6031.37442914.38
S80.0670.0670.6690.6021.40543914.68S680.0330.0670.6690.6021.37142914.34
S90.0000.0670.6720.6041.34341914.06S690.0330.0670.6690.6021.37242914.35
S100.0330.0670.6690.6021.37142914.34S700.0330.0670.6690.6021.37242914.35
S110.0000.0670.6720.6051.34541914.08S710.0330.0670.6690.6021.37242914.35
S120.0000.0670.6720.6051.34441914.08S720.0000.0670.6710.6041.34241914.05
S130.0000.0670.6720.6051.34441914.08S730.0000.0670.6710.6041.34241914.05
S140.0330.0670.6690.6021.37142914.34S740.0000.0670.6040.6041.27540113.45
S150.0000.0670.6700.6031.33941914.02S750.0000.0670.6710.6041.34241914.05
S160.0330.0670.6690.6021.37142914.34S760.0330.0670.6700.6031.37342914.36
S170.0330.0670.6690.6021.37142914.34S770.0330.0670.6700.6031.37342914.36
S180.0670.0670.6690.6021.40543914.68S780.0000.0670.6720.6041.34341914.06
S190.0670.0670.6690.6021.40543914.68S790.0000.0670.6720.6041.34341914.06
S200.0330.0670.6690.6021.37242914.35S800.0330.0670.6690.6021.37142914.34
S210.0340.0670.6710.6041.37542914.38S810.0330.0670.6680.6011.36942914.32
S220.0000.0670.6710.6041.34341914.06S820.0000.0670.6700.6031.34141914.04
S230.0330.0670.6700.6031.37342914.37S830.0000.0670.6700.6031.33941914.03
S240.0000.0670.6700.6031.33941914.02S840.0330.0670.6700.6031.37342914.36
S250.0340.0670.6710.6041.37742914.40S850.0330.0670.6690.6021.37242914.35
S260.0000.0670.6700.6031.33941914.02S860.0330.0670.6690.6021.37242914.35
S270.0670.0670.6690.6021.40543914.68S870.0670.0670.6690.6021.40543914.68
S280.0330.0670.6690.6021.37142914.34S880.0330.0670.6690.6021.37142914.34
S290.0330.0670.6690.6021.37142914.348S90.0330.0670.6690.6021.37142914.34
S300.0330.1000.6030.5361.37339713.29S900.0330.0670.6690.6021.37142914.34
S310.0670.0670.6690.6021.40543914.68S910.0330.0670.6690.6021.37142914.34
S320.0680.0680.6780.6111.42543914.89S920.0330.0670.6690.6021.37242914.35
S330.0000.0670.6710.6041.34241914.05S930.0340.0670.6710.6041.37542914.39
S340.0000.0670.6690.6021.33941914.02S940.0340.0670.6710.6041.37542914.38
S350.0000.0670.6690.6021.33941914.02S950.0340.0670.6710.6041.37542914.38
S360.0000.0670.6690.6021.33941914.02S960.0000.0670.6710.6041.34141914.05
S370.0340.0670.6700.6031.37442914.37S970.0000.0670.6710.6041.34141914.05
S380.0000.0670.6700.6031.34141914.04S980.0340.0670.6710.6041.37542914.39
S390.0000.0670.6710.6041.34141914.05S990.0000.0670.6720.6051.34441914.07
S400.0000.0670.6700.6031.34041914.03S1000.0000.0670.6720.6051.34441914.07
S410.0000.0670.6700.6031.33941914.02S1010.0330.0670.6690.6021.37242914.35
S420.0330.0670.6690.6021.37142914.34S1020.0330.0670.6690.6021.37242914.35
S430.0330.0670.6690.6021.37142914.34S1030.0330.0670.6690.6021.37242914.35
S440.0000.0670.6720.6041.34341914.06S1040.0330.0670.6690.6021.37242914.35
S450.0000.0670.6720.6041.34341914.06S1050.0330.0670.6690.6021.37142914.34
S460.0330.0670.6690.6021.37142914.34S1060.0330.0670.6690.6021.37142914.34
S470.0330.0670.6690.6021.37242914.35S1070.0670.0670.6690.6021.40543914.68
S480.0330.0670.6690.6021.37242914.35S1080.0000.0670.6700.6031.34041914.03
S490.0330.0670.6690.5691.33841713.94S1090.0000.0670.6700.6031.34041914.03
S500.0000.0670.6720.6041.34341914.06S1100.0000.0670.6710.6041.34241914.06
S510.0000.0670.6720.6041.34341914.06S1110.0000.0670.6710.6041.34241914.06
S520.0000.0670.6720.6041.34341914.06S1120.0330.0670.6690.6021.37142914.34
S530.0000.0670.6690.6021.33841914.01S1130.0330.0670.6690.6021.37142914.34
S540.0670.0670.6690.6021.40543914.68S1140.0670.0670.6690.6021.40543914.68
S550.0330.0670.6690.6021.37142914.34S1150.0000.0670.6700.6031.34041914.04
S560.0330.0670.6690.6021.37142914.34S1160.0330.0670.6690.6021.37242914.35
S570.0330.0670.6690.6021.37242914.35S1170.0330.0670.6690.6021.37242914.35
S580.0330.0670.6690.6021.37142914.34S1180.0330.0670.6690.6021.37242914.35
S590.0670.0670.6690.6021.40543914.68S1190.0330.0670.6690.6021.37242914.35
S600.0670.0670.6690.6021.40543914.68S1200.0330.0670.6690.6021.37242914.35
S1210.0000.0680.6760.6091.35341914.16

Relative abundance (RA), RA1 = Monomeric repeats, RA2 = Dimeric repeats, RA3 = Trimeric repeats, RA6 = Hexameric repeats, RD = Relative Density.

Table 3

Relative abundance (RA) and relative density (RD) in SARS-CoV-2 genome sequence calculated for SSRIT derived SSR repeats.

No.RA2RA3RA6RALengthRDNo.RA2RA3RA6RALengthRD
S10.0670.6690.6021.33841914.01S610.0670.6700.6031.33941914.02
S20.0670.6370.6031.30741013.74S620.0670.6700.6031.33941914.02
S30.0670.6690.6021.33841914.01S630.0670.6710.6041.34241914.05
S40.0670.6710.6041.34241914.05S640.0670.6700.6031.34141914.04
S50.0670.6720.6381.37843114.48S650.0670.6070.6071.28140113.51
S60.0670.6360.6031.30641013.73S660.0670.6700.5701.30740713.63
S70.0670.6690.6021.33941914.02S670.0670.6700.6031.34141914.04
S80.0670.6690.6021.33841914.01S680.0670.6690.6021.33841914.01
S90.0670.6720.6041.34341914.06S690.0670.6690.6021.33841914.01
S100.0670.6690.6021.33841914.01S700.0670.6690.6021.33841914.01
S110.0670.6720.6051.34541914.08S710.0670.6690.6021.33841914.01
S120.0670.6720.6051.34441914.08S720.0670.6710.6041.34241914.05
S130.0670.6720.6051.34441914.08S730.0670.6710.6041.34241914.05
S140.0670.6690.6021.33841914.01S740.0670.6040.6041.27540113.45
S150.0670.6700.6031.33941914.02S750.0670.6710.6041.34241914.05
S160.0670.6690.6021.33841914.01S760.0670.6700.6031.33941914.02
S170.0670.6690.6021.33841914.01S770.0670.6700.6031.33941914.02
S180.0670.6690.6021.33841914.01S780.0670.6720.6041.34341914.06
S190.0670.6690.6021.33841914.01S790.0670.6720.6041.34341914.06
S200.0670.6690.6021.33841914.01S800.0670.6690.6021.33841914.01
S210.0670.6710.6041.34141914.05S810.0670.6680.6011.33641913.99
S220.0670.6710.6041.34341914.06S820.0670.6700.6031.34141914.04
S230.0670.6700.6031.34041914.03S830.0670.6700.6031.33941914.03
S240.0670.6700.6031.33941914.02S840.0670.6700.6031.33941914.02
S250.0670.6710.6041.34341914.06S850.0670.6690.6021.33941914.02
S260.0670.6700.6031.33941914.02S860.0670.6690.6021.33841914.01
S270.0670.6690.6021.33841914.01S870.0670.6690.6021.33841914.01
S280.0670.6690.6021.33841914.01S880.0670.6690.6021.33841914.01
S290.0670.6690.6021.33841914.018S90.0670.6690.6021.33841914.01
S300.1000.6700.5691.33941713.96S900.0670.6690.6021.33841914.01
S310.0670.6690.6021.33841914.01S910.0670.6690.6021.33841914.01
S320.0680.6780.6111.35741914.21S920.0670.6690.6021.33941914.02
S330.0670.6710.6041.34241914.05S930.0670.6710.6041.34241914.05
S340.0670.6690.6021.33941914.02S940.0670.6710.6041.34141914.05
S350.0670.6690.6021.33941914.02S950.0670.6710.6041.34141914.05
S360.0670.6690.6021.33941914.02S960.0670.6710.6041.34141914.05
S370.0670.6700.6031.34141914.04S970.0670.6710.6041.34141914.05
S380.0670.6700.6031.34141914.04S980.0670.6710.6041.34241914.05
S390.0670.6710.6041.34141914.05S990.0670.6720.6051.34441914.07
S400.0670.6700.6031.34041914.03S1000.0670.6720.6051.34441914.07
S410.0670.6700.6031.33941914.02S1010.0670.6690.6021.33841914.01
S420.0670.6690.6021.33841914.01S1020.0670.6690.6021.33841914.01
S430.0670.6690.6021.33841914.01S1030.0670.6690.6021.33841914.01
S440.0670.6720.6041.34341914.06S1040.0670.6690.6021.33841914.01
S450.0670.6720.6041.34341914.06S1050.0670.6690.6021.33841914.01
S460.0670.6690.6021.33841914.01S1060.0670.6690.6021.33841914.01
S470.0670.6690.6021.33941914.02S1070.0670.6690.6021.33841914.01
S480.0670.6690.6021.33941914.02S1080.0670.6700.6031.34041914.03
S490.0670.6690.5691.30440713.61S1090.0670.6700.6031.34041914.03
S500.0670.6720.6041.34341914.06S1100.0670.6710.6041.34241914.06
S510.0670.6720.6041.34341914.06S1110.0670.6710.6041.34241914.06
S520.0670.6720.6041.34341914.06S1120.0670.6690.6021.33841914.01
S530.0670.6690.6021.33841914.01S1130.0670.6690.6021.33841914.01
S540.0670.6690.6021.33841914.01S1140.0670.6690.6021.33841914.01
S550.0670.6690.6021.33841914.01S1150.0670.6700.6031.34041914.04
S560.0670.6690.6021.33841914.01S1160.0670.6690.6021.33941914.02
S570.0670.6690.6021.33841914.01S1170.0670.6690.6021.33941914.02
S580.0670.6690.6021.33841914.01S1180.0670.6690.6021.33941914.02
S590.0670.6690.6021.33841914.01S1190.0670.6690.6021.33941914.02
S600.0670.6690.6021.33841914.01S1200.0670.6690.6021.33941914.02
S1210.0680.6760.6091.35341914.16

Relative abundance (RA), RA1 = Monomeric repeats, RA2 = Dimeric repeats, RA3 = Trimeric repeats, RA6 = Hexameric repeats, RD = Relative Density.

Relative abundance (RA) and relative density (RD) in SARS-CoV-2 genome sequence calculated for IMEx derived SSR repeats. Relative abundance (RA), RA1 = Monomeric repeats, RA2 = Dimeric repeats, RA3 = Trimeric repeats, RA6 = Hexameric repeats, RD = Relative Density. Relative abundance (RA) and relative density (RD) in SARS-CoV-2 genome sequence calculated for SSRIT derived SSR repeats. Relative abundance (RA), RA1 = Monomeric repeats, RA2 = Dimeric repeats, RA3 = Trimeric repeats, RA6 = Hexameric repeats, RD = Relative Density.

Motifs types in analyzed genomes

Monomeric repeats from the IMEx tool analysis showed that 50 sequences do not contain any monomeric repeat while remaining 59 have only one and the rest 12 sequences have 2 monomeric repeats. Out of these 59 sequences with only one monomeric repeat, 45 contained (A)n while the rest 14 contained (T)n (Table S2). However, two monomeric repeats containing 12 sequences have both (A)n and (T)n repeats (Fig. 4 ). All except one sequence S30 (MT372482) contained two predominant dimeric repeats of (TC)n and (GT)n motif. A third dimeric repeat was found only in the sequence S30 (MT372482) which possessed (AT)n. Among the trimeric repeats, motifs (TTC)n and (CTT)n occurred twice in all the analyzed sequences (Fig. 4). The occurrence of motifs is counted across all sequences. For instance, if a motif is repeated twice in each sequence, the total occurrence of the motif is 242 (total number of sequences X2). Motifs (AAG)n and (GAA)n were also repeated twice in all of the sequences except S6 (MT502774) and S2 (MT635672) which had (AAG)n once and S30 (MT372482) which had both (AAG)n and (GAA)n once. Motifs (AGT)n and (CTG)n were present once in each sequence except sequence S74 (MT451783). Motif (CAA)n was the only trimer that was repeated four times in a cluster in every sequence, while other trimeric repeats repeated three times. Nineteen different hexametric motifs were identified in the analyzed sequences. Among them, (TAGTCA)n and (TACTTG)n was absent in S30 (MT372482); (GTTTTCT)n and (GGCTTT)n was missing in S49 (MT039890) and S66 (MT447176). Exceptionally, (AATAGG)n motif was only found to be present in one sequence S74 (MT539160). All other hexameric repeats were found precisely once in every sequence. These SSR markers were found to be distributed in the ORF1ab, S, ORF3ab, ORF7a, and N regions of the SARS-CoV-2 genome (Fig. 5 ). Maximum 24 motifs were present in the ORF1ab region, followed by 5 motifs each in S, ORF3ab, and N regions, and only one motif present in ORF7a region.
Fig. 4

The differential occurrence of individual SSR motif. The figure showed the occurrence of different unique mono-, di-, tri- and hexanucleotide in all the analyzed 121 SARS-CoV-2 genomes. The figure very clearly illustrates the presence of TTC, GAA, AAG and CTT trinucleotide repeats twice per genome, while the rest of the repeats present only once per genome.

Fig. 5

Distribution of the identified SSR motifs across the genome of SARS-CoV-2. The figure showed the occurrence of different SSR motifs in the ORF1ab, S, ORF3ab, ORF7, and N region of SARS-CoV-2 genomes. The number of repeats of each motif could also be found from this figure.

The differential occurrence of individual SSR motif. The figure showed the occurrence of different unique mono-, di-, tri- and hexanucleotide in all the analyzed 121 SARS-CoV-2 genomes. The figure very clearly illustrates the presence of TTC, GAA, AAG and CTT trinucleotide repeats twice per genome, while the rest of the repeats present only once per genome. Distribution of the identified SSR motifs across the genome of SARS-CoV-2. The figure showed the occurrence of different SSR motifs in the ORF1ab, S, ORF3ab, ORF7, and N region of SARS-CoV-2 genomes. The number of repeats of each motif could also be found from this figure.

Correlation studies

The correlation between genome size/GC content with the value of relative abundance (RA) and relative density (RD) of SSRs was determined. Correlation coefficient of IMEx tool detected SSRs repeats showed a positive correlation with the total RA 0.52 (R2 = 0.271, P < 0.05) and RD 0.419 (R2 = 0.176, P < 0.05). While that with G/C content is −0.102 (R2 = 0.010, P > 0.1) and 0.147 (R2 = 0.022, P > 0.1) for RA and RD, respectively. Surprisingly, total RA and RD correlation coefficients obtained from the SSRIT tool correlate negatively with the genome size as −0.0595 (R2 = 0.003, P > 0.1) and − 0.107, (R2 = 0.011, P > 0.1), respectively. Further analysis suggested that the RA and RD are both positively correlated against G/C content with a coefficient value of 0.310 (R2 = 0.096, P < 0.05) and 0.331 (R2 = 0.109741269, P < 0.05) respectively. Since the genome sizes of the analyzed viruses are very much similar with little variation to one another, a significant correlation was not expected.

Discussion

Due to the advancement of next-generation DNA sequencing technologies, microbial genome could be sequenced in an increasingly efficient, fast, cheap, and multiple copies at a time (Alam et al., 2014a; Atia et al., 2016) and thus, a tremendous surge of over 114,000 SARS-CoV-2 genomic sequences being available in the public database in few months only (https://www.gisaid.org/). This accumulation of a huge dataset has led us to the unravel the genomic complexities and genetic distribution/variation present in SARS-CoV-2 genome isolated from across the globe. These genome sequences represent a potentially valuable resource for mining both clinical and evolutionary significant SSR markers. Their presence and variation across the genome of same species have been studied extensively in different viruses including Spodoptera littoralis multiple nucleopolyhedrovirus (SpliMNPV) (Atia et al., 2016), potexvirus (Alam et al., 2014a), Human Immunodeficiency Virus (Chen et al., 2009), Mycobacteriophage (Alam et al., 2019), Hepatitis C (Chen et al., 2011) to identify the correlation between the diversity of repeats, incidence and complexity of repeats, genome size and host range (Zhao et al., 2012). In the present study, we have explored 121 SARS-CoV-2 genomes identified from 31 countries covering 6 continents for the identification, abundance, and composition of SSR repeats and observed a total of 38–42 different types of repeats. The SSRs incidence in SARS-CoV-2 genome is almost similar to potyviruses (23–45 SSRs) (Zhao et al., 2011) and Human immunodeficiency virus isolates (22–48 SSRs) (Chen et al., 2009); but higher than tobamovirus having 11–36 SSRs (Alam et al., 2014a), potexvirus of 11–30 SSRs (Alam et al., 2014a) and geminivirus (4–19 SSRs) (George et al., 2012); and lower than that of Spodoptera littoralis multiple nucleopolyhedrovirus with 55 repeats (Atia et al., 2016). Although genome size and hosts play an important factor in determining the occurrence of SSRs (Zhao et al., 2012); SSRs incident frequency varied quite largely across all these studied genomes. We have calibrated our identification tools so that tandem repeat sequences below 6 bp and above 15 bp are not counted. The minimum number of repeats for each type is 10-5-3-3-3-2 configuration for mono-, di-, tri-, tetra-, penta-, and hexarepeats. We have identified incredible similarity pattern in all of 121 genomes, might be due to the high level of sequence conservancy in SARS-CoV-2. Independent studies on vertebrate and plant genomes have provided a basis for categorizing the most common SSR motifs. The most common SSR motif in animals and invertebrates is (GT)n (Stallings et al., 1991), whereas in plants it is (AT)n (Lagercrantz et al., 1993) and in insects, the most common motif is thought to be (CT)n (Paxton et al., 1996). Dinucleotide repeats AT/TA and AG/GA were found to be the two most prominent form in the largest Closteroviridae RNA virus family (George et al., 2016). Following the similar trend SSR analysis of viral genomes revealed the most common motif to be (AT)n (Zhao et al., 2012). SARS-CoV-2 deviates from this trend with the most common repeat being trimeric (TTC)n and (CTT)n repeats which were present in all of the analyzed genomes for multiple times. In the case of the SARS-CoV-2 genome, results revealed that the hexameric motif was the most abundant type of repeat (49%) followed by the trinucleotide of 42%, the other two types of mono- and dimeric- repeats present in 4% (Table S3); while tetra- and penta-nucleotide repeats were non-existent. In partial agreement with our results, trinucleotide SSRs were found to be the most frequent types in SpliMNPV and Human Immunodeficiency Virus Type 1 (HIV–1). However, the genome of hepatitis C virus (HCV) possessed predominantly mono-, di- and tri-nucleotide repeats with the rare presence of other types (Chen et al., 2011). In contrary, the mononucleotide repeats were the most abundant form in 30 alphaviruses (Alam et al., 2014b), Herpes Simplex Virus Type 1 (Deback et al., 2009) and different ssDNA viruses (Jain et al., 2014) genomes followed by di-/tri-nucleotide repeats. Although the presence of tetra− and penta− nucleotides microsatellites is rare in diverse Geminivirus (George et al., 2012) and HCV (Chen et al., 2011), SARS-CoV-2 genomes showed complete absence of this kind of motifs (Fig. 4). The level of repetitiveness and incidence of SSR sequences have been readily correlated with genome size and G/C content (Zhao et al., 2012). Several reports established the positive correlation of the SSR content with their respective genome size of fungal (Karaoglu et al., 2005) and plant genomes (Morgante et al., 2002). A weak influence of genome size and GC content had been established on the number, relative abundance and relative density of microsatellites in various analyzed HCV genomes (Chen et al., 2011). Our findings suggest that relative abundance and density is positively correlated with genome size and the correlation is statistically significant. Conversely, the correlation with G/C content is positive but not statistically significant. In establishing distribution patterns of SSRs in SARS-CoV-2, it could be concluded that there is no significant pattern in the distribution of SSRs in viral genomes. It can also be said that the number of SSR present in a genome cannot be considered proportional to the genome size as the sequences used in this study were grossly similar in size (Table 1). Similar kind of study conducted in diverse HIV-1 genomes revealed no direct proportional relationship to the genome size and total SSR contents (Chen et al., 2009). We conducted this study in the hope of documenting and establishing the SSR patterns present in SARS-CoV-2 as well as the particular motifs that are present in the genome. Further studies would perhaps aim at detailing the presence of these repeat motifs in coding and non-coding regions of the genome to predict regions prone to mutations.

Conclusion

The relevance of our findings would help to gain knowledge regarding the functional, physiological, and evolutionary significance of various SSR repeats. Repetitive sequences are considered as the hot spots for recombination, as this might play a significant role in the ability of SARS-CoV-2 virus to rapidly adapt to a different kind of environmental and genetic variation of hosts. Genome-wide extraction of microsatellites across 121 SARS-CoV-2 genomes revealed the presence of 38–42 SSRs per genome. Though a complete understanding of the position of these SSRs in the coding region of the genome yet to be completed, the functional variations of this virus in a different region could be assigned.

Role of funding sources

There was no funding received to carry out this work.

CRediT authorship contribution statement

R.S. and A.G. performed all the computational work, wrote the main manuscript, prepared tables, and figures. A.G. conceptualized the idea, supervised the entire study and was involved in the analysis and interpretation of the data. All authors reviewed and approved the manuscript.

Declaration of competing interest

The authors declare that they have no known competing financial or personal conflicts.
  34 in total

1.  Microsatellites are preferentially associated with nonrepetitive DNA in plant genomes.

Authors:  Michele Morgante; Michael Hanafey; Wayne Powell
Journal:  Nat Genet       Date:  2002-01-22       Impact factor: 38.330

2.  Two distinct modes of microsatellite mutation processes: evidence from the complete genomic sequences of nine species.

Authors:  Daniel Dieringer; Christian Schlötterer
Journal:  Genome Res       Date:  2003-10       Impact factor: 9.043

3.  Differential distribution and occurrence of simple sequence repeats in diverse geminivirus genomes.

Authors:  B George; Ch Mashhood Alam; S K Jain; Ch Sharfuddin; S Chakraborty
Journal:  Virus Genes       Date:  2012-08-18       Impact factor: 2.332

Review 4.  Coronavirus pathogenesis.

Authors:  Susan R Weiss; Julian L Leibowitz
Journal:  Adv Virus Res       Date:  2011       Impact factor: 9.937

5.  In- silico exploration of thirty alphavirus genomes for analysis of the simple sequence repeats.

Authors:  Chaudhary Mashhood Alam; Avadhesh Kumar Singh; Choudhary Sharfuddin; Safdar Ali
Journal:  Meta Gene       Date:  2014-10-06

6.  A pneumonia outbreak associated with a new coronavirus of probable bat origin.

Authors:  Peng Zhou; Xing-Lou Yang; Xian-Guang Wang; Ben Hu; Lei Zhang; Wei Zhang; Hao-Rui Si; Yan Zhu; Bei Li; Chao-Lin Huang; Hui-Dong Chen; Jing Chen; Yun Luo; Hua Guo; Ren-Di Jiang; Mei-Qin Liu; Ying Chen; Xu-Rui Shen; Xi Wang; Xiao-Shuang Zheng; Kai Zhao; Quan-Jiao Chen; Fei Deng; Lin-Lin Liu; Bing Yan; Fa-Xian Zhan; Yan-Yi Wang; Geng-Fu Xiao; Zheng-Li Shi
Journal:  Nature       Date:  2020-02-03       Impact factor: 69.504

Review 7.  The origin, transmission and clinical therapies on coronavirus disease 2019 (COVID-19) outbreak - an update on the status.

Authors:  Yan-Rong Guo; Qing-Dong Cao; Zhong-Si Hong; Yuan-Yang Tan; Shou-Deng Chen; Hong-Jun Jin; Kai-Sen Tan; De-Yun Wang; Yan Yan
Journal:  Mil Med Res       Date:  2020-03-13

8.  Simple sequence repeats in Neurospora crassa: distribution, polymorphism and evolutionary inference.

Authors:  Tae-Sung Kim; James G Booth; Hugh G Gauch; Qi Sun; Jongsun Park; Yong-Hwan Lee; Kwangwon Lee
Journal:  BMC Genomics       Date:  2008-01-23       Impact factor: 3.969

9.  Inferring the hosts of coronavirus using dual statistical models based on nucleotide composition.

Authors:  Qin Tang; Yulong Song; Mijuan Shi; Yingyin Cheng; Wanting Zhang; Xiao-Qin Xia
Journal:  Sci Rep       Date:  2015-11-26       Impact factor: 4.379

Review 10.  The genetic sequence, origin, and diagnosis of SARS-CoV-2.

Authors:  Huihui Wang; Xuemei Li; Tao Li; Shubing Zhang; Lianzi Wang; Xian Wu; Jiaqing Liu
Journal:  Eur J Clin Microbiol Infect Dis       Date:  2020-04-24       Impact factor: 3.267

View more
  1 in total

1.  Two short low complexity regions (LCRs) are hallmark sequences of the Delta SARS-CoV-2 variant spike protein.

Authors:  Arturo Becerra; Israel Muñoz-Velasco; Abelardo Aguilar-Cámara; Wolfgang Cottom-Salas; Adrián Cruz-González; Alberto Vázquez-Salazar; Ricardo Hernández-Morales; Rodrigo Jácome; José Alberto Campillo-Balderas; Antonio Lazcano
Journal:  Sci Rep       Date:  2022-01-18       Impact factor: 4.379

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.