Literature DB >> 34391794

Genome-wide covariation in SARS-CoV-2.

Evan Cresswell-Clay1, Vipul Periwal2.   

Abstract

The SARS-CoV-2 virus causing the global pandemic is a coronavirus with a genome of about 30Kbase length. The design of vaccines and choice of therapies depends on the structure and mutational stability of encoded proteins in the open reading frames (ORFs) of this genome. In this study, we computed, using Expectation Reflection, the genome-wide covariation of the SARS-CoV-2 genome based on an alignment of ≈130000 SARS-CoV-2 complete genome sequences obtained from GISAID. We used this covariation to compute the Direct Information between pairs of positions across the whole genome, investigating potentially important relationships within the genome, both within each encoded protein and between encoded proteins. We then computed the covariation within each clade of the virus. The covariation detected recapitulates all clade determinants and each clade exhibits distinct covarying pairs.
Copyright © 2021. Published by Elsevier Inc.

Entities:  

Keywords:  Analysis; Genome; SARS-coV-2

Mesh:

Year:  2021        PMID: 34391794      PMCID: PMC8360999          DOI: 10.1016/j.mbs.2021.108678

Source DB:  PubMed          Journal:  Math Biosci        ISSN: 0025-5564            Impact factor:   2.144


Introduction

Severe Acute Respiratory Syndrome (SARS) is a viral respiratory disease caused by the SARS-associated coronavirus. In December 2019, this pneumonia-like disease re-emerged in the Chinese city of Wuhan and the novel beta-coronavirus 2 (SARS-CoV-2) was identified as the causative agent [1]. The genome was first characterized by Wu et al. in December 2019 [2]. Since then the SARS-CoV-2 virus has spread relentlessly all over the world and been declared a worldwide pandemic with 79 million cases leading to 1.7 million deaths to date [3], [4]. SARS-CoV-2 is an +ssRNA virus belonging to the coronaviridae family major genera Betacoronavirus [5]. The viral genome encodes several open reading frames (ORFs): ORF1ab, ORF3a, ORF6, ORF7a, ORF7b, ORF8, ORF10. These ORFs encode for several non-structural proteins (NSPs) while there are specific regions encoding the spike glycoprotein (S), envelope (E), membrane glycoprotein (M), and the nucleocapsid protein (N). The genome (NC_045512.2, 29 870 nucleotides long) of the virus can be broken into 11 encoding regions: ORF1ab (266-21555), S (21563-25384), ORF3a (25393- 26220), E (26245-26472), M (26523-27191), ORF6 (27202-27387), ORF7a (27394-27759), ORF7b (27756-27887), ORF8 (27894-28259), N (8274-29533), ORF10 (29558-29674) [6]. While the reference genome is used for most investigations, there is also an abundance of data available which can be used to monitor variations in the genome and analyze the evolution and nature of the virus. This data was assembled by GISAID to document different strains of the virus in a new database: EpiCoV. With the first viral entry on January 10 2020, the database has grown to 292,000 submissions [7]. In this study we use 137 636 of these strains to analyze the evolution of the virus. In our analysis we develop a co-evolutionary interaction network of nucleotide positions using an entropy-based method to infer genome-level interaction in the SARS-CoV-2 genome.

Coevolution

The variation of the virus’s genetic structure is of considerable medical and biological importance for prevention, diagnosis, and therapy. Mutations in the viral genome allows us to investigate potentially important relationships within the genome. Comparative RNA sequence analysis has long been used to investigate co-evolution via covariance of nucleotide mutations (30,31) with difficulty arising in the separating of indirect and direct interactions that lead to such co-variation. A similar issue in inferring protein residue interactions was addressed by Lapedes et al. [8] and many later groups [9], [10], [11] in a statistical physics methodology that has come to be called Direct Coupling Analysis (DCA), and which has successfully inferred direct interactions (DIs) in proteins as well as between proteins. For RNAs, the DCA-based methods infer physical interactions, both secondary and tertiary, between nucleotides in an RNA molecule by analyzing the co-evolutionary signals of nucleotides across sequences in the RNA family [12]. More recently, this analysis has been applied to the SARS-CoV-2 genome by Zeng et al. [13]. In this paper we will utilize a form of logistic regression optimization, Expectation Reflection [14], [15], with DCA to infer DI in the SARS-CoV-2 genome. These interactions may also provide information on protein–protein interaction. Additionally this analysis could be useful in vaccine development, aiding in efforts to mitigate “escape pathways” for the virus to use in future strains [16]. Comparing Incidence across Clades. We plot total number of retained columns (incidence) against the allowed conserved percentage per a given nucleotide position for both ORF1ab and S regions (across all clades). Comparing the Incidence of S and ORF1ab. We plot the percentage of retained columns (incidence) against the allowed conserved percentage per a given nucleotide position (in the full genome data set).

SARS-CoV-2 genome analysis using expectation reflection

Our analysis begins with the acquisition and alignment of genome sequence data, described in Section 4.2. Once the data is aligned we pre-process the aligned sequences by removing sequence positions which contain 95% or more conservation as discussed in Section 2.1. We infer covarying positions from the curated-aligned genome data using the Ising model of statistical physics pioneered by Lapedes et al. [8] which is outlined in Section 4.1. The resulting genome-wide interactions are discussed by encoding region in Section 2.2. Our analysis includes the presentation of position interaction maps and tabulation of the strongest resulting DI pairs. For the strongest DI pairs we also present the single site amino acid (AA) frequency as well as the AA-pair counts. Similar analysis is also applied to the G, GR, GH, S and V clades in Section 2.3. hCoV-19 Genome-Wide Interaction Map. We infer covariation in nucleotide positions across sequences.

Results

Clade incidence

As in any ab initio inference problem, when inferring co-evolutionary interactions between nucleotide positions in the SARS CoV-2 genome, we must consider certain properties of the data at hand. As an example, the length of the full genome is approximately 29 000 nucleotide positions. However, when we consider genome positions in which no single nucleotide is expressed more than 95% of the time (95% conserved), the relevant positions are reduced by approximately two-thirds. Decreasing the conserved percentage decreases the number of allowed repetitions in a given position (data column). In other words, as we decrease the threshold for conserved columns, the condition for variation at a given position becomes more stringent and the number of columns retained for analysis will decrease. As in the example above, going from 100% (full genome, all columns allowed) to 95% conservation removes a significant number of genome positions. Because inference with ER relies on mutations at a given position we must consider the resulting number of columns, or incidence, after such curation. In addition, region-specific incidence may also underline importance to the efficacy of the virus because higher incidence represents more variation and mutation. We also consider the clade-specific incidence of the full genome since we will consider genome interactions in different clades in future sections. In Fig. 1 we plot the incidence of the ORF1ab and S regions for different thresholds of allowed conservation. These incidence curves are given for the different clade data sets. Fig. 1 shows that region incidence varies between clades and that the incidence of different encoding regions is affected differently for a given clade. For example, consider the full genome data set against the S clade set in Fig. 1. The full genome data set (blue) has one of the highest ORF1ab incidence curves, but the same level of incidence is not necessarily expressed in the S region. In contrast, the S clade (purple) shows middling incidence for ORF1ab and one of the strongest incidence curves for the S encoding region.
Fig. 1

Comparing Incidence across Clades. We plot total number of retained columns (incidence) against the allowed conserved percentage per a given nucleotide position for both ORF1ab and S regions (across all clades).

Top 10 ORF1ab DI pairs. Listing the strongest DI pair positions (bolded positions not in ORF1ab). This analysis can also give some intuition on the nature of individual encoding regions by quantifying the level of variability expressed within a given clade. For example, in Fig. 1, we see differences in which clades express higher incidence between the ORF1ab and S encoding regions. We can apply the same principle to compare the ORF1ab and S regions in the full genome data set. In Fig. 2, we plot the incidence of ORF1ab and S in the full genome sequence set as we increase the allowed conserved percentage per given nucleotide position. This figure shows different levels of position-wise variability for the two encoding regions.
Fig. 2

Comparing the Incidence of S and ORF1ab. We plot the percentage of retained columns (incidence) against the allowed conserved percentage per a given nucleotide position (in the full genome data set).

For the remainder of the paper we will set the conservation threshold to 95%. This is in order to retain a significant number of positions for the subsequent analysis. We must also consider that the size of a given clade, or genome data set will affect the incidence and variability. Specifically, as the number of sequence considered changes, the level of variability, enforced by the conservation thresholds, will be altered as well. Therefore, when considering smaller data sets, such as the S and V clades, we must keep in mind that the incidence is affected by the cardinality of the set itself.

Genome wide analysis

We begin by inferring interactions between nucleotide positions across the entire genome. Fig. 3 shows a gray-scale of DI calculated from inferred couplings between positions using all available sequences. Position pairs which showed significant coevolution () are emphasized. In Fig. 3, the full interaction map is dominated by interactions in ORF1ab (positions 266-21556), S (positions 21564–25384), and ORF7ab (27395-27888). Proximal nucleotide positions (diagonal of the interaction map) express strong covariation as shown by the thick black diagonal bar. In fact, we suppress the emphasis of for proximal pairs () in all interaction maps due to the prevalence of proximal pair interactions with . However, there are several off-diagonal position pairs (far apart in the genome) which show strong covariation. This shows potential evolutionary links between specific positions or regions in different parts of the genome. As an example we can consider the top 5 DI pairs for the entire genome in Table given below. While most of the position pairs are proximal, the strongest interaction is more than 20 000 nucleotides apart.
Fig. 3

hCoV-19 Genome-Wide Interaction Map. We infer covariation in nucleotide positions across sequences.

In order to further explore features from the full interaction map we will divide the full genome into different encoding regions, focusing on those regions which show significant incidence.

ORF1ab

The ORF1ab region of SARS-CoV-2 genome is an important polyprotein gene which encodes 16 nonstructural proteins important to the life cycle of the virus. Because of this importance, some of the proteins encoded in this region have been proposed as potential targets for antiviral therapy [17], [18], [19]. Fig. 3 shows that the region also has the largest number of non-conserved nucleotide positions, or position incidence, (as described in Section 2.1) of the major encoding regions. This increased incidence is expressed in the cardinality of the interaction map (see Fig. 4).
Fig. 4

ORF1ab Interaction Map.

Table 1 shows the top 10 DI pairs in ORF1ab with bolded positions representing distal (non ORF1ab) positions. Note that more than half of the top 10 DI pairs in Table 1 are distal which may be an indicator of the significance of the region to other encoding regions (and their resulting functions).
Table 1

Top 10 ORF1ab DI pairs. Listing the strongest DI pair positions (bolded positions not in ORF1ab).

Position 1Position 2DI
1440830370.456
1059255630.441
14805261440.380
7540234010.343
14408234040.340
3037234030.335
21254222270.329
18555234010.313
21367213690.291
1163234010.283
ORF1ab Interaction Map. While the interaction map and DI pairs are a useful overview of genome coevolution, it is important to consider whether these coevolving positions result in alterations of amino acids encoded by the interacting positions. Table 2 gives the resulting proportion of amino acids (AA) encoded by the nucleotide positions outlined in Table 1. The table shows frequent occurrence for a dominant AA at any given position. This analysis can be extended to consider the AA pair counts for a given genome position pair. Table 3 shows the AA pair counts of the dominant DI pairs in ORF1ab. In the given position pair matrices we see three cases arise: A high AA-pair count on the diagonal, on a single row/column, or in a single element. A high AA-pair count on the diagonal means the prevalence of two AA-pairings resulting from the coevolution of the two positions. The AA-count between positions 14 408 and 23 404 is a good example of this first case. Dominance of a given row or column suggests that one position remains mostly fixed while the other position expresses variability, as seen in the AA-counts for the position position pairing of 3037 and 23 403 where 3037 almost always encodes Phenylalanine. The final case in Table 3 is a single AA-pair expressed dominantly for a given position as seen with positions 21 367 and 21 369. This case shows little coevolution and generally occurs in AA-pairs from positions with lower DI where both positions have a dominant AA in their single site AA frequency (see 21 367 and 21 369 in Table 2). We continue with the same analysis for the S encoding region.
Table 2

ORF1ab Single Site AA Frequencies. We consider the distributions of resulting amino acids from the nucleotide positions which had the highest DI in ORF1ab.

Single site Freqs.
14408L: .857, P: .139
3037F:.995
1059T:832,I:.164
25563Q:.783,H:.215
14805Y:.998
26144G:.945,V:.040
7540T:.997
23401Q:.999
23404G:.861,D:.138
23403G:.861,D:.138
21254A:.933,X:.106
22227A:.956,V:.029
18555D:.999
21367I:.941,X:.059
21369I:.941,X:059
1163I:.938,F:.059
Table 3

ORF1ab AA Pair Counts. Encoded amino acid counts for high ranked DI positions in ORF1ab. Considering the most prevalent amino acids for each position.

3037 14408FOther
L117404497
P19006117
Other54468
ORF1ab Single Site AA Frequencies. We consider the distributions of resulting amino acids from the nucleotide positions which had the highest DI in ORF1ab. ORF1ab AA Pair Counts. Encoded amino acid counts for high ranked DI positions in ORF1ab. Considering the most prevalent amino acids for each position.

Spike glycoprotein

The Spike protein encoding region (S) plays a vital role in viral entry into the host cells [20]. As a result this region is considered a key target in current vaccine development [21]. In Fig. 5 and Table 4 we present both the interaction map and the top ranked DI pairings respectively for the S region gene positions. In addition to this DI analysis we present the single site frequencies and AA-pair non-singular AA-counts in Table 5, Table 6. For the S encoding region we only include the top 3 rated DI amino acid pairings. Regardless of this curation, the AA-pair matrices in Table 6 show a single AA-pair count prevalence for all but the strongest DI pairing. This pairing, between positions 22 363 and 22 384, shows a strong diagonal. However, the alternate can be any amino acid pairing which shows that the main trend is the threonine-valine combination.
Fig. 5

S Interaction Map.

Table 4

Top 10 S DI pairs. Listing the strongest DI pair positions (bolded positions not in S).

Position 1Position 2DI
22363223840.915
2340175400.343
23401185550.313
22497224950.301
22353223550.292
2340111630.283
23401166470.275
22334223360.263
22539225410.245
22526225240.239
Table 5

S Single Site AA Frequencies. We consider the distributions of resulting amino acids from the nucleotide positions which had the highest DI in S.

Single Site Freqs.
22363V:.894,X:.106
22384T:.896,X:.104
23401Q:.998,X:.001
7540T:.997,X: .003
18555D:.997,X:.003
22497I:.893,X:.107
22495G:.893,X:.107
22353A:.891,X:.109
1163I:.938,X:.057
16647T:.989,X:.011
22334W:.937,X:.062
22336W:.938,X:.062
22539I:.951,X:.049
22541V:.994,X:.005
22526P:.947,X:.053
22524Q:.937,X:.062
22354A:.891,X:.109
22488E:.893,X:.106
Table 6

S AA Pair Counts. Encoded amino acid counts for high ranked DI positions in S. Considering the most prevalent amino acids for each position.

22363 22384TXOther
V12255642613
X736138821
Other8113
Top 10 S DI pairs. Listing the strongest DI pair positions (bolded positions not in S). S Single Site AA Frequencies. We consider the distributions of resulting amino acids from the nucleotide positions which had the highest DI in S. S AA Pair Counts. Encoded amino acid counts for high ranked DI positions in S. Considering the most prevalent amino acids for each position.

ORF3a

The ORF3a gene region encodes a unique membrane protein with a 3-membrane structure and it is essential for the pathogenesis of the disease [22], [23]. The interaction map of ORF3a expresses minimal incidence so it is not shown. However, the interactions in ORF3a are still important to consider. The region itself shows little variation (only 5 pairs have DI) with the exception of position 25 563 as seen in previous work [24]. Regardless, in Table 7 we are able to show that 25 563 interacts with several other regions including ORF1ab and S with the strongest coevolution occurring with position 1059 in the ORF1ab region. We consider the encoded amino acids for positions coevolving in ORF3a in Table 8, Table 9. In Table 9 we see a new case in the AA-pair distribution with position pairs of 25 563 and both 22 992 and 25 429. In both these pairs, the majority of the AA pair count is the primary AA for each position. However, the second most prevalent pairing is in the primary–secondary AA couples for the position. While a high count on the matrix diagonal represents the prevalence of two AA pairs, the distribution in these pairs shows more variety with 3 significant pairings expressed in the position interaction.
Table 7

Top 5 ORF3a DI pairs. Listing the strongest DI pair positions (bolded positions not in ORF3a).

Position 1Position 2DI
2556310590.441
25563224440.202
25563229920.155
25563254290.129
25563202680.117
Table 8

ORF3a Single Site Amino Acid Frequencies. We consider the distributions of resulting amino acids from the nucleotide positions which had the highest DI in ORF3a.

Single Site Freqs.
25563Q:.783,H:.215
1059T:.832,I:.164
22444D:.891,X:.109
22992S:.914,N:.046,X:.039
25429V:.893,L:.105
20268L:.961,X:.039
Table 9

ORF3a AA Pair Counts. Encoded amino acid counts for high ranked DI positions in ORF3a. Considering the most prevalent amino acids for each position.

25563 1059QHOther
T1070277275222
I2542223645
Other4349944
Top 5 ORF3a DI pairs. Listing the strongest DI pair positions (bolded positions not in ORF3a).

ORF7ab

ORF7ab contains a viral antagonist of host restriction factor BST-2/Tetherin and induces apoptosis [25]. We present the interaction map for both ORF7a and ORF7b separately in Fig. 6, Fig. 7. We also present the top 10 DI pairs for each region in Table 10.
Fig. 6

ORF7a Interaction Map.

Fig. 7

ORF7b Interaction Map.

Table 10

Top 10 ORF7ab DI pairs. Listing the strongest DI pair positions.

Position 1Position 2DI
27801278030.365
27688277920.333
27698277000.299
27579275810.295
27803278050.277
27794277960.268
27804278060.248
27805278070.227
27758277610.209
27688277520.188
It is important to note that there was little significant coevolution between positions in ORF7ab and other regions of the genome. We will only present the AA single site frequencies, in Table 11 for this region because of the single AA dominance at each position considered. In addition to this strong dominance () of the primary AA at each position, the secondary AA at all positions was undefined (X).
Table 11

Single Site Amino Acid Frequencies. We consider the distributions of resulting amino acids from the nucleotide positions which had the highest DI in ORF7ab.

Single Site Freqs.
27801F:.934,X:.065
27803F:.934,X:.065
27792F:.929,X:.071
27688P:.904,X:.096
27698L:.904,X:.1
27700I: .903,X.1
27579I:.902, X:.097
27581F:.901,X: .099
27805L:.941,X:058
27794F:.929,X:071
27696F:.931,X:.069
27804L:.941,X:058
27806L:.941,X:058
27807L:.988,X:.011
27761I:.902,X:.097
27752T:.902,X:.098
ORF3a Single Site Amino Acid Frequencies. We consider the distributions of resulting amino acids from the nucleotide positions which had the highest DI in ORF3a. ORF3a AA Pair Counts. Encoded amino acid counts for high ranked DI positions in ORF3a. Considering the most prevalent amino acids for each position. Top 10 ORF7ab DI pairs. Listing the strongest DI pair positions. Single Site Amino Acid Frequencies. We consider the distributions of resulting amino acids from the nucleotide positions which had the highest DI in ORF7ab. S Interaction Map.

Nucleocapsid

We conclude our complete genome analysis with a brief description of our findings of coevolution in the Nucleocapsid (N) encoding region. The N protein plays varied roles in the regulation of the infected cell metabolism and packaging of the viral genome. Therefore, the protein plays an important role in both replication and transcription [26]. The N region contains an specific nucleotide variation in the triplet 28 881, 28 882, and 28 883 discussed in previous work [24]. This triplet, appropriately, presents the only DI in the region, as presented in Table 12.
Table 12

Nucleocapsid (N) DI pair. The single pairing is also one of the strongest DI in the genome.

Position 1Position 2DI
28881288830.801
Nucleocapsid (N) DI pair. The single pairing is also one of the strongest DI in the genome.

Clades

When investigating genetic variance, it is useful to stratify available data to understand and analyze genomic diversity. Analysis of genetic variance plays a crucial role in expanding knowledge and developing prevention strategies. Previous work has developed phylogenetic trees and divided the SARS-CoV-2 genome both genomically and geographically into clades [27]. We extend this analysis to see how such stratification affects the virus’s genome-wide covariation. Before presenting our results on clades, it is important to note that our initial results yielded many previously defined clade determinants [27]. These determinant nucleotide (NT) positions are bolded in Table 13. In the following sections we apply our method to these clades and investigate the resulting change in the coevolution of NT positions across the genome.
Table 13

Clade Determinants in Genome-Wide Analysis. We outline the clade determinants from DI presented in Section 2.2 (bolded).

Clade Determinants Present in DI
Position 1Position 2DIRegionClade
1440830370.456ORF1abG
1059255630.441ORF1abGH
14805261440.380ORF1abV
14408234040.340ORF1abG
3037234030.335ORF1abG, GR
2556310590.441ORF3aG
25563224440.202ORF3aG
25563229920.155ORF3aG
25563254290.129ORF3aG
25563202680.117ORF3aG
28881288830.801NGR
Clade Determinants in Genome-Wide Analysis. We outline the clade determinants from DI presented in Section 2.2 (bolded).

G clade

We begin our clade analysis with the largest existing clade. The G clade is stratified by the most common set of events, a quadruplet of mutations: C241T, C3037T, C14408T, A23403G [27]. Extracting genome sequences with these features we re-apply ER, resulting in the interaction map in Fig. 8. Comparing the full interaction map (Fig. 3) and the G clade interaction map (Fig. 8), we see a drastic change in incidence.
Fig. 8

G Clade Full Genome Interaction Map.

Both genome sequence sets have the same curation applied, with pre-processing removing genome positions which were conserved. However, within the G clade we see a severely decreased incidence in ORF1ab, with positions from NT 19 300 to 19 500 no longer showing sufficient variation. This change in incidence is expressed in Fig. 9 as a decrease in cardinality of the interaction map from the full genome set to the G clade genome set.
Fig. 9

ORF1ab Interaction Maps. Showing the interaction map for the ORF1ab region with both the full genome sequence set (left), as well as the G clade genome sequence set (right).

The incidence change in the S encoding region is much less severe. The NT range from 2200 to 24 000 appears mostly unchanged, showing the same pattern of significant DI (, red dots). In addition to the changed incidence of ORF1ab and S. While variation of the different encoding regions differs in ORF1ab and S, there are alterations in the top ranked DI positions throughout both regions. ORF7a Interaction Map. ORF7b Interaction Map. G Clade Full Genome Interaction Map. ORF1ab Interaction Maps. Showing the interaction map for the ORF1ab region with both the full genome sequence set (left), as well as the G clade genome sequence set (right). First consider Table 14, which gives the top ranked DI pairs in ORF1ab for the full genome sequence set and the G clade genome sequence set side by side. None of the top ranked DI pairs from the full genome analysis remain in the top ranked pairs for the G clade. This is not due to an overall loss in information, as the DI magnitude remains in the same region (DI). Most of the removed pairs were G clade determinants, therefore these positions are fixed in the G clade, specifically positions 14 408, 3037, 23 403 (and proximal positions). By fixing these positions we effectively removed the variation of positions represented in 6 of the top 10 pairs. However, it seems that the connection between ORF1ab and other encoding regions, specifically the S region, was retained. Consider the NT position 23 403 from the S encoding region, which was fixed in the G clade genome sequence set. During the stratification of the G clade genome sequence set, the variation at 23 403 (likely 23 401 and 23 404 as well) was removed. This small NT group in region S accounted for 4 of the top 10 DI pairs for ORF1ab. However, while 23 403 and its corresponding positions no longer coevolved with ORF1ab, the NT position 22 870 expressed a very strong connection with ORF1ab in the G clade.
Table 14

ORF1ab top ranked DI pairs. We show the 10 top ranked DI pairs for ORF1ab generated from the full genome sequence set and the G clade genome sequence set.

Full Genome
Position 1Position 2DI
1440830370.456
1059255630.441
14805261440.380
7540234010.343
14408234040.340
3037234030.335
21255222270.329
18555234010.313
21367213690.291
1163234010.283
ORF1ab top ranked DI pairs. We show the 10 top ranked DI pairs for ORF1ab generated from the full genome sequence set and the G clade genome sequence set. This trend continues in the S encoding region. Table 15 shows the top ranked DI pairs in the S encoding region for the full genome sequence set and the G clade genome sequence set. As before, none of the top ranked DI pairs from the full set survive the stratification which creates the G clade genome sequence set.
Table 15

S top ranked DI pairs. We show the 10 top ranked DI pairs for S generated from the full genome sequence set and the G clade genome sequence set.

Full Genome
Position 1Position 2DI
22363223840.915
2340175400.343
23401185550.313
22497224950.301
22353223550.292
2340111630.283
23401166470.275
22334223360.263
22539225410.245
22526225240.239
22354224880.218
S top ranked DI pairs. We show the 10 top ranked DI pairs for S generated from the full genome sequence set and the G clade genome sequence set.

Other clades

We conclude our analysis with an overview of the remaining clades presented in previous work: GR, GH, S, and V clades [27]. Specifically we present the interaction maps for the different clades (Fig. 10, Fig. 11, Fig. 12, Fig. 13).
Fig. 10

GH Clade Interaction Map. Showing the interaction map for the GH clade genome sequence set.

Fig. 11

GR Clade Interaction Map. Showing the interaction map for the GR clade genome sequence set.

Fig. 12

S Clade Interaction Map. Showing the interaction map for the S clade genome sequence set.

Fig. 13

V Clade Interaction Map. Showing the interaction map for the V clade genome sequence set.

With the given number of sequences in the smaller clades, the sample size limits the inference. This is evident when considering the shift in DI threshold in the Fig. 10, Fig. 11, Fig. 12, Fig. 13 with significant DI changing . Due to the smaller sample sizes in these clades, we forego further analysis as the inference would be unreliable.

Discussion

In this work we have presented a novel analysis of the SARS CoV-2 genome by finding genome interactions both within and across encoding regions. In our analysis of the SARS CoV-2 genome we have presented several perspectives on nucleotide interactions in the genome. These interactions showed both proximal interactions within individual encoding regions as well as distal interactions between different encoding regions throughout the genome. We inferred nucleotide position interactions for the entire genome as well as the separate encoding regions. Particular attention was given to the ORF1ab and S regions, which demonstrated the highest variability in the given data set. We were able to draw analogous conclusions from previous work by inferring the most common variations previously reported [24]. Additionally, our genome-wide interaction maps expressed determinant positions of all clades available at the time our data was acquired [27]. Generating interaction maps of individual clades showed clade-specific coevolution of nucleotide positions. We further considered the level of variability, both within regions of the full genome data set and for different clades. This was accomplished by varying the threshold for conserved columns while considering the retained column incidence. This relationship shows nucleotide variability is different both between encoding regions of the full genome and between different clades. Region-specific incidences are not consistent between clades, with individual regions expressing different variability in different clades. Comparison of region-specific incidence can also give intuition on the level of variability or specific regions within specific clades. GH Clade Interaction Map. Showing the interaction map for the GH clade genome sequence set. GR Clade Interaction Map. Showing the interaction map for the GR clade genome sequence set. Our work shares significant similarities in results with the work of Zeng et al. [13] in which 50 000 SARS-CoV-2 genomes were analyzed using pseudo-likelihood maximization [28]. We inferred 5 of the 8 epistatic links described in their work with the notable exception of not inferring 3 interaction pairs: , and in the ORF1ab region. This exclusion is likely due to the larger data-set used in our analysis as those links were removed during post-processing due to insignificant DI scoring (). Another notable difference between the results and methodology of this work and that of [13] is in the consideration of inherited significance. It should be noted that phylogenetic statistical analysis is a deep subject and there is a great deal of subtlety regarding the applicability of simple approaches like randomization [29]. [13] utilize phylogenetic randomization to filter out insignificant inferences that arise only due to phylogenetic relationships. We simply use data-driven clade stratification to remove such effects. This stratification in turn yields novel sets of interactions which are clade specific and therefore are unlikely to be influenced by shared inheritance. In the absence of clade determinant positions however, the use of phylogenetic randomization by [13] would be necessary. S Clade Interaction Map. Showing the interaction map for the S clade genome sequence set. V Clade Interaction Map. Showing the interaction map for the V clade genome sequence set. Future extensions of this analysis provide several avenues of investigations. First, as the database of SARS CoV-2 genomes grows, the incidence and overall variability will increase, yielding further insights into genome interactions. Additionally, the availability of data over longer time periods will allow for chronological compartmentalization of genome data sets and interaction maps can be compared across the temporal evolution of the virus. Second, this analysis can also be applied to diseases for which there is more data available as the importance of genome interactions is not SARS-CoV-2 specific.

Materials and methods

Ising model covariation analysis

We use the standard Ising model analogy first applied to protein sequences by Lapedes et al. [8]. This basic method has been applied by many groups with small refinements on how the couplings of the Ising model are actually computed. We formulated the Ising model coupling calculation in terms of logistic regression with one-hot encoding of the genome sequences. As there is no closed form solution to logistic regression problems, essentially all methods for finding coefficients are iterative. The particular algorithm we used to find the logistic regression coefficients was Expectation Reflection which has been demonstrated to perform well in the limit of for small sample sizes [15], but the results do not differ appreciably from logistic regression with regularization as implemented in the Scikit-learn Python package, for example. Indeed, the Scikit-learn logistic regression implementation turns out to be considerably faster than our ER implementation. Here we outline the method and how it is applied to infer connections between genome positions. We begin with a given genetic sequence, In order to translate this into a binary variable sequence required for the Ising model we can use a OneHot transformation [30]. This transformation converts a nucleotide into a binary representation, which allows us to convert the genome sequence set into a binary sequence set. Therefore the previous sequence in Eq. (1) becomes , Now that we have established the conversion of a set of genetic sequences into a set of binary variables we can continue to the computation of the interactions between these binary variables. Given a binary variable representing the th sequence such that with sequences, or states, which the SARS-CoV-2 virus genome can take. An individual sequence has the form where is the number of binary variables representing the nucleotides such that with the length genome . We can then assume that given a current sequence , an individual position in a future sequence changes stochastically according to the following conditional probability, where the local field, , where represents the connection between positions and . is a function of the current sequence state and expresses the influence of a given sequence position on the future state of sequence position . This conditional relationship allows us to iteratively search for the ideal using all available sequences. The method, and resulting algorithm, is discussed in further detail in previous work [14], [15]. The concept of how to go from the coefficient matrix to a measure of interaction strength between sequence position pairs we take from [11], [12], [28], [31]. They defined a Direct Information (DI) between all position pairs using the computed interaction matrix.

Genome data: Acquisition and alignment

The data used for both the full genome-wide covariation and the clade-specific genome-wide covariation was acquired from GISAID [7]. We downloaded all 137 636 available complete SARS-CoV-2 sequences on October 19, 2020. The resulting sequences were then aligned on the Biowulf Linux cluster at the National Institutes of Health, Bethesda, MD. MAFFT [32] was used to align the sequences.

Code and data

The raw and processed data, along with the code and instructions for both the processing and analysis of the data is available on github at https://github.com/nihcompmed/SARS-CoV-2-genome.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Position 1Position 2DI
2237223840.915
28881288830.801
29700297210.670
2413130.665
1440830370.456
  23 in total

1.  Network inference in stochastic systems from neurons to currencies: Improved performance at small sample size.

Authors:  Danh-Tai Hoang; Juyong Song; Vipul Periwal; Junghyo Jo
Journal:  Phys Rev E       Date:  2019-02       Impact factor: 2.529

2.  Coordinate linkage of HIV evolution reveals regions of immunological vulnerability.

Authors:  Vincent Dahirel; Karthik Shekhar; Florencia Pereyra; Toshiyuki Miura; Mikita Artyomov; Shiv Talsania; Todd M Allen; Marcus Altfeld; Mary Carrington; Darrell J Irvine; Bruce D Walker; Arup K Chakraborty
Journal:  Proc Natl Acad Sci U S A       Date:  2011-06-20       Impact factor: 11.205

Review 3.  Inverse statistical physics of protein sequences: a key issues review.

Authors:  Simona Cocco; Christoph Feinauer; Matteo Figliuzzi; Rémi Monasson; Martin Weigt
Journal:  Rep Prog Phys       Date:  2018-03

4.  Direct-Coupling Analysis of nucleotide coevolution facilitates RNA secondary and tertiary structure prediction.

Authors:  Eleonora De Leonardis; Benjamin Lutz; Sebastian Ratz; Simona Cocco; Rémi Monasson; Alexander Schug; Martin Weigt
Journal:  Nucleic Acids Res       Date:  2015-09-29       Impact factor: 16.971

Review 5.  Coronavirus spike proteins in viral entry and pathogenesis.

Authors:  T M Gallagher; M J Buchmeier
Journal:  Virology       Date:  2001-01-20       Impact factor: 3.616

6.  An 81-Nucleotide Deletion in SARS-CoV-2 ORF7a Identified from Sentinel Surveillance in Arizona (January to March 2020).

Authors:  LaRinda A Holland; Emily A Kaelin; Rabia Maqsood; Bereket Estifanos; Lily I Wu; Arvind Varsani; Rolf U Halden; Brenda G Hogue; Matthew Scotch; Efrem S Lim
Journal:  J Virol       Date:  2020-07-01       Impact factor: 5.103

Review 7.  Viral and cellular RNA helicases as antiviral targets.

Authors:  Ann D Kwong; B Govinda Rao; Kuan-Teh Jeang
Journal:  Nat Rev Drug Discov       Date:  2005-10       Impact factor: 84.694

8.  Genome Composition and Divergence of the Novel Coronavirus (2019-nCoV) Originating in China.

Authors:  Aiping Wu; Yousong Peng; Baoying Huang; Xiao Ding; Xianyue Wang; Peihua Niu; Jing Meng; Zhaozhong Zhu; Zheng Zhang; Jiangyuan Wang; Jie Sheng; Lijun Quan; Zanxian Xia; Wenjie Tan; Genhong Cheng; Taijiao Jiang
Journal:  Cell Host Microbe       Date:  2020-02-07       Impact factor: 21.023

9.  The species Severe acute respiratory syndrome-related coronavirus: classifying 2019-nCoV and naming it SARS-CoV-2.

Authors: 
Journal:  Nat Microbiol       Date:  2020-03-02       Impact factor: 17.745

10.  Characterization of the receptor-binding domain (RBD) of 2019 novel coronavirus: implication for development of RBD protein as a viral attachment inhibitor and vaccine.

Authors:  Wanbo Tai; Lei He; Xiujuan Zhang; Jing Pu; Denis Voronin; Shibo Jiang; Yusen Zhou; Lanying Du
Journal:  Cell Mol Immunol       Date:  2020-03-19       Impact factor: 11.530

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.