Literature DB >> 33273962

Randomness for Nucleotide Sequences of SARS-CoV-2 and Its Related Subfamilies.

Ray-Ming Chen1.   

Abstract

The origin and evolution of SARS-CoV-2 has been an important issue in tackling COVID-19. Research on these topics would enhance our knowledge of this virus and help us develop vaccines or predict its paths of mutations. There are many theoretical and clinical researches in this area. In this article, we devise a structural metric which directly measures the structural differences between any two nucleotide sequences. In order to explore the mechanisms of how the evolution works, we associate the nucleotide sequences of SARS-CoV-2 and its related families with the degrees of randomness. Since the distances between randomly generated nucleotide sequences are very concentrated around a mean with low variance, they are qualified as good candidates for the fundamental reference. Such reference could then be applied to measure the randomness of other Coronaviridae sequences. Our findings show that the relative randomness ratios are very consistent and concentrated. This result indicates their randomness is very stable and predictable. The findings also reveal the evolutional behaviours between the Coronaviridae and all its subfamilies.
Copyright © 2020 Ray-Ming Chen.

Entities:  

Year:  2020        PMID: 33273962      PMCID: PMC7683162          DOI: 10.1155/2020/8819942

Source DB:  PubMed          Journal:  Comput Math Methods Med        ISSN: 1748-670X            Impact factor:   2.238


1. Introduction

COVID-19 has a huge impact on all works of life. To develop stable and trustworthy vaccines [1, 2], one needs to track and analyse the properties of SARS-CoV-2, which couples with MERS-CoV [3] and SARS-CoV which are the subfamilies of betacoronavirus. Besides, one also needs to compare the properties of its related families: alphacoronavirus, deltacoronavirus, and gammacoronavirus [4]. In the Coronaviridae, betacoronavirus is the most deadly subfamily. In the category, SARS-CoV, MERS-CoV, and SARS-CoV-2 emerged in 2003, 2012, and 2019, respectively. To evaluate and analyse their properties, there are many genomic, clinical, statistical, and analytical tools available. Among all the theoretical or clinical research, genetical analysis provides a straightforward way to delve into the structures of Coronaviridae [5, 6]. There are some researchers focusing on geographic, demographic, and genomic analysis to extract some patterns of the viruses [7, 8]. Though the origin and evolution of these viruses was studied previously—for example, MERS-CoV [9] and SARS [10, 11]—there is still a long way to map out the interaction of these viruses. Currently, there are many theories or evidence about the mechanisms regulating the evolution and mutation of SARS-CoV-2 [12-14]. Nonetheless, a decisive solution to reveal such mechanisms still depends on further research and findings. In this article, we analyse their properties from the point of randomness, i.e., the degree of randomness of their nucleotide sequences. We devise a structural metric which would be applied in measuring the distances between all sorts of the Coronaviridae nucleotide sequences and the randomly generated nucleotide sequences. These distances could indicate how far the Coronaviridae is with respect to the random nucleotide sequences. We utilise the data of coronavirus genomes from NCBI datasets [15]. Then, we measure the distances for each individual subfamily of the Coronaviridae. Our results show this structural metric is very suitable in revealing the properties of randomness. Hence, the relative distances between the random sequences are fairly stable and concentrated—this feature makes the concept of randomness feasible. From these settings, we could then calculate their relative randomness ratios (RRR) and extract our findings and results from RRR. The method to implement this notion is characterized in Section 3, and the results of the implementation are listed in Section 4, and the conclusions are reached in Section 5.

2. Theoretical Settings

In order to clearly measure the distances between structures, we devise a structural metric in this section—which would be applied in the latter sections. For any vector , we use or to denote its jth element and to denote its length. We also use to denote its Euclidean norm.

2.1. Common Finite Interval (CFI)

Let AFS denote the set of all the ascending finite sequences. Let ,AFS be arbitrary. Define the greatest lower bound . Define the least upper bound . Let denote the subsequence of whose elements lie between a and b. Let denote the set of all the elements of . Let . Let finite K⊆R be arbitrary. Let Sort(K) ∈ FINI denote the vector by sorting all the elements in K. Define a difference operator Diff over finite vectors by , where .

Definition 1 .

For any , any a < b, define by .

Definition 2 .

(common subsequence). If ,AFS, we define by . This serves as the common structure between two structures.

Definition 3 .

(ascending finite sequences). Let [a, b] < (wherea < b) denote the set of all the ascending real vectors whose first element is a and last element is b. Let FINI be the union set of all [a, b] < foranya < b, i.e., FINI = ∪{[a, b]<:a < b, a, b ∈ ℝ}.

Definition 4 .

(structural metric). Define a distance function δ over FINI by .

Claim 1 .

δ is a metric on AFS.

Proof

It can be proved, according to Definition 4, by taking all the possible cases regarding their relations of intervals into consideration.

Claim 2 .

If d1, d2, ⋯, d is a set of metrics over a set K, then d(a, b) = ∑α · d(a, b) is also a metric on K.

Definition 5 .

It follows immediately from the definitions of a metric.

Example 1 .

Suppose nucleotide sequence N1, N2 are given above. Let p denote the position of nitrogenous base Q in the sequence i. Let p12Q denote the position of common sequence ofp1Q and p2Q. Then, the results are presented in Table 1. Let BASES = {“A”, “C”, “G”, “T”}. Now we define δ(N1, N2) = [∑Q∈BasesδQ(p1Q, p2Q)]/4 = 1/4 · [(∑Q∈Bases‖p1Q‖ + ‖p2Q‖)/2 − ‖p12Q‖], where the last equality comes directly from Definition 4. Since Therefore, δ(N1, N2) = (2.32 + 3.44 + 4.20 + 1.92)/4 = 2.97.
Table 1

Position, difference vectors, and norms: N1 and N2.

NamePosition (index)Difference vectorNorm
p 1A(1, 5, 10)(4, 5) 41
p 1C(2, 3, 6,11,13,14,15)(1, 3, 5, 2, 1, 1) 41
p 1G(7, 9, 17, 18)(2, 8, 1) 69
p 1T(4, 8, 12, 16)(4, 4, 4) 48
p 2A(4, 5, 10,14,19,21)(1, 5, 4, 5, 2) 71
p 2C(1, 2, 3, 7, 17, 22)(1, 1, 4,10,5) 143
p 2G(8,11,15,24)(3, 4, 9) 106
p 2T(6, 9, 12,13,16,18,20,23)(3, 3, 1, 3, 2, 2, 3) 45
p 12A(4, 5, 10)(1, 5) 26
p 12C(2, 3, 6, 7, 11,13,14,15)(1, 3, 1, 4, 2, 1, 1) 33
p 12G(8, 9, 11,15,17,18)(1, 2, 4, 2, 1) 26
p 12T(6, 8, 9,12,13,16)(2, 1, 3, 1, 3) 24
The weights are all predetermined 1/4 for each nitrogenous base. These values could also be adjusted according to professional judgement. For example, the weights could be decided by the relative frequencies of the bases. Example 1 lays a foundation of our latter arithmetical calculation.

3. Methods

There are several steps for calculating the relative randomness ratios (RRR). Generate a set of 1000 random nucleotide sequences whose lengths are all fixed at 30000. The generated random (nucleotide) sequences are presented in Table 2
Table 2

1000 sampled random nitrogenous bases.

SamplesRandom sequenceLength
s 1 CCTTTCGTTGCTCAT ⋯ GTTTATGGTACGCAGC30000
s 2 TGAGTATCTGGATCC ⋯ GCCACATGGCCAGTCC30000
s 999 TCGAGTGTCGGACTC ⋯ ATCCGGAGTTCTCCGA30000
s 1000 TAATCCAAAACAATA ⋯ AGCCTTAGGTCCTATT30000
Each sequence is regarded as a node. We then calculate the distance matrix for these nodes. This metric is a weighted metric consisting of 4 metrics which measure the structural distance with respect to each nitrogenous base. A concrete computation is shown in Example 1 Some patterned nucleotide sequences are created and their distances with random sequences are calculated. These sequences are nonessential. They are generated only for comparative purposes. The created (followed by rules) nucleotide sequences and their distances are presented in Table 3
Table 3

Distances between patterned sequences and random ones.

[s1, s2, s3, ⋯, s998, s999, s1000]MinMaxMeanSd.
q 1 [106.9,107.3,107.0, ⋯, 108.7,108.0,107.5]105.6109.6107.40.62
q 2 [114.7,114.1,114.2, ⋯, 115.2,115.1,114.8]112.8116.4114.70.58
q 3 [110.1,110.5,110.4, ⋯, 111.7,111.3,111.0]108.9113.1110.70.62
The structural distances between SARS-CoV-2 nucleotide sequences and random ones are calculated. The results are presented in Table 4
Table 4

Distance and randomness ratio between SARS-CoV-2 and random sequences.

SequenceLengthMinMaxMeanSd.Mean randRRR
1158929903133.78138.00135.920.72130.661.04
2177229671133.50138.03135.710.71130.521.04
3383429903133.73137.95135.880.72130.601.04
448329798133.73137.92135.850.72130.581.04
5133329869133.94137.92135.840.72130.631.04
6451529862133.94137.92135.840.72130.671.04
7410029846133.72137.94135.850.72130.661.04
8100529855133.68137.91135.820.72130.701.04
9113229743133.70137.92135.850.72130.621.04
10421829857133.35137.93135.680.72130.501.04
11339129835133.73137.96135.880.72130.651.04
12218729816133.41137.89135.740.70130.621.04
13280229782133.48137.64135.730.69130.611.04
14112529726133.39137.81135.760.70130.551.04
15168129903133.72137.92135.850.72130.591.04
16338829834133.72138.00135.910.72130.571.04
17340729834133.41138.10135.700.70130.501.04
18203029835133.77137.99135.910.72130.531.04
19180029827133.75137.94135.880.72130.751.04
20202329808133.77137.99135.910.72130.651.04
The structural distances between MERS-CoV nucleotide sequences and random ones are calculated. The results are presented in Table 5
Table 5

Distance and randomness ratio between MERS-CoV and random sequences.

SequenceLengthMinMaxMeanSd.Mean randRRR
139430123131.29135.49133.260.71130.881.02
231530123131.16135.40133.220.73130.901.02
332430123131.12135.37133.220.71130.501.02
438130123130.59135.05132.910.69131.071.01
54630094131.77136.29133.880.74131.091.02
639230123130.34135.47133.060.69130.931.02
728230123130.75135.36133.000.70130.771.02
8630081131.26135.40133.240.71131.101.02
921030096131.27135.52133.260.71130.971.02
1038630123131.28135.30133.230.71130.881.02
1148430096130.75135.06133.030.71130.321.02
1250630118130.84135.24133.030.71131.071.01
1324130123130.69135.23133.020.70130.811.02
1435930123130.87135.24133.050.71130.881.02
1520930096131.22135.47133.230.71130.851.02
1646929455130.35135.42133.070.69130.821.02
175929919130.70135.90133.220.74130.801.02
1836630123130.77134.99133.000.70130.931.02
1935430123130.88135.26133.050.71130.411.02
2012830118130.79135.39133.030.70130.781.02
The structural distances between SARS nucleotide sequences and random ones are calculated. The results are presented in Table 6
Table 6

Distance and randomness ratio between SARS-CoV and random sequences.

SequenceLengthMinMaxMeanSd.Mean randRRR
11021829849130.21134.75132.300.71130.221.02
2775029782130.20134.79132.300.71130.121.02
3648329782129.81134.56132.220.73130.151.02
480529882129.98134.51132.270.72130.181.02
5266029900129.50134.75132.260.71130.291.02
6185629865130.17134.74132.310.70130.681.01
7712629835130.25134.57132.310.73130.431.01
88729767130.14134.69132.230.71130.611.01
9328929882130.32134.66132.310.72130.431.01
10530729868130.22134.54132.290.73130.181.02
11959329858130.17134.55132.210.72130.431.01
12892529867130.02134.50132.190.72130.031.02
13602029836130.17134.55132.210.72130.291.01
14702929769130.03134.48132.200.72130.071.02
15478329860130.18134.55132.210.72130.271.01
16180429902130.15134.71132.240.71130.121.02
17685229842130.13134.70132.240.71130.431.01
18241529812130.01134.30132.180.72130.071.02
1968129890130.12134.46132.250.72130.521.01
20307529808130.11134.46132.230.73130.411.01
The structural distances between alphacoronavirus nucleotide sequences and random ones are calculated. The results are presented in Table 7
Table 7

Distance and randomness ratio between alphacoronavirus and random sequences.

SequenceLengthMinMaxMeanSd.Mean randRRR
132827993126.25131.20128.950.71126.581.02
220527998133.48138.02136.050.69125.781.08
388128029133.65138.18135.790.74125.281.08
413727410130.80135.79133.180.73129.761.03
5429355130.70135.07132.750.69128.911.03
687727516130.02134.36131.920.72127.541.03
773928009129.30133.56131.450.70128.011.03
872328029129.84134.29131.870.72127.801.03
961527489133.74138.24135.920.73126.111.08
1014027413133.70138.41135.990.70125.351.08
1152928595129.90134.35131.890.71127.581.03
1276428173128.37132.50130.340.71126.471.03
133629295127.71132.27129.860.70125.471.03
1411829357129.94134.27131.980.69127.771.03
1591727165129.79133.95131.950.72127.721.03
1668628038132.14136.16134.100.69129.041.04
1754728521126.37131.24129.010.71126.921.02
1882028038125.67130.34127.760.71124.561.03
1939327993125.36130.07127.450.70124.851.02
2023827998125.45130.13127.550.70124.631.02
The structural distances between deltacoronavirus nucleotide sequences and random ones are calculated. The results are presented in Table 8
Table 8

Distance and randomness ratio between deltacoronavirus and random sequences.

SequenceLengthMinMaxMeanSd.Mean randRRR
1625422125.14129.84127.570.70122.911.04
211725393125.22129.71127.220.71122.841.04
39125399127.27131.43129.420.73123.051.05
43325422123.34128.16125.410.71123.021.02
511625414119.74124.91122.560.69120.451.02
68725413119.90124.63122.440.70120.411.02
76325420122.77127.27124.980.70122.041.02
87325406121.12125.58123.490.71121.711.01
96525420123.14128.15125.310.72123.231.02
1013826227123.57128.48125.890.72122.021.03
1112025403124.56129.30126.970.72122.481.04
1212925424127.75132.35129.910.72122.311.06
139025414120.35124.48122.430.71120.531.02
1410725422120.46124.58122.350.70120.461.02
152225408120.41124.54122.480.71120.221.02
16426552120.23124.43122.380.71120.481.02
172925422120.33124.48122.450.71120.461.02
1811925413120.32124.48122.430.71120.221.02
193425438120.33124.50122.440.71120.411.02
2013126487120.39124.53122.480.71120.551.02
The structural distance between gammacoronavirus nucleotide sequences and random ones are calculated. The results are presented in Table 9
Table 9

Distance and randomness ratio between gammacoronavirus and random sequences.

SequenceLengthMinMaxMeanSd.Mean randRRR
113427676131.36135.84133.550.72125.931.06
233927603130.72135.93133.040.71125.691.06
338527733130.77135.76133.140.74125.041.06
438427755130.64135.54132.790.74124.901.06
58727691131.14135.36133.280.72125.411.06
626727675130.46135.08132.820.74125.491.06
715127388130.27134.91132.760.72125.591.06
83727690130.98135.48133.130.70125.501.06
94727616131.93136.30133.940.72125.731.07
1013727618130.94135.36133.250.72125.391.06
118827630130.89135.31133.090.71125.631.06
124227620131.08135.88133.470.71125.951.06
1323827685131.33135.72133.460.73125.471.06
1431727590142.40147.90145.410.73130.851.11
1513327617130.66135.16132.980.71125.821.06
1627827686130.74135.31133.080.71125.441.06
1714427682132.28136.87134.460.72125.821.07
1824127685131.28135.79133.440.71125.371.06
1933427474130.71135.07132.670.71125.631.06
2037827642129.80135.01132.440.70125.721.05
RRR for each subfamily is calculated and the way to calculate it is explained in Section 4.2

4. Results

We use R program 4.0.2 (version) which in particular involves a package “Biostrings” to help us implement the theoretical setting. By the procedures mentioned in Section 3, we present the results in this section. We set the length of random nitrogenous base to be 30000, which is pretty much the length for SARS-CoV virus family. We also use R to sample 1000 samples (sequences) for our experiment (due to the capacity of our computers).

4.1. Experiment: Randomness of Nucleotide Sequences

Through Definition 4 and Example 1, we have the distance matrix as follows: After removing the diagonal, we calculate some descriptive values for the 999∗999 elements: the minimum, maximum, mean, and standard derivation of the whole distance matrix. The minimum is 127.1 and the maximum is 134.7. The mean is 130.88 and the standard derivation is 0.83. Since the standard derivation is very small, the structural distance between any pair of random nucleotide sequences is highly concentrated around the mean—this is a good referential property for our further analysis. Now, let us demonstrate the distances between some patterned sequences with random sequences.

Example 2 .

Suppose A, C, G, T are bundled and repeated 7500 times with ∣q1 | = 3000; moreover, AA, CC, GG, TT are bundled and repeated 3750 times with ∣q2 | = 3000; finally, AACGAT (a pattern for the Fibonacci sequence F with mod operation, or F mod 4, where 1, 2, 3, and 4 are identified with “A”, “C”, “G”, and “T”, respectively) are bundled and repeated 5000 times with ∣q3 | = 3000 as shown in the following: q1 = (“A”, “C”, “G”, “T”, “A”, “C”, “G”, “T”, ⋯, “A”, “C”, “G”, “T”) q2 = (“A”, “A”, “C”, “C”, “G”, “G”, “T”, “T”, ⋯, “G”, “G”, “T”, “T”) q3 = (“A”, “A”, “C”, “G”, “A”, “T”, “A”, “A”, ⋯, “C”, “G”, “A”, “T”) The distances between each q and the random sequences are listed in Table 3. The structural distances between patterned sequences and random ones obviously have different results in comparison with the random sequences.

4.2. Distance for Nucleotide Sequences

We import SARS-CoV-2 genomic codes and save them in S4DSC2 [15]. Since the size of S4DSC2 is too huge (4617), or {s1, s2, ⋯, s4617}, and could not be handled by our computer, we sample only 20 of them. The results are presented in Table 4, where column “Sequence” is the order of the sampled sequence in the data set; “Min” and “Max” are the minimal and maximal distance for the given sequence with the random sequences, respectively; “Mean” is the average distance between the given sequence and the random sequences; “Sd” is the standard derivation of such set of distances; “Mean rand” is the average distance of the distance matrix of random sequences; “RRR” is the relative randomness ration, which is the “Mean” over “Mean rand.” For the latter tables, meanings of the columns are the same; we will skip the wording. For MERS-CoV, the size of data downloaded is 530. We sample 20 of them randomly. The results are presented in Table 5. For SARS-CoV, the size of data downloaded is 10647. We sample 20 of them randomly. The results are presented in Table 6. For alphacoronavirus, the size of data downloaded and filtered is 1002. We sample 20 of them randomly. The results are presented in Table 7. For deltacoronavirus, the size of data downloaded and filtered is 149. We sample 20 of them randomly. The results are presented in Table 8. For gammacoronavirus, the size of data downloaded and filtered is 427. We sample 20 of them randomly. The results are presented in Table 9.

5. Conclusion

By observing all the results presented in the tables, we could reach the following statements: The structural distances between random (nucleotide) sequences are highly concentrated with low standard derivation. This feature justifies the referential role under structural metric The patterned nucleotide sequences have lower means and lower standard derivations in distances with random sequences The relative randomness ratios (RRR) for Coronaviridae, which lie between 1.01 and 1.08, are much close to complete randomness ratio (or 1) in comparison with the ones for patterned nucleotide sequence, which lie around 0.84 in our examples Overall, the randomness of betacoronavirus is higher than alphacoronavirus or deltacoronavirus, which in turn are higher than the structural distances between SARS-CoV-2 and random sequences. This could probably explain why the mutations of betacoronavirus are higher than other subfamilies In the betacoronavirus, the RRR of SARS-CoV-2 is almost fixed at 1.04. This indicates the mutations of SARS-CoV-2 are stabilized at this moment These findings provide some insightful knowledge about the degree of structural randomness of SARS-CoV-2 and its related family. Linking this knowledge to other research results and findings would help us map out the dynamical structures and evolutions of these viruses.
  13 in total

1.  Identification of Geographic Specific SARS-Cov-2 Mutations by Random Forest Classification and Variable Selection Methods.

Authors:  Manoj Kandpal; Ramana V Davuluri
Journal:  Stat Appl       Date:  2020-06-30

Review 2.  Viral evolution and the emergence of SARS coronavirus.

Authors:  Edward C Holmes; Andrew Rambaut
Journal:  Philos Trans R Soc Lond B Biol Sci       Date:  2004-07-29       Impact factor: 6.237

3.  Evolutionary origins of the SARS-CoV-2 sarbecovirus lineage responsible for the COVID-19 pandemic.

Authors:  Maciej F Boni; Philippe Lemey; Xiaowei Jiang; Tommy Tsan-Yuk Lam; Blair W Perry; Todd A Castoe; Andrew Rambaut; David L Robertson
Journal:  Nat Microbiol       Date:  2020-07-28       Impact factor: 17.745

Review 4.  Origin and evolution of pathogenic coronaviruses.

Authors:  Jie Cui; Fang Li; Zheng-Li Shi
Journal:  Nat Rev Microbiol       Date:  2019-03       Impact factor: 60.633

5.  Demographic science aids in understanding the spread and fatality rates of COVID-19.

Authors:  Jennifer Beam Dowd; Liliana Andriano; David M Brazel; Valentina Rotondi; Per Block; Xuejie Ding; Yan Liu; Melinda C Mills
Journal:  Proc Natl Acad Sci U S A       Date:  2020-04-16       Impact factor: 11.205

6.  Genetic diversity and evolution of SARS-CoV-2.

Authors:  Tung Phan
Journal:  Infect Genet Evol       Date:  2020-02-21       Impact factor: 3.342

7.  SARS-CoV-2 variants: Relevance for symptom granularity, epidemiology, immunity (herd, vaccines), virus origin and containment?

Authors:  Antoine Danchin; Kenneth Timmis
Journal:  Environ Microbiol       Date:  2020-05-19       Impact factor: 5.476

8.  A global survey of potential acceptance of a COVID-19 vaccine.

Authors:  Jeffrey V Lazarus; Scott C Ratzan; Adam Palayew; Lawrence O Gostin; Heidi J Larson; Kenneth Rabin; Spencer Kimball; Ayman El-Mohandes
Journal:  Nat Med       Date:  2020-10-20       Impact factor: 53.440

9.  The proximal origin of SARS-CoV-2.

Authors:  Kristian G Andersen; Andrew Rambaut; W Ian Lipkin; Edward C Holmes; Robert F Garry
Journal:  Nat Med       Date:  2020-04       Impact factor: 87.241

View more
  2 in total

1.  Quantifying collective intelligence and behaviours of SARS-CoV-2 via environmental resources from virus' perspectives.

Authors:  Ray-Ming Chen
Journal:  Environ Res       Date:  2021-05-12       Impact factor: 8.431

2.  Analysing deaths and confirmed cases of COVID-19 pandemic by analytical approaches.

Authors:  Ray-Ming Chen
Journal:  Eur Phys J Spec Top       Date:  2022-03-21       Impact factor: 2.707

  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.