Literature DB >> 14499005

Relationship of SARS-CoV to other pathogenic RNA viruses explored by tetranucleotide usage profiling.

Yee Leng Yap¹, Xue Wu Zhang, Antoine Danchin.

Abstract

BACKGROUND: The exact origin of the cause of the Severe Acute Respiratory Syndrome (SARS) is still an open question. The genomic sequence relationship of SARS-CoV with 30 different single-stranded RNA (ssRNA) viruses of various families was studied using two non-standard approaches. Both approaches began with the vectorial profiling of the tetra-nucleotide usage pattern V for each virus. In approach one, a distance measure of a vector V, based on correlation coefficient was devised to construct a relationship tree by the neighbor-joining algorithm. In approach two, a multivariate factor analysis was performed to derive the embedded tetra-nucleotide usage patterns. These patterns were subsequently used to classify the selected viruses.
RESULTS: Both approaches yielded relationship outcomes that are consistent with the known virus classification. They also indicated that the genome of RNA viruses from the same family conform to a specific pattern of word usage. Based on the correlation of the overall tetra-nucleotide usage patterns, the Transmissible Gastroenteritis Virus (TGV) and the Feline CoronaVirus (FCoV) are closest to SARS-CoV. Surprisingly also, the RNA viruses that do not go through a DNA stage displayed a remarkable discrimination against the CpG and UpA di-nucleotide (z = -77.31, -52.48 respectively) and selection for UpG and CpA (z = 65.79,49.99 respectively). Potential factors influencing these biases are discussed.
CONCLUSION: The study of genomic word usage is a powerful method to classify RNA viruses. The congruence of the relationship outcomes with the known classification indicates that there exist phylogenetic signals in the tetra-nucleotide usage patterns, that is most prominent in the replicase open reading frames.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
RNA, Viral

Year: 2003 PMID： 14499005 PMCID： PMC222961 DOI： 10.1186/1471-2105-4-43

Source DB: PubMed Journal: BMC Bioinformatics ISSN： 1471-2105 Impact factor: 3.169

Background

Severe Acute Respiratory Syndrome (SARS), a newly identified infectious disease, has imperilled the health of human population in more than 30 nations. It has claimed over 812 lives and infected more than 8442 (9.61% death rate) by July 2, 2003 [1] since its outbreak in November 2002 in the province of GuangDong, People's Republic of China. By May 15, 2003, the primary etiological agent for SARS was found to fulfil Koch's postulate through experimental infection of cynomolgus macaques (Macaca fascicularis) [2]. Chronicles for the discovery of SARS CoronaVirus (SARS-CoV) can be found in articles [e.g. [3,4]] and websites [e.g. [5]]. A common question is often asked when investigating viral evolution: what hallmark, in term of genome sequence or RNA word usage, could be used to trace back the emergence of a new pathogen in humans/animals? In particular, CoronaViruses are prone to recombination [6,7] and like all other viruses they mutate at a high frequency [8]. This makes extremely hazardous to try to trace the origin of the virus. Nevertheless, this prompted us to investigate their relationships using the RNA word usage hoping to identify some RNA viruses that display similar word usage pattern. Such RNA viruses might hint about the origin of SARS-CoV. This study will contribute to our understanding of the RNA word usage of SARS-CoV and some other pathogenic RNA viruses. In the present study, we explored the relationships of 31 RNA viruses, which are known to cause diseases to their corresponding hosts with either similar symptoms or infectiousness, including SARS-CoV, based on their global tetra-nucleotide usage pattern. Preliminary analysis of the sequence data indicated that there are 11–14 open reading frames in the SARS-CoV genome [9-11]. The overall gene order for this novel pathogen supported its placement in the family of Coronaviridae which includes the animal/human CoronaViruses. It should be emphasized that the sequence similarity shown is attributed mainly to the large RNA-dependent RNA polymerase (replication enzyme or RdRp) residing in the first two open reading frames (ORFs). These two ORFs constitute more than 65% (>20 kb) of the total genome size and these regions are more conserved in their nucleotide sequences due to their specialized role for viral RNA replication. Therefore, the possible relationship based on the sequence of the replication enzyme alone was also investigated.

Results and Discussion

Mono-nucleotide bias

Table 1 presents the breakdown of the RNA sequence into mononucleotide frequencies for the 31 viral genomes in our dataset. Except for the Rabbit Hemorrhagic disease Virus (RHV) that shows a fair usage of the four nucleotides in approximately equal number, the other RNA viruses have a biased genome composition. Bovine CoronaVirus (BCoV) and Human CoronaVirus 229E (HCoV) favor the U nucleotide (35.5% and 34.6%) at the expense of the C nucleotide (15.3% and 16.7%). Relatively strong nucleotide biases are visible in the other genomes and we will mention a few of the extremes. The highest base count is 28.4% G in the Yellow Fever Virus (YFV), 38.9% A in the Respiratory Syncytial Virus (RSV), 35.5% U in the Bovine CoronaVirus (BCoV) and 28.5% C count in the Foot-and-Mouth disease Virus (FMV). The lowest base counts are 15.8% G in the Human Respiratory syncytial Virus (HRV), 21.2% A in the Equine arteritis Virus (EV1), 20.9% U in the Igbo Ora Virus (IOV) and 13.6% C in the Bovine ephemeral Fever Virus (BFV). The A nucleotide is the most popular base among RNA viruses (ranging from 21.2% to 38.9%), and C is the most variable nucleotide (ranging from 13.6% to 33.1%).

Table 1

RNA virus in current study.

		Virus Name	Type	Acession Number	DNA Stage	Segment	Acronym	Size (nt)	G	A	U	C	A+U%
ssRNA positive-strand viruses	1	Avian infectious bronchitis virus	ss-RNA	NC_001451	No	1	ABV	27608	21.7	28.9	33.2	16.2	62.1
	2	Bovine coronavirus	ss-RNA	NC_003045	No	1	BCoV	31028	21.8	27.4	35.5	15.3	62.9
	3	Equine arteritis virus	ss-RNA	NC_002532	No	1	EV1	12704	26.0	21.2	27.1	25.6	48.3
	4	Human coronavirus 229E	ss-RNA	NC_002645	No	1	HCoV	27317	21.6	27.2	34.6	16.7	61.7
	5	Lactate dehydrogenase-elevating virus	ss-RNA	NC_002534	No	1	LDV	14225	25.9	23.1	28.2	22.6	51.3
	6	Murine hepatitis virus	ss-RNA	NC_001846	No	1	MHV	31357	23.9	26.0	32.3	17.9	58.2
	7	Porcine epidemic diarrhea virus	ss-RNA	NC_003436	No	1	PDV	28033	22.8	24.7	33.2	19.2	58.0
	8	Porcine reproductive and respiratory syndrome virus	ss-RNA	NC_001961	No	1	PRV	15428	26.2	21.7	25.3	26.7	47.0
	9	SARS coronavirus	ss-RNA	NC_004718	No	1	SAR	29751	20.8	28.5	30.7	20.0	59.2
	10	Feline coronavirus	ss-RNA	AY204704	No	1	FCoV	9979	22.6	27.9	29.2	20.3	57.2
	11	Simian hemorrhagic fever virus	ss-RNA	NC_003092	No	1	SFV	15717	22.6	22.5	27.4	27.5	49.9
	12	Transmissible gastroenteritis virus	ss-RNA	NC_002306	No	1	TGV	28586	20.6	29.5	32.9	17.0	62.4
	13	Avian encephalomyelitis virus	ss-RNA	NC_003990	No	1	AEV	7055	25.7	27.0	28.3	19.0	55.3
	14	Bovine viral diarrhea virus genotype 2	ss-RNA	NC_002032	No	1	BDV	12255	25.2	32.7	22.3	19.8	54.9
	15	Foot-and-mouth disease virus C	ss-RNA	NC_002554	No	1	FMV	8115	25.6	24.8	21.2	28.5	45.9
	16	Igbo Ora virus	ss-RNA	NC_001924	No	1	IOV	11821	24.1	31.1	20.9	24.0	51.9
	17	Poliovirus	ss-RNA	NC_002058	No	1	PV1	7440	23.0	29.7	24.0	23.3	53.7
	18	Rabbit hemorrhagic disease virus	ss-RNA	NC_001543	No	1	RHV	7437	25.5	25.9	23.9	24.7	49.8
	19	Tamana bat virus	ss-RNA	NC_003996	No	1	TBV	10053	21.5	33.2	28.3	16.9	61.6
	20	Yellow fever virus	ss-RNA	NC_002031	No	1	YFV	10862	0.28	0.27	0.23	0.21	0.50

ssRNA negative-strand viruses	21	Avian paramyxovirus 6	ss-RNA	NC_003043	No	1	APV	16236	0.23	0.29	0.25	0.23	0.54
	22	Bovine ephemeral fever virus	ss-RNA	NC_002526	No	1	BFV	14900	0.20	0.38	0.28	0.14	0.66
	23	Bovine respiratory syncytial virus	ss-RNA	NC_001989	No	1	BRV	15140	0.17	0.38	0.29	0.17	0.66
	24	Canine distemper virus	ss-RNA	NC_001921	No	1	CDV	15690	0.22	0.31	0.26	0.21	0.57
	25	Human respiratory syncytial virus	ss-RNA	NC_001781	No	1	HRV	15225	0.16	0.39	0.28	0.18	0.67
	26	Hantaan virus	ss-RNA	AF345636	Yes	2	HV1	11772	0.21	0.33	0.29	0.17	0.62
	27	Influenza B virus	ss-RNA	NC_002208	Yes	8	IBV	14452	0.22	0.36	0.24	0.18	0.60
	28	Measles virus	ss-RNA	NC_001498	No	1	MV1	15894	0.24	0.29	0.23	0.24	0.53
	29	Respiratory syncytial virus	ss-RNA	NC_001803	No	1	RSV	15191	0.16	0.39	0.28	0.18	0.67
	30	Reston Ebola virus	ss-RNA	NC_004161	No	1	REV	18891	0.20	0.31	0.28	0.21	0.59
	31	Tioman virus	ss-RNA	NC_004074	No	1	TV2	15522	0.21	0.30	0.26	0.22	0.57

The information about 31 RNA viruses being investigated in this study. Their accession number, abbreviation, genome size, number of segments and whether they undergo DNA stage are tabulated. The breakdown of the RNA nucleic acids and A+U contents are also shown.

RNA virus in current study. The information about 31 RNA viruses being investigated in this study. Their accession number, abbreviation, genome size, number of segments and whether they undergo DNA stage are tabulated. The breakdown of the RNA nucleic acids and A+U contents are also shown. From the standpoint of the overall genomic composition analysis, the G+C content is an interesting property for a genome, in that the overall content often correlates with the organism pathogenicity [12]. Most of the pathogens genomes have a low G+C content, while some such as Mycobacterium tuberculosis has a relatively high G+C content. Therefore, as expected in Table 1, we noted that most of the pathogenic viruses are A+U-rich (>50%), except for Porcine reproductive and Respiratory syndrome Virus (PRV), Equine arteritis virus (EV1), Rabbit hemorrhagic disease virus (RHV), Simian hemorrhagic Fever Virus (SFV) and Foot-and-Mouth disease Virus C (FMV).

Di-nucleotide bias

The frequencies of occurrence for di-nucleotides were compared to the random RNA counterparts having the same base proportion in order to compute the z value that reflected their di-nucleotide bias (Table 2). Among the 31 virus sequences examined, the frequencies of occurrence for di-nucleotide were not randomly distributed, with only a few exceptional di-nucleotides starting with a purine residue present at the expected frequencies (ApC, ApG, GpC, |z| < 3). A remarkable deviation from the expected frequencies occurs for the di-nucleotide pairs CpG and UpA (suppression or under-representation, z < -50) as well as di-nucleotides pairs CpA and UpG (enhancement or over-representation, z > 40). These di-nucleotide biases, together with mono-nucleotide bias [13], have a direct impact on the codon usage of viruses. For example, in the codon usage for the 24 protein coding sequences in human CoronaVirus 229E (Table 3), only 2.85% of codons contain the under-represented subword CpG di-nucleotide whereas 11.26% of the codons contain the over-represented CpA di-nucleotide (the aggregate codon usage containing each di-nucleotide subword without mono- and di-nucleotide bias is close to 6.25%).

Table 2

Di-nucleotide bias for six RNA viruses.

	BCoV			MHV			SARS			ABV			HCoV			PDV			Average z valueacross 31 viruses
Di-nucleotide	N(w)	E(w)	z	N(w)	E(w)	z	N(w)	E(w)	z	N(w)	E(w)	z	N(w)	E(w)	z	N(w)	E(w)	z
CG	497	1034	-103.81	798	1342	-104.39	566	1235	-121-14	486	976	-109.69	487	979	-95.94	684	1226	-102.03	-77.31
GC	1344	1037	62.19	1694	1341	62.33	1432	1236	36.12	1147	970	35.96	1164	976	37.86	1416	1228	35.05	5.74
AU	2845	3007	-26.44	2499	2614	-19.25	2234	2594	-58.91	2200	2642	-76.99	2092	2556	-81.97	1976	2296	-55.43	-15.54
UA	2818	3000	-30.10	2404	2616	-35.12	2080	2594	-87.64	2409	2641	-42.42	2033	2554	-84.51	1965	2299	-53.83	-52.48
AG	1824	1848	-4.25	1968	1941	4.77	1749	1760	-2.00	1844	1728	21.13	1416	1601	-34.78	1537	1579	-7.12	3.80
GA	1629	1849	-39.08	1745	1941	-32.74	1677	1764	-16.43	1505	1730	-39.47	1397	1598	-36.09	1358	1581	-38.05	-1.33
AC	1371	1303	12.93	1384	1458	-13.50	1978	1695	50.18	1474	1292	35.28	1558	1236	58.96	1594	1332	50.25	5.42
CA	1594	1297	56.03	1705	1453	46.19	2203	1695	87.29	1603	1290	59.90	1638	1234	74.68	1783	1327	83.96	49.99
CU	1801	1674	22.52	1874	1806	12.28	2190	1814	67.50	1661	1487	31.50	1724	1568	28.13	1953	1784	29.95	16.50
UC	1179	1674	-88.35	1296	1802	-94.30	1552	1815	-46.36	1127	1482	-65.41	1130	1568	-79.37	1410	1781	-67.80	-17.49
GU	2449	2394	9.10	2473	2402	11.92	1868	1898	-5.35	2154	1982	29.46	2240	2044	34.60	2262	2119	23.86	-7.13
UG	3101	2392	120.25	3146	2408	128.13	2663	1897	137.30	2476	1983	87.74	2898	2040	152.24	2814	2117	126.99	65.79

The di-nucleotide bias in six RNA viruses. The z value quantifies the di-nucleotide bias as defined in equation 1. N (w) and E (w) are actual and expected frequency of occurrence for word w. The last column is the average z value across 31 RNA viruses.

Table 3

Codon usage for Human CoronaVirus 229E (HCoV).

Amino Acid	Codon	Usage/%	Amino Acid	Codon	Usage/%
Arg	CGU	1.04	Ile	AUU	3.34
	CGC	0.41		AUC	0.74
	CGA	0.17		AUA	1.35
	CGG	0.13	Gly	GGU	4.12
	AGA	1.23		GGC	1.43
	AGG	0.36		GGA	0.67
Leu	UUA	1.49		GGG	0.22
	UUG	2.96	Val	GUU	6.00
	CUU	2.48		GUC	1.23
	CUC	0.46		GUA	1.09
	CUA	0.65		GUG	1.90
	CUG	0.63	Lys	AAA	3.15
Ser	UCU	2.70		AAG	2.31
	UCC	0.66	Asn	AAU	4.15
	UCA	1.37		AAC	1.82
	UCG	0.20	Gln	CAA	2.04
	AGU	1.86		CAG	1.17
	AGC	0.71	his	CAU	1.14
Thr	ACU	3.23		CAC	0.46
	ACC	0.76	Glu	GAA	2.81
	ACA	2.21		GAG	1.21
	ACG	0.29	Asp	GAU	3.09
Pro	CCU	1.6S		GAC	1.96
	CCC	0.35	Tyr	UAU	3.00
	CCA	1.07		UAC	1.46
	CCG	0.19	Cys	UGU	2.26
Ala	GCU	3.58		UGC	0.95
	GCC	0.83	Phe	UUU	4.59
	GCA	1.80		UUC	1.10
	GCG	0.42

The relative usage of synonymous codons in the 24 known CDSs of Human Corona Virus 229E (HCoV).

Di-nucleotide bias for six RNA viruses. The di-nucleotide bias in six RNA viruses. The z value quantifies the di-nucleotide bias as defined in equation 1. N (w) and E (w) are actual and expected frequency of occurrence for word w. The last column is the average z value across 31 RNA viruses. Codon usage for Human CoronaVirus 229E (HCoV). The relative usage of synonymous codons in the 24 known CDSs of Human Corona Virus 229E (HCoV). In double stranded DNA genomes the deficiency in di-nucleotide CpG is often supposed to be due to the fact that they are the targets for methyltransferase activity that leads to cytosine deamination [14,15]. It is however unlikely that the mechanism of deamination that alters the genetic contents at the DNA level would affect the viral RNA content of most RNA viruses without a DNA stage. There might exist specific cytosine RNA methylases that could be responsible for this effect [16]. However it is more consistent to propose that, unlike the mechanism of cytosine deamination in the DNA realm, the dominating process is cytosine deamination in RNA viruses, converting cytosine to uracil (C ♦ U) instead of thymine (T). As a consequence of this mechanism, di-nucleotide CpG changes to either di-nucleotide UpG or CpA in the direct/complementary strands of RNA viruses and causes the over-representation in di-nucleotide UpG and CpA (z > 19). Interestingly, there is experimental evidence in vitro that the rate of cytosine deamination is faster (>100 times) in the single stranded than in double-stranded state [17]. Apart from the under-representation in di-nucleotide CpG and over-representation in di-nucleotide CpA and UpG, the reason for the observed di-nucleotide UpA scarcity in RNA may be explained by its chemical lability [18]. The UpA dinucleotide is chemically the most unstable among the 16 dinucleotides. Furthermore, UpA appears to be a preferential target for ribonucleases [19]. This lability would create a selection pressure against di-nucleotide UpA in RNA viruses. If we choose a critical value for z (|z| = 3.29) that only allows a chance of 1 in 1000 error for classifying a word as biased (over/under-represented), all di-nucleotides show some kind of bias in their usage pattern across 31 different viruses (Table 4, derived from the complete form of Table 2 provided as the additional file 1). The causes for these biases await further investigation.

Table 4

Overall statistics for biased di-nucleotides and tetra-nucleotides.

Percentage of di-nucleotide that can be used to discriminate between vi ruses(\|z\| > 3.29)	Percentage of tetra-nucleotide that can be used to discriminate between vi ruses(\|z\| > 3.29)	Virus	Percentage of biased di-nucleotide (\|z\| > 3.29)/%	Percentage of biased tetra-nucleotide (\|z\| > 3.29)/%
100%	96.09%	BCoV	93.8	29.7
		MHV	93.8	28.1
		SARS	81.3	34.4
		ABV	81.3	27.3
		HCoV	93.8	31.3
		PDV	81.3	28.5
		TGV	87.5	31.6
		LDV	93.8	19.5
		PRV	93.8	15.6
		SFV	93.8	16.0
		FCoV	75.0	11.7
		EV1	87.5	14.5
		TBV	75.0	21.9
		AEV	93.8	11.7
		PV1	87.5	11.7
		YFV	93.8	29.3
		BDV	87.5	17.6
		RHV	93.8	9.4
		FMV	87.5	12.1
		IOV	75.0	9.8
		HV1	62.5	12.5
		RSV	87.5	18.8
		HRV	87.5	19.1
		BRV	93.8	19.9
		TV2	81.3	15.2
		REV	87.5	18.4
		MV1	81.3	15.2
		CDV	75.0	16.0
		APV	93.8	11.7
		BFV	81.3	15.2
		IBV	87.5	23.4

The percentage of biased di-nucleotides and tetra-nucleotides that shows strong biases (lzl > 3.29) in 31 RNA viruses (right). For di-nucleotides, all 16 (100%) of them show strong biases in part of or all 31 RNA viruses. For tetra-nucleotides, 246 (96%) of the tetra-nucleotides show strong biases in part of or all 31 RNA viruses.

Overall statistics for biased di-nucleotides and tetra-nucleotides. The percentage of biased di-nucleotides and tetra-nucleotides that shows strong biases (lzl > 3.29) in 31 RNA viruses (right). For di-nucleotides, all 16 (100%) of them show strong biases in part of or all 31 RNA viruses. For tetra-nucleotides, 246 (96%) of the tetra-nucleotides show strong biases in part of or all 31 RNA viruses.

Tetra-nucleotide bias

Inspection of the tetra-nucleotide usage pattern for RNA viruses (additional file 2) reveals considerable differences. The frequencies of occurrence for tetra-nucleotides were compared to artificial chromosomes constructed as random RNA sequences having the same nucleotide succession up to order three to compute the z values that reflect their tetra-nucleotide bias in the corresponding virus (Table 5). If we choose a critical value for z (|z| = 3.29) that only allows a chance of 1 in 1000 error for classifying a word as over/under-represented, 96% of the tetra-nucleotides show a strong bias in their usage pattern across 31 viruses (shown in Table 4, derived from the complete form of Table 5 provided as the additional file 1). This indicated strongly that tetra-nucleotides are being used in a different manner between different viruses, providing us with a tool to study the relationships between viruses based on the tetra-nucleotide bias exhibited in their genomes.

Table 5

Tetra-nucleotide bias for three RNA viruses. The tetra-nucleotide bias in three viruses. z value quantifies the tetra-nucleotide bias, as defined in equation (1). N (w) and E (w) are actual and expected frequency of occurrence for word w.

	BCoV			MHV			SARS				BCoV			MHV			SARS
Tetra-nucleotide	N(w)	E(w)	z	N(w)	E(w)	z	N(w)	E(w)	z	Tetra-nucleotide	N(w)	E(w)	z	N(w)	E(w)	z	N(w)	E(w)	z
AAAA	148	206.2	-7.4	147	145.8	0.2	222	216.5	0.7	UAAA	264	226.2	4.6	187	170.9	2.2	170	183	-1.7
AAAC	110	103.7	1.1	98	91.2	1.3	154	148.1	0.9	UAAC	78	105.1	-4.8	85	122.4	-6.1	123	128	-0.8
AAAG	184	169.7	2.0	173	133.6	6.2	165	158.8	0.9	UAAG	205	171.7	4.6	193	165.3	3.9	107	134.4	-4.3
AAAU	217	220	-0.4	179	164.9	2.0	213	200.6	1.6	UAAU	322	309.9	1.3	245	259.8	-1.7	166	193.9	-3.6
AACA	133	114.1	3.2	113	112.1	0.2	215	175.7	5.4	UACA	178	163.7	2.0	122	123.4	-0.2	230	200.3	3.8
AACC	76	61.3	3.4	107	75.2	6.7	102	92.5	1.8	UACC	97	82.9	2.8	106	98.5	1.4	118	97.1	3.8
AACG	29	40.7	-3.3	35	61.8	-6.2	44	66.3	-5.0	UACG	50	54.4	-1.1	58	72.6	-3.1	46	63.3	-3.9
AACU	91	121.5	-5.0	84	106.5	-4.0	171	168.5	0.4	UACU	196	205.2	-1.2	153	168.2	-2.1	195	192.2	0.4
AAGA	172	157.9	2.0	176	136.4	6.2	184	161.8	3.2	UAGA	128	123.4	0.8	119	124.7	-0.9	102	119.6	-2.9
AAGC	137	103.8	5.9	140	103	6.6	96	112.8	-2.9	UAGC	79	98.7	-3.6	82	118.7	-6.1	71	84.8	-2.7
AAGG	133	121.3	1.9	159	122.4	6.0	140	117.3	3.8	UAGG	73	78.8	-1.2	67	121.3	-9.0	74	75.6	-0.3
AAGU	191	180.6	1.4	179	163.1	2.3	136	139.4	-0.5	UAGU	171	213	-5.2	161	190.7	-3.9	101	126.8	-4.2
AAUA	189	215.2	-3.2	148	182.1	-4.6	113	154.1	-6.0	UAUA	251	237	1.7	192	189	0.4	99	136.5	-5.8
AAUC	100	104.5	-0.8	75	93.2	-3.4	93	121.9	-4.8	UAUC	84	112.3	-4.9	86	99.9	-2.5	84	116.1	-5.4
AAUG	246	229.3	2.0	234	232.1	0.2	230	201.5	3.7	UAUG	310	271.5	4.3	278	238.1	4.7	189	190.3	-0.2
AAUU	265	265.5	-0.1	212	207.8	0.5	211	212	-0.1	UAUU	314	345	-3.0	253	248.5	0.5	190	211.8	-2.7
ACAA	144	137.1	1.1	115	114.1	0.2	269	204.1	8.3	UCAA	131	130	0.2	136	117.7	3.1	202	174.1	3.8
ACAC	84	66.4	3.9	88	75.4	2.6	168	142.2	3.9	UCAC	53	60.4	-1.7	57	67.1	-2.2	130	109.6	3.5
ACAG	118	105.7	2.2	108	104.9	0.5	151	145.1	0.9	UCAG	107	122.1	-2.5	105	106.7	-0.3	110	121.4	-1.9
ACAU	128	123.5	0.7	106	122.7	-2.7	186	172.9	1.8	UCAU	84	124.6	-6.6	88	117.8	-5.0	153	146.7	0.9
ACCA	105	76.9	5.8	116	85	6.1	161	117.7	7.3	UCCA	68	73.4	-1.1	74	80.6	-1.3	76	95.4	-3.6
ACCC	56	37.3	5.6	84	57.6	6.3	54	60.7	-1.6	UCCC	31	37.4	-1.9	45	56	-2.7	31	44.6	-3.7
ACCG	24	35.5	-3.5	52	57.4	-1.3	31	48.8	-4.6	UCCG	15	26	-3.9	37	55.1	-4.4	19	29.7	-3.5
ACCU	83	77.7	1.1	97	97.7	-0.1	139	111.2	4.8	UCCU	74	101.4	-4.9	103	102.4	0.1	80	107.1	-4.8
ACGA	32	44.5	-3.4	31	56.5	-6.2	40	64.4	-5.5	UCGA	18	41.4	-6.6	42	50.4	-2.2	43	67.5	-5.4
ACGC	29	34.1	-1.6	49	52.5	-0.9	31	54.8	-5.8	UCGC	30	34.1	-1.3	45	54	-2.2	38	56.2	-4.4
ACGG	26	31.3	-1.7	46	48.5	-0.7	26	41	-4.2	UCGG	19	29.1	-3.4	33	49.8	-4.3	16	39.9	-6.9
ACGU	47	60.5	-3.1	53	72.1	-4.1	53	72.6	-4.2	UCGU	51	74.1	-4.9	59	74.8	-3.3	73	84.4	-2.3
ACUA	141	127.9	2.1	119	121.3	-0.4	166	167.3	-0.2	UCUA	116	130.2	-2.3	115	124.5	-1.5	130	140.9	-1.7
ACUC	49	68.2	-4.2	61	68.9	-1.7	119	115	0.7	UCUC	52	67.1	-3.3	69	69.6	-0.1	82	108.4	-4.6
ACUG	144	131.5	2.0	126	141.4	-2.4	159	163.3	-0.6	UCUG	119	135.5	-2.6	117	135.5	-2.9	133	141.4	-1.3
ACUU	142	160.9	-2.7	116	132.1	-2.5	207	184.9	3.0	UCUU	195	191.8	0.4	153	142.8	1.6	219	182.6	4.9
AGAA	147	141.6	0.8	162	126.6	5.7	144	158.7	-2.1	UGAA	174	195.3	-2.8	154	176.8	-3.1	164	180.2	-2.2
AGAC	67	71.9	-1.1	87	80.6	1.3	114	117.2	-0.5	UGAC	86	101.8	-2.8	118	127.6	-1.5	153	151.9	0.2
AGAG	107	88.9	3.5	115	103.4	2.1	146	112.3	5.8	UGAG	96	127.7	-5.1	144	167	-3.2	117	136.2	-3.0
AGAU	177	170.4	0.9	158	145.2	1.9	128	141.1	-2.0	UGAU	314	311.7	0.2	243	261	-2.0	215	196.5	2.4
AGCA	113	105.8	1.3	102	112.1	-1.7	105	118.8	-2.3	UGCA	181	166.3	2.1	187	161.5	3.6	166	182	-2.2
AGCC	77	54.1	5.7	91	80.7	2.1	68	55.4	3.1	UGCC	102	81.6	4.1	144	122.4	3.6	114	81.6	6.5
AGCG	48	44.9	0.8	62	71.2	-2.0	32	46.4	-3.8	UGCG	52	65.5	-3.0	86	97.6	-2.1	58	58.5	-0.1
AGCU	126	122.6	0.6	132	132	0.0	140	146.9	-1.0	UGCU	270	218.4	6.4	254	226.8	3.3	315	224.5	11.0
AGGA	116	96.4	3.6	114	117.8	-0.6	138	99.5	7.0	UGGA	171	154.4	2.4	187	147.4	5.9	152	126	4.2
AGGC	65	61.9	0.7	114	104.6	1.7	92	80.3	2.4	UGGC	144	103.8	7.2	184	144.4	6.0	141	103.4	6.7
AGGG	55	59.1	-1.0	88	79.5	1.7	53	60.7	-1.8	UGGG	81	105	-4.3	90	118.9	-4.8	59	74.4	-3.2
AGGU	137	143.8	-1.0	128	150	-3.3	129	119.5	1.6	UGGU	307	302.3	0.5	260	236.3	2.8	200	173.5	3.7
AGUA	137	159.2	-3.2	124	155.6	-4.6	115	116.9	-0.3	UGUA	228	233.8	-0.7	202	215.5	-1.7	161	165.6	-0.6
AGUC	62	77.6	-3.2	75	93.9	-3.5	76	87.3	-2.2	UGUC	116	116.5	-0.1	159	143.6	2.3	141	129.8	1.8
AGUG	152	156.2	-0.6	187	173.8	1.8	127	120.7	1.0	UGUG	266	246.9	2.2	300	255.8	5.0	214	170.9	6.0
AGUU	222	239.6	-2.1	214	206	1.0	126	161.7	-5.1	UGUU	498	407.8	8.1	415	346.1	6.7	274	252	2.5
AUAA	228	220.7	0.9	189	188.5	0.1	129	152.4	-3.4	UUAA	322	269.8	5.8	258	235.5	2.7	195	202.8	-1.0
AUAC	124	129.2	-0.8	100	112.8	-2.2	100	132.3	-5.1	UUAC	185	173.5	1.6	158	155.6	0.3	186	179.8	0.8
AUAG	120	141.9	-3.3	120	135	-2.3	65	91.7	-5.1	UUAG	141	177.5	-5.0	131	183	-7.0	112	119.5	-1.2
AUAU	205	237.9	-3.9	151	185.9	-4.7	99	144.4	-6.9	UUAU	397	385.6	1.1	309	269.1	4.4	191	226.5	-4.3
AUCA	105	122	-2.8	77	99.2	-4.1	139	136.8	0.3	UUCA	127	155	-4.1	132	126.1	1.0	206	180	3.5
AUCC	59	62.9	-0.9	63	65.5	-0.6	54	65.9	-2.7	UUCC	66	80.1	-2.9	75	90.9	-3.0	71	87.9	-3.3
AUCG	31	46.9	-4.2	42	57.2	-3.7	31	59.6	-6.7	UUCG	33	55.3	-5.5	64	69.3	-1.2	56	71.8	-3.4
AUCU	108	129.1	-3.4	87	109.2	-3.9	108	137.1	-4.5	UUCU	193	201.6	-1.1	133	151.2	-2.7	226	189.5	4.8
AUGA	204	212.3	-1.0	203	202.8	0.0	189	198.6	-1.2	UUGA	237	239.3	-0.3	197	213.4	-2.0	189	186.8	0.3
AUGC	186	151.1	5.2	194	164.2	4.2	179	154.3	3.6	UUGC	197	174.1	3.2	188	184.4	0.5	185	162.5	3.2
AUGG	211	180.8	4.1	197	179.4	2.4	185	143.1	6.4	UUGG	213	230.6	-2.1	208	185.6	3.0	153	143.2	1.5
AUGU	296	273.3	2.5	275	269.9	0.6	218	197.4	2.7	UUGU	415	363.3	4.9	368	298	7.4	245	204.3	5.2
AUUA	239	253.9	-1.7	191	216	-3.1	190	192	-0.3	UUUA	407	345.3	6.0	303	257.1	5.2	204	204.1	0.0
AUUC	106	126.1	-3.3	100	110.2	-1.8	127	136.8	-1.5	UUUC	141	161.3	-2.9	109	146.1	-5.6	187	162.4	3.5
AUUG	245	253.5	-1.0	206	211.6	-0.7	208	176	4.4	UUUG	367	357.7	0.9	318	271.3	5.2	207	194.8	1.6
AUUU	361	337.8	2.3	287	251.6	4.1	197	205.6	-1.1	UUUU	454	495.8	-3.4	296	325	-2.9	215	245.2	-3.5
GAAA	118	124	-1.0	104	111.4	-1.3	142	140.8	0.2	CAAA	128	133.8	-0.9	160	108.5	9.0	221	182.3	5.2
GAAC	58	64.6	-1.5	63	75.1	-2.5	89	96.9	-1.5	CAAC	83	71	2.6	93	74.9	3.8	166	128.5	6.0
GAAG	136	125.8	1.7	153	123.2	4.9	126	117.7	1.4	CAAG	108	111	-0.5	135	111.5	4.0	158	132.5	4.0
GAAU	118	140.8	-3.5	119	142.4	-3.6	90	125.7	-5.8	CAAU	144	147.5	-0.5	126	139.1	-2.0	178	170.2	1.1
GACA	82	83	-0.2	99	91	1.5	162	128.5	5.4	CACA	82	78.3	0.8	84	80.6	0.7	168	150.9	2.5
GACC	37	39.2	-0.6	61	69.1	-1.8	66	68.5	-0.5	CACC	59	44	4.1	76	59.3	3.9	98	80.1	3.6
GACG	27	33	-1.9	46	51	-1.3	33	52.9	-5.0	CACG	28	31.5	-1.1	40	46	-1.6	27	50.4	-6.0
GACU	82	86.6	-0.9	88	94.7	-1.3	101	115.3	-2.4	CACU	108	75.3	6.8	97	93.1	0.7	184	151.4	4.8
GAGA	73	77.8	-1.0	104	100.3	0.7	105	111.8	-1.2	CAGA	125	106.4	3.3	123	98.1	4.6	141	128.8	2.0
GAGC	52	60.1	-1.9	66	90.9	-4.8	83	74.9	1.7	CAGC	96	71.9	5.2	99	82.8	3.2	95	96.6	-0.3
GAGG	73	68.6	1.0	112	108.6	0.6	95	89.1	1.1	CAGG	94	84.9	1.8	106	93.4	2.4	102	93.4	1.6
GAGU	103	100.9	0.4	128	134	-0.9	108	98.6	1.7	CAGU	108	128.7	-3.3	132	127.4	0.7	98	127.2	-4.7
GAUA	149	172.1	-3.2	127	145.4	-2.8	81	111	-5.2	CAUA	88	110.8	-3.9	92	112.5	-3.5	99	118.2	-3.2
GAUC	70	86.1	-3.2	63	73	-2.1	55	75.7	-4.3	CAUC	49	56.5	-1.8	45	67.1	-4.9	100	91.9	1.5
GAUG	231	209.7	2.7	237	199.5	4.8	198	159	5.6	CAUG	110	117.8	-1.3	119	143.9	-3.8	153	138.3	2.3
GAUU	205	201.7	0.4	159	176.8	-2.4	125	128.6	-0.6	CAUU	166	149	2.5	160	159	0.1	196	173	3.2
GCAA	104	114.7	-1.8	137	123.3	2.2	133	131	0.3	CCAA	84	81.4	0.5	126	90.8	6.7	119	107.2	2.1
GCAC	70	65.3	1.1	77	74.2	0.6	99	92	1.3	CCAC	71	41.7	8.2	74	55.9	4.4	81	77.7	0.7
GCAG	131	102.4	5.1	157	113.3	7.5	80	100.5	-3.7	CCAG	67	64.5	0.6	90	80.7	1.9	95	77.5	3.6
GCAU	120	109.4	1.8	128	140.9	-2.0	112	104.2	1.4	CCAU	81	71.9	1.9	94	97.4	-0.6	97	91	1.1
GCCA	84	57.8	6.2	111	97.4	2.5	99	76.5	4.7	CCCA	46	41.6	1.2	83	62.7	4.7	56	60.1	-1.0
GCCC	34	34.5	-0.2	75	63.8	2.5	35	36.6	-0.5	CCCC	28	22.2	2.2	43	47.5	-1.2	18	28.8	-3.6
GCCG	29	29.7	-0.2	51	60.5	-2.2	21	31.7	-3.4	CCCG	17	20.4	-1.4	45	39.4	1.6	16	20.1	-1.6
GCCU	84	66.8	3.8	122	106.2	2.8	75	71.9	0.7	CCCU	58	43.9	3.8	76	78	-0.4	48	60.7	-3.0
GCGA	30	38.9	-2.6	42	57.3	-3.7	36	43.5	-2.1	CCGA	25	27.9	-1.0	45	47.1	-0.6	16	36.2	-6.0
GCGC	31	31.7	-0.2	65	57.4	1.8	38	41	-0.8	CCGC	20	21.8	-0.7	50	47	0.8	21	32.1	-3.6
GCGG	21	31.7	-3.4	43	56.8	-3.3	23	29.6	-2.2	CCGG	11	20.9	-3.9	36	44.9	-2.4	13	21.2	-3.2
GCGU	63	55.9	1.7	87	82.1	1.0	47	52.9	-1.5	CCGU	29	38.2	-2.7	54	68.1	-3.1	37	41.8	-1.3
GCUA	165	131.3	5.4	162	144.3	2.7	153	140.7	1.9	CCUA	85	77	1.7	83	96.5	-2.5	104	88.6	3.0
GCUC	58	58.8	-0.2	75	80.5	-1.1	89	98.1	-1.7	CCUC	38	40.1	-0.6	79	58.6	4.8	63	65.2	-0.5
GCUG	136	131.5	0.7	187	173.4	1.9	196	145.3	7.6	CCUG	89	80.4	1.7	118	108.1	1.7	70	89	-3.7
GCUU	167	147.3	3.0	158	162.5	-0.6	180	149.5	4.5	CCUU	86	97.4	-2.1	119	113.3	1.0	105	104.6	0.1
GGAA	86	82.1	0.8	83	103.4	-3.7	103	86	3.3	CGAA	23	42.7	-5.5	40	58.5	-4.4	37	55.5	-4.5
GGAC	51	48.4	0.7	57	67.7	-2.4	68	72.3	-0.9	CGAC	24	22.5	0.6	32	34.3	-0.7	27	46.9	-5.3
GGAG	81	66.6	3.2	109	95.9	2.4	92	70.1	4.8	CGAG	17	29.5	-4.2	42	53.5	-2.9	35	49.8	-3.8
GGAU	122	127	-0.8	139	124.7	2.3	80	83.7	-0.7	CGAU	41	63.1	-5.0	46	63.3	-4.0	36	56	-4.9
GGCA	93	70	5.0	142	99.4	7.7	108	83.7	4.8	CGCA	38	40.7	-0.8	67	63.3	0.8	46	58.2	-2.9
GGCC	34	33.7	0.1	74	74.8	-0.2	33	39.1	-1.8	CGCC	19	17.1	0.8	50	45.6	1.2	15	27.1	-4.2
GGCG	28	32.2	-1.3	57	62.9	-1.4	33	40.5	-2.1	CGCG	17	14.9	1.0	32	39.2	-2.1	21	23.4	-0.9
GGCU	95	88.7	1.2	135	117.9	2.9	115	94.9	3.8	CGCU	36	44.4	-2.3	61	73.5	-2.7	46	74.1	-5.9
GGGA	38	53	-3.7	52	65.7	-3.1	36	48.5	-3.3	CGGA	15	26.4	-4.0	35	51	-4.1	18	33.8	-4.9
GGGC	20	37.4	-5.1	64	68.8	-1.1	36	38.3	-0.7	CGGC	21	19	0.8	45	47.2	-0.6	20	29.3	-3.1
GGGG	26	41.9	-4.5	23	53.8	-7.6	20	31.4	-3.7	CGGG	10	19.9	-4.0	27	39.1	-3.5	12	17.5	-2.4
GGGU	88	95	-1.3	88	100.4	-2.2	52	63.8	-2.7	CGGU	31	50.2	-4.9	52	67.2	-3.4	28	55.5	-6.7
GGUA	147	153.8	-1.0	113	130.8	-2.8	106	102.8	0.6	CGUA	55	53.6	0.3	52	71.6	-4.2	52	60.2	-1.9
GGUC	51	70.4	-4.2	61	76.8	-3.3	40	71.3	-6.7	CGUC	16	24.9	-3.2	29	41.9	-3.6	36	41.9	-1.6
GGUG	160	161.8	-0.3	179	171.9	1.0	135	119.9	2.5	CGUG	60	64.9	-1.1	84	90.6	-1.3	69	71.3	-0.5
GGUU	205	201.3	0.5	175	181.2	-0.8	127	123.2	0.6	CGUU	59	83.2	-4.8	88	104.4	-2.9	53	81.6	-5.8
GUAA	165	174.4	-1.3	135	160.2	-3.6	101	130.5	-4.7	CUAA	154	145.5	1.3	128	140	-1.8	141	153.8	-1.9
GUAC	99	109.2	-1.8	86	110.2	-4.2	143	109	5.9	CUAC	112	88.1	4.6	95	87.5	1.5	160	140.2	3.0
GUAG	112	118.4	-1.1	104	136.9	-5.1	96	88.6	1.4	CUAG	78	86.6	-1.7	74	103.6	-5.3	75	99.2	-4.4
GUAU	191	195.4	-0.6	166	172	-0.8	94	118.6	-4.1	CUAU	163	148.3	2.2	182	150.6	4.7	177	162	2.1
GUCA	85	95.2	-1.9	105	98.5	1.2	114	113.3	0.1	CUCA	59	75	-3.4	73	82.5	-1.9	137	130.2	1.1
GUCC	30	52.2	-5.6	59	73.4	-3.1	35	52.7	-4.4	CUCC	33	41	-2.3	62	58.7	0.8	46	62.3	-3.7
GUCG	33	39.6	-1.9	35	59	-5.7	40	51.3	-2.9	CUCG	21	32	-3.5	39	46	-1.9	43	59.9	-4.0
GUCU	97	109.1	-2.1	125	120.1	0.8	104	112.9	-1.5	CUCU	84	82	0.4	110	94.8	2.8	126	136.9	-1.7
GUGA	122	162.3	-5.8	152	163.3	-1.6	131	127	0.6	CUGA	107	124.3	-2.8	108	139.4	-4.8	141	150.6	-1.4
GUGC	113	115.2	-0.4	149	148.5	0.1	130	110.9	3.3	CUGC	109	91.9	3.2	141	110.9	5.2	159	128.2	4.9
GUGG	158	146.3	1.8	180	158.8	3.1	109	101	1.4	CUGG	121	98.1	4.2	136	123.2	2.1	106	104.8	0.2
GUGU	245	218.3	3.3	269	223.8	5.5	174	129.3	7.2	CUGU	151	157.5	-0.9	164	179.7	-2.1	152	159.3	-1.1
GUUA	255	244.6	1.2	237	225.1	1.4	126	168.7	-6.0	CUUA	143	152.3	-1.4	125	150.3	-3.8	164	163.1	0.1
GUUC	104	123	-3.1	119	116.6	0.4	97	114.8	-3.0	CUUC	68	80.6	-2.5	76	81.3	-1.1	148	124.3	3.9
GUUG	280	254.7	2.9	283	248	4.0	165	169.1	-0.6	CUUG	168	147.9	3.0	154	152.9	0.2	190	154.7	5.2
GUUU	344	316.4	2.8	253	239.5	1.6	192	171	2.9	CUUU	211	212.4	-0.2	191	177.1	1.9	209	183.8	3.4

Approach one – Sequence Relationship of Viruses based on The Correlation of Tetra-nucleotide Bias

Two relationship trees were derived, one from the entire genome and the other from the replication enzyme (Figure 1). The result based on the replication enzyme sequence was included because these regions in RNA viruses are submitted to a strong selective pressure to ensure successful replication of their own RNA in the host cell. The two distance trees can be clustered distinctly into two major groups of viruses. Interestingly, this clustering validates our approach, since these clusters are consistent with biological properties of the viruses: Group #1 corresponds to all positive strand ssRNA viruses while Group #2 corresponds to negative strand ssRNA viruses. Each group must undergo different evolutionary paths which lead to their distinct pattern in tetra-nucleotide usage. The classification for the two main groups of viruses (positive/negative strand ssRNA viruses) demonstrate a level of congruence with the taxonomy of the viruses [20] and indicated that there exists a relationship signal in tetra-nucleotide usage patterns.

Figure 1

Two Relationship trees based on the correlation coefficients of tetra-nucleotide usage bias The distance tree for 31 RNA viruses based on tetra-nucleotide usage pattern for the entire genome (right) and the replication enzyme (left). The correlation distances are shown on top of each branch. Inside both relationship trees, Avian Encephalomyelitis Virus (AEV), Lactate Dehydrogenase-elevating Virus (LDV), Porcine Reproductive and respiratory syndrome Virus (PRV), Equine arteritis Virus (EV1), Rabbit Hemorrhagic disease Virus (RHV), Yellow Fever Virus (YFV), are the outermost group of viruses, exhibiting differences in their tetra-nucleotide usage pattern. From the family of positive strand ssRNA viruses, CoronaViruses form the largest cluster. The SARS-CoV is found to be at the basal position of other CoronaVirus types and remains closest to the Transmissible Gastroenteritis Virus (TGV) and Feline CoronaVirus (FCoV). This placement is consistent with the findings from two seminal papers [9,10] where the SARS-CoV was classified in a separate group from the rest of the known CoronaViruses. In addition, both distance trees suggested that the Bovine CoronaVirus (BCoV) and the Mouse Hepatitis Virus (MHV) should be grouped together whereas the Human CoronaVirus 229E (HCoV) is the closest to the Porcine epidemic Diarrhea Virus (PDV). For the family of negative strand ssRNA viruses, there are two obvious classes that have evolved through different branches of word usage pattern. The first class covers Hantaan Virus (HV1), Reston Ebola Virus (REV), Bovine Ephemeral Fever Virus (BFV), Bovine Respiratory syncytial Virus (BRV), Respiratory Syncytial Virus (RSV) and Human Respiratory syncytial Virus (HRV). The second class covers the remaining negative strand ssRNA viruses.

Approach two – Sequence Relationship of Viruses based on The Factors of the Tetra-nucleotide Usage Pattern [21-23]

The overall tetra-nucleotide usage pattern (additional file 2) was decomposed into several eigen-vectors using a factor analysis algorithm. They are the uncorrelated components of the original usage pattern embedded within the overall tetra-nucleotide usage pattern. Three eigen-vectors, which carry 83.3% of the variance for the viral tetra-nucleotide usage patterns, were retained (Figure 2). From the three dimensional figures (Figure 3, Figure 4, Figure 5 and Figure 6) plotted against these retained eigen-vectors, the negative strand ssRNA viruses stemmed clearly out from the positive strand ssRNA viruses. This is most obvious when the axes of projection were the 1st and 3rd eigen-vectors. This indicated that both types of viruses have a complex component of tetra-nucleotide usage patterns and that these patterns changes with different family of viruses.

Figure 2

Figure 3

3-D plot for the vectorial profiling of each virus onto the three eigen-vectors. The tetra-nucleotide usage patterns V for the replicase open reading frame in each virus have been redisplayed on the 1st, 2nd and 3rd eigen-vector axes ('o' represents positive strand ssRNA virus; x represents negative strand ssRNA virus). The two families of viruses clustered into two different regions of the plot.

Figure 4

2-D plots for Figure 3 with different viewpoint specifications. The tetra-nucleotide usage patterns for the replicase open reading frame in each virus have been redisplayed on the (1st vs 2nd), (1st vs 3rd) and (2nd vs 3rd) eigen-vector axes ('o' represents positive strand ssRNA virus; 'x' represents negative strand ssRNA virus). For the top figure, the order for 'o' is [15,17,12,16,8,14,9,11,13,4,7,3,10,6,2,5,1]* (left to right), whereas 'x' is [24,27,25,28,22,26,18,23,20,19,21]* (left to right). For the middle figure, the order for 'o' is [15,17,12,16,8,14,9,11,13,4,7,3,10,6,2,5,1]* (left to right), whereas 'x' is [24,27,25,28,22,26,18,23,20,19,21]* (left to right). For the bottom figure, the order for 'o' is [15,17,12,16,8,14,9,11,13,4,7,3,10,6,2,5,1]* (left to right), whereas 'x' is [24,27,25,28,22,26,18,23,20,19,21]* (left to right). *The corresponded virus for each number follows Figure 3.

Figure 5

3-D plot for the vectorial profiling of each virus onto the three eigen-vectors. The tetra-nucleotide usage patterns table in the additional file 2 (entire genome) for each virus have been redisplayed on the 1st, 2nd and 3rd eigen-vector axes ('o' represents positive strand ssRNA virus; 'x' represents negative strand ssRNA virus). The two families of viruses clustered into three different regions of the plot.

Figure 6

2-D plots for Figure 5 with different viewpoint specifications. The tetra-nucleotide usage patterns table in the additional file 2 (entire genome) for each virus have been redisplayed on the (1st vs 2nd), (1st vs 3rd) and (2nd vs 3rd) eigen-vector axes ('o' represents positive strand ssRNA virus, 'x' represents negative strand ssRNA virus). For the top figure, the order for 'o' is [3,7,1,4,2,5,6,17,13,20,10,16,9,8,11,15,12,14,18,19]* (left to right), whereas 'x' is [26,22,30,23,24,28,31,27,25,29,21]* (left to right). For the middle figure, the order for 'o' is [3,7,1,4,2,5,6,17,13,20,10,16,9,8,11,15,12,14,18,19]* (left to right), whereas 'x' is [26,22,30,23,24,28,31,27,25,29,21]* (left to right). For the bottom figure, the order for 'o' is [3,7,1,4,2,5,6,17,13,20,10,16,9,8,11,15,12,14,18,19]* (left to right), whereas 'x' is [26,22,30,23,24,28,31,27,25,29,21]* (left to right). *The corresponded virus for each number follows Figure 5.

Relationship between the number of eigen-vectors retained and the percentage of the variance they represent in the entire usage patterns for 31 viruses. As each consecutive factor is defined to identify a usage pattern that is not captured by the preceding eigen-vectors, each consecutive factors are therefore independent of each other. In addition, the order for the consecutive eigen-vectors is extracted with diminishing importance. 3-D plot for the vectorial profiling of each virus onto the three eigen-vectors. The tetra-nucleotide usage patterns V for the replicase open reading frame in each virus have been redisplayed on the 1st, 2nd and 3rd eigen-vector axes ('o' represents positive strand ssRNA virus; x represents negative strand ssRNA virus). The two families of viruses clustered into two different regions of the plot. 2-D plots for Figure 3 with different viewpoint specifications. The tetra-nucleotide usage patterns for the replicase open reading frame in each virus have been redisplayed on the (1st vs 2nd), (1st vs 3rd) and (2nd vs 3rd) eigen-vector axes ('o' represents positive strand ssRNA virus; 'x' represents negative strand ssRNA virus). For the top figure, the order for 'o' is [15,17,12,16,8,14,9,11,13,4,7,3,10,6,2,5,1]* (left to right), whereas 'x' is [24,27,25,28,22,26,18,23,20,19,21]* (left to right). For the middle figure, the order for 'o' is [15,17,12,16,8,14,9,11,13,4,7,3,10,6,2,5,1]* (left to right), whereas 'x' is [24,27,25,28,22,26,18,23,20,19,21]* (left to right). For the bottom figure, the order for 'o' is [15,17,12,16,8,14,9,11,13,4,7,3,10,6,2,5,1]* (left to right), whereas 'x' is [24,27,25,28,22,26,18,23,20,19,21]* (left to right). *The corresponded virus for each number follows Figure 3. 3-D plot for the vectorial profiling of each virus onto the three eigen-vectors. The tetra-nucleotide usage patterns table in the additional file 2 (entire genome) for each virus have been redisplayed on the 1st, 2nd and 3rd eigen-vector axes ('o' represents positive strand ssRNA virus; 'x' represents negative strand ssRNA virus). The two families of viruses clustered into three different regions of the plot. 2-D plots for Figure 5 with different viewpoint specifications. The tetra-nucleotide usage patterns table in the additional file 2 (entire genome) for each virus have been redisplayed on the (1st vs 2nd), (1st vs 3rd) and (2nd vs 3rd) eigen-vector axes ('o' represents positive strand ssRNA virus, 'x' represents negative strand ssRNA virus). For the top figure, the order for 'o' is [3,7,1,4,2,5,6,17,13,20,10,16,9,8,11,15,12,14,18,19]* (left to right), whereas 'x' is [26,22,30,23,24,28,31,27,25,29,21]* (left to right). For the middle figure, the order for 'o' is [3,7,1,4,2,5,6,17,13,20,10,16,9,8,11,15,12,14,18,19]* (left to right), whereas 'x' is [26,22,30,23,24,28,31,27,25,29,21]* (left to right). For the bottom figure, the order for 'o' is [3,7,1,4,2,5,6,17,13,20,10,16,9,8,11,15,12,14,18,19]* (left to right), whereas 'x' is [26,22,30,23,24,28,31,27,25,29,21]* (left to right). *The corresponded virus for each number follows Figure 5. In the result based on replication enzyme sequence (Figure 3 and Figure 4), we observed a clear splitting between two main families of RNA viruses (positive/negative strand ssRNA virus). All viruses that belong to a specific family were clustered together closely. This pointed to an interesting hypothesis that the replication enzyme sequence between closely related RNA viruses adopt a common word usage pattern that are closely linked. In addition, it is clear that the viruses from different family groups adopt different strategy of word usage. However in Figure 5 and Figure 6, when we project the tetra-nucleotide usage patterns (entire genome) for each virus on the 1st, 2nd and 3rd eigen-vector axes, the separation between viruses showed a different outcome when V was derived from the entire genome. The two main families of viruses were grouped into three clusters, two being allocated to the positive strand ssRNA viruses. It is particularly interesting that all viruses in the upper left corner corresponded to the viruses originating from the CoronaVirus family. Unexpectedly, the Hantaan Virus (HV1) is the only negative strand ssRNA virus to have a high loading on the eigen-vector that corresponded to the tetra-nucleotide usage pattern for the positive strand ssRNA viruses. It is important to realize what factor analysis will provide and how this analysis is different from the previous method of relationship tree generation using correlation coefficient. For the previous method that is based on correlation coefficient of word usage patterns, it treats the vectorial profiling V for each virus as a whole entity, However, the factor analysis considered the vectorial profiling V as a superposition of many patterns which can be separated into mutually uncorrelated patterns of word usage. Each eigen-vector represents the embedded component of RNA word usage patterns communalised by a group of viruses presumably under the same selection pressures. By projecting the overall usage patterns on these eigen-vectors, it is possible to determine a group of viruses that adopt a common strategy of word usage.

Conclusion

Using the two approaches to study the tetra-nucleotide usage pattern in RNA viruses, we reached the following conclusions: 1. Based on the correlation of the overall tetra-nucleotide usage patterns, the Transmissible Gastroenteritis Virus (TGV) and the Feline CoronaVirus (FCoV) are closest to SARS-CoV. 2. Based on the three most significant eigen-vectors, the genomes of the viruses from the same family conform to a similar tetra-nucleotide usage pattern, irrespective of their genome size. 3. The study of word usage is a powerful method to classify RNA viruses. The congruence of the relationship trees with the known classification indicates that there exist phylogenetic signals in tetra-nucleotide usage patterns, and this signal is most prominent in the replicase open reading frames.

Methods

Dataset

We focused our study on the genomic sequences (their translated strand) of ssRNA viruses (Table 1), which incorporated 20 species from the family of positive strand ssRNA viruses and 11 species from the family of negative strand ssRNA viruses. We are aware of the fact that these viruses constitute completely different species, most probably unrelated to one another. They are included in a common study in order to try to have means to identify relevant features from purely statistical background properties. The coverage included the viruses that are known to cause diseases to their corresponding hosts. The acronym for each virus is shown in the table and is referred to throughout this study. All sequences corresponding to their translated strand were retrieved from GenBank, and the accession numbers and genomic size (in nucleotides) for individual virus were provided for reference. For the present study, two sets of data were generated from the complete sequence for each virus. Dataset 1 covered the entire genome and dataset 2 covered only their replicase open reading frame. The flowchart for studying the tetra-nucleotide usage pattern in 31 viruses is shown in Figure 7.

Figure 7

Flowchart for studying the tetra-nucleotide usage pattern. The FA and NJ algorithms stand for factor analysis [21-23] and neighbor joining [29] algorithm.

Computer hardware and software

Sun Fire 6800 Server with 24 CPUs (each running with a clock speed of 900 MHz) was employed throughout this study. The computation of correlation coefficient and factor analysis algorithm were implemented using Matlab Technical Programming language.

Method for counting the frequency of occurrence for RNA words

It is necessary to address the question of how we counted the number of time each tetra-nucleotide (for example 'GAGA' or any other tetra-nucleotide), appeared in a given genome. For this study, we adopted the convention of not counting overlapping words [24]. Take a sequence "UAUGAGAGAUCCGAGA' as example. With second or higher overlapping words not counted, the tetra-nucleotide 'GAGA' is counted as occurring only twice, namely in position 4–7 and 13–16. Positions 6–9 are omitted because they overlap with 'GAGA' at position 4–7. However, when we counted tetra-nucleotide 'UGAG', position 3–6 would also be registered as position 4–6 already recorded when counting tetra-nucleotide 'GAGA'. In short, all frequency counting of tetra-nucleotide were started anew when we changed from counting the frequency of one tetra-nucleotide to another; this was to preserve the correlation of tetra-nucleotides which have overlapping subword (e.g: 'UAGA' and 'GACA'). A table showing the frequencies of tetra-nucleotides is shown in the additional file 2.

Vectorial profiling (V) of the viral RNA genome word usage pattern

The nucleotide composition has being suggested to be a specific characteristic in different virus phylogeny [25]. Because most viral genomes are short, and because we lack a prior information on the tempo and modes of evolution of RNA viruses, we proceeded as follows. We created a vector, V = [C1,C2, ... Ci, ... Ck], with each element representing the frequency for a specific RNA word of length n. The number of components (k) in V increases exponentially with word size (n) - k = 4n. In order to use V for discrimination between viruses, two criteria must be met. First, V must contain sufficient components (di-nucleotide k = 16; tri-nucleotide k = 64; tetra-nucleotide k = 256); second, the frequencies for tetra-nucleotides must show a prominent bias (over/under-representation) that is unique for a family of viruses. For the first criteria, there are pros and cons for choosing either longer or shorter words. When the shorter words are used, they inherit the problem of inadequate representation of the viral genome because the long motifs will be neglected. But the shorter words have an advantage of saving computational time. On the other hand, when the longer words are used, they cause a problem of computer tractability due to a larger word set to explore (k = 4n). However, the larger words have an advantage of accounting for the correlation of their sub-words. In contrast the number of their occurrences falls down rapidly, preventing accurate statistical analysis. We chose tetra-nucleotides for our study because they provide 256 vector components (additional file 2) and account for correlation of sub-words up to the order three. For the second criteria, the bias in RNA word usage was examined. The bias in word usage (of size n) is influenced by the bias of word with sizes less than n [26]. Therefore, in order to evaluate the true bias of word size m, it is required to compare the frequencies of word usage in the original sequence to that of model chromosomes that take into account the biases of word size m - 1, m - 2 ... 1. These model chromosomes were generated by obeying the Markov model of the order (m - 1)th. This can be achieved by shuffling m - 1 viral nucleotides as one whole unit so that the nucleotide successions up to order (m - 1)th were being preserved. Several statistical approaches have been proposed for quantifying word biases [27,28]. In this study, we employed the z statistics (Equation 1) for di-nucleotide and tetra-nucleotide biases [27,28]. The z value is a measure of the bias of a word, with values close to zero meaning no bias, negative values meaning under-representation and positive values meaning over-representation of the word w in the RNA text. where w is a word of size m; N(w) is observed count in actual viral RNA; E(w) and Var(w) are expected count and variance for w derived from the 100 artificial chromosomes that preserved the nucleotide succession up to order m - 1.

Approach one – sequence relationship of viruses based on the correlation of tetra-nucleotide bias

A scale-invariant parameter, the correlation coefficient r, was employed to compare between word usage patterns of viruses. The correlation coefficient r measures the degree of linear relationship between two vectors. Here, the two vectors are the tetra-nucleotide word usage pattern V corresponding to each viral genome. The magnitude of r would indicate how much of the change of pattern in the tetra-nucleotide word usage in one virus is explained by the change in another. The magnitude of r is always between -1 and +1 and the relationship between the two variables will approach perfect linearity as the magnitude of correlation coefficient approaches to extreme values (+/-1). However, perfect positive correlation (r = 1) does not mean identity of the paired V, but, rather, identity up to positive linearity, that is, identity between the paired standardized values. This is a crucial property of r (scale-invariant) that enables the comparison of viral genome despite their differences in genomic sizes. Positive magnitude of r indicates positive association whereas negative magnitude of r indicates negative association between two usage patterns. For this study, correlation coefficient, r, for let say virus 1 and virus 2, is defined as follow: where V1, V2 are vector representing the tetra-nucleotide usage pattern; Sand Sstandard deviation of V1, V2; are the mean of V1, V2. Then, the distance between the tetra-nucleotide usage patterns of two viruses is defined as follows: Distance D= 1 - r; (3) where Dis the distance between the tetra-nucleotide usage patterns of virus i and virus j; ris the correlation coefficient between the tetra-nucleotide usage patterns of virus i and virus j Prior to the construction of a relationship tree, the pair-wise distance matrix M of size 31 by 31 was constructed (see additional file 3). Pair-wise distance between two viral genomes is measured by the value of (1 - r). Each row/column corresponds to a specific virus and an entry at the intersection of row X and column Y corresponds to the distance between virus X and virus Y. Such matrix has a diagonal entry of value 0. For the purpose of constructing a relationship tree, only the lower/upper triangular matrix of M is required. After obtaining lower/upper triangular matrix of M, the neighbor-joining method (NJ) algorithm was used to construct the relationship tree (Figure 1). The neighbor-joining method is based on minimum-distance principle. Details of the NJ algorithm are available in [29].

Approach two – sequence relationship of viruses based on the factors of the tetra-nucleotide usage pattern

The factor analysis is a statistical method that reveals simpler patterns within a complex set of tetra-nucleotide usage patterns V (additional file 2). It seeks to discover if the observed usage patterns can be explained in terms of a much smaller number of un-correlated pattern sets called factors (eigen-vectors). Suppose we take a simple case where there are 31 viruses each represented by two components (x,y) in vector V (x,y represent the frequencies of occurrence for two specific tetra-nucleotides). Then, in a scatter-plot we can think of the regression line as the original X-axis, rotated so that it approximates the regression line. This type of rotation maximize the variance of the variables (x,y) on the eigen-vector. The remaining variability around this the first eigen-vector was captured in the subsequent eigen-vectors. In this manner, consecutive eigen-vectors are extracted but with a diminishing importance. What each eigen-vector represents is the embedded RNA word usage patterns communalised by a group of viruses presumably under the same selection pressures. We implemented the factor analysis algorithm [21-23] in Matlab Technical Programming Language and computed a set of eigen-vectors. Then, the original usage pattern V was re-mapped for each virus onto the new coordinate system based on these derived eigen-vectors. The difference between approach two and approach one is discussed in the results and discussion section.

Authors' contributions

YLY participated in the design and performed the statistical analysis. AD participated in the design and overall coordination of this study. XWZ participated in the design of the study. All authors read and approved the final manuscript.

Additional File 1

The RNA word biases of different sizes in RNA viruses. These tables show the di-nucleotide, tetra-nucleotide and penta-nucleotide biases for 31 RNA viruses. Click here for file

Additional File 2

Vectorial profiling of tetra-nucleotide usage pattern in seven RNA viruses. The tetra-nucleotide frequencies of occurrence in seven viral genomes. Each column represents a tetra-nucleotide usage pattern Vfor a single virus. We derived correlation coefficient (r) by comparing any two columns simultaneously. This parameter r indicates the likeness of word usage patterns in any two viruses. Click here for file

Additional File 3

The distance matrices. Each entry in matrix M is computed using Equation 3. The correlation coefficient (r) in equation 3 is obtained by comparing any two columns in the tetra-nucleotide usage patterns table in the additional file 2 simultaneously. Click here for file

22 in total

Review 1. Ecological fitness, genomic islands and bacterial pathogenicity. A Darwinian view of the evolution of microbes.

Authors: J Hacker; E Carniel
Journal: EMBO Rep Date: 2001-05 Impact factor: 8.807

2. SARS update.

Authors: James Maskalyk; John Hoey
Journal: CMAJ Date: 2003-05-13 Impact factor: 8.262

3. SARS Web information.

Authors: John S James
Journal: AIDS Treat News Date: 2003-04-04

4. Identification of a novel coronavirus in patients with severe acute respiratory syndrome.

Authors: Christian Drosten; Stephan Günther; Wolfgang Preiser; Sylvie van der Werf; Hans-Reinhard Brodt; Stephan Becker; Holger Rabenau; Marcus Panning; Larissa Kolesnikova; Ron A M Fouchier; Annemarie Berger; Ana-Maria Burguière; Jindrich Cinatl; Markus Eickmann; Nicolas Escriou; Klaus Grywna; Stefanie Kramme; Jean-Claude Manuguerra; Stefanie Müller; Volker Rickerts; Martin Stürmer; Simon Vieth; Hans-Dieter Klenk; Albert D M E Osterhaus; Herbert Schmitz; Hans Wilhelm Doerr
Journal: N Engl J Med Date: 2003-04-10 Impact factor: 91.245

5. Over- and underrepresentation of short DNA words in herpesvirus genomes.

Authors: M Y Leung; G M Marsh; T P Speed
Journal: J Comput Biol Date: 1996 Impact factor: 1.479

6. The nonenzymatic hydrolysis of oligoribonucleotides. VII. Structural elements affecting hydrolysis.

Authors: A Bibillo; M Figlerowicz; K Ziomek; R Kierzek
Journal: Nucleosides Nucleotides Nucleic Acids Date: 2000 May-Jun Impact factor: 1.381

7. High frequency RNA recombination in porcine reproductive and respiratory syndrome virus occurs preferentially between parental sequences with high similarity.

Authors: Joke J F A van Vugt; Torben Storgaard; Martin B Oleksiewicz; Anette Bøtner
Journal: J Gen Virol Date: 2001-11 Impact factor: 3.891

8. Exceptional motifs in different Markov chain models for a statistical analysis of DNA sequences.

Authors: S Schbath; B Prum; E de Turckheim
Journal: J Comput Biol Date: 1995 Impact factor: 1.479

9. A novel coronavirus associated with severe acute respiratory syndrome.

Authors: Thomas G Ksiazek; Dean Erdman; Cynthia S Goldsmith; Sherif R Zaki; Teresa Peret; Shannon Emery; Suxiang Tong; Carlo Urbani; James A Comer; Wilina Lim; Pierre E Rollin; Scott F Dowell; Ai-Ee Ling; Charles D Humphrey; Wun-Ju Shieh; Jeannette Guarner; Christopher D Paddock; Paul Rota; Barry Fields; Joseph DeRisi; Jyh-Yuan Yang; Nancy Cox; James M Hughes; James W LeDuc; William J Bellini; Larry J Anderson
Journal: N Engl J Med Date: 2003-04-10 Impact factor: 91.245

10. Aetiology: Koch's postulates fulfilled for SARS virus.

Authors: Ron A M Fouchier; Thijs Kuiken; Martin Schutten; Geert van Amerongen; Gerard J J van Doornum; Bernadette G van den Hoogen; Malik Peiris; Wilina Lim; Klaus Stöhr; Albert D M E Osterhaus
Journal: Nature Date: 2003-05-15 Impact factor: 49.962

13 in total

1. Sequence alignment by cross-correlation.

Authors: Alan L Rockwood; David K Crockett; James R Oliphant; Kojo S J Elenitoba-Johnson
Journal: J Biomol Tech Date: 2005-12

2. Two-way antigenic cross-reactivity between severe acute respiratory syndrome coronavirus (SARS-CoV) and group 1 animal CoVs is mediated through an antigenic site in the N-terminal region of the SARS-CoV nucleoprotein.

Authors: Anastasia N Vlasova; Xinsheng Zhang; Mustafa Hasoksuz; Hadya S Nagesha; Lia M Haynes; Ying Fang; Shan Lu; Linda J Saif
Journal: J Virol Date: 2007-10-03 Impact factor: 5.103

9. Antifragility and Tinkering in Biology (and in Business) Flexibility Provides an Efficient Epigenetic Way to Manage Risk.

Authors: Antoine Danchin; Philippe M Binder; Stanislas Noria
Journal: Genes (Basel) Date: 2011-11-29 Impact factor: 4.096

10. Extreme Genomic CpG Deficiency in SARS-CoV-2 and Evasion of Host Antiviral Defense.

Authors: Xuhua Xia
Journal: Mol Biol Evol Date: 2020-09-01 Impact factor: 16.240