Literature DB >> 29018311

De novo Genome Assembly and Single Nucleotide Variations for Soybean Mosaic Virus Using Soybean Seed Transcriptome Data.

Yeonhwa Jo¹, Hoseong Choi¹, Miah Bae¹, Sang-Min Kim², Sun-Lim Kim², Bong Choon Lee², Won Kyong Cho¹, Kook-Hyung Kim¹.

Abstract

Soybean is the most important legume crop in the world. Several diseases in soybean lead to serious yield losses in major soybean-producing countries. Moreover, soybean can be infected by diverse viruses. Recently, we carried out a large-scale screening to identify viruses infecting soybean using available soybean transcriptome data. Of the screened transcriptomes, a soybean transcriptome for soybean seed development analysis contains several virus-associated sequences. In this study, we identified five viruses, including soybean mosaic virus (SMV), infecting soybean by de novo transcriptome assembly followed by blast search. We assembled a nearly complete consensus genome sequence of SMV China using transcriptome data. Based on phylogenetic analysis, the consensus genome sequence of SMV China was closely related to SMV isolates from South Korea. We examined single nucleotide variations (SNVs) for SMVs in the soybean seed transcriptome revealing 780 SNVs, which were evenly distributed on the SMV genome. Four SNVs, C-U, U-C, A-G, and G-A, were frequently identified. This result demonstrated the quasispecies variation of the SMV genome. Taken together, this study carried out bioinformatics analyses to identify viruses using soybean transcriptome data. In addition, we demonstrated the application of soybean transcriptome data for virus genome assembly and SNV analysis.

Entities: CellLine Chemical Disease Gene Species

Keywords: de novo genome assembly; single nucleotide variation; soybean mosaic virus

Year: 2017 PMID： 29018311 PMCID： PMC5624490 DOI： 10.5423/PPJ.OA.03.2017.0060

Source DB: PubMed Journal: Plant Pathol J ISSN： 1598-2254 Impact factor: 1.795

Soybean (Glycine max (L.) Merr.) is the most important legume crop, representing 50% of the global legume crop area and 68% of global legume production (Herridge et al., 2008). Soybean is consumed as health food, providing a rich source of proteins, and as well as vegetative oil production (Messina, 1999; Pimentel and Patzek, 2005). Moreover, soybean plays an important role for dinitrogen (N2) fixation, which is an important natural process (Herridge et al., 2008). Several diseases in soybean, such as cyst, brown spot, charcoal rot, and sclerotinia stem rot, lead to yield losses in major soybean-producing countries (Wrather et al., 2001). In addition, soybean can be infected by diverse viruses. Although a small numbers of viruses infecting soybean cause serious economic problems in soybean production, it is always important to control and to manage viral diseases in soybeans (Hill and Whitham, 2014). The best known soybean virus is Soybean mosaic virus (SMV), a member of the family Potyviridae, causing soybean mosaic disease. In addition, bean pod mottle virus (BPMV), soybean vein necrosis virus, tobacco ringspot virus, soybean dwarf virus, peanut mottle virus, peanut stunt virus, and alfalfa mosaic virus are important viruses infecting soybeans (Hill and Whitham, 2014). Many plant viruses have been identified based on viral disease symptoms and several detection methods. However, virus infection in plants does not always cause disease symptoms, and many plants showing viral disease symptoms are very often co-infected by different viruses. Recent advances in next generation sequencing (NGS) technology lead to identification of numerous known as well as novel viruses by means of metagenomics (Barba et al., 2014; Massart et al., 2014). Not only NGS data for virus detection but also many plant transcriptome data contain virus sequences, which might be amplified along with infected host transcripts (Burger and Maree, 2015; Jo et al., 2016). The identification of virus sequences in the plant transcriptome is no longer surprising, because most plant viruses are RNA viruses and many of them carry poly(A) tail, which is easily amplified by oligo d(T) primers for cDNA synthesis. Recently, we carried out a large-scale screening to identify viruses infecting soybean in the world using available soybean transcriptome data. Of them, we found that a soybean transcriptome for soybean seed development analysis contains many virus sequences. In this study, we conducted a bioinformatics analyses for virus identification, virus genome assembly, phylogenetic analysis, and single nucleotide variations of the SMV.

Materials and Methods

Plant materials, library preparation, and next generation sequencing

The plant material used for RNA-Seq was soybean cultivar Heinong44. Plants were grown in the experimental station in Beijing from May to August according to the previous study (Song et al., 2013). Total RNAs were extracted from seeds at six different developmental stages, which were classified according to the seed weight. The cDNA was synthesized using poly(A)-containing RNAs. A single RNA-Seq library was constructed and sequenced by single-end sequencing using the Illumina HiSeq 2000 system. The raw data is available in the SRA database (http://www.ncbi.nlm.nih.gov/sra/SRR1777405).

Raw data processing and de novo transcriptome assembly

All bioinformatics analyses were performed in the Linux (Linux Mint version 17)-installed workstation (four 16-core CPUs and 256 GB ram). We downloaded the raw data from the SRA database using the SRA toolkit (Leinonen et al., 2011). The raw SRA data were converted to FASTQ files using the SRA toolkit. For the de novo assembly of transcriptomes, we used Trinity version 2.0.6 (Haas et al., 2013). De novo transcriptome assembly was performed according to the manuals provided by developers with default parameters.

Identification of viruses and sequence alignment

To identify virus-associated contigs, we conducted blast search using standalone BLAST version 2.1.19 installed in the Linux system (Madden, 2013). All assembled contigs were subjected to MEGABLAST search, which is optimized for highly similar sequences, against complete reference sequences for viruses and viroids (http://www.ncbi.nlm.nih.gov/genome/viruses/) with E value 1e-5 as a cutoff. In addition, all raw data were converted to FASTA files using the SRA toolkit and subjected to a MEGABLAST search against the viral reference database with E value 1e-5 as a cutoff. We used the Burrows–Wheeler Aligner (BWA) software for sequence alignment on the reference virus genome with default parameters (Li and Durbin, 2009).

De novo assembly of SMV genomes

The 79 SMV-associated contigs identified by the BLAST were retrieved by the BLASTCMD program in the standalone BLAST system. To assemble SMV genomes, the identified viral contigs were aligned against the SMV reference genome (NC_002634.1) using ClustalW implemented in the MEGA6 program (Tamura et al., 2013) The nearly complete consensus genome of SMV was manually obtained. Raw data were again aligned on the assembled consensus SMV genome to confirm sequences by BWA. The poly(A) tail at the 3′ end of the assembled SMV genome was removed. We obtained a nearly complete consensus genome for SMV China (accession number NC_002634.1) from soybean transcriptome.

Identification of SNVs in soybean transcriptome

In order to analyze SNVs of SMV China in the soybean transcriptome, the raw data were aligned on the consensus genome of SMV China using the BWA program with default parameters. The aligned SAM files by BWA were converted into BAM files by SAMtools (Li et al., 2009). For SNV calling, we sorted the BAM files and then generated the VCF file format using mpileup (Danecek et al., 2011). BCFtools implemented in SAMtools was finally used to call SNVs. The positions of identified SNVs on the SMV genome were visualized by the Tablet program (Milne et al., 2010).

Construction of phylogenetic trees

In order to reveal phylogenetic relationships of the obtained consensus genome for SMV China with known SMV isolates, we generated three phylogenetic trees. The complete SMV isolate China genome sequence as well as two polyprotein sequences were blasted against NCBI nucleotide and non-redundant protein databases. Best-matched sequences were retrieved for the construction of phylogenetic tree. The obtained sequences were aligned by the ClustalW program with default parameters. After alignment, we deleted unnecessary sequences. The manually edited aligned sequences were subjected to construction of a phylogenetic tree using the MEGA6 program. The phylogenetic tree was constructed by the neighbor-joining method, with 1,000 bootstrap replicates.

Results

De novo soybean transcriptome assembly and identification of viruses in the soybean seeds

We screened available soybean transcriptome data deposited in NCBI’s Sequence Read Archive (SRA) database in order to identify viruses infecting soybean. Of screened soybean transcriptomes, a transcriptome conducting a gene expression profile during soybean seed development contains several virus-associated sequences (accession number SRR1777405) (Song et al., 2013). In order to identify virus-associated contigs, we de novo assembled the transcriptome of soybean using Trinity program, resulting in 116,108 transcripts (contigs) with 710 bp for contig N50 (Table 1). Next, we blasted 116,108 transcripts against the viral reference database. After removing redundant sequences and endogenous viral sequences, we identified 83 contigs-associated with viruses (Table 2). Most contigs (79 contigs) were associated with SMV. The lengths of SMV-associated contigs ranged from 224 to 3,636 nt (Fig. 1A). Four contigs were associated with BPMV, lettuce infectious yellow virus (LICV), lettuce chlorosis virus (LCV), and cucumber mosaic virus (CMV), respectively. The lengths of contigs associated with the four viruses ranged from 232 nt (LCV RNA2) to 1,015 nt (bean common mosaic virus) (Fig. 1A). Other than a contig-associated with LICV (1E-08), virus-associated contigs display reliable E values indicating significance of blast results (Table 2).

Table 1

Summary of de novo soybean transcriptome assembly using Trinity

Accession number	SRR1777405a
Total trinity transcripts	116108
Percent GC	43.97
Contig N50	710 bp
Median contig length	428 bp
Average contig	580.18 bp
Total assembled bases	67363642 bp

We assembled raw data from two different libraries using Trinity program.

The statistics of assembled contigs were calculated by TrinityStats.pl in the Trinity program.

Table 2

Summary of blast results to identify virus-associated contigs

Query id	Subject id	Name of virus	Identity (%)	Alignment length	Mismatches	Gap opens	Query start	Query end	Subject start	Subject end	E value	Bit score
TR2274\|c0_g1_i1	NC_002634.1	Soybean mosaic virus	93.13	233	16	0	2	234	8571	8803	3.00E-93	342
TR3618\|c0_g1_i1	NC_002634.1	Soybean mosaic virus	91.02	256	23	0	1	256	1342	1597	2.00E-94	346
TR3618\|c0_g1_i2	NC_002634.1	Soybean mosaic virus	90.58	276	26	0	1	276	1342	1617	2.00E-100	366
TR3858\|c0_g1_i1	NC_002634.1	Soybean mosaic virus	97.35	264	7	0	1	264	910	1173	2.00E-125	449
TR3858\|c0_g1_i2	NC_002634.1	Soybean mosaic virus	96.6	235	8	0	1	235	939	1173	1.00E-107	390
TR4672\|c0_g1_i1	NC_002634.1	Soybean mosaic virus	96.55	261	9	0	1	261	9036	9296	2.00E-120	433
TR4672\|c0_g1_i2	NC_002634.1	Soybean mosaic virus	97.7	261	6	0	1	261	9036	9296	2.00E-125	449
TR5077\|c1_g1_i1	NC_002634.1	Soybean mosaic virus	94.19	258	15	0	3	260	4680	4937	9.00E-109	394
TR5077\|c1_g1_i2	NC_002634.1	Soybean mosaic virus	91.47	258	22	0	3	260	4680	4937	4.00E-97	355
TR5102\|c0_g1_i1	NC_002634.1	Soybean mosaic virus	91.96	224	18	0	1	224	7552	7329	6.00E-85	315
TR5869\|c0_g1_i1	NC_002634.1	Soybean mosaic virus	91.98	212	17	0	5	216	7243	7032	6.00E-80	298
TR5869\|c0_g2_i1	NC_002634.1	Soybean mosaic virus	92.45	212	16	0	5	216	7243	7032	1.00E-81	303
TR5869\|c0_g3_i1	NC_002634.1	Soybean mosaic virus	92.92	212	15	0	5	216	7243	7032	3.00E-83	309
TR5869\|c0_g4_i1	NC_002634.1	Soybean mosaic virus	92.92	212	15	0	5	216	7243	7032	3.00E-83	309
TR7406\|c0_g1_i1	NC_002634.1	Soybean mosaic virus	94.64	280	15	0	1	280	2677	2956	6.00E-121	435
TR7406\|c0_g1_i2	NC_002634.1	Soybean mosaic virus	92.12	241	19	0	1	241	2677	2917	1.00E-92	340
TR7406\|c0_g1_i3	NC_002634.1	Soybean mosaic virus	94.16	274	16	0	1	274	2677	2950	5.00E-116	418
TR7406\|c0_g1_i4	NC_002634.1	Soybean mosaic virus	93.36	241	16	0	1	241	2677	2917	1.00E-97	357
TR8100\|c0_g1_i1	NC_002634.1	Soybean mosaic virus	97.86	234	5	0	12	245	6060	6293	4.00E-112	405
TR9520\|c0_g1_i1	NC_002634.1	Soybean mosaic virus	95.06	385	19	0	1	385	8268	7884	2.00E-172	606
TR9520\|c0_g1_i2	NC_002634.1	Soybean mosaic virus	96.65	239	8	0	4	242	8122	7884	6.00E-110	398
TR9520\|c0_g1_i3	NC_002634.1	Soybean mosaic virus	94.66	356	19	0	1	356	8268	7913	2.00E-156	553
TR9520\|c0_g1_i4	NC_002634.1	Soybean mosaic virus	94.38	356	20	0	1	356	8268	7913	9.00E-155	547
TR9520\|c0_g1_i5	NC_002634.1	Soybean mosaic virus	95.06	385	19	0	1	385	8268	7884	2.00E-172	606
TR9520\|c0_g1_i6	NC_002634.1	Soybean mosaic virus	96.19	210	8	0	4	213	8122	7913	8.00E-94	344
TR9520\|c0_g1_i7	NC_002634.1	Soybean mosaic virus	96.88	385	12	0	1	385	8268	7884	0	645
TR13605\|c0_g1_i1	NC_002634.1	Soybean mosaic virus	92.25	400	31	0	10	409	8665	9064	8.00E-161	568
TR13605\|c0_g1_i2	NC_002634.1	Soybean mosaic virus	94.75	400	21	0	10	409	8665	9064	2.00E-177	623
TR15892\|c0_g1_i1	NC_002634.1	Soybean mosaic virus	92.64	231	17	0	2	232	5845	5615	2.00E-90	333
TR20496\|c0_g1_i1	NC_002634.1	Soybean mosaic virus	96.88	224	7	0	1	224	2087	1864	3.00E-103	375
TR22770\|c0_g1_i1	NC_002634.1	Soybean mosaic virus	91.67	240	20	0	1	240	6413	6652	2.00E-90	333
TR22770\|c0_g1_i2	NC_002634.1	Soybean mosaic virus	92.53	281	21	0	2	282	6372	6652	2.00E-111	403
TR25078\|c0_g1_i1	NC_002634.1	Soybean mosaic virus	88.54	253	29	0	1	253	8730	8478	1.00E-82	307
TR25078\|c0_g2_i1	NC_002634.1	Soybean mosaic virus	94.72	246	13	0	16	261	8627	8382	2.00E-105	383
TR25078\|c0_g2_i2	NC_002634.1	Soybean mosaic virus	93.7	349	22	0	1	349	8730	8382	2.00E-147	523
TR25078\|c0_g2_i3	NC_002634.1	Soybean mosaic virus	95.72	187	8	0	43	229	8568	8382	5.00E-81	302
TR25078\|c0_g2_i4	NC_002634.1	Soybean mosaic virus	90.91	253	23	0	1	253	8730	8478	1.00E-92	340
TR32819\|c0_g1_i1	NC_002634.1	Soybean mosaic virus	91.7	265	22	0	2	266	2515	2251	6.00E-101	368
TR32819\|c0_g2_i1	NC_002634.1	Soybean mosaic virus	92.08	265	21	0	2	266	2515	2251	1.00E-102	374
TR34507\|c0_g1_i1	NC_002634.1	Soybean mosaic virus	87.27	377	44	4	4	378	3523	3149	1.00E-118	427
TR37651\|c0_g1_i1	NC_002634.1	Soybean mosaic virus	87.61	218	24	3	2	218	410	625	2.00E-65	250
TR37651\|c0_g3_i1	NC_002634.1	Soybean mosaic virus	87.27	487	57	4	2	487	410	892	1.00E-155	551
TR37706\|c0_g2_i1	NC_002634.1	Soybean mosaic virus	90.51	274	24	2	1	273	1128	1400	9.00E-99	361
TR41793\|c1_g1_i1	NC_002634.1	Soybean mosaic virus	92.89	394	28	0	1	394	7483	7876	2.00E-162	573
TR41793\|c1_g1_i2	NC_002634.1	Soybean mosaic virus	93.15	438	29	1	1	437	7483	7920	0	641
TR41793\|c1_g1_i3	NC_002634.1	Soybean mosaic virus	91.55	213	18	0	23	235	7486	7698	8.00E-79	294
TR41793\|c1_g1_i4	NC_002634.1	Soybean mosaic virus	93.93	445	27	0	1	445	7483	7927	0	673
TR41793\|c1_g1_i5	NC_002634.1	Soybean mosaic virus	91.59	226	19	0	1	226	7473	7698	2.00E-84	313
TR41793\|c1_g1_i6	NC_002634.1	Soybean mosaic virus	93.03	445	31	0	1	445	7483	7927	0	651
TR41793\|c1_g1_i7	NC_002634.1	Soybean mosaic virus	91.17	419	37	0	1	419	7473	7891	2.00E-161	569
TR44246\|c0_g1_i2	NC_002634.1	Soybean mosaic virus	87.9	157	18	1	87	242	477	633	2.00E-45	183
TR44822\|c4_g1_i1	NC_002634.1	Soybean mosaic virus	97.83	460	10	0	2	461	843	384	0	795
TR44822\|c4_g1_i2	NC_002634.1	Soybean mosaic virus	97.65	765	18	0	2	766	843	79	0	1314
TR44822\|c4_g1_i3	NC_002634.1	Soybean mosaic virus	97.27	622	14	1	2	623	843	225	0	1051
TR44822\|c4_g2_i1	NC_002634.1	Soybean mosaic virus	90.13	1256	122	2	1	1255	1991	737	0	1631
TR44822\|c4_g2_i2	NC_002634.1	Soybean mosaic virus	91.46	820	70	0	1	820	1918	1099	0	1127
TR44822\|c4_g2_i3	NC_002634.1	Soybean mosaic virus	92.75	483	35	0	1	483	1991	1509	0	699
TR44822\|c4_g2_i4	NC_002634.1	Soybean mosaic virus	88.67	256	29	0	1	256	1617	1362	2.00E-84	313
TR44822\|c4_g2_i5	NC_002634.1	Soybean mosaic virus	93.9	246	15	0	19	264	1853	1608	4.00E-102	372
TR44822\|c4_g2_i6	NC_002634.1	Soybean mosaic virus	94.81	231	12	0	19	249	1853	1623	9.00E-99	361
TR44822\|c4_g2_i7	NC_002634.1	Soybean mosaic virus	94.15	410	24	0	1	410	1918	1509	5.00E-178	625
TR44822\|c5_g1_i1	NC_002634.1	Soybean mosaic virus	95.98	994	40	0	2	995	5991	6984	0	1615
TR44822\|c5_g1_i2	NC_002634.1	Soybean mosaic virus	94.11	3599	207	4	2	3596	5991	9588	0	5467
TR44822\|c5_g2_i1	NC_002634.1	Soybean mosaic virus	93.3	224	15	0	4	227	8124	8347	6.00E-90	331
TR44822\|c5_g1_i3	NC_002634.1	Soybean mosaic virus	96.21	501	19	0	2	502	5991	6491	0	821
TR44822\|c5_g1_i4	NC_002634.1	Soybean mosaic virus	92.81	292	21	0	2	293	5991	6282	1.00E-117	424
TR44822\|c6_g1_i1	NC_002634.1	Soybean mosaic virus	95.07	1015	50	0	1	1015	6049	5035	0	1598
TR44822\|c6_g2_i1	NC_002634.1	Soybean mosaic virus	97.64	212	5	0	10	221	4930	4719	6.00E-100	364
TR44822\|c6_g2_i2	NC_002634.1	Soybean mosaic virus	95.83	240	10	0	1	240	5051	4812	4.00E-107	388
TR44822\|c6_g2_i3	NC_002634.1	Soybean mosaic virus	97.52	1372	34	0	1	1372	5146	3775	0	2346
TR44822\|c6_g2_i4	NC_002634.1	Soybean mosaic virus	96.59	293	10	0	1	293	5146	4854	2.00E-136	486
TR44822\|c6_g3_i1	NC_002634.1	Soybean mosaic virus	95.8	691	27	2	5	694	2746	2057	0	1114
TR44822\|c6_g3_i2	NC_002634.1	Soybean mosaic virus	96.69	877	29	0	2	878	2822	1946	0	1459
TR44822\|c6_g4_i1	NC_002634.1	Soybean mosaic virus	95.39	1149	53	0	1	1149	3889	2741	0	1829
TR44822\|c6_g4_i2	NC_002634.1	Soybean mosaic virus	94.89	333	17	0	1	333	3598	3266	5.00E-147	521
TR45256\|c0_g1_i1	NC_002634.1	Soybean mosaic virus	93.49	261	17	0	4	264	6897	6637	4.00E-107	388
TR45256\|c0_g1_i2	NC_002634.1	Soybean mosaic virus	94.32	229	13	0	2	230	6865	6637	5.00E-96	351
TR47685\|c0_g1_i1	NC_002634.1	Soybean mosaic virus	92.53	281	21	0	1	281	5082	5362	2.00E-111	403
TR47685\|c0_g2_i1	NC_002634.1	Soybean mosaic virus	92.53	281	21	0	1	281	5082	5362	2.00E-111	403
TR44246\|c0_g1_i1	NC_003397.1	Bean common mosaic virus	81.86	408	68	6	490	894	458	862	2.00E-91	339
TR19277\|c0_g2_i1	NC_003617.1	Lettuce infectious yellows virus RNA1	75.34	146	29	6	467	607	6837	6694	1.00E-08	63.9
TR45572\|c0_g2_i1	NC_012910.1	Lettuce chlorosis virus RNA2	87.96	191	22	1	15	205	8555	8366	1.00E-57	224
TR29303\|c0_g1_i1	NC_002034.1	Cucumber mosaic virus RNA1	91.28	298	26	0	4	301	1334	1631	1.00E-112	407

Fig. 1

De novo assembly of SMV isolate in China using transcriptome data. (A) Size distribution of virus-associated contigs. Red-colored bar indicates SMV-associated contigs. Four viruses with respective contig length were indicated. (B) Alignment of 79 SMV-associated contigs on the assembled genome of SMV isolate in China using BWA program. Black bar indicates the reference SMV genome. Sequence alignment was visualized by Tablet program. (C) Genome organization of SMV isolate in China. The nucleotide positions of two proteins, GP1 and GP2, were indicated.

De novo genome assembly of SMV from a soybean transcriptome

Of identified viruses, SMV was severely infected in the soybean seeds. Fortunately, 79 contigs associated with SMV mostly covered the SMV reference genome (Table 2). A total of 79 contigs associated with SMV were mapped on the SMV reference genome (accession number NC_002634.1) (Eggenberger et al., 1989) (Fig. 1B). After sequence alignment followed by manual modification, we assembled a nearly complete consensus genome of SMV referred as SMV China (Fig. 1C). The SMV China is composed of 9,507 nucleotides (nt) encoding two proteins such as GP1 and GP2. GP1 encodes a polyprotein (nt 54 to 9,254) which is further cleaved into ten mature proteins such as P1 (P1 proteinase), HC-Pro (helper component proteinase), P3 (P3 protein), 6K1 (6K1 protein), CI (cylindrical inclusion), 6K2, NIa-VPg (Nuclear inclusion protein a-genome linked viral protein), NIa-Pro, NIb (nuclear inclusion protein b), and coat protein (CP) while GP2 encodes PIPO (pretty interesting potyviridae ORF) protein (nt 2,804 to 3,031) (Fig. 1C).

Phylogenetic relationships of the SMV isolate China

In order to find genetic relationships of the assembled SMV China with known SMV isolates, we constructed phylogenetic trees. The phylogenetic tree using SMV complete genome sequences showed two groups of SMV isolates (Fig. 2A). The SMV China belongs to group B along with two SMV isolates from South Korea. Using polyprotein sequences, the SMV China in group C was distantly related with other SMV isolates (Fig. 2B). The phylogenetic tree using PIPO protein sequences confirmed that SMV China is a member of SMV belonging to group A, which contains seven viruses including BPMV (Fig. 2C). Based on phylogenetic analyses, it seems that the consensus genome of SMV China is genetically close to the SMV isolates from South Korea.

Fig. 2

Phylogenetic relationship of the assembled SMV isolate China with known SMV isolates. Phylogenetic trees of SMV isolates using complete genomes (A), polyproteins (B), and PIPO sequences (C). The respective genome and protein sequences were blasted against NCBI database and highly matched sequences were used for construction of phylogenetic trees using MEGA6 program using neighbor-joining method with 1000 bootstrap replications. Kimura 2-parameter and Poisson substitution model were used for nucleotide and protein sequences, respectively.

Single nucleotide variations of SMV in the soybean seeds

It is well known that RNA viruses exhibit quasispecies nature, exhibiting several variants in the infected host. Therefore, we examined single nucleotide variations (SNVs) for SMV in the soybean seeds. The identified SMV China was used as a reference. After BWA alignment of raw data against SMV China, SNVs were identified using SAMtools (Fig. 3A). The SNVs in this study was derived from a population of different isolates. As a result, we identified 780 SNVs (Supplementary Table 1). SNVs were evenly distributed along the SMV genome (Fig. 3B). Most SNVs were Single nucleotide polymorphisms (SNPs) except one InDel (CAGG to CAGGAGG) at nt 640 of SMV China (Table S1). Four SNVs, C-U (190 SNVs), U-C (180 SNVs), A-G (168 SNVs), and G-A (155 SNVs), were frequently identified (Fig. 3C). Based on SNV results, the mutation rate for SMV in the soybean seeds was 8.2045%, indicating a high level of mutations for the SMV RNA genome. In addition, we calculated the ratio of Ts/Tv (Transition versus Transversion). The Ts/Tv ratio for SMV China was 8.06 (693/86).

Fig. 3

SNVs of SMV in the soybean seed transcriptome. (A) Raw data were mapped on the genome sequence of SMV isolate China using BWA and visualized by Tablet program. (B) The positions of identified single nucleotide variations on the SMV were visualized by Tablet program. Detailed information for SNVs can be found in Supplementary Table 1. (C) The numbers of identified SNVs of SMV in the soybean seed transcriptome.

The amount of viral RNA in the soybean transcriptome

It might be of interest to examine viral RNAs in the analyzed soybean transcriptome. Of 116,108 contigs, virus-associated contigs account for 0.068% (79 contigs). The length of total assembled contigs was 67,363,642 bp and the total length of virus-associated contigs 36,022 bp, accounting for 0.0535%. The amount of virus-associated reads accounts for 0.0529% (39,403/74,431,152) of reads. Moreover, we calculated SMV copy numbers within the soybean transcriptome resulting in 414 SMV virus copies, which is highly correlated with sequence coverage of SMV genome. This result indicates high variability of SMV genome.

Discussion

Development of NGS provides various DNA as well as RNA sequencing data (Metzker, 2010). The main purposes of DNA and RNA sequencing is elucidation of the genome and transcriptome of target eukaryotic and prokaryotic organisms (Morozova and Marra, 2008). In case of bacteria, metagenomics using 16s rRNA sequences that are highly conserved in bacteria species is intensively performed to study bacterial communities under specific conditions (Wang and Qian, 2009). However, viruses do not have any conserved sequences like bacteria, and genomes of viruses are mostly very small (Edwards and Rohwer, 2005). Therefore, virus-specific sequencing usually requires a purification step for NGS. For example, extraction of double-stranded RNAs from virus-infected organisms followed by NGS is one of the efficient approaches to identify viruses (Yanagisawa et al., 2016). Moreover, sequencing of small RNAs is an alternative technique for virus identification and genome assembly (Vodovar et al., 2011). In addition, RNA-Seq is also a good technique to identify viruses that have a poly(A) tail. However, several recent studies demonstrated that viruses and viroids without a poly(A) tail can be detected by RNA-Seq (Burger and Maree, 2015; Jo et al., 2016). In this study, we identified several viruses infecting soybean. This transcriptome was initially conducted for expression profiling of soybean seed development. Thus, this transcriptome is not derived from a single condition but from six developmental seed stages in which several seeds might be included for total RNA extraction. Although we identified five viruses that might infect soybean, four viruses other than SMV were identified based on only one single contig, and their presence should be validated by other methods. In many cases, the partial viral sequence or contig is homologous to a closely related virus, not the target virus. Thus, it is possible that the identified virus-associated contigs might be not from the infected viruses but from other viruses which share similar viral sequences. SMV is seed-borne and transmitted by aphids (Domier et al., 2011). Soybean seeds infected by SMV often display a discolored and mottled seed. In addition, BCMV is known as a seed-borne virus (Refugee et al., 1987). Seed-borne viruses can be actually infected in embryo, such as BCMV, or carried on the seed coat (Jafarpour et al., 1979). In addition, seed transmission of CMV has been identified in several plants such as pepper, spinach, and lupin (Ali and Kobayashi, 2010; Wylie et al., 1993; Yang et al., 1997). Based on previous knowledge on seed-borne viruses, the identification of SMV, BCMV, and CMV in the soybean seed is not surprising. In addition, the infection of LCV in green bean (Phaseolus vulgaris L.) has been recently reported (Ruiz et al., 2014). However, the infection of LIYV and LCV, which are members in the genus Crinivirus, in the soybean seed should be validated. The soybean transcriptome was derived not from a single soybean seed but from a mixture of soybeans which were further divided into six developmental stages of seeds. The lengths of assembled contigs-associated with SMV in this study might be shorter than virus-associated contigs from a single plant due to the transcriptome containing several variants of SMV. Therefore, the assembled genome of SMV China is a consensus sequence of several SMV variants. Although the portion of SMV-associated sequences accounted for about 0.05% in the total transcriptome, the coverage of SMV genome in this study was about 414, and its coverage was also visualized by the alignment of raw data on the genome of SMV China. As a result, we could de novo assemble SMV genome based on enough sequence data associated with SMV. Based on the assembled SMV genome, we could also identify SNVs for SMV. As we expected, we found several SNVs that resulted from a mixture of SMV infected diverse seed samples. However, we could not reveal the exact number of variants. Furthermore, the identification of SNVs in SMV demonstrated that not a specific region of SMV but several regions of SMV genome were highly mutated. The presence of several SMV variants in the soybean seeds is a very interesting finding, indicating that SMV is highly replicated in the developing seeds; this might be correlated with some disease symptoms in the soybean seeds caused by SMV. It might be of interest to examine replication rates of SMV in different developmental stages and tissues; this could provide evidence of the quasispecies nature of SMV in the near future. Phylogenetic analyses suggested that the identified SMV isolate China was very different from other known SMV isolates based on polypeptide sequences. However, SMV isolate China seems to be highly correlated with two SMV isolates from South Korea, suggesting the phylogenetic correlation between geographical regions and SMV isolates. Our SNV analysis in the soybean seeds indicates a high level of quasispecies nature for SMV. Mutations were not in a specific region but in most regions of SMV genome. Furthermore, we found that A-G and C-U conversions and vice and versa were frequent. Taken together, our bioinformatics analyses using soybean seed transcriptomes identified five viruses infecting the soybean seeds. Of these five viruses, we de novo assembled the genome of SMV isolate China and analyzed SNVs revealing quasispecies nature of SMV in the soybean seeds for the first time. Our approaches and analyses in this study are valuable for the virus-associated studies using NGS-based transcriptome data.

24 in total

Review 1. Applications of next-generation sequencing technologies in functional genomics.

Authors: Olena Morozova; Marco A Marra
Journal: Genomics Date: 2008-08-24 Impact factor: 5.736

2. MEGA6: Molecular Evolutionary Genetics Analysis version 6.0.

Authors: Koichiro Tamura; Glen Stecher; Daniel Peterson; Alan Filipski; Sudhir Kumar
Journal: Mol Biol Evol Date: 2013-10-16 Impact factor: 16.240

3. In silico reconstruction of viral genomes from small RNAs improves virus-derived small interfering RNA profiling.

Authors: Nicolas Vodovar; Bertsy Goic; Hervé Blanc; Maria-Carla Saleh
Journal: J Virol Date: 2011-08-31 Impact factor: 5.103

Review 4. Sequencing technologies - the next generation.

Authors: Michael L Metzker
Journal: Nat Rev Genet Date: 2009-12-08 Impact factor: 53.242

5. Multiple loci condition seed transmission of soybean mosaic virus (SMV) and SMV-induced seed coat mottling in soybean.

Authors: Leslie L Domier; Houston A Hobbs; Nancy K McCoppin; Charles R Bowen; Todd A Steinlage; Sungyul Chang; Yi Wang; Glen L Hartman
Journal: Phytopathology Date: 2011-06 Impact factor: 4.025

Review 6. Legumes and soybeans: overview of their nutritional profiles and health effects.

Authors: M J Messina
Journal: Am J Clin Nutr Date: 1999-09 Impact factor: 7.045

7. The Sequence Alignment/Map format and SAMtools.

Authors: Heng Li; Bob Handsaker; Alec Wysoker; Tim Fennell; Jue Ruan; Nils Homer; Gabor Marth; Goncalo Abecasis; Richard Durbin
Journal: Bioinformatics Date: 2009-06-08 Impact factor: 6.937

8. Tablet--next generation sequence assembly visualization.

Authors: Iain Milne; Micha Bayer; Linda Cardle; Paul Shaw; Gordon Stephen; Frank Wright; David Marshall
Journal: Bioinformatics Date: 2009-12-04 Impact factor: 6.937

9. De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis.

Authors: Brian J Haas; Alexie Papanicolaou; Moran Yassour; Manfred Grabherr; Philip D Blood; Joshua Bowden; Matthew Brian Couger; David Eccles; Bo Li; Matthias Lieber; Matthew D MacManes; Michael Ott; Joshua Orvis; Nathalie Pochet; Francesco Strozzi; Nathan Weeks; Rick Westerman; Thomas William; Colin N Dewey; Robert Henschel; Richard D LeDuc; Nir Friedman; Aviv Regev
Journal: Nat Protoc Date: 2013-07-11 Impact factor: 13.491

10. Soybean GmbZIP123 gene enhances lipid content in the seeds of transgenic Arabidopsis plants.

Authors: Qing-Xin Song; Qing-Tian Li; Yun-Feng Liu; Feng-Xia Zhang; Biao Ma; Wan-Ke Zhang; Wei-Qun Man; Wei-Guang Du; Guo-Dong Wang; Shou-Yi Chen; Jin-Song Zhang
Journal: J Exp Bot Date: 2013-08-20 Impact factor: 6.992