| Literature DB >> 28446139 |
You-Yu Lin1,2, Chia-Hung Hsieh3, Jiun-Hong Chen4, Xuemei Lu5, Jia-Horng Kao6, Pei-Jer Chen6, Ding-Shinn Chen6,7, Hurng-Yi Wang8,9,10.
Abstract
BACKGROUND: The accuracy of metagenomic assembly is usually compromised by high levels of polymorphism due to divergent reads from the same genomic region recognized as different loci when sequenced and assembled together. A viral quasispecies is a group of abundant and diversified genetically related viruses found in a single carrier. Current mainstream assembly methods, such as Velvet and SOAPdenovo, were not originally intended for the assembly of such metagenomics data, and therefore demands for new methods to provide accurate and informative assembly results for metagenomic data.Entities:
Keywords: Assembly pipeline; Hepatitis B virus; Metagenomics; Next generation sequencing; Sequence assembly
Mesh:
Substances:
Year: 2017 PMID: 28446139 PMCID: PMC5406902 DOI: 10.1186/s12859-017-1630-z
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Average assembly statistics of all 12 data sets using BBAP with multiple approaches
| PDa | FDb | SRc | PDRd | |
|---|---|---|---|---|
| RRs | 214,942 | 21,494,295 | 21,494,295 | 21,494,295 |
| HQRs | 143,912 | 14,388,844 | 14,388,844 | 14,388,844 |
| URs | 27,150 | 860,144 | 860,144 | 860,144 |
| HRURs | 6264 | 60,228 | 60,228 | 60,228 |
| RiHRURs | 116,555 | 13,388,423 | 13,388,423 | 13,388,423 |
| Contigs assemblede | 2.1 | 46.0 | 1.0 | 3.9 |
| Max contig length | 3119 | 1473 | 3,207 | 3148 |
| Average contig length | 2319 | 321 | 3207 | 1268 |
| % of Mapped HRURs | 95.9% | 70.3% | 67.4% | 69.9% |
| % of Mapped RiHRURs | 80.4% | 68.7% | 82.7% | 84.5% |
The full data sets were used in the BBAP assembly with FD, SR, and PDR approaches, whereas partial data sets consisting of 1% of randomly selected RRs were used in the BBAP PD assembly approach
aPartial data set de novo assembly
bFull data set de novo assembly
cSanger reference assembly
dPartial data set reference assembly of the full data set
eOnly minimum assembled contig length > 150 bp was shown
RRs raw reads, HQRs high quality reads (quality score threshold = 20, i.e., sequencing error rate = 1%), URs unique representative reads, HRURs high redundancy unique representative reads (unique representative reads with redundancy threshold = 5), RiHRURs reads included in high redundancy unique representative reads
Fig. 1Assembly results of full and partial D2_1 data set by a BBAP, b Velvet, c SOAPdenovo, and d Genovo. The contigs were aligned to the Sanger reference sequence. MetaVelvet assembly results for both full and partial D2_1 data set were identical to those of Velvet and thus not shown
Comparison of D2_1 assembly results with different methods and different data set sizes
| Max length | Average length | Number of contigs | % of HBV genome recovered | Contigs that map to reference HBV genome | Contigs with HBV structural variants | |
|---|---|---|---|---|---|---|
| BBAP/FD | 998 | 263 | 52 | 100% | 16 | 30 |
| Velvet/Full | 1102 | 303 | 13 | 19% | 4 | 0 |
| MetaVelvet/Full | 1102 | 303 | 13 | 19% | 4 | 0 |
| SOAPdenovo/Full | 934 | 340 | 8 | 14% | 3 | 0 |
| Genovo/Full | 1352 | 395 | 60 | 44% | 4 | 34 |
| BBAP/PD | 2924 | 692 | 6 | 100% | 3 | 3 |
| Velvet/Partial | 2576 | 973 | 3 | 89% | 3 | 0 |
| MetaVelvet/Partial | 2576 | 973 | 3 | 89% | 3 | 0 |
| SOAPdenovo/Partial | 1723 | 390 | 10 | 95% | 10 | 0 |
| Genovo/Partial | 2427 | 481 | 12 | 91% | 4 | 7 |
Fig. 2Schematic summary of corresponding HBV genome (NC_003977) regions for assembled contigs identified as HBV variants. Arrows indicate 5’ to 3’ direction. Only reads containing the sequences spanning the junction regions were assembled separately into variant contigs; reads spanning non-junction regions of the variant contigs (dotted lines) were assembled into the main HBV contig. The L1 sequence, which is similar to T5, resulted from HBV variant validation with PCR using specialized primers followed by Sanger sequencing. Positions are in correspondence with NC_003977, with dotted lines representing the remaining portion of the circular HBV genome, and the boxed section indicating the encapsidation signal (or episilon, ε)
Assembled results of in silico generated data sets from the reference HBV genome by different methodsa
| Data set size | Method | BBAP FD | Velvet | SOAPdenovo | Genovo | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Error rate | 10−4 | 10−3 | 10−2 | 10−4 | 10−3 | 10−2 | 10−4 | 10−3 | 10−2 | 10−4 | 10−3 | 10−2 | |
| 55X | Coverage | 1 | 1 | 1 | 1 | 1 | 1 |
|
| 1 | 1 | 1 | 1 |
| Accuracy | 1 | 1 | 0.99 | 1 | 1 | 1 |
|
| 1 | 1 | 1 | 1 | |
| # of contigs | 1 | 1 | 2 | 1 | 1 | 1 |
|
| 1 | 1 | 1 | 1 | |
| 557X | Coverage | 1 | 1 | 1 | 1 | 0.99 | 1 |
| 1 |
| 1 | 1 | 1 |
| Accuracy | 1 | 1 | 1 | 1 | 1 | 0.99 |
| 0.99 |
| 1 | 1 | 1 | |
| # of contigs | 1 | 1 | 2 | 1 | 1 | 9 |
| 1 |
| 1 | 1 | 1 | |
| 5,579X | Coverage | 1 | 1 | 1 | 0.99 | 0.96 |
| 1 |
|
| 1 | 1 | 1 |
| Accuracy | 1 | 1 | 1 | 1 | 1 |
| 1 |
|
| 1 | 1 | 0.99 | |
| # of contigs | 1 | 1 | 1 | 1 | 6 |
| 1 |
|
| 1 | 1 | 1 | |
| 55,799X | Coverage | 1 | 1 | 1 | 0.98 |
|
|
|
|
| 1 | 1 | 1 |
| Accuracy | 1 | 1 | 1 | 1 |
|
|
|
|
| 1 | 1 | 0.99 | |
| # of contigs | 1 | 1 | 1 | 3 |
|
|
|
|
| 3 | 2 | 5 | |
aResults represent averages of the assembly results of 5 replicate data sets. Bold areas indicate average assembly results with <80% coverage
Fig. 3Flow chart summary of BBAP. a The pipeline is divided into four major steps: quality control (QC), blast and cluster (BC), alignment and consensus determination (AC), and contig assembly (CA). b BBAP can perform de novo assembly, reference assembly, or partial de novo-reference assembly, which includes the initial de novo assembly of a randomly extracted partial data set followed by the reference assembly of the full data set using the assembly results from the initial assembly as reference sequences