| Literature DB >> 30310253 |
Cindy Huang1,2, Vichetra Sam1,3, Sophie Du1,3, Tuan Le1, Anthony Fletcher1, William Lau1, Kathleen Meyer1,3, Esther Asaki1,3, Da Wei Huang1, Calvin Johnson1.
Abstract
The third-generation sequencing technology, PacBio, has shown an ability to sequence the HIV virus amplicons in their full length. The long read of PaBio offers a distinct advantage to comprehensively understand the virus evolution complexity at quasispecies level (i.e. maintaining linkage information of variants) comparing to the short reads from Illumina shotgun sequencing. However, due to the highnoise nature of the PacBio reads, it is still a challenge to build accurate contigs at high sensitivity. Most of previously developed NGS assembly tools work with the assumption that the input reads are fairly accurate, which is largely true for the data derived from Sanger or Illumina technologies. When applying these tools on PacBio high-noise reads, they are largely driven by noise rather than true signal eventually leading to poor results in most cases. In this study, we propose the de novo assembly procedure, which comprises a positivefocused strategy, and linkage-frequency noise reduction so that it is more suitable for PacBio high-noise reads. We further tested the unique de novo assembly procedure on HIV PacBio benchmark data and clinical samples, which accurately assembled dominant and minor populations of HIV quasispecies as expected. The improved de novo assembly procedure shows potential ability to promote PacBio technology in the field of HIV drug-resistance clinical detection, as well as in broad HIV phylogenetic studies.Entities:
Keywords: De Novo Assembly; HIV; PacBio; quasispecies
Year: 2018 PMID: 30310253 PMCID: PMC6166399 DOI: 10.6026/97320630014449
Source DB: PubMed Journal: Bioinformation ISSN: 0973-2063
Figure 2Linkage-frequency noise reduction. A. Freq. of an individual mutation at a given vertical position can be measured as the number of reads containing the mutation vs. the total number of reads. Co-existing freq. of two mutations can be measured as the number of reads containing both mutations vs. the total number of reads. The true mutations (in red circles) intend to co-exist on the same reads at much higher frequency than that of random noise mutations (in green stars). B. Based on benchmark dataset 1, all possible pairs of mutationmutation were examined. Each dot represents a pair of both mutations regarding its expected co-existing frequency (vertical freq. of individual mutation 1 x vertical freq. of individual mutation 2) vs. actual observed co-existing frequency (# reads containing the both mutations/# total reads). Red color represents both mutations in the pair as prior known mutations. Blue color represents one mutation in the pair as prior known mutations. Green color represents both mutations in the pair as unexpected mutations.
Figure 3Efficacy of the G test. The sensitivity-PPV curve for the test was examined on the PacBio benchmark data to determine the G statistic that maximizes PPV while achieving perfect sensitivity. The optimal G statistic occurs at a value of 940, corresponding a sensitivity of 1.0 and PPV of 0.695, and an F-measure (harmonic mean of sensitivity and PPV) of 0.82. Plot depicts the F-measure against the full range of G-statistic cutoff values.
Figure 1The sequential steps of the de novo assembly procedure. The procedure takes PacBio CCS reads as the only input data, and outputs final contig(s).
De novo assembly results on two PacBio HIV benchmark datasets
| Benchmark Admixture | PacBio De Novo Assembly | ||||
| Dataset | Strain | Admixture Ratio | # read | Accuracy | |
| Benchmark 90:10 | HIV pLN4-3 | 90% | contig 1 | 15,020 | Exactly matched |
| HIV BN10 | 10% | contig 2 | 764 | Exactly matched | |
| Benchmark 99:1 | HIV pLN4-3 | 99% | contig 1 | 16,444 | Exactly matched |
| HIV BN10 | 1% | contig 2 | 187 | Exactly matched | |