| Literature DB >> 35057847 |
Xiao Luo1,2, Xiongbin Kang1,2, Alexander Schönhuth3,4.
Abstract
Haplotype-resolved de novo assembly of highly diverse virus genomes is critical in prevention, control and treatment of viral diseases. Current methods either can handle only relatively accurate short read data, or collapse haplotype-specific variations into consensus sequence. Here, we present Strainline, a novel approach to assemble viral haplotypes from noisy long reads without a reference genome. Strainline is the first approach to provide strain-resolved, full-length de novo assemblies of viral quasispecies from noisy third-generation sequencing data. Benchmarking on simulated and real datasets of varying complexity and diversity confirm this novelty and demonstrate the superiority of Strainline.Entities:
Keywords: Genome assembly; Haplotype; Long reads; SARS-CoV-2; Virus
Mesh:
Year: 2022 PMID: 35057847 PMCID: PMC8771625 DOI: 10.1186/s13059-021-02587-6
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Fig. 1The workflow of Strainline. The reads with different colors are from different haplotypes or strains. The pink fork represents the sequencing error, i.e. mismatch, insertion or deletion. The two steps Seed-vs-all overlaps and Read clustering are executed simultaneously (see Algorithm 1 for details). Hap1_A and Hap1_B denote two subsequences (not full-length) of haplotype 1 (Hap1), the same for Hap2_A and Hap2_B. There still may be very few remaining sequencing errors in corrected reads such as in the corresponding read cluster of Hap1_A, and incorrectly clustered reads such as the ’blue read’ in the corresponding read cluster of Hap2_B. Nevertheless, these errors will be eliminated through Consensus step
Fig. 2The schematic diagram for the sequencing error correction procedure of raw reads. In the top region, the bold solid line denotes the target raw read i, and the overlapping reads of the target read i are drawn dashed outside of the read alignment pile and solid inside of it. The pink fork represents the sequencing error, i.e. mismatch, insertion or deletion. The read alignment pile is split into k small windows, representing as Win 1, Win 2,... Win k. DBG is short for de Bruijn Graph. Window consensus refers to the highest scoring sequences (see main text for explanations) through the DBGs of small windows over the read alignment pile. The region filled with gray rhombus between two window consensus denotes the overlap between them (30bp). We perform the error correction step for each raw read
Characteristics of benchmarking data sets
| Virus mixture | Virus type | #Strain | Genome size (bp) | Coverage | Divergence (%) | Strain abundance (%) |
|---|---|---|---|---|---|---|
| 5-strain HIV | HIV-1 | 5 | 9478–9719 | 20,000 × | 2.7–5.6 | 10, 15, 20, 25, 30 |
| 6-strain Poliovirus | Poliovirus-2 | 6 | 7428–7460 | 20,000 × | 0.2–5.5 | 2, 4, 8, 16, 20, 50 |
| 6-strain Poliovirus (la1) | Poliovirus-2 | 6 | 7428–7460 | 20,000 × | 0.2–5.5 | 0.1, 1, 2, 8, 20, 68.9 |
| 6-strain Poliovirus (la2) | Poliovirus-2 | 6 | 7428–7460 | 20,000 × | 0.2–5.5 | 0.01, 0.1, 1, 2, 8, 88.89 |
| 10-strain HCV | HCV-1a | 10 | 9273–9311 | 20,000 × | 2.8–7.4 | 5, 6, 7, 8, 9, 11, 12, 13, 14, 15 |
| 15-strain ZIKV | ZIKV | 15 | 10,251–10,269 | 20,000 × | 1.1–15.1 | 2, 4, 5, 5, 5, 6, 6, 6, 7, 7, 8, 8, 9, 10, 12 |
| 5-strain SARS-CoV-2 | SARS-CoV-2 | 5 | 26,574–29,903 | 20,000 × | 0.3–1.1 | 10, 15, 20, 25, 30 |
| 5-strain SARS-CoV-2 (la) | SARS-CoV-2 | 5 | 26,574–29,903 | 20,000 × | 0.3–1.1 | 0.1, 1, 5, 10, 83.9 |
| 5-strain PVY (Mock) | PVY | 5 | 9694–9701 | 5800 × | 3.6–21.6 | 9.3, 12.7, 21.1, 24.4, 32.5 |
| SARS-CoV-2 (Real) | SARS-CoV-2 | - | - | 12,000 × | - | - |
For each benchmarking data set, we specify the name of virus mixture, virus type, number of strains in the mixture, range of genome size, total sequencing coverage, pairwise divergence, and strain abundance spectrum. The pairwise divergence is equal to 1−ANI, where ANI (Average Nucleotide Identity) is calculated by FastANI [31]. In experimental data sets, 5-strain PVY is a mock community, that is the sequencing data is real, but the mixture is synthetic, whereas SARS-CoV-2 (Real) is a real sample so there is no ground truth for the strains. The data sets 6-strain Poliovirus (la1) and 6-strain Poliovirus (la2) are similar with 6-strain Poliovirus, except the lowest abundance (la) of strains extends to 0.1% and 0.01%, respectively. The data set 5-strain SARS-CoV-2 (la) is similar with 5-strain SARS-CoV-2 except the lowest abundance (la) of strains extends to 0.1%
Benchmarking results for simulated PacBio CLR reads. HC haplotype coverage, ER error rate (mismatches + indels + N’s), MC misassembled contigs proportion. NGA50 is labeled with “-” if the uniquely aligned blocks cover less than half of the reference length. The total sequencing coverage in this table is 20,000 ×. Note that CliqueSNV is a reference-based method, whereas the others are de novo assemblers. For running CliqueSNV, we have tried various strategies (see Additional file 1: Table S2) but only the results of the best strategy are reported here. * If contigs are full-length, this number represents the estimated number of haplotypes or strains in the virus mixture. Wtdbg2 consensus as reference, using reads corrected by Strainline error correction. High quality reference (the highest abundant strain) using reads corrected by Strainline error correction
| #Contigs* | HC (%) | N50 (bp) | NGA50 (bp) | ER (%) | MC(%) | |
|---|---|---|---|---|---|---|
| Strainline | 5 | 99.9 | 9697 | 9697 | 0.002 | 0.0 |
| Canu | 5 | 84.5 | 8227 | 8170 | 0.409 | 20.0 |
| Wtdbg2 | 1 | 15.5 | 7419 | - | 1.820 | 0.0 |
| CliqueSNV | 8 | 77.2 | 7419 | 7419 | 1.106 | 0.0 |
| Strainline | 6 | 99.9 | 7449 | 7444 | 0.074 | 0.0 |
| Canu | 6 | 62.7 | 7040 | 6399 | 0.538 | 0.0 |
| Wtdbg2 | 1 | 14.7 | 6575 | - | 0.244 | 0.0 |
| CliqueSNV | 4 | 49.9 | 7452 | 7438 | 0.433 | 0.0 |
| Strainline | 10 | 99.9 | 9294 | 9292 | 0.056 | 0.0 |
| Canu | 11 | 76.9 | 7703 | 7174 | 0.351 | 0.0 |
| Wtdbg2 | 2 | 13.6 | 7698 | - | 5.077 | 0.0 |
| CliqueSNV | 1 | 10.0 | 9310 | - | 1.963 | 0.0 |
| Strainline | 15 | 99.6 | 10,238 | 10,238 | 0.021 | 0.0 |
| Canu | 13 | 55.7 | 10,233 | 7129 | 0.189 | 0.0 |
| Wtdbg2 | 2 | 10.7 | 8773 | - | 1.693 | 0.0 |
| CliqueSNV | 1 | 6.7 | 10,268 | - | 1.627 | 0.0 |
| Strainline | 7 | 98.6 | 29,017 | 29,009 | 0.047 | 0.0 |
| Canu | 16 | 85.9 | 12,419 | 25,137 | 0.078 | 0.0 |
| Wtdbg2 | 1 | 20.6 | 29,158 | - | 0.360 | 0.0 |
| CliqueSNV | 1 | 21.1 | 29,903 | - | 0.007 | 0.0 |
Benchmarking results for simulated Oxford Nanopore reads. HC haplotype coverage, ER error rate (mismatches + indels + N’s), MC misassembled contigs proportion. NGA50 is labeled with ’-’ if the uniquely aligned blocks cover less than half of the reference length. The total sequencing coverage in this table is 20,000 ×
| #Contigs | HC (%) | N50 (bp) | NGA50 (bp) | ER (%) | MC(%) | |
|---|---|---|---|---|---|---|
| Strainline | 5 | 99.9 | 9702 | 9702 | 0.081 | 0.0 |
| Canu | 2 | 35.8 | 18,151 | 7634 | 1.730 | 50.0 |
| Wtdbg2 | 1 | 18.9 | 9046 | - | 1.327 | 0.0 |
| Strainline | 6 | 100.0 | 7454 | 7453 | 0.051 | 0.0 |
| Canu | 1 | 16.6 | 7446 | - | 0.646 | 0.0 |
| Wtdbg2 | - | - | - | - | - | - |
| Strainline | 10 | 99.9 | 9294 | 9294 | 0.023 | 0.0 |
| Canu | 1 | 10.0 | 9279 | - | 2.619 | 0.0 |
| Wtdbg2 | 2 | 18.4 | 8567 | - | 1.336 | 0.0 |
| Strainline | 16 | 98.3 | 10,244 | 10244 | 0.026 | 0.0 |
| Canu | 1 | 6.6 | 10,251 | - | 0.459 | 0.0 |
| Wtdbg2 | 3 | 17.1 | 9490 | - | 0.564 | 0.0 |
| Strainline | 6 | 99.9 | 29,299 | 29071 | 0.041 | 0.0 |
| Canu | 10 | 66.3 | 18,003 | 9492 | 0.062 | 0.0 |
| Wtdbg2 | 1 | 19.0 | 26,767 | - | 0.586 | 0.0 |
Benchmarking results for real Oxford Nanopore reads. HC haplotype coverage, ER error rate (mismatches + indels + N’s), MC misassembled contigs proportion. NGA50 is labeled with ’-’ if the uniquely aligned blocks cover less than half of the reference length. Note that the metrics for the SARS-CoV-2 real sample in this table are not necessarily correct but for reference only, because the ground truth is unknown and we only used the sequence of Wuhan-Hu-1 (NC_045512) as the truth for comparison. Strainline consensus as reference, using reads corrected by Strainline error correction. * Sometimes NGA50 still reports a value (5456 bp) even if HC <50% because contigs have overlaps (See https://github.com/ablab/quast/discussions/ 174 for the detailed explanation)
| #Contigs | HC (%) | N50 (bp) | NGA50 (bp) | ER (%) | MC(%) | |
|---|---|---|---|---|---|---|
| Strainline | 7 | 97.9 | 9538 | 9548 | 0.956 | 0.0 |
| Canu | 3 | 39.9 | 9665 | 5456* | 0.105 | 0.0 |
| Wtdbg2 | 2 | 26.0 | 7632 | - | 4.931 | 0.0 |
| Strainline | 1 | 99.9 | 29,565 | 29,565 | 0.832 | 0.0 |
| Canu | - | - | - | - | - | - |
| Wtdbg2 | 1 | 65.8 | 19,405 | 19,396 | 1.542 | 0.0 |
| CliqueSNV | 1 | 99.9 | 29,565 | 29,565 | 0.859 | 0.0 |
Absolute and relative errors of estimated haplotype abundances by Strainline on different virus mixtures. For each dataset, we present the average error over all assembled strains. Note that for the SARS-CoV-2 (SRP250446) real sample, the abundance estimation error is not shown because the ground truth is unknown. AFE absolute frequency error, RFE relative frequency error
| Datasets | AFE (%) | RFE (%) |
|---|---|---|
| 5-strain HIV | 0.04 | 0.18 |
| 6-strain Poliovirus | 2.17 | 48.35 |
| 10-strain HCV | 0.06 | 0.80 |
| 15-strain ZIKV | 0.18 | 4.93 |
| 5-strain SARS-CoV-2 | 3.40 | 17.49 |
| 5-strain HIV | 0.27 | 1.48 |
| 6-strain Poliovirus | 2.30 | 51.18 |
| 10-strain HCV | 0.28 | 3.05 |
| 15-strain ZIKV | 0.28 | 6.10 |
| 5-strain SARS-CoV-2 | 4.14 | 20.60 |
| 5-strain PVY | 4.48 | 23.07 |
Fig. 3Performance of Strainline with varying strain abundance (= relative frequency) and divergence in two-strain mixtures using simulated PacBio CLR reads. The x, y axis refers to the abundance ratio and the pairwise divergence of two strains in the mixture, respectively. Panels A, B, and C refer to haplotype coverage, error rate (mismatch + indel) and N50 of the resulting assemblies, respectively. The darker colors indicate the better assembly performance
Benchmarking results for 5-strain HIV mixture (PacBio CLR reads) with varying sequencing coverage. HC haplotype coverage, ER error rate (mismatches + indels + N’s), MC misassembled contigs proportion
| #Contigs | HC (%) | N50 (bp) | NGA50 (bp) | ER (%) | MC(%) | |
|---|---|---|---|---|---|---|
| Strainline | 7 | 78.4 | 9517 | 9545 | 1.603 | 0.0 |
| Canu | 4 | 32.9 | 6766 | 5281 | 1.447 | 0.0 |
| Wtdbg2 | 1 | 18.0 | 8687 | - | 2.295 | 0.0 |
| Strainline | 5 | 79.2 | 9557 | 9554 | 1.069 | 0.0 |
| Canu | 5 | 63.6 | 7207 | 7139 | 1.087 | 0.0 |
| Wtdbg2 | 1 | 13.5 | 6516 | - | 1.520 | 0.0 |
| Strainline | 6 | 99.3 | 9644 | 9614 | 0.675 | 0.0 |
| Canu | 5 | 67.4 | 8193 | 7544 | 0.416 | 20.0 |
| Wtdbg2 | 1 | 16.0 | 7741 | - | 1.103 | 0.0 |
| Strainline | 5 | 99.7 | 9690 | 9686 | 0.311 | 0.0 |
| Canu | 4 | 69.3 | 8264 | 8001 | 0.537 | 25.0 |
| Wtdbg2 | 1 | 16.5 | 7950 | - | 2.115 | 0.0 |
| Strainline | 5 | 99.8 | 9696 | 9691 | 0.184 | 0.0 |
| Canu | 4 | 71.7 | 8844 | 8524 | 0.446 | 25.0 |
| Wtdbg2 | 1 | 15.6 | 7528 | - | 2.777 | 0.0 |
| Strainline | 5 | 99.9 | 9697 | 9697 | 0.002 | 0.0 |
| Canu | 5 | 84.5 | 8227 | 8170 | 0.409 | 20.0 |
| Wtdbg2 | 1 | 15.5 | 7419 | - | 1.820 | 0.0 |