| Literature DB >> 33996146 |
Ahmed Al Qaffas1, Jenna Nichols2, Andrew J Davison2, Amine Ourahmane1, Laura Hertel3, Michael A McVoy1, Salvatore Camiolo2.
Abstract
Long-read, single-molecule DNA sequencing technologies have triggered a revolution in genomics by enabling the determination of large, reference-quality genomes in ways that overcome some of the limitations of short-read sequencing. However, the greater length and higher error rate of the reads generated on long-read platforms make the tools used for assembling short reads unsuitable for use in data assembly and motivate the development of new approaches. We present LoReTTA (Long Read Template-Targeted Assembler), a tool designed for performing de novo assembly of long reads generated from viral genomes on the PacBio platform. LoReTTA exploits a reference genome to guide the assembly process, an approach that has been successful with short reads. The tool was designed to deal with reads originating from viral genomes, which feature high genetic variability, possible multiple isoforms, and the dominant presence of additional organisms in clinical or environmental samples. LoReTTA was tested on a range of simulated and experimental datasets and outperformed established long-read assemblers in terms of assembly contiguity and accuracy. The software runs under the Linux operating system, is designed for easy adaptation to alternative systems, and features an automatic installation pipeline that takes care of the required dependencies. A command-line version and a user-friendly graphical interface version are available under a GPLv3 license at https://bioinformatics.cvr.ac.uk/software/ with the manual and a test dataset.Entities:
Keywords: PacBio; de novo assembly; long read assembler; viral genomics
Year: 2021 PMID: 33996146 PMCID: PMC8111061 DOI: 10.1093/ve/veab042
Source DB: PubMed Journal: Virus Evol ISSN: 2057-1577
Figure 1.HCMV genome isomers. Arrows represent the locations and orientations of a (black), b (blue) and c (magenta) and their inverted copies, a′, b′ and c’. Colour gradients in the unique sequences (UL and US) indicate the relative orientations of these regions in the four isomers, which are designated prototype (P), UL inverted (IL), US inverted (IS) and UL and US inverted (ILIS). For simplicity, a single copy of the a/a′ sequence is shown at the termini and internally. However, a proportion of genomes has been reported to contain additional directly repeated copies of a at the left end and internally or to lack a copy of a at the right end.
Figure 2.LoReTTA pipeline. Step A: the reference genome (blue; scale in kbp (k)) is subsampled in sliding windows. Step B: reads (red) are aligned to each window (purple segments represent sequencing errors). Step C: all kmers are extracted from each read and the most recurrent are selected. Step D: the selected kmers are used to perform a local contig assembly. Step E: adjacent local contigs are joined by exploiting their overlaps. Step F: gaps due to non-overlapping local contigs are closed using long reads. Step G: the long-read dataset is aligned to the reconstructed genome, and substitutions and indels in more than fifty per cent of reads are emended (green segments represent corrected sequence errors).
Assembly statistics of simulated long-read datasets for four assemblers.
| Dataset | N50 (nt) | Contigs (no.) | Ambiguous (‘N’) nucleotides (no.) | Assembled nucleotides (no.) | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| L | R | C | F | Re | L | R | C | F | Re | L | R | C | F | Re | L | R | C | F | Re | |
|
| 3,162 | — | — | — | 3,182 | 1 | — | — | — | 1 | 0 | — | — | — | 0 | 3,162 | — | — | — | 3,182 |
|
| 3,175 | — | — | — | 3,182 | 1 | — | — | — | 1 | 0 | — | — | — | 0 | 3,175 | — | — | — | 3,182 |
|
| 3,175 | — | — | — | 3,182 | 1 | — | — | — | 1 | 0 | — | — | — | 0 | 3,175 | — | — | — | 3,182 |
|
| 9,019 | — | — | — | 9,570 | 1 | — | — | — | 1 | 0 | — | — | — | 0 | 9,019 | — | — | — | 9,570 |
|
| 9,283 | — | — | — | 9,550 | 1 | — | — | — | 1 | 0 | — | — | — | 0 | 9,283 | — | — | — | 9,550 |
|
| 9,361 | — | — | — | 9,477 | 1 | — | — | — | 1 | 0 | — | — | — | 0 | 9,361 | — | — | — | 9,477 |
| SARS-CoV- | 28,470 | 28,927 | 24,058 | 29,183 | 29,500 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 28,470 | 28,927 | 24,058 | 29,183 | 29,500 |
| SARS-CoV- | 28,558 | 29,016 | 18,218 | 29,004 | 29,419 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 28,558 | 29,016 | 18,218 | 29,004 | 29,419 |
| SARS-CoV- | 28,811 | 29,054 | 25,772 | 28,211 | 29,519 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 28,811 | 29,054 | 25,772 | 28,211 | 29,519 |
|
| 89,878 | 90,320 | 61,255 | 90,490 | 91,729 | 1 | 1 | 2 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 89,878 | 90,320 | 68,019 | 90,490 | 91,729 |
|
| 90,507 | 91,164 | 41,996 | 91,298 | 91,475 | 1 | 1 | 3 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 90,507 | 91,164 | 65,526 | 91,298 | 91,475 |
|
| 90,726 | 91,005 | 11,569 | 91,126 | 91,440 | 1 | 1 | 2 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 90,726 | 91,005 | 16,853 | 91,126 | 91,440 |
| HSV- | 149,635 | 129,262 | 31,652 | 107,923 | 152,014 | 1 | 2 | 4 | 2 | 1 | 196 | 2 | 0 | 0 | 0 | 149,668 | 162,428 | 134,065 | 151,615 | 152,014 |
| HSV- | 150,121 | 151,163 | 12,973 | 138,690 | 152,332 | 1 | 1 | 5 | 2 | 1 | 169 | 1 | 0 | 0 | 0 | 150,116 | 151,163 | 66,536 | 151,614 | 152,332 |
| HSV- | 150,049 | 136,648 | 19,618 | 138,690 | 152,143 | 1 | 2 | 7 | 2 | 1 | 154 | 2 | 0 | 0 | 0 | 150,042 | 176,004 | 101,745 | 151,615 | 152,143 |
|
| 233,519 | 206,327 | 29,649 | 199,577 | 235,651 | 1 | 2 | 8 | 2 | 1 | 0 | 2 | 0 | 0 | 0 | 233,518 | 250,544 | 199,855 | 235,062 | 235,651 |
|
| 235,355 | 235,025 | 12,645 | 270,553 | 235,656 | 1 | 1 | 5 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 235,354 | 235,025 | 55,398 | 270,553 | 235,656 |
|
| 234,848 | 254,672 | 11,475 | 270,551 | 235,651 | 1 | 1 | 7 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 234,848 | 254,672 | 78,734 | 270,551 | 235,651 |
|
| 233,677 | 264,101 | 29,407 | 235,068 | 235,651 | 1 | 1 | 8 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 233,675 | 264,101 | 188,267 | 235,068 | 235,651 |
|
| 233,675 | 201,971 | 29,460 | 196,437 | 235,668 | 1 | 2 | 11 | 3 | 1 | 0 | 2 | 0 | 0 | 0 | 233,671 | 254,556 | 228,380 | 236,989 | 235,668 |
|
| 234,596 | 251,519 | 19,633 | 231,802 | 235,454 | 1 | 1 | 15 | 2 | 1 | 1,052 | 1 | 0 | 0 | 0 | 234,613 | 251,519 | 262,827 | 236,111 | 235,454 |
|
| 232,884 | 218,646 | 16,414 | 231,982 | 235,650 | 1 | 3 | 22 | 2 | 1 | 3,407 | 3 | 0 | 0 | 0 | 234,750 | 299,671 | 325,122 | 238,019 | 235,650 |
|
| 235,646 | 251,335 | — | — | 235,650 | 1 | 1 | — | — | 1 | 0 | 1 | 0 | 0 | 0 | 233,518 | 251,335 | — | — | 235,650 |
|
| 233,519 | 252,791 | — | — | 235,650 | 1 | 1 | — | — | 1 | 0 | 1 | 0 | 0 | 0 | 233,518 | 252,791 | — | — | 235,650 |
|
| 233,519 | 251,362 | — | — | 235,650 | 1 | 1 | — | — | 1 | 0 | 1 | 0 | 0 | 0 | 233,518 | 251,362 | — | — | 235,650 |
|
| 233,519 | 251,350 | — | — | 235,650 | 1 | 1 | — | — | 1 | 0 | 1 | 0 | 0 | 0 | 233,518 | 251,350 | — | — | 235,650 |
|
| 233,519 | 202,242 | — | — | 235,650 | 1 | 2 | — | — | 1 | 0 | 1 | 0 | 0 | 0 | 233,518 | 262,935 | — | — | 235,650 |
|
| 233,519 | 202,242 | — | — | 235,650 | 1 | 2 | — | — | 1 | 0 | 1 | 0 | 0 | 0 | 233,518 | 262,935 | — | — | 235,650 |
|
| 233,519 | 257,666 | — | — | 235,650 | 1 | 1 | — | — | 1 | 0 | 2 | 0 | 0 | 0 | 233,518 | 257,666 | — | — | 235,650 |
|
| 233,519 | 257,666 | — | — | 235,651 | 1 | 1 | — | — | 1 | 0 | 2 | 0 | 0 | 0 | 233,518 | 257,666 | — | — | 235,651 |
–, no assembly was produced when computation was complete (HBV and HCV), or assembly had not finished after an impractical length of time (more than seven days), at which point computation was terminated (HCMVclinical and HCMVmeta).
L, LoReTTA; R, Raven; C, Canu; F, Flye; Re, Rebaler.
Comparative statistics for the simulated long-read datasets as determined by the software QUAST using the genomes from which the reads were generated as references.
| Dataset | Genome coverage (%) | Mismatches per 100 kb (no.) | Indels per 100 kb (no.) | Misassemblies (no.) | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| L | R | C | F | Re | L | R | C | F | Re | L | R | C | F | Re | L | R | C | F | Re | |
|
| 99.4 | — | — | — | 100.0 | 0.0 | — | — | — | 0.0 | 0.0 | — | — | — | 0.0 | 0 | — | — | — | 0 |
|
| 99.8 | — | — | — | 100.0 | 0.0 | — | — | — | 0.0 | 0.0 | — | — | — | 0.0 | 0 | — | — | — | 0 |
|
| 99.8 | — | — | — | 100.0 | 0.0 | — | — | — | 0.0 | 0.0 | — | — | — | 0.0 | 0 | — | — | — | 0 |
|
| 93.5 | — | — | — | 100.0 | 0.0 | — | — | — | 0.0 | 0.0 | — | — | — | 10.4 | 0 | — | — | — | 0 |
|
| 96.2 | — | — | — | 100.0 | 0.0 | — | — | — | 0.0 | 0.0 | — | — | — | 51.8 | 0 | — | — | — | 0 |
|
| 97.0 | — | — | — | 98.2 | 0.0 | — | — | — | 0.0 | 0.0 | — | — | — | 42.2 | 0 | — | — | — | 0 |
| SARS-CoV- | 95.2 | 96.5 | 80.0 | 97.6 | 98.2 | 0.0 | 0.0 | 25.1 | 0.0 | 0.0 | 0.0 | 3.5 | 861.0 | 0.0 | 10.2 | 0 | 0 | 0 | 0 | 0 |
| SARS-CoV- | 95.5 | 97.0 | 60.8 | 97.0 | 98.3 | 0.0 | 0.0 | 0.0 | 0.0 | 10.2 | 0.0 | 0.0 | 176.0 | 0.0 | 20.4 | 0 | 0 | 0 | 0 | 0 |
| SARS-CoV- | 96.3 | 97.2 | 86.2 | 94.3 | 97.6 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 6.9 | 42.7 | 0.0 | 3.4 | 0 | 0 | 0 | 0 | 0 |
|
| 98.0 | 98.5 | 74.0 | 98.7 | 100.0 | 0.0 | 0.0 | 16.2 | 0.0 | 1.1 | 0.0 | 0.0 | 374.4 | 2.2 | 15.2 | 0 | 0 | 0 | 0 | 0 |
|
| 98.7 | 99.4 | 67.5 | 99.5 | 99.5 | 0.0 | 0.0 | 19.4 | 0.0 | 0.0 | 0.0 | 1.1 | 443.9 | 0.0 | 4.4 | 0 | 0 | 0 | 0 | 0 |
|
| 98.9 | 99.2 | 18.3 | 99.4 | 99.5 | 0.0 | 0.0 | 29.9 | 0.0 | 0.0 | 0.0 | 1.1 | 698.4 | 0.0 | 8.8 | 0 | 0 | 0 | 0 | 0 |
| HSV- | 98.6 | 100.0 | 87.5 | 94.0 | 100.0 | 0.7 | 30.9 | 21.8 | 0.0 | 0.7 | 4.7 | 157.2 | 642.8 | 0.0 | 2.0 | 0 | 2 | 1 | 1 | 0 |
| HSV- | 98.8 | 99.5 | 39.9 | 95.6 | 100.0 | 0.0 | 0.0 | 13.2 | 0.0 | 0.7 | 6.0 | 2.6 | 385.7 | 0.0 | 5.0 | 0 | 0 | 0 | 1 | 0 |
| HSV- | 98.8 | 99.8 | 56.9 | 95.6 | 100.0 | 0.0 | 5.3 | 34.7 | 0.0 | 0.0 | 6.0 | 84.3 | 723.0 | 0.0 | 2.6 | 0 | 3 | 1 | 1 | 0 |
|
| 99.1 | 99.8 | 83.4 | 98.9 | 100.0 | 0.0 | 4.3 | 23.4 | 0.0 | 0.0 | 0.4 | 44.2 | 580.3 | 0.0 | 0.8 | 0 | 0 | 1 | 0 | 0 |
|
| 99.8 | 98.9 | 21.0 | 99.8 | 100.0 | 0.0 | 0.0 | 28.3 | 0.0 | 0.0 | 0.0 | 0.9 | 432.6 | 0.0 | 1.3 | 0 | 2 | 0 | 1 | 0 |
|
| 99.7 | 98.9 | 29.8 | 98.9 | 100.0 | 0.0 | 0.4 | 44.1 | 0.0 | 0.0 | 0.0 | 10.3 | 988.2 | 0.0 | 2.1 | 0 | 2 | 0 | 1 | 0 |
|
| 99.2 | 100.0 | 77.8 | 99.8 | 100.0 | 0.0 | 98.0 | 34.9 | 0.0 | 0.0 | 0.4 | 135.8 | 716.3 | 0.0 | 3.8 | 0 | 3 | 1 | 0 | 0 |
|
| 99.2 | 99.9 | 87.9 | 98.9 | 100.0 | 0.0 | 29.3 | 293.2 | 0.9 | 0.0 | 1.3 | 74.3 | 742.4 | 0.4 | 1.2 | 0 | 1 | 1 | 2 | 0 |
|
| 99.0 | 100.0 | 84.5 | 99.4 | 99.5 | 24.0 | 46.3 | 889.9 | 9.4 | 3.4 | 10.3 | 62.0 | 950.2 | 8.1 | 7.2 | 0 | 2 | 1 | 1 | 0 |
|
| 98.0 | 99.8 | 92.1 | 98.1 | 100.0 | 325.1 | 699.0 | 1321.2 | 883.4 | 153.1 | 29.9 | 156.9 | 1267.3 | 109.8 | 57.6 | 0 | 3 | 2 | 1 | 1 |
|
| 99.1 | 100.0 | — | — | 100.0 | 0.0 | 0.4 | — | — | 0.0 | 0.4 | 22.5 | — | — | 1.3 | 0 | 1 | — | — | 0 |
|
| 99.1 | 100.0 | — | — | 100.0 | 0.0 | 10.2 | — | — | 0.0 | 0.4 | 76.8 | — | — | 1.3 | 0 | 2 | — | — | 0 |
|
| 99.1 | 100.0 | — | — | 100.0 | 0.0 | 0.9 | — | — | 0.0 | 0.4 | 24.6 | — | — | 1.3 | 0 | 1 | — | — | 0 |
|
| 99.1 | 100.0 | — | — | 100.0 | 0.0 | 0.4 | — | — | 0.0 | 0.4 | 20.4 | — | — | 1.3 | 0 | 1 | — | — | 0 |
|
| 99.1 | 99.8 | — | — | 100.0 | 0.0 | 3.0 | — | — | 0.0 | 0.4 | 44.7 | — | — | 1.3 | 0 | 1 | — | — | 0 |
|
| 99.1 | 99.8 | — | — | 100.0 | 0.0 | 3.0 | — | — | 0.0 | 0.4 | 44.7 | — | — | 1.3 | 0 | 1 | — | — | 0 |
|
| 99.1 | 100.0 | — | — | 100.0 | 0.0 | 3.8 | — | — | 0.0 | 0.4 | 44.6 | — | — | 1.3 | 0 | 1 | — | — | 0 |
|
| 99.1 | 100.0 | — | — | 100.0 | 0.0 | 3.8 | — | — | 0.0 | 0.4 | 44.6 | — | — | 1.3 | 0 | 1 | — | — | 0 |
–, no assembly was produced when computation was complete (HBV and HCV), or assembly had not finished after an impractical length of time (more than seven days), at which point computation was terminated (HCMVclinical and HCMVmeta).
L, LoReTTA; R, Raven; C, Canu; F, Flye; Re, Rebaler.
Assembly statistics for four experimental datasets.
| Dataset | N50 (nt) | Contigs (no.) | Ambiguous (‘N’) nucleotides (no.) | Assembled nucleotides (no) | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| L | R | C | F | Re | L | R | C | F | Re | L | R | C | F | Re | L | R | C | F | Re | |
| HBV | 3,223 | — | — | — | 2,676 | 1 | — | — | — | 1 | 0 | — | — | — | 0 | 3,223 | — | — | — | 2,676 |
| HSV-1 | 151,797 | 28,815 | 18,382 | — | 151,254 | 1 | 15 | 69 | — | 1 | 994 | 15 | 0 | — | 0 | 151,754 | 326,010 | 930,393 | — | 151,254 |
| HCMV | 236,253 | 234,751 | 42,630 | 199,299 | 234,164 | 1 | 1 | 36 | 3 | 1 | 0 | 1 | 0 | 0 | 0 | 236,253 | 234,751 | 463,210 | 237,725 | 234,164 |
| PaP1 | 91,687 | 91,032 | 9305 | — | 91,137 | 1 | 1 | 7 | — | 1 | 0 | 1 | 0 | — | 0 | 91,687 | 91,032 | 53,737 | — | 91,137 |
–, no assembly was produced when computation was complete (HBV), or assembly had not finished after an impractical length of time (more than seven days), at which point computation was terminated (HSV-1 and PaP1).
L, LoReTTA; R, Raven; C, Canu; F, Flye; Re, Rebaler.
Comparative statistics for four experimental datasets as determined by the software QUAST using the relevant deposited sequences as references.
| Dataset | Genome coverage (%) | Mismatches per 100 kb (no.) | Indels per 100 kb (no.) | Misassemblies (no.) | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| L | R | C | F | Re | L | R | C | F | Re | L | R | C | F | Re | L | R | C | F | Re | |
| HSV-1 | 98.6 | 86.2 | 98.6 | — | 99.1 | 0.7 | 33.5 | 462.9 | — | 1.3 | 10.6 | 83.7 | 923.9 | — | 13.9 | 0 | 39 | 65 | — | 0 |
| HCMV | 100.0 | 99.3 | 100.0 | 99.5 | 99.1 | 0.9 | 0.9 | 0.9 | 0.9 | 1.7 | 3.4 | 3.0 | 7.6 | 2.6 | 4.7 | 0 | 2 | 1 | 0 | 0 |
| PaP1 | 99.9 | 99.3 | 56.6 | — | 99.3 | 1.1 | 1.1 | 40.4 | — | 1.1 | 1.1 | 4.4 | 369.7 | — | 9.9 | 0 | 0 | 0 | — | 0 |
–, no assembly was produced when computation was complete (HBV), or assembly had not finished after an impractical length of time (more than seven days), at which point computation was terminated (HSV-1 and PaP1).
L, LoReTTA; R, Raven; C, Canu; F, Flye; Re, Rebaler.