| Literature DB >> 31694525 |
Nathan LaPierre1, Rob Egan2, Wei Wang3, Zhong Wang4,5,6.
Abstract
BACKGROUND: Long read sequencing technologies such as Oxford Nanopore can greatly decrease the complexity of de novo genome assembly and large structural variation identification. Currently Nanopore reads have high error rates, and the errors often cluster into low-quality segments within the reads. The limited sensitivity of existing read-based error correction methods can cause large-scale mis-assemblies in the assembled genomes, motivating further innovation in this area.Entities:
Keywords: Deep learning; Long sequence reads; Oxford Nanopore; de novo assembly
Mesh:
Year: 2019 PMID: 31694525 PMCID: PMC6833143 DOI: 10.1186/s12859-019-3103-z
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Overview of MiniScrub. The Convolutional Neural Network (CNN) must be trained to predict sequence segment percent identity (percent match to reference) from the read-to-read overlaps. To generate ground-truth percent identity for read segments, reads are generated from known genomes in a reference database, then GraphMap [26] is used to map those reads to the reference, from which we calculate the percentage of bases from each read segment that match the reference genome. We also use MiniMap2 to generate read-to-read mapping, then encode the information into an RGB “pileup” image for each read, which is then split up into shorter segments. We then train the CNN to learn the segment percent identity from the pileup images and save the model. On the user side, users run MiniMap2 on their set of reads and specify a cutoff threshold for read segments to scrub. The learned CNN model then predicts read segment percent identity and scrubs the segments below the quality threshold, outputting a new FASTQ file with the scrubbed reads
Results from training and testing on different datasets
| LC train, LC test | LC train, HC test | HC train, LC test | HC train, HC test | |
|---|---|---|---|---|
| Mean Sq. Error | 0.00300 | 0.00447 | 0.00312 | 0.00391 |
| Pearson | 0.827 | 0.747 | 0.809 | 0.772 |
| Spearman | 0.805 | 0.795 | 0.778 | 0.802 |
| Sensitivity | 0.950 | 0.891 | 0.938 | 0.889 |
| Specificity | 0.681 | 0.734 | 0.681 | 0.751 |
“LC” is a low complexity, high coverage (140 × to 204 ×) community derived from 747,598 reads from only two species, Escherichia coli (204 × coverage) and Sphingomonas koreensis (140 × coverage). “HC” is a high complexity, low coverage (0.005 × to 64 ×) community derived from 260,930 reads from 26 species, described in [31]. The cutoff point for the sensitivity/specificity results was set at 0.8. We use the notation “LC train, HC test” to mean training the model on the LC data and testing it on the HC data
Fig. 2Density scatter plot showing average read quality improvement by MiniScrub versus raw reads. The X-axis shows read percent identity to the reference while the Y-axis shows read length. Raw reads are in blue while scrubbed reads are in red. The darkness of the color indicates increased “density” – more reads fall into a darker region of the graph than the lighter areas. MiniScrub scrubs out most of the low-quality segments in low quality reads while leaving high quality reads intact, increasing average read percent identity by over 3%, from 83.1 to 86.2%. Average read length decreased from 2673 bases to 1594 bases due to splitting reads where low-quality segments were removed. Reads > 25kbp have low density, and are not shown in order to keep the substantive portion of the graph relatively large
MiniScrub improves read error correction in the metagenome setting
| No scrubbing | With scrubbing | |
|---|---|---|
| Avg. coverage pct. | 47.04% |
|
| Avg. mean coverage depth | 3.08 |
|
| Pct. of genomes above 1.0 coverage depth | 46.67% |
|
Reads from the high complexity (HC) dataset [31], both with and without scrubbing beforehand, were corrected using Canu’s [19] error correction module and then aligned to their reference genomes with GraphMap [26]. The statistics in the table are averages across all source genomes that had non-zero coverage. Applying scrubbing before read error correction improves average coverage percentage and the average mean coverage depth across the source genomes, and leads to a larger number of source genomes having a mean coverage depth of at least 1.0 Best performance numbers are shown in bold
MiniScrub reduces downstream assembly errors
| MECAT Raw | MiniScrub + MECAT | Canu Raw | MiniScrub + Canu | |
|---|---|---|---|---|
| % genome assembled | 79.39% |
| 99.69% |
|
| NGA50 | 242478 |
|
| 696460 |
| LGA50 | 12 |
|
| 5 |
| # of contigs | 38 |
|
| 19 |
| # mis-assembled contigs | 28 |
| 2 | 2 |
| # local mis-assemblies | 209 |
| 5 |
|
| # indels > 5 bp | 1099 |
| 84 |
|
| Runtime (hours) |
| 9 | 80 |
|
MiniScrub significantly improves assembly, tested with MECAT [32], increasing genome coverage and NGA50 while limiting LGA50, mis-assemblies, mismatches, and indels. Canu’s assembly had slightly reduced errors and misassemblies when reads were preprocessed with MiniScrub, but the assembly was more fractured, likely due in part to resolving large misassemblies and indels. Notably, Canu assembly of raw reads took about 3.5 days, while the MiniScrub+Canu pipeline took about 9 hours, likely due to a reduction in the amount of error correction needed in the latter situation. Results were evaluated using QUAST [33] Best performance numbers are shown in bold
Performance with different parameter settings
| Default | (Length, Depth) = (36, 36) | (w,k) = (7,17) | |
|---|---|---|---|
| Mean Sq. Error | 0.00300 | 0.00329 | 0.00305 |
| Pearson | 0.827 | 0.821 | 0.830 |
| Spearman | 0.805 | 0.786 | 0.810 |
| Sensitivity | 0.950 | 0.934 | 0.914 |
| Specificity | 0.681 | 0.693 | 0.780 |
Performance with different parameter settings. w and k refer to the minimizer parameters, while Length and Depth refer to the length and depth of each pileup image segment, which correspond to the number of minimizers in that read segment and the number of matching reads used. The default settings are (w,k) =(5,15) and (Length, Depth) = (48, 24). The columns show the performance when varying one of these settings and with cutoff 0.8