| Literature DB >> 30646604 |
Wenjing Zhang1, Neng Huang2, Jiantao Zheng3, Xingyu Liao4, Jianxin Wang5, Hong-Dong Li6.
Abstract
The advent of third-generation sequencing (TGS) technologies, such as the Pacific Biosciences (PacBio) and Oxford Nanopore machines, provides new possibilities for contig assembly, scaffolding, and high-performance computing in bioinformatics due to its long reads. However, the high error rate and poor quality of TGS reads provide new challenges for accurate genome assembly and long-read alignment. Efficient processing methods are in need to prioritize high-quality reads for improving the results of error correction and assembly. In this study, we proposed a novel Read Quality Evaluation and Selection Tool (REQUEST) for evaluating the quality of third-generation long reads. REQUEST generates training data of high-quality and low-quality reads which are characterized by their nucleotide combinations. A linear regression model was built to score the quality of reads. The method was tested on three datasets of different species. The results showed that the top-scored reads prioritized by REQUEST achieved higher alignment accuracies. The contig assembly results based on the top-scored reads also outperformed conventional approaches that use all reads. REQUEST is able to distinguish high-quality reads from low-quality ones without using reference genomes, making it a promising alternative sequence-quality evaluation method to alignment-based algorithms.Entities:
Keywords: genomics; read quality assessment; third-generation sequencing
Mesh:
Year: 2019 PMID: 30646604 PMCID: PMC6356754 DOI: 10.3390/genes10010044
Source DB: PubMed Journal: Genes (Basel) ISSN: 2073-4425 Impact factor: 4.096
Figure 1The workflow of the Read Quality Evaluation and Selection Tool (REQUEST). The method consists of three steps: (1) compiling of the training data of high- and low-quality reads; (2) splitting the training set into two parts to build the linear model and cross-score the reads; (3) selecting the top-scored reads and evaluating them. SQ stands for the score of sequencing read quality computed by REQUEST.
Figure 2Distribution of nucleotide combinations of genome, high-quality, and low-quality reads of four example trinucleotides: (a) ACC; (b) CTC; (c) GAC; (d) GCA. The green, blue, and red lines represent the data from genome (gold-standard error-free reads), high-quality (corrected reads), and low-quality reads (raw reads), respectively.
Summary of the results of Escherichia coli in selection, correction, and contigs. REQUEST—Read Quality Evaluation and Selection Tool.
| P (%) | Num | Max | Min | Mean | n | R (%) | Mean I | Median I | ||
|---|---|---|---|---|---|---|---|---|---|---|
|
| All reads | 100 | 31,858 | 64,218 | 99 | 7668 | 27,869 | 87.48 | 84.16 | 88.33 |
| Random | 95 | 30,265 | 62,072 | 99 | 7669 | 26,471 | 87.50 | 84.16 | 88.33 | |
| 90 | 28,672 | 62,072 | 99 | 7670 | 25,078 | 87.50 | 84.16 | 88.32 | ||
| 85 | 27,079 | 61,357 | 100 | 7670 | 23,685 | 87.51 | 84.15 | 88.32 | ||
| 80 | 25,486 | 59,926 | 100 | 7671 | 22,288 | 87.50 | 84.15 | 88.32 | ||
| REQUEST | 95 | 30,265 | 64,218 | 99 | 7875 | 27,238 | 90.00 |
|
| |
| 90 | 28,672 | 64,218 | 99 | 7964 | 26,192 | 91.35 |
|
| ||
| 85 | 27,079 | 64,218 | 99 | 8028 | 25,016 | 92.38 |
|
| ||
| 80 | 25,486 | 64,218 | 99 | 8082 | 23,766 | 93.25 |
|
| ||
|
| All reads | 100 | 26,034 | 33,912 | 2000 | 8144 | 25,775 | 99.01 | 96.67 | 98.36 |
| Random | 95 | 24,252 | 33,882 | 2000 | 8147 | 24,011 | 99.01 | 96.78 | 98.36 | |
| 90 | 22,943 | 33,817 | 2001 | 8143 | 22,715 | 99.01 | 96.76 | 98.34 | ||
| 85 | 21,626 | 33,724 | 2001 | 8137 | 21,411 | 99.00 | 96.73 | 98.31 | ||
| 80 | 20,303 | 33,519 | 2001 | 8130 | 20,101 | 99.00 | 96.70 | 98.28 | ||
| REQUEST | 95 | 25,715 | 33,886 | 2001 | 8162 | 25,469 | 99.04 |
|
| |
| 90 | 24,906 | 33,886 | 2001 | 8224 | 24,670 | 99.05 |
|
| ||
| 85 | 23,968 | 33,880 | 2001 | 8279 | 23,731 | 99.01 |
|
| ||
| 80 | 22,883 | 33,880 | 2000 | 8335 | 22,673 | 99.08 |
|
| ||
|
|
|
|
|
|
|
|
|
| ||
| All reads | 100 | 2 | 4636 | 6 | 2305 | 4636 | 1655.60 | 99.86 | ||
| Random | 95 | 3 | 3724 | 3 | 3294 | 3724 | 3201.91 | 99.98 | ||
| 90 | 4 | 2958 | 3 | 2606 | 2947 | 2438.54 | 99.97 | |||
| 85 | 5 | 3463 | 3 | 3153 | 3380 | 3032.37 | 99.92 | |||
| 80 | 7 | 2496 | 3 | 2444 | 1970 | 1864.48 | 99.81 | |||
| REQUEST | 95 | 2 | 4641 | 5 | 2530 | 4641 | 2529.56 |
| ||
| 90 | 2 | 4639 | 7 | 3587 | 4639 | 3587.13 |
| |||
| 85 | 3 | 4635 | 5 | 3956 | 4635 | 3956.42 |
| |||
| 80 | 3 | 4636 | 5 | 3957 | 4636 | 3956.57 |
| |||
1 P indicates the proportion of retained reads; Max, Min, and Mean indicate the maximum, minimum, and mean read lengths, respectively; “n” means the number of alignments; R means the aligned rate; “I” indicates the identity; MA indicates misassemblies; GF indicates genome fraction.
Summary of the results of Yersinia pestis in selection, correction, and contigs.
| P (%) | Num | Max | Min | Mean | n | R (%) | Mean I | Median I | ||
|---|---|---|---|---|---|---|---|---|---|---|
|
| All reads | 100 | 28,429 | 61,191 | 125 | 7679 | 26,989 | 94.93 | 83.44 | 86.70 |
| Random | 95 | 27,007 | 61,191 | 125 | 7680 | 25,628 | 94.90 | 83.44 | 86.70 | |
| 90 | 25,586 | 61,191 | 125 | 7689 | 24,277 | 94.88 | 83.44 | 86.70 | ||
| 85 | 24,164 | 61,191 | 145 | 7679 | 22,928 | 94.89 | 83.44 | 86.70 | ||
| 80 | 22,743 | 53,492 | 125 | 7686 | 21,573 | 94.86 | 83.44 | 86.70 | ||
| REQUEST | 95 | 27,008 | 61,191 | 184 | 7785 | 26,181 | 96.94 |
|
| |
| 90 | 25,586 | 61,191 | 184 | 7827 | 25,024 | 97.80 |
|
| ||
| 85 | 24,164 | 61,191 | 184 | 7869 | 23,750 | 98.29 |
|
| ||
| 80 | 22,743 | 61,191 | 184 | 7904 | 22,402 | 98.50 |
|
| ||
|
| All reads | 100 | 25,776 | 57,301 | 2000 | 7229 | 24,769 | 96.09 | 96.96 | 98.09 |
| Random | 95 | 23,953 | 33,843 | 2001 | 7170 | 23,946 | 99.97 | 97.12 | 98.14 | |
| 90 | 22,633 | 33,587 | 2001 | 7157 | 22,627 | 99.97 | 97.11 | 98.12 | ||
| 85 | 21,315 | 33,289 | 2000 | 7139 | 21,310 | 99.98 | 97.11 | 98.10 | ||
| 80 | 19,974 | 33,730 | 2001 | 7117 | 19,969 | 99.98 | 97.10 | 98.09 | ||
| REQUEST | 95 | 25,357 | 56,560 | 2000 | 7263 | 25,350 | 99.97 |
|
| |
| 90 | 24,449 | 56,560 | 2000 | 7336 | 24,442 | 99.97 |
|
| ||
| 85 | 23,312 | 56,587 | 2000 | 7399 | 23,305 | 99.97 |
|
| ||
| 80 | 22,028 | 57,044 | 2000 | 7468 | 22,022 | 99.97 |
|
| ||
|
|
|
|
|
|
|
|
|
| ||
| All reads | 100 | 4 | 4646 | 30 | 940 | 4646 | 377.69 | 99.96 | ||
| Random | 95 | 5 | 2749 | 28 | 835 | 2310 | 370.53 | 99.72 | ||
| 90 | 8 | 2174 | 25 | 816 | 1642 | 345.93 | 99.55 | |||
| 85 | 11 | 1756 | 28 | 771 | 1141 | 301.66 | 99.28 | |||
| 80 | 19 | 1194 | 27 | 593 | 471 | 224.73 | 98.54 | |||
| REQUEST | 95 | 6 | 4641 | 31 | 1012 | 4641 | 377.70 |
| ||
| 90 | 4 | 4658 | 31 | 798 | 4658 | 377.69 |
| |||
| 85 | 4 | 4645 | 29 | 1012 | 4645 | 377.69 |
| |||
| 80 | 7 | 2571 | 30 | 798 | 2571 | 282.40 |
| |||
Summary of the results of Drosophila biarmipes in selection, correction, and contigs.
| P (%) | Num | Max | Min | Mean | n | R (%) | Mean I | Median I | ||
|---|---|---|---|---|---|---|---|---|---|---|
|
| All reads | 100 | 1,375,649 | 93,368 | 61 | 4102 | 845,134 | 61.44 | 79.57 | 82.58 |
| Random | 95 | 1,306,867 | 93,368 | 61 | 4102 | 802,968 | 61.44 | 79.57 | 82.24 | |
| 90 | 1,260,870 | 93,368 | 61 | 4101 | 760,614 | 60.32 | 79.57 | 82.58 | ||
| 85 | 1,192,229 | 93,368 | 61 | 4101 | 718,489 | 60.26 | 79.57 | 82.58 | ||
| 80 | 1,123,446 | 93,368 | 61 | 4102 | 676,352 | 60.20 | 79.57 | 82.58 | ||
| REQUEST | 95 | 1,306,867 | 93,368 | 83 | 4298 | 844,504 | 64.62 |
|
| |
| 90 | 1,260,870 | 93,368 | 83 | 4503 | 841,439 | 66.73 |
|
| ||
| 85 | 1,192,229 | 93,368 | 83 | 4725 | 833,911 | 69.95 |
|
| ||
| 80 | 1,123,446 | 93,368 | 105 | 4950 | 818,457 | 72.85 |
|
| ||
|
| All reads | 100 | 628,180 | 53,163 | 2000 | 6743 | 625,472 | 99.57 | 89.25 | 94.68 |
| Random | 95 | 594,932 | 52,702 | 2000 | 6654 | 592,270 | 99.55 | 89.22 | 94.67 | |
| 90 | 571,876 | 52,531 | 2000 | 6579 | 558,019 | 97.58 | 89.21 | 94.68 | ||
| 85 | 536,463 | 49,260 | 2000 | 6452 | 522,184 | 97.34 | 89.19 | 94.68 | ||
| 80 | 498,685 | 47,746 | 2000 | 6297 | 483,630 | 96.98 | 89.19 | 94.69 | ||
| REQUEST | 95 | 634,003 | 53,154 | 2000 | 6713 | 630,933 | 99.52 | 89.10 | 94.55 | |
| 90 | 633,478 | 53,154 | 2000 | 6715 | 629,206 | 99.33 | 89.11 | 94.56 | ||
| 85 | 632,026 | 53,154 | 2000 | 6719 | 629,206 | 99.55 | 89.14 | 94.56 | ||
| 80 | 627,427 | 53,157 | 2000 | 6731 | 575,145 | 91.67 | 89.25 | 94.72 | ||
|
|
|
|
|
|
|
|
|
| ||
| All reads | 100 | 2185 | 673 | 10,602 | 304 | 67 | 31.00 | 55.65 | ||
| Random | 95 | 2051 | 530 | 9689 | 216 | 57 | 27.00 | 46.36 | ||
| 90 | 1868 | 301 | 8973 | 176 | 50 | 23.00 | 36.92 | |||
| 85 | 1635 | 226 | 8165 | 160 | 43 | 16.00 | 28.09 | |||
| 80 | 1385 | 191 | 7376 | 112 | 39 | 10.00 | 20.90 | |||
| REQUEST | 95 | 2164 | 552 | 10,815 | 307 | 68 | 31.00 |
| ||
| 90 | 2142 | 552 | 10,732 | 234 | 68 | 31.00 |
| |||
| 85 | 2132 | 552 | 10,616 | 234 | 68 | 31.00 |
| |||
| 80 | 2113 | 552 | 10,734 | 234 | 67 | 31.00 |
| |||
Figure 3Relationship of identity and predicted (SQ) score. The identity was grouped into 65–70%, 70–75%, 75–80%, 80–85%, 85–90%, 90–95%, and 95–100%. For each group, the distribution of SQ scores was plotted. (a) Comparison of Escherichia coli; (b) comparison of Yersinia pestis; (c) comparison of Drosophila biarmipes.