| Literature DB >> 30598110 |
Liang Zhao1,2, Jin Xie3, Lin Bai4, Wen Chen3, Mingju Wang3, Zhonglei Zhang3, Yiqi Wang3, Zhe Zhao4, Jinyan Li5.
Abstract
BACKGROUND: NGS data contains many machine-induced errors. The most advanced methods for the error correction heavily depend on the selection of solid k-mers. A solid k-mer is a k-mer frequently occurring in NGS reads. The other k-mers are called weak k-mers. A solid k-mer does not likely contain errors, while a weak k-mer most likely contains errors. An intensively investigated problem is to find a good frequency cutoff f0 to balance the numbers of solid and weak k-mers. Once the cutoff is determined, a more challenging but less-studied problem is to: (i) remove a small subset of solid k-mers that are likely to contain errors, and (ii) add a small subset of weak k-mers, that are likely to contain no errors, into the remaining set of solid k-mers. Identification of these two subsets of k-mers can improve the correction performance.Entities:
Keywords: Error correction; Next-generation sequencing; z-score
Mesh:
Year: 2018 PMID: 30598110 PMCID: PMC6311904 DOI: 10.1186/s12864-018-5272-y
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 1Frequency distribution of both error-free and error-containing k-mers for a NGS data set. The frequency distribution of erroneous k-mers is represented by the dash orange line, while the distribution of the correct ones is shown as the dash sky-blue line. The solid black line is the distribution of all the k-mers. The α-labeled area is the proportion of correct k-mers having frequency less than f0, while the β-labeled area is the proportion of erroneous k-mers having frequency greater than f0
Fig. 2Illustration of the forward and backward search to correct sequencing errors. The forward search starts from the first k-mer to the last k-mer. At each step the last base of the k-mer is substituted by its alternatives to check the solidity. Inversely, the backward search starts from the last k-mer to the first k-mer. On the contrary to the forward search, the first base of the k-mers are altered other than the last one
The data sets that are used for evaluating the performance of error correction models
| Data set | Genome name | Genome size (bp) | Error rate (%) | Read length (bp) | Coverage | Number of reads | Insert length | Is sythetic |
|---|---|---|---|---|---|---|---|---|
| R1 | S. aueus | 2,821,361 | 1.28 | 101 | 46.3 × | 1,294,104 | 180 | No |
| R2 | R. sphaeroides | 4,603,110 | 1.08 | 101 | 45.0 × | 2,050,868 | 180 | No |
| R3 | H. chromosome 14 | 88,218,286 | 0.52 | 101 | 41.8 × | 36,504,800 | 155 | No |
| R4 | B. impatiens | 249,185,056 | 0.86 | 124 | 150.8 × | 303,118,594 | 400 | No |
| S1 | H. chromosome 14 | 88,218,286 | 0.97 | 101 | 41.8 × | 36,504,800 | 180 | Yes |
| S2 | B. impatiens | 249,185,056 | 0.98 | 124 | 150.8 × | 303,118,594 | 400 | Yes |
Error-correction performance comparison between ZEC, Lighter, Racer, BLESS2, Musket, BFC, SGA and MEC
| Data | Corrector | Gain | Reca | Prec | Pber(%) |
|---|---|---|---|---|---|
| R1 | ZEC | 0.908 | 0.912 | 0.996 | 0.102 |
| Lighter | 0.839 | 0.845 | 0.994 | 0.163 | |
| Racer | 0.760 | 0.822 | 0.929 | 0.190 | |
| BLESS2 | 0.189 | 0.409 | 0.650 | 0.879 | |
| Musket | 0.499 | 0.628 | 0.830 | 0.448 | |
| SGA | 0.746 | 0.815 | 0.922 | 0.202 | |
| BFC | 0.753 | 0.817 | 0.927 | 0.196 | |
| MEC |
| 0.911 | 0.998 | 0.102 | |
| R2 | ZEC |
| 0.663 | 0.894 | 0.537 |
| Lighter | 0.226 | 0.329 | 0.762 | 1.076 | |
| Racer | 0.364 | 0.450 | 0.839 | 0.780 | |
| BLESS2 | 0.318 | 0.405 | 0.806 | 0.890 | |
| Musket | 0.265 | 0.364 | 0.786 | 0.984 | |
| SGA | 0.331 | 0.423 | 0.822 | 0.843 | |
| BFC | 0.306 | 0.400 | 0.811 | 0.893 | |
| MEC | 0.570 | 0.631 | 0.912 | 0.541 | |
| R3 | ZEC |
| 0.923 | 0.884 | 0.087 |
| Lighter | 0.445 | 0.764 | 0.706 | 0.256 | |
| Racer | 0.562 | 0.814 | 0.764 | 0.196 | |
| BLESS2 | 0.130 | 0.641 | 0.556 | 0.438 | |
| Musket | 0.533 | 0.802 | 0.749 | 0.211 | |
| SGA | 0.567 | 0.818 | 0.765 | 0.194 | |
| BFC | 0.603 | 0.833 | 0.783 | 0.176 | |
| MEC | 0.788 | 0.852 | 0.930 | 0.117 | |
| R4 | ZEC |
| 0.833 | 0.905 | 0.137 |
| Lighter | 0.126 | 0.408 | 0.591 | 0.688 | |
| Racer | 0.313 | 0.541 | 0.703 | 0.484 | |
| BLESS2 | -0.517 | 0.018 | 0.003 | 0.862 | |
| Musket | 0.502 | 0.660 | 0.807 | 0.320 | |
| SGA | 0.542 | 0.690 | 0.823 | 0.289 | |
| BFC | 0.195 | 0.457 | 0.636 | 0.607 | |
| MEC | 0.705 | 0.806 | 0.889 | 0.201 | |
| S1 | ZEC |
| 0.935 | 0.982 | 0.056 |
| Lighter | 0.791 | 0.851 | 0.934 | 0.130 | |
| Racer | 0.882 | 0.916 | 0.964 | 0.071 | |
| BLESS2 | 0.634 | 0.740 | 0.875 | 0.243 | |
| Musket | 0.819 | 0.871 | 0.944 | 0.111 | |
| SGA | 0.810 | 0.865 | 0.940 | 0.117 | |
| BFC | 0.866 | 0.903 | 0.961 | 0.081 | |
| MEC | 0.899 | 0.916 | 0.982 | 0.063 | |
| S2 | ZEC |
| 0.894 | 0.956 | 0.109 |
| Lighter | 0.058 | 0.329 | 0.548 | 0.891 | |
| Racer | 0.168 | 0.408 | 0.630 | 0.720 | |
| BLESS2 | 0.311 | 0.509 | 0.719 | 0.543 | |
| Musket | 0.232 | 0.453 | 0.672 | 0.636 | |
| SGA | 0.075 | 0.342 | 0.562 | 0.862 | |
| BFC | 0.751 | 0.822 | 0.920 | 0.157 | |
| MEC | 0.849 | 0.887 | 0.959 | 0.122 |
The numbers in bold face are the best gain achieved for each data set
Fig. 3A relation between k-mer frequency and GC-content. The bottom left panel shows the smoothed scatter plot between k-mer frequency and GC-content, the top left is the distribution of k-mer frequency, and the bottom right is the distribution of GC-content. It is clear that GC-content k-mers have relatively low frequency. The data shown in this example is obtained from the H. chromosome 14 with k-mer size of 25
Fig. 4A relation between z-score and k-mer frequency. The level of shade represents the density of the distribution. The darker the color is, the more k-mers are presented. The frequencies of the k-mers highlighted in the red box are less than nine, which are very likely to be treated as weak for all existing k-mer based approaches. However, the very high z-score reflects that they should be treated as solid k-mers. The data shown here is obtained from B. impatiens with k-mer size of 25
Fig. 5The proportion of k-mers refined by z-score. The refinements come from two folds: weak k-mers having high z-score (moved to the solid k-mer set), and solid k-mers having low z-score (excluded from the solid k-mer set)
Fig. 6Memory saving analysis on the six data sets. The x-axis shows the memory saving ratio between the size of real memory allocation and raw input, while the y-axis shows how much proportion of an input held by a bit vector