| Literature DB >> 33208104 |
Jiaqi Liu1,2, Jiayin Wang3,4, Xiao Xiao1,5, Xin Lai1,2, Daocheng Dai1,2, Xuanping Zhang1,2, Xiaoyan Zhu1,2, Zhongmeng Zhao1,2, Juan Wang1,6, Zhimin Li7,8.
Abstract
BACKGROUND: The emergence of the third generation sequencing technology, featuring longer read lengths, has demonstrated great advancement compared to the next generation sequencing technology and greatly promoted the biological research. However, the third generation sequencing data has a high level of the sequencing error rates, which inevitably affects the downstream analysis. Although the issue of sequencing error has been improving these years, large amounts of data were produced at high sequencing errors, and huge waste will be caused if they are discarded. Thus, the error correction for the third generation sequencing data is especially important. The existing error correction methods have poor performances at heterozygous sites, which are ubiquitous in diploid and polyploidy organisms. Therefore, it is a lack of error correction algorithms for the heterozygous loci, especially at low coverages.Entities:
Keywords: Error correction method; Heterozygous variant; Hybrid correction method; PacBio sequencing; Probabilistic model; Sequencing analysis; Sequencing error
Mesh:
Year: 2020 PMID: 33208104 PMCID: PMC7677778 DOI: 10.1186/s12864-020-07008-9
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 1The voting rule in existing hybrid correction methods incorrectly handle heterozygous
Fig. 2Accuracy of QIHC with different coverages of S
The comparisons on accuracy between and
| Coverage of | ||||||
|---|---|---|---|---|---|---|
| Heterozygous interval | 5 × | 10 × | 15 × | 20 × | 50 × | |
| [0.20,0.80] | 0.948 | 0.958 | 0.954 | 0.942 | 0.940 | |
| 0.092 | 0.230 | 0.322 | 0.316 | 0.288 | ||
| [0.25,0.75] | 0.938 | 0.946 | 0.936 | 0.922 | 0.914 | |
| 0.086 | 0.206 | 0.290 | 0.294 | 0.258 | ||
| [0.30,0.70] | 0.826 | 0.876 | 0.866 | 0.836 | 0.838 | |
| 0.066 | 0.168 | 0.224 | 0.236 | 0.200 | ||
| [0.35,0.65] | 0.666 | 0.750 | 0.702 | 0.676 | 0.710 | |
| 0.022 | 0.110 | 0.138 | 0.166 | 0.138 | ||
The comparisons on accuracy among , and
| Heterozygous interval | [0.20,0.80] | [0.25,0.75] | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Coverage of | 3 × | 5 × | 10 × | 12 × | 15 × | 3 × | 5 × | 10 × | 12 × | 15 × |
| 0.888 | 0.954 | 0.978 | 0.95 | 0.972 | 0.878 | 0.940 | 0.972 | 0.940 | 0.956 | |
| 0.780 | 0.922 | 0.968 | 0.958 | 0.988 | 0.724 | 0.896 | 0.962 | 0.948 | 0.982 | |
| 0.112 | 0.234 | 0.336 | 0.344 | 0.368 | 0.102 | 0.222 | 0.304 | 0.302 | 0.336 | |
Fig. 3Accuracy of QIHC, Canu and Jabba
The comparisons on heterozygosity quality between and
| Difference value | Negative | Draw | Positive | |
|---|---|---|---|---|
| 210 | 31 | 245 | ||
| excellence | good | |||
| 57 | 181 | |||
| 246 | 37 | 211 | ||
| excellence | good | |||
| 20 | 191 | |||
The comparisons on accuracy with different sequencing error rates of between and
| Heterozygous interval | [0.20,0.80] | [0.25,0.75] | ||||||
|---|---|---|---|---|---|---|---|---|
| Error rate of | 20% | 15% | 10% | 5% | 20% | 15% | 10% | 5% |
| 0.920 | 0.868 | 0.882 | 0.882 | 0.902 | 0.850 | 0.862 | 0.870 | |
| 0.868 | 0.852 | 0.864 | 0.866 | 0.846 | 0.796 | 0.802 | 0.836 | |
| Heterozygous interval | [0.30,0.70] | [0.35,0.65] | ||||||
| Error rate of | 20% | 15% | 10% | 5% | 20% | 15% | 10% | 5% |
| 0.816 | 0.778 | 0.760 | 0.788 | 0.618 | 0.560 | 0.536 | 0.592 | |
| 0.780 | 0.710 | 0.690 | 0.768 | 0.592 | 0.520 | 0.504 | 0.626 | |
Fig. 4Density of four different Beta distributions
The comparisons on accuracy with different prior probabilities of homozygosity on
| Heterozygous interval | [0.20,0.80] | [0.25,0.75] | [0.30,0.70] |
|---|---|---|---|
| Prior probability | |||
| 0.10 | 0.652 | 0.638 | 0.576 |
| 0.25 | 0.940 | 0.936 | 0.844 |
| 0.35 | 0.950 | 0.936 | 0.872 |
| 0.50 | 0.964 | 0.944 | 0.866 |
| 0.75 | 0.952 | 0.940 | 0.868 |
Fig. 5The dependency relationships and distribution among Ref, and R
Fig. 6The possible distributions of bases