| Literature DB >> 32996082 |
Krzysztof Kotlarz1, Magda Mielczarek1,2, Tomasz Suchocki1,2, Bartosz Czech1, Bernt Guldbrandtsen3, Joanna Szyda4,5.
Abstract
A downside of next-generation sequencing technology is the high technical error rate. We built a tool, which uses array-based genotype information to classify next-generation sequencing-based SNPs into the correct and the incorrect calls. The deep learning algorithms were implemented via Keras. Several algorithms were tested: (i) the basic, naïve algorithm, (ii) the naïve algorithm modified by pre-imposing different weights on incorrect and correct SNP class in calculating the loss metric and (iii)-(v) the naïve algorithm modified by random re-sampling (with replacement) of the incorrect SNPs to match 30%/60%/100% of the number of correct SNPs. The training data set was composed of data from three bulls and consisted of 2,227,995 correct (97.94%) and 46,920 incorrect SNPs, while the validation data set consisted of data from one bull with 749,506 correct (98.05%) and 14,908 incorrect SNPs. The results showed that for a rare event classification problem, like incorrect SNP detection in NGS data, the most parsimonious naïve model and a model with the weighting of SNP classes provided the best results for the classification of the validation data set. Both classified 19% of truly incorrect SNPs as incorrect and 99% of truly correct SNPs as correct and resulted in the F1 score of 0.21 - the highest among the compared algorithms. We conclude the basic models were less adapted to the specificity of a training data set and thus resulted in better classification of the independent, validation data set, than the other tested models.Entities:
Keywords: Classification; Keras; Next-generation sequencing; Python; SNP calling; SNP microarray; TensorFlow
Mesh:
Year: 2020 PMID: 32996082 PMCID: PMC7652806 DOI: 10.1007/s13353-020-00586-0
Source DB: PubMed Journal: J Appl Genet ISSN: 1234-1983 Impact factor: 3.240
Fig. 1Implementation scheme for the NAÏVE deep learning algorithm for SNP classification in Keras
Characteristics of the analysed data sets
| SNP | Training data | Validation data | ||
|---|---|---|---|---|
| Correct | Incorrect | Correct | Incorrect | |
| % | 97.94% | 2.06% | 98.05% | 1.95% |
| Genotype counts | ||||
| 0/0 | 882,838 | 19,725 | 299,804 | 6037 |
| 0/1 | 571,549 | 12,910 | 193,755 | 4270 |
| 1/1 | 773,608 | 14,285 | 255,947 | 4633 |
| Mean DP ± SD | 37.91 ± 11.66 | 30.68 ± 13.21 | 37.61 ± 11.95 | 33.51 ± 13.08 |
| DP range | 1–587 | 1–457 | 1–587 | 1–457 |
| Mean DP2 ± SD | 9.53 ± 4.16 | 6.20 ± 4.09 | 9.27 ± 3.59 | 7.13 ± 4.33 |
| DP2 range | 1–159 | 1–113 | 1–192 | 1–184 |
| Mean QUAL ± SD | 484.15 ± 430.14 | 365.74 ± 318.29 | 479.85 ± 429.82 | 405.4 ± 340.08 |
| QUAL range | 10.00–3829.35 | 10.00–4516.94 | 10.00–3829.35 | 10.0–4516.90 |
Fig. 2Probabilities of each SNP being incorrect, estimated based on the training data set, by the different algorithms
Fig. 3Probability cutoff values for SNP classification into the correct or incorrect group, estimated by the different algorithms based on the optimisation either for the F1 or for SUMSS metric
Fig. 4Classification of training data by the different algorithms, based on the probability cutoff thresholds estimated for the F1 or SUMSS metrics. The numbers above columns represent TP—percentages of true positive results, TN—percentages of true negative results, F1—values of the F1 metric
Fig. 5Classification of validation data by the different algorithms, based on the probability cutoff thresholds estimated for the F1 or SUMSS metrics. The numbers above columns represent TP—percentages of true positive results, TN—percentages of true negative results, F1—values of the F1 metric