| Literature DB >> 30353964 |
Francesc Muyas1,2,3, Mattia Bosio1,2, Anna Puig1,2, Hana Susak1,2, Laura Domènech1,2,4, Georgia Escaramis1,2,4, Luis Zapata1,2, German Demidov1,2,3, Xavier Estivill5,6, Raquel Rabionet1,2,4,7, Stephan Ossowski1,2,3.
Abstract
In recent years, next-generation sequencing (NGS) has become a cornerstone of clinical genetics and diagnostics. Many clinical applications require high precision, especially if rare events such as somatic mutations in cancer or genetic variants causing rare diseases need to be identified. Although random sequencing errors can be modeled statistically and deep sequencing minimizes their impact, systematic errors remain a problem even at high depth of coverage. Understanding their source is crucial to increase precision of clinical NGS applications. In this work, we studied the relation between recurrent biases in allele balance (AB), systematic errors, and false positive variant calls across a large cohort of human samples analyzed by whole exome sequencing (WES). We have modeled the AB distribution for biallelic genotypes in 987 WES samples in order to identify positions recurrently deviating significantly from the expectation, a phenomenon we termed allele balance bias (ABB). Furthermore, we have developed a genotype callability score based on ABB for all positions of the human exome, which detects false positive variant calls that passed state-of-the-art filters. Finally, we demonstrate the use of ABB for detection of false associations proposed by rare variant association studies. Availability: https://github.com/Francesc-Muyas/ABB.Entities:
Keywords: allele balance; false positive variant calls; genetic variant detection; systematic NGS errors
Mesh:
Year: 2018 PMID: 30353964 PMCID: PMC6587442 DOI: 10.1002/humu.23674
Source DB: PubMed Journal: Hum Mutat ISSN: 1059-7794 Impact factor: 4.878
Figure 1(a) Observed (bars) and expected (density) allele balance (AB) distributions split by genotype. (b) Gaussian mixture model of the allele balance deviation devAB, separating non‐deviated (0) and deviated (1) positions. (c) Precision‐Recall curves and PR‐AUC for the linear regression model LR‐1. The color gradient on the right shows the LR response value (probability to belong to class 1) obtained by logistic regression. (d) Correlation of LR response and precision. Precision was measured in the test and validation sets using labels defined by the GMM. Confidence levels were defined by visual inspection
Distribution of ABB genotype callability levels in the whole exome, germline SNV calls, and somatic SNV calls
| ABB callability | Whole exome | Germline SNV | Somatic SNV |
|---|---|---|---|
| High confidence [0–0.15) | 99.736% | 44.955% | 80.286% |
| Medium confidence [0.15–0.75) | 0.205% | 44.585% | 5.771% |
| Low confidence [0.75–0.9) | 0.033% | 6.665% | 5.865% |
| Very low confidence [0.9–1] | 0.025% | 3.796% | 8.077% |
Figure 2(a) ABB classifications of heterozygous SNPs reported by GATK HaplotypeCaller. Shape of AB distribution of variants identified by GATK + VQSR (left); AB distribution of low (red) compared with high (green) confidence positions (middle); and AB distribution after ABB filtering (right). (b) ROC curve of Sanger validation results compared with ABB (AUC = 0.778). (c) Proportion of True Positive (TP) and False Positive (FP) variants in four ABB genotype callability ranges
Enrichment of somatic SNV calls in dbSNP and Cosmic, separated by ABB callability range
| ABB callability | Novel | Cosmic | DbSNP |
|---|---|---|---|
| All SNVs [0–1] | 80.89% | 4.53% | 14.58% |
| High confidence [0–0.15) | 85.60% | 4.89% | 9.51% |
| Mid confidence [0.15–0.75) | 68.08% | 3.47% | 28.45% |
| Low confidence [0.75–0.9) | 67.95% | 4.06% | 27.99% |
| Very low confidence [0.9–1] | 52.60% | 2.02% | 45.38% |
Row 1 shows results for the complete call set used as baseline.
*P value < 10E−3.
***P value < 2E−16.
Results of Sanger validation grouped by ABB genotype callability levels
| ABB callability | SNVs | TP | TP rate | FP | FP rate | Failed | Fail rate |
|---|---|---|---|---|---|---|---|
| High confidence [0–0.15) | 42 | 38 | 100.00% | 0 | 0.00% | 4 | 9.52% |
| Mid confidence [0.15–0.75) | 73 | 55 | 84.62% | 10 | 15.38% | 8 | 10.96% |
| Low confidence [0.75–0.9) | 46 | 20 | 68.97% | 9 | 31.03% | 17 | 36.96% |
| Very low confidence [0.9–1] | 48 | 21 | 50.00% | 21 | 50.00% | 6 | 12.50% |
Failed Sanger sequencing experiments were ignored for the FP and TP rate calculation.
Enrichment of ABB very low confidence (VLC) positions in public variant databases
| Database | Total positions | VLC Obs. | VLC Freq. Obs. | Ratio Obs./Exp. |
|---|---|---|---|---|
| Exome | 81,609,944 | 20,725 | 0.03% | 1 |
| dbSNP | 3,172,724 | 12,787 | 0.40% | 15.87 |
| EVS | 1,840,709 | 1,114 | 0.06% | 2.38 |
| 1000GP | 2,653,982 | 4,690 | 0.18% | 6.96 |
| EXAC | 2,662,396 | 3,510 | 0.13% | 5.19 |
The fraction of VLC positions in the exome was used as expected value.
*P value < 10E−16.