| Literature DB >> 27605105 |
Niko Popitsch1,2, Anna Schuh2,3, Jenny C Taylor1,2.
Abstract
MOTIVATION: The increasing adoption of clinical whole-genome resequencing (WGS) demands for highly accurate and reproducible variant calling (VC) methods. The observed discordance between state-of-the-art VC pipelines, however, indicates that the current practice still suffers from non-negligible numbers of false positive and negative SNV and INDEL calls that were shown to be enriched among discordant calls but also in genomic regions with low sequence complexity.Entities:
Mesh:
Year: 2016 PMID: 27605105 PMCID: PMC5903559 DOI: 10.1093/bioinformatics/btw587
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Concordance scoring example. The figure shows three call set groups (grouped horizontal black lines) derived from three different VC pipelines. The called genotypes of the individual calls are represented as follows: black rectangles: homozygous variant calls; white circles: heterozygous variant calls; shaded rectangles: INDELs. The letters y and n below each group indicate genotypes concordance or discordance among the VC pipelines; p is the index of the polymorphic positions; n and n are counts of concordant and discordant decisions respectively; s is the calculated concordance score using the weights and . Note that position 4 shows a concordant call because the extended INDEL intervals (indicated by arrows) overlap. INDEL calls are also compared based on their called genotype
Fig. 2.Statistics and evaluation results. (a) Results for evaluation experiment 1. The performance metrics shown were calculated by splitting the chr20 call sets from the described 219 WGS datasets into two random subsets and using one for training RG and the other as ground truth. The x-axis plots the percentage size of the training set. For each training set size we repeated the experiment 10X to avoid random artefacts. The plotted solid line corresponds to the mean value of the respective measure, the light-coloured corridor depicts the standard deviation. (b) Results for evaluation experiment 2. Accuracy (), false discovery rate () and negative prediction value () boxplots for the various partition sets were calculated by using 34 independent WGS samples as ground truth. A complete set of performance metrics is given in Supplementary Fig S4. (c) Venn diagram showing the overlaps between four genomic partitions. The numbers show percentages of covered genomic positions that are considered reliable. 2.1% of the human reference genome (excluding assembly gaps) is not covered by any of the sets. (d) Bar plot showing the percentage of the human genome considered to be reliable/concordant per partition. (e) Results for evaluation experiment 3. The bar plot shows the precision for classifying heterozygous CHM1 calls as false positives. All call set labels are explained in the main text. Here, the combined RG + UM75 partition as well as other RG derived set show low and constant false-positive rates of around 5-7%, by this outperforming other methods such as GIAB or PLAT about 3-4X
This table summarizes the results of several statistics for discordant calls derived from over 34 million polymorphic genomic positions in the described 219 WGS datasets
| Discordant variant positions… | SNV | INDEL | Figure | |
|---|---|---|---|---|
| Variant calling and cohort related | Are less abundant in the cohort | Yes | No | S7 |
| Show less classification agreement (more intermediate scores) | Yes | Yes | S10 | |
| Have less contributing datasets in the cohort | No | No | S11 | |
| Have lower variant qualities | Yes | Yes | S17 | |
| Potentially violate Hardy-Weinberg equilibrium more often | Yes | n/a | S19 | |
| Are more often multi-allelic | Yes | n/a | S22a | |
| Genomic location related | Are more abundant in low-mappability regions | Yes | Yes | S14 |
| Are less covered | Yes | Yes | S15 | |
| Are closer to adjacent INDELs | Yes | Yes | S16 | |
| Are closer to INDEL locations in the cohort | Yes | Yes | S16 | |
| Are less abundant in genes/exons | Yes | Yes | S20 | |
| External DB related | Are less abundant in dbSNP | Yes | Yes | S8 |
| Have allele frequencies that correlate worse with population AFs | Yes | yes | S12 | |
| Are considered less deleterious by CADD | Yes | n/a | S18 | |
| Sequence context related | Are enriched in annotated LCR | Yes | Yes | S20 |
| Show reduced Ts/Tv ratios | Yes | n/a | S22ff | |
| Are enriched in AT rich regions | Partially | Partially | S23ff | |
| Show reduced sequence context complexity | Yes | Yes | S23ff |
A more detailed discussion of these data is given in the Supplement.