| Literature DB >> 28035032 |
Suyash S Shringarpure1, Rasika A Mathias2,3, Ryan D Hernandez4,5,6, Timothy D O'Connor7,8,9, Zachary A Szpiech4, Raul Torres10, Francisco M De La Vega1, Carlos D Bustamante1, Kathleen C Barnes2,3, Margaret A Taub11.
Abstract
Motivation: Variant calling from next-generation sequencing (NGS) data is susceptible to false positive calls due to sequencing, mapping and other errors. To better distinguish true from false positive calls, we present a method that uses genotype array data from the sequenced samples, rather than public data such as HapMap or dbSNP, to train an accurate classifier using Random Forests. We demonstrate our method on a set of variant calls obtained from 642 African-ancestry genomes from the Consortium on Asthma among African-ancestry Populations in the Americas (CAAPA), sequenced to high depth (30X).Entities:
Mesh:
Year: 2017 PMID: 28035032 PMCID: PMC5408850 DOI: 10.1093/bioinformatics/btw786
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1Overlap between the three call sets for variants on chromosome 22 for our individual of interest
Fig. 2Feature importance for the Random Forest classifier distinguishing calls made by different calling algorithms, limited to individual-level features. Scale on the x-axis is unitless but indicates relative importance of the different features
Fig. 3Feature importance for call set-specific classifiers based on Omni genotype data. Note that the frequency features refer to the estimates of the allele frequency from the call set being studied. Also note that the INDELPROX variable has a value of 0 for RTG
Fig. 4Zoomed-in detail of ROC curves for call set-specific classifiers based on Omni genotype data
Fig. 5Call set-specific classifier scores for all sites, stratified by the call sets in which the site was found. Shown are calls from Illumina (left), RTG (center) and GATK (right). Colors represent the number of call sets a particular variant was observed in, with pink for one call set, green for two call sets and blue for all three call sets
Fig. 6Call set-specific classifier scores for all sites stratified by whether the site was found in both Sample 1 and Sample 1R or only one sample. Shown are calls from Illumina (left), RTG (center) and GATK (right). Colors represent the number of replicates a particular variant was observed in, with pink for one replicate only and green for both replicates
Predicted number of true variant sites from three different call sets using fitted call set-specific classifiers
| Variant caller | Total sites | Predicted variant sites | Rate |
|---|---|---|---|
| Illumina | 57 825 | 54 023 | 93% |
| RTG | 54 275 | 45 297 | 83% |
| GATK | 51 201 | 49 485 | 97% |
Number of predicted variant sites for the Illumina classifier, with percentages indicating fraction of total common SNPs (out of 50 547) or rare SNPs (out of 7278). The results are stratified by allele frequency, with common SNPs having frequency 5% or larger
| Method | Common SNPs | Rare SNPs |
|---|---|---|
| ( | ( | |
| No stratification | 47 828 (95%) | 6354 (87%) |
| Variable thresholds | 48 870 (97%) | 6043 (83%) |
| Separate training | 47 547 (94%) | 5280 (73%) |