| Literature DB >> 26932475 |
Rachel L Goldfeder1,2, James R Priest3,4, Justin M Zook5, Megan E Grove6,7, Daryl Waggott8,9, Matthew T Wheeler10,11, Marc Salit12, Euan A Ashley13,14,15.
Abstract
BACKGROUND: As whole exome sequencing (WES) and whole genome sequencing (WGS) transition from research tools to clinical diagnostic tests, it is increasingly critical for sequencing methods and analysis pipelines to be technically accurate. The Genome in a Bottle Consortium has recently published a set of benchmark SNV, indel, and homozygous reference genotypes for the pilot whole genome NIST Reference Material based on the NA12878 genome.Entities:
Mesh:
Year: 2016 PMID: 26932475 PMCID: PMC4774017 DOI: 10.1186/s13073-016-0269-0
Source DB: PubMed Journal: Genome Med ISSN: 1756-994X Impact factor: 11.117
Fig. 1Complexity of the Genome. a The genome consists of several (overlapping) regions. Eighty-six percent of 35 bp sequences and 95 % of 100 bp sequences are unique to one location in the reference genome. b A total of 50.6 % of the non-N reference genome falls into a repeat (data from RepeatMasker). c There is great variation in exon count and number of exonic bases per gene (data from RefSeq). d An unrooted phylogenetic tree derived from multiple alignment of cDNA sequences of 10 voltage-gated sodium channel genes within the human genome illustrates the complexity evolutionary relationship of paralogous sequences which complicates the process of short-read alignment in next-generation sequencing. A related voltage gated calcium channel CACNA1L is included as an outgroup
Sensitivities for whole genome sequencing (WGS) and whole exome sequencing (WES) SNVs
| Function | Gene set | WGS SNV sensitivity | WES SNV sensitivity |
|---|---|---|---|
| Non-synonymous | ClinVarOMIM | 0.979 (0.970,0.985) | 0.936 (0.923,0.948) |
| Non-synonymous | Exome | 0.979 (0.975,0.982) | 0.936 (0.930,0.942) |
| Splicing | ClinVarOMIM | 0.889 (0.565,0.994) | 0.556 (0.267,0.811) |
| Splicing | Exome | 0.951 (0.865,0.983) | 0.629 (0.505,0.738) |
| Synonymous | ClinVarOMIM | 0.988 (0.982,0.992) | 0.952 (0.942,0.961) |
| Synonymous | Exome | 0.985 (0.983,0.988) | 0.952 (0.947,0.956) |
| Truncating | ClinVarOMIM | 1.000 (0.646,1.000) | 1.000 (0.646,1.000) |
| Truncating | Exome | 1.000 (0.924,1.000) | 0.915 (0.801,0.966) |
| Whole genome | N/A | 0.954 (0.954,0.955) | 0.053 (0.053,0.053) |
Sensitivity for different categories of potentially functional variants across different gene categories. Parentheses contain 95 % binomial confidence intervals
Sites with falsely-called variants in one or more technologies and their presence in several databases
| Sites (n) | |
|---|---|
| Total variants with bias | 39,301 |
| Total variants with bias in databases | 7,467 |
| ClinVar | 4 |
| ESP | 38 |
| 1000 Genomes | 89 |
| dbSNP (v138) | 7,363 |
| COSMIC | 123 |
Fig. 2a The fraction of each ACMG gene within GIAB high-confidence regions. b Violin plots showing the distribution of the fraction each gene in the GIAB high-confidence regions for NA12878 for relevant gene sets: ACMG reportable genes, genes with variants in OMIM or ClinVar, and all genes. c Boxplots showing the distribution of the fraction of first, second, middle, penultimate, and last exon in ClinVar or OMIM genes in the GIAB high-confidence regions
Reasons for low confident bases in ACMG genes
| Reason for low confidence | Percentage of bases |
|---|---|
| CNVs or other SVs that have been reported in dbVar for NA12878 | 47 |
| STRs in RepSeqSTRdb | 34 |
| Regions with known segmental duplications | 15 |
| Simple Repeats from repeat masker | 1.7 |
| <3 datasets have at least 5 reads with mapping quality >10 | 1.3 |
| Abnormal allele balance | 0.17 |
| Unresolved conflicting genotypes after arbitration | 0.03 |
| Calls with support from <3 datasets after arbitration | 0.0082 |
| Local alignment problems | 0.0041 |
Fig. 3a The number of sites in the genome where each 35 bp sequence appears for Genome in a Bottle high-confidence and low-confidence regions. b The fraction of each RepeatMasker repeat class in high-confidence regions
Genomic context of ClinVar (likely) pathogenic SNVs
| n | % | |
|---|---|---|
| Total likely pathogenic or pathogenic SNVs | 15,735 | |
| Likely pathogenic or pathogenic SNVs in high-confidence regions | 12,138 | 77.14 |
| Likely pathogenic or pathogenic SNVs that start a 35 bp unique sequencea | 15,289 | 97.17 |
| Likely pathogenic or pathogenic SNVs that start a 100 bp alignable sequenceb | 15,438 | 98.11 |
| Total likely pathogenic or pathogenic SNVs with > = level 2 ClinVar review status [ | 1,212 | |
| Likely pathogenic or pathogenic SNVs with > = level 2 ClinVar review status in high-confidence regions | 998 | 82.34 |
| Likely pathogenic or pathogenic SNVs with > = level 2 ClinVar review status that start a 35 bp unique sequencea | 1,190 | 98.18 |
| Likely pathogenic or pathogenic SNVs with > = level 2 ClinVar review status that start a 100 bp alignable sequenceb | 1,195 | 98.60 |
aThe 35 bp sequence that starts at the SNV’s genomic loci is only present once in the whole reference genome (hg19)
bThe 100 bp sequence (with up to two mismatches) that starts at the SNV’s genomic loci is only present once in the whole reference genome (hg19)
Fig. 4Bar graphs displaying the fraction of ClinVar pathogenic or likely pathogenic SNVs in high-confidence regions, unique sequences (35 bp), and alignable sequences (100 bp). The black line represents the genome-wide value
Fig. 5ClinVar variants within ACMG genes in the ExAC database. Depth of coverage in log2 space versus the number of samples that were unable to be called for that variant. The size of the points is relative to quality scores from GATK during joint calling. Orange indicates that the variant is in a high-confidence NA12878 region while blue is considered to be in low confidence. Triangles highlight variants that failed VQSR filtering