| Literature DB >> 28606096 |
Alexandra R Buckley1,2, Kristopher A Standish1,2, Kunal Bhutani1,3, Trey Ideker4,5,6, Roger S Lasken7, Hannah Carter4,5,6, Olivier Harismendy5,8, Nicholas J Schork9,10.
Abstract
BACKGROUND: Cancer research to date has largely focused on somatically acquired genetic aberrations. In contrast, the degree to which germline, or inherited, variation contributes to tumorigenesis remains unclear, possibly due to a lack of accessible germline variant data. Here we called germline variants on 9618 cases from The Cancer Genome Atlas (TCGA) database representing 31 cancer types.Entities:
Keywords: Batch effects; Cancer genomics; Cancer germline; GATK; Genetic association testing; TCGA; Variant annotation; Variant calling; Whole exome sequencing; Whole genome amplification
Mesh:
Year: 2017 PMID: 28606096 PMCID: PMC5467262 DOI: 10.1186/s12864-017-3770-y
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 1Overview of technical covariates for pan-cancer samples. For each covariate and cancer type, color represents the fraction of total samples. Fraction of total samples sums to 1 for each covariate and cancer type. Red indicates higher heterogeneity. Year first published included for context. TCGA cancer abbreviations: ACC, adrenocortical carcinoma; BLCA, bladder urothelial carcinoma; BRCA, breast invasive carcinoma; CESC, cervical squamous cell carcinoma and endocervical adenocarcinoma; CHOL, cholangiocarcinoma; COAD, colon adenocarcinoma; ESCA, esophageal carcinoma; GBM, glioblastoma multiforme; HNSC, head and neck squamous cell carcinoma; KICH, kidney chromophobe; KIRC, kidney renal clear cell carcinoma; KIRP, kidney renal papillary cell carcinoma; LAML, acute myeloid leukemia; LGG, brain lower grade glioma; LIHC, liver hepatocellular carcinoma; LUAD, lung adenocarcinoma; LUSC, lung squamous cell carcinoma; OV, ovarian serous cystadenocarcinoma; PAAD, pancreatic adenocarcinoma; PCPG, pheochromocytoma and paraganglioma; PRAD, prostate adenocarcinoma; READ, rectum adenocarcinoma; SARC, sarcoma; SKCM, skin cutaneous melanoma; STAD, stomach adenocarcinoma; TGCT, testicular germ cell tumors; THCA, thyroid carcinoma; UCEC, uterine corpus endometrioid carcinoma; UCS, uterine carcinosarcoma; UVM, uveal melanoma
Fig. 2WGA increases LOF variant burden. a LOF variant burden includes both SNV and indels. Red line indicates expected LOF burden from ExAC (155). b Individual LOF variant burden in cancers with WGA samples plotted by WGA status..* = Wilcoxon rank sum test p < 0.05, ** = Wilcoxon rank sum test p < 0.001. c Individual LOF variant burden in n = 13 samples that have both DNA and WGA samples available. ** = Wilcoxon paired rank sum test p < 0.001
Variance in LOF SNV and indel burden explained by technical covariates
| Sum. Sq. | Df | F value |
| % Var. Exp. | |
|---|---|---|---|---|---|
| LOF SNV | |||||
| C20X | 1785.49 | 1 | 52.95 | 3.72e-13 | 0.0056 |
| WGA | 156.89 | 1 | 4.65 | 3.10e-02 | 0.0005 |
| Center | 716.08 | 2 | 10.61 | 2.48e-05 | 0.0023 |
| BWA | 79.59 | 5 | 0.47 | 7.97e-01 | 0.0003 |
| RACE | 30698.90 | 5 | 182.10 | 1.33e-184 | 0.0973 |
| Residuals | 281966.85 | 8363 | 0.8940 | ||
| LOF Indel | |||||
| C20X | 52930.43 | 1 | 153.90 | 4.95e-35 | 0.0072 |
| WGA | 3744887.28 | 1 | 10888.62 | 0.0000 | 0.5080 |
| Center | 383585.43 | 2 | 557.65 | 4.53e-228 | 0.0520 |
| BWA | 169507.9 | 5 | 98.57 | 2.76e-101 | 0.0229 |
| RACE | 146904.86 | 5 | 85.42 | 7.59e-88 | 0.0199 |
| Residuals | 2876257.21 | 8363 | 0.3900 | ||
ANOVA results table
Sum. Sq. Sum of Squares; Df Degrees of Freedom; % Var. Exp. Percent variance explained by each factor (factor Sum. Sq./total Sum. Sq.)
Fig. 3Characteristics of variant calls in WGA samples. a Individual LOF indel burden vs. individual LOF SNV burden. Color indicates WGA status. b Total number of variant calls plotted by WGA status. c Number of genes with 0 read depth across 16,824 genes. d Fraction of insertions and deletions in n = 5654 WGA-enriched and n = 34,880 non-enriched indels. Shading indicates LOF status. e Size in base pairs of WGA-enriched and non-enriched indels. f Density plot showing distribution of insertion and deletion size for WGA-enriched and non-enriched indels. g Individual burden of LOF indels for all indels, homopolymer + indels, indels 15 base pairs or longer, and other indels. Color indicates WGA status. Indel burden calculated using GATK VQSR TS99 filter
Fraction of WGA-enriched and non-enriched indels in three indel categories
| % Other Indels | % Homopolymer Indels | % Large Indels | |
|---|---|---|---|
| WGA-enriched | 47.78 | 27.13 | 25.07 |
| Non-enriched | 83.52 | 9.63 | 6.83 |
Homopolymer indels: indels with a 4 or more single base repeat directly proximal to the indel; Large indels: indels with 15 or more inserted or deleted bases. Other indels: all indels that don’t fit one of the previous criteria
Fig. 4A comparison of indel filtering strategies. a Individual LOF indel burden for all indel filter methods in order of decreasing stringency. b Percent of variation in individual LOF indel burden explained by technical covariates for each filter method
Metrics of filter stringency and efficacy
| Filter | LOF indel sites | Median LOF indel burden | Fraction discordant indels removed | Fraction concordant indels removed | Indel overlap with ExAC |
|---|---|---|---|---|---|
| VQSR 90 | 6212 | 53 | 0.8667 | 0.4514 | 0.7079 |
| VQSR 95 | 9177 | 59 | 0.8064 | 0.3760 | 0.6776 |
| Hardfilter | 24212 | 91 | 0.3600 | 0.0210 | 0.3527 |
| VQSR 99 | 26134 | 98 | 0.2763 | 0.1100 | 0.5394 |
GATK VQSR 90 is the only filter capable of eliminating the significant association between WGA and LOF indel burden, however; it does so at the cost of over 75% of all LOF indel sites (Additional file 1: Table S10). From this we can conclude that WGA artifactual indels closely resemble true indels, preventing VQSR from selectively removing artifactual indels
Fig. 5Association testing between germline LOF variant burden and cancer type. a Quantile-quantile plots from logistic regression association testing between germline LOF burden and ovarian cancer for three indel filter methods. n = number of genes tested. Red line indicates significant cutoff and red points indicate associations significant p < 1.61 × 10-7. BRCA1/2 associations highlighted. b Number of significant cancer type - gene associations in each cancer type for three indel filter methods. Color indicates cancer types with WGA samples