| Literature DB >> 29385983 |
Dorota H Sendorek1, Cristian Caloian1, Kyle Ellrott2,3, J Christopher Bare4, Takafumi N Yamaguchi1, Adam D Ewing2,5, Kathleen E Houlahan1, Thea C Norman4, Adam A Margolin4,3,6, Joshua M Stuart2, Paul C Boutros7,8,9.
Abstract
BACKGROUND: The clinical sequencing of cancer genomes to personalize therapy is becoming routine across the world. However, concerns over patient re-identification from these data lead to questions about how tightly access should be controlled. It is not thought to be possible to re-identify patients from somatic variant data. However, somatic variant detection pipelines can mistakenly identify germline variants as somatic ones, a process called "germline leakage". The rate of germline leakage across different somatic variant detection pipelines is not well-understood, and it is uncertain whether or not somatic variant calls should be considered re-identifiable. To fill this gap, we quantified germline leakage across 259 sets of whole-genome somatic single nucleotide variant (SNVs) predictions made by 21 teams as part of the ICGC-TCGA DREAM Somatic Mutation Calling Challenge.Entities:
Keywords: Cancer genomics; Germline contamination; Germline leakage; Mutation calling; Next-generation sequencing; Patient identifiability; SNV; Single nucleotide variant
Mesh:
Year: 2018 PMID: 29385983 PMCID: PMC5793408 DOI: 10.1186/s12859-018-2046-0
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1GermlineFilter Workflow for the SMC Challenge. Locally, tumour-normal BAM files are submitted to a germline caller (e.g. GATK) to create a germline SNP call VCF file, which is later hashed and encrypted. The encrypted, hashed germline calls can now be moved to any server and used to filter for germline leakage in somatic SNV call VCF files. The output is the germline count found in the somatic calls. To quantify germline leakage using the Challenge submissions, the germline variant VCF file was created by the Challenge administrators “in-house” on a private server. The somatic SNV prediction VCF files were provided by the teams participating in the Challenge
Fig. 2Assessment of somatic SNV prediction accuracy against germline leakage. a F1-scores for each submission are plotted against the germline count (as determined by GermlineFilter). Submissions for different tumours are colour-coded (IS1 = orange, IS2 = green, IS3 = purple). The grey area represents 30–80 counts: the minimum number of independent SNPs required to correctly identify a subject, according to Lin et al. [15]. b Proportions of germline calls as found in total submission calls (upper panel) and in false positive submission calls (lower panel) per tumour. The horizontal red lines indicate the 30 count mark (the lower bound of the 30–80 SNP range mentioned above)
Fig. 3Germline leakage across all tumours (IS1, IS2, IS3) and SNV-calling algorithms. Teams are consistently colour-coded across multiple tumours. Barplots show F1-scores from each team’s top-scoring submission. Leaked variants are displayed below with their corresponding chromosomes. Variant bars that overlap horizontally represent recurrent germline leaks