| Literature DB >> 29415044 |
James X Sun1, Yuting He1, Eric Sanford1, Meagan Montesion1, Garrett M Frampton1, Stéphane Vignot2,3, Jean-Charles Soria2, Jeffrey S Ross1,4, Vincent A Miller1, Phil J Stephens1, Doron Lipson1, Roman Yelensky1.
Abstract
A key constraint in genomic testing in oncology is that matched normal specimens are not commonly obtained in clinical practice. Thus, while well-characterized genomic alterations do not require normal tissue for interpretation, a significant number of alterations will be unknown in whether they are germline or somatic, in the absence of a matched normal control. We introduce SGZ (somatic-germline-zygosity), a computational method for predicting somatic vs. germline origin and homozygous vs. heterozygous or sub-clonal state of variants identified from deep massively parallel sequencing (MPS) of cancer specimens. The method does not require a patient matched normal control, enabling broad application in clinical research. SGZ predicts the somatic vs. germline status of each alteration identified by modeling the alteration's allele frequency (AF), taking into account the tumor content, tumor ploidy, and the local copy number. Accuracy of the prediction depends on the depth of sequencing and copy number model fit, which are achieved in our clinical assay by sequencing to high depth (>500x) using MPS, covering 394 cancer-related genes and over 3,500 genome-wide single nucleotide polymorphisms (SNPs). Calls are made using a statistic based on read depth and local variability of SNP AF. To validate the method, we first evaluated performance on samples from 30 lung and colon cancer patients, where we sequenced tumors and matched normal tissue. We examined predictions for 17 somatic hotspot mutations and 20 common germline SNPs in 20,182 clinical cancer specimens. To assess the impact of stromal admixture, we examined three cell lines, which were titrated with their matched normal to six levels (10-75%). Overall, predictions were made in 85% of cases, with 95-99% of variants predicted correctly, a significantly superior performance compared to a basic approach based on AF alone. We then applied the SGZ method to the COSMIC database of known somatic variants in cancer and found >50 that are in fact more likely to be germline.Entities:
Mesh:
Year: 2018 PMID: 29415044 PMCID: PMC5832436 DOI: 10.1371/journal.pcbi.1005965
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Fig 1SGZ method overview.
The SGZ pipeline is overviewed in panel A. Key components include fitting an optimal copy number model to the genome‐wide log‐ratio and minor allele frequency profiles (B), and modeling the expected allele frequencies of germline, somatic, and subclonal somatic mutations (C). In panel B, the dots in the top panel correspond to log ratios at each exon sequenced, segmented and fitted to discrete copy number levels, while the dots on the bottom panel are germline SNP minor allele frequencies. In panel C, examples of expected variant allele frequencies are shown for various scenarios of copy number and tumor purity. The expected allele frequencies are shown for germline (blue), somatic (red), and subclonal somatic (yellow).
Fig 2Copy number detection overview.
Aligned DNA sequences of the tumor specimen are normalized against a process‐matched normal, producing log‐ratio and minor allele frequency (MAF) data. Next, whole‐genome segmentation is performed using a circular binary segmentation (CBS) algorithm on the log‐ratio data. Then, a Gibbs sampler fitted copy number model and a grid‐based model are fit to the segmented log‐ratio and MAF data, producing genome‐wide copy number estimates. Finally, the degree of fit of candidate models returned by Gibbs sampling and grid sampling are compared and the optimal model is selected by an automated heuristic.
Validation of somatic and germline predictions.
| Validation study | Call rate | Somatic variants predicted correctly | Germline variants predicted correctly |
|---|---|---|---|
| All variants in 30 lung & colon samples with matched-normal as gold standard ( | 100% (568/568) | 67% (255/380) | 87% (164/188) |
| All variants in 30 lung & colon samples with matched-normal as gold standard ( | 85% (480/568) | 95% (312/327) | 99% (151/153) |
| All variants in 3 cell lines with varying proportions of tumor-normal admixture ( | 100% (215/216) | 92% (83/90) | 41% (51/125) |
| All variants in 3 cell lines with varying proportions of tumor-normal admixture ( | 83% (184/222) | 97% (60/62) | 97% (118/122) |
| 17 somatic hotspot mutations and 20 common germline variants in 20,182 clinical samples ( | 100% (12506/12506) | 95% (7213/7560) | 51% (2537/4946) |
| 17 somatic hotspot mutations and 20 common germline variants in 20,182 clinical samples ( | 84% (9829/11646) | 96% (5325/5540) | 97% (4172/4289) |
Fig 3Breakdown of no-calls made by SGZ.
Reasons behind no-calls made by SGZ are shown for (left) all variants in 30 lung and colon samples and (right) 17 somatic hotspot mutations and 20 common germline variants within 20,182 clinical samples.
SGZ performance as a function of tumor purity in the cell line dataset.
| Tumor Purity | 10% | 20% | 30% | 40% | 50% | 70% |
|---|---|---|---|---|---|---|
| Call Rate | 0.94 | 0.89 | 0.83 | 0.78 | 0.75 | 0.80 |
| Germline Accuracy | 1.00 | 1.00 | 1.00 | 1.00 | 0.94 | 0.88 |
| Somatic Accuracy | 1.00 | 1.00 | 0.92 | 1.00 | 0.90 | 1.00 |
Tumor zygosity predictions of somatic mutations in 20,182 clinical samples.
| Gene | Amino acid affected | Gene type | Samples with mutation | Mutations with LOH | LOH enrichment ratio |
|---|---|---|---|---|---|
| V600 | Oncogene | 279 | 6.8% | 0.61 | |
| L858R | Oncogene | 116 | 4.3% | 0.63 | |
| R132H | Oncogene | 131 | 0.8% | 0.06 | |
| G12 | Oncogene | 1444 | 16.6% | 1.21 | |
| Q61 | Oncogene | 198 | 13.1% | 0.68 | |
| H1047 | Oncogene | 347 | 11.5% | 0.86 | |
| All substitutions | Suppressor | 308 | 81.8% | 3.54 | |
| All substitutions | Suppressor | 307 | 90.6% | 2.75 | |
| All substitutions | Suppressor | 4666 | 91.8% | 3.74 |
1The enrichment ratio with respect to background LOH percentage, which is measured in non-mutated samples at the genomic locations in each gene.
†Includes all missense mutations of the codon.
‡All missense and nonsense substitutions of confirmed somatic status in COSMIC or consensus splice site variants. Samples with compound heterozygous mutations in a gene are excluded as they are not expected to be under LOH.
Likely somatic status mis-annotation in COSMIC, predicted by SGZ to be germline in multiple samples in Foundation Medicine sample set.
| Gene | Protein change | Status in COSMIC v62 | Entries in COSMIC | dbSNP ID | Common SNP in 1000 Genomes | P-value |
|---|---|---|---|---|---|---|
| P925T | Confirmed somatic | 1 | rs148884710 | No | 8.0E-235 | |
| P25L | Confirmed somatic | 1 | rs35460768 | No | 3.0E-191 | |
| V32G | Confirmed somatic | 1 | rs56048668 | No | 3.4E-181 | |
| I1307K | Confirmed somatic | 1 | rs1801155 | No | 1.5E-159 | |
| Y791F | Confirmed somatic | 1 | rs77724903 | No | 6.4E-124 | |
| V509A | Confirmed somatic | 1 | rs63751005 | No | 3.0E-84 | |
| L3614P | Confirmed somatic | 1 | rs146191865 | Yes | 7.6E-71 | |
| T244I | Confirmed somatic | 2 | rs6897932 | Yes | 2.3E-60 | |
| S893L | Confirmed somatic | 4 | rs142047649 | No | 5.7E-47 | |
| S978P | Unknown | 1 | rs139552233 | No | 2.0E-45 |
†The listed mutations have “confirmed somatic” status in COSMIC, but are likely mis-annotation, as the number of references supporting the status is low, while SGZ predicted these variants to be germline in multiple samples. Furthermore, although the mutations are not necessarily common SNPs, each mutation has a dbSNP entry, which further supports germline status.
‡Probability of being somatic, given multiple SGZ predictions for each variant.