| Literature DB >> 31523512 |
Zhi Yang1, Priyatama Pandey1, Darryl Shibata2, David V Conti1, Paul Marjoram1, Kimberly D Siegmund1.
Abstract
We propose a hierarchical latent Dirichlet allocation model (HiLDA) for characterizing somatic mutation data in cancer. The method allows us to infer mutational patterns and their relative frequencies in a set of tumor mutational catalogs and to compare the estimated frequencies between tumor sets. We apply our method to two datasets, one containing somatic mutations in colon cancer by the time of occurrence, before or after tumor initiation, and the second containing somatic mutations in esophageal cancer by sex, age, smoking status, and tumor site. In colon cancer, the relative frequencies of mutational patterns were found significantly associated with the time of occurrence of mutations. In esophageal cancer, the relative frequencies were significantly associated with the tumor site. Our novel method provides higher statistical power for detecting differences in mutational signatures.Entities:
Keywords: Colorectal cancer; Deconvolution; Latent dirichlet allocation; Mutational signatures; Somatic mutation
Year: 2019 PMID: 31523512 PMCID: PMC6717498 DOI: 10.7717/peerj.7557
Source DB: PubMed Journal: PeerJ ISSN: 2167-8359 Impact factor: 2.984
List of notation.
| Total number of mutational catalogs (indexed by | |
| Number of observed mutations in | |
| Number of features to include. Here, we use the nucleotide substitution, flanking bases and transcription strand (indexed by | |
| Vector of the maximum numbers of possible values, ( | |
| Total number of mutational signatures (indexed by | |
| Observed mutation characteristic vector, ( | |
| Index of the latent assignment for | |
| Probability vector of signature | |
| Probability vector of observing any of | |
| A tuple of probability vectors with length | |
| A vector indicating group membership of the samples. ( | |
| A tuple of concentration parameters of a Dirichlet distribution with length | |
| A tuple of expected values of |
Comparing mutational exposures from two sets of mutational catalogs, Side A and Side B, in the USC data.
| Side A–Side B | HiLDA-CI | HiLDA-Wald | TS-Wilcoxon | |
|---|---|---|---|---|
| Tests | [95% C.I.] | |||
| Δ1 | 0.002 | [−0.079, 0.083] | 0.986 | 0.780 |
| Δ2 | 0.000 | [−0.029, 0.029] | 0.988 | 0.897 |
| Δ3 | −0.002 | [−0.083, 0.086] | 0.961 | 0.985 |
| Bayes Factor | ||||
Notes.
, the difference in the mean exposure of signature k in group 1 and 2.
95% credible interval from the posterior distribution.
Figure 1The numbers of somatic mutations in 32 mutational catalogs obtained from 16 colon cancer patients in the USC data and their mutation spectra.
(A) The number of somatic mutations in 16 tumors, each of which contributes two mutational catalogs denoted as trunk (dark blue) and branch (light blue). (B) The percentage bar plot of relative frequencies for six substitution types in 16 trunk mutational catalogs. (C) The percentage bar plot of relative frequencies for six substitution types in 16 branch mutational catalogs.
Figure 2Mutational exposures and three mutational signatures from the analysis of 16 trunk mutational catalogs and 16 branch mutational catalogs in the USC data (16 colon cancer patients).
(A) Barplot of the somatic mutation counts, by signature type, sorted in a descending order of the total number of mutations. Each grouped pair contain the trunk mutations and the branch mutations. y-axis shows total number of mutations. (B) Barplot of the somatic mutation counts, again by signature type and sorted in a descending order of the total number of mutations. Again, each grouped pair contains the trunk mutations and the branch mutations, but now the y-axis is rescaled to show proportions rather than total mutation count. (C) The same data as in Fig. 2B, but now separate into trunk and branch mutations. Within each group the plots are sorted by the exposure frequency of the first signature (yellow). (D) The yellow mutational signature with four flanking bases. (E) The orange mutational signature with four flanking bases. (F) The red mutational signature with four flanking bases. (G) The distributions of mutational exposures of the three mutational signatures highlighted by group, where the branch mutational catalogs are highlighted as pink and the trunk ones are highlighted as blue.
Comparing mutational exposures in colorectal cancer from two sets of mutational catalogs, trunk and branch, in the USC data.
| Branch-Trunk | HiLDA-CI | HiLDA-Wald | ||
|---|---|---|---|---|
| [95% C.I.] | ||||
| Δ1 | −0.210 | [−0.295, −0.127] | <0.0001 | 0.0002 |
| Δ2 | 0.064 | [0.035, 0.099] | 0.0001 | 0.0075 |
| Δ3 | 0.146 | [0.056, 0.231] | 0.0011 | <0.0001 |
| Bayes Factor | ||||
Notes.
, the difference in the mean exposure of signature k in group 1 and 2.
95% credible interval from the posterior distribution.
Figure 3Estimated mutational exposures and posterior distributions of mean differences in mutational exposures from the analysis of the EAC data (146 esophageal adenocarcinoma patients).
(A) Barplot of mean mutational exposures of three signatures by sex, age groups, smoking status, and tumor sites derived from pmsignature. The significance level of TS approach is denoted by asterisks (**, <0.005; *, <0.05). The mutational exposures do not sum to one since the frequency of remaining mutations (those not assigned to these three signatures) is not displayed. (B) 95% credible interval of mean differences in mutational exposures of four signatures derived from HiLDA-CI with the significance level of HiLDA-Wald test. (**, <0.005; *, <0.05). The difference in mean exposures from HiLDA can differ from those estimated by pmsignature due to the covariate distribution in the hierarchical model.
The false positive rates (FPR), true positive rates (TPR), and updated true positive rates of both the two-stage method and HiLDA.
The false positive rates (n = 1,000) and true positive rates (n = 200) of both the two-stage method and HiLDA when applied to the simulated data.
| Δ1 | Δ2 | Δ3 | ||
|---|---|---|---|---|
| HILDA-CI | 4.8% | 5.0% | 5.1% | |
| HILDA-Wald | 5.1% | 3.7% | 5.4% | |
| TS-Wilcoxon | 4.3% | 5.2% | 4.3% | |
| HILDA-CI | 99.5% | 85.5% | 91.5% | |
| HILDA-Wald | 99.5% | 80.5% | 92.5% | |
| TS-Wilcoxon | 99.0% | 77.5% | 88.0% |
Notes.
Percentage of 95% credible intervals that exclude zero.
Percentage of P-values <0.05 after applying the Wald test to the posterior distribution.