| Literature DB >> 31349863 |
Damian Wojtowicz1, Itay Sason2, Xiaoqing Huang1, Yoo-Ah Kim1, Mark D M Leiserson3, Teresa M Przytycka4, Roded Sharan5.
Abstract
Knowing the activity of the mutational processes shaping a cancer genome may provide insight into tumorigenesis and personalized therapy. It is thus important to characterize the signatures of active mutational processes in patients from their patterns of single base substitutions. However, mutational processes do not act uniformly on the genome, leading to statistical dependencies among neighboring mutations. To account for such dependencies, we develop the first sequence-dependent model, SigMa, for mutation signatures. We apply SigMa to characterize genomic and other factors that influence the activity of mutation signatures in breast cancer. We show that SigMa outperforms previous approaches, revealing novel insights on signature etiology. The source code for SigMa is publicly available at https://github.com/lrgr/sigma.Entities:
Keywords: Breast cancer; Hidden Markov model; Mutation signature; Mutational process
Mesh:
Substances:
Year: 2019 PMID: 31349863 PMCID: PMC6660659 DOI: 10.1186/s13073-019-0659-1
Source DB: PubMed Journal: Genome Med ISSN: 1756-994X Impact factor: 11.117
Fig. 1Overview of the SIGMA model. The input data consists of (a) a set of predefined signatures that form an emission matrix E (here, for simplicity, represented over six mutation types) and (b) a sequence of mutation categories from a single sample and a distance threshold separating sky and cloud mutation segments. c The SIGMA model has two components: (top) a multinomial mixture model (MMM) for isolated sky mutations and (bottom) an extension of a hidden Markov model (HMM) capturing sequential dependencies between close-by cloud mutations; all model parameters are learned from the input data in an unsupervised manner. dSIGMA finds the most likely sequence of signatures that explains the observed mutations in sky and clouds
Fig. 2a Comparative assessment of model performance on held-out data for MMM and SIGMA across different distance thresholds. SIGMA at a threshold of 2000 bp shows the best performance by maximizing the log-likelihood (the y-axis has a customized scale with a scale break). b Comparison of fraction of signature 1 mutations found in CpG islands in sky and clouds. Both NMF and SIGMA show significant depletion of signature 1 in CpG islands with respect to randomized data, with SIGMA exhibiting more pronounced depletions, particularly in clouds. We performed 1000 permutations of signature assignments preserving mutation trinucleotide context within each sample. We used a one-sided Wilcoxon signed-rank test to compare the observed and randomized numbers of signature 1 in CpG islands. c Spearman correlation comparison of APOBEC3A/B expression with signature 2 and 13 activities across samples. For signature 2, the mutation counts in clouds with SIGMA are positively correlated with APOBEC3A/B expression while the NMF-based counts have zero or negative correlation in both sky and clouds. Signature 13 mutation counts are positively correlated in both models. In b and c, the significance level was categorized as *P value (P) < 0.05; ** P<5×10−3; *** P<5×10−5. All bar plots show mean values with standard error of the mean (small black bars) from 31 random initializations of MMM and SIGMA models
Fig. 3a Distribution of distance between consecutive mutations in clouds of various sizes (number of mutations in a cloud). b Difference between NMF and SIGMA in mutation signatures assigned to mutations is higher for cloud mutations. c Comparison of exposure to mutation signatures in sky and cloud regions based on SIGMA signature assignments. d Frequency distribution of the 12 mutation signatures (assigned by SIGMA) over replication time. The red line is the distribution over replication time from early to late for mutations in clouds. The blue line is the distribution of trends for sky mutations downsampled to the number of mutations found in clouds. The sampling was repeated 1000 times, and the 95% confidence intervals of the downsampled sky mutation frequencies are shown. All results show mean values with standard error of the mean (small vertical bars) from 31 random initializations of SIGMA
Fig. 4Enrichment of transition frequencies between mutation signatures in sequence-dependent cloud segments across all samples. a Enrichment represented as Pearson residuals between observed and expected signature frequencies shows a strong enrichment of self-transitions. b Enrichment computed in the same way but ignoring self-transitions to correctly estimate the enrichment of transitions between different signatures while accounting for the enrichment for self-transitions. Mean values of enrichment from random initializations of SIGMA are shown
Fig. 5Spearman correlation coefficients between demographic or clinical features and mutations attributed to each signature in sky and cloud regions. Only significant correlations with a p-value cutoff of 0.001 are shown. Barplots show mean correlations with standard error of the mean (small black bars) from 31 random initializations of SIGMA