| Literature DB >> 35116056 |
Nikta Feizi1, Qian Liu1,2, Leigh Murphy1,3, Pingzhao Hu1,2,3.
Abstract
In-silico classification of the pathogenic status of somatic variants is shown to be promising in promoting the clinical utilization of genetic tests. Majority of the available classification tools are designed based on the characteristics of germline variants or the combination of germline and somatic variants. Significance of somatic variants in cancer initiation and progression urges for development of classifiers specialized for classifying pathogenic status of cancer somatic variants based on the model trained on cancer somatic variants. We established a gold standard exclusively for cancer somatic single nucleotide variants (SNVs) collected from the catalogue of somatic mutations in cancer. We developed two support vector machine (SVM) classifiers based on genomic features of cancer somatic SNVs located in coding and non-coding regions of the genome, respectively. The SVM classifiers achieved the area under the ROC curve of 0.94 and 0.89 regarding the classification of the pathogenic status of coding and non-coding cancer somatic SNVs, respectively. Our models outperform two well-known classification tools including FATHMM-FX and CScape in classifying both coding and non-coding cancer somatic variants. Furthermore, we applied our models to predict the pathogenic status of somatic variants identified in young breast cancer patients from METABRIC and TCGA-BRCA studies. The results indicated that using the classification threshold of 0.8 our "coding" model predicted 1853 positive SNVs (out of 6,910) from the TCGA-BRCA dataset, and 500 positive SNVs (out of 1882) from the METABRIC dataset. Interestingly, through comparative survival analysis of the positive predictions from our models, we identified a young-specific pathogenic somatic variant with potential for the prognosis of early onset of breast cancer in young women.Entities:
Keywords: breast cancer; computational classification; pathogenic status; somatic variants; survival analysis
Year: 2022 PMID: 35116056 PMCID: PMC8804317 DOI: 10.3389/fgene.2021.805656
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
FIGURE 1A flowchart overview of the steps of the study modelling and application of computational algorithms designed based on supervised classification methods.
Four major genomic feature groups characterizing the SNVs in gold standard datasets.
| Feature group | Description | Example |
|---|---|---|
| Structural and genomic context features | Characterizing sequence attributes of the mutation location. These features estimate the disruption in the mutations surrounding sequence both in coding and non-coding regions | Percentage of GC in a ±75 bp window |
| Epigenetic features | Describing epigenetic changes such as histone modifications and methylation alterations | Maximum H3K4 methylation level from Encode |
| Genomic distance features | Measuring the distance between a given SNV and critical functional and structural genomic elements such as transcription start and end sites | Minimum distance to Transcribed Sequence Start (TSS) |
| Genomic conservation features | Measuring the evolutionary conservation at the mutation alignment sites in an effort to help the training models learn the relationships between the measurements and pathogenicity of the SNVs | Scores from PhastCons and Phylop |
FIGURE 2ROC curves of the models designed for classifying cancer somatic variants from coding (A) and non-coding (B) regions of the genome.
FIGURE 3ROC curves comparing the performance of our model (SVM) with FATHMM-XF and CScape for somatic cancer variants in coding (A) and non-coding (B) regions of the genome.
Number of pathogenic SNVs. They were predicted by the SVM models for the SNVs from prediction datasets regarding the optimum cut-offs [0.55 for METABRIC and TCGA-coding (TCGA-CD), and 0.41 for TCGA-noncoding (TCGA-NC)].
| Dataset | No. of pathogenic predictions | No. of pathogenic predictions overlapped with training SNVs | No. of affected genes | The frequency of pathogenic SNVs ≥2 | The frequency of pathogenic SNVs ≥3 | The frequency of pathogenic SNVs ≥4 | |||
|---|---|---|---|---|---|---|---|---|---|
| No. of SNVs | No. affected of genes | No. of SNVs | No. of affected genes | No. of SNVs | No. of affected genes | ||||
| METABRIC | 959 | 27 | 154 | 52 | 18 | 17 | 5 |
| 3 |
| TCGA-CD | 3,510 | 59 | 2,537 | 232 | 184 | 6 | 2 |
| 2 |
| TCGA-NC | 943 | 4 | 331 |
| 25 | 0 | 0 | 0 | 0 |
Bold ones are the number of SNVs that were used in the SNV-level survival analysis.
An overview of the genes harboring the recurrent pathogenic SNVs predicted by our models. The “SNV ID” column shows the ID of the recurrent SNV that affects the gene mentioned in “Gene” column. “Ref” column shows the nucleotide in the reference genome sequence and “Alt” column shows the alternative nucleotide that is substituted for the reference nucleotide. The highlighted row shows the SNV that appeared as significant through our subsequent survival analysis.
| Cohort | Gene | SNV ID | Ref | Alt | SNV position [Chr: Position (base pair: GRCh38)] | SNV consequence from VEP Ensembl |
|---|---|---|---|---|---|---|
| METABRIC | AKT1 | 14:104780214_C > T | C | T | 14:104780214 | Missense |
| METABRIC | PIK3CA |
| T | A | 3:179203765 | Missense |
| METABRIC | PIK3CA |
| G | A | 3:179218294 | Missense |
| METABRIC | PIK3CA |
| G | A | 3:179218303 | Missense |
| METABRIC | PIK3CA |
| A | T | 3:179234297 | Missense |
| METABRIC | TP53 | 17:7673802_C > T | C | T | 17:7673802 | Missense |
| METABRIC | TP53 | 17:7674220_C > T | C | T | 17:7674220 | Missense |
| METABRIC | TP53 | 17:7674221_G > A | G | A | 17:7674221 | Missense |
| METABRIC | TP53 | 17:7675088_C > T | C | T | 17:7675088 | Missense |
| TCGA-CD | PIK3CA |
| T | A | 3:179203765 | Missense |
| TCGA-CD | PIK3CA |
| T | A | 3:179218294 | Missense |
| TCGA-CD | PIK3CA |
| G | A | 3:179218303 | Missense |
| TCGA-CD | TP53 | 17:7675088_C > T | C | T | 17:7675088 | Missense |
| TCGA-NC | ZFP30 | 19:37613150_G > A | G | A | 19:37613150 | Missense |
| TCGA-NC | CLIC3 | 9:136993900_A > C | A | C | 9:136993900 | Regulatory_region_SNV |
| TCGA-NC | AC211476.2 | 7:72926895_G > C | G | C | 7:72926895 | Missense |
| TCGA-NC | ZNF512 | 2:27578227_C > T | C | T | 2:27578227 | Missense |
| TCGA-NC | KRTAP19-11P | 21:30541689_G > A | G | A | 21:30541689 | Missense |
| TCGA-NC | AL034345.1 | 6:38924007_C > G | C | G | 6:38924007 | Missense |
| TCGA-NC | PGAM1P6 | 2:23869699_C > A | C | A | 2:23869699 | Missense |
| TCGA-NC | ZDHHC11B | 5:711218_G > C | G | C | 5:711218 | Noncoding_exon_SNV |
| TCGA-NC | AC120498.10 | 16:1220974_G > A | G | A | 16:1220974 | Missense |
| TCGA-NC | RF00092 | 1:37880149_C > G | C | G | 1:37880149 | Missense |
| TCGA-NC | MIR519A2 | 19:53761153_G > A | G | A | 19:53761153 | Mature miRNA variant |
| TCGA-NC | AL049555.1 | 6:54941625_C > T | C | T | 6:54941625 | Missense |
| TCGA-NC | PLIN5 | 19:4538646_C > T | C | T | 19:4538646 | Missense |
| TCGA-NC | CDC27P1 | 2:132257729_T > G | T | G | 2:132257729 | Noncoding_exon_SNV |
Bold ones are the SNVs that are overlapped with the somatic pathogenic SNVs in the training set. In other words, they are known somatic pathogenic SNVs.
FIGURE 4Results from disease specific survival (DSS) analysis comparing the survival time of breast cancer patients with and without the mutation X17.7674220_C.T. The difference between the two groups of patients (with and without the mutation) is significant among young (under 45 years of age) individuals (p-value = 0.037), but not significant in older patients (p-value = 0.88).
FIGURE 5Results from overall survival (OS) analysis comparing the survival time of breast cancer patients with and without the mutation X17.7674220_C.T. The difference between the two groups of patients (with and without the mutation) is significant among young (under 45 years of age) individuals (p-value = 0.018), but not significant in older patients (p-value = 0.98).
Number of genes affected per different thresholds. The thresholds indicate the number of positive somatic point mutations each gene is harboring. The highlighted ones were used in the gene level survial analysis.
| Frequency threshold | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
|---|---|---|---|---|---|---|---|---|---|
| Count of genes-METABRIC | 154 | 106 | 74 | 55 | 44 | 36 | 29 | 27 | 23 |
| Count of genes-TCGA-coding | 2,539 | 412 | 92 | 29 | 11 | 7 | 3 | 2 | 2 |
| Count of genes-TCGA- non-coding | 330 | 22 | 3 | 0 | 0 | 0 | 0 | 0 | 0 |
FIGURE 6Results from disease free survival (DFS) analysis comparing the survival time of breast cancer patients possessing a mutated or un-mutated Muc16 gene. The difference between the two groups (with and without mutations in Muc16 gene) is significant among young (under 45 years of age) individuals (p-value = 0.011), while not significant in older patients (p-value = 0.14).
FIGURE 7Results from overall survival (OS) analysis comparing the survival time of breast cancer patients possessing a mutated or un-mutated Muc16 gene. The difference between the two groups (with and without mutations in Muc16 gene) is significant among young (under 45 years of age) individuals (p-value = 0.011), while not significant in older patients (p-value = 0.14).
Significant (adjusted p-value<0.05) gene sets showing an overrepresentation of our candidate gene lists. For each library we have only reported the five top significant gene sets.
| Gene list | Reactome 2016 | Panther 2016 | KEGG 2019 human | GO biological process 2018 | GO molecular function 2018 | ChEA 2016 |
|---|---|---|---|---|---|---|
| A | Chromatin modifying enzymes_ | EGF receptor signaling pathway_ | Endometrial cancer | positive regulation of nucleic acid-templated transcription | Protein kinase activity | AR_22383394_ChIP-Seq_PROSTATE_CANCER_Human |
| A | Chromatin organization_ | p53 pathway feedback loops 2_ | Hepatocellular carcinoma | Positive regulation of gene expression | Protein kinase binding | STAT3_23295773_ChIP-Seq_U87_Human |
| A | Diseases of signal transduction_ | Angiogenesis_ | Pathways in cancer | Positive regulation of transcription, DNA-templated | Transcription coactivator activity | SMAD4_21799915_ChIP-Seq_A2780_Human |
| A | PI-3K cascade:FGFR1_ | Insulin/IGF pathway-protein kinase B signaling cascade_ | Human papillomavirus infection | Phosphatidylinositol 3-kinase signaling | Ubiquitin protein ligase binding | ZNF217_24962896_ChIP-Seq_MCF-7_Human |
| A | PI-3K cascade:FGFR2_ | Apoptosis signaling pathway_ | Breast cancer | Chromatin disassembly | Ubiquitin-like protein ligase binding | DROSHA_22980978_ChIP-Seq_HELA_Human |
| B | Neuronal System_ | Endothelin signaling pathway_ | Endometrial cancer | Calcium ion import | Calcium ion transmembrane transporter activity | STAT3_23295773_ChIP-Seq_U87_Human |
| B | Transmission across Chemical Synapses_ | p53 pathway feedback loops 2_ | PI3K-Akt signaling pathway | Axonogenesis | ATPase activity ( | TCF4_23295773_ChIP-Seq_U87_Human |
| B | PI-3K cascade:FGFR1_ | p53 pathway_ | Pathways in cancer | Calcium ion transmembrane transport | Calcium channel activity | SMAD4_21799915_ChIP-Seq_A2780_Human |
| B | PI-3K cascade:FGFR2_ | Ionotropic glutamate receptor pathway_ | Breast cancer | Protein phosphorylation | Motor activity | AR_22383394_ChIP-Seq_PROSTATE_CANCER_Human |
| B | PI-3K cascade:FGFR3_ | Wnt signaling pathway_ | Pathways in cancer | Calcium ion transport | Voltage-gated cation channel activity | PAX3-FKHR_20663909_ChIP-Seq_RHABDOMYOSARCOMA_Human |
| C | PI-3K cascade:FGFR1_ | p53 pathway feedback loops 2_ | Endometrial cancer | protein phosphorylation (GO:0006468) | MAP kinase kinase activity | STAT3_23295773_ChIP-Seq_U87_Human |
| C | PI-3K cascade:FGFR2_ | EGF receptor signaling pathway_ | Gastric cancer | Protein autophosphorylation (GO:0046777) | Calcium ion transmembrane transporter activity | TCF4_23295773_ChIP-Seq_U87_Human |
| C | PI-3K cascade:FGFR3_ | Endothelin signaling pathway_ | Thyroid hormone signaling pathway | Calcium ion import (GO:0070509) | Protein kinase activity (GO:0004672) | SMAD4_21799915_ChIP-Seq_A2780_Human |
| C | PI-3K cascade:FGFR4_ | p53 pathway_ | Central carbon metabolism in cancer | Peptidyl-serine phosphorylation (GO:0018105) | ATPase activity (GO:0016887) | AR_22383394_ChIP-Seq_PROSTATE_CANCER_Human |
| C | PI3K events in ERBB4 signaling_ | Wnt signaling pathway_ | Breast cancer | Phosphorylation (GO:0016310) | ATP-dependent microtubule motor activity, minus-end-directed (GO:0008569) | DROSHA_22980978_ChIP-Seq_HELA_Human |
| D | Chromatin modifying enzymes_ | CCKR signaling map ST_ | Endometrial cancer | Regulation of megakaryocyte differentiation (GO:0045652) | ATP-dependent microtubule motor activity, minus-end-directed (GO:0008569) | AR_19668381_ChIP-Seq_PC3_Human |
| D | Chromatin organization_ | Wnt signaling pathway_ | Human papillomavirus infection | Regulation of myeloid cell differentiation (GO:0045637) | ATP-dependent microtubule motor activity (GO:1990939) | TCF4_23295773_ChIP-Seq_U87_Human |
| D | Developmental Biology_ | Huntington disease_ | Hepatocellular carcinoma | Cellular response to caffeine (GO:0071313) | Ligand-gated calcium channel activity (GO:0099604) | SMAD4_21799915_ChIP-Seq_A2780_Human |
| D | PI3K/AKT Signaling in Cancer_ | p53 pathway_ | Lysine degradation | Response to caffeine (GO:0031000) | Protein kinase binding (GO:0019901) | STAT3_23295773_ChIP-Seq_U87_Human |
| D | PKMTs methylate histone lysines_ | Beta1 adrenergic receptor signaling | Huntington disease | Regulation of cardiac muscle cell contraction (GO:0086004) | ATPase activity (GO:0016887) | ZNF217_24962896_ChIP-Seq_MCF-7_Human |