| Literature DB >> 34689757 |
Zexian Zeng1,2, Chengsheng Mao1, Andy Vo3, Xiaoyu Li4, Janna Ore Nugent5, Seema A Khan6, Susan E Clare7, Yuan Luo8.
Abstract
BACKGROUND: Genetic information is becoming more readily available and is increasingly being used to predict patient cancer types as well as their subtypes. Most classification methods thus far utilize somatic mutations as independent features for classification and are limited by study power. We aim to develop a novel method to effectively explore the landscape of genetic variants, including germline variants, and small insertions and deletions for cancer type prediction.Entities:
Keywords: Cancer; Classification; Convolutional neural network; Germline variants; Somatic mutation; Whole-exome sequencing
Mesh:
Year: 2021 PMID: 34689757 PMCID: PMC8543824 DOI: 10.1186/s12859-021-04400-4
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
The number of samples of each cancer and the corresponding number of germline variants and somatic mutations
| Cancer | Cancer # | Blood # | Adjacent normal # | Germline moderate | Germline high | Somatic moderate | Somatic high |
|---|---|---|---|---|---|---|---|
| Breast cancer | 959 | 936 | 23 | 10,911 (71) | 641 (5) | 65 (10) | 16 (3) |
| Colorectal cancer | 420 | 395 | 25 | 10,331 (241) | 621 (15) | 293 (79) | 81 (18) |
| Brain cancer | 763 | 763 | 0 | 9572 (55) | 550 (3) | 94 (30) | 23 (6) |
| Uterus cancer | 530 | 507 | 23 | 10,894 (103) | 650 (7) | 645 (152) | 160 (28) |
| Lung cancer | 730 | 713 | 17 | 9717 (37) | 555 (2) | 217 (15) | 43 (3) |
| Kidney cancer | 332 | 332 | 0 | 10,882 (124) | 634 (8) | 51 (6) | 14 (1) |
| Prostate cancer | 440 | 440 | 0 | 9744 (53) | 558 (4) | 34 (22) | 7 (3) |
Variants annotated with moderate effects are defined as missense mutations and in-frame shifts; variants annotated as high effects are defined as nonsense mutations
*Number in parenthesis is standard deviation (SD)
Fig. 1Feature generation for proposed models. a The transcript sequences were retrieved from RefSeq and were formed as a consensus matrix. b Each patient’s germline variants were embedded in the consensus matrix, forming a germline raw sequence for each sample. The brown dots are the germline variants including polymorphisms, deletions, and insertions. As an illustration, single nucleotide polymorphisms were identified and embedded in transcript A, E, and H. An in-frame shift deletion was embedded in transcript B and an in-frame shift insertion was embedded in transcript C. A frame shift deletion and frame shift insertion is embedded in transcripts D and E, respectively. Transcript F and G remained the same. c Each patient’s somatic mutations were embedded in the germline raw sequence (from B), forming a germline and cancer raw sequences. The green dots are the somatic mutations including SNVs, insertions, and deletions. As an illustration, the tissue gained somatic mutations in transcript A and E; gained a stop loss in transcript F; and gained a deletion that shifted the frame in transcript G
Fig. 2The architecture of the convolutional neural network. Component a is the input layer with one hot encoding with the column number equals 64 (number of total possible codons) and the row number equals the number of codons in the transcript. Component b is the encoder component containing a sequence of layers, each consisting of a convolutional layer, followed by a Leaky Rectified Linear Unit and average pooling layer. The number of convolution layers is determined by the gene length. Component c is a fully connected layer that combined all the outputs from the component b and has k outputs for k diseases
Fig. 3Prediction accuracy comparisons between DeepCues and baseline models. The compared methods include penalized logistic regression (LR) and support vector machine (SVM) with linear kernel, Gradient Boosting Decision Tree (GBDT), and Multiple Layer Perceptron (MLP) model
Precision and recall for our proposed model
| Germline sequence | Cancer sequence | |||||
|---|---|---|---|---|---|---|
| Precision | Recall | F-measure | Precision | Recall | F-measure | |
| Breast | 81.9% (2.7%) | 83.4% (4.2%) | 81.4% (1.8%) | 85.6% (2.0%) | 90.7% (1.8%) | |
| Colorectal | 85.9% (1.9%) | 83.9% (2.1%) | 84.6% (0.8%) | 84.7% (4.5%) | 87.1% (2.6%) | 84.7% (2.1%) |
| Brain | 73.0% (1.5%) | 66.5% (4.3%) | 68.5% (1.9%) | |||
| Uterus | 76.3% (4.9%) | 62.7% (6.9%) | 64.2% (3.9%) | 85.3% (1.8%) | 68.2% (3.3%) | |
| Lung | 64.8% (3.2%) | 75.5% (4.7%) | 67.7% (1.5%) | 70.3% (4.9%) | 78.4% (5.5%) | 71.2% (2.0%) |
| Kidney | 76.9% (2.5%) | 71.5% (2.8%) | 73.4% (1.1%) | 77.6% (4.7%) | 68.7% (5.9%) | 69.5% (2.3%) |
| Prostate | 70.8% (4.2%) | 55.7% (6.0%) | 58.9% (3.2%) | 65.6% (5.5%) | 50.5% (9.7%) | 49.1% (5.9%) |
The experiment is replicated for 10 times and the number in parenthesis is standard error. The bolded number are those that significantly improved in cancer sequence compared to germline sequence
The confusion matrix for our proposed model
| Predicted | Predicted class cancer sequence | |||
|---|---|---|---|---|
| Germline sequence | Cancer sequence | |||
| True | Positive | Negative | Positive | Negative |
| Breast | ||||
| Positive | 161 | 31 | 169 | 22 |
| Negative | 36 | 608 | 39 | 604 |
| Colorectal | ||||
| Positive | 73 | 11 | 72 | 11 |
| Negative | 15 | 737 | 8 | 743 |
| Brain | ||||
| Positive | 124 | 29 | 100 | 53 |
| Negative | 67 | 616 | 53 | 628 |
| Uterus | ||||
| Positive | 64 | 42 | 62 | 44 |
| Negative | 22 | 708 | 16 | 713 |
| Lung | ||||
| Positive | 88 | 58 | 94 | 52 |
| Negative | 33 | 657 | 60 | 628 |
| Kidney | ||||
| Positive | 48 | 19 | 49 | 19 |
| Negative | 18 | 751 | 26 | 741 |
| Prostate | ||||
| Positive | 57 | 31 | 50 | 38 |
| Negative | 30 | 719 | 36 | 710 |
The number is average of the 10 runs for test set prediction
Precision and recall for our proposed model
| Germline sequence | Cancer sequence | |||||
|---|---|---|---|---|---|---|
| Precision | Recall | F-measure | Precision | Recall | F-measure | |
| Breast | 87.1% (2.8%) | 91.9% (1.6%) | 89.0% (1.2%) | 89.1% (1.6%) | 87.2% (2.2%) | 87.8% (0.7%) |
| Colorectal | 87.8% (1.7%) | 94.0% (1.6%) | 90.6% (0.8%) | 86.1% (3.8%) | 90.8% (2.9%) | 87.5% (2.2%) |
| Uterus | 90.0% (1.8%) | 72.8% (6.8%) | 78.0% (5.5%) | 84.9% (3.1%) | 71.4% (2.8%) | 76.6% (1.0%) |
| Brain | 88.3% (4.4%) | 58.6% (5.2%) | 67.9% (2.2%) | 77.3% (2.5%) | ||
| Lung | 69.1% (4.9%) | 79.9% (7.2%) | 69.8% (2.7%) | 73.7% (3.1%) | 75.6% (4.0%) | 73.4% (1.8%) |
| Kidney | 81.9% (2.7%) | 81.3% (2.0%) | 81.3% (1.5%) | 70.0% (3.8%) | 76.0% (5.0%) | 71.2% (3.3%) |
| Prostate | 76.4% (6.1%) | 67.3% (7.2%) | 66.0% (3.7%) | 72.2% (2.7%) | 65.5% (3.5%) | 68.1% (2.3%) |
The experiment is replicated for 10 times and the number in parenthesis standard error. The bolded number are those that significantly improved in cancer sequence compared to germline sequence
The top 20 genes relevant genes with breast cancer derived from the 985 pathogenetic transcripts and the 1970 transcripts
| 985 transcripts | 1970 transcripts | ||
|---|---|---|---|
| Germline | Cancer | Germline | Cancer |
| FOXP1 | CASP8 | ||
| APC | TCF3 | PALB2 | |
| PAFAH1B2 | |||
| PPP2R1A | CCNE1 | ANK1 | |
| CHD2 | NCOA4 | ITK | |
| TCF3 | |||
| MSH6 | PALB2 | CCNE1 | |
| CNBD1 | FOXP1 | ||
| LEF1 | |||
| TBX3 | |||
| TP53 | KRAS | ||
| CHD4 | BLM | ||
| ITK | SDHAF2 | ||
| NRAS | ZNF507 | ||
| MAP3K1 | PICALM | ||
| HLA-A | TERT | LEF1 | |
The bold genes in the 985 transcripts are the ones found in COSMIC top 20 genes. The bold genes in the 1970 transcripts are the ones in the unknown transcripts