| Literature DB >> 30367576 |
Yuchen Yuan1,2, Yi Shi3, Xianbin Su1, Xin Zou1, Qing Luo1, David Dagan Feng4, Weidong Cai5, Ze-Guang Han6.
Abstract
BACKGROUND: With the developments of DNA sequencing technology, large amounts of sequencing data have been produced that provides unprecedented opportunities for advanced association studies between somatic mutations and cancer types/subtypes which further contributes to more accurate somatic mutation based cancer typing (SMCT). In existing SMCT methods however, the absence of high-level feature extraction is a major obstacle in improving the classification performance.Entities:
Keywords: Cancer type prediction; Convolutional neural network; Copy number aberration; Deep learning; HiC; Somatic mutation
Mesh:
Substances:
Year: 2018 PMID: 30367576 PMCID: PMC6101087 DOI: 10.1186/s12864-018-4919-z
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Architecture of our proposed 1D CNN
| Layer | Type | Output size | Conv (size, channel, pad) | Max pooling |
|---|---|---|---|---|
| input | in | 32768*1*ch | N/A | N/A |
| conv1 | c + r + p | 8192*1*32 | 3*1, 32, 1 | 4*1 |
| conv2 | c + r + p | 2048*1*64 | 3*1, 64, 1 | 4*1 |
| conv3 | c + r + p | 512*1*128 | 3*1, 128, 1 | 4*1 |
| conv4 | c + r + p | 128*1*256 | 3*1, 256, 1 | 4*1 |
| conv5 | c + r + p | 32*1*512 | 3*1, 512, 1 | 4*1 |
| conv6 | c + r | 1*1*4096 | 32*1, 4096, 0 | N/A |
| fc7 | fc + r + d | 1*1*4096 | 1*1, 4096, 0 | N/A |
| fc8 | fc | 1*1*25 | 1*1, 25, 0 | N/A |
| loss | sm + log | 1*1 | N/A | N/A |
Annotations - in: input layer; c: convolutional layer; r: ReLU layer; p: pooling layer; fc: fully connected layer; d: dropout layer; sm: softmax layer; log: log loss layer; ch: number of input channels (depending on whether the HiC data is used); asterisk(*): multiplication
Architecture of our proposed 2D CNN
| Layer | Type | Output size | Conv (size, channel, pad) | Max pooling |
|---|---|---|---|---|
| input | in | 176*176*ch | N/A | N/A |
| conv1 | c + r + p | 88*88*32 | 3*3, 32, 1 | 2*2 |
| conv2 | c + r + p | 44*44*64 | 3*3, 64, 1 | 2*2 |
| conv3 | c + r + p | 22*22*128 | 3*3128, 1 | 2*2 |
| conv4 | c + r + p | 11*11*256 | 3*3, 256, 1 | 2*2 |
| conv5 | c + r | 1*1*1024 | 11*11, 1024, 0 | N/A |
| fc6 | fc + r + d | 1*1*1024 | 1*1, 1024, 0 | N/A |
| fc7 | fc | 1*1*25 | 1*1, 25, 0 | N/A |
| loss | sm + log | 1*1 | N/A | N/A |
Annotations - in: input layer; c: convolutional layer; r: ReLU layer; p: pooling layer; fc: fully connected layer; d: dropout layer; sm: softmax layer; log: log loss layer; ch: number of input channels (depending on whether the HiC data is used); asterisk(*): multiplication
Fig. 1Performances of our proposed method with different design options. a With different HiC data configurations. From left to right: baseline model (2D CNN); baseline with hESC only; baseline with IMR90 only; baseline with both types of HiC data. The last configuration leads to the optimal performance. b With different network and HiC combinations. From left to right: 1D CNN without HiC data; 1D CNN with HiC data; 2D CNN without HiC data; 2D CNN with HiC data. The last configuration leads to the optimal performance
Evaluation of SVM with different kernel types
| Kernel | Linear | Polynomial | RBF |
|---|---|---|---|
| Accuracy | 0.317 | 0.322 | 0.275 |
Evaluation of KNN with different number of neighbors and p value
| p\ | 3 | 4 | 5 | 6 | 7 |
|---|---|---|---|---|---|
| 1 | 0.257 | 0.259 | 0.262 | 0.265 | 0.266 |
| 2 | 0.263 | 0.273 | 0.283 | 0.279 | 0.277 |
| 3 | 0.254 | 0.259 | 0.264 | 0.258 | 0.262 |
Evaluation of NB with different data distribution assumptions
| Distribution | Bernoulli | Multinomial | Gaussian |
|---|---|---|---|
| Accuracy | 0.161 | 0.238 | 0.139 |
Fig. 2Performances of our proposed method against three widely adopted data classifiers. a The comparison methods use raw CNA input data (without HiC). From left to right: Our method, SVM (polynomial kernel), KNN (number of neighbors = 5 and p = 2) and NB (multinomial distribution). Our method shows significant advantage against the comparison methods. b The comparison methods use both CNA and HiC as input data. From left to right: Our method, SVM (polynomial kernel), KNN (number of neighbors = 5 and p = 2) and NB (multinomial distribution). Our method shows even greater advantage against the comparison methods