| Literature DB >> 31195961 |
N Özlem Özcan Şimşek1, Arzucan Özgür2, Fikret Gürgen3.
Abstract
BACKGROUND: As DNA sequencing technologies are improving and getting cheaper, genomic data can be utilized for diagnosis of many diseases such as cancer. Human raw genome data is huge in size for computational systems. Therefore, there is a need for a compact and accurate representation of the valuable information in DNA. The occurrence of complex genetic disorders often results from multiple gene mutations. The effect of each mutation is not equal for the development of a disease. Inspired from the field of information retrieval, we propose using the term frequency (tf) and BM25 term weighting measures with the inverse document frequency (idf) and relevance frequency (rf) measures to weight genes based on their mutations. The underlying assumption is that the more mutations a gene has in patients with a certain disease and the less mutations it has in other patients, the more discriminative that gene is.Entities:
Keywords: BM25; DNA mutations; Disease classification; Gene weighting; Information retrieval; Machine learning; tf-idf; tf-rf
Mesh:
Year: 2019 PMID: 31195961 PMCID: PMC6567431 DOI: 10.1186/s12859-019-2868-4
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Overview of the proposed system
The list of cancer types and sample counts in the BOUN10CANCER dataset
| Cancer type | Sample count |
|---|---|
| Lung | 1232 |
| Breast | 1080 |
| Brain | 1028 |
| Kidney | 734 |
| Colorectal | 656 |
| Thyroid | 504 |
| Prostate | 503 |
| Skin | 472 |
| Stomach | 441 |
| Liver | 378 |
Fig. 2The effect of the smoothing parameter k in the BM25 calculations for term frequency
Parameter tuning results for the parameter k in the BM25-tf formula
| k | Accuracy | F-Score | Precision | Recall |
|---|---|---|---|---|
| 0.6 | 75.20 ± 1.21 | 75.89 ± 1.14 | 76.59 ± 1.11 | 75.20 ± 1.21 |
|
| ||||
| 1.0 | 75.24 ± 1.60 | 75.87 ± 1.55 | 76.51 ± 1.53 | 75.24 ± 1.60 |
| 1.2 | 74.60 ± 1.60 | 75.32 ± 1.49 | 76.07 ± 1.39 | 74.60 ± 1.60 |
| 1.4 | 74.24 ± 1.00 | 74.88 ± 0.76 | 75.54 ± 0.65 | 74.24 ± 1.00 |
| 1.6 | 74.43 ± 1.39 | 75.35 ± 1.28 | 76.30 ± 1.42 | 74.43 ± 1.39 |
| 1.8 | 74.21 ± 1.46 | 74.89 ± 1.40 | 75.58 ± 1.38 | 74.21 ± 1.46 |
| 2.0 | 74.73 ± 1.32 | 75.53 ± 1.21 | 76.36 ± 1.16 | 74.73 ± 1.32 |
The row with the highest scores is shown in bold.
The range (or list) of parameters used in the parameter tuning phase for the classification models
| Algorithm | Parameter | Range or Values |
|---|---|---|
| KNN | k | [2,150] |
| SVM | Kernel | linear, polynomial, rbf |
| Polynomial degree | [2,5] | |
| Gamma | [10−4,10−1] | |
| Cost | [101,104] | |
| Perceptron | Optimization function | Adam, SDG |
| Activation function | ReLU, tanh | |
| Hidden layer size | [10,100] | |
| The maximum number of iterations | [100,300] | |
| Feed Forward NN | Optimization function | Adam, SDG |
| Activation function | ReLU, tanh | |
| The number of layers | [2,6] | |
| Dropout rate | [0.25,0.5] | |
| The number of nodes in the first layer | [1024,8192] |
The best parameters found as a result of the parameter tuning phase for the classification models
| Algorithm | Parameter | Value | Data Rep. |
|---|---|---|---|
| KNN | k | 50 | Binary |
| 10 | c-score, tf-idf, tf-rf, | ||
| bm25-tf-idf, bm25-tf-rf | |||
| SVM-poly | Polynomial degree | 3 | binary, tf-idf, bm25-tf-idf |
| 2 | c-score, tf-rf, bm25-tf-rf | ||
| SVM-rbf | Gamma | 10−4 | All |
| Cost | 103 | All | |
| SVM-linear | Gamma | 10−4 | All |
| Cost | 102 | All | |
| Perceptron | Optimization function | SGD | binary |
| Adam | c-score, tf-idf, tf-rf, | ||
| bm25-tf-idf, bm25-tf-rf | |||
| Activation function | tanh | binary | |
| ReLU | c-score, tf-idf, tf-rf, | ||
| bm25-tf-idf, bm25-tf-rf | |||
| Hidden layer size | 100 | All | |
| The maximum number of iterations | 200 | binary | |
| 300 | c-score, tf-idf, tf-rf, | ||
| bm25-tf-idf, bm25-tf-rf | |||
| Feed Forward NN | Optimization function | Adam | All |
| Activation function | ReLU | All | |
| The number of layers | 4 | All | |
| Dropout rate | 0.25 | All | |
| The number of nodes in the first layer | 8192 | All |
Machine learning experiment test results on the gene sets with the binary representation model
| Gene Set | Algorithm | Data Rep. | Accuracy | F-Score | Precision | Recall | Roc-Auc | FPR |
|---|---|---|---|---|---|---|---|---|
| causal | LR | binary | 36.81 ± 0.45 | 36.36 ± 0.50 | 35.93 ± 0.52 | 36.81 ± 0.45 | 0.63 ± 0.03 | 9.03 ± 0.10 |
| causal | SVM-linear | binary | 33.53 ± 0.32 | 32.70 ± 0.99 | 31.92 ± 1.13 | 33.53 ± 0.32 | 0.62 ± 0.05 | 9.38 ± 011 |
| causal | Perceptron | binary | 36.74 ± 0.56 | 36.62 ± 0.83 | 36.52 ± 2.56 | 36.74 ± 0.56 | 0.63 ± 0.06 | 10.01 ± 0.10 |
| all | LR | binary | 67.19 ± 0.41 | 68.01 ± 0.01 | 68.01 ± 0.00 | 67.01 ± 0.01 | 0.78 ± 0.01 | 3.85 ± 0.07 |
| all | SVM-linear | binary | 68.46 ± 0.67 | 68.01 ± 0.01 | 69.01 ± 0.01 | 68.01 ± 0.01 | 0.78 ± 0.01 | 4.07 ± 0.09 |
| all | Perceptron | binary | 68.50 ± 0.48 | 69.01 ± 0.01 | 70.01 ± 0.01 | 68.01 ± 0.01 | 0.78 ± 0.03 | 4.07 ± 0.09 |
Machine learning experiment test results on the data representation models of the full gene BOUN10CANCER dataset
| Algorithm | Data Rep. | Accuracy | F-Score | Precision | Recall | Roc-Auc | FPR |
|---|---|---|---|---|---|---|---|
| NB | binary | 33.84 ± 0.83 | 35.25 ± 0.95 | 37.04 ± 1.34 | 33.84 ± 0.83 | 0.62 ± 0.02 | 8.38 ± 0.11 |
| c-score | 31.10 ± 0.86 | 32.72 ± 0.74 | 34.53 ± 1.43 | 31.10 ± 0.86 | 0.59 ± 0.01 | 8.61 ± 0.08 | |
| tf-idf | 33.34 ± 0.48 | 35.04 ± 0.60 | 37.03 ± 1.03 | 33.34 ± 0.48 | 0.62 ± 0.02 | 7.99 ± 0.07 | |
| tf-rf |
|
|
|
|
|
| |
| bm25-tf-idf | 32.50 ± 0.96 | 34.19 ± 0.87 | 36.08 ± 1.35 | 32.50 ± 0.96 | 0.60 ± 0.01 | 8.48 ± 0.10 | |
| bm25-tf-rf | 37.94 ± 0.63 | 38.99 ± 0.60 | 40.12 ± 1.24 | 37.94 ± 0.63 | 0.62 ± 0.01 | 7.91 ± 0.10 | |
| KNN | binary | 11.54 ± 0.85 | 16.87 ± 0.66 | 31.46 ± 2.54 | 11.54 ± 0.85 | 0.50 ± 0.04 | 7.41 ± 0.04 |
| c-score | 15.87 ± 0.63 | 22.60 ± 0.44 | 39.27 ± 4.21 | 15.87 ± 0.63 | 0.53 ± 0.01 | 7.96 ± 0.07 | |
| tf-idf |
|
|
|
|
|
| |
| tf-rf | 19.29 ± 0.44 | 22.23 ± 0.61 | 40.29 ± 0.82 | 19.29 ± 0.44 | 0.55 ± 0.02 | 7.57 ± 0.07 | |
| bm25-tf-idf | 12.72 ± 1.23 | 20.05 ± 0.58 | 47.32 ± 5.85 | 12.72 ± 1.23 | 0.51 ± 0.01 | 8.17 ± 0.37 | |
| bm25-tf-rf | 11.91 ± 1.13 | 19.21 ± 0.50 | 49.74 ± 1.58 | 11.91 ± 1.13 | 0.51 ± 0.01 | 7.88 ± 0.17 | |
| SVM-poly | binary | 17.50 ± 0.00 | 5.21 ± 0.00 | 3.06 ± 0.00 | 17.50 ± 0.00 | 0.53 ± 0.00 | 16.34 ± 0.00 |
| c-score |
|
|
|
|
|
| |
| tf-idf | 17.50 ± 0.00 | 5.21 ± 0.00 | 3.06 ± 0.00 | 17.50 ± 0.00 | 0.53 ± 0.00 | 16.35 ± 0.00 | |
| tf-rf | 55.51 ± 0.55 | 56.52 ± 0.65 | 61.40 ± 0.53 | 55.51 ± 0.55 | 0.71 ± 0.03 | 5.16 ± 0.05 | |
| bm25-tf-idf | 36.36 ± 0.66 | 42.64 ± 0.75 | 51.56 ± 0.89 | 36.36 ± 0.66 | 0.62 ± 0.01 | 7.93 ± 0.08 | |
| bm25-tf-rf | 53.41 ± 0.27 | 51.46 ± 0.27 | 63.95 ± 0.65 | 53.41 ± 0.27 | 0.66 ± 0.01 | 7.38 ± 0.04 | |
| SVM-rbf | binary | 66.71 ± 0.36 | 67.01 ± 0.00 | 68.01 ± 0.00 | 67.01 ± 0.01 | 0.78 ± 0.01 | 4.04 ± 0.09 |
| c-score | 57.35 ± 0.30 | 61.31 ± 0.28 | 65.86 ± 1.10 | 57.35 ± 0.30 | 0.72 ± 0.01 | 7.09 ± 0.05 | |
| tf-idf | 50.92 ± 0.19 | 44.26 ± 0.20 | 51.64 ± 0.19 | 50.92 ± 0.19 | 0.69 ± 0.02 | 8.30 ± 0.03 | |
| tf-rf | 69.53 ± 0.71 | 69.82 ± 0.72 | 70.75 ± 0.71 | 69.53 ± 0.71 | 0.78 ± 0.03 | 3.64 ± 0.09 | |
| bm25-tf-idf | 66.17 ± 0.56 | 66.61 ± 0.60 | 67.20 ± 0.62 | 66.17 ± 0.56 | 0.78 ± 0.01 | 4.40 ± 0.07 | |
| bm25-tf-rf |
|
|
|
|
|
| |
| SVM-linear | binary | 68.46 ± 0.67 | 68.01 ± 0.01 | 69.01 ± 0.01 | 68.01 ± 0.01 | 0.78 ± 0.01 | 4.07 ± 0.09 |
| c-score | 71.91 ± 0.44 | 72.46 ± 0.45 | 73.02 ± 0.44 | 71.91 ± 0.44 | 0.82 ± 0.01 | 3.50 ± 0.09 | |
| tf-idf | 69.54 ± 0.66 | 69.01 ± 0.01 | 70.01 ± 0.01 | 69.01 ± 0.01 | 0.78 ± 0.01 | 3.94 ± 0.06 | |
| tf-rf | 68.80 ± 0.62 | 68.01 ± 0.01 | 69.51 ± 0.01 | 69.01 ± 0.01 | 0.78 ± 0.01 | 3.74 ± 0.09 | |
| bm25-tf-idf | 66.26 ± 0.58 | 66.35 ± 0.60 | 67.94 ± 0.66 | 66.26 ± 0.58 | 0.78 ± 0.01 | 4.31 ± 0.07 | |
| bm25-tf-rf |
|
|
|
|
|
| |
| LR | binary | 67.19 ± 0.41 | 68.01 ± 0.01 | 68.01 ± 0.00 | 67.01 ± 0.01 | 0.78 ± 0.01 | 3.85 ± 0.07 |
| c-score | 73.50 ± 0.64 | 73.89 ± 0.92 | 74.29 ± 0.66 | 73.50 ± 0.64 | 0.83 ± 0.01 | 3.40 ± 0.08 | |
| tf-idf | 63.17 ± 0.30 | 60.01 ± 0.00 | 66.01 ± 0.01 | 63.01 ± 0.00 | 0.74 ± 0.01 | 5.68 ± 0.04 | |
| tf-rf | 71.51 ± 0.46 | 72.01 ± 0.01 | 73.01 ± 0.01 | 71.01 ± 0.01 | 0.81 ± 0.01 | 3.24 ± 0.07 | |
| bm25-tf-idf | 67.80 ± 0.45 | 68.20 ± 0.47 | 68.61 ± 0.53 | 67.80 ± 0.45 | 0.79 ± 0.01 | 4.09 ± 0.06 | |
| bm25-tf-rf |
|
|
|
|
|
| |
| Perceptron | binary | 68.50 ± 0.48 | 69.01 ± 0.01 | 70.01 ± 0.01 | 68.01 ± 0.01 | 0.78 ± 0.03 | 4.07 ± 0.09 |
| c-score | 71.64 ± 1.54 | 71.76 ± 1.87 | 71.89 ± 1.38 | 71.64 ± 1.54 | 0.81 ± 0.01 | 3.67 ± 0.24 | |
| tf-idf | 70.23 ± 0.40 | 70.01 ± 0.00 | 70.01 ± 0.01 | 70.01 ± 0.01 | 0.79 ± 0.01 | 3.83 ± 0.05 | |
| tf-rf | 72.07 ± 1.86 | 72.01 ± 0.02 | 74.01 ± 0.01 | 72.01 ± 0.02 | 0.82 ± 0.02 | 3.29 ± 0.12 | |
| bm25-tf-idf | 65.52 ± 0.52 | 65.97 ± 0.52 | 66.44 ± 0.56 | 65.52 ± 0.52 | 0.78 ± 0.01 | 4.48 ± 0.08 | |
| bm25-tf-rf |
|
|
|
|
|
| |
| Feed-Forward NN | binary | 69.00 ± 0.76 | 69.52 ± 0.70 | 71.00 ± 0.52 | 69.00 ± 0.81 | 0.79 ± 0.02 | 3.65 ± 0.17 |
| c-score | 73.74 ± 0.88 | 74.07 ± 0.73 | 74.41 ± 0.67 | 73.74 ± 0.88 | 0.84 ± 0.02 | 3.27 ± 0.24 | |
| tf-idf | 62.91 ± 0.79 | 63.32 ± 0.70 | 65.04 ± 0.52 | 62.91 ± 0.83 | 0.73 ± 0.02 | 4.00 ± 0.10 | |
| tf-rf | 74.13 ± 1.33 | 74.17 ± 1.47 | 75.43 ± 1.07 | 74.13 ± 1.40 | 0.85 ± 0.02 | 3.07 ± 0.24 | |
| bm25-tf-idf | 68.18 ± 1.83 | 68.79 ± 1.28 | 69.42 ± 0.76 | 68.18 ± 1.83 | 0.82 ± 0.02 | 4.07 ± 0.54 | |
| bm25-tf-rf |
The row with the best accuracy and f-score is shown in italic for each algorithm. The overall best performance is made bold
Class based experiment test results with NN on full gene BM25-tf-rf dataset
| Cancer Type | F-Score | Precision | Recall | FPR |
|---|---|---|---|---|
| Lung | 85.47 ± 1.20 | 88.03 ± 2.00 | 83.16 ± 1.94 | 2.42 ± 0.48 |
| Breast | 95.92 ± 1.81 | 94.23 ± 2.41 | 97.69 ± 1.44 | 1.09 ± 0.47 |
| Brain | 69.80 ± 1.23 | 64.19 ± 3.47 | 77.13 ± 1.46 | 2.61 ± 2.32 |
| Kidney | 68.51 ± 1.14 | 73.59 ± 3.48 | 64.23 ± 2.22 | 2.72 ± 0.48 |
| Colorectal | 88.89 ± 1.92 | 88.21 ± 3.45 | 89.93 ± 2.89 | 1.28 ± 0.66 |
| Thyroid | 51.40 ± 3.35 | 47.86 ± 4.29 | 56.54 ± 4.43 | 14.79 ± 1.25 |
| Prostate | 39.80 ± 2.28 | 37.97 ± 4.03 | 42.32 ± 2.41 | 15.36 ± 1.03 |
| Skin | 89.56 ± 1.21 | 95.66 ± 3.49 | 84.38 ± 2.09 | 1.29 ± 0.26 |
| Stomach | 60.30 ± 2.33 | 74.45 ± 4.41 | 51.86 ± 4.50 | 10.31 ± 0.84 |
| Liver | 71.51 ± 2.20 | 83.98 ± 3.67 | 63.28 ± 4.00 | 7.75 ± 0.47 |
Machine learning experiment test results on the separated exonic and intronic mutations
| Mutation Set | Accuracy | F-Score | Precision | Recall | Roc-Auc | FPR |
|---|---|---|---|---|---|---|
| exonic | 54.56 ± 1.18 | 55.52 ± 0.96 | 56.52 ± 0.83 | 54.56 ± 1.18 | 0.67 ± 0.01 | 5.44 ± 0.17 |
| intronic | 74.39 ± 1.58 | 75.54 ± 1.30 | 76.74 ± 1.10 | 74.39 ± 1.58 | 0.83 ± 0.01 | 2.91 ± 0.33 |
| all | 76.44 ± 0.66 | 76.95 ± 0.68 | 77.48 ± 0.78 | 76.44 ± 0.66 | 0.86 ± 0.02 | 2.75 ± 0.13 |
Fig. 3The heat map of the most effective genes in NN with BM25-tf-rf model for breast cancer. A light colored region for a gene and a cancer type can be interpreted as the gene is more effective in the decision of the cancer type. A dark colored region corresponds to less effective state
Fig. 4The heat map of the most effective genes in NN with BM25-tf-rf model for lung cancer. A light colored region for a gene and a cancer type can be interpreted as the gene is more effective in the decision of the cancer type. A dark colored region corresponds to less effective state