| Literature DB >> 26539502 |
Yukun Chen1, Jingchun Sun2, Liang-Chin Huang2, Hua Xu2, Zhongming Zhao3.
Abstract
An accurate classification of human cancer, including its primary site, is important for better understanding of cancer and effective therapeutic strategies development. The available big data of somatic mutations provides us a great opportunity to investigate cancer classification using machine learning. Here, we explored the patterns of 1,760,846 somatic mutations identified from 230,255 cancer patients along with gene function information using support vector machine. Specifically, we performed a multiclass classification experiment over the 17 tumor sites using the gene symbol, somatic mutation, chromosome, and gene functional pathway as predictors for 6,751 subjects. The performance of the baseline using only gene features is 0.57 in accuracy. It was improved to 0.62 when adding the information of mutation and chromosome. Among the predictable primary tumor sites, the prediction of five primary sites (large intestine, liver, skin, pancreas, and lung) could achieve the performance with more than 0.70 in F-measure. The model of the large intestine ranked the first with 0.87 in F-measure. The results demonstrate that the somatic mutation information is useful for prediction of primary tumor sites with machine learning modeling. To our knowledge, this study is the first investigation of the primary sites classification using machine learning and somatic mutation data.Entities:
Mesh:
Year: 2015 PMID: 26539502 PMCID: PMC4619847 DOI: 10.1155/2015/491502
Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411
Figure 1Study design using somatic mutations to classify primary tumor sites by machine learning model. In order to precisely represent the mutations, we generated a feature gMutation by binding mutations with their corresponding gene symbols.
Distribution of primary tumor sites.
| Primary tumor site | Number of patients | Percentage (%) |
|---|---|---|
| Lung | 970 | 14.43 |
| Breast | 967 | 14.39 |
| Large intestine | 654 | 9.73 |
| Haematopoietic and lymphoid tissue | 644 | 9.58 |
| Kidney | 491 | 7.31 |
| Ovary | 490 | 7.29 |
| Liver | 400 | 5.95 |
| Central nervous system | 377 | 5.61 |
| Prostate | 374 | 5.56 |
| Endometrium | 261 | 3.88 |
| Pancreas | 252 | 3.75 |
| Autonomic ganglia | 222 | 3.30 |
| Skin | 184 | 2.74 |
| Oesophagus | 174 | 2.59 |
| Urinary tract | 110 | 1.64 |
| Upper aerodigestive tract | 91 | 1.35 |
| Stomach | 60 | 0.89 |
Mutation description.
| Mutation description | Definition |
|---|---|
| Substitution | A mutation involving the substitution of a single nucleotide |
| Substitution-nonsense | A substitution mutation resulting in a termination codon, foreshortening the translated peptide |
| Substitution-missense | A substitution mutation resulting in an alternate codon, altering the amino acid at this position only |
| Substitution-coding silent | A synonymous substitution mutation which encodes the same amino acid as the wild type codon |
| Substitution-intronic | A substitution mutation outside the coding domains; no interpretation is made as to its effect on splice sites or nearby regulatory regions |
| Insertion | An insertion of novel sequence into the gene |
| Insertion-in frame | An insertion of nucleotides which does not affect the gene's translation frame, leaving the downstream peptide sequence intact |
| Insertion-frameshift | An insertion of novel sequence which alters the translation frame, changing the downstream peptide sequence (often resulting in premature termination) |
| Deletion | A deletion of a portion of the gene's sequence |
| Deletion-in frame | A deletion of nucleotides which does not affect the gene's translation frame, leaving the downstream peptide sequence intact |
| Deletion-frameshift | A deletion of nucleotides which alters the translation frame, changing the downstream peptide sequence (often resulting in premature termination) |
| Complex | A compound mutation which may involve multiple insertions, deletions, and substitutions |
Micro- and macroaveraged accuracies of seven combinations of gene symbols with three other features.
| Feature combination | Number of features | miAccuracy | maAccuracy (mean) | maAccuracy (SD) |
|---|---|---|---|---|
|
| 21,286 | 0.57 | 0.57 | 0.019 |
|
| 101,151 | 0.58 | 0.58 | 0.019 |
|
| 21,571 | 0.58 | 0.58 | 0.010 |
|
| 21,311 | 0.60 | 0.60 | 0.022 |
|
| 101,436 | 0.60 | 0.60 | 0.013 |
|
| 101,176 | 0.62 | 0.62 | 0.021 |
|
| 101,461 | 0.60 | 0.60 | 0.015 |
Note: miAccuracy represents the microaverage accuracy; maAccuracy represents the macroaverage accuracy, which is reported in mean and standard deviation (SD) over 10 accuracies from 10-fold cross validation.
Precision, recall, and F-measure for the best predictive model using “Gene,” “gMutation,” and “Chromosome” on each primary tumor site.
| Primary tumor site | Precision | Recall |
|
|---|---|---|---|
| Large intestine | 0.88 | 0.85 | 0.87 |
| Liver | 0.88 | 0.72 | 0.79 |
| Skin | 0.91 | 0.61 | 0.73 |
| Pancreas | 0.75 | 0.67 | 0.71 |
| Lung | 0.66 | 0.75 | 0.70 |
| Endometrium | 0.91 | 0.52 | 0.67 |
| Kidney | 0.72 | 0.62 | 0.66 |
| Haematopoietic and lymphoid tissue | 0.50 | 0.75 | 0.60 |
| Breast | 0.50 | 0.75 | 0.60 |
| Central nervous system | 0.63 | 0.51 | 0.56 |
| Ovary | 0.40 | 0.49 | 0.44 |
| Prostate | 0.46 | 0.35 | 0.40 |
| Autonomic ganglia | 0.45 | 0.28 | 0.34 |
| Oesophagus | 0.81 | 0.20 | 0.31 |
| Urinary tract | 0.83 | 0.09 | 0.16 |
| Upper aerodigestive tract | 1.00 | 0.05 | 0.10 |
| Stomach | 0.60 | 0.05 | 0.09 |
Summary of genes and samples used in the primary tumor site prediction.
| Primary tumor site | Number of genes | Number of true positives |
|---|---|---|
| Large intestine | 18,066 | 555 |
| Liver | 19,778 | 287 |
| Skin | 10,898 | 113 |
| Pancreas | 3,364 | 170 |
| Lung | 18,423 | 724 |
| Endometrium | 18,234 | 137 |
| Kidney | 10,601 | 302 |
| Haematopoietic and lymphoid tissue | 14,545 | 723 |
| Breast | 6,327 | 486 |
| Central nervous system | 2,773 | 192 |
| Ovary | 8,169 | 238 |
| Prostate | 5,875 | 132 |
| Autonomic ganglia | 1,425 | 62 |
| Oesophagus | 6,200 | 34 |
| Urinary tract | 3,288 | 10 |
| Upper aerodigestive tract | 1,013 | 5 |
| Stomach | 86 | 3 |
Figure 2Comparison among five sets of the top 50 genes used in the machine learning modeling for five primary tumor sites (large intestine, liver, lung, pancreas, and skin).