| Literature DB >> 28947442 |
Tong Lin1, Tiebing Liu2, Yucheng Lin1, Chaoting Zhang3, Lailai Yan4, Zhongxue Chen5, Zhonghu He3, Jingyu Wang4.
Abstract
OBJECTIVES: Esophageal squamous cell carcinoma (ESCC) is the predominant form of esophageal carcinoma with extremely aggressive nature and low survival rate. The risk factors for ESCC in the high-incidence areas of China remain unclear. We used machine learning methods to investigate whether there was an association between the alterations of serum levels of certain chemical elements and ESCC. SETTINGS: Primary healthcare unit in Anyang city, Henan Province of China. PARTICIPANTS: 100 patients with ESCC and 100 healthy controls matched for age, sex and region were included. PRIMARY AND SECONDARY OUTCOME MEASURES: Primary outcome was the classification accuracy. Secondary outcome was the p Value of the t-test or rank-sum test.Entities:
Keywords: Esophageal squamous cell carcinoma; chemical elements; machine learning
Mesh:
Substances:
Year: 2017 PMID: 28947442 PMCID: PMC5623487 DOI: 10.1136/bmjopen-2016-015443
Source DB: PubMed Journal: BMJ Open ISSN: 2044-6055 Impact factor: 2.692
Demographic characteristics of normal controls and patients with ESCC from Anyang, China, 2010
| Variable | Case (n=100) | Control (n=100) | p Value* |
| n (%) | n (%) | ||
| Age (years) | |||
| Median (IQR) | 56 (55–62) | 59 (55–63) | |
| Gender | |||
| Male | 60 (60) | 60 (60) | |
| Female | 40 (40) | 40 (40) | |
| History of regular alcohol consumption | |||
| No | 82 (82) | 81 (81) | 0.856 |
| Yes | 18 (18) | 19 (19) | |
| History of regular cigarette smoking | |||
| No | 54 (54) | 57 (57) | 0.669 |
| Yes | 46 (46) | 43 (43) | |
| Family history of ESCC | |||
| No | 71 (71) | 83 (83) | 0.044 |
| Yes | 29 (29) | 17 (17) | |
*p Values derived from the Χ2 test.
ESCC, esophageal squamous cell carcinoma.
Classification accuracies (in percentage) and runtime (in seconds) of the patient with ESCC
| Origin | FFS | PCA | FDA | FDAx | LPP | FA | |
| NB | 91.70 (0.14) | 93.35 (0.12) | 90.75 (0.17) | 93.90 (0.09) | 54.45 (0.18) | 89.40 (0.21) | 88.95 (0.22) |
| LR | 95.89 (1.70) | 94.89 (0.58) | 94.05 (0.39) | 94.99 (0.47) | 94.10 (0.63) | 91.31 (0.27) | 94.22 (0.53) |
| NN | 97.01 (5.30) | 95.05 (12.3) | 93.93 (6.40) | 94.81 (7.00) | 94.33 (6.80) | 91.44 (8.50) | 94.54 (8.50) |
| AB | 96.05 (76.8) | 95.19 (17.3) | 88.59 (18.2) | 94.26 (1.80) | 94.68 (19.4) | 71.56 (6.30) | 87.84 (12.1) |
| SVM | 97.23 (2.50) | 96.56 (1.53) | 94.15 (2.30) | 94.90 (0.80) | 92.15 (1.40) | 92.15 (1.50) | 94.25 (1.50) |
| RF | 98.38 (16.3) | 95.40 (15.0) | 91.51 (17.7) | 94.88 (15.8) | 94.23 (18.8) | 89.87 (18.5) | 91.94 (17.7) |
AB, AdaBoost; ESCC, esophageal squamous cell carcinoma; FA, factor analysis; FDA, Fisher discriminant analysis; FDAx, FDA with its variant; FFS, Fisher feature selection; LPP, locality preserving projection; LR, logistic regression; NB, Naive Bayes; NN, neural network; PCA, principal component analysis; RF, Random Forest; SVM, support vector machine.
The projection coefficients w in Fisher discriminant analysis and the Fisher discriminant ratio F used in Fisher feature selection
| Feature | w | F | Feature | w | F | Feature | w | F | Feature | w | F |
| Age | −0.7 | 0.1 | Bi | −1.9 | 19.7 | Se | −0.7 | 5.5 | Rb | −0.9 | 26.0 |
| Gender | 0.0 | 0.0 | Cs | 1.0 | 41.5 | Sr | 5.0 | 340.1 | Hg | −2.2 | 37.5 |
| Smoking | −0.1 | 0.2 | Th | 0.5 | 0.2 | Li | −1.5 | 9.9 | Pb | −0.7 | 6.6 |
| Drinking | −0.2 | 0.0 | U | −2.3 | 72.1 | Ni | 0.7 | 1.5 | Ca | −2.0 | 66.4 |
| Family history | −0.1 | 4.1 | La | 1.5 | 5.2 | Mo | 0.4 | 1.6 | Fe | −0.4 | 0.3 |
| Be | −0.4 | 1.1 | Ce | 2.3 | 1.1 | Ag | 0.3 | 0.0 | K | 1.3 | 96.5 |
| B | 1.0 | 68.3 | V | −1.4 | 41.7 | Cd | −1.6 | 3.2 | Mg | 0.3 | 96.5 |
| Ai | −0.3 | 0.0 | Cr | −1.5 | 13.3 | Sn | −0.3 | 14.5 | Na | −1.0 | 43.0 |
| Ti | 0.7 | 47.0 | Mn | 1.3 | 5.5 | Ba | −0.6 | 6.8 | P | 1.7 | 135.4 |
| Ge | −0.7 | 17.8 | Cu | −1.4 | 39.4 | Pt | 0.1 | 0.0 | S | 4.7 | 173.1 |
| As | −1.1 | 17.8 | Zn | 1.3 | 22.1 | Ti | −2.1 | 31.8 |
The top two elements with more discriminant information are Sr and S that are marked with bold font. Other important elements (including P, U, Ca, Tl, Bi and Hg) are marked with italic font.
Figure 1Concentration distributions of eight important elements and one unimportant element (Se) for patients with oesophageal cancer and healthy controls.
Classification accuracies (in percentage) based on single, pair and triple elements
| Singles | Sr | Tl | Bi | U | Hg | Ca | P | S |
| NB | 94.41 | 67.75 | 59.10 | 76.65 | 64.55 | 74.90 | 81.50 | 85.20 |
| LR | 94.43 | 70.45 | 63.40 | 77.80 | 71.55 | 75.50 | 81.75 | 85.80 |
| NN | 93.65 | 70.30 | 64.50 | 77.10 | 71.45 | 75.85 | 81.90 | 86.30 |
| AB | 92.13 | 68.10 | 62.85 | 76.95 | 70.25 | 73.60 | 79.75 | 85.50 |
| SVM | 93.86 | 68.00 | 57.45 | 77.00 | 64.80 | 74.45 | 82.40 | 85.00 |
| RF | 91.50 | 58.10 | 57.85 | 66.40 | 65.20 | 65.55 | 73.30 | 79.55 |
| Pairs | Sr+U | Sr+Ca | Sr+P | Sr+S | U+P | U+S | Ca+S | P+S |
| NB | 96.38 | 93.15 | 95.52 | 93.72 | 86.82 | 87.65 | 83.35 | 83.25 |
| LR | 95.93 | 94.05 | 95.15 | 93.85 | 87.40 | 88.20 | 87.15 | 85.00 |
| NN | 94.25 | 92.95 | 95.37 | 92.43 | 85.85 | 86.90 | 86.80 | 84.20 |
| AB | 93.65 | 91.15 | 92.87 | 91.78 | 84.30 | 85.60 | 84.35 | 84.80 |
| SVM | 96.35 | 93.05 | 94.86 | 94.01 | 85.70 | 87.65 | 83.55 | 83.10 |
| RF | 94.00 | 91.75 | 94.45 | 92.22 | 82.33 | 84.90 | 85.10 | 82.80 |
| Triples | Sr+U+Ca | Sr+U+P | Sr+U+S | Sr+Ca+P | Sr+Ca+S | Sr+P+S | ||
| NB | 94.10 | 96.90 | 94.65 | 93.10 | 91.90 | 93.43 | ||
| LR | 95.80 | 96.48 | 96.20 | 95.20 | 95.65 | 94.00 | ||
| NN | 94.15 | 95.20 | 94.95 | 94.95 | 94.60 | 94.00 | ||
| Ada | 93.35 | 94.85 | 93.40 | 93.85 | 92.40 | 92.10 | ||
| SVM | 95.65 | 96.98 | 95.65 | 94.25 | 93.75 | 93.70 | ||
| RF | 94.00 | 95.23 | 95.40 | 95.15 | 93.45 | 94.25 |
The best accuracy of each method is marked with bold font in each row.
AB, AdaBoost; LR, logistic regression; NB, Naive Bayes; NN, neural network; RF, Random Forest; SVM, support vector machine.
Figure 2Distributions in normalised concentration for pairs of elements (Sr-S, U-P, Tl-Ca and Bi-Hg).
Figure 3Distributions in normalised concentration for combinations of three elements (Sr-U-P, Sr-U-S, Sr-Ca-P and Sr-P-S).
Classification accuracies (in percentage) on removing a subset of features
| Classifiers | Whole | Whole—DemCha | Whole—LowCon | Whole—DemCha—LowCon |
| NB | 92.17 | 92.23 | 91.38 | 91.74 |
| LR | 95.92 | 96.43 | 95.58 | 95.88 |
| NN | 97.09 | 97.18 | 95.58 | 95.47 |
| AB | 96.90 | 96.89 | 96.30 | 96.63 |
| SVM | 97.35 | 97.95 | 95.71 | 96.08 |
| RF | 98.38 | 98.31 | 96.60 | 96.78 |
‘Whole’ means using all available input features in classification. ‘DemCha’ refers to demographic characteristics including 5 demographic variables: age, gender, smoking history, drinking history and family history. ‘LowCon’ means the set of six elements with lower concentrations than the detection limit of the spectrometry.
AB, AdaBoost; LR, logistic regression; NB, Naive Bayes; NN, neural network; RF, Random Forest; SVM, support vector machine.
The results of hypothesis tests: means and standard deviations of cases and controls, and the p value of t-test and rank-sum test (RS test) (the p values less than alpha=5% are boldfaced)
| Feature | Case | Control | t-Test | RS test | Feature | Case | Control | t-Test | RS test |
| Be | 0.10±0.09 | 0.10±0.10 | 0.298 | 0.988 | Bi | 0.19±0.23 | 0.10±0.09 | <0.001 | <0.001 |
| B | 33.5±24.1 | 74.2±45.3 | <0.001 | <0.001 | Cs | 0.53±0.20 | 0.72±0.24 | <0.001 | <0.001 |
| Al | 165±143 | 366±2111 | 0.924 | 0.189 | Th | 0.11±0.43 | 0.10±0.24 | 0.699 | 0.016 |
| Ti | 69.4±16.0 | 118±291 | <0.001 | <0.001 | U | 0.13±0.08 | 0.05±0.10 | <0.001 | <0.001 |
| Ge | 2.08±0.52 | 2.37±0.46 | <0.001 | <0.001 | La | 0.12±0.11 | 0.21±0.53 | 0.024 | 0.017 |
| As | 12.0±4.40 | 14.1±2.45 | <0.001 | <0.001 | Ce | 0.64±0.81 | 0.80±1.04 | 0.286 | <0.001 |
| Se | 58.4±18.0 | 63.1±12.6 | 0.020 | 0.010 | V | 0.78±0.50 | 0.83±4.41 | <0.001 | <0.001 |
| Sr | 34.0±14.8 | 108±37.5 | <0.001 | <0.001 | Cr | 10.1±10.5 | 5.27±7.38 | <0.001 | <0.001 |
| Li | 14.3±15.9 | 9.09±5.41 | 0.002 | 0.691 | Mn | 8.14±11.4 | 5.14±10.5 | 0.020 | <0.001 |
| Ni | 8.34±5.93 | 7.32±7.69 | 0.219 | 0.008 | Cu | 9145±254 | 1122±214 | <0.001 | <0.001 |
| Mo | 4.80±4.60 | 4.09±3.10 | 0.204 | 0.280 | Zn | 655±166 | 774±193 | <0.001 | <0.001 |
| Ag | 0.13±0.10 | 0.54±4.2 | 0.792 | 0.100 | Rb | 150±77.1 | 181±44.4 | <0.001 | <0.001 |
| Cd | 0.40±0.63 | 0.28±0.19 | 0.074 | 0.306 | Hg | 0.53±0.35 | 0.30±0.11 | <0.001 | <0.001 |
| Sn | 3.90±7.90 | 1.19±3.10 | <0.001 | <0.001 | Pb | 7.05±8.14 | 4.64±4.56 | 0.010 | 0.010 |
| Ba | 41.2±81.4 | 18.4±34.4 | 0.010 | 0.266 | Ca | 7.8×104
| 9.3×104
| <0.001 | <0.001 |
| Pt | 18.0±82.7 | 16.4±73.1 | 0.867 | <0.001 | Fe | 2124±2204 | 1966±1100 | 0.588 | 0.311 |
| Tl | 0.35±0.26 | 0.20±0.13 | <0.001 | <0.001 | K | 1.4×105
| 1.6×105
| <0.001 | <0.001 |
| P | 9.0×105
| 1.3×105
| <0.001 | <0.001 | Mg | 1.7×104
| 2.2×104
| <0.001 | <0.001 |
| S | 8.2×105
| 1.2×106
| <0.001 | <0.001 | Na | 3.0×106
| 3.2×106
| <0.001 | <0.001 |