| Literature DB >> 31430304 |
Shesh N Rai1,2, Sudhir Srivastava2,3, Jianmin Pan1, Xiaoyong Wu1, Somesh P Rai4, Chongkham S Mekmaysy5, Lynn DeLeeuw5, Jonathan B Chaires5,6, Nichola C Garbett5,6.
Abstract
The thermoanalytical technique differential scanning calorimetry (DSC) has been applied to characterize protein denaturation patterns (thermograms) in blood plasma samples and relate these to a subject's health status. The analysis and classification of thermograms is challenging because of the high-dimensionality of the dataset. There are various methods for group classification using high-dimensional data sets; however, the impact of using high-dimensional data sets for cancer classification has been poorly understood. In the present article, we proposed a statistical approach for data reduction and a parametric method (PM) for modeling of high-dimensional data sets for two- and three- group classification using DSC and demographic data. We compared the PM to the non-parametric classification method K-nearest neighbors (KNN) and the semi-parametric classification method KNN with dynamic time warping (DTW). We evaluated the performance of these methods for multiple two-group classifications: (i) normal versus cervical cancer, (ii) normal versus lung cancer, (iii) normal versus cancer (cervical + lung), (iv) lung cancer versus cervical cancer as well as for three-group classification: normal versus cervical cancer versus lung cancer. In general, performance for two-group classification was high whereas three-group classification was more challenging, with all three methods predicting normal samples more accurately than cancer samples. Moreover, specificity of the PM method was mostly higher or the same as KNN and DTW-KNN with lower sensitivity. The performance of KNN and DTW-KNN decreased with the inclusion of demographic data, whereas similar performance was observed for the PM which could be explained by the fact that the PM uses fewer parameters as compared to KNN and DTW-KNN methods and is thus less susceptible to the risk of overfitting. More importantly the accuracy of the PM can be increased by using a greater number of quantile data points and by the inclusion of additional demographic and clinical data, providing a substantial advantage over KNN and DTW-KNN methods.Entities:
Mesh:
Substances:
Year: 2019 PMID: 31430304 PMCID: PMC6701772 DOI: 10.1371/journal.pone.0220765
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1The composite line plot and error bar plot.
(A) Composite line plot of HC values at each temperature data point for 97 normal (green), 35 cervical cancer (red) and 54 lung cancer (blue) samples. (B) Composite error bar plot of HC values at each temperature data point for three groups: normal (green), cervical cancer (red) and lung cancer (blue). The circles represent mean values and the error bars represent the 95% confidence interval.
Demographic, clinical and data characteristics of the study group.
| Control | Cervical cancer | Lung cancer | |
|---|---|---|---|
| Male | 50 (51.5) | N / A | 22 (40.7) |
| Female | 47 (48.5) | 35 (100) | 32 (59.3) |
| African-American | 20 (20.6) | 4 (11.4) | 14 (25.9) |
| White | |||
| Non-Hispanic or Latino | 50 (51.5) | 28 (80.0) | 40 (74.1) |
| Hispanic or Latino | 27 (27.8) | 3 (8.6) | 0 (0) |
| Range (years) | 18–61 | 26–66 | 42–86 |
| Age (years) mean (sd) | 35.8 (11.2) | 46.5 (11.8) | 61.8 (11.6) |
| I | N / A | 14 | 2 |
| II | N / A | 10 | 3 |
| III | N / A | 7 | 15 |
| IV | N / A | 4 | 30 |
| Limited | N / A | N / A | 2 |
| Not Staged | N / A | N / A | 2 |
| Original data: 45°C -90°C; 0.1°C intervals; duplicate scans | 902 | 902 | 902 |
| Truncated, averaged data: 48°C -80°C; 1°C intervals; averaged scans | 33 | 33 | 33 |
| 97 | 35 | 54 | |
| Original data: 45°C -90°C; 0.1°C intervals; duplicate scans | 87,494 | 31,570 | 48,708 |
| Truncated, averaged data: 48°C -80°C; 1°C intervals; averaged scans | 3,201 | 1,155 | 1,782 |
Fig 2Overview of the classification methods.
(A) Parametric method, (B) K-nearest neighbors method and (C) Dynamic time warping- K-nearest neighbors method.
Results of the normality test showing p-values at different temperature points using data transformations for three-group classification.
| Temperature (°C) | |||||
|---|---|---|---|---|---|
| 48 | 0 | 0.01 | 0.01 | 0 | 0 |
| 49 | 0 | 0 | 0 | 0 | 0 |
| 50 | 0 | 0 | 0 | 0 | 0 |
| 51 | 0 | 0.01 | 0.01 | 0 | 0 |
| 52 | 0 | 0.05 | 0.05 | 0 | 0 |
| 53 | 0 | 0.11 | 0.13 | 0 | 0 |
| 54 | 0 | 0.64 | 0.66 | 0 | 0 |
| 55 | 0 | 0.35 | 0.37 | 0 | 0 |
| 56 | 0 | 0.57 | 0.57 | 0 | 0 |
| 57 | 0 | 0.76 | 0.77 | 0 | 0 |
| 58 | 0 | 0.47 | 0.47 | 0 | 0 |
| 59 | 0 | 0.34 | 0.3 | 0 | 0 |
| 60 | 0 | 0.04 | 0.02 | 0 | 0 |
| 61 | 0 | 0.28 | 0.13 | 0 | 0 |
| 62 | 0.01 | 0.12 | 0.72 | 0.01 | 0.03 |
| 63 | 0.86 | 0 | 0.02 | 0.86 | 0.8 |
| 64 | 0.08 | 0 | 0 | 0.06 | 0.02 |
| 65 | 0.02 | 0 | 0 | 0.01 | 0.01 |
| 66 | 0.61 | 0.77 | 0.88 | 0.64 | 0.72 |
| 67 | 0.15 | 0.75 | 0.63 | 0.17 | 0.22 |
| 68 | 0.2 | 0.65 | 0.63 | 0.22 | 0.29 |
| 69 | 0.5 | 0.71 | 0.84 | 0.54 | 0.65 |
| 70 | 0.54 | 0.7 | 0.87 | 0.58 | 0.7 |
| 71 | 0.29 | 0.8 | 0.86 | 0.33 | 0.45 |
| 72 | 0.07 | 0.91 | 0.87 | 0.09 | 0.14 |
| 73 | 0 | 0.85 | 0.53 | 0 | 0 |
| 74 | 0 | 0.25 | 0.06 | 0 | 0 |
| 75 | 0 | 0.03 | 0 | 0 | 0 |
| 76 | 0 | 0.02 | 0 | 0 | 0 |
| 77 | 0 | 0.07 | 0.02 | 0 | 0 |
| 78 | 0 | 0.31 | 0.17 | 0 | 0 |
| 79 | 0 | 0.74 | 0.77 | 0 | 0 |
| 80 | 0.18 | 0 | 0 | 0.18 | 0.19 |
| 0.11 | 0.34 | 0.34 | 0.11 | 0.13 | |
| 10 | 21 | 21 | 10 | 9 | |
| 30.3 | 30.3 | 27.27 |
Legend: H1 = log(H), H2 = logit(H/0.5),
Transformation and model selected for different classifications.
| Data used | Classification | Model selected |
|---|---|---|
| DSC data | H1 ~ T1 + T2 + T3 + T4 + G + T1:G + T2:G + T3:G + T4:G | |
| H1 ~ T1 + T2 + T3 + T4 + G + T1:G + T2:G + T3:G | ||
| H1 ~ T1 + T2 + T3 + T4 + G + T1:G + T2:G + T3:G | ||
| H1 ~ T1 + T2 + T3 + T4 + G + T2:G + T3:G + T4:G | ||
| H1 ~ T1 + T2 + T3 + T4 + G + T1:G + T2:G + T3:G | ||
| DSC + demographic data | H1 ~ T1 + T2 + T3 + T4 + G + T1:G + T2:G + T3:G + T4:G + Ethnicity + Gender | |
| H1 ~ T1 + T2 + T3 + T4 + G + T1:G + T2:G + T3:G + Ethnicity + Gender | ||
| H1 ~ T1 + T2 + T3 + T4 + G + T1:G + T2:G + T3:G + Ethnicity + Gender | ||
| H1 ~ T1 + T2 + T3 + T4 + G + T2:G + T3:G + T4:G | ||
| H1 ~ T1 + T2 + T3 + T4 + G + T1:G + T2:G + T3:G + Ethnicity |
Normal/ control (C), cervical cancer (CC), and lung cancer (LC). See Eqs 4 and 5 for definition of the other terms
Results of two group classification methods.
| Groups | Methods | |||||||
|---|---|---|---|---|---|---|---|---|
| 0.86 (0.05) | 0.83 (0.07) | 0.94 (0.06) | 0.97 (0.03) | 0.69 (0.10) | 0.89 (0.04) | |||
| 0.96 (0.03) | 0.99 (0.02) | 0.86 (0.10) | 0.95 (0.03) | 0.98 (0.04) | 0.93 (0.05) | |||
| 0.96 (0.03) | 1.00 (0.00) | 0.84 (0.10) | 0.95 (0.03) | 1.00 (0.01) | 0.92 (0.05) | |||
| 0.85 (0.06) | 0.82 (0.08) | 0.91 (0.07) | 0.94 (0.04) | 0.74 (0.08) | 0.86 (0.05) | |||
| 0.94 (0.03) | 0.97 (0.03) | 0.90 (0.08) | 0.95 (0.04) | 0.94 (0.05) | 0.93 (0.04) | |||
| 0.94 (0.03) | 0.96 (0.03) | 0.91 (0.06) | 0.95 (0.03) | 0.94 (0.06) | 0.94 (0.03) | |||
| 0.59 (0.08) | 0.54 (0.13) | 0.62 (0.12) | 0.50 (0.10) | 0.66 (0.07) | 0.58 (0.08) | |||
| 0.69 (0.08) | 0.43 (0.15) | 0.87 (0.09) | 0.72 (0.17) | 0.69 (0.06) | 0.65 (0.08) | |||
| 0.68 (0.07) | 0.38 (0.14) | 0.89 (0.08) | 0.73 (0.17) | 0.68 (0.05) | 0.64 (0.07) | |||
| 0.86 (0.05) | 0.81 (0.07) | 0.90 (0.05) | 0.90 (0.05) | 0.82 (0.06) | 0.86 (0.04) | |||
| 0.94 (0.03) | 0.96 (0.03) | 0.92 (0.05) | 0.93 (0.04) | 0.96 (0.04) | 0.94 (0.03) | |||
| 0.93 (0.03) | 0.95 (0.04) | 0.92 (0.05) | 0.93 (0.04) | 0.94 (0.04) | 0.93 (0.03) | |||
| 0.85 (0.05) | 0.82 (0.07) | 0.93 (0.06) | 0.97 (0.03) | 0.68 (0.09) | 0.88 (0.05) | |||
| 0.83 (0.05) | 0.96 (0.04) | 0.49 (0.14) | 0.84 (0.04) | 0.86 (0.14) | 0.73 (0.07) | |||
| 0.89 (0.05) | 0.99 (0.02) | 0.63 (0.16) | 0.88 (0.05) | 0.96 (0.07) | 0.81 (0.08) | |||
| 0.85 (0.05) | 0.81 (0.07) | 0.91 (0.07) | 0.94 (0.04) | 0.74 (0.08) | 0.86 (0.05) | |||
| 0.89 (0.04) | 0.91 (0.06) | 0.86 (0.08) | 0.92 (0.04) | 0.84 (0.08) | 0.88 (0.04) | |||
| 0.81 (0.06) | 1.00 (0.01) | 0.47 (0.16) | 0.78 (0.05) | 0.99 (0.02) | 0.73 (0.08) | |||
| 0.59 (0.08) | 0.54 (0.13) | 0.62 (0.12) | 0.50 (0.10) | 0.66 (0.07) | 0.58 (0.08) | |||
| 0.77 (0.06) | 0.60 (0.15) | 0.89 (0.08) | 0.81 (0.12) | 0.77 (0.07) | 0.75 (0.07) | |||
| 0.67 (0.01) | 0.80 (0.10) | 0.57 (0.15) | 0.58 (0.10) | 0.81 (0.10) | 0.69 (0.09) | |||
| 0.86 (0.04) | 0.81 (0.07) | 0.91 (0.05) | 0.91 (0.05) | 0.82 (0.06) | 0.86 (0.04) | |||
| 0.84 (0.04) | 0.88 (0.06) | 0.81 (0.08) | 0.84 (0.06) | 0.86 (0.06) | 0.84 (0.05) | |||
| 0.87 (0.04) | 0.98 (0.03) | 0.76 (0.09) | 0.82 (0.06) | 0.97 (0.03) | 0.87 (0.05) |
Note: The accuracy measures are denoted by accuracy (Acc), sensitivity (Sens), specificity (Spec), positive predictive value (PPV), negative predictive value (NPV) and balanced accuracy (Bal Acc). The groups are denoted by normal/ control (C), cervical cancer (CC) and lung cancer (LC). The methods of classifications used are our parametric or proposed method (PM), KNN and DTW-KNN. Mean values of accuracy measures are shown with standard deviation in parentheses. Mean values less than 50 are shaded in red, values 50–84 are shaded in grey and values greater than or equal to 85 are shaded in green.
Results of three group classification methods.
| Methods | Groups | |||||||
|---|---|---|---|---|---|---|---|---|
| 0.84 (0.07) | 0.74 (0.08) | 0.78 (0.06) | 0.81 (0.07) | 0.79 (0.05) | 0.65 (0.05) | |||
| 0.29 (0.13) | 0.88 (0.04) | 0.38 (0.15) | 0.84 (0.03) | 0.59 (0.07) | ||||
| 0.55 (0.12) | 0.82 (0.05) | 0.55 (0.09) | 0.82 (0.04) | 0.68 (0.06) | ||||
| 0.97 (0.03) | 0.91 (0.06) | 0.92 (0.04) | 0.96 (0.04) | 0.94 (0.03) | 0.80 (0.04) | |||
| 0.42 (0.15) | 0.95 (0.03) | 0.70 (0.17) | 0.87 (0.03) | 0.69 (0.07) | ||||
| 0.77 (0.10) | 0.84 (0.05) | 0.66 (0.07) | 0.90 (0.04) | 0.81 (0.05) | ||||
| 0.95 (0.04) | 0.91 (0.05) | 0.92 (0.04) | 0.95 (0.04) | 0.93 (0.03) | 0.80 (0.04) | |||
| 0.35 (0.13) | 0.96 (0.03) | 0.71 (0.18) | 0.86 (0.02) | 0.66 (0.07) | ||||
| 0.81 (0.09) | 0.82 (0.05) | 0.64 (0.07) | 0.92 (0.04) | 0.81 (0.05) | ||||
| 0.86 (0.07) | 0.74 (0.08) | 0.78 (0.06) | 0.83 (0.07) | 0.80 (0.05) | 0.65 (0.05) | |||
| 0.23 (0.12) | 0.90 (0.04) | 0.35 (0.17) | 0.83 (0.02) | 0.56 (0.06) | ||||
| 0.56 (0.13) | 0.80 (0.06) | 0.53 (0.09) | 0.82 (0.04) | 0.68 (0.07) | ||||
| 0.90 (0.06) | 0.78 (0.08) | 0.81 (0.05) | 0.88 (0.06) | 0.84 (0.04) | 0.75 (0.04) | |||
| 0.30 (0.12) | 0.96 (0.03) | 0.65 (0.21) | 0.85 (0.02) | 0.63 (0.06) | ||||
| 0.80 (0.10) | 0.85 (0.06) | 0.70 (0.08) | 0.92 (0.04) | 0.83 (0.05) | ||||
| 0.98 (0.02) | 0.72 (0.09) | 0.79 (0.05) | 0.98 (0.03) | 0.85 (0.05) | 0.75 (0.05) | |||
| 0.50 (0.16) | 0.87 (0.05) | 0.50 (0.14) | 0.88 (0.04) | 0.69 (0.08) | ||||
| 0.44 (0.15) | 0.96 (0.03) | 0.83 (0.13) | 0.81 (0.04) | 0.70 (0.07) |
Note: The accuracy measures are denoted by sensitivity (Sens), specificity (Spec), positive predictive value (PPV), negative predictive value (NPV), balanced accuracy (Bal Acc) and accuracy (Acc). The groups are denoted by normal/ control (C), cervical cancer (CC) and lung cancer (LC). The methods of classifications used are our parametric or proposed method (PM), KNN and DTW-KNN. Mean values of accuracy measures are shown with standard deviation in parentheses. Mean values less than 50 are shaded in red, values 50–84 are shaded in grey and values greater than or equal to 85 are shaded in green.