| Literature DB >> 35151267 |
Wenjuan Peng1, Yuan Sun1, Ling Zhang2.
Abstract
BACKGROUND: Although the diagnostic method for coronary atherosclerosis heart disease (CAD) is constantly innovated, CAD in the early stage is still missed diagnosis for the absence of any symptoms. The gene expression levels varied during disease development; therefore, a classifier based on gene expression might contribute to CAD diagnosis. This study aimed to construct genetic classification models for CAD using gene expression data, which may provide new insight into the understanding of its pathogenesis.Entities:
Keywords: Classification model; Coronary atherosclerosis heart disease; Logistic regression; Machine learning; Random forest; Support vector machine
Mesh:
Year: 2022 PMID: 35151267 PMCID: PMC8840658 DOI: 10.1186/s12872-022-02481-4
Source DB: PubMed Journal: BMC Cardiovasc Disord ISSN: 1471-2261 Impact factor: 2.298
Fig. 1Schematic overview of study flow. CAD, coronary atherosclerosis heart disease
Information of the downloaded datasets
| Dataset | Case/control | Country | Specimen | Probe number | Platform |
|---|---|---|---|---|---|
| GSE12288 | 110/112 | Switzerland | Peripheral blood | 22483 | GPL96 |
| GSE66360 | 49/50 | USA | Circulating endothelial cells | 47000 | GPL570 |
| GSE7638 | 110/50 | Switzerland | Peripheral monocyte | 14500 | GPL571 |
Fig. 2Volcano plots of datasets. The red nodes represent genes that adjusted P < 0.001 and log2FC > 0.263. The blue nodes represent genes that adjusted P < 0.001 and log2FC < − 0.263. The horizontal dotted line represents adjusted P = 0.001. Integrated dataset was the combination of GSE12288, GSE7638 and GSE66360. The analysis of differentially expressed genes between case group and control group was performed using Limma package
Fig. 3Weighted gene co-expression network analysis. A, B Scale-free network test by which the soft thresholding power parameter was set to 5. C Hierarchical clustering. The branches of the tree represent the clusters of genes. The colors below the tree were gene modules that correspond to the clusters. D The correlation between gene modules and traits (disease), and red represents a positive correlation and green represents a negative correlation. E Hub genes. The red nodes represent hub genes screened by the threshold of absolute gene significance > 0.2 and absolute module membership > 0.8. The vertical dotted line represents absolute gene significance = 0.2, and the horizontal dotted line represents absolute module membership = 0.8. CAD, coronary atherosclerosis heart disease
The information of 33 hub genes identified by weighed gene co-expression network analysis
| Gene symbol | Module | GS | MM | ||
|---|---|---|---|---|---|
| Black | − 0.273 | 1.11E−09 | 0.810 | 8.54E−113 | |
| Blue | 0.224 | 6.76E−07 | 0.853 | 6.29E−137 | |
| Blue | 0.203 | 7.61E−06 | 0.845 | 7.38E−132 | |
| Blue | 0.209 | 4.07E−06 | 0.821 | 3.04E−118 | |
| Brown | 0.232 | 2.79E−07 | 0.807 | 3.38E−111 | |
| Green | 0.211 | 3.01E−06 | 0.839 | 2.35E−128 | |
| Green | 0.240 | 1.03E−07 | 0.842 | 5.09E−130 | |
| Green | 0.220 | 1.09E−06 | 0.802 | 4.99E−109 | |
| Green | 0.213 | 2.63E−06 | 0.834 | 2.15E−125 | |
| Green | 0.210 | 3.35E−06 | 0.815 | 3.28E−115 | |
| Green | 0.238 | 1.35E−07 | 0.801 | 2.41E−108 | |
| Green | 0.271 | 1.58E−09 | 0.844 | 2.57E−131 | |
| Turquoise | − 0.298 | 2.63E−11 | 0.877 | 4.70E−154 | |
| Turquoise | − 0.334 | 5.92E−14 | 0.811 | 1.47E−113 | |
| Turquoise | − 0.339 | 2.17E−14 | 0.866 | 9.86E−146 | |
| Turquoise | − 0.222 | 8.74E−07 | 0.808 | 5.01E−112 | |
| Turquoise | − 0.436 | 1.21E−23 | 0.852 | 1.03E−136 | |
| Turquoise | − 0.383 | 3.52E−18 | 0.841 | 1.55E−129 | |
| Turquoise | − 0.254 | 1.64E−08 | 0.808 | 4.63E−112 | |
| Turquoise | − 0.214 | 2.16E−06 | 0.810 | 9.52E−113 | |
| Turquoise | 0.233 | 2.49E−07 | − 0.872 | 9.16E−151 | |
| Turquoise | − 0.237 | 1.55E−07 | 0.827 | 8.63E−122 | |
| Turquoise | − 0.239 | 1.23E−07 | 0.811 | 3.21E−113 | |
| Turquoise | − 0.285 | 2.14E−10 | 0.816 | 7.91E−116 | |
| Turquoise | − 0.223 | 7.75E−07 | 0.813 | 3.04E−114 | |
| Yellow | 0.296 | 3.58E−11 | 0.841 | 2.05E−129 | |
| Yellow | 0.254 | 1.56E−08 | 0.834 | 1.42E−125 | |
| Yellow | 0.222 | 9.12E−07 | 0.870 | 3.90E−149 | |
| Yellow | 0.259 | 8.04E−09 | 0.808 | 4.25E−112 | |
| Yellow | 0.205 | 5.73E−06 | 0.816 | 7.83E−116 | |
| Yellow | 0.203 | 7.19E−06 | 0.822 | 4.14E−119 | |
| Yellow | 0.243 | 6.66E−08 | 0.882 | 3.17E−158 | |
| Yellow | 0.252 | 2.25E−08 | 0.875 | 2.03E−152 |
GS, gene significance with coronary atherosclerosis heart disease; P.GS, P value for gene significance with coronary atherosclerosis heart disease; MM, module membership; P.MM, P value for module membership
Fig. 4Feature elimination curves of hub genes and heatmap of the 12 optimal feature genes in different dataset. A Feature elimination curves of hub genes. Root mean square error (RMSE) is the statistical parameter to determine the optimal feature genes after the analysis of recursive feature elimination algorithm. The lowest RMSE correspond with the best optimal feature gene set, based on which the model was trained by machine learning methods in 50% samples in GSE12288. B–E Heatmap of the 12 optimal feature genes in different dataset using pheatmap package. The red and blue colors indicate high and low expression, respectively, of the 12 optimal feature genes among samples. Upregulation, genes that higher expressed in case group than control group. Downregulation, genes that lower expressed in case group than control group. Integrated data was the combination of GSE12288, GSE7638 and GSE66360
The result information of 12 optimal feature genes in limma package analysis
| Gene symbol | GSE12288 | GSE7638 | GSE66360 | Integrated dataset | ||||
|---|---|---|---|---|---|---|---|---|
| Foldchange | Adjusted | Foldchange | Adjusted | Foldchange | Adjusted | Foldchange | Adjusted | |
| 1.07 | 2.36E−01 | 1.20 | 7.07E−03 | 2.86 | 2.93E−09 | 1.27 | 1.34E−06 | |
| 1.05 | 4.18E−04 | 1.06 | 1.99E−03 | 1.00 | 9.94E−01 | 1.08 | 6.80E−05 | |
| 0.88 | 1.02E−22 | 0.95 | 4.57E−02 | 1.02 | 8.29E−01 | 0.90 | 3.94E−12 | |
| 0.89 | 2.72E−08 | 0.93 | 1.39E−02 | 0.83 | 4.21E−02 | 0.88 | 1.22E−09 | |
| 0.90 | 1.03E−17 | 0.92 | 6.26E−03 | 1.00 | 9.87E−01 | 0.88 | 7.23E−12 | |
| 1.03 | 1.91E−01 | 1.29 | 4.81E−19 | 0.88 | 4.74E−01 | 1.17 | 3.29E−06 | |
| 1.07 | 1.67E−01 | 1.16 | 5.43E−06 | 2.09 | 1.10E−03 | 1.28 | 9.15E−07 | |
| 1.11 | 4.11E−10 | 1.00 | 9.68E−01 | 1.09 | 4.12E−01 | 1.10 | 8.22E−06 | |
| 0.78 | 3.49E−22 | 0.90 | 3.42E−02 | 0.92 | 2.26E−01 | 0.84 | 1.05E−15 | |
| 1.08 | 5.39E−09 | 1.06 | 3.58E−02 | 0.88 | 1.69E−01 | 1.09 | 3.93E−05 | |
| 1.04 | 3.88E−01 | 1.19 | 1.74E−06 | 2.22 | 7.91E−06 | 1.25 | 2.60E−07 | |
| 1.07 | 1.17E−04 | 1.27 | 1.02E−08 | 0.81 | 1.22E−01 | 1.15 | 3.07E−06 | |
Integrated dataset was the combination of GSE12288, GSE7638 and GSE66360; foldchange, the fold change of the average gene expressional level going from control group to case group; adjusted P, the P value adjusted by Benjamini–Hochberg in comparing the gene expressional level between case group and control group
Fig. 5ROC charts of classification by SVM, RF and LR classifiers in internal and external validation datasets. SVM, support vector machine; RF, randomforest; LR, logistic regression; AUC, area under the ROC curve; ROC, receiver operating characteristic curve
Validation and evaluation results of three machine learning classifiers performance
| Classifiers | AUC (95% CI) | Se (95% CI) | Sp (95% CI) | PPV | NPV | Correct rate |
|---|---|---|---|---|---|---|
| SVM a | 0.996 (0.989, 1.000) | 0.982 (0.906, 1.000) | 0.907 (0.797, 0.969) | 0.918 | 0.946 | 0.946 |
| SVM b | 0.813 (0.761, 0.866) | 0.780 (0.707, 0.842) | 0.717 (0.618, 0.803) | 0.816 | 0.756 | 0.756 |
| RF a | 0.995 (0.988, 1.000) | 0.983 (0.906, 1.000) | 0.907 (0.797, 0.969) | 0.919 | 0.955 | 0.955 |
| RF b | 0.727 (0.665, 0.788) | 0.723 (0.647, 0.791) | 0.525 (0.422, 0.627) | 0.696 | 0.636 | 0.636 |
| LR a | 0.991 (0.971, 1.000) | 0.965 (0.879, 0.996) | 0.982 (0.901, 1.000) | 0.982 | 0.973 | 0.973 |
| LR b | 0.783 (0.725, 0.841) | 0.516 (0.435, 0.596) | 0.869 (0.786, 0.928) | 0.859 | 0.640 | 0.640 |
SVM, support vector machine; RF, randomforest; LR, logistic regression; Se, sensitivity; Sp, specificity; PPV, positive predictive value; NPV, negative predictive value; AUC, area under the ROC curve; ROC, receiver operating characteristic curve
aVerified in the 50% samples of GSE12288 (111/222)
bVerified in the integrated dataset of GSE7638 and GSE66360 (258)