Hong-Zhi Fang1, Dan-Li Hu1, Qin Li1, Su Tu1. 1. Department of Emergency, The Second Affiliated Wuxi People's Hospital of Nanjing Medical University, Wuxi, Jiangsu 214000, P.R. China.
Abstract
The present study aimed to identify genes associated with increased risk of myocardial infarction (MI) and construct an early diagnosis model based on support vector machine (SVM) learning. The gene expression profile data of GSE34198, containing 97 human blood samples including 49 patients with MI and 48 healthy individuals, were obtained from the Gene Expression Omnibus database. Differentially expressed gene (DEG) screening, DEG enrichment analysis, protein‑protein interaction (PPI) network investigation and clustering analysis were performed. The feature genes were identified using the neighboring score algorithm. Furthermore, a recursive feature elimination (RFE) algorithm was employed to screen risk factors among feature genes. The SVM prediction model was constructed and validated using the dataset GSE61144. A total of 1,207 DEGs (724 downregulated, 483 upregulated) between the two groups were identified. PPI analysis investigated 1,083 DEGs and 46,363 edges. In total, 87 genes were selected as candidate genes, and were primarily enriched in functions including 'G‑protein coupled receptor signaling' or pathways such as 'focal adhesion'. Furthermore, 15 genes with a high RFE score were selected to construct an SVM prediction model. The model's average accuracy was 86%. Data set verification showed that the predictive precision reached 0.92. High expression of the genes vascular endothelial growth factor A, A‑kinase anchoring protein 12 and olfactory receptor 8D2 were potential risk factors for MI. The SVM early diagnosis model constructed by candidate genes could not only predict early MI, but also provide risk probability according to the severity of MI.
The present study aimed to identify genes associated with increased risk of myocardial infarction (MI) and construct an early diagnosis model based on support vector machine (SVM) learning. The gene expression profile data of GSE34198, containing 97 human blood samples including 49 patients with MI and 48 healthy individuals, were obtained from the Gene Expression Omnibus database. Differentially expressed gene (DEG) screening, DEG enrichment analysis, protein‑protein interaction (PPI) network investigation and clustering analysis were performed. The feature genes were identified using the neighboring score algorithm. Furthermore, a recursive feature elimination (RFE) algorithm was employed to screen risk factors among feature genes. The SVM prediction model was constructed and validated using the dataset GSE61144. A total of 1,207 DEGs (724 downregulated, 483 upregulated) between the two groups were identified. PPI analysis investigated 1,083 DEGs and 46,363 edges. In total, 87 genes were selected as candidate genes, and were primarily enriched in functions including 'G‑protein coupled receptor signaling' or pathways such as 'focal adhesion'. Furthermore, 15 genes with a high RFE score were selected to construct an SVM prediction model. The model's average accuracy was 86%. Data set verification showed that the predictive precision reached 0.92. High expression of the genes vascular endothelial growth factor A, A‑kinase anchoring protein 12 and olfactory receptor 8D2 were potential risk factors for MI. The SVM early diagnosis model constructed by candidate genes could not only predict early MI, but also provide risk probability according to the severity of MI.
Acute myocardial infarction (MI) is myocardial necrosis caused by acute and persistent ischemia/hypoxia of the coronary artery (1). As a life-threatening disease, MI can be complicated by arrhythmia, shock or heart failure (2). Although classical clinical diagnostic methods, such as characteristic electrocardiogram evolution and dynamic changes of serum biomarkers, have improved the outcome to a certain extent, MI remains a significant problem in terms of morbidity, mortality and healthcare costs globally (3). Therefore, effective identification of risk genes associated with the development of this disease is essential for patients with MI.Genetic variants play important roles during the progression of MI (4). In certain areas, such as Japan, identification of polymorphisms of candidate genes can be beneficial to reveal the genetic risk of MI (5). The genes encoding proteins that affect hemostasis, such as coagulation factor XIII, play an essential role in the pathogenesis of MI and are ideal candidate genes for assessing the risk of acute MI (6). Bis et al (7) indicated that the variation in inflammation-related genes, including those encoding interleukin (IL)-1β, IL-6 and C-reactive protein, are involved in the progression of nonfatal incident MI or the risk of ischemic stroke.Mathematical modeling is an important tool for the investigation of MI epidemics (8). Support vector machine (SVM) is a supervised learning model used for classification and regression analysis (9). SVM has been successfully employed for the detection of acute MI using serial electrocardiograms (10). An SVM radial-based model provided improved classification performance compared with the linear SVM model, and the use of SVM models could improve disease classification performance (11). Despite these advances in the study of MI pathogenesis and research tools, the genes associated with a risk of MI remain unclear and an early diagnosis model based on SVM is yet to be developed. Thus, an investigation of abnormal genes and their related biological functions might be beneficial to reveal MI risk-associated genes and enable diagnostic model construction.A previous study has explored the genetic predisposition to acute MI (12). Although genes associated with genetic risk of acute MI were revealed, the detailed molecular mechanisms of candidate genes and associated models for the clinical diagnosis of MI are still unclear. In the present study, an investigation of differentially expressed genes (DEGs), function and pathway enrichment analyses, protein-protein interaction (PPI) network analysis and clustering analysis were performed using previously reported data (12). Furthermore, an SVM prediction model was constructed and validated using other gene expression profiles. These findings may help to identify MI risk-associated genes, and develop an early diagnostic model based on these genes using SVM.
Materials and methods
Data resource
GSE34198 gene expression profile data (12) were downloaded from the Gene Expression Omnibus (GEO) database based on the GPL6102-11574 platform. The dataset was obtained from peripheral blood samples of 97 participants, including 49 samples from patients with acute MI (MI group) and 48 samples from healthy individuals (control group).
Data preprocessing and investigation of DEGs
The downloaded original data were processed using the RMA package (version 0.1.0; http://www.rdocumentation.org/packages/affy/versions/1.50.0/topics/rma) in R software (13). To investigate the DEGs among different groups, the Z-score method was used for the standardization of data (13). Then, the Limma package (version 3.38.3) (14) in R was used to reveal DEGs between the control and MI groups. P<0.05 and |log fold change (FC)|>1 were considered to be the standards for the screening of DEGs.
PPI network construction
Based on Human Protein Reference Database protein interaction data (15), the DEGs were mapped to a human protein interaction network, and the interaction relationship was edged to construct an MI-specific PPI network. The degree (number of connections for the target protein) was used to evaluate the important target genes (16). The PPI network was constructed based on Cytoscape (version 3.4.0) software (17). To complement the incomplete gene interaction network, the network was extended by introducing non-DEGs that interacted with at least 20 DEGs.
Feature gene investigation
Disease-related genes often participate in the same disease pathway or biological processes together with their various adjacent proteins. Since the proteins involved in disease pathways and their adjacent proteins are related in terms of expression, the genes that were associated with MI were identified using the neighborhood score (NS score) network algorithm (18). This algorithm calculates the FC value of the central node and its surrounding neighbor nodes to calculate the degree of node changes in the disease state and its impact on other genes around it, so as to identify disease-related genes. According to the probability density distribution of the score, the nodes with the highest absolute scores were selected as the candidate feature genes.
Unsupervised hierarchical clustering analysis
To verify that the candidate feature genes could effectively distinguish the control group from the MI group, an unsupervised hierarchical clustering analysis was performed on all samples based on candidate feature genes. Pearson correlation coefficients were used to calculate a similarity matrix, and average linkage was used to calculate the value of linkage. The clustering results were visualized using a heatmap.
Enrichment analysis of DEGs
Using DAVID software (version 6.8) (19), Gene Ontology-biological function (GO-BP) annotation (20) and Kyoto Encyclopedia of Genes and Genomes pathway enrichment analysis (21) were performed on DEGs. P<0.05 and a count >5 were chosen as the cut-off criteria for the present enrichment analysis. The enrichment process was realized using a corrected Fisher's exact test algorithm (22).
Feature selection of candidate feature genes
To optimize and screen out representative genes that could be used as clinical diagnostic markers for model construction, all candidate feature genes were enrolled for current feature selection. The recursive feature elimination (RFE) algorithm (23) in machine learning was used to evaluate the effectiveness of classifying and identifying patients with different risks through iterative random feature combination.
SVM model investigation
The confusion matrix is a standard format for precision evaluation. The precision index reflects the accuracy of image classification from different aspects. A confusion matrix algorithm (24) in SVM was used to construct the confusion matrix. SVM is a blend of linear modeling and instance-based learning (25). An SVM selects a small number of critical boundary samples, called support vectors, from each category and builds a linear discriminate function that separates them as widely as possible (26). Five-fold cross-validation on a receiver operating characteristic (ROC) curve was used to evaluate the effectiveness of the model. To observe the distribution of samples under different characteristics intuitively, the result was visualized via two-dimensional and three-dimensional (3D) images.
Validation of independent data
The GSE61144 dataset (27) [seven pre-percutaneous coronary intervention (PCI) samples, seven post-PCI samples and 10 control samples; GPL6106 Sentrix Human-6 v2 Expression BeadChip platform] obtained from the GEO database was used as the validation data in the current study. In the independent data validation process, the normal healthy control group (10 control samples) and the MI disease group (seven pre-PCI samples and seven post-PCI samples) were used as two subgroups to verify the efficacy of the model in predicting patients with MI. The classification model was used to classify and identify 24 samples in the verification data.
Results
Identification of DEGs and PPI network investigation
A total of 1,207 DEGs, which included 724 downregulated and 483 upregulated genes, were obtained among the groups with thresholds of P<0.05 and |logFC|>1.Based on these DEGs, a PPI network was further constructed (data not shown). There were 1,083 nodes and 46,363 edges in this network. Among the 1,083 nodes, there were 328 upregulated genes, 217 downregulated genes and 538 extended genes directly interacting with at least 20 DEGs.
Candidate gene exploration and unsupervised hierarchical clustering analysis
The probability density distribution of all DEGs was evaluated by calculating the NS score. An NS score of 0.8 indicated that the corresponding nodal degree and FC of genes had a high expression value. Thus, with a score of 0.8, a total of 87 DEGs, including EHBP1 (NS score=0.96), EX0C6B (NS score=0.96), GRB10 (NS score=0.92), A-kinase anchoring protein 12 (AKAP12; NS score=0.91) and SOX4 (NS score=0.91) were selected as candidate genes. Unsupervised hierarchical clustering was performed for these 87 DEGs (Fig. 1). Almost all the MI samples were clustered in the left cluster, while most of the normal samples were clustered in the right cluster. This indicated that the candidate genes identified by the neighborhood score algorithm could be used to distinguish MI samples from non-MI samples.
Figure 1.
Hierarchical clustering of candidate genes. The x-axis indicates different samples, and the y-axis indicates candidate genes. Blue and red denote the samples in the control group and the MI group, respectively. Gene expression values are expressed as a thermogram. Blue denotes upregulated genes in the MI samples and yellow denotes downregulated genes in the MI samples. MI, myocardial infarction.
Enrichment analysis
The functional enrichment of candidate genes was performed using Fisher's exact test (Table I). The result showed that these genes were mainly enriched in functions such as ‘G-protein coupled receptor signaling’ [olfactory receptor (OR)5I1, OR1A1, ENPP2, CD3E, LHCGR, NPBWR2, HTR4, AKAP12, OR1D2, OR1G1, OR51M1, OR8B8, OR7C1, OR51B5, OR8D2 and GLP1R] and pathways including ‘focal adhesion’ (EGFR, KRAS, PAK3, JUN, TGFA, MAPK8 and CAMK2A).
Table I.
Function and pathway enrichment of DEGs.
A, GO-BP analysis
Term
Count
P-value
ErbB signaling pathway
7
4.903×10−5
Focal adhesion
8
9.329×10−4
Pancreatic cancer
5
1.500×10−3
Renal cell carcinoma
5
1.500×10−3
Neurotrophin signaling pathway
6
2.000×10−3
Olfactory transduction
10
3.000×10−3
cAMP signaling pathway
7
3.900×10−3
GnRH signaling pathway
5
5.100×10−3
Oxytocin signaling pathway
6
5.700×10−3
Choline metabolism in cancer
5
7.300×10−3
Proteoglycans in cancer
6
1.830×10−2
Neuroactive ligand-receptor interaction
7
1.890×10−2
Insulin signaling pathway
5
2.120×10−2
Hepatitis B
5
2.490×10−2
Ras signaling pathway
6
2.920×10−2
B, KEGG analysis
Term
Count
P-value
ErbB signaling pathway
7
4.903×10−5
Focal adhesion
8
9.329×10−4
Pancreatic cancer
5
1.500×10−3
Renal cell carcinoma
5
1.500×10−3
Neurotrophin signaling pathway
6
2.100×10−3
Olfactory transduction
10
3.000×10−3
cAMP signaling pathway
7
3.900×10−3
GnRH signaling pathway
5
5.100×10−3
Oxytocin signaling pathway
6
5.700×10−3
Choline metabolism in cancer
5
7.300×10−3
Proteoglycans in cancer
6
1.830×10−2
Neuroactive ligand-receptor interaction
7
1.890×10−2
Insulin signaling pathway
5
2.120×10−2
Hepatitis B
5
2.490×10−2
Ras signaling pathway
6
2.920×10−2
GO-BP, Gene Ontology-biological process; KEGG, Kyoto Encyclopedia of Genes and Genomes; Count, the number of genes assembled/enriched in certain GO-BP function/KEGG pathway; DEG, differentially expressed gene.
Feature selection and subnetwork analysis for candidate genes
To improve the prediction accuracy, feature selection was performed using the RFE algorithm (Fig. 2). The model had the highest prediction accuracy when 15 features were combined (85%). The gene expression distribution of 15 genes, HES5, ZNF417, GLRA2, OR8D2, HOXA7, FABP6, MUSK, HTR6, GRIP2, OR51M1, OR1C1, KLRK1, vascular endothelial growth factor A (VEGFA), AKAP12 and RHEB, are shown in Fig. 3. Most of the genes were upregulated in patients with MI, although the expression levels of the OR8D2, OR1C1, HES5 and VEGFA genes were lower in the MI group than those in the control group. These 15 feature genes and non-DEGs that interact with candidate genes were extracted from the PPI network to construct the subnetwork (Fig. 4). There were 107 nodes and 117 edges in the current subnetwork.
Figure 2.
Feature elimination of candidate genes. The x-axis indicates the number of features selected, and the y-axis indicates the prediction accuracy based on the selected feature set. Random combinations of any number of features in all candidate genes were used as the feature to compare prediction accuracy.
Figure 3.
Distribution of candidate genes in the two groups of samples. Red indicates the MI group; green indicates the control group. MI, myocardial infarction.
Figure 4.
Subnetwork constructed by candidate genes. The red polygon denotes upregulated DEGs. The green rhombus denotes downregulated DEGs. The blue square node denotes extended genes that interact directly with DEGs. DEG, differentially expressed gene.
Classification model constructed using candidate genes
A total of 15 genes obtained from the feature selection in this study were used as salient features to construct a classification model based on the SVM classifier (Fig. 5A). The five-fold cross-validation fit the average area under the curve (AUC) value of 0.86, which further indicated that the average prediction accuracy of the model was 86%. To compare the accuracy of the SVM classification model in predicting patients that are high-risk for MI compared with healthy controls in greater detail, a confusion matrix was used for visualization (Fig. 5B). The prediction accuracies of the confusion matrix were 88 and 90% for the MI group and control group, respectively. The 3D distribution analysis of prominent features in the MI group and control group is shown in Fig. 5C. Significant differences were evident in the distribution between the two groups of samples. PPP1CC, GLP1R and ERCC3 were the first three genes of significance and were selected as the coordinate axis. Consequently, this indicated that the SVM model constructed in the present study by the MI specific biomarkers could be used to predict those at high risk of MI.
Figure 5.
Classification model constructed using candidate genes. (A) ROC curve of SVM based on 15 genes as features. The x-axis indicates the false positive rate and the y-axis indicates the true positive rate. The five-fold cross-validation is represented by five colors. The final fitted average is denoted by the black dotted line. (B) Confusion matrix obtained by constructing SVM classifiers based on 15 genes as features. The rows represent the true labels and the columns represent predicted labels. The more consistent the predicted labels were with the real labels, the more accurate the prediction was, and the closer the color is to red. (C) 3D analysis for efficiency of SVM model based on candidate genes. Red represent myocardial infarction samples and blue represents control samples. The color gradient is caused by overlapping samples; the darker the color, the more samples overlap. The three axes represent the first three outstanding genes including PPP1CC, GLP1R and ERCC3, which were the first three genes of significance. ROC, receiver operating characteristics; SVM, support vector machines.
Data validation
The validation of independent data was performed using the GSE61144 dataset obtained from the GEO database. The P-value distribution indicated that the P-values of pre-(average 0.64) and post-PCI (average 0.51) were higher compared with the control samples (average 0.19), which indicated that the model could distinguish patients with MI from normal individuals. Meanwhile, the average P-value of pre-PCI was higher compared with post-PCI samples indicated that PCI treatment could alleviate the progression of MI (Fig. 6A). Moreover, the validation samples were divided into a control group and patients with MI group. The ROC curve analysis of validation data showed that the accuracy of the AUC value calculated using the predicted results of the model was 0.92, which proved that this accurately predicted MI (Fig. 6B).
Figure 6.
Results of independent data validation. (A) Distribution of P-values of samples in the three groups. The red dots represent the control samples. The green triangle represent seven patients with MI post-PIC. The blue square represent the patients with MI pre-PIC. The x-axis represents the index (24 samples in the validation set) and the y-axis represents the P-value corrected by control. (B) ROC curve analysis for model efficacy based on validation data. The x-axis represented the FPR and the y-axis represented the TPR. ROC, receiver operating characteristic; MI, myocardial infarction; AUC, area under curve; FPR, false positive rate; TPR, true positive rate.
Discussion
MI is a disease with high mortality and mobility worldwide (28). A family history of MI is an important risk factor for MI, and so far, numerous studies have sought to identify genetic factors associated with MI (6,29,30). In the present study, in order to identify the MI-associated risk genes, a total of 1,207 DEGs were explored between two groups from the GSE34198 dataset, followed by a PPI network construction (1,083 genes and 46,363 edges). A total of 87 candidate genes were identified by evaluating these genes using NS score. The 87 genes were mainly enriched in functions such as ‘G-protein coupled receptor signaling’ and pathways including ‘focal adhesion’. Furthermore, an RFE algorithm was used to screen out 15 genes with the highest prediction accuracy, which were further used to construct a prediction model based on SVM. Finally, a microarray dataset GSE61144 was used to verify that the accuracy of the model was 0.92.AKAP12 is a member of the AKAP family, and serves an essential role in the morphogenesis of muscles (31). Members of the AKAP family participate in various biological functions associated with the heart, such as heart potassium channel phosphorylation (32) and cardiac muscle contraction (33). AKAP12 also enhances β2-adrenoceptor sensitivity in tracheal smooth muscle (34). In an animal model, a previous study revealed that AKAP12 regulated by heat shock protein A12B participates in ventricular dysfunction during the progression of MI (35). The biological function of AKAP12 is commonly realized by its participation in the G-protein coupled receptor pathway (36). G-protein coupled genes (such as P2RY2) have been shown to play an important role in the progression of atherosclerosis, which can lead to the development of MI (37). The variation of endothelial G-protein coupled receptor pathways in arteries contributes to compensated left ventricular hypertrophy (38). OR8D2 belongs to a subfamily of olfactory receptor genes (39). Aisenberg et al (40) showed that the OR family of genes participates in the biological function of airway smooth muscle and belongs to the superfamily of G-protein coupled receptors. A close relationship between the OR family and G-protein coupled receptors has previously been described (41). In the current study, genes including AKAP12 and OR8D2 were revealed as DEGs between patients with MI and healthy individuals, and thus were selected as candidate genes for MI prediction. Importantly, GO-BP function enrichment analysis showed that AKAP12 and OR8D2 were both associated with ‘G-protein coupled receptor signaling’. Thus, it was hypothesized that AKAP12 and OR8D2 might participate in the progression of MI via ‘G-protein coupled receptor signaling’.The VEGF gene encodes a potent and selective angiogenic agent that is required for mesangial cell migration and survival (42). Endogenous VEGFA is responsible for mitogenic effects of macrophage chemoattractant protein-1 on vascular smooth muscle cells (43). The upregulation of VEGFA is associated with the progression of MI (44). Gene transfer of VEGF-A165 after MI affects angiogenic and cardiac functions (45). Drugs such as Danshen, improve damaged cardiac angiogenesis and cardiac function induced by MI by modulating the VEGFA-related signaling pathway (46). Another previous drug experiment using an animal model indicated that puerarin accelerates cardiac angiogenesis and improves cardiac function of MI by upregulating VEGFA (47). SVM is a machine learning method developed on the basis of statistical learning theory. SVA uses the training error as the constraint condition of the optimization problem, and the minimum of the confidence range value as the optimization goal. SVM is the realization of a structural risk minimization principle (48). Furthermore, SVM was reported to contribute to the detection of an acute MI from a serial electrocardiogram (10). Autoregressive coefficients were demonstrated as being useful to characterize the feature of atrial fibrillation, and this feature could be classified using different statistical classifiers such as kernel SVM (49). Based on SVM, the automated risk identification of MI was realized based on certain features, which included the relative frequency band coefficient (50). In the present study, VEGFA was explored as a DEG and was revealed as a candidate gene for MI prediction. Importantly, the early diagnosis model of SVM constructed using 15 candidate genes, including VEGFA, could be used to predict patients at a high risk for MI.Thus, it is proposed that the early diagnosis model of SVM can not only predict early MI, but also indicate the probability of risk according to the severity of MI. Genes including VEGFA might be novel candidate risk genes for MI prediction. Furthermore, AKAP12 and OR8D2 may participate in the progression of MI via G-protein coupled receptor signaling.The present study has several limitations. More factors that may affect the accuracy of the prediction model need to be screened to determine the diagnostic efficacy of these biomarkers. It is also necessary to confirm whether the patient has received relevant treatment, such as nitroglycerin injection, before taking blood samples and whether the patient has other cardiovascular diseases. In addition, these results require validation in a larger cohort of patients with MI. In the future, a prospective study is required to validate the diagnostic potential of these biomarkers. Combining biomarkers with other diagnostic methods is also a worthwhile venture.
Authors: M Parmentier; F Libert; S Schurmans; S Schiffmann; A Lefort; D Eggerickx; C Ledent; C Mollereau; C Gérard; J Perret Journal: Nature Date: 1992-01-30 Impact factor: 49.962
Authors: Dustin T Duncan; Jared Aldstadt; John Whalen; Steven J Melly; Steven L Gortmaker Journal: Int J Environ Res Public Health Date: 2011-11-04 Impact factor: 3.390
Authors: Hun-Jun Park; Ji Heon Noh; Jung Woo Eun; Yoon-Seok Koh; Suk Min Seo; Won Sang Park; Jung Young Lee; Kiyuk Chang; Ki Bae Seung; Pum-Joon Kim; Suk Woo Nam Journal: Oncotarget Date: 2015-05-30