Literature DB >> 32705275

Risk gene identification and support vector machine learning to construct an early diagnosis model of myocardial infarction.

Hong-Zhi Fang1, Dan-Li Hu1, Qin Li1, Su Tu1.   

Abstract

The present study aimed to identify genes associated with increased risk of myocardial infarction (MI) and construct an early diagnosis model based on support vector machine (SVM) learning. The gene expression profile data of GSE34198, containing 97 human blood samples including 49 patients with MI and 48 healthy individuals, were obtained from the Gene Expression Omnibus database. Differentially expressed gene (DEG) screening, DEG enrichment analysis, protein‑protein interaction (PPI) network investigation and clustering analysis were performed. The feature genes were identified using the neighboring score algorithm. Furthermore, a recursive feature elimination (RFE) algorithm was employed to screen risk factors among feature genes. The SVM prediction model was constructed and validated using the dataset GSE61144. A total of 1,207 DEGs (724 downregulated, 483 upregulated) between the two groups were identified. PPI analysis investigated 1,083 DEGs and 46,363 edges. In total, 87 genes were selected as candidate genes, and were primarily enriched in functions including 'G‑protein coupled receptor signaling' or pathways such as 'focal adhesion'. Furthermore, 15 genes with a high RFE score were selected to construct an SVM prediction model. The model's average accuracy was 86%. Data set verification showed that the predictive precision reached 0.92. High expression of the genes vascular endothelial growth factor A, A‑kinase anchoring protein 12 and olfactory receptor 8D2 were potential risk factors for MI. The SVM early diagnosis model constructed by candidate genes could not only predict early MI, but also provide risk probability according to the severity of MI.

Entities:  

Mesh:

Year:  2020        PMID: 32705275      PMCID: PMC7411293          DOI: 10.3892/mmr.2020.11247

Source DB:  PubMed          Journal:  Mol Med Rep        ISSN: 1791-2997            Impact factor:   2.952


Introduction

Acute myocardial infarction (MI) is myocardial necrosis caused by acute and persistent ischemia/hypoxia of the coronary artery (1). As a life-threatening disease, MI can be complicated by arrhythmia, shock or heart failure (2). Although classical clinical diagnostic methods, such as characteristic electrocardiogram evolution and dynamic changes of serum biomarkers, have improved the outcome to a certain extent, MI remains a significant problem in terms of morbidity, mortality and healthcare costs globally (3). Therefore, effective identification of risk genes associated with the development of this disease is essential for patients with MI. Genetic variants play important roles during the progression of MI (4). In certain areas, such as Japan, identification of polymorphisms of candidate genes can be beneficial to reveal the genetic risk of MI (5). The genes encoding proteins that affect hemostasis, such as coagulation factor XIII, play an essential role in the pathogenesis of MI and are ideal candidate genes for assessing the risk of acute MI (6). Bis et al (7) indicated that the variation in inflammation-related genes, including those encoding interleukin (IL)-1β, IL-6 and C-reactive protein, are involved in the progression of nonfatal incident MI or the risk of ischemic stroke. Mathematical modeling is an important tool for the investigation of MI epidemics (8). Support vector machine (SVM) is a supervised learning model used for classification and regression analysis (9). SVM has been successfully employed for the detection of acute MI using serial electrocardiograms (10). An SVM radial-based model provided improved classification performance compared with the linear SVM model, and the use of SVM models could improve disease classification performance (11). Despite these advances in the study of MI pathogenesis and research tools, the genes associated with a risk of MI remain unclear and an early diagnosis model based on SVM is yet to be developed. Thus, an investigation of abnormal genes and their related biological functions might be beneficial to reveal MI risk-associated genes and enable diagnostic model construction. A previous study has explored the genetic predisposition to acute MI (12). Although genes associated with genetic risk of acute MI were revealed, the detailed molecular mechanisms of candidate genes and associated models for the clinical diagnosis of MI are still unclear. In the present study, an investigation of differentially expressed genes (DEGs), function and pathway enrichment analyses, protein-protein interaction (PPI) network analysis and clustering analysis were performed using previously reported data (12). Furthermore, an SVM prediction model was constructed and validated using other gene expression profiles. These findings may help to identify MI risk-associated genes, and develop an early diagnostic model based on these genes using SVM.

Materials and methods

Data resource

GSE34198 gene expression profile data (12) were downloaded from the Gene Expression Omnibus (GEO) database based on the GPL6102-11574 platform. The dataset was obtained from peripheral blood samples of 97 participants, including 49 samples from patients with acute MI (MI group) and 48 samples from healthy individuals (control group).

Data preprocessing and investigation of DEGs

The downloaded original data were processed using the RMA package (version 0.1.0; http://www.rdocumentation.org/packages/affy/versions/1.50.0/topics/rma) in R software (13). To investigate the DEGs among different groups, the Z-score method was used for the standardization of data (13). Then, the Limma package (version 3.38.3) (14) in R was used to reveal DEGs between the control and MI groups. P<0.05 and |log fold change (FC)|>1 were considered to be the standards for the screening of DEGs.

PPI network construction

Based on Human Protein Reference Database protein interaction data (15), the DEGs were mapped to a human protein interaction network, and the interaction relationship was edged to construct an MI-specific PPI network. The degree (number of connections for the target protein) was used to evaluate the important target genes (16). The PPI network was constructed based on Cytoscape (version 3.4.0) software (17). To complement the incomplete gene interaction network, the network was extended by introducing non-DEGs that interacted with at least 20 DEGs.

Feature gene investigation

Disease-related genes often participate in the same disease pathway or biological processes together with their various adjacent proteins. Since the proteins involved in disease pathways and their adjacent proteins are related in terms of expression, the genes that were associated with MI were identified using the neighborhood score (NS score) network algorithm (18). This algorithm calculates the FC value of the central node and its surrounding neighbor nodes to calculate the degree of node changes in the disease state and its impact on other genes around it, so as to identify disease-related genes. According to the probability density distribution of the score, the nodes with the highest absolute scores were selected as the candidate feature genes.

Unsupervised hierarchical clustering analysis

To verify that the candidate feature genes could effectively distinguish the control group from the MI group, an unsupervised hierarchical clustering analysis was performed on all samples based on candidate feature genes. Pearson correlation coefficients were used to calculate a similarity matrix, and average linkage was used to calculate the value of linkage. The clustering results were visualized using a heatmap.

Enrichment analysis of DEGs

Using DAVID software (version 6.8) (19), Gene Ontology-biological function (GO-BP) annotation (20) and Kyoto Encyclopedia of Genes and Genomes pathway enrichment analysis (21) were performed on DEGs. P<0.05 and a count >5 were chosen as the cut-off criteria for the present enrichment analysis. The enrichment process was realized using a corrected Fisher's exact test algorithm (22).

Feature selection of candidate feature genes

To optimize and screen out representative genes that could be used as clinical diagnostic markers for model construction, all candidate feature genes were enrolled for current feature selection. The recursive feature elimination (RFE) algorithm (23) in machine learning was used to evaluate the effectiveness of classifying and identifying patients with different risks through iterative random feature combination.

SVM model investigation

The confusion matrix is a standard format for precision evaluation. The precision index reflects the accuracy of image classification from different aspects. A confusion matrix algorithm (24) in SVM was used to construct the confusion matrix. SVM is a blend of linear modeling and instance-based learning (25). An SVM selects a small number of critical boundary samples, called support vectors, from each category and builds a linear discriminate function that separates them as widely as possible (26). Five-fold cross-validation on a receiver operating characteristic (ROC) curve was used to evaluate the effectiveness of the model. To observe the distribution of samples under different characteristics intuitively, the result was visualized via two-dimensional and three-dimensional (3D) images.

Validation of independent data

The GSE61144 dataset (27) [seven pre-percutaneous coronary intervention (PCI) samples, seven post-PCI samples and 10 control samples; GPL6106 Sentrix Human-6 v2 Expression BeadChip platform] obtained from the GEO database was used as the validation data in the current study. In the independent data validation process, the normal healthy control group (10 control samples) and the MI disease group (seven pre-PCI samples and seven post-PCI samples) were used as two subgroups to verify the efficacy of the model in predicting patients with MI. The classification model was used to classify and identify 24 samples in the verification data.

Results

Identification of DEGs and PPI network investigation

A total of 1,207 DEGs, which included 724 downregulated and 483 upregulated genes, were obtained among the groups with thresholds of P<0.05 and |logFC|>1. Based on these DEGs, a PPI network was further constructed (data not shown). There were 1,083 nodes and 46,363 edges in this network. Among the 1,083 nodes, there were 328 upregulated genes, 217 downregulated genes and 538 extended genes directly interacting with at least 20 DEGs.

Candidate gene exploration and unsupervised hierarchical clustering analysis

The probability density distribution of all DEGs was evaluated by calculating the NS score. An NS score of 0.8 indicated that the corresponding nodal degree and FC of genes had a high expression value. Thus, with a score of 0.8, a total of 87 DEGs, including EHBP1 (NS score=0.96), EX0C6B (NS score=0.96), GRB10 (NS score=0.92), A-kinase anchoring protein 12 (AKAP12; NS score=0.91) and SOX4 (NS score=0.91) were selected as candidate genes. Unsupervised hierarchical clustering was performed for these 87 DEGs (Fig. 1). Almost all the MI samples were clustered in the left cluster, while most of the normal samples were clustered in the right cluster. This indicated that the candidate genes identified by the neighborhood score algorithm could be used to distinguish MI samples from non-MI samples.
Figure 1.

Hierarchical clustering of candidate genes. The x-axis indicates different samples, and the y-axis indicates candidate genes. Blue and red denote the samples in the control group and the MI group, respectively. Gene expression values are expressed as a thermogram. Blue denotes upregulated genes in the MI samples and yellow denotes downregulated genes in the MI samples. MI, myocardial infarction.

Enrichment analysis

The functional enrichment of candidate genes was performed using Fisher's exact test (Table I). The result showed that these genes were mainly enriched in functions such as ‘G-protein coupled receptor signaling’ [olfactory receptor (OR)5I1, OR1A1, ENPP2, CD3E, LHCGR, NPBWR2, HTR4, AKAP12, OR1D2, OR1G1, OR51M1, OR8B8, OR7C1, OR51B5, OR8D2 and GLP1R] and pathways including ‘focal adhesion’ (EGFR, KRAS, PAK3, JUN, TGFA, MAPK8 and CAMK2A).
Table I.

Function and pathway enrichment of DEGs.

A, GO-BP analysis

TermCountP-value
ErbB signaling pathway74.903×10−5
Focal adhesion89.329×10−4
Pancreatic cancer51.500×10−3
Renal cell carcinoma51.500×10−3
Neurotrophin signaling pathway62.000×10−3
Olfactory transduction103.000×10−3
cAMP signaling pathway73.900×10−3
GnRH signaling pathway55.100×10−3
Oxytocin signaling pathway65.700×10−3
Choline metabolism in cancer57.300×10−3
Proteoglycans in cancer61.830×10−2
Neuroactive ligand-receptor interaction71.890×10−2
Insulin signaling pathway52.120×10−2
Hepatitis B52.490×10−2
Ras signaling pathway62.920×10−2

B, KEGG analysis

TermCountP-value

ErbB signaling pathway74.903×10−5
Focal adhesion89.329×10−4
Pancreatic cancer51.500×10−3
Renal cell carcinoma51.500×10−3
Neurotrophin signaling pathway62.100×10−3
Olfactory transduction103.000×10−3
cAMP signaling pathway73.900×10−3
GnRH signaling pathway55.100×10−3
Oxytocin signaling pathway65.700×10−3
Choline metabolism in cancer57.300×10−3
Proteoglycans in cancer61.830×10−2
Neuroactive ligand-receptor interaction71.890×10−2
Insulin signaling pathway52.120×10−2
Hepatitis B52.490×10−2
Ras signaling pathway62.920×10−2

GO-BP, Gene Ontology-biological process; KEGG, Kyoto Encyclopedia of Genes and Genomes; Count, the number of genes assembled/enriched in certain GO-BP function/KEGG pathway; DEG, differentially expressed gene.

Feature selection and subnetwork analysis for candidate genes

To improve the prediction accuracy, feature selection was performed using the RFE algorithm (Fig. 2). The model had the highest prediction accuracy when 15 features were combined (85%). The gene expression distribution of 15 genes, HES5, ZNF417, GLRA2, OR8D2, HOXA7, FABP6, MUSK, HTR6, GRIP2, OR51M1, OR1C1, KLRK1, vascular endothelial growth factor A (VEGFA), AKAP12 and RHEB, are shown in Fig. 3. Most of the genes were upregulated in patients with MI, although the expression levels of the OR8D2, OR1C1, HES5 and VEGFA genes were lower in the MI group than those in the control group. These 15 feature genes and non-DEGs that interact with candidate genes were extracted from the PPI network to construct the subnetwork (Fig. 4). There were 107 nodes and 117 edges in the current subnetwork.
Figure 2.

Feature elimination of candidate genes. The x-axis indicates the number of features selected, and the y-axis indicates the prediction accuracy based on the selected feature set. Random combinations of any number of features in all candidate genes were used as the feature to compare prediction accuracy.

Figure 3.

Distribution of candidate genes in the two groups of samples. Red indicates the MI group; green indicates the control group. MI, myocardial infarction.

Figure 4.

Subnetwork constructed by candidate genes. The red polygon denotes upregulated DEGs. The green rhombus denotes downregulated DEGs. The blue square node denotes extended genes that interact directly with DEGs. DEG, differentially expressed gene.

Classification model constructed using candidate genes

A total of 15 genes obtained from the feature selection in this study were used as salient features to construct a classification model based on the SVM classifier (Fig. 5A). The five-fold cross-validation fit the average area under the curve (AUC) value of 0.86, which further indicated that the average prediction accuracy of the model was 86%. To compare the accuracy of the SVM classification model in predicting patients that are high-risk for MI compared with healthy controls in greater detail, a confusion matrix was used for visualization (Fig. 5B). The prediction accuracies of the confusion matrix were 88 and 90% for the MI group and control group, respectively. The 3D distribution analysis of prominent features in the MI group and control group is shown in Fig. 5C. Significant differences were evident in the distribution between the two groups of samples. PPP1CC, GLP1R and ERCC3 were the first three genes of significance and were selected as the coordinate axis. Consequently, this indicated that the SVM model constructed in the present study by the MI specific biomarkers could be used to predict those at high risk of MI.
Figure 5.

Classification model constructed using candidate genes. (A) ROC curve of SVM based on 15 genes as features. The x-axis indicates the false positive rate and the y-axis indicates the true positive rate. The five-fold cross-validation is represented by five colors. The final fitted average is denoted by the black dotted line. (B) Confusion matrix obtained by constructing SVM classifiers based on 15 genes as features. The rows represent the true labels and the columns represent predicted labels. The more consistent the predicted labels were with the real labels, the more accurate the prediction was, and the closer the color is to red. (C) 3D analysis for efficiency of SVM model based on candidate genes. Red represent myocardial infarction samples and blue represents control samples. The color gradient is caused by overlapping samples; the darker the color, the more samples overlap. The three axes represent the first three outstanding genes including PPP1CC, GLP1R and ERCC3, which were the first three genes of significance. ROC, receiver operating characteristics; SVM, support vector machines.

Data validation

The validation of independent data was performed using the GSE61144 dataset obtained from the GEO database. The P-value distribution indicated that the P-values of pre-(average 0.64) and post-PCI (average 0.51) were higher compared with the control samples (average 0.19), which indicated that the model could distinguish patients with MI from normal individuals. Meanwhile, the average P-value of pre-PCI was higher compared with post-PCI samples indicated that PCI treatment could alleviate the progression of MI (Fig. 6A). Moreover, the validation samples were divided into a control group and patients with MI group. The ROC curve analysis of validation data showed that the accuracy of the AUC value calculated using the predicted results of the model was 0.92, which proved that this accurately predicted MI (Fig. 6B).
Figure 6.

Results of independent data validation. (A) Distribution of P-values of samples in the three groups. The red dots represent the control samples. The green triangle represent seven patients with MI post-PIC. The blue square represent the patients with MI pre-PIC. The x-axis represents the index (24 samples in the validation set) and the y-axis represents the P-value corrected by control. (B) ROC curve analysis for model efficacy based on validation data. The x-axis represented the FPR and the y-axis represented the TPR. ROC, receiver operating characteristic; MI, myocardial infarction; AUC, area under curve; FPR, false positive rate; TPR, true positive rate.

Discussion

MI is a disease with high mortality and mobility worldwide (28). A family history of MI is an important risk factor for MI, and so far, numerous studies have sought to identify genetic factors associated with MI (6,29,30). In the present study, in order to identify the MI-associated risk genes, a total of 1,207 DEGs were explored between two groups from the GSE34198 dataset, followed by a PPI network construction (1,083 genes and 46,363 edges). A total of 87 candidate genes were identified by evaluating these genes using NS score. The 87 genes were mainly enriched in functions such as ‘G-protein coupled receptor signaling’ and pathways including ‘focal adhesion’. Furthermore, an RFE algorithm was used to screen out 15 genes with the highest prediction accuracy, which were further used to construct a prediction model based on SVM. Finally, a microarray dataset GSE61144 was used to verify that the accuracy of the model was 0.92. AKAP12 is a member of the AKAP family, and serves an essential role in the morphogenesis of muscles (31). Members of the AKAP family participate in various biological functions associated with the heart, such as heart potassium channel phosphorylation (32) and cardiac muscle contraction (33). AKAP12 also enhances β2-adrenoceptor sensitivity in tracheal smooth muscle (34). In an animal model, a previous study revealed that AKAP12 regulated by heat shock protein A12B participates in ventricular dysfunction during the progression of MI (35). The biological function of AKAP12 is commonly realized by its participation in the G-protein coupled receptor pathway (36). G-protein coupled genes (such as P2RY2) have been shown to play an important role in the progression of atherosclerosis, which can lead to the development of MI (37). The variation of endothelial G-protein coupled receptor pathways in arteries contributes to compensated left ventricular hypertrophy (38). OR8D2 belongs to a subfamily of olfactory receptor genes (39). Aisenberg et al (40) showed that the OR family of genes participates in the biological function of airway smooth muscle and belongs to the superfamily of G-protein coupled receptors. A close relationship between the OR family and G-protein coupled receptors has previously been described (41). In the current study, genes including AKAP12 and OR8D2 were revealed as DEGs between patients with MI and healthy individuals, and thus were selected as candidate genes for MI prediction. Importantly, GO-BP function enrichment analysis showed that AKAP12 and OR8D2 were both associated with ‘G-protein coupled receptor signaling’. Thus, it was hypothesized that AKAP12 and OR8D2 might participate in the progression of MI via ‘G-protein coupled receptor signaling’. The VEGF gene encodes a potent and selective angiogenic agent that is required for mesangial cell migration and survival (42). Endogenous VEGFA is responsible for mitogenic effects of macrophage chemoattractant protein-1 on vascular smooth muscle cells (43). The upregulation of VEGFA is associated with the progression of MI (44). Gene transfer of VEGF-A165 after MI affects angiogenic and cardiac functions (45). Drugs such as Danshen, improve damaged cardiac angiogenesis and cardiac function induced by MI by modulating the VEGFA-related signaling pathway (46). Another previous drug experiment using an animal model indicated that puerarin accelerates cardiac angiogenesis and improves cardiac function of MI by upregulating VEGFA (47). SVM is a machine learning method developed on the basis of statistical learning theory. SVA uses the training error as the constraint condition of the optimization problem, and the minimum of the confidence range value as the optimization goal. SVM is the realization of a structural risk minimization principle (48). Furthermore, SVM was reported to contribute to the detection of an acute MI from a serial electrocardiogram (10). Autoregressive coefficients were demonstrated as being useful to characterize the feature of atrial fibrillation, and this feature could be classified using different statistical classifiers such as kernel SVM (49). Based on SVM, the automated risk identification of MI was realized based on certain features, which included the relative frequency band coefficient (50). In the present study, VEGFA was explored as a DEG and was revealed as a candidate gene for MI prediction. Importantly, the early diagnosis model of SVM constructed using 15 candidate genes, including VEGFA, could be used to predict patients at a high risk for MI. Thus, it is proposed that the early diagnosis model of SVM can not only predict early MI, but also indicate the probability of risk according to the severity of MI. Genes including VEGFA might be novel candidate risk genes for MI prediction. Furthermore, AKAP12 and OR8D2 may participate in the progression of MI via G-protein coupled receptor signaling. The present study has several limitations. More factors that may affect the accuracy of the prediction model need to be screened to determine the diagnostic efficacy of these biomarkers. It is also necessary to confirm whether the patient has received relevant treatment, such as nitroglycerin injection, before taking blood samples and whether the patient has other cardiovascular diseases. In addition, these results require validation in a larger cohort of patients with MI. In the future, a prospective study is required to validate the diagnostic potential of these biomarkers. Combining biomarkers with other diagnostic methods is also a worthwhile venture.
  39 in total

1.  Analysis of microarray data using Z score transformation.

Authors:  Chris Cheadle; Marquis P Vawter; William J Freed; Kevin G Becker
Journal:  J Mol Diagn       Date:  2003-05       Impact factor: 5.568

2.  Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources.

Authors:  Da Wei Huang; Brad T Sherman; Richard A Lempicki
Journal:  Nat Protoc       Date:  2009       Impact factor: 13.491

3.  Polymorphisms of genes affecting thrombosis and risk of myocardial infarction.

Authors:  S Kakko; T Elo; J M Tapanainen; H V Huikuri; M J Savolainen
Journal:  Eur J Clin Invest       Date:  2002-09       Impact factor: 4.686

4.  Expression of members of the putative olfactory receptor gene family in mammalian germ cells.

Authors:  M Parmentier; F Libert; S Schurmans; S Schiffmann; A Lefort; D Eggerickx; C Ledent; C Mollereau; C Gérard; J Perret
Journal:  Nature       Date:  1992-01-30       Impact factor: 49.962

5.  Mathematical modelling of tuberculosis epidemics.

Authors:  Juan Pablo Aparicio; Carlos Castillo-Chavez
Journal:  Math Biosci Eng       Date:  2009-04       Impact factor: 2.080

6.  Detection of acute myocardial infarction from serial ECG using multilayer support vector machine.

Authors:  Akshay Dhawan; Brian Wenzel; Samuel George; Ihor Gussak; Bosko Bojovic; Dorin Panescu
Journal:  Conf Proc IEEE Eng Med Biol Soc       Date:  2012

7.  Validation of walk score for estimating neighborhood walkability: an analysis of four US metropolitan areas.

Authors:  Dustin T Duncan; Jared Aldstadt; John Whalen; Steven J Melly; Steven L Gortmaker
Journal:  Int J Environ Res Public Health       Date:  2011-11-04       Impact factor: 3.390

8.  Assessment and diagnostic relevance of novel serum biomarkers for early decision of ST-elevation myocardial infarction.

Authors:  Hun-Jun Park; Ji Heon Noh; Jung Woo Eun; Yoon-Seok Koh; Suk Min Seo; Won Sang Park; Jung Young Lee; Kiyuk Chang; Ki Bae Seung; Pum-Joon Kim; Suk Woo Nam
Journal:  Oncotarget       Date:  2015-05-30

9.  Human Protein Reference Database--2009 update.

Authors:  T S Keshava Prasad; Renu Goel; Kumaran Kandasamy; Shivakumar Keerthikumar; Sameer Kumar; Suresh Mathivanan; Deepthi Telikicherla; Rajesh Raju; Beema Shafreen; Abhilash Venugopal; Lavanya Balakrishnan; Arivusudar Marimuthu; Sutopa Banerjee; Devi S Somanathan; Aimy Sebastian; Sandhya Rani; Somak Ray; C J Harrys Kishore; Sashi Kanth; Mukhtar Ahmed; Manoj K Kashyap; Riaz Mohmood; Y L Ramachandra; V Krishna; B Abdul Rahiman; Sujatha Mohan; Prathibha Ranganathan; Subhashri Ramabadran; Raghothama Chaerkady; Akhilesh Pandey
Journal:  Nucleic Acids Res       Date:  2008-11-06       Impact factor: 16.971

10.  Screening of feature genes in distinguishing different types of breast cancer using support vector machine.

Authors:  Qi Wang; Xudong Liu
Journal:  Onco Targets Ther       Date:  2015-08-27       Impact factor: 4.147

View more
  1 in total

1.  Machine Learning Revealed Ferroptosis Features and a Novel Ferroptosis-Based Classification for Diagnosis in Acute Myocardial Infarction.

Authors:  Dan Huang; Shiya Zheng; Zhuyuan Liu; Kongbo Zhu; Hong Zhi; Genshan Ma
Journal:  Front Genet       Date:  2022-01-25       Impact factor: 4.599

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.