Dong Zhang1, Liying Guan2, Xiaoming Li2. 1. The Key Laboratory of Cardiovascular Remodeling and Function Research, Chinese Ministry of Education, Chinese National Health Commission and Chinese Academy of Medical Sciences, The State and Shandong Province Joint Key Laboratory of Translational Cardiovascular Medicine, Department of Cardiology, Qilu Hospital, Cheeloo College of Medicine, Shandong University, Jinan, Shandong, China. 2. Health Management Center, Shandong Provincial Hospital Affiliated to Shandong First Medical University, Jinan, Shandong, China.
Abstract
BACKGROUND: Coronary artery disease (CAD) is the leading cause of mortality worldwide. We aimed to screen out potential gene signatures and construct a diagnostic model for CAD. METHOD: We downloaded two mRNA profiles, GSE66360 and GSE60993, and performed analyses of differential expression, gene ontology terms, and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways. The STRING database was used to identify protein-protein interactions (PPI). PPI network visualization and screening out of key genes were performed using Cytoscape software. Finally, a diagnostic model was constructed. RESULTS: A total of 2127 differentially expressed genes (DEGs) were identified in GSE66360, and 527 DEGs in GSE60993. Of the 153 DEGs from both datasets that showed differential expression between CAD patients and controls, 471 biological process terms, 35 cellular component terms, 17 molecular function terms, and 49 KEGG pathways were significantly enriched. The top 20 key genes in the PPI network were identified, and a diagnostic model constructed from five optimal genes that could efficiently separate CAD patients from controls. CONCLUSION: We identified several potential biomarkers for CAD and built a logistic regression model that will provide a valuable reference for future clinical diagnoses and guide therapeutic strategies.
BACKGROUND:Coronary artery disease (CAD) is the leading cause of mortality worldwide. We aimed to screen out potential gene signatures and construct a diagnostic model for CAD. METHOD: We downloaded two mRNA profiles, GSE66360 and GSE60993, and performed analyses of differential expression, gene ontology terms, and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways. The STRING database was used to identify protein-protein interactions (PPI). PPI network visualization and screening out of key genes were performed using Cytoscape software. Finally, a diagnostic model was constructed. RESULTS: A total of 2127 differentially expressed genes (DEGs) were identified in GSE66360, and 527 DEGs in GSE60993. Of the 153 DEGs from both datasets that showed differential expression between CAD patients and controls, 471 biological process terms, 35 cellular component terms, 17 molecular function terms, and 49 KEGG pathways were significantly enriched. The top 20 key genes in the PPI network were identified, and a diagnostic model constructed from five optimal genes that could efficiently separate CAD patients from controls. CONCLUSION: We identified several potential biomarkers for CAD and built a logistic regression model that will provide a valuable reference for future clinical diagnoses and guide therapeutic strategies.
Cardiovascular disease (CVD) is the leading cause of death worldwide, and coronary
artery disease (CAD) is considered one of the most serious CVDs.[1,2] CAD is the predominant cause of
ischemic heart disease, which often leads to myocardial infarction and death.[3] Coronary heart disease is always caused by reduced blood flow in the heart
muscle and the accumulation of fat and plaques in the heart’s arteries.[4]Globally, the total number of deaths from CAD increased by 4.1% to 55.8 million
between 2005 and 2015, although age-standardized mortality rates fell by 17.0%
through prevention and treatment strategies established from growing knowledge of
basic CAD pathophysiology.[5] Treatment strategies for patients with complex CAD often include percutaneous
coronary intervention or coronary artery bypass grafting.[6] However, mortality and rehospitalization rates for CAD patients remain high,[7] so the identification and development of specific gene signatures in the
early diagnosis of CAD is crucial to improve treatment.A number of noninvasive diagnostic approaches have been applied to the early
diagnosis of CAD, of which exercise electrocardiography is the most studied and
least accurate for female patients. However, electrocardiography combined with
imaging techniques such as echocardiography or nuclear single photon emission
computed tomography can improve the accuracy of diagnosis.[8] Additionally, a series of key genes has been identified to aid CAD diagnosis
through the analysis of expression profiles. For instance, circulating miR-765 and
miR-149 were identified as potential noninvasive diagnostic biomarkers for geriatric
CAD patients.[9] Zhang et al.[10] showed that several microRNAs and long non-coding RNAs influence the
progression of CAD by regulating the function of vascular endothelial cells, smooth
muscle cells, and macrophages, as well as vascular inflammation and metabolism.
These studies suggested that noninvasive diagnostic approaches have important
diagnostic value in CAD.In this study, we downloaded the mRNA profiles of blood samples from CAD patients and
healthy controls, then analyzed differential gene expression between the two groups
to explore the diagnostic signatures of potential key genes. A protein–protein
interaction (PPI) network was constructed based on 153 differentially expressed
genes (DEGs) that showed significant expression differences. Twenty key genes were
screened out, and five genes closely associated with CAD progression were further
optimized and used to build a logistic regression model with good diagnostic value
for CAD.
Materials and methods
Data collection
mRNA profiles GSE66360[11] and GSE60993[12] were downloaded from the Gene Expression Omnibus (GEO, https://www.ncbi.nlm.nih.gov/geo/). GSE66360 included blood
samples from 49 CADpatients and 50 healthy controls, and mRNA profiles were
detected using the Affymetrix Human Genome U133 Plus 2.0 Array (Thermo Fisher
Scientific, Santa Clara, CA, USA). GSE60993 included blood samples from 26 CAD
patients and seven healthy controls, and mRNA profiles were examined by a
HumanWG-6 v3.0 gene expression beadchip (Illumina, San Diego, CA, USA).
Differential expression analysis
The probes of mRNA profiles were stripped when expression values were 0 in more
than 50% of the samples, and remaining data were standardized using the robust
multi-array method. Differential expression analysis was performed based on the
limma function package of R language,[13] with |log 2 (fold change [FC])| > 1 and false discovery rate (FDR) ≤
0.05 as significant thresholds to screen differentially expressed probes.
Functional enrichment analysis
Gene ontology (GO) analysis (including biological process [BP], molecular
function [MF], and cellular component [CC]) and Kyoto Encyclopedia of Genes and
Genomes (KEGG) pathway enrichment analysis were carried out based on the
clusterProfiler function package of R language.[14] P < 0.05 was used as the threshold to screen significantly enriched GO
terms and KEGG pathways.
Protein–protein interaction (PPI) networks
The STRING database (https://string-db.org/version 11.0) analyzes and predicts
functional connections and interactions of proteins.[15] PPI interactions with a confidence score of ≥ 0.4 were retained after
identification by the STRING database. The PPI network was visualized using
Cytoscape software (https://cytoscape.org/,
version 3.7.2),[16] and key genes of the PPI network were screened by the cytoHubba plug-in
of Cytoscape software based on the Maximal Clique Centrality (MCC)
algorithm.
Construction of the logistic regression model
The glm function in R language[17] was used to construct a multivariate logistic regression model with the
expression value of target genes as the continuous prediction variable and the
sample type as the categorical response value (disease or not). Variables were
then further screened using the stepwise regression method, and used to
reconstruct the model. The model calculated the P value of each variable, and
variables with P < 0.05 were used to construct the final model. Generally,
points with a COOK distance > 0.5 affect the accuracy of the model. The area
under the curve (AUC) value represents the quality of the model, with a larger
AUC value representing a better model.
Results
Identification of DEGs
The mRNA profiles from GSE66360 and GSE60993 databases were standardized, and no
obvious change was seen in the data deviation of each sample (Fig. S1). A total
of 2127 DEGs (925 upregulated and 1202 downregulated) were identified in CAD
patients compared with controls from the GSE66360 dataset (Figure 1a). DEG expression differed
between patients and controls (Figure 1b). A total of 527 DEGs (333 upregulated and 194
downregulated) were identified in CAD patients compared with controls from the
GSE60993 dataset (Figure
1c). DEG expression again differed between patients and controls
(Figure 1d).
Moreover, 153 DEGs from the two datasets simultaneously exhibited notable
differences between CAD patients and controls (Figure 1e).
Figure 1.
Identification of DEGs. (a) Volcano plot of DEGs between CAD patients and
healthy controls from the GSE66360 dataset. The horizontal axis shows
Log2(FC) and the vertical axis shows –Log10(FDR). Red
points represent up-regulated genes, blue points represent
down-regulated genes, and black points indicate no significant
difference. (b) Heatmap of DEGs between CAD patients and healthy
controls from the GSE66360 dataset. The horizontal axis shows genes and
the vertical axis shows samples. Red indicates high expression and blue
indicates low expression. (c) Volcano plot of DEGs between CAD patients
and healthy controls from the GSE60993 data set. The horizontal axis
shows Log2 FC and the vertical axis shows –Log10 (FDR). Red
points represent up-regulated genes and blue points represent
down-regulated genes. (d) Heatmap of DEGs between CAD patients and
healthy controls from the GSE60993 dataset. The horizontal axis shows
genes and the vertical axis shows samples. Red indicates high expression
and blue indicates low expression. (e) Venn diagram of DEGs.
Identification of DEGs. (a) Volcano plot of DEGs between CAD patients and
healthy controls from the GSE66360 dataset. The horizontal axis shows
Log2(FC) and the vertical axis shows –Log10(FDR). Red
points represent up-regulated genes, blue points represent
down-regulated genes, and black points indicate no significant
difference. (b) Heatmap of DEGs between CAD patients and healthy
controls from the GSE66360 dataset. The horizontal axis shows genes and
the vertical axis shows samples. Red indicates high expression and blue
indicates low expression. (c) Volcano plot of DEGs between CAD patients
and healthy controls from the GSE60993 data set. The horizontal axis
shows Log2 FC and the vertical axis shows –Log10 (FDR). Red
points represent up-regulated genes and blue points represent
down-regulated genes. (d) Heatmap of DEGs between CAD patients and
healthy controls from the GSE60993 dataset. The horizontal axis shows
genes and the vertical axis shows samples. Red indicates high expression
and blue indicates low expression. (e) Venn diagram of DEGs.DEGs, differentially expressed genes; CAD, coronary artery disease.
Functional and pathway enrichment analysis
GO and KEGG pathway analyses were performed on the 153 DEGs from the two datasets
showing significant differences between CAD patients and controls. We identified
471 significantly enriched BP terms including neutrophil-mediated immunity, T
cell activation, and leukocyte differentiation and migration (P < 0.05); 35
significantly enriched CC terms including specific granule, tertiary granule,
and secretory granule (P < 0.05); and 17 significantly enriched MF terms
including protein heterodimerization activity, carbohydrate binding, and
signaling pattern recognition receptor activity. We also detected 49
significantly enriched KEGG pathways including tuberculosis, hematopoietic cell
lineage, phagosome, and humanimmunodeficiency virus 1 infection. The full list
of significantly enriched GO terms and KEGG pathways is shown in Table S1. The
top 15 most significantly enriched BP terms are shown in Figure 2a, the top 15 most significantly
enriched CC terms in Figure
2b, the top 15 most significantly enriched MF terms in Figure 2c, and the top 15
most significantly enriched KEGG pathways in Figure 2d.
Figure 2.
Enrichment of GO terms and KEGG pathways based on 153 DEGs. (a) The 15
most significantly enriched BP terms. (b) The 15 most significantly
enriched CC terms. (c) The 15 most significantly enriched MF terms. (d)
The 15 most significantly enriched KEGG pathways. The horizontal axis
shows the number of enriched genes and the vertical axis shows the
corresponding biological process or KEGG pathway. Longer bars represent
more enriched genes. Colors represent the P value.
GO, gene ontology; KEGG, Kyoto Encyclopedia of Genes and Genomes; BP,
biological process; CC, cellular component; MF, molecular function.
Enrichment of GO terms and KEGG pathways based on 153 DEGs. (a) The 15
most significantly enriched BP terms. (b) The 15 most significantly
enriched CC terms. (c) The 15 most significantly enriched MF terms. (d)
The 15 most significantly enriched KEGG pathways. The horizontal axis
shows the number of enriched genes and the vertical axis shows the
corresponding biological process or KEGG pathway. Longer bars represent
more enriched genes. Colors represent the P value.GO, gene ontology; KEGG, Kyoto Encyclopedia of Genes and Genomes; BP,
biological process; CC, cellular component; MF, molecular function.
Key genes in PPI networks
Functional connections and interactions of proteins were predicted by the STRING
database, and interactions between protein pairs with a confidence score of ≥
0.4 were selected. The PPI network was visualized using Cytoscape software
(Figure 3a) and
shown to contain 125 nodes, with a maximum node degree of 47 and minimum node
degree of 1. The integrin alpha M gene had the maximum node degree. CytoHubba
plug-in was used to screen key genes in the PPI network according to the MCC
algorithm, and the top 20 key genes by score are shown in Table S2. The
subnetwork composed of these 20 genes is shown in Figure 3b.
Figure 3.
Construction of the PPI network. (a) PPI network of shared genes. Each
dot represents a node. importance of the gene in the network is
reflected by the degree of the node which is represented by the number
of lines connected to the dot (node). A thicker line reflects a stronger
interaction between two nodes. Red points represent up-regulated genes
and blue points represent down-regulated genes. (b) The network of key
genes selected using the Maximal Clique Centrality algorithm. Important
genes are shown in darker colors.
PPI, protein–protein interaction.
Construction of the PPI network. (a) PPI network of shared genes. Each
dot represents a node. importance of the gene in the network is
reflected by the degree of the node which is represented by the number
of lines connected to the dot (node). A thicker line reflects a stronger
interaction between two nodes. Red points represent up-regulated genes
and blue points represent down-regulated genes. (b) The network of key
genes selected using the Maximal Clique Centrality algorithm. Important
genes are shown in darker colors.PPI, protein–protein interaction.
The logistic regression diagnostic model
We combined the expression value of 20 key genes from GSE66360 and GSE60993
datasets, and removed the batch effect using the sva package[18] of R language. We randomly selected two-thirds of the samples as the
training set, and the remaining samples as the testing set (Table S3). Logistic
regression model 1 was constructed with the expression value of all 20 key genes
as the continuous type prediction variable and the sample type (disease or not)
as the categorical response variable. Eight genes including
CLEC4D, CLEC5A, HP,
LCN2, MMP9, SLC2A3,
SLPI, and TIMP2 were screened out from the
20 predicted to be closely related to CAD progression.Logistic regression model 2 was then constructed with these eight genes as a
variable; the detailed parameters of this model are shown in Table 1. An odds ratio
(OR) value > 1 indicated that genes with high expression promoted the
occurrence of CAD, whereas an OR value < 1 indicated that genes with high
expression inhibited the occurrence of CAD. P values of CLEC4D,
HP, LCN2, MMP9, and
TIMP2 were less than 0.05, suggesting that these genes
contributed markedly to the model. Finally, logistic regression model 3 was
constructed from these five genes. Sample GSM1620895 (with a COOK
distance > 0.5) was found to have little impact on the accuracy of the model
(Figure 4a). The
accuracy of the model was evaluated by the receiver operating characteristic
(ROC) curve; AUC values in the training set and testing set were 0.9295 and
0.8674, respectively (Figure
4b), suggesting that the model constructed by these five genes had a
good diagnostic value in CAD.
Table 1.
Model interpretation of the logistic regression model 2.
GENE
β
SE
OR
95% CI
P-value
CLEC4D
1.9146
0.4868
6.7842
0.4868–6.7842
0.0001
CLEC5A
–0.5006
0.3714
0.6062
0.3714–0.6062
0.1777
HP
–1.1738
0.5409
0.3092
0.5409–0.3092
0.0300
LCN2
–1.1003
0.46
0.3328
0.46–0.3328
0.0167
MMP9
2.0203
0.5648
7.5404
0.5648–7.5404
0.0003
SLC2A3
–0.4833
0.3006
0.6167
0.3006–0.6167
0.1079
SLPI
0.5277
0.3287
1.695
0.3287–1.695
0.1084
TIMP2
–1.0497
0.4608
0.35
0.4608–0.35
0.0227
SE, standard error; OR, odds ratio; CI, confidence interval.
Figure 4.
Construction of the logistic regression diagnostic model. (a) The
logistic regression diagnostic model. Red dashed line indicates the COOK
distance. (b) The ROC curve. The horizontal axis represents the false
positive rate and the vertical axis represents the true positive
rate.
ROC, receiver operating characteristic.
Model interpretation of the logistic regression model 2.SE, standard error; OR, odds ratio; CI, confidence interval.Construction of the logistic regression diagnostic model. (a) The
logistic regression diagnostic model. Red dashed line indicates the COOK
distance. (b) The ROC curve. The horizontal axis represents the false
positive rate and the vertical axis represents the true positive
rate.ROC, receiver operating characteristic.
Discussion
While CVD has traditionally been considered a disease of western society, its global
incidence is on the rise and it is currently more prevalent in low and middle income
countries in Asia and Africa.[19] CAD is a multifactorial inherited disorder associated with at least three
major risk factors: hypertension, diabetes mellitus, and obesity.[20] Although gene expression analysis has had a great impact on the
identification and development of biomarkers in the cardiovascular field, current
risk prediction models only provide a rough estimation of individual risk.[21] Therefore, the identification and development of new biomarkers for CVD risk
prediction is still an urgent need.In recent decades, several genes have been identified or predicted to closely
participate in the occurrence and development of CAD. For instance, the
intercellular adhesion molecule 1 gene polymorphism rs5498 was correlated with a
decreased risk of myocardial infarction and may reduce the risk of CAD.[22] Additionally, gene knockout of transforming growth factor-β, its receptors,
and downstream signaling proteins demonstrated the importance of this pleiotropic
cytokine during vasculogenesis and the maintenance of vascular homeostasis.[23] A meta-analysis suggested that receptor for advanced glycation end products
gene polymorphism Gly82Ser was associated with an increased risk of CAD, especially
in Chinese populations,[24] while another meta-analysis indicated that the T allele of the
5,10-methylenetetrahydrofolate reductase gene rs1801133 polymorphism was a risk
factor for CAD and that this was partly mediated by abnormal lipid levels.[25] Moreover, Hou et al.[26] reported that the interleukin-6 gene –174G/C polymorphism C allele was
associated with increased CAD risk in Caucasians. Nevertheless, early diagnosis and
treatment of CAD is still difficult, so the use of informatics approaches may help
prioritize molecules that are biologically and functionally related to CAD, which
will aid diagnosis and treatment.Herein, a network based on the PPI was constructed with 153 DEGs that showed
significant expression differences between CAD patients and controls in the two GEO
datasets. Twenty key genes were selected (Table S2), of which five optimal genes
(CLEC4D, HP, LCN2,
MMP9, and TIMP2) were predicted to be closely
related to the development of CAD. Although the role of CLEC4D in CAD remains
unclear, necrotic cell sensor CLEC4E was identified to promote a proatherogenic
macrophage phenotype through activation of the unfolded protein response.[27] Graves et al.[28] reported that HP was not only an important antioxidant in vascular
inflammation and atherosclerosis, but also an enhancer of inflammation in cardiac
transplants. LCN2 plays a pivotal role in processes involved in atherogenesis by
promoting the polarization and migration of monocytic cells and the development of
macrophages towards foam cells.[29] Wang et al.[30] found that the expression of MMP9 was significantly upregulated in CAD
samples compared with controls, and participated in the progression of CAD. These
reports confirm that the biomarkers we predicted are functionally similar to known
CAD risk factors, providing a theoretical basis for the subsequent logistic
regression diagnostic model. The accuracy of this model was evaluated by the ROC
curve, identifying AUC values of 0.9295 in the training set and 0.8674 in the
testing set.
Conclusion
Our logistic regression model was shown to have a good diagnostic value in CAD. Our
study therefore provides new insights into the discovery of diagnostic biomarkers in
CAD which could aid early clinical diagnosis and guide therapeutic strategies. Our
results should be verified using a larger sample size.Click here for additional data file.Supplemental material, sj-pdf-1-imr-10.1177_0300060520979856 for Bioinformatics
analysis identifies potential diagnostic signatures for coronary artery disease
by Dong Zhang, Liying Guan and Xiaoming Li in Journal of International Medical
ResearchClick here for additional data file.Supplemental material, sj-xlsx-2-imr-10.1177_0300060520979856 for Bioinformatics
analysis identifies potential diagnostic signatures for coronary artery disease
by Dong Zhang, Liying Guan and Xiaoming Li in Journal of International Medical
ResearchClick here for additional data file.Supplemental material, sj-xlsx-3-imr-10.1177_0300060520979856 for Bioinformatics
analysis identifies potential diagnostic signatures for coronary artery disease
by Dong Zhang, Liying Guan and Xiaoming Li in Journal of International Medical
ResearchClick here for additional data file.Supplemental material, sj-xlsx-4-imr-10.1177_0300060520979856 for Bioinformatics
analysis identifies potential diagnostic signatures for coronary artery disease
by Dong Zhang, Liying Guan and Xiaoming Li in Journal of International Medical
Research
Authors: Matthew E Ritchie; Belinda Phipson; Di Wu; Yifang Hu; Charity W Law; Wei Shi; Gordon K Smyth Journal: Nucleic Acids Res Date: 2015-01-20 Impact factor: 16.971
Authors: Damian Szklarczyk; Annika L Gable; David Lyon; Alexander Junge; Stefan Wyder; Jaime Huerta-Cepas; Milan Simonovic; Nadezhda T Doncheva; John H Morris; Peer Bork; Lars J Jensen; Christian von Mering Journal: Nucleic Acids Res Date: 2019-01-08 Impact factor: 16.971