Ke-Ke Huang1, Hui-Lei Zheng2, Shuo Li1, Zhi-Yu Zeng1. 1. Department of Cardiology, Institute of Cardiovascular Diseases, the First Affiliated Hospital, Guangxi Medical University, Nanning, China. 2. Department of Health Management, the First Affiliated Hospital, Guangxi Medical University, Nanning, China.
Abstract
Background: Coronary artery disease (CAD) is a multifactorial disease and its pathogenesis remains unclear. We aimed to explore the optimal feature genes (OFGs) for CAD and to investigate the function of immune cell infiltration of CAD. It will be helpful for better understanding of the pathogenesis and the development of genetic prediction of CAD. Methods: Datasets related to CAD were obtained from the Gene Expression Omnibus (GEO) database. Cases from the datasets met diagnostic criteria including clinical symptoms, electrocardiographic (ECG) and angiographic evidence. We identified differentially expressed genes (DEGs) and conducted functional enrichment analysis. OFGs were obtained from the least absolute shrinkage and selection operator (LASSO) algorithm, support vector machine recursive feature elimination (SVM-RFE) algorithm, and random forest (RF) algorithm. CIBERSORT was used to compare immune infiltration between CAD patients and normal controls, and the correlation between OFGs and immune cells was analyzed. Results: DEGs were involved in the interleukin (IL)-17 signaling pathway, nuclear factor (NF)-kappa B signaling pathway, and tumor necrosis factor (TNF) signaling pathway. Gene Ontology (GO) analysis revealed DEGs were enriched in lipopolysaccharide (LPS), tertiary granule, and pattern recognition receptor activity. Disease Ontology (DO) analysis suggested DEGs were enriched in lung disease, arteriosclerotic cardiovascular disease (CVD). Matrix metalloproteinase 9 (MMP9), Pellino E3 ubiquitin protein ligase 1 (PELI1), thrombomodulin (THBD), and zinc finger protein 36 (ZFP36) were screened by the intersection of OFGs obtained from LASSO, SVM-REF, and RF algorithms. CAD patients had a lower proportion of memory B cells (P=0.019), CD8 T cells (P<0.001), resting memory CD4 T cells (P<0.001), regulatory T cells (P=0.028), and gamma delta T cells (P<0.001) than normal controls, while the proportion of activated memory CD4 T cells (P=0.014), resting natural killer (NK) cells (P<0.001), monocytes (P<0.001), M0 macrophages (P=0.023), activated mast cells (P<0.001), and neutrophils (P<0.001) in CAD patients were higher than normal controls. MMP9, PELI1, THBD, and ZFP36 were correlated with immune cells. Conclusions: MMP9, PELI1, THBD, and ZFP36 may be predicted biomarkers for CAD. The OFGs and association between OFGs and immune infiltration may provide potential biomarkers for CAD prediction along with the better assessment of the disease. 2022 Journal of Thoracic Disease. All rights reserved.
Background: Coronary artery disease (CAD) is a multifactorial disease and its pathogenesis remains unclear. We aimed to explore the optimal feature genes (OFGs) for CAD and to investigate the function of immune cell infiltration of CAD. It will be helpful for better understanding of the pathogenesis and the development of genetic prediction of CAD. Methods: Datasets related to CAD were obtained from the Gene Expression Omnibus (GEO) database. Cases from the datasets met diagnostic criteria including clinical symptoms, electrocardiographic (ECG) and angiographic evidence. We identified differentially expressed genes (DEGs) and conducted functional enrichment analysis. OFGs were obtained from the least absolute shrinkage and selection operator (LASSO) algorithm, support vector machine recursive feature elimination (SVM-RFE) algorithm, and random forest (RF) algorithm. CIBERSORT was used to compare immune infiltration between CAD patients and normal controls, and the correlation between OFGs and immune cells was analyzed. Results: DEGs were involved in the interleukin (IL)-17 signaling pathway, nuclear factor (NF)-kappa B signaling pathway, and tumor necrosis factor (TNF) signaling pathway. Gene Ontology (GO) analysis revealed DEGs were enriched in lipopolysaccharide (LPS), tertiary granule, and pattern recognition receptor activity. Disease Ontology (DO) analysis suggested DEGs were enriched in lung disease, arteriosclerotic cardiovascular disease (CVD). Matrix metalloproteinase 9 (MMP9), Pellino E3 ubiquitin protein ligase 1 (PELI1), thrombomodulin (THBD), and zinc finger protein 36 (ZFP36) were screened by the intersection of OFGs obtained from LASSO, SVM-REF, and RF algorithms. CAD patients had a lower proportion of memory B cells (P=0.019), CD8 T cells (P<0.001), resting memory CD4 T cells (P<0.001), regulatory T cells (P=0.028), and gamma delta T cells (P<0.001) than normal controls, while the proportion of activated memory CD4 T cells (P=0.014), resting natural killer (NK) cells (P<0.001), monocytes (P<0.001), M0 macrophages (P=0.023), activated mast cells (P<0.001), and neutrophils (P<0.001) in CAD patients were higher than normal controls. MMP9, PELI1, THBD, and ZFP36 were correlated with immune cells. Conclusions: MMP9, PELI1, THBD, and ZFP36 may be predicted biomarkers for CAD. The OFGs and association between OFGs and immune infiltration may provide potential biomarkers for CAD prediction along with the better assessment of the disease. 2022 Journal of Thoracic Disease. All rights reserved.
Coronary artery disease (CAD) is a multifactorial chronic disease with complex pathology resulted from environmental and genetic factors as well as their interactions (1,2). CAD is characterized by the formation of plaques in the coronary arteries then atherosclerosis occurs. The mechanisms contributing to atherosclerosis are diverse, including dyslipidemia, hypercoagulability, endothelial dysfunction, oxidative stress and inflammation (2). There are multiple factors associated with CAD, containing age, gender, dyslipidemia, hypertension, smoking, diabetes mellitus, obesity and family history (3). In the previous study, researchers have found some biomarkers related to CAD, for instance cardiac troponin T, lipoprotein(a) [Lp(a)], C-reactive protein (CRP) and high-sensitive C-reactive protein (hs-CRP) (4-6). As the most common cardiovascular disease (CVD), CAD causes a heavy burden on human health globally (7-9). The morbidity and mortality of CAD has been continually rising in low- and middle-income countries and is now close to the level of that in developed counties, making it a global issue. According to the 2018 China CVD report, approximately 290 million people suffer from CVD and about 3.79% of them are CAD patients (10).With rapid advancements in technology, the management of CAD is constantly being remodeled and is now more efficiently based on scientific classification and targeted treatment (11). Microarray analysis has been used as a practical method for studying changes in gene expression (12). The Gene Expression Omnibus (GEO) database (13) is a publicly available website supported by the National Center for Biotechnology Information (NCBI) and is used to identify key genes and potential mechanisms of the onset and development of diseases. Therefore, we can detect the gene expression information more efficiently and time-saving by conducting bioinformatics analysis.CAD is life-threatening and in the stage of CAD initiation experts may miss diagnosis for the absence of typical symptoms (14). Meanwhile, the mechanisms of CAD remain still complicated and unclear. While coronary angiography is the gold standard diagnostic technique for CAD, it is invasive and costly. Therefore, combinations of more biomarkers need to be integrated using various methods for creating predictive, diagnostic, or prognostic tools for CAD. Machine learning (ML) has undergone an expansion in its application as a component of artificial intelligence (AI) and has enhanced the efficiency of the health care system (15). A previous study has shown that ML algorithms are effective for risk prediction, diagnosis, and imaging analysis of CVD (16). ML provides a more intelligent approach and increases confidence in the investigation of potential biomarkers compared to traditional methods.The results of accumulating studies are helping researchers better understand the crucial role of immune cell infiltration in the onset and progression of CAD (17,18). CIBERSORT is widely used to investigate the expression of 22 subgroups of immune cells in order to determine the proportions of these immune cells in study samples (19).The purpose of our study is to investigate the potential predictive biomarkers and provide fresh insights into the pathogenesis of CAD and direction for future studies of innovative therapies. If these potential biomarkers indicate the probability of occurrence of CAD accurately then early prevention can be carried out.
Methods
Overview of research procedures
In the present study, CAD-related gene chip data were obtained from GEO open resources. The GEO database was used to conduct bioinformatic analysis of differentially expressed genes (DEGs) between CAD patients and normal controls. Subsequently, we utilized 3 ML algorithms, including least absolute shrinkage and selection operator (LASSO), support vector machine recursive feature elimination (SVM-RFE), and random forest (RF) classifier, to screen optimal feature genes (OFGs) from DEGs for CAD. In addition, we investigated the correlation of OFGs with immune infiltration. The study was conducted in accordance with the Declaration of Helsinki (as revised in 2013).
Data collection and preprocessing
The gene expression profiles of datasets GSE66360, GSE61144, GSE60993, and GSE42418 were downloaded from the GEO (13) database (https://www.ncbi.nlm.nih.gov/geo/). GSE66360, GSE61144, and GSE60993, based on GPL570, GPL6106, and GPL6884, respectively, were used to identify the DEGs. GSE66360 contained 49 CAD samples and 50 control samples, GSE61144 consisted of 14 CAD and 10 samples, and GSE60993 was made up of 26 CAD and 7 control samples. The GSE66360, GSE61144, and GSE60993 datasets were normalized and the batch effect was eliminated using the “sva” package in R. We merged the 3 datasets to enlarge the sample size. Subsequently, the merged data became a gene expression profile of 89 CAD patients and 67 normal controls. The GSE42418 dataset, based on GPL13607, was used as the validation set and included 13 CAD samples and 11 control samples.
Identification of DEGs
To screen DEGs between patients and controls, the “limma” package in R was used. A P value <0.05 and fold change (FC) ≥2 (|log2FC| >1) was considered statistically different. Those with log2FC >0 were considered upregulated genes and log2FC <0 downregulated genes. Heatmaps and volcano plots for DEGs were carried out in R using “pheatmap” and “ggplot2” packages, respectively.
Functional enrichment analysis
Gene ontology (GO) analysis [comprising biological processes (BP), molecular functions (MF), and cellular components (CC)], Kyoto Encyclopedia of Genes and Genomes (KEGG) analysis, and Disease Ontology (DO) analysis were carried out using “clusterProfile”, “enrichplot”, “ggplot2”, “org.Hs.eg.db”, “GOplot”, and “DOSE” packages in R. A P value <0.05 was used as the threshold to screen significantly enriched GO terms, DO terms, and KEGG pathways.
Screening of OFGs
Three ML methods were used to screen OFGs. LASSO binomial logistic regression was used to select the OFGs from DEGs by applying the “glmnet” package in R. Optimal penalty parameter λ was determined by minimal binomial deviance. The SVM-RFE algorithm was used with “e1071”, “kernlab”, and “caret” packages in R to investigate the point with the smallest cross-validation error to select OFGs. The “randomForest” package in R was used for implementing the RF algorithm to identify the point where error was the most minimal. MeanDecreaseGini score >2 was used as the threshold to determine whether a gene was an OFG. In addition, a Venn diagram visualized the key OFGs obtained from the results of the 3 ML methods (LASSO, SVM-REF, and RF).
Construction of receiver operating characteristic (ROC) curves
The R package “pROC” was used to construct ROC curves and calculate the area under the curve (AUC) for hub genes.
Validation of the OFGs and ROC
The expression matrix of the GSE42418 dataset was used to verify each OFG as well as the ROC.
Infiltrating differential analysis of 22 immune cells
CIBERSORT algorithm in R was used to quantify the proportion of 22 types of immune cells in the merged dataset. We filtered out samples with P<0.05. A bar plot and violin diagram were used to visually represent differences in immune cells between CAD and normal samples.
Correlation analysis between OFGs and infiltrating immune cells
Relationships between the hub genes and infiltrating immune cells were investigated using R software. The analysis results were visualized using the “ggpubr” package in R.
Statistical analysis
All statistical analysis and graphics were conducted with R software (version 4.1.2). Differential expression analysis was performed with the cut-off threshold of P<0.05 and FC ≥2 or |log2FC| >1. A P value of less than 0.05 was two-sided and considered statistically significant.
Results
shows an overview of the present study. We performed differential gene expression analysis () to investigate gene expression in CAD patients and normal controls. When comparing the blood samples of 89 CAD patients and 67 normal controls, 100 upregulated and 5 downregulated genes were identified in the merged dataset (GSE66360, GSE61144, and GSE60993). shows the volcano plot and heatmap for DEGs of the merged dataset.
Figure 1
Schematic overview of study. GO, Gene Ontology; DO, Disease Ontology; KEGG, Kyoto Encyclopedia of Genes and Genomes; LASSO, least absolute shrinkage and selection operator; SVM-RFE, support vector machine recursive feature elimination; RF, random forest; MMP9, matrix metalloproteinase 9; PELI1, ellino E3 ubiquitin protein ligase 1; THBD, thrombomodulin; ZFP36, zinc finger protein 36; ROC, receiver operating characteristic.
Figure 2
Differential expression analysis. (A) Cluster heatmap for DEGs in CAD patients and normal controls. From blue to red represents the low expression to high expression. (B) Volcano plot for DEGs; red dots represent upregulated differential genes, and the green dots represent downregulated differential genes (|log2FC|>1 and adjusted P<0.05). ; Con, control; FC, fold change; DEG, differentially expressed gene; CAD, coronary artery disease.
Schematic overview of study. GO, Gene Ontology; DO, Disease Ontology; KEGG, Kyoto Encyclopedia of Genes and Genomes; LASSO, least absolute shrinkage and selection operator; SVM-RFE, support vector machine recursive feature elimination; RF, random forest; MMP9, matrix metalloproteinase 9; PELI1, ellino E3 ubiquitin protein ligase 1; THBD, thrombomodulin; ZFP36, zinc finger protein 36; ROC, receiver operating characteristic.Differential expression analysis. (A) Cluster heatmap for DEGs in CAD patients and normal controls. From blue to red represents the low expression to high expression. (B) Volcano plot for DEGs; red dots represent upregulated differential genes, and the green dots represent downregulated differential genes (|log2FC|>1 and adjusted P<0.05). ; Con, control; FC, fold change; DEG, differentially expressed gene; CAD, coronary artery disease.GO analysis of the DEGs in the merged dataset revealed the top 10 most significantly enriched BP, CC, and MF items (). In the BP category, upregulated DEGs were enriched in lipopolysaccharide (LPS); for CC, upregulated DEGs were significantly enriched in tertiary granule; and for MF, upregulated DEGs were enriched in pattern recognition receptor activity. KEGG pathway analysis was also performed and upregulated DEGs were abundantly enriched in the interleukin (IL)-17 signaling pathway, nuclear factor (NF)-kappa B signaling pathway, and tumor necrosis factor (TNF) signaling pathway. DO analysis indicated that the DEGs were enriched in lung disease, arteriosclerotic CVD, and atherosclerosis.
Figure 3
Functional enrichment analysis of DEGs. (A) Gene Ontology enrichment analysis; the figure represents biological process, cellular component, and molecular function (top 30 according to adjusted P value, respectively). (B) Circos graph for Gene Ontology enrichment analysis. (C) Disease Ontology enrichment analysis (top 30 according to adjusted P value). (D) Kyoto Encyclopedia of Genes and Genomes enrichment analysis (top 30 according to adjusted P value). GO, Gene Ontology; FC, fold change; DEG, differentially expressed gene.
Functional enrichment analysis of DEGs. (A) Gene Ontology enrichment analysis; the figure represents biological process, cellular component, and molecular function (top 30 according to adjusted P value, respectively). (B) Circos graph for Gene Ontology enrichment analysis. (C) Disease Ontology enrichment analysis (top 30 according to adjusted P value). (D) Kyoto Encyclopedia of Genes and Genomes enrichment analysis (top 30 according to adjusted P value). GO, Gene Ontology; FC, fold change; DEG, differentially expressed gene.
Screening OFGs
Eighteen genes were identified from the CAD-related DEGs using the LASSO algorithm (), and 40 genes were selected using the SVM-REF algorithm. In addition, 8 genes were screened by the RF algorithm. After overlapping the hub genes obtained from the 3 ML methods, 4 candidate hub genes were identified: thrombomodulin (THBD), Pellino E3 ubiquitin protein ligase 1 (PELI1), matrix metallopeptidase 9 (MMP9), and zinc finger protein 36 (ZFP36). The AUC of ROC analysis () was 0.870 for THBD, 0.872 for PELI1, 0.847 for MMP9, and 0.839 for ZFP36.
Figure 4
Three machine learning algorithms were used for OFGs. (A) LASSO algorithm to screen OFGs. (B) SVM-REF algorithm; RMSE was the statistical parameter to determine the optimal feature genes after the analysis of recursive feature elimination algorithm. The lowest RMSE corresponds with the optimal feature genes. (C) RF algorithm to select OFGs; MeanDecreaseGini score >2 was used as the threshold to determine whether a gene was selected. (D) The individual feature selection by LASSO, SVM-RFE, and Random Forest algorithms and the intersection of OFGs obtained from the 3 algorithms. SVM-REF, support vector machine recursive feature elimination; LASSO, least absolute shrinkage and selection operator; OFGs, optimal feature genes; RMSE, root mean square error; RF, random forest.
Figure 5
ROC curves of the predictive efficacy of MMP9, PELI1, THBD, and ZFP36. MMP9, matrix metallopeptidase 9; PELI1, Pellino E3 ubiquitin protein ligase 1; THBD, thrombomodulin; ZFP36, zinc finger protein 36; AUC, area under the curve; CI, confidence interval; ROC, receiver operating characteristic.
Three machine learning algorithms were used for OFGs. (A) LASSO algorithm to screen OFGs. (B) SVM-REF algorithm; RMSE was the statistical parameter to determine the optimal feature genes after the analysis of recursive feature elimination algorithm. The lowest RMSE corresponds with the optimal feature genes. (C) RF algorithm to select OFGs; MeanDecreaseGini score >2 was used as the threshold to determine whether a gene was selected. (D) The individual feature selection by LASSO, SVM-RFE, and Random Forest algorithms and the intersection of OFGs obtained from the 3 algorithms. SVM-REF, support vector machine recursive feature elimination; LASSO, least absolute shrinkage and selection operator; OFGs, optimal feature genes; RMSE, root mean square error; RF, random forest.ROC curves of the predictive efficacy of MMP9, PELI1, THBD, and ZFP36. MMP9, matrix metallopeptidase 9; PELI1, Pellino E3 ubiquitin protein ligase 1; THBD, thrombomodulin; ZFP36, zinc finger protein 36; AUC, area under the curve; CI, confidence interval; ROC, receiver operating characteristic.
Verification of the OFGs
Validation was performed using the GSE42418 dataset [CAD patients (n=13), normal controls (n=11)] to evaluate whether the 4 hub genes were differentially expressed in CAD samples when compared with normal controls (). The expression levels of both PELI1 and ZFP36 were higher in CAD patients than normal controls (P<0.05). However, there was no significant difference in the expression levels of MMP9 and THBD between CAD patients and normal controls. The results of ROC analysis () showed that AUC was 0.552 for THBD, 0.832 for PELI1, 0.727 for MMP9, and 0.769 for ZFP36.
Figure 6
Validation of the OFGs and ROC curves. (A) Expression of PELI1 and ZFP36 in CAD patients compared to normal controls in the validation dataset (only genes with P<0.05 are shown). (B) ROC curves of the predictive efficacy of PELI1 and ZFP36 in the validation set. Con, control; PELI1, Pellino E3 ubiquitin protein ligase 1; AUC, area under the curve; CI, confidence interval; ZFP36, zinc finger protein 36; OFGs, optimal feature genes; ROC, receiver operating characteristic; CAD, coronary artery disease.
Validation of the OFGs and ROC curves. (A) Expression of PELI1 and ZFP36 in CAD patients compared to normal controls in the validation dataset (only genes with P<0.05 are shown). (B) ROC curves of the predictive efficacy of PELI1 and ZFP36 in the validation set. Con, control; PELI1, Pellino E3 ubiquitin protein ligase 1; AUC, area under the curve; CI, confidence interval; ZFP36, zinc finger protein 36; OFGs, optimal feature genes; ROC, receiver operating characteristic; CAD, coronary artery disease.
Immune infiltration analysis
The CIBERSORT algorithm was used to explore the relative proportion of 22 types of immune cells in samples from 88 CAD patients and 67 normal controls (). The bar plot shows the contents of varied subpopulations in each individual clearly (). The violin diagram shows that the proportion of memory B cells (P=0.019), CD8 T cells (P<0.001), resting memory CD4 T cells (P<0.001), regulatory T cells (P=0.028), and gamma delta T cells (P<0.001) in CAD samples were significantly lower than in normal control samples. However, activated memory CD4 T cells (P=0.014), resting natural killer (NK) cells (P<0.001), monocytes (P<0.001), M0 macrophages (P=0.023), activated mast cells (P<0.001), and neutrophils (P<0.001) in CAD samples were significantly higher than in normal control samples ().
Figure 7
Immune cell infiltration analysis. (A) The relative percentage of 22 immune cell subpopulations of the samples from the merged dataset. (B) Correlation heatmap of 22 immune cells: red and blue represent positive and negative correlation, respectively. The square area with a deeper color has a stronger correlation index. (C) Violin diagram displays different fractions of 22 immune cells in CAD and control samples. Con, control; CAD, coronary artery disease.
Immune cell infiltration analysis. (A) The relative percentage of 22 immune cell subpopulations of the samples from the merged dataset. (B) Correlation heatmap of 22 immune cells: red and blue represent positive and negative correlation, respectively. The square area with a deeper color has a stronger correlation index. (C) Violin diagram displays different fractions of 22 immune cells in CAD and control samples. Con, control; CAD, coronary artery disease.
The correlation between MMP9, PELI1, THBD, and ZFP36 and immune cells
Correlation analysis between hub genes and immune cells () revealed that MMP9 had a significant positive correlation with neutrophils (r=0.652, P<0.001), monocytes (r=0.405, P<0.001), and M0 macrophages (r=0.402, P<0.001) and a negative correlation with CD8 T cells (r=−0.426, P<0.001). PELI1 was positively correlated with neutrophils (r=0.583, P<0.001) and activated mast cells (r=0.436, P<0.001) and negatively correlated with CD8 T cells (r=−0.405, P<0.001). THBD was positively correlated with neutrophils (r=0.573, P<0.001) and monocytes (r=0.432, P<0.001) and negatively correlated with CD8 T cells (r=−0.412, P<0.001). ZFP36 was positively correlated with neutrophils (r=0.555, P<0.001) and activated mast cells (r=0.432, P<0.001) and negatively correlated with CD8 T cells (r=−0.405, P<0.001). It can be concluded that MMP9, PELI1, THBD, and ZFP36 were all correlated with immune cells.
Figure 8
Visualization of Spearman correlation between immune cells and the 4 optimal feature genes. The dot with a larger size has a stronger correlation coefficient. The P value is presented by different colors, the dot with a greener color has a smaller P value, while the yellower color has a larger P value. Abs (cor), absolute value (correlation); MMP9, matrix metallopeptidase 9; PELI1, Pellino E3 ubiquitin protein ligase 1; THBD, thrombomodulin; ZFP36, zinc finger protein 36.
Visualization of Spearman correlation between immune cells and the 4 optimal feature genes. The dot with a larger size has a stronger correlation coefficient. The P value is presented by different colors, the dot with a greener color has a smaller P value, while the yellower color has a larger P value. Abs (cor), absolute value (correlation); MMP9, matrix metallopeptidase 9; PELI1, Pellino E3 ubiquitin protein ligase 1; THBD, thrombomodulin; ZFP36, zinc finger protein 36.
Discussion
Coronary heart disease (CHD) is characterized as a multifactorial disease and has become an economic burden globally (7). With rapid advances in technology, ML algorithms are being used to achieve a deeper understanding of clinical diagnoses, prediction and treatments through gene expression data (20). In the current study, we investigated the key feature genes associated with CAD by comparing differences in gene expression chips between CAD patients and normal controls. In this study, we combined 3 CAD datasets and then identified 105 DEGs and 4 upregulated hub genes (MMP9, PEIL1, THBD, and ZFP36) using bioinformatics analyses and ML methods, respectively. In addition, we performed functional enrichment. Further, we used the CIBERSORT algorithm to reveal that the 4 OFGs participated in immune cell infiltration.GO enrichment analysis showed that the DEGs screened from the merged dataset were mainly related to LPS, tertiary granule, and pattern recognition receptor activity. Previous studies have demonstrated that LPS is related to heart injury. A study by Lepper et al. reported that the serum LPS-binding protein concentration in CAD patients was significantly increased compared with individuals without coronary atherosclerosis (21). A study conducted by Justo-Junior et al. found higher levels of chemokine and pattern-recognition receptor expressed in patients with unstable angina (22). According to KEGG pathways analysis, DEGs were abundantly enriched in the IL-17 signaling pathway, NF-kappa B signaling pathway, and TNF signaling pathway. It is already known that interferon gamma (IFN-γ) secreted by T cells is highly expressed in atherosclerotic lesions, and that regulation of the IL-17 signaling pathway plays a key role in atherosclerosis (23). NF-kappa B regulates the expression of genes targeting the initiation and progression of atherosclerosis. Under the action of NF-kappa B, multiple processes are integrated in the formation of atherosclerotic plaques (24). As one of the inflammatory markers, TNF was confirmed to be an indicator of increasing risk of CAD. The results of DO analysis indicated that the DEGs were enriched in lung disease, arteriosclerotic CVD, and atherosclerosis.In recent years, the application of AI technology in CVD has made significant progress (20,25). It is important to consider whether the features selected at each point are true biomarkers or false positives (15). The OFGs obtained from ML algorithms may improve the efficiency of clinical diagnosis and prediction and provide more clues to guide doctors to make a diagnosis as ML methods can identify more complex, nonliner relationships. LASSO is a regression-based methodology, and it has the unique feature of penalizing the absolute value of a regression coefficient (26,27). SVM is a powerful ML method for building a classifier (28), and the present study used SVM-REF to screen OFGs because it can select relevant features using a separating hyperplane (28-30). The RF algorithm was used to screen feature items to obtain their importance ranking and is not vulnerable to overfitting. “Gini importance” is a measure of feature importance and is available in RF implementations (31,32). Our main aim in the present study was to screen the OFGs, not to develop a diagnostic or predictive tool. In this study, we preprocessed the original GEO data first, then investigated the OFGs through multiple feature selection algorithms. Therefore, by intergrading with LASSO, RF, and SVM-REF algorithms, 4 OFGs (MMP9, PELI1, THBD, and ZFP36) were eventually selected. Additionally, the SVM-REF and LASSO algorithms reduced the risk of overfitting through cross-validation. The selected 4 OFGs ranked highly in terms of importance.MMP9, also known as GELB, CLG4B, MMP-9, and MANDP2, is a member of the matrix metalloproteinase family and is abundant in the interruption of extracellular matrix in normal physiological processes. MMP9 also plays an important role in disease processes, including in embryonic development, reproduction tissue remodeling, arthritis, and metastasis (33). Goerg et al. (34) reported that downregulation of MMP9 protein expression in the heart by empagliflozin could improve systolic heart function after myocardial infarction (MI) in rats. A study by Mujumdar et al. demonstrated that the activation of MMP9 decreased cardiac tensile strength (35). PELI1 can facilitate the activity of ubiquitin protein ligase, and it participates in negative regulation of necroptotic procedure, protein polyubiquitination, and reaction to LPS (36). A mouse model of MI confirmed that PELI1 was an important downstream target of vascular endothelial growth factor (VEGF), which can salvage impaired collateral blood vessel formation, diminish fibrosis, and improve myocardial function (37). Another study reported that PELI1 was a potential clinical marker for therapies to repair the damaged heart following MI in humans (38). THBD is an intronless gene and is also known as TM, THRM, AHUS6, BDCA3, CD141, BDCA-3, and THPH12. THBD encodes a protein that is an endothelial-specific type I membrane receptor and can bind thrombin. The combination of this protein and thrombin results in the activation of protein C, which degrades clotting factors Va and VIIIa and decreases the amount of thrombin generated. Mutations in THBD are a driver of thromboembolic disease, also known as inherited thrombophilia (39). A study from Iran (40) investigated the association of the rs1042579 single nucleotide polymorphism (SNP) in THBD with the risk of CVD and found that rs1042579 SNP could increase the risk of CVD. A study from Pakistan showed that the relationship between THBD and inflammatory cytokines in CAD helped to identify new prognostic and therapeutic targets for CVD treatments (41). ZFP36 is also known as TTP, G0S24, GOS24, TIS11, NUP475, zfp-36, and RNF162A, and it enables several functions, including 14-3-3 protein binding activity, heat shock protein binding activity, and mRNA 3'-UTR AU-rich region binding activity. Moreover, ZFP36 is extensively involved in cellular response to cytokine stimulation and growth factor stimulation as well as regulation of gene expression (42). Zhang et al. reported that ZFP36 was expressed in vascular endothelial cells and macrophage foam cells of atherosclerosis, and thereby ZFP36 expression may reduce vascular inflammation and prevent or treat atherosclerosis (43).Atherosclerosis is characterized by hyperlipidemia and inflammation, and it is a major cause of CAD (11). Previous studies have shown that inflammatory macrophages and foam cell formation are crucial factors in atherosclerotic plague progression (44,45). T cells target the vessel wall in line with macrophages and react to antigens in the arterial wall (46). After activation of T cells, proinflammatory mediators are produced, exaggerating the inflammatory response, and disease development is worsened (47). In addition to macrophages and T cells, other innate and immune cells contribute to the pathogenesis of atherosclerosis, including neutrophils, B cells, and NK cells (48). Another study showed that CAD patients presented a highly activated CD+CXCR5+T cell subset (49). However, Olson et al. (50) evaluated innate and adaptive immune cells subsets in CHD patients and found that peripheral blood monocyte subsets were not strongly associated with CD4+ naive, memory, CD28−, or T helper cell subsets in MI or MI angina cases. CIBERSORT evaluation in the present study suggested that memory B cells, CD8 T cells, resting memory CD4 T cells, regulatory T cells, and gamma delta T cells in CAD samples infiltrated less than in normal control samples. On the contrary, activated memory CD4 T cells, resting NK cells, monocytes, M0 macrophages, activated mast cells, and neutrophils in CAD samples infiltrated more than in normal control samples. One previous study found that the number of neutral endopeptidase positive neutrophils was higher in acute MI patients with ruptured plaques compared with eroded plaques (51).To further discover the relationship between OGFs and the main pathogenic mechanism of CAD, we analyzed the correlation between immune cells and the 4 selected OFGs. The present study revealed that MMP9 had a significant positive correlation with neutrophils, monocytes, and M0 macrophages, and a negative correlation with CD8 T cells. PELI1 was positively correlated with neutrophils and activated mast cells, and negatively correlated with CD8 T cells. THBD was positively associated with neutrophils and monocytes, while negatively correlated with CD8 T cells. ZFP36 was positively correlated with neutrophils and activated mast cells, and negatively correlated with CD8 T cells. It can be concluded from the results that MMP9, PELI1, THBD, and ZFP36 were all correlated with immune cells.The current study had several strengths. Firstly, our approach identified 4 OFGs for CAD patients using 3 ML feature selection algorithms. The combined application of LASSO, SVM-RFE, and RF in this study to screen the OFGs associated with CAD reduced bias to the maximum extent. Secondly, we merged 3 different datasets to enlarge the sample size and removed the unqualified samples. In addition, we eliminated the batch effect between GEO datasets to make the statistical analyses more trustworthy. Finally, the expression levels of the 4 OFGs were validated in another independent dataset and showed good performance. However, several limitations in this study should be addressed. This was a bioinformatic analysis, and the identified hub genes as well as the interaction of these genes and immune cells need to be confirmed by functional validation in vitro and in vivo. Moreover, despite merging 3 GEO datasets, a larger sample is still needed for better results in the future.
Conclusions
In conclusion, 105 DEGs were identified using bioinformatics analyses, and 4 OFGs were obtained using 3 ML methods, providing a focus for further investigation of prediction for CAD. We investigated immune infiltration in CAD samples using CIBERSORT analysis and found a significant difference in immune infiltration between CAD and normal control samples. The relationship between OFGs and immune infiltration in the occurrence and development of CAD needs more in-depth study.The article’s supplementary files as
Authors: M P Brown; W N Grundy; D Lin; N Cristianini; C W Sugnet; T S Furey; M Ares; D Haussler Journal: Proc Natl Acad Sci U S A Date: 2000-01-04 Impact factor: 11.205
Authors: Justin T Saunders; Vijay Nambi; James A de Lemos; Lloyd E Chambless; Salim S Virani; Eric Boerwinkle; Ron C Hoogeveen; Xiaoxi Liu; Brad C Astor; Thomas H Mosley; Aaron R Folsom; Gerardo Heiss; Josef Coresh; Christie M Ballantyne Journal: Circulation Date: 2011-03-21 Impact factor: 29.690
Authors: Huanchun Zhang; W Robert Taylor; Giji Joseph; Valentina Caracciolo; Donna M Gonzales; Neil Sidell; Emre Seli; Perry J Blackshear; Caleb B Kallen Journal: Arterioscler Thromb Vasc Biol Date: 2013-04-04 Impact factor: 8.311
Authors: A S Justo-Junior; L M Villarejos; X T V Lima; W Nadruz; A C Sposito; R L Mamoni; R Abdalla; J L Fernandes; R T D Oliveira; M H S L Blotta Journal: Cytokine Date: 2018-06-20 Impact factor: 3.861
Authors: Philipp M Lepper; Christian Schumann; Kathy Triantafilou; F Maximilian Rasche; Tibor Schuster; Hedwig Frank; E Marion Schneider; Martha Triantafilou; Maximilian von Eynatten Journal: J Am Coll Cardiol Date: 2007-06-18 Impact factor: 24.094