| Literature DB >> 35812497 |
Yanbao Sun1, Qi Zhang2, Qi Yang2, Ming Yao3, Fang Xu4, Wenyu Chen2.
Abstract
Since the first report of SARS-CoV-2 virus in Wuhan, China in December 2019, a global outbreak of Corona Virus Disease 2019 (COVID-19) pandemic has been aroused. In the prevention of this disease, accurate diagnosis of COVID-19 is the center of the problem. However, due to the limitation of detection technology, the test results are impossible to be totally free from pseudo-positive or -negative. Improving the precision of the test results asks for the identification of more biomarkers for COVID-19. On the basis of the expression data of COVID-19 positive and negative samples, we first screened the feature genes through ReliefF, minimal-redundancy-maximum-relevancy, and Boruta_MCFS methods. Thereafter, 36 optimal feature genes were selected through incremental feature selection method based on the random forest classifier, and the enriched biological functions and signaling pathways were revealed by Gene Ontology and Kyoto Encyclopedia of Genes and Genomes. Also, protein-protein interaction network analysis was performed on these feature genes, and the enriched biological functions and signaling pathways of main submodules were analyzed. In addition, whether these 36 feature genes could effectively distinguish positive samples from the negative ones was verified by dimensionality reduction analysis. According to the results, we inferred that the 36 feature genes selected via Boruta_MCFS could be deemed as biomarkers in COVID-19.Entities:
Keywords: COVID-19; bioinformatics; feature selection; gene expression markers; random forest classifier
Mesh:
Substances:
Year: 2022 PMID: 35812497 PMCID: PMC9258782 DOI: 10.3389/fpubh.2022.901602
Source DB: PubMed Journal: Front Public Health ISSN: 2296-2565
Figure 1The flowchart of the study.
ReliefF.
|
|
Figure 2Different feature selection methods including ReliefF, mRMR, and Boruta_MCFS were compared. The IFS curves of ReliefF, mRMR, and Boruta_MCFS methods based on the random forest classifier. The abscissa represents the number of feature genes, and the ordinate represents the MCC value.
36 feature genes screened by Boruta_MCFS feature selection method.
|
| |||
|---|---|---|---|
| PLVAP | SIGLEC1 | SERPING1 | IFIT5 |
| TRO | IFI6 | CXCL10 | ATM |
| TMEM126A | LGR6 | MED9 | PTAFR |
| RTP4 | PBDC1 | LAG3 | RILPL2 |
| NOC3L | PADI2 | SCN2A | PRMT7 |
| ICAM4 | ISG15 | USP18 | CDC42EP3 |
| CXCL11 | HERC5 | OAS3 | COPS5 |
| BST2 | HRASLS2 | DDX58 | PPARD |
| DSC2 | NDUFB9 | NPFFR1 | IFI44 |
Figure 3Functional enrichment analyses. (A) GO was performed on 36 feature genes. (B) KEGG was performed on 36 feature genes.
Figure 4PPI network analysis. (A) The PPI network based on 36 feature genes. (B) The number of nodes of the 22 feature genes in the PPI network. (C) Major subsets identified using MCODE; (D) The heat map of the enriched module (the smaller the p-value, the darker the color). (E) The network of the enriched module (the same cluster with the same color). (F) Enriched module network (the smaller the p-value, the darker the color).
Figure 5PCA. PCA was performed on the 36 feature genes. Green represents positive samples and red represents negative ones.