| Literature DB >> 35207515 |
Shijian Ding1, Deling Wang2, Xianchao Zhou3, Lei Chen4, Kaiyan Feng5, Xianling Xu6, Tao Huang7,8, Zhandong Li9, Yudong Cai1.
Abstract
The heart is an essential organ in the human body. It contains various types of cells, such as cardiomyocytes, mesothelial cells, endothelial cells, and fibroblasts. The interactions between these cells determine the vital functions of the heart. Therefore, identifying the different cell types and revealing the expression rules in these cell types are crucial. In this study, multiple machine learning methods were used to analyze the heart single-cell profiles with 11 different heart cell types. The single-cell profiles were first analyzed via light gradient boosting machine method to evaluate the importance of gene features on the profiling dataset, and a ranking feature list was produced. This feature list was then brought into the incremental feature selection method to identify the best features and build the optimal classifiers. The results suggested that the best decision tree (DT) and random forest classification models achieved the highest weighted F1 scores of 0.957 and 0.981, respectively. The selected features, such as NPPA, LAMA2, DLC1, and the classification rules extracted from the optimal DT classifier played a crucial role in cardiac structure and function in recent research and enrichment analysis. In particular, some lncRNAs (LINC02019, NEAT1) were found to be quite important for the recognition of different cardiac cell types. In summary, these findings provide a solid academic foundation for the development of molecular diagnostics and biomarker discovery for cardiac diseases.Entities:
Keywords: biomarker; decision rule; heart cell; machine learning method; single-cell profiles
Year: 2022 PMID: 35207515 PMCID: PMC8877019 DOI: 10.3390/life12020228
Source DB: PubMed Journal: Life (Basel) ISSN: 2075-1729
Figure 1Flow chart of the study design. First, lightGBM method was applied to rank the features of single-cell gene expression profiles into a ranked list. Second, IFS method with machine learning algorithms was used to detect the best number of features and build the optimal classifiers and decision rules. Finally, functional enrichment analysis was performed on the optimal gene feature set.
Sample size of each heart cell type on single-cell dataset.
| Index | Cell Type | Sample Size |
|---|---|---|
| 1 | Adipocytes | 3799 |
| 2 | Atrial cardiomyocyte | 23,483 |
| 3 | Endothelial | 100,579 |
| 4 | Fibroblast | 59,341 |
| 5 | Lymphoid | 17,217 |
| 6 | Mesothelial | 718 |
| 7 | Myeloid | 23,028 |
| 8 | Neuronal | 3961 |
| 9 | Pericytes | 77,856 |
| 10 | Smooth muscle cells | 16,242 |
| 11 | Ventricular cardiomyocyte | 125,289 |
Top 20 genes in a feature list, as ranked by lightGBM method.
| Index | Gene Symbol | Index | Gene Symbol |
|---|---|---|---|
| 1 | LINC02019 | 11 | LAMA2 |
| 2 | CAMSAP3 | 12 | NPPA |
| 3 | AC128685.1 | 13 | LINC01958 |
| 4 | AL139125.1 | 14 | LMNTD2 |
| 5 | AL024508.2 | 15 | AC131009.2 |
| 6 | AL121772.1 | 16 | DLC1 |
| 7 | LINC02346 | 17 | AC020978.5 |
| 8 | GLB1L3 | 18 | RYR2 |
| 9 | C22orf15 | 19 | LDB2 |
| 10 | UPK3A | 20 | SPARCL1 |
Figure 2IFS curve of DT and RF methods. IFS curves were plotted, with the number of features as X-axis and the performance as Y-axis. The highest weighted F1 scores generated by RF and DT were marked.
Performance of the two optimal classifiers.
| Classification Algorithm | Number of Features | ACC | MCC | Macro F1 | Weighted F1 |
|---|---|---|---|---|---|
| Random forest | 470 | 0.981 | 0.977 | 0.973 | 0.981 |
| Decision tree | 380 | 0.957 | 0.945 | 0.934 | 0.957 |
Figure 3Performance of two optimal classifiers on each cell type.
Figure 4Box plot to show the performance of the optimal RF classifier on datasets with noise. The performance is almost same as that on the original dataset, proving the robustness of the optimal RF classifier.
Figure 5Number of rules produced by the optimal DT classifier on each cell type.
Figure 6GO term and KEGG pathway analysis for the top 470 genes. (A) Top 15 key GO terms. (B) Top five key KEGG pathways.
Number and passed counts of the selected 33 rules based on the first three highest passed counts in each cell type.
| Rule Index | Cell Type | Passed Counts a | Rule Index | Cell Type | Passed Counts |
|---|---|---|---|---|---|
| Rules_4 | Atrial cardiomyocyte | 14,567 | Rules_12 | Endothelial | 2992 |
| Rules_15 | Atrial cardiomyocyte | 2451 | Rules_159 | Mesothelial | 199 |
| Rules_25 | Atrial cardiomyocyte | 1657 | Rules_269 | Mesothelial | 96 |
| Rules_0 | Ventricular cardiomyocyte | 95,879 | Rules_287 | Mesothelial | 89 |
| Rules_13 | Ventricular cardiomyocyte | 2856 | Rules_20 | Neuronal | 1950 |
| Rules_14 | Ventricular cardiomyocyte | 2728 | Rules_182 | Neuronal | 165 |
| Rules_2 | Fibroblast | 32,635 | Rules_189 | Neuronal | 159 |
| Rules_9 | Fibroblast | 6595 | Rules_36 | Adipocytes | 1227 |
| Rules_19 | Fibroblast | 2219 | Rules_52 | Adipocytes | 866 |
| Rules_11 | Smooth muscle cells | 3115 | Rules_81 | Adipocytes | 486 |
| Rules_17 | Smooth muscle cells | 2242 | Rules_20 | Neuronal | 1950 |
| Rules_29 | Smooth muscle cells | 1565 | Rules_182 | Neuronal | 165 |
| Rules_3 | Pericytes | 21,300 | Rules_189 | Neuronal | 159 |
| Rules_8 | Pericytes | 7142 | Rules_5 | Lymphoid | 9681 |
| Rules_10 | Pericytes | 4448 | Rules_24 | Lymphoid | 1673 |
| Rules_1 | Endothelial | 62,186 | Rules_78 | Lymphoid | 503 |
| Rules_7 | Endothelial | 8820 |
a: “passed counts” indicates the number of samples satisfying the condition of the rule.