| Literature DB >> 35721855 |
Zhandong Li1, Xiaoyong Pan2, Yu-Dong Cai3.
Abstract
Diabetes is the most common disease and a major threat to human health. Type 2 diabetes (T2D) makes up about 90% of all cases. With the development of high-throughput sequencing technologies, more and more fundamental pathogenesis of T2D at genetic and transcriptomic levels has been revealed. The recent single-cell sequencing can further reveal the cellular heterogenicity of complex diseases in an unprecedented way. With the expectation on the molecular essence of T2D across multiple cell types, we investigated the expression profiling of more than 1,600 single cells (949 cells from T2D patients and 651 cells from normal controls) and identified the differential expression profiling and characteristics at the transcriptomics level that can distinguish such two groups of cells at the single-cell level. The expression profile was analyzed by several machine learning algorithms, including Monte Carlo feature selection, support vector machine, and repeated incremental pruning to produce error reduction (RIPPER). On one hand, some T2D-associated genes (MTND4P24, MTND2P28, and LOC100128906) were discovered. On the other hand, we revealed novel potential pathogenic mechanisms in a rule manner. They are induced by newly recognized genes and neglected by traditional bulk sequencing techniques. Particularly, the newly identified T2D genes were shown to follow specific quantitative rules with diabetes prediction potentials, and such rules further indicated several potential functional crosstalks involved in T2D.Entities:
Keywords: Monte Carlo feature selection; RIPPER; single-cell sequencing; support vector machine; type 2 diabetes
Year: 2022 PMID: 35721855 PMCID: PMC9201257 DOI: 10.3389/fbioe.2022.890901
Source DB: PubMed Journal: Front Bioeng Biotechnol ISSN: 2296-4185
FIGURE 1Workflow for key gene identification of type 2 diabetes. The MCFS method was used to evaluate the importance of all features (genes). On the one hand, the IFS method with SVM/RF/KNN was applied on the feature list yielded by the MCFS method to extract optimal T2D-associated genes and optimal classifiers. On the other hand, the informative features yielded by the MCFS method were fed into the Johnson reducer and RIPPER algorithms to construct optimal T2D-associated rules.
FIGURE 2Performance of KNN integrated in IFS using different numbers of features. The y-axis is F1-measure, and the x-axis is the number of participated features. k is the parameter of KNN, indicating the number of nearest neighbors that are used to make prediction. KNN can yield the best F1-measure of 0.886 when k = 5 and the top 665 features are used.
FIGURE 3Bar chart to show five measurements of three optimal classifiers based on different classification algorithms.
FIGURE 4Performance of RF integrated in IFS using different numbers of features. The y-axis is F1-measure, and the x-axis is the number of participated features. I is the parameter of RF, indicating the number of decision trees. RF can yield the best F1-measure of 0.907 when I = 100 and the top 305 features are used.
FIGURE 5Performance of SVM integrated in IFS using different numbers of features. The y-axis is F1-measure, and the x-axis is the number of participated features. SVM can yield the best F1-measure of 0.936 when the kernel is a linear function and the top 745 features are used.
Top seven genes among the optimal genes for SVM.
| Rank | Gene ID | Gene symbol | RI |
|---|---|---|---|
| 1 | 100128906 | LOC100128906 | 0.1140 |
| 2 | 100873254 | MTND4P24 | 0.1046 |
| 3 | 100271063 | RPS14P1 | 0.1032 |
| 4 | 100652939 | MTND2P28 | 0.0979 |
| 5 | 285045 | LINC00486 | 0.0959 |
| 6 | 729898 | ZBTB8OSP2 | 0.0954 |
| 7 | 391524 | THRAP3P1 | 0.0862 |
Nine classification rules for diabetes generated by the RIPPER algorithm.
| Rule | Criteria | Patient |
|---|---|---|
| Rule 1 | Gene Id 100128906 (LOC100128906) ≥ 2.7722 | Non-diabetes |
| Gene Id 326307 (RPL3P4) ≤ 15.2306 | ||
| Gene Id 8781 (PSPHP1) ≥ 0.0965 | ||
| Gene Id 100873065 (PTCHD1-AS) ≤ 0.1036 | ||
| Rule 2 | Gene Id 100462954 (MICOS10P3) ≥ 2.0984 | Non-diabetes |
| Gene Id 1487 (CTBP1) ≤ 17.3460 | ||
| Gene Id 326307 (RPL3P4) ≤ 6.2868 | ||
| Gene Id 100873254 (MTND4P24) ≥ 3.0364 | ||
| Rule 3 | Gene Id 100128906 (LOC100128906) ≥ 49.6340 | Non-diabetes |
| Gene Id 143244 (EIF5AL1) ≥ 1.0987 | ||
| Gene Id 486 (FXYD2) ≤ 152.8666 | ||
| Gene Id 326307 (RPL3P4) ≤ 11.3894 | ||
| Gene Id 6126 (RPL9P7) ≤ 103.5050 | ||
| Rule 4 | Gene Id 100128906 (LOC100128906) ≥ 3.0256 | Non-diabetes |
| Gene Id 326307 (RPL3P4) ≤ 22.4381 | ||
| Gene Id 100128906 (LOC100128906) ≥ 225.8732 | ||
| Gene Id 388147 (RPL9P9) ≤ 50.3934 | ||
| Gene Id 100271332 (RPL36AP21) ≥ 1.7952 | ||
| Gene Id 222901 (RPL23P8) ≤ 2.6067 | ||
| Rule 5 | Gene Id 100652939 (MTND2P28) ≥ 450.8125 | Non-diabetes |
| Gene Id 4574 (MT-TS1) ≤ 445.4115 | ||
| Gene Id 1487 (CTBP1) ≤ 37.6438 | ||
| Rule 6 | Gene Id 285045 (LINC00486) ≤ 0.0930 | Non-diabetes |
| Gene Id 100873254 (MTND4P24) ≤ 28.2479 | ||
| Gene Id 653147 (RPL26P30) ≥ 5.1856 | ||
| Gene Id 285900 (RPL6P20) ≥ 0.4760 | ||
| Gene Id 643932 (RPS3AP20) ≥ 5.5063 | ||
| Rule 7 | Gene Id 100128906 (LOC100128906) ≥ 3.0256 | Non-diabetes |
| Gene Id 440737 (RPL35P1) ≥ 4.118 | ||
| Gene Id 100271003 (RPL34P18) ≥ 9.0166 | ||
| Rule 8 | Gene Id 100128906 (LOC100128906) ≥ 109.2232 | Non-diabetes |
| Gene Id 100873254 (MTND4P24) ≤ 28.3353 | ||
| Gene Id 644972 (RPS3AP26) ≥ 53.5552 | ||
| Gene Id 644604 (EEF1A1P12) ≤ 7.9556 | ||
| Rule 9 | Others | Diabetes |
Performance of classifiers using informative features yielded by the MCFS method.
| Classification algorithm | F1-measure | Decrement |
|---|---|---|
| KNN (k = 1) | 0.849 | 0.036 |
| KNN (k = 5) | 0.839 | 0.047 |
| KNN (k = 10) | 0.847 | 0.033 |
| RF (I = 20) | 0.889 | 0.014 |
| RF (I = 40) | 0.891 | 0.013 |
| RF (I = 60) | 0.894 | 0.011 |
| RF (I = 80) | 0.894 | 0.010 |
| RF (I = 100) | 0.897 | 0.010 |
| SVM (linear kernel) | 0.882 | 0.054 |
| SVM (polynomial kernel) | 0.859 | 0.035 |
| SVM (RBF kernel) | 0.886 | 0.023 |
| SVM (sigmoid kernel) | 0.631 | 0.056 |
Numbers listed in this column indicate the difference of F1-measure yielded by the optimal classifier and that listed in the second column of this table.
Significant Gene Ontology enrichment analysis result on rule genes.
| GO ID | Term |
| Cluster |
|---|---|---|---|
| GO:1903408 | Positive regulation of sodium: potassium-exchanging ATPase activity | 5.30E-04 | BP |
| GO:0045901 | Positive regulation of translational elongation | 7.00E-04 | BP |
| GO:0045905 | Positive regulation of translational termination | 7.00E-04 | BP |