| Literature DB >> 35891786 |
HuaChun Yin1,2,3, JingXin Tao2, Yuyang Peng1, Ying Xiong3, Bo Li2, Song Li1,4, Hui Yang1,4.
Abstract
In transcriptomics, differentially expressed genes (DEGs) provide fine-grained phenotypic resolution for comparisons between groups and insights into molecular mechanisms underlying the pathogenesis of complex diseases or phenotypes. The robust detection of DEGs from large datasets is well-established. However, owing to various limitations (e.g., the low availability of samples for some diseases or limited research funding), small sample size is frequently used in experiments. Therefore, methods to screen reliable and stable features are urgently needed for analyses with limited sample size. In this study, MSPJ, a new machine learning approach for identifying DEGs was proposed to mitigate the reduced power and improve the stability of DEG identification in small gene expression datasets. This ensemble learning-based method consists of three algorithms: an improved multiple random sampling with meta-analysis, SVM-RFE (support vector machines-recursive feature elimination), and permutation test. MSPJ was compared with ten classical methods by 94 simulated datasets and large-scale benchmarking with 165 real datasets. The results showed that, among these methods MSPJ had the best performance in most small gene expression datasets, especially those with sample size below 30. In summary, the MSPJ method enables effective feature selection for robust DEG identification in small transcriptome datasets and is expected to expand research on the molecular mechanisms underlying complex diseases or phenotypes.Entities:
Keywords: AUC, area under the ROC curve (AUC); DEGs, differentially expressed genes; Differentially expressed genes; FDR, false positive rate; Feature selection; GA, genetic algorithm; GEO, Gene Expression Omnibus; GO, gene ontology; MSPJ, the Joint method of Meta-analysis, SVM-RFE, and Permutation test; Machine learning; RF, random forest; ROC, receiver operating characteristic; Random sampling; SAM, significance analysis of microarrays; SMDs, standardized mean differences; SNR, signal noise ratio; SVM-RFE, support vector machines-recursive feature elimination; Small sample size; mRMR, minimum-redundancy-maximum-relevance
Year: 2022 PMID: 35891786 PMCID: PMC9304602 DOI: 10.1016/j.csbj.2022.07.022
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 6.155
Fig. 1A working mechanism of the MSPJ. The RPBs represents robust potential biomarkers.
Fig. 2(A) Type I error control for 11 methods applied to small simulated microarray datasets. (B) Type I error control for 11 methods applied to small simulated RNA-seq datasets. For each method, the box plot represents the values obtained from 20 experiments. All samples in simulated datasets were<30 and contained 6000–20000 genes. (C) The rank of time consuming and memory usage for 11 methods applied to simulated microarray and RNA-seq datasets.
Summary of real datasets.
| Technique | Accession number | Sample size | Gene numer | Size per class | Organism | Ref. |
|---|---|---|---|---|---|---|
| Microarray | GSE16515 | 20 | 12,937 | 10:10 | Human | |
| RNA-seq | PMID: 21179090 | 18 | 9,300 | 12:6 | Fly | |
| Microarray | GSE10072 | 107 | 12,937 | 58:49 | Human |
Fig. 3Comparison of eleven methods with small sample sizes. (A) UpSet plot of DEGs obtained by 11 methods from a microarray dataset. (B) UpSet plot of DEGs from 11 methods from the RNA-seq dataset. (C) The Jaccard scores of DEGs from the microarray dataset. (D) The Jaccard scores of DGEs from the RNA-seq dataset. (E) The Jaccard scores of GO terms from the microarray dataset. (F) The Jaccard scores of GO terms from the RNA-seq dataset. (G) The AUC values of 11 methods for the microarray dataset. (H) The AUC values of 11 methods for the RNA-seq dataset.
Fig. 4Application of different methods to large-scale microarray datasets. The value of the smoothing parameter (loess) for a curve fitting was chosen for<1,000 observations, and the generalized additive model was used for>1,000 observations. (A) The similarity of DEG analysis methods for small datasets. (B) Similarity of enriched GO terms for DEGs identified using different methods in small datasets. When Jaccard score > 0.5, Jaccard index = 1 - Jaccard score, else Jaccard index = Jaccard score. Detailed information is provided in Table A.2. (C)∼(F) showed the AUC, specificity, sensitivity and accuracy of the top ten DEGs identified using different methods, respectively. The detailed values are reported in Table A.3.
Fig. 5Application of different methods to large-scale RNA-seq datasets. The value of the smoothing parameter (loess) for a curve fitting was chosen for<1,000 observations, and the generalized additive model was used for>1,000 observations. (A) Similarity of DEG analysis methods for small datasets. (B) Similarity of enriched GO terms for DEGs identified using different methods in small datasets. When Jaccard score > 0.5, Jaccard index = 1 - Jaccard score, else Jaccard index = Jaccard score. Detailed information is provided in Table A.2. (C)∼(F) showed the AUC, specificity, sensitivity and accuracy of the top ten DEGs identified using different methods, respectively. The detailed values are reported in Table A.3.
Fig. 6Robust DEG detection for different sampling rates in large datasets. The entire process was repeated 10 times for random sampling, and each repetition employed a different random seed. (A) The number of discovered DEGs for different sampling rates. (B) The Jaccard scores of DEG counts for the comparison between the subset and original microarray dataset. (C) The Jaccard scores of GO terms for the comparison between the subset and microarray original dataset. (D) The Jaccard scores of the top 100 DEGs between the subset and microarray original dataset. (E) The Jaccard scores of GO terms enriched from the top 100 DEGs between the subset and microarray original dataset. (F)∼(I) The AUC, specificity, sensitivity and accuracy of sub-sampling microarray datasets, respectively.
| Inputs: |
| Training dataset |
| Output: |
| Feature ranked list |
| Initialize: |
| Subset of surviving features, |
| Feature ranked list |
| While |
| 1). Restrict the features of |
| 2). Train SVM to get weight vectors |
| 3). the ranking score of |
| 4). Find the feature with the smallest ranking criteria |
| 5). Add feature index |
| 6). Eliminate the feature index |