Literature DB >> 35891786

MSPJ: Discovering potential biomarkers in small gene expression datasets via ensemble learning.

HuaChun Yin^1,2,3, JingXin Tao², Yuyang Peng¹, Ying Xiong³, Bo Li², Song Li^1,4, Hui Yang^1,4.

Abstract

In transcriptomics, differentially expressed genes (DEGs) provide fine-grained phenotypic resolution for comparisons between groups and insights into molecular mechanisms underlying the pathogenesis of complex diseases or phenotypes. The robust detection of DEGs from large datasets is well-established. However, owing to various limitations (e.g., the low availability of samples for some diseases or limited research funding), small sample size is frequently used in experiments. Therefore, methods to screen reliable and stable features are urgently needed for analyses with limited sample size. In this study, MSPJ, a new machine learning approach for identifying DEGs was proposed to mitigate the reduced power and improve the stability of DEG identification in small gene expression datasets. This ensemble learning-based method consists of three algorithms: an improved multiple random sampling with meta-analysis, SVM-RFE (support vector machines-recursive feature elimination), and permutation test. MSPJ was compared with ten classical methods by 94 simulated datasets and large-scale benchmarking with 165 real datasets. The results showed that, among these methods MSPJ had the best performance in most small gene expression datasets, especially those with sample size below 30. In summary, the MSPJ method enables effective feature selection for robust DEG identification in small transcriptome datasets and is expected to expand research on the molecular mechanisms underlying complex diseases or phenotypes.

Entities: Chemical

Keywords: AUC, area under the ROC curve (AUC); DEGs, differentially expressed genes; Differentially expressed genes; FDR, false positive rate; Feature selection; GA, genetic algorithm; GEO, Gene Expression Omnibus; GO, gene ontology; MSPJ, the Joint method of Meta-analysis, SVM-RFE, and Permutation test; Machine learning; RF, random forest; ROC, receiver operating characteristic; Random sampling; SAM, significance analysis of microarrays; SMDs, standardized mean differences; SNR, signal noise ratio; SVM-RFE, support vector machines-recursive feature elimination; Small sample size; mRMR, minimum-redundancy-maximum-relevance

Year: 2022 PMID： 35891786 PMCID： PMC9304602 DOI： 10.1016/j.csbj.2022.07.022

Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN： 2001-0370 Impact factor: 6.155

Introduction

Transcriptomics technology has become ubiquitous in systems biology, and identifying the differentially expressed genes (DEGs) is a critical step in analyses of these high-throughput data. Analyses of DEGs provide key insights into the mechanisms underlying diseases and a basis for the discovery of diagnostic biomarkers [1], [2], [3]. Meanwhile, the DEG analysis was applied to several various fields, such as ecology, evolution and population genetics [4], [5], [6]. However, identifying DEGs from high-dimensional datasets is a challenging task. In the context of transcriptomic data analyses, many candidate biomarkers are unstable and yield false positive results in clinical applications [7]. The instability of derived biomarkers has many explanations, of which a core reason may be that the sample size of datasets is insufficient [8], [9]. Large-scale datasets are considered to have significant statistical power for DEG identification [10], [11], whereas small sample size poses significant challenges for datasets analysis, including the “curse of dimensionality” and the overfitting of training datasets [12]. Unfortunately, large sample size is not typically feasible in broad laboratory biological experiments owing to the scarcity of specimens or the prohibitive costs of datasets preparation. To mitigate the reduced power and improve the stability of biomarkers in gene expression datasets with small sample sizes, many methods for DEG identification have been proposed (such as, based on noise distribution [13], [14] and optimized machine learning algorithms and prediction models [15], [16]). Apart from strategies to improve statistical robustness in studies with small sample sizes, multiple random sampling is recognized as a good strategy, in which samples are randomly extracted with replacement from the original dataset to construct independent sub-datasets [17]. The use of multiple sub-datasets can effectively adjust the skewed distributions of errors caused by a small sample size. This method of combining multiple datasets can simulate large-scale datasets and provide sufficient statistical power [18]. To improve the consistency of biomarkers derived from multiple datasets, several popular feature selection methods (such as meta-analysis, permutation test and SVM-RFE) were usually used in biomedicine filed [19], [20], [21]. Meta-analysis is a statistical technique for aggregating the results of various independent but related studies to identify DEGs [19], [22], [23], [24]. SVM-RFE is a method for feature selection by iteratively training an SVM classifier with the current set of features and removing the least important feature indicated by the SVM [25]. In cases with very small sample sizes, multiple sub-datasets are generated by random sampling, and the permutation-derived test statistics for each gene in sub-datasets are combined to determine DEGs [26], [27]. In this study, an improved ensemble learning-based method, MSPJ (the Joint method of Meta-analysis, SVM-RFE, and Permutation test) was proposed to discover DEGs adapting to transcriptome dataset with small sample size, so that improve the reproduction of DEGs by integrating results from multiple sub-datasets. With the aim of exploring stable molecular signatures by utilizing small samples to reconstruct large datasets and sub-datasets. MSPJ is expected to provide opportunities to identify reliable, stable biomarkers, irrespective of sample size, in a cost-effective manner.

Materials and methods

Multiple random sampling procedure

Resampling, a simple but powerful statistical technique, is used to obtain multiple subsets of data by random sampling with replacement. In this study, a multiple random sampling strategy was used to draw a sample from the original dataset without replacement to create subsets. Each subset had independent samples, and sizes were not identical. The random sampling method was as follows: Original dataset:}; Sample size: Take from } and from } where 3 & 3 Sampling times, Subset:; Sample size:. Where and represent the E-th sample of experimental group and C-th sample of the control group in the original dataset individually. stands for the original sample size, while and represent original sample size of the experimental and control group, respectively. Similarly, and represent the e-th sample of the experimental group and c-th sample of the control group in the sub-datasets. K denotes the number of random sampling sets. represents the subset sample size, represents the experimental group of subset sample size, represents the control group of subset sample size.

Application of the MSPJ approach to detect differentially expressed genes

To identify stable candidate biomarkers, the MSPJ method was built by integrating three algorithms (meta-analysis, SVM-RFE, and permutation test) for DEG detection. Using this joint method, shared DEGs estimated by the three algorithms were defined as robust biomarkers. An overview of the whole process of MSPJ is provided in Fig. 1. MSPJ procedure contains two steps. Firstly, three sets of DEGs were identified by three methods (meta-analysis based on multiple subsets, SVM-RFE based on nested 5-fold cross-validation, and permutation test in original dataset) individually. Secondly, the overlap of DEGs from three methods was obtained, and the share DEGs in overlapping area were considered as the robust potential biomarkers.

Fig. 1

A working mechanism of the MSPJ. The RPBs represents robust potential biomarkers.

Meta-analysis for gene selection

The meta-analysis strategy was used in the MSPJ method for the initial screening of DEGs by combining the results from multiple sub-datasets [28]. For continuous outcomes according to the following formula: 1) The standard deviation was computed for gene i in every sub-datasets: 2) The standardized mean differences (SMDs) were calculated in experimental and control groups of sub-datasets: ; 3) The weight vectors ; 4) The combining effect sizes were calculated: Where and are number of observations in experimental and control groups, and the and represent the standard deviation in experimental and control groups. The k is the N-th sub-datasets, N is the number of sub-datasets, is the weight vectors. And and denote the mean values in the two groups. A gene was considered up-regulated when SMD > 0.5 with 95% confidence interval did not cross the null hypothesis. In contrast, a gene was defined as a down-regulated gene if SMD < 0.5 with 95% confidence intervals did not exhibit an invalid line cross.

SVM-RFE algorithm for gene ranking

As one of the most widely used backward elimination algorithms, SVM-RFE works by iteratively removing the “worst” gene until the predefined size of the final gene subset is reached [29]. The feature ranking score as the ranking criterion was calculated by the coefficients of the weight vector in each dataset according to the following computing steps:Where is the number of training samples, is the -th training sample, encodes the class label of , is the surviving feature size, is the -th feature index, is the SVM classifier, is the weight vectors, is the ranking criteria and is the smallest ranking criterion.

Permutation test for the detection of DEGs

The permutation test is increasingly used to construct sampling distributions without replacement, especially for data sets with small sample sizes and ambiguous data distributions. Generally, DEGs were identified by a permutation test by the following basic steps: The experimental group: A }; The control group: B };here, is the statistic for the observation dataset, and is the statistic for the permutation subset. and are the mean expression values for the experimental and control groups. and are the mean expression values in experimental and control groups in the permutation dataset. The represents the sample sizes of A, and represents the sample sizes of B. The represents the permutation rounds. compute the difference in mean expression values between experimental and control groups in the original dataset: randomly sampling the mixed data [A, B] into subset and . compute the difference in mean expression values between experimental and control groups in the permuted dataset: repeat steps (2) and (3) times and evaluate all permutations, and return the p-value as the p-value of DEGs when exceeds :

Datasets used to assess the performance of MSPJ

To determine the validity of the MSPJ method for DGE detection, 259 gene expression datasets (containing 94 simulated and 165 real datasets) were used with different sizes generated by DNA microarray or RNA-seq platforms.

Simulated datasets

To test the stability of feature selection methods, several simulated gene expression datasets (including microarray and RNA-seq data) were utilized. Microarray datasets (log2-ratios) were simulated using the madsim package in R [30] and RNA-seq datasets were generated using the SPsimSeq package [31].

Real datasets

A number of real datasets with various sample sizes were applied to further compare MSPJ with established methods. These gene expression datasets were mainly collected from Gene Expression Omnibus (GEO) (). The detailed information for all gene expression datasets is provided in Table A.1.

Data preprocessing

Before DEG detection, raw *.cel files for the microarray datasets were read for RMA normalization and the generation of gene expression data matrices [32]. Then, the genes with low expression values in all samples (threshold < 0.1) were removed in subsequent analyses of each dataset [33]. To reduce sample error resulting from random sampling, the original datasets and subsets were further normalized by the upper quantile method using the preprocessCore package [34]. All computations were implemented in the R environment (v4.0.0).

Comparison with classical methods for gene selection

Ten methods were used for a comparative analysis with the newly developed MSPJ method, including five DEG identification methods (limma [35], significance analysis of microarrays (SAM) [36], T-test, Wilcoxon’s test, multtest [37]) and five gene rank-based methods (RankProd [38], signal noise ratio (SNR), minimum-redundancy-maximum-relevance (mRMR) [39], genetic algorithm (GA) [40], random forest (RF) [41]). Those ten methods have been widely used, and can be categorized into three main types: filter, wrapper, and embedded approach. Among of them, Limma, SAM, T-test, Wilcoxon’s test, multtest, mRMR, RankProd and SNR were the well-known filter approach algorithms [42]. GA was applied as a wrapper feature selection method, and RF was known as an embedded method [43], [44]. The gene expression matrices were used to identify DEGs by each of these 11 methods independently. The DEG identification methods provided directly a subset of genes, and the feature ranking methods provided the ranking. Genes were considered as DEGs when the adjusted p-value was below 0.05 or the arbitrary rank was in the top 30%. A gene set enrichment analysis of gene ontology (GO) terms was implemented using clusterProfiler with an adjusted p-value cutoff of 0.05 [45]. The Jaccard index was used to evaluated consensus DEGs and the related GO terms identified by different methods. The Jaccard index was defined as follows: , where A and B represent the DEGs or related GO terms identified by two methods, respectively.

Application of a classification algorithm for method assessment

Prediction modeling plays an important role in algorithm assessment by classifying experimental and control cases. In this study, the SVM algorithm was implemented using the R package e1071, and selected to assess the stability of biomarkers from MSPJ and conventional methods [46]. The top ten biomarkers were selected to construct the SVM classifier. The nested 5-fold cross-validation strategy was applied to partitioning the training-test datasets and calculating the evaluating indexes. The classification accuracy was used to assess the classification performance of gene features from training datasets. The area under the receiver operating characteristic (ROC) curve (AUC), specificity, sensitivity, and accuracy were used to evaluate the performance of the MSPJ and conditional methods using the pROC package [47].

Results

The type i error control in simulated datasets

The type I error rate was calculated as the proportion of simulations where non-DEGs tested positively, i.e., the false positive rate (FDR). The sample size of all simulated datasets was limited to a maximum of 30 samples, and all non-DEGs were false positive. To better compare feature selection methods based on hypothesis testing, the DEGs were identified using a nominal p-value cutoff of 0.05. As for gene rank-based methods, the DEGs were identified by an arbitrary rank (the preset DEGs number). In most cases, keeping the FDR < 10% is a conservative approach for a well-calibrated test. Under baseline conditions for 20 microarray simulation datasets, only three methods (MSPJ, SAM, and multtest) performed quite well and even in maintained FDR < 5% (Fig. 2A). The RF algorithm performed reasonably well but with the FDR of > 10%. However, limma, T-test, RankProd, and Wilcoxon’s test called around half non-DEGs. The GA, mRMR and SNR methods had a totally different gene list in simulation datasets.

Fig. 2

(A) Type I error control for 11 methods applied to small simulated microarray datasets. (B) Type I error control for 11 methods applied to small simulated RNA-seq datasets. For each method, the box plot represents the values obtained from 20 experiments. All samples in simulated datasets were<30 and contained 6000–20000 genes. (C) The rank of time consuming and memory usage for 11 methods applied to simulated microarray and RNA-seq datasets. As for the simulated RNA-seq datasets, all methods maintained the FDR of > 5% (Fig. 2B). For each of SNR, mRMR, MSPJ, SAM, and Wilcoxon’s test, the median FDR value is simultaneously less than < 30% in most datasets. Of which, SAM has a high upper-quantile, and limma, multtest, T-test, and RankProd showed an FDR of approximately 50%. The FDR values of GA and RF were much higher than those of other methods. It was noted that MSPJ method applied to small datasets showed better performance with respect to type I error control than those of 11 classical methods both in microarray and RNA-seq datasets.

The time consuming and memory usage in simulation datasets

The complexity algorithms may result in high computational cost during the massive iterations of feature identification, especially as the search strategies drift towards more exhaustive [48]. Hence, the time consuming and memory usage of 11 methods were evaluated using the simulation datasets. The detailed score of time consuming and memory usage were normalized across the different methods. For each method, the mean value of time consuming or memory usage obtained from three repeated experiments. The time and memory scores were shifted and scaled to σ = 1 and μ = 0, and then applied the unit probability density function of a normal distribution on these values to get the scores = 1-scores and back into [0,1] range [49]. In the Fig. 2C, all methods have consistency in the time consuming and memory usage both in microarray and RNA-seq technologies. Most of filter methods spend less memory and time than the wrapper and embedded methods in different samples and gene size. In terms of time consuming, limma, SNR, SAM and T-test were rarely influenced by sample and gene size. The time consuming of Wilcoxon’s test, mRMR, multtest and RF increased with sample and gene size. The RankProd, MSPJ and GA spend more time consuming on both microarray and RNA-seq technologies. As for the memory usage, multtest, limma, mRMR, SNR, Wilcoxon’s test and T-test achieved low memory usage. The memory usage of wrapper method (GA and MSPJ) was less than filter methods (SAM and RankPrond) and embedded method RF.

Comparative analysis of DEG detection from real datasets

To compare different methods for identifying DEGs, three datasets with known properties were utilized (Table 1). The Benjamini and Hochberg procedure was applied to statistical tests with no adjustment, like the MSPJ, T-test, and Wilcoxon’s test. As an adjusted p-value of < 0.01 was the criterion for DEG detection methods. The top 30% of genes were considered statistically significant for the gene rank-based methods. The identification of unique and common DEGs, gene set enrichment analyses, and AUC analyses for different sample sizes are important analysis types for a comprehensive evaluation.

Table 1

Summary of real datasets.

Technique	Accession number	Sample size	Gene numer	Size per class	Organism	Ref.
Microarray	GSE16515	20	12,937	10:10	Human	[50]
RNA-seq	PMID: 21179090	18	9,300	12:6	Fly	[51]
Microarray	GSE10072	107	12,937	58:49	Human	[52]

Summary of real datasets. Large-scale datasets (163 datasets with different sample sizes) were utilized to assess the robustness of gene selection methods. Detailed sample information was provided in Table A.1.

Assessment of the application of eleven methods to small datasets

The 11 individual gene selection methods were applied to two real datasets to identify DEGs. The GA, RF, and mRMR methods yielded the most unique DEGs; however, the intersection of DEGs obtained by other methods in microarray data was small (as shown in Fig. 3A). The MSPJ method generated the highest number of genes compared with the DEGs identified methods. The MSPJ, limma, SAM, RankProd, T-test, and Wilcoxon’s test generated more shared DEGs. For RNA-seq datasets, the number of DEGs identified by different methods were similar to those of the microarray datasets (Fig. 3B). The GA and RF methods detected more total and unique DEGs and generated similar results. Apart from DEG counts, the Jaccard scores were used to evaluate DEGs identified using different methods. Jaccard scores closing to 0.5 (between 0.4 and 0.6) indicated that similar unique and common DEGs were detected by two methods. Jaccard scores for comparisons between MSPJ and limma, SAM, T-test, and Wilcoxon’s test were all within 0.40–0.60 for the microarray datasets (Fig. 3C). The Jaccard scores for comparisons between MSPJ and limma, SAM, T-test, and Wilcoxon’s test fell in 0.42 ∼ 0.55 for RNA-seq datasets (Fig. 3D). Overall, the DEGs from MSPJ, limma, SAM, T-test and Wilcoxon’s test methods were very similar; however, the Jaccard scores for comparisons between MSPJ and other four methods were closer to 0.5 than those of comparisons in other methods.

Fig. 3

Comparison of eleven methods with small sample sizes. (A) UpSet plot of DEGs obtained by 11 methods from a microarray dataset. (B) UpSet plot of DEGs from 11 methods from the RNA-seq dataset. (C) The Jaccard scores of DEGs from the microarray dataset. (D) The Jaccard scores of DGEs from the RNA-seq dataset. (E) The Jaccard scores of GO terms from the microarray dataset. (F) The Jaccard scores of GO terms from the RNA-seq dataset. (G) The AUC values of 11 methods for the microarray dataset. (H) The AUC values of 11 methods for the RNA-seq dataset. The Jaccard scores of GO terms were consistent with those of the DEGs enriched by different methods. MSPJ-derived DEGs generated more unique and common biological functions in comparisons with ones of SAM, T-test, and Wilcoxon’s test (Jaccard score: 0.55–0.60) for microarray datasets, as shown in Fig. 3E. For RNA-seq datasets, there existed similar results according to biological functions (Jaccard scores of 0.55 ∼ 0.58) (Fig. 3F). The AUC values revealed that MSPJ, limma, RankProd, T-test and Wilcoxon’s test outperformed other methods in DEG discovery for all datasets (Fig. 3G and H).

Good performance of MSPJ revealed by testing on large-scale datasets

We next compared various methods for analyses of large-scale datasets, 123 microarray datasets and 40 RNA-seq datasets were tested. After identifying DEGs using diverse methods, the Jaccard score was computed to evaluate consensus with respect to DEGs and GO enrichment. Overall, the similarity between the methods was not significant for DEG identification both in microarray and RNA-seq datasets. In microarray datasets, mRMR, SNR, MSPJ, RF, RankProd and GA had higher Jaccard scores than those of other methods for all sample sizes (Fig. 4A). But around 10 to 40 samples, SNR, mRMR, MSPJ, and RF had higher Jaccard scores than those of the other methods. Each method retained its independence for DEG detection on different sample sizes, except RankProd had high Jaccard scores around 20 to 40 samples. For GO terms, most methods showed similar trends in Jaccard scores, while T-test, GA, and limma showed a decrease as the sample size increased (Fig. 4B). Of note, RF, SAM, RankProd, and MSPJ showed high Jaccard scores for sample sizes between 20 and 30. According to Jaccard scores for DEGs and functional enrichment, only RF and MSPJ stayed on top of the momentum under 30 sample sizes. Consistency among the detected DEGs does not fully reflect the advantages and disadvantages of different methods; accordingly, we evaluated the discriminant ability of the model based on the top 10 DEGs between experimental and control groups in the DNA microarray datasets. The top 10 DEGs from five DEG identification methods were ranked by the p-value. Several indices should be considered for the classification and prediction of biomarkers, such as AUC, specificity, sensitivity, and accuracy. For these four indices, MSPJ and SNR showed good performance with small sample sizes (Fig. 4C–F). Then, RankProd, RF, and GA showed similar trends with respect to sample size, behind MSPJ and SNR.

Fig. 4

Application of different methods to large-scale microarray datasets. The value of the smoothing parameter (loess) for a curve fitting was chosen for<1,000 observations, and the generalized additive model was used for>1,000 observations. (A) The similarity of DEG analysis methods for small datasets. (B) Similarity of enriched GO terms for DEGs identified using different methods in small datasets. When Jaccard score > 0.5, Jaccard index = 1 - Jaccard score, else Jaccard index = Jaccard score. Detailed information is provided in Table A.2. (C)∼(F) showed the AUC, specificity, sensitivity and accuracy of the top ten DEGs identified using different methods, respectively. The detailed values are reported in Table A.3. In RNA-seq datasets, SNR, limma, MSPJ, and T-test had higher Jaccard scores than those of the other methods for around 12 to 60 samples (Fig. 5A). For GO terms, MSPJ, and RankProd showed higher Jaccard scores than those of the other methods with samples size between 12 and 50 (Fig. 5B). According to Jaccard scores for DEGs and functional enrichment, only MSPJ kept the higher Jaccard for around 12 to 50 samples. Similarly, according to four indices (AUC, specificity, sensitivity, and accuracy), MSPJ, limma, T-test and SNR showed good performance with sample size between 12 and 70 (Fig. 5C–F).

Fig. 5

Application of different methods to large-scale RNA-seq datasets. The value of the smoothing parameter (loess) for a curve fitting was chosen for<1,000 observations, and the generalized additive model was used for>1,000 observations. (A) Similarity of DEG analysis methods for small datasets. (B) Similarity of enriched GO terms for DEGs identified using different methods in small datasets. When Jaccard score > 0.5, Jaccard index = 1 - Jaccard score, else Jaccard index = Jaccard score. Detailed information is provided in Table A.2. (C)∼(F) showed the AUC, specificity, sensitivity and accuracy of the top ten DEGs identified using different methods, respectively. The detailed values are reported in Table A.3.

Robustness of MSPJ for DEG identification in small datasets

To explore the stability of the performance of MSPJ for feature retrieval in small size datasets, the impact of sample size was investigated by computing overlap in DEGs and GO terms between subsets and the original dataset for 11 methods. Six sub-datasets with 10%, 20%, 30%, 40%, 50%, and 60% of samples from each of two groups were generated by random sampling, and the random sampling was repeated ten times. The average Jaccard index for overlap in the DEGs or GO terms between subsets and the original dataset was evaluated for 11 methods and various small sample sizes. Except for the case where Jaccard value = 1, a high Jaccard score (i. e., the positive value close to 1) indicates that the method is more robust regardless of changes in sample size. AUC, specificity, sensitivity, and accuracy for different sample sizes were also evaluated. One example of DNA microarray dataset GSE10072 containing 107 samples, DEG identification methods were sensitive to changes in sample size (Fig. 6A). The number of DEGs detected by most methods increased with the sampling rate, other than GA, SNR, mRMR and RF. Among all methods, MSPJ showed the least variation in DEG numbers across sample rates. For most methods, except for multtest, SNR, GA, mRMR, and RF, the Jaccard scores of DEGs and GO terms between subsets and the original dataset increased as the sample size increased (Fig, 6B and C). In the assessment of the sensitivity for sample size, mRMR showed complete concordance with DEG detection (Jaccard scores = 1), shown in Fig. 6B-E. Therefore, mRMR will not be compared with other methods in terms of the consistency of DEGs and GO terms. MSPJ clearly showed the highest Jaccard score of DEGs for sub-datasets in the 10–20% range. Up to 30%, MSPJ was only behind the T-test. For GO terms, MSPJ also stably ranked high in the 10–30% range. We also considered the retention of the top 100, 200, and 300 DEGs by different methods with regard to sampling rate between the sub-dataset and original dataset. For both DEGs and GO enrichment terms, RankProd consistently showed the highest Jaccard score for all sampling rates and all top-ranked genes (Fig. 6D and E, Fig. A.1). MSPJ showed a moderate Jaccard score for DEG detection, and the Jaccard scores for MSPJ were among the top five for the evaluation of GO terms for all top ranked genes and 10–30% sampling rates.

Fig. 6

Robust DEG detection for different sampling rates in large datasets. The entire process was repeated 10 times for random sampling, and each repetition employed a different random seed. (A) The number of discovered DEGs for different sampling rates. (B) The Jaccard scores of DEG counts for the comparison between the subset and original microarray dataset. (C) The Jaccard scores of GO terms for the comparison between the subset and microarray original dataset. (D) The Jaccard scores of the top 100 DEGs between the subset and microarray original dataset. (E) The Jaccard scores of GO terms enriched from the top 100 DEGs between the subset and microarray original dataset. (F)∼(I) The AUC, specificity, sensitivity and accuracy of sub-sampling microarray datasets, respectively. Most methods consistently showed high AUC values and high specificity in the 10–60% range, except for mRMR, multtest, GA, SNR, and Wilcoxon’s test (Fig. 6F and G). For sensitivity, GA, RankProd, mRMR, and SNR showed decrease as the sampling rate increased, while other methods maintained a high sensitivity across all sample rates (Fig. 6H). Moreover, the biomarkers identified by GA, mRMR, and SNR had a lower accuracy for the prediction of control and experimental groups than that of other methods, irrespective of the sampling rate (Fig. 6I). Overall, MSPJ, limma, RF, and T-test have superior performance based on four indices for the assessment of classification and prediction.

Discussion

Owing to high experimental costs or low availability of samples, many gene expression datasets have small sample sizes. DEG analysis methods often show poor performance when the sample size is small [53]. The scDEA method proposed by Li et al. [54] was developed for differential expression analyses of scRNA-seq data. Further, the scDEA did not provide a significant advantage over other approaches when the sample size is small. To resolve these issues, an improved gene selection approach named MSPJ was developed in this study by integrating meta-analysis, SVM-RFE, and permutation test frameworks. To our knowledge, this is the first differential expression analysis method specifically targeting small gene expression datasets. In this study, we compared various methods with default parameters, as implemented in widely used packages. Ten representative methods were used for a comparative analysis with MSPJ. Among these, limma is often used for gene discovery by differential expression analyses of microarray and high-throughput PCR data, and SAM is a frequently used nonparametric method for analyses of microarray dataset. The T-test, multtest, Wilcoxon’s test, mRMR and RankProd (rank product method) methods are well-established statistical methodologies for feature selection [38]. The RankProd method is expected to be applicable to small datasets [55]. As a supervised machine learning approach, RF has gained substantial popularity for feature selection [56]. The GA approach, as an unsupervised search method, is often used to select a set of features to discriminate between groups, especially for classification in cases with small sample sizes [57]. Thus, RF and GA were used as representative supervised and unsupervised strategies for our comparative analysis. SNR is a signal–noise-ratio based feature selection method for ranking genes [58]. Our comparison of these distinct approaches for DEG identification provided a general and important reference for research in the field. MSPJ was obviously superior to other methods with respect to type I error control using simulated microarray datasets. Using simulated RNA-seq datasets, MSPJ ranked highly in terms of type I error control, only behind SNR and mRMR. The high noise level of RNA-seq datasets is an issue for the accurate detection of DEGs [59]. Hence, it is reasonable that the SNR method is superior to the MSPJ method for analyses of simulated RNA-seq datasets. Nevertheless, SNR and mRMR had a quite loose on type I error control in microarray datasets. Perez’s research indicated that mRMR algorithm is not suitable for high domain feature problems [60], and this may be the reason why mRMR kept unstable in high microarray and RNA-seq dimensional data, in terms of type I error control. MSPJ had the robust type I error control both on microarray and RNA-seq datasets. In terms of time and memory consumption, MSPJ did not perform well. Because it was based on meta-analysis for 40 sub-datasets and classifier model development provided by machine learning, and the proposed method taken slightly high computational cost for feature selection. However, it was worth to consider exchanging more time consuming and memory usage for robust feature detection based on small samples. Using real datasets for a large-scale microarray and RNA-seq datasets, SNR and MSPJ had an outperformance in terms of similarity of gene detection for <30 samples. In terms of gene enrichment of functional entries, RankProd and MSPJ had good performance under 30 samples. However, MSPJ and SNR were comparable and superior to the other methods in terms of feature gene classification and prediction both in microarray and RNA-seq datasets. In brief, the overall results for 165 bulk datasets revealed that MSPJ showed good performance for DEG detection under small sample sizes. In this study, the DEGs identified by individual methods were also assessed with different sampling rates. In terms of DEG counts, the rank-based methods were not included in the evaluation. For the 10–30% sampling rate, the T-test, MSPJ, limma, and Wilcoxon's test detected the most DEGs. Moreover, we found that the number of DEGs identified using MSPJ was least influenced by sample size, while the number of DEGs obtained by other methods was more sensitive to sample size. Furthermore, comparing the gene identification and GO terms of sub-datasets and the biomarker classification of sub-datasets from different methods, changes in sample size clearly had the least impact on the mRMR, T-test and MSPJ, in the range of 10–30%. The more detailed analysis revealed that RankProd was demonstrated the most robust DEG detection. Although MSPJ was not the most stable with respect to gene ranking, and it still outperformed many methods across different sampling rates. T-test method outperformed than others in several aspects, however, it had some serious limitations in actual scenario, such as the absence of fold change values of genes between controls and cases, and assumptions regarding the distribution of datasets (normal distribution), and so on [61]. The distribution assumption also applied to SAM and limma (normal or Poisson distribution) [35]. MSPJ had no restrictions with respect to the distribution of the data and therefore could be applied to various types of omics datasets, such as proteomics, metabolomics, and single cell RNA-seq datasets. Moreover, the MSPJ method could be used to visualize expression levels of each gene with estimated significance measurements (Fig. A.2). Overall, our results support the performance of MSPJ for DEG identification in datasets with small sample sizes, especially those with<30 samples. Although the datasets considered in our study were large and included datasets from various scientific fields, it is not, strictly speaking, representative of a “population of datasets” and hence the results may not be generalizable. Accordingly, the method does not necessarily perform better on all datasets and its performance is not limited to the sample sizes evaluated. Depending on the research requirements, our method can be an additional option for gene discovery.

Conclusion

Comprehensive analyses were conducted to evaluate the performance of the newly developed method, MSPJ, for analyses of transcriptome datasets with different sample sizes, including simulated and real datasets. Our systematic large-scale comparative analysis using 165 real datasets revealed that MSPJ showed good average prediction performance for biomarkers, with high rates of common and unique of DEGs and gene function identification. The robustness to sample size enables effective DEG detection by MSPJ. The MSPJ method described here is effective under limited sample sizes for gene expression datasets and thus provides stable scores. Our method can be easily applied to high-throughput transcriptional datasets of any size from microarray or RNA-seq experiments. It is even theoretical possible to apply the method to other omics data types due to the free distribution (e.g., metabolomics, proteomics, and single-cell RNA-seq).

Availability of data and code

All artificial datasets used to evaluate the performance of gene selection methods were deposited in the Zenodo repository (). The real datasets were collected from GEO and pre-processed according to the pipeline described in this manuscript. All R source code for MSPJ, dataset analyses and benchmarking in this study was released on GitHub ().

Funding

This work was supported by grants from the Nursery Project of Army Medical University (No. 2019R054), Natural Science Foundation of Chongqing, China (Grant No. CSTC2019JCYJ-MSXMX0527), Open Fund of Yunnan Key Laboratory of Plant Reproductive Adaptation and Evolutionary Ecology, Yunnan University, Chongqing Technology Innovation and Application Development Special key Project (cstc2019jscx-dxwtBX0010), and Science and Technology Research Program of Chongqing Municipal Education Commission (Grant No. KJQN202100538).

Authorship statement

Hui Yang and Bo Li conceived the study and provided the relevant bioinformatics technologies; HuaChun Yin and Yuyang Peng performed the datasets analysis. HuaChun Yin, Bo Li and JingXin Tao wrote the R code. Ying Xiong and Song Li supervised the project. Song Li, Bo Li, and Huachun Yin wrote the manuscript.

CRediT authorship contribution statement

HuaChun Yin: Software, Methodology, Validation, Formal analysis, Investigation, Data curation, Resources, Writing – original draft, Writing – review & editing, Visualization. JingXin Tao: Software, Methodology. Yuyang Peng: Validation. Ying Xiong: Supervision. Bo Li: Conceptualization, Software, Methodology, Resources, Writing – original draft, Writing – review & editing. Song Li: Writing – original draft, Writing – review & editing, Supervision. Hui Yang: Conceptualization, Project administration, Funding acquisition.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Inputs:

Training dataset {Xi,Yi}i=1N

Output:

Feature ranked list R.

Initialize:

Subset of surviving features,S=[1,2,⋯n]

Feature ranked list R, R=[]

While S is not empty, do:

1). Restrict the features of Xi to the remaining S

2). Train SVM to get weight vectors ω=∑iαiYiXi, i=1,2,3, …,N

3). the ranking score of i-th feature is defined as ∁i=(ωi)2,i = 1,2,3, …,N

4). Find the feature with the smallest ranking criteria f=argmin(∁i)

5). Add feature index f to R, R=f∪R

6). Eliminate the feature index f from S, S=S\f

55 in total

1. RankProd: a bioconductor package for detecting differentially expressed genes in meta-analysis.

Authors: Fangxin Hong; Rainer Breitling; Connor W McEntee; Ben S Wittner; Jennifer L Nemhauser; Joanne Chory
Journal: Bioinformatics Date: 2006-09-18 Impact factor: 6.937

2. Derivation of stable microarray cancer-differentiating signatures using consensus scoring of multiple random sampling and gene-ranking consistency evaluation.

Authors: Zhi Qun Tang; Lian Yi Han; Hong Huang Lin; Juan Cui; Jia Jia; Boon Chuan Low; Bao Wen Li; Yu Zong Chen
Journal: Cancer Res Date: 2007-10-15 Impact factor: 12.701

3. KnowSeq R-Bioc package: The automatic smart gene expression tool for retrieving relevant biological knowledge.

Authors: Daniel Castillo-Secilla; Juan Manuel Gálvez; Francisco Carrillo-Perez; Marta Verona-Almeida; Daniel Redondo-Sánchez; Francisco Manuel Ortuno; Luis Javier Herrera; Ignacio Rojas
Journal: Comput Biol Med Date: 2021-04-13 Impact factor: 4.589

4. Unsupervised feature selection algorithm for multiclass cancer classification of gene expression RNA-Seq data.

Authors: Pilar García-Díaz; Isabel Sánchez-Berriel; Juan A Martínez-Rojas; Ana M Diez-Pascual
Journal: Genomics Date: 2019-11-20 Impact factor: 5.736

5. Characterization of Gene Expression Phenotype in Amyotrophic Lateral Sclerosis Monocytes.

Authors: Weihua Zhao; David R Beers; Kristopher G Hooten; Douglas H Sieglaff; Aijun Zhang; Shanker Kalyana-Sundaram; Christopher M Traini; Wendy S Halsey; Ashley M Hughes; Ganesh M Sathe; George P Livi; Guo-Huang Fan; Stanley H Appel
Journal: JAMA Neurol Date: 2017-06-01 Impact factor: 18.302