Literature DB >> 36067196

A feature selection-based framework to identify biomarkers for cancer diagnosis: A focus on lung adenocarcinoma.

Omar Abdelwahab¹, Nourelislam Awad^1,2, Menattallah Elserafy^1,3, Eman Badr^1,4.

Abstract

Lung cancer (LC) represents most of the cancer incidences in the world. There are many types of LC, but Lung Adenocarcinoma (LUAD) is the most common type. Although RNA-seq and microarray data provide a vast amount of gene expression data, most of the genes are insignificant to clinical diagnosis. Feature selection (FS) techniques overcome the high dimensionality and sparsity issues of the large-scale data. We propose a framework that applies an ensemble of feature selection techniques to identify genes highly correlated to LUAD. Utilizing LUAD RNA-seq data from the Cancer Genome Atlas (TCGA), we employed mutual information (MI) and recursive feature elimination (RFE) feature selection techniques along with support vector machine (SVM) classification model. We have also utilized Random Forest (RF) as an embedded FS technique. The results were integrated and candidate biomarker genes across all techniques were identified. The proposed framework has identified 12 potential biomarkers that are highly correlated with different LC types, especially LUAD. A predictive model has been trained utilizing the identified biomarker expression profiling and performance of 97.99% was achieved. In addition, upon performing differential gene expression analysis, we could find that all 12 genes were significantly differentially expressed between normal and LUAD tissues, and strongly correlated with LUAD according to previous reports. We here propose that using multiple feature selection methods effectively reduces the number of identified biomarkers and directly affects their biological relevance.

Entities: Chemical

Mesh：

Substances：

Year: 2022 PMID： 36067196 PMCID： PMC9447897 DOI： 10.1371/journal.pone.0269126

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.752

Introduction

Detecting the most correlated genes to a specific disease has been a major computational problem. Standard statistical methods such as t-test, linear regression, or negative binomial distribution are used to identify differentially expressed genes, providing a large number of candidate genes [1-3]. However, only a few of these candidates contribute significantly to the pathology and response to treatment. Therefore, feature selection (FS) techniques have been utilized to identify potential gene biomarkers whose expression profiling can help in phenotypic differentiation [4-8]. FS techniques are used to identify genes whose transcriptomic profiling varies significantly across sample groups. Feature selection reduces the dimensionality of the input data before constructing a predictive model without losing relevant information. Additionally, it increases the speed of learning, facilitates generalization, and improves performance [9]. Utilizing feature selection with large scale data such as RNA-seq allows important feature extraction and overcomes the “curse of dimensionality” problem. The curse of dimensionality appears when the number of data features increases, along with much smaller data size, as in the RNA-seq data case. Although a higher number of features should allow more information, practically, it includes more redundant and possibly noisy data. More complex models are required to handle such high dimension data, which can lead to overfitting [10-12]. Thus, employing multiple feature selection techniques effectively decreases the number of utilized features and identifies the most significant ones. Different studies have utilized feature selection to detect the transcriptomic signature of different diseases. Huijuan et al. introduced a hybrid FS technique that combines both mutual information maximization and adaptive genetic algorithm. DNA microarray data of six cancer sets have been analyzed. The authors showed that utilizing multiple techniques increased classification accuracy and reduced feature dimensionality [4]. Tabl et al. used Chi-square and Info-Gain along with a tree-based model to predict the 5-year survivability of breast cancer patients [11]. Li et al. utilized the mutual information method and then the incremental feature selection along with a support vector machine (SVM) classifier and selected 23 discriminative genes for Osteoarthritis, where 97.1% accuracy was achieved [13]. Chen et al. utilized the Monte-Carlo feature selection method with SVM classifier to identify gene expression signatures in multiple types of neural stem cells [14] (the hybrid feature selection methods are reviewed in [11]). Developing a reliable computational approach to determine gene expression signature improves the diagnosis of complex diseases, as a small number of correlated genes can be exploited and further investigated in clinical settings. This is especially important for developing countries, where RNA-seq and transcriptome profiling of patients’ samples are not affordable to decide on the best therapeutic approach. Thus, analyzing a small set of candidate genes will contribute to more accurate therapy prescription, in a cost-efficient manner. In this article, we are proposing a framework where a combination of feature selection methods and a prediction model are utilized to detect biomarker profiling that differentiates between normal and lung adenocarcinoma cancer patients. We selected Lung cancer (LC) as it is one of the most prevalent malignancies worldwide and the most common cause of global cancer-associated mortality, with a five-year survival rate. Lung adenocarcinoma (LUAD) is a subtype of lung cancer whose causes are still ambiguous. One of the possible causes might be deficiencies in therapeutic methods and difficulties in early diagnosis. The early diagnosis of cancer contributes to increasing the survival rate, which makes it important to create other diagnostic tools for LUAD [15]. In an attempt to identify the most significantly correlated genes to LUAD, we utilized mutual information (MI) [16] and recursive feature elimination (RFE) feature selection techniques along with the SVM classification model [17]. In addition, we have also utilized Random Forest (RF) as an embedded FS technique [17]. Our framework takes advantage of filter, wrapper, and embedded feature selection methods. As filter techniques focus mainly on the statistical characteristics of the input data, the features are selected based on the correlation between the feature and the target class independent of a classification model. MI was utilized to measure the relevance of the features to the classes and the redundancy among them, which reduces the number of highly correlated features. However, it produces a relatively large number of features. Utilizing a wrapper-based technique where MI was employed with SVM as a classification model significantly reduced the selected features. In this case, the features are selected based on the SVM performance. RFE is another well-known feature reduction technique widely used in machine learning to reduce high dimensional data despite its high computational time [17-22]. Finally, Random Forest (RF) is used as an embedded technique where feature selection is a part of the classifier construction process. RF is not sensitive to outliers, it reduces feature correlations, but it is prone to overfitting [23, 24]. All previous methods have been utilized to identify a specific subset of features as candidate biomarkers. Utilizing multiple FS techniques maximizes their advantages and alleviates their disadvantages. We hypothesize that consensus features among all FS methods yield the most significant biomarkers. Interestingly, we could observe noticeable variations in each technique’s candidate genes but identifying the common candidates between all techniques yielded 12 genes that are strongly correlated with LUAD, as illustrated later in the discussion section. DEseq2 [25] has been utilized for results verification. It is a standard pipeline that is very commonly used by biologists. Its results are reliable and robust to outliers [26, 27]. Upon performing differential gene expression analysis using DEseq2, the 12 genes were found to be significantly differentially expressed between LUAD and normal samples. Our predictive model trained on gene biomarker profiling achieves an accuracy of 97.99% and is capable of identifying candidates that are highly correlated to LUAD.

Results

A framework to identify genes highly correlated to LUAD

In this study, we propose a framework that applies three feature selection techniques to identify genes highly correlated to LUAD (Fig 1). The LUAD RNA-seq data was obtained from The Cancer Genome Atlas (TCGA-LUAD). Each technique was utilized separately along with SVM classification model (in case of MI and RFE), to obtain the key features with high diagnostic values. Then, the results were integrated and candidate biomarker genes across all techniques were identified.

Fig 1

An overview of our proposed framework.

Twelve potential biomarker genes are identified by MI-SVM, RFE-SVM, and random forest models

Mutual information selection is used to obtain the best subset of features that can generate the highest accuracy score in differentiating between normal and LUAD/tumor samples. MI rank the genes in the dataset from the most to the least correlated to the two classes (normal and tumor). Utilizing the MI method, 45292 features (gene expression values) have been selected and ranked according to its importance. As a filtering technique, the MI produced an enormous number of features that did not minimize the feature space as expected. According to the ranked feature list, we followed a wrapping method utilizing SVM. We focused on the highest 1000 ranked features from the MI results. SVM was applied to consecutive feature subsets starting with the highly ranked two features. The first 19 MI-ranked features recorded the best weighted accuracy score of 98.64%. Fig 2A illustrates the accuracy achieved by the SVM classifier along with the different feature sets. The highest accuracy was achieved at 19 features, then a gradual decline happened with adding more features. The full list of the 19 MI-SVM features is listed in (S1 Table).

Fig 2

The incremental feature selection curves for the MI-SVM, RFE-SVM, and random forest models.

The incremental feature selection curves for the MI-SVM, RFE-SVM, and random forest models.

The number of genes along with the corresponding SVM model weighted accuracy are shown (A and B) while the number of trees versus the RF achieved accuracy is shown in (C). (A) The peak of the curve is achieved at 19 genes with an accuracy of 98.64%. (B) The peak of the curve is achieved at 76 genes with an accuracy of 97.73%. (C) Utilizing 345 trees, the random forest model identified 1261 features and achieved an accuracy of 98.64%. RFE is a wrapper technique in which data is split continuously until a desired subset of features is reached based on the chosen predictive model. We performed 1000 iterations to determine the best subset of features starting with one feature. The weighted accuracy score achieved with the least number of features was 97.73%, utilizing 76 features. Fig 2B illustrates the accuracy scores against the number of RFE-SVM features. The full list of the 76 candidate biomarkers is illustrated in (S2 Table). Random forest is an embedded FS technique, where both feature selection and classification are performed together. In order to determine the best number of trees, we utilized different numbers of trees (up to1000 trees). Utilizing 345 trees, a performance of 98.64% was achieved. The resulting incremental feature selection curve is illustrated in Fig 2C. The random forest was generated using 1261 features, which are listed in (S3 Table). The different techniques used were compared in terms of precision, recall, specificity, balanced accuracy, and F1-score (Table 1). The Receiver Operating Characteristic (ROC) metric with stratified 5-fold cross-validation has also been calculated (Fig 3). The results are comparable, although the set of biomarker genes identified through each method is not quite identical. Most of the testing results of each feature selection method returned a high classification performance of over 93%. Specificity metric has ranged from 87% to 91%, indicating that the model had samples misclassified as LUAD. This can be due to the small number of the normal samples.

Table 1

A detailed evaluation table of MI-SVM, RFE-SVM, and RF models in terms of precision, recall, specificity, F1 score, and the mean AUC.

Technique	MI-SVM	RFE-SVM	RF
Number of Features	19 features	76 features	1261 features
Precision	0.9866	0.9778	0.9865
Recall (Sensitivity)	0.9864	0.9773	0.9864
Specificity	0.8773	0.9167	0.8773
Balanced Accuracy	0.9318	0.9470	0.9318
F1-Score	0.9859	0.9775	0.9859
Mean AUC	0.9940±0.0037	0.9880±0.0089	0.9949±0.0004

Fig 3

ROC and AUC analysis for different feature selection techniques.

(A) MI-SVM model. (B) RFE-SVM model. (C) RF model.

ROC and AUC analysis for different feature selection techniques.

(A) MI-SVM model. (B) RFE-SVM model. (C) RF model. The selected features reported by the MI-SVM, RFE-SVM, and RF were integrated as shown in (Fig 4). Overall, 12 features are reported as common between all methods. However, 44 features were additionally reported as common between at least two of the FS techniques. The MI-SVM and RF have 18 common features, which represent most of the features generated from the MI-SVM algorithm.

Fig 4

A Venn diagram illustrating the number of features of each model and the common features across all techniques.

Regarding the 76 RFE-SVM features, 12 features are common with MI-SVM while 50 are common with RF features (Fig 4). As random forest has yielded the largest number of features, It was expected to have more features in common with other methods. Utilizing multiple well-known FS techniques maximizes the advantages of methods. The list of genes identified by all three methods or at least by two of the methods is presented in Table 2.

Table 2

List of features common between all selection techniques or common between at least two selection techniques.

Features common between all selection techniques	Features common between at least two selection techniques
ADRB2	AC009093.3	CLDN18	LANCL1-AS1	SMAD6
AGER	AC025048.1	CLEC4M	LINC00656	SOX17
CAVIN2	AC104984.4	EPAS1	LINC00968	SPAAR
CLEC3B	ADGRE3	ERCC6L	NCAPGP2	SPOCK2
C10orf67	ADRB1	FCN3	NCKAP5	SSTR4
FABP4	ALAS2	FMO2	OTC	TEK
FAM107A/DRR1	ANGPT4	GPM6A	RGS9	TMEM100
LOC105376453	CAV1	GYPE	RTKN2	TNNC1
RGCC	CD300LG	HBA2	S1PR1	TOP2A
SFTPC	CD5L	HBB	SEMA3G	VIPR1
SLC6A4	CHRM1	HBM	SH3GL3	WNT3A
STX11

To evaluate our candidate biomarkers reported by all techniques, an SVM model was constructed using only the 12 identified biomarker genes. The model achieved an accuracy score of 0.9799± (0.0069) using stratified 5-fold cross-validation. Other evaluation measures have also been computed (Table 3). The proposed model has achieved a mean AUC value of 0.9934±0.0022 with stratified 5-fold cross-validation (Fig 5). Furthermore, another SVM classification model was developed using the 56 features. This classifier achieved 97.27% accuracy. It is clear that utilizing only 12 genes yields comparable results with individual FS methods, but with a much smaller number of genes. Although Mutual information method performed well with relatively a small number of features, utilizing multiple methods reduces the number of candidate biomarkers with more biological relevance. An external dataset (GSE81809) was used to evaluate the proposed model (Table 3). Overall, all evaluation metrics indicate higher performance with over 92%. Fig 6 illustrates the ROC analysis with AUC value of 1.0000 using stratified 5-fold cross-validation.

Table 3

Evaluation statistics of the proposed model with the candidate biomarker using the testing samples and the external dataset.

	Precision	Recall (Sensitivity)	Specificity	Accuracy	Balanced Accuracy	F1-Score	AUC
Proposed model (Testing)	0.9768	0.9773	0.8359	0.9799± 0.0069	0.9066	0.9765	0.9934±0.0022
Proposed model (External dataset)	0.9649	0.9629	0.9259	0.9629	0.9444	0.9623	1.0000

Fig 5

ROC and AUC analysis.

Using the proposed model of the 12 candidate biomarkers with stratified 5-fold cross-validation.

Fig 6

ROC and AUC validation of the proposed model using the external dataset (GSE81809).

ROC and AUC analysis.

Using the proposed model of the 12 candidate biomarkers with stratified 5-fold cross-validation. To further support the output of our framework, we have developed a random labeled model where the training set labels have been randomized. Five-fold cross-validation was conducted where a balanced accuracy of 0.4990± 0.0020 was achieved. The mean AUC value of 0.5203± 0.0669 using 5-fold cross-validation has also been reported along with the ROC curves (Fig 7). Moreover, we have generated 100 random labeled models. The mean balanced accuracy of the generated models was 0.5208. Fig 8 is a summary figure to illustrate the balanced accuracy achieved by the random models.

Fig 7

ROC and AUC analysis for a randomized version of the proposed model.

Fig 8

The balanced accuracy scores that were achieved by running 100 random labeled models.

The X-axis is the attempt number. Y-axis indicates the balanced accuracy score.

The balanced accuracy scores that were achieved by running 100 random labeled models.

The X-axis is the attempt number. Y-axis indicates the balanced accuracy score.

The candidate genes identified by the feature selection techniques are differentially expressed between normal and tumor samples

To confirm the output of our framework, we performed differential expression analysis using DESeq2 [25]. We identified 5911 differential expressed genes (DEGs) between normal (N) and tumor (T) samples (S4 Table). Among the identified DEGs, we found that the 12 common genes obtained by the three different feature selection techniques are downregulated in tumor samples; this was also evident upon plotting the normalized counts of normal versus tumor samples (Fig 9). Similarly, upon plotting the normalized counts for the 44 genes identified by at least two selection techniques, we could find a trend where the majority of the genes are downregulated in tumors in comparison to normal samples. With the exception of TOP2A and ERCC6L, which were upregulated in tumor samples (S4 Table and Figs 10–12).

Fig 9

Boxplots representing the expression level of the 12 common candidate genes in LUAD patients in comparison to normal samples.