Literature DB >> 23505559

Improved classification of lung cancer tumors based on structural and physicochemical properties of proteins using data mining models.

Abstract

Detecting divergence between oncogenic tumors plays a pivotal role in cancer diagnosis and therapy. This research work was focused on designing a computational strategy to predict the class of lung cancer tumors from the structural and physicochemical properties (1497 attributes) of protein sequences obtained from genes defined by microarray analysis. The proposed methodology involved the use of hybrid feature selection techniques (gain ratio and correlation based subset evaluators with Incremental Feature Selection) followed by Bayesian Network prediction to discriminate lung cancer tumors as Small Cell Lung Cancer (SCLC), Non-Small Cell Lung Cancer (NSCLC) and the COMMON classes. Moreover, this methodology eliminated the need for extensive data cleansing strategies on the protein properties and revealed the optimal and minimal set of features that contributed to lung cancer tumor classification with an improved accuracy compared to previous work. We also attempted to predict via supervised clustering the possible clusters in the lung tumor data. Our results revealed that supervised clustering algorithms exhibited poor performance in differentiating the lung tumor classes. Hybrid feature selection identified the distribution of solvent accessibility, polarizability and hydrophobicity as the highest ranked features with Incremental feature selection and Bayesian Network prediction generating the optimal Jack-knife cross validation accuracy of 87.6%. Precise categorization of oncogenic genes causing SCLC and NSCLC based on the structural and physicochemical properties of their protein sequences is expected to unravel the functionality of proteins that are essential in maintaining the genomic integrity of a cell and also act as an informative source for drug design, targeting essential protein properties and their composition that are found to exist in lung cancer tumors.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Proteins

Year: 2013 PMID： 23505559 PMCID： PMC3591381 DOI： 10.1371/journal.pone.0058772

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

Oncogenic tumors are the leading cause of death around the world with Lung Cancer bearing the major toll of malignant fatalities [1]–[3]. Smoking and use of tobacco along with diverse environmental carcinogens increased human susceptibility to this deadly ailment [4]–[5]. Gene Polymorphisms concerned with detoxification of carcinogens have been associated with formation of lung tumors. Lung tumors have been broadly categorized as Non-Small Cell Lung Cancer (NSCLC) affecting nearly two-thirds of patients with a low-survival rate and Small Cell Lung Cancer (SCLC), both of which respond to different forms of therapy [6]–[10]. This drives the need to precisely identify pathological differences between these two types of tumors. Gene expression patterns from microarray analysis enabled the sub-categorization of lung cancer types that related to the degree of tumor demarcation, nature of therapy and victim survival rate [11]–[14]. It was an established fact that Lung carcinogenesis was a process that involved gradual phenotypic changes that occurred as a result of onco-gene activation and deactivation of tumor suppressor genes [8]. Reports thus far in literature have failed to identify any reliable biomarkers for this condition since wet-lab experiments often consumed more time, expertise and capital with unsure returns [1] [4]–[6]. Microarray technology has been utilized in the recent past to detect appropriate biomarkers but present methodologies were more susceptible to overlook potential facts contained in patient tissue samples [14]. Hence determination of potential and informative markers (diagnostic and prognostic) from both the biological and molecular perspective is highly essential to study and evaluate the genetic and molecular distinctiveness that characterized tumors and Tumor Node metastasis (TNM) staging in lung carcinogenesis to make possible effective diagnosis, and corroborate therapeutic strategies. In recent research undertakings, several classifiers and data mining models have been used that targeted the appropriate categorization of lung cancer tumors. Forty-one samples characterized by 26 attributes computed from the mass-to-charge ratio (m/z) and peak heights of proteins identified by mass spectroscopy of blood serum samples from lung cancer affected and non-affected patients was utilized to train a classification and regression tree (CART) model [13]. Molecular classification of NSCLC based on a percentage train-test approach was used to evaluate the reliability of cDNA microarray-based classifications of resected human non-small cell lung cancers (NSCLCs) [14]. In further research Linear Discriminant Analysis and Artificial Neural Network classification of individual lung cancer cell lines (SCLC and NSCLC) was performed based on DNA methylation markers [13]. The results reported that Artificial Neural Network analysis of DNA methylation data was a potential technique to develop automated methods for lung cancer classification. In another study Support Vector Machine [14] was used in lung cancer gene expression database analysis and the results proposed that incorporated prior knowledge into cancer classification based on gene expression data was essential to improve classification accuracy. Automatic classification of lung TNM cancer stages from free-text pathology reports using symbolic rule- based classification was attempted [15]. The methodology was assessed based on accuracy parameters and confusion matrices against a database of multidisciplinary team staging by decisions and a machine learning-based text classification system using support vector machines. The current investigation was focussed on a very recent article by Hosseinzadeh et.al [1] that aimed to classify lung cancer tumors based on structural and physiochemical properties of proteins using Bioinformatics models. We chose this paper for three main reasons. (i) The work is the most recent and the data is publicly available. (ii) The research involved plenty of data cleaning and pre-processing strategies which could be avoided. (iii) Their work involved few assumptions on the obtained data which are not adopted in this work. Moreover the method proposed in this paper was able to generate higher classification accuracy in differentiating between lung cancer tumors based on protein properties while retaining the original data and eliminating assumptions. Precisely this paper makes the following contributions: (a) Design of a new methodology with hybrid feature selection techniques to identify the optimal protein features that distinguished between lung cancer tumors with higher accuracy. (b) Eliminated the need for data cleaning and assumptions on attribute significance. (c) Contributing features identified are believed to influence drug design that could target the protein property leading to lung cancer tumors.

Materials and Methods

Dataset

The Gene Set Enrichment Analysis database (GSEA db) [16] was utilized to obtain the gene sets that contributed to the development of NSCLC and SCLC. It was obtained from the Kyoto Encyclopaedia of Genes and Genomes (KEGG) [17] gene sets. A total of 84 genes [17] were present in the SCLC gene set while 54 genes [17] were found contributing to NSCLC. In order to precisely discriminate between the two classes of tumors, the genes commonly occurring in both tumors were placed in a different class called COMMON. The strength of the gene set for SCLC was 59, NSCLC included 29 while the COMMON gene set summed up to 25. Proteins for each group of genes were obtained from the Gene Card database [18] and the corresponding protein sequences extracted from UniProt Knowledgebase database [19]. These sequences were saved as text file and loaded onto PROFEAT web server [20]–[21] to compute the structural and physicochemical properties associated with the protein. A total of one thousand four hundred and ninety seven attributes were computed and represented as Fi.j.k.l where ‘l’ represented the descriptor value and ‘k’ denoted the descriptor while ‘j’ indicated the feature and ‘í’ signified the feature group [20]–[21]. The features and their annotations have been provided as File S1. The complete data set comprising of 1497 features and 113 tumor samples [17] were loaded in to WEKA 3.7.7 machine learning software [22] and the tumor type was set to be the target class. The complete pre-processed dataset is provided as File S2. The variation in sample size as compared to previous work is attributed to possible updations in the database. The methodology proposed in this research work is described in the following section.

Proposed Computational Methodology

The proposed methodology comprised of two phases: The training phase and the prediction phase. The training phase incorporated the data preparation, feature selection and classification process while the prediction phase involved evaluation of the classifier model using Jack-knife cross-validation test based on the performance parameters [23]–[24]: Matthews Correlation Co-efficient (MCC) and Accuracy. The diagrammatic representation of the proposed methodology is given in Figure 1. The data preparation phase incorporated categorization of the input gene sets as SCLC, NSCLC and the COMMON classes. This was followed by Hybrid feature selection with Incremental Feature Selection. The classification models were then built and compared to identify the best performing computational prediction technique on lung tumor classification using protein structural and physicochemical properties.

Figure 1

Proposed computational methodology for lung tumor classification from protein sequence properties.

Hybrid Feature Selection

Feature ranking presented significant features in the order of their contribution to categorizing the samples under the different target classes [25]–[28]. Since most feature selection algorithms focused on ranking the attributes according to their significance value, the liability of choosing the limiting constraint rested with the user [29]–[31]. Hence in order to automate the process of finding the minimal yet optimal set of features, the ranking feature selection algorithms were followed by Correlation Subset Evaluators [32] that included features highly correlated to the class and least correlated to each other. Since both the ranking and subset evaluators were utilized to obtain the optimal feature set, this was termed the Hybrid Feature Selection strategy. The description of the methods used in this research is detailed below.

Gain Ratio Criterion

Gain ratio criterion [33]–[34], revealed the association between an attribute and the class value, being primarily computed from the Information Gain using the Information Entropy (InfoE) values [35]. After having obtained the value of the Entropy H(SR), and assuming ‘F’ to be the set of all features, and SR to be the set of all records, Value(r,f) is taken to be the value of a specific instance ‘r <$>\raster="rg1"<$> S’ for the feature ‘f <$>\raster="rg1"<$> F’. Information Gain for the attribute was computed using Equation (1) as follows [35]: In order to compute the Intrinsic Value for a test, the following formula was adopted: The Information Gain Ratio [33]–[35] was calculated as the ratio between the Information Gain and the Intrinsic value, according to Equation (3) The attributes were thus ranked according to their rank in the descending order of the Gain Ratio score and were used for the CFS Subset Evaluator method described below.

Correlation Feature Selection (CFS) Subset Evaluator

The CFS hypothesis [36] suggested that the most predictive features needed to be highly correlated to the target class and least relevant to other predictor attributes. The following equation [36]–[37] recorded the value of a feature subset S that consisted of ‘k’ featureswhere, was the average value of all feature-classification correlations, and was the average value of all feature-feature correlations. The CFS criterion [36] was defined as follows: Where and variables were referred to as correlations. The attributes that portrayed a high correlation to the target class and least relevance to each other were chosen as the best subset of attributes. The attributes filtered by the CFS Subset Evaluator method were added in an incremental manner to identify the optimal set of features that contributed to lung tumor categorization. This methodology is reported below.

Incremental Feature Selection

The predictor attributes generated by the Gain Ratio and CFS Subset Attribute Evaluator (Hybrid Feature Selection) method were later utilized for Incremental Feature Selection (IFS) [38]–[39] to determine the minimal and optimal set of features. On adding each feature, a new feature set was obtained and the kth feature set could be stated as Where M denoted the total number of predictor subsets. On constructing each feature set, the predictor model was constructed and tested through Jack-knife cross-validation method. The MCC and Accuracy of cross-validation was measured, leading to the formation of the IFS table with the number of features and the classification accuracy they were able to generate. ‘ATo’ was the minimal and optimal feature set that achieved the highest MCC and accuracy. In order to determine the best classification model for lung tumor classification [40], a total of five benchmark prediction techniques viz, Support Vector Machine [29], Random Forest [1], Nearest Neighbor algorithm [39], Bayesian Network Learning [22] and Random Committee (Ensemble classifier) [22] were analyzed and compared. Our results affirmed that Bayesian Network approach generated higher accuracy in tumor classification with the optimal feature set.

Bayesian Network Learning

The learning phase in this approach incorporated the process of finding an appropriate Bayesian network [41] given a data set D over R where R = {r1, rn}, n ≥1 was the set of input variables. The classification task consisted of classifying a variable V = v0 called the class variable (NSCLC/SCLC/COMMON) given a set of variables R = r1 . . . rn. A classifier C: r → v was a function that mapped an instance of ‘r’ to a value of ‘v’. The classifier was learned from a dataset D that consisted of samples over (r, v) [42]. A Bayesian network over a set of variables R was a network structure Bs, a directed acyclic graph (DAG) over the set of variables R and a set of probability tables [43] was given by Where pa(r) was the set of parents of r in BS and the network represented a probability distribution given by Eq. (8) The inference made from the Bayesian Network [41]–[43] was to allocate the category with the maximum probability [44]. The Simple Estimator with the K2 local search method using Bayes Score were utilized (default parameters) for the execution of the algorithm in WEKA 3.7.7 [22]. The clustering methods are briefed about in the following section.

Supervised Clustering

Supervised clustering [45]–[47] deviated from unsupervised clustering in that it was applied on already categorized examples with the prime aim of detecting clusters that had high probability density with respect to a single class. Supervised clustering required the number of clusters to be kept to a minimum, and objects were assigned to clusters using the notion of closeness with respect to a given distance function [48]–[49]. Supervised clustering evaluated a clustering technique based on the following two criteria [47]–[49]: Class impurity, Impurity(X): It was measured by the percentage of marginal examples in the different clusters of a clustering X. A marginal example was an example that belonged to a class different from the most frequent class in its cluster. Number of clusters, k. In this research we have compared the classes to cluster evaluation accuracy of seven clustering algorithms [22] namely Expectation-Maximization (EM) Algorithm, COBWEB [22], Hierarchical clustering, K-Means clustering, Farthest First Clustering, Density-Based clustering and Filtered Clustering. The number of clusters was automatically assigned in the COBWEB algorithm whereas the remaining algorithms allowed the user to select the desired number of clusters [22]. Some algorithms exhibited better performance on inclusion of all the attributes for clustering while the performance deteriorated on the hybrid feature selection datasets. The performance evaluation methods and parameters are briefed about in the subsequent sections.

Jack-knife Cross-Validation Test

Statistical prediction methods [50] were utilized for measuring the predictor performance in order to assess their efficiency in practical applications. In this study, the jack-knife cross validation method [50]–[51] was used for verification and validation of classifier accuracy since previous reports have stated it to be least arbitrary in nature and widely acclaimed by researchers and practitioners to estimate the performance of predictors. In jack-knife cross-validation [38]–[39] [52], each one of the statistical records in the training dataset was in turn singled out as a test sample and the predictor was trained by the remaining samples. During the jack-knifing process [23]–[24] [39], both the training dataset and testing dataset were actually open, and a statistical sample moved from one group to the other. In this research, the following indexes [50]–[52] were adopted to test the proposed methodology. where reflected the Mathews Correlation Coefficient; reflected the accuracy, i.e., the rate of correctly predicted lung cancer tumor class; TP, TN, FP and FN denoted the number of true positives, true negatives, false positives and false negatives, respectively.

Experimental Results and Discussion

The experimental results are discussed in three sections. The foremost describes the ranking of the structural and physicochemical properties according to their gain ratio. The entire list of attributes was ranked and the file is provided as Table S1. The second section deals with the results of Incremental Feature Selection while the final section portrays the comparative performance of the benchmark classification models on the protein sequence properties in categorizing lung tumors.

Hybrid Feature Selection

A total of 1497 attributes were initially loaded as the training data with 113 instances [17]–[18]. No records were duplicated and there were no missing values. On ranking the attributes by the Gain Ratio criterion, a total of 134 attributes were assigned a gain ratio greater than zero. The CFS subset evaluator returned 39 features as the most optimal subset that was highly correlated to the target class but least correlated to each other. These features were then utilized for the Incremental feature Selection process. The results of the Hybrid Feature Selection techniques are given as Table S1.

Incremental Feature Selection

The ranked attributes from the CFS subset evaluator were then input in the descending order of their rank to the classifier. At each attribute entry, the MCC and accuracy of the classifier on Jack-knife test was calculated. The Bayesian Network Learning was found to give the highest prediction MCC of 0.812 and accuracy of 87.6% with 36 features. The IFS curves generated on classifier accuracy and the corresponding MCC is represented in Figure 2. The optimal prediction accuracy with the proposed methodology for each feature subset is given in Table 1. The complete results of Incremental Feature Selection process on all the three Hybrid Feature Selection datasets are given in Table S2.

Figure 2

The IFS curves depicting classification accuracy and MCC in lung tumor categorization.

(A) The IFS curve generated using Classification Accuracy in Lung Tumor categorization. The x-axis represented the number of features while the y-axis represented the jack-knife cross-validation accuracy. The peak of classification accuracy attained was 87.6% with 36 features. The top 36 features derived by Hybrid Feature Selection (Gain Ratio +CFS Subset) approach form the optimal feature set. (B) The IFS curve generated using MCC values obtained from classification algorithms. The peak of MCC is 0.812 with 36 features. The top 36 features derived by the Hybrid Feature Selection approach (Gain Ratio + CFS Subset) formed the optimal feature set.

Table 1

Optimal classification accuracy with filtered subsets and IFS.

Hybrid Feature Selection Technique	Features	Classification Algorithm	Jack-knife Cross-Validation Accuracy (%)
Gain Ratio + CFS Subset	36		87.6
Information Gain +CFS Subset	32	Bayesian Network	85
Symmetric Uncertainty + CFS Subset	29		85.8

The IFS curves depicting classification accuracy and MCC in lung tumor categorization.

Classifier Models

Benchmark classification models that have been reported [14] [38]–[39] [53]–[54] to generate high accuracy in classification of biological data were compared to determine the optimal prediction technique that generated highest accuracy in prediction. The comparative performance of the classification models with the feature set generated by the Hybrid Feature Selection technique is depicted in Table 2. The performance is compared based on the MCC and prediction accuracy.

Table 2

Comparison of predictor models in lung cancer tumor categorization.

S.No	Hybrid Feature Selection Technique	Classifier	Training Phase		Prediction Phase
			MCC	Accuracy	MCC	Accuracy
1	Gain Ratio + CFS	Bayesian Network	0.895	92.9	0.77	85
2	Subset Evaluator	Random Forest	1	100	0.652	78.8
3		Nearest Neighbor	1	100	0.507	69
4		Support Vector Machine	0.856	91.2	0.603	76.1
5		Random Committee	1	100	0.484	69
1	Information Gain +	Bayesian Network	0.895	92.9	0.77	85
2	CFS SubsetEvaluator	Random Forest	1	100	0.61	76.1
3		Nearest Neighbor	1	100	0.52	69.9
4		Support Vector Machine	0.856	91.2	0.603	76.1
5		Random Committee	1	100	0.553	72.6
1	Symmetric	Bayesian Network	0.895	92.9	0.77	85
2	Uncertainty + CFS	Random Forest	1	100	0.521	71.7
3	Subset Evaluator	Nearest Neighbor	1	100	0.52	69.9
4		Support Vector Machine	0.84	90.3	0.603	76.1
5		Random Committee	1	100	0.62	77

Clustering Models

This study utilized seven clustering algorithms [22] in order to compare their performance in categorizing the classes of lung tumors based on the attribute values. The results of generating the clustering algorithms on the dataset before and after performing hybrid feature selection are presented. The classes to cluster evaluation results are portrayed in Table 3. It is evident from the tabulated results that clustering algorithms were not useful in providing any new idea on the attribute significance in detecting clusters since their performance accuracy was substantially low. The discussions on the data and the results are presented in the ensuing section.

Table 3

Classes to cluster evaluation.

S.No	Clustering Models	Classes to Cluster Evaluation Accuracy (%)
		Pre- Hybrid feature selection	Post- Hybrid feature selection
1	E-M Algorithm	52.2124	51.3274
2	COBWEB	2.6549	5.3097
3	K-Means	53.0973	51.3274
4	Hierarchical Clustering	51.3274	51.3274
5	Density Based Clustering	53.0973	52.2124
6	Filtered Clustering	53.0973	51.3274
7	Farthest First Clustering	48.6726	46.0176

Discussion

Influence of Structural and Physicochemical Properties

There have been several researches on lung cancer classification [55]–[65] but the only previous computational study on the influence of protein sequence based structural and physicochemical properties in categorization of lung tumors was done by Hosseinzadeh et.al [1] who utilized the decision tree generated by the Random Forest classifier to identify the contributing attributes. In this study, we utilized the smallest tree among the 10 decision tree models generated by the Random Forest classifier [66] on the training dataset in order to identify the most contributing attributes to lung tumor classification. Albeit the Random Committee algorithm also depicted 100% accuracy and a high MCC of 1 in the training phase, the results obtained on Jack-knife cross-validation were not as high as the Random Forest Model. The decision tree model with the smallest number of nodes generated by the Random Forest on the training dataset is portrayed in Figure 3. The visualization of this tree made it easier to identify the composition of each protein property in the different types of lung cancer tumors, thus providing a source for drug design targeting the protein composition.

Figure 3

Decision tree model obtained by the Random Forest classifier.

The following novel insights on the protein properties were gained from the Random Forest Model with a new set of discriminative features being reported for the first time in discriminating the lung tumor classes. Dipeptide composition was the most discriminating feature among the classes. F1.2 [Dipeptide Composition], F5.3 [Distribution Descriptor], F4.1 [Geary Auto-correlation] and F6.1 [Sequence order coupling number] were the subsequent significant protein properties used by the Random Forest Model to discriminate the lung tumor classes. A low value of the F5.3.2 [Normalized vdW volumes] and F [7.1] pseudo amino-acid composition moved the records into the COMMON class. A high F5.3.1 [distribution of hydrophobicity] and F5.3.3 [distribution of polarity] was found among the genes common in both classes of tumors whereas a lower concentration of the same was found among the NSCLC tumor genes. This directs molecular research to design drugs that would lower the distribution of hydrophobicity and polarity while raising the normalized vdW volumes and pseudo amino-acid composition to target the COMMON classes of tumors. A high dipeptide composition was characteristic of the NSCLC genes and a relatively low value represented the SCLC tumors. A high concentration of F5.3.1 [Distribution of hydrophobicity] and F5.3.7 [distribution of Solvent Accessibility] was evident in the COMMON classes of tumors. These findings suggest designing drugs that raise dipeptide composition to aid in cure of SCLC tumors and drugs that lower the dipeptide composition to cure NSCLC tumors. Moreover design of drugs that lower the distribution of hydrophobicity and solvent accessibility could aid in curing tumors of both kinds. It was evident that a strict demarcation among the tumor categories was a complicated task since many properties were found to exhibit similar composition in both the tumor classes. However the proposed methodology was found to differentiate between the tumor classes with a high MCC of 0.812 and classification accuracy of 87.6%, the highest reported thus far in protein –property based lung tumor categorization.

Comparison to Previous Work

As stated earlier, the only previous computational study on lung tumor categorization based on the protein sequence-based structural and physicochemical properties was reported by Hosseinzadeh et.al [1] that made a comparison of ten different feature selection techniques and reported the feature set generated by the Gain Ratio criterion to generate optimal 10-fold cross validation accuracy of 86% with the Random Forest classifier. Their methodology incorporated 114 sequences with 30 genes in the NSCLC class, 59 in the SCLC and 25 in the COMMON class of tumors. Moreover their methodology also involved extensive data cleaning and pre-processing. Here we made use of the 113 sequences [16]–[18] from the KEGG gene sets corresponding to the NSCLC and SCLC tumor classes and segregated the genes under the three classes viz, NSCLC, SCLC and COMMON. The number of records summed up to 113 with 29 genes [16]–[17] in the NSCLC class. This study was aimed at identifying the minimal and optimal set of features to categorize the lung tumor classes for use in diagnostic practice and drug design. Hence we used the Gain Ratio criterion, Information Gain criterion and Symmetric Uncertainty to rank the features and then applied the Correlation Feature Subset evaluator [22] with a search termination threshold of 5 and Best First Search approach to identify the smallest subset of features with a high correlation to the target class and least correlation to each other. This resulted in a feature subset with 39 features. On comparing the jack-knife cross-validation accuracy of five benchmark classification models, the Bayesian Network Learning algorithm was found to generate the highest MCC of 0.77 with an accuracy of 85% with all the three hybrid feature selection subsets. On applying Incremental Feature Selection we obtained the most optimal feature set of 36 features (feature subset of Gain Ratio + CFS) generating an accuracy of 87.6%. The previous work by Hosseinzadeh et.al reported a high accuracy of 86% only on the cleaned data after removal of duplicate records, correlated records and based on the standard deviation values. When considering the same data, our proposed work has achieved a higher accuracy with the original, unmodified data thus saving computational time by the elimination of the data cleaning process. In order to bring out the comparison more clearly we have identified the accuracy of Random Forest with Gain Ratio (previously proposed classifier model) on the original data which was able to generate an optimal accuracy of only 79.6% with 26 features from the Gain Ratio –CFS feature set compared to our proposed method which produced 87.6% accuracy with 36 features from the same feature subset. We believe our proposed methodology can easily be extended to classify and discriminate between other oncogenic tumors since the original data was retained for computational analysis. However the previous method appears to have generated a high accuracy (86%) only on the cleaned data which makes it a limitation when extending the methodology to other cancer datasets. Moreover the previously proposed model would entail additional data pre-processing time when applied to new cancer datasets.

Comparison with Other Methods

We compared three feature selection methods [22] namely Information Gain, Symmetric Uncertainty and Gain Ratio. We applied CFS Subset evaluator on all the feature sets ranked by the three algorithms. All the five benchmark classification algorithms [67]–[68] were applied on the reduced feature datasets. The results are tabulated in Table 2. All the three predictor methods displayed consistently high accuracy with the Bayesian Network prediction technique. The optimal accuracy was obtained only during the process of Incremental Feature Selection with the Gain Ratio and CFS subset evaluator combination which attained an improved accuracy of 87.6% with 36 features. Albeit the Bayesian Network learning algorithm showed consistent accuracy with the reduced feature sets of the Information Gain and Symmetric Uncertainty ranked features, yet during the process of Incremental Feature Selection, substantial decline in accuracy was apparent with the Information Gain and Symmetric Uncertainty subsets as detailed in the Table S2. Hence the Gain Ratio based ranking of features was considered to be the most optimal feature set for lung tumor categorization. The features selected by all the three hybrid feature selection techniques and the commonality among the selected features are displayed as a graph using NodeXL graph visualization software [69] in Figure 4. On careful analysis of the graphical representation of the feature subsets, it could be concluded that many features were commonly filtered by all the three hybrid feature selection techniques and hence reasonably similar performance accuracy was evident across the filtered subsets. However the process of Incremental Feature Selection disclosed the optimal and minimal feature set required for optimum prediction accuracy.

Figure 4

Feature relevance graph.

The hybrid feature selection techniques are represented as solid diamonds. The optimal features filtered by each technique are represented by directed edges from the technique to the feature. Results of each hybrid feature selection technique are represented in different colors.

Feature relevance graph.

Benefits of the Bayesian Network Learning Algorithm

Bayesian Networks have been used in several [70]–[73] clinical prediction problems. Previous research has stated that a Bayesian network is a mathematically rigorous way to model a domain problem, being flexible and adaptable to available knowledge, and computationally efficient [72] [74]–[75]. Some notable features of Bayesian Networks [44] for use in clinical prediction are narrated below. Bayes net only relates nodes that are probabilistically related by some sort of causal dependency. This eliminates the need to store all possible configurations of states. The algorithm stores and works with all possible combinations of states between sets of related parent and child nodes that greatly reduce computational complexity. Bayes Net utilizes expert knowledge and data to build models dynamically. It allows both backward and forward reasoning. The medical domain is one research area where expert knowledge always has room for improvement and backward reasoning is a definite requirement. Hence application of computational techniques like Bayesian Networks in discriminating and classifying tumor classes based on protein sequence based physicochemical properties is expected to advance the current state of molecular and biological analysis of oncogenic tumor classes for drug design.

Conclusion

Research on the utilization of computational techniques and predictions on clinical and biological data has intensified in the recent past owing to the fact that most wet-lab experiments consumed more human expertise, time and capital with irresolute rewards. This research was aimed at identifying the minimal and optimal set of protein sequence based structural and physicochemical properties in lung tumor categorization into NSCLC, SCLC and the COMMON tumor classes. The findings of this study are believed to be both a computational and biological advancement, the former revealing a new combination of feature selection and prediction techniques for categorizing tumor classes with enhanced accuracy and the latter acquiring information on protein properties prevalent in lung tumors that could aid in diagnostic practice and drug design. Possible extensions to this work would involve application of this novel computational framework in categorization of other oncogenic tumors and detecting properties that could be targeted for cancer therapy. Moreover computational advancement would require improving the prediction accuracy of the proposed methodology by possible updations to the existing algorithms. Attribute description file. (DOC) Click here for additional data file. Pre-processed protein based structural and physicochemical data. (TXT) Click here for additional data file. Hybrid feature selection results. (XLS) Click here for additional data file. Incremental feature selection results. (XLS) Click here for additional data file.

37 in total

Review 1. A radiologic review of the new TNM classification for lung cancer.

Authors: Seth Kligerman; Gerald Abbott
Journal: AJR Am J Roentgenol Date: 2010-03 Impact factor: 3.959

2. [Diagnosis and prediction of lung cancer through different classification techniques with tumor markers].

Authors: Guang-jin Nie; Fei-fei Feng; Yong-jun Wu; Yi-ming Wu
Journal: Zhonghua Lao Dong Wei Sheng Zhi Ye Bing Za Zhi Date: 2009-05

3. Gene expression-based classification of non-small cell lung carcinomas and survival prediction.

Authors: Jun Hou; Joachim Aerts; Bianca den Hamer; Wilfred van Ijcken; Michael den Bakker; Peter Riegman; Cor van der Leest; Peter van der Spek; John A Foekens; Henk C Hoogsteden; Frank Grosveld; Sjaak Philipsen
Journal: PLoS One Date: 2010-04-22 Impact factor: 3.240

4. [Application of protein markers in combination with ThinPrep bronchial brush cytology in classification of lung cancer subtypes].

Authors: Yan Yang; Qin-jing Pan; Mao-fang Teng; Zhong-lin Li; Lin-lin Zhao; Nai-jun Han; Yan-ning Gao; Jian Cao
Journal: Zhonghua Zhong Liu Za Zhi Date: 2008-08

5. A training-testing approach to the molecular classification of resected non-small cell lung cancer.

Authors: Noboru Yamagata; Yu Shyr; Kiyoshi Yanagisawa; Mary Edgerton; Thao P Dang; Adriana Gonzalez; Sorena Nadaf; Paul Larsen; John R Roberts; Jonathan C Nesbitt; Roy Jensen; Shawn Levy; Jason H Moore; John D Minna; David P Carbone
Journal: Clin Cancer Res Date: 2003-10-15 Impact factor: 12.531

6. A classification method based on principal components of SELDI spectra to diagnose of lung adenocarcinoma.

Authors: Qiang Lin; Qianqian Peng; Feng Yao; Xu-Feng Pan; Li-Wen Xiong; Yi Wang; Jun-Feng Geng; Jiu-Xian Feng; Bao-Hui Han; Guo-Liang Bao; Yu Yang; Xiaotian Wang; Li Jin; Wensheng Guo; Jiu-Cun Wang
Journal: PLoS One Date: 2012-03-26 Impact factor: 3.240

7. Analysis and prediction of translation rate based on sequence and functional features of the mRNA.

Authors: Tao Huang; Sibao Wan; Zhongping Xu; Yufang Zheng; Kai-Yan Feng; Hai-Peng Li; Xiangyin Kong; Yu-Dong Cai
Journal: PLoS One Date: 2011-01-06 Impact factor: 3.240

8. Predicting transcriptional activity of multiple site p53 mutants based on hybrid properties.

Authors: Tao Huang; Shen Niu; Zhongping Xu; Yun Huang; Xiangyin Kong; Yu-Dong Cai; Kuo-Chen Chou
Journal: PLoS One Date: 2011-08-08 Impact factor: 3.240

9. PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence.

Authors: Z R Li; H H Lin; L Y Han; L Jiang; X Chen; Y Z Chen
Journal: Nucleic Acids Res Date: 2006-07-01 Impact factor: 16.971

10. Predicting cancer involvement of genes from heterogeneous data.

Authors: Ramon Aragues; Chris Sander; Baldo Oliva
Journal: BMC Bioinformatics Date: 2008-03-27 Impact factor: 3.169

4 in total

1. An Expression Signature as an Aid to the Histologic Classification of Non-Small Cell Lung Cancer.

Authors: Luc Girard; Jaime Rodriguez-Canales; Carmen Behrens; Debrah M Thompson; Ihab W Botros; Hao Tang; Yang Xie; Natasha Rekhtman; William D Travis; Ignacio I Wistuba; John D Minna; Adi F Gazdar
Journal: Clin Cancer Res Date: 2016-06-28 Impact factor: 12.531

2. Genome-wide copy number variation pattern analysis and a classification signature for non-small cell lung cancer.

Authors: Zhe-Wei Qiu; Jia-Hao Bi; Adi F Gazdar; Kai Song
Journal: Genes Chromosomes Cancer Date: 2017-05-04 Impact factor: 5.006

3. Improved Classification of Lung Cancer Using Radial Basis Function Neural Network with Affine Transforms of Voss Representation.

Authors: Emmanuel Adetiba; Oludayo O Olugbara
Journal: PLoS One Date: 2015-12-01 Impact factor: 3.240

4. Construction of Metabolism Prediction Models for CYP450 3A4, 2D6, and 2C9 Based on Microsomal Metabolic Reaction System.

Authors: Shuai-Bing He; Man-Man Li; Bai-Xia Zhang; Xiao-Tong Ye; Ran-Feng Du; Yun Wang; Yan-Jiang Qiao
Journal: Int J Mol Sci Date: 2016-10-09 Impact factor: 5.923

4 in total