Literature DB >> 20011240

Feature selection and classification of MAQC-II breast cancer and multiple myeloma microarray gene expression data.

Qingzhong Liu1, Andrew H Sung, Zhongxue Chen, Jianzhong Liu, Xudong Huang, Youping Deng.   

Abstract

Microarray data has a high dimension of variables but available datasets usually have only a small number of samples, thereby making the study of such datasets interesting and challenging. In the task of analyzing microarray data for the purpose of, e.g., predicting gene-disease association, feature selection is very important because it provides a way to handle the high dimensionality by exploiting information redundancy induced by associations among genetic markers. Judicious feature selection in microarray data analysis can result in significant reduction of cost while maintaining or improving the classification or prediction accuracy of learning machines that are employed to sort out the datasets. In this paper, we propose a gene selection method called Recursive Feature Addition (RFA), which combines supervised learning and statistical similarity measures. We compare our method with the following gene selection methods: Support Vector Machine Recursive Feature Elimination (SVMRFE), Leave-One-Out Calculation Sequential Forward Selection (LOOCSFS), Gradient based Leave-one-out Gene Selection (GLGS). To evaluate the performance of these gene selection methods, we employ several popular learning classifiers on the MicroArray Quality Control phase II on predictive modeling (MAQC-II) breast cancer dataset and the MAQC-II multiple myeloma dataset. Experimental results show that gene selection is strictly paired with learning classifier. Overall, our approach outperforms other compared methods. The biological functional analysis based on the MAQC-II breast cancer dataset convinced us to apply our method for phenotype prediction. Additionally, learning classifiers also play important roles in the classification of microarray data and our experimental results indicate that the Nearest Mean Scale Classifier (NMSC) is a good choice due to its prediction reliability and its stability across the three performance measurements: Testing accuracy, MCC values, and AUC errors.

Entities:  

Mesh:

Substances:

Year:  2009        PMID: 20011240      PMCID: PMC2789385          DOI: 10.1371/journal.pone.0008250

Source DB:  PubMed          Journal:  PLoS One        ISSN: 1932-6203            Impact factor:   3.240


Introduction

Using microarray techniques, researchers can measure the expression levels for tens of thousands of genes in a single experiment. This ability allows scientists to investigate the functional relationship between the cellular and physiological processes of biological organisms and genes at a genome-wide level. The preprocessing procedure for the raw microarray data consists of background correction, normalization, and summarization. After preprocessing, a high level analysis, such as gene selection, classification, or clustering, is applied to profile the gene expression patterns [1]. In the high-level analysis, partitioning genes into closely related groups across time and classifying patients into different health statuses based on selected gene signatures have become two main tracks of microarray data analysis in the past decade [2]–[6]. Various standards related to systems biology are discussed by Brazma et al. [7]. When sample sizes are substantially smaller than the number of features or genes, statistical modeling and inference issues become challenging as the familiar “large p small n problem” arises. Designing feature selection methods that lead to reliable and accurate predictions by learning classifiers, therefore, is an issue of great theoretical as well as practical importance in high dimensional data analysis. To address the “curse of dimensionality” problem, three basic strategies have been proposed for feature selection: filtering, wrapper, and embedded methods. Filtering methods select subset features independently from the learning classifiers and do not incorporate learning [8]–[11]. One of the weaknesses of filtering methods is that they only consider the individual feature in isolation and ignore the possible interaction among features. Yet the combination of certain features may have a net effect that does not necessarily follow from the individual performance of features in that group [12]. A consequence of the filtering methods is that we may end up with selecting groups of highly correlated features/genes, which present redundant information to the learning classifier to ultimately worsen its performance. Also, if there is a practical limit on the number of features to be chosen, one may not be able to include all informative features. To avoid the weakness of filtering methods, wrapper methods wrap around a particular learning algorithm that can assess the selected feature subsets in terms of the estimated classification errors and then build the final classifier [13]. Wrapper methods use a learning machine to measure the quality of subsets of features. One recent well-known wrapper method for feature/gene selection is Support Vector Machine Recursive Feature Elimination (SVMRFE) [14], which refines the optimum feature set by using Support Vector Machines (SVM). The idea of SVMRFE is that the orientation of the separating hyper-plane found by the SVM can be used to select informative features; if the plane is orthogonal to a particular feature dimension, then that feature is informative, and vice versa. In addition to microarray data analysis, SVMRFE has been widely used in high-throughput biological data analyses and other areas involving feature selection and pattern classification [15]. Wrapper methods can noticeably reduce the number of features and significantly improve the classification accuracy [16], [17]. However, wrapper methods have the drawback of high computational load, making them less desirable as the dimensionality increases. The embedded methods perform feature selection simultaneously with learning classifiers to achieve better computational efficiency than wrapper methods while maintaining similar performance. LASSO [18], [19], logic regression with the regularized Laplacian prior [20], and Bayesian regularized neural network with automatic relevance determination [21] are examples of embedded methods. To improve classification of microarray data, Zhou and Mao proposed SFS-LS bound and SFFS-LS bound algorithms for optimal gene selection by combining the sequential forward selection (SFS) and sequential floating forward selection (SFFS) with LS (Least Squares) bound measure [22]. Tang et al. designed two methods of gene selection, leave-one-out calculation sequential forward selection (LOOCSFS) and the gradient based leave-one-out gene selection (GLGS) [23]. Diaz-Uriarte and De Andres [24] presented a method for gene selection by calculating the out of bag errors with random forest [25]. In human genetic research, exploiting information redundancy from highly correlated genes can potentially reduce the cost and simultaneously improve the reliability and accuracy of learning classifiers that are employed in data analysis. To exploit the information redundancy that exists among the huge number of variables and improve classification accuracy of microarray data, we propose a gene selection method, Recursive Feature Addition (RFA), which is based on supervised learning and similarity measures. We compare RFA with SVMRFE, LOOCSFS, and GLGS by using the MAQC-II breast cancer dataset to predict pre-operative treatment response (pCR) and estrogen receptor status (erpos) and compare RFA with SVMRFE and LOOCSFS on the MAQC-II multiple myeloma dataset to predict the overall survival milestone outcome (OSMO, 730-day cutoff) and to predict event-free survival milestone outcome (EFSMO, 730-day cutoff).

Results

Results on MAQC-II Breast Cancer Dataset

We compare MSC-based RFA methods with GLGS, LOOCSFS, and SVMRFE on MAQC-II breast cancer dataset. Tables 1 to 10 list the cancer related genes of the first 100 features selected by these methods. In comparison to GLGS, LOOCSFS, and SVMRFE, RFA is associated with a greater number of currently known cancer related genes for prediction of pCR and a smaller number of currently known cancer related genes for prediction of erpos. Since disease status is not simply related to the number of these cancer related genes, we obtain the prediction performance by running multiple experiments, and compare the average prediction performance using the following measurements: testing accuracy, Matthews Correlation Coefficients (MCC) [26], [27] that has been used in MAQC-II consortium [28], and area under the receiver operating curve (AUC) errors with classifiers NMSC, NBC, SVM and UDC (uncorrelated normal based quadratic Bayes classifier) [29].
Table 1

The 34 cancer related genes of the 100 features selected by NBC-MSC on original training group of MAQC-II breast cancer data for pCR prediction.

SymbolSynonym(s)Entrez Gene NameAffymetrix
ALBALB1, ALBUMIN, Albumin 1, Albza, DKFZp779N1935, PRO0883, PRO0903, PRO1341, SA, SERUM ALBUMIN, SERUM ALBUMIN CHAIN A, Serum albumin precursoralbumin214837_at
C10ORF819930023K05RIK, bA211N11.2, FLJ23537, HEL185, MGC99964, RGD1559884, RP11-211N11.2chromosome 10 open reading frame 81219857_at
CDK10BC017131, MGC112847, PISSLREcyclin-dependent kinase 10210622_x_at
CEACAM1bb-1, BGP, BGP1, BGPA, Bgpd, BGPI, BGPR, C-CAM, C-CAM1, CCAM105, CD66, CD66A, Cea-1, Cea-7, CEACAM1-4L, ECTO ATPASE, HV2, mCEA1, Mhv-1, MHVR, MHVR1, mmCGM1, mmCGM1a, mmCGM2, Pp120carcinoembryonic antigen-related cell adhesion molecule 1 (biliary glycoprotein)211889_x_at
CHRNB4Acrb-4, NACHR BETA4cholinergic receptor, nicotinic, beta 4207516_at
CR1C3b/C4b receptor, C3BR, CD35, CD46, Cr1l, Crry, KN, Mcp, mCRY, MGC102484, SCR1complement component (3b/4b) receptor 1 (Knops blood group)217552_x_at
CXCL3Cinc-2, CINC-2a, CINC-2b, Cinc3, Cxcl2, Dcip1, Gm1960, GRO ALPHA, GRO BETA, GRO GAMMA, GRO1, Gro2, GRO3, GROA, GROb, GROg, KC, MGSA, Mgsa-b, MIP-2, MIP-2a, MIP-2b, Mip2 alpha, N51, Scyb, Scyb2, SCYB3chemokine (C-X-C motif) ligand 3207850_at
CXCL13ANGIE, ANGIE2, BCA-1, BLC, BLC1, BLR1L, CXC CHEMOKINE, Loc498335, SCYB13chemokine (C-X-C motif) ligand 13205242_at
DKK1Dkk1 predicted, mdkk-1, SKdickkopf homolog 1 (Xenopus laevis)204602_at
DRD2D2, D2 DOPAMINE RECEPTOR, D2a dopamine receptor, D2DR, D2R, D2S, DOPAMINE D2 RECEPTOR, Dr2dopamine receptor D2216924_s_at
EEDHEED, l(7)5Rn, l7Rn5, lusk, WAIT-1embryonic ectoderm development209572_s_at
GINS32700085M18Rik, AI616142, FLJ13912, PSF3, RGD1308153GINS complex subunit 3 (Psf3 homolog)218719_s_at
GPS2AI505953, AMF-1, MGC104294, MGC119287, MGC119288, MGC119289G protein pathway suppressor 2209350_s_at
GRIA2GLUR-B, GluR-K2, GLUR2, GLUR2 IONOTROPIC, HBGR2glutamate receptor, ionotropic, AMPA 2205358_at
GSNDKFZp313L0718, GELSOLIN, MGC28083, MGC95032gelsolin (amyloidosis, Finnish type)214040_s_at
IFNAR1ALPHA CHAIN OF TYPE I IFNR, AVP, BETA R1, CD118, Ifar, IFN RECEPTOR TYPE 1, IFN TYPE 1 RECEPTOR, IFN-alpha-beta-R, IFN-ALPHA-REC, IFNalpha/betaR, IFNAR, IFNBR, IFRC, Infar, INFAR1, Interferon Receptor, LOC284829, Type I infrinterferon (alpha, beta and omega) receptor 1204191_at
ITGB4AA407042, C230078O20, CD104, INTEGRIN-BETA 4integrin, beta 4204989_s_at
IVD1300016K07Rik, 6720455E18Rik, ACAD2, AI463340, Isovaleryl-Coa Dehydrogenaseisovaleryl Coenzyme A dehydrogenase216958_s_at
KLALPHA KLOTHO, alpha-kl, KLOTHOklotho205978_at
N4BP1AI481586, C81621, FLJ31821, KIAA0615, MGC176730, MGC7607, RGD1305179NEDD4 binding protein 132069_at
NAIPAV364616, BIRC1, BIRC1A, Birc1b, Birc1e, BIRC1F, D13Lsd1, FLJ18088, FLJ42520, FLJ58811, LGN1, LOC652755, Naip-rs1, Naip-rs3, Naip-rs4, Naip-rs4A, Naip1, Naip2, Naip5, Naip6, NLRB1, psiNAIP, RGD1559914NLR family, apoptosis inhibitory protein204861_s_at
NDST11200015G06RIK, HSNST, HSST, HSST1, NST1N-deacetylase/N-sulfotransferase (heparan glucosaminyl) 1202608_s_at
PAICS2610511I09Rik, ADE2, ADE2H1, AIRC, DKFZp781N1372, MGC1343, MGC5024, MGC93240, PAISphosphoribosylaminoimidazole carboxylase, phosphoribosylaminoimidazole succinocarboxamide synthetase214664_at
PAWRPAR-4PRKC, apoptosis, WT1, regulator204005_s_at
PDPK1MGC20087, MGC35290, PDK1, PRO04613-phosphoinositide dependent protein kinase-1204524_at
PHLDA1DT1P1B11, MGC131738, PHRIP, PQ-RICH, Proline- and glutamine-rich, TDAG, TDAG51pleckstrin homology-like domain, family A, member 1218000_s_at
PPP1R15A (includes EG:23645)9630030H21, GADD34, MYD116, Myeloid Differentiation, Peg-3, PP1 REGULATORY SUBUNIT, Ppp1r15aprotein phosphatase 1, regulatory (inhibitor) subunit 15A202014_at
RASSF1123F2, AA536941, AU044980, D4Mgi37, MGC94319, NORE2A, PTS, Rassf1A, Rassf1B, Rassf1C, RDA32, REH3P21Ras association (RalGDS/AF-6) domain family member 1204346_s_at
SYT1AW124717, DKFZp781D2042, G630098F17Rik, P65, SVP65, SYNAPTOTAGMIN 1, SYTsynaptotagmin I203999_at
TACSTD2C80403, EGP-1, GA733, GA733-1, Ly97, M1S1, MGC141612, MGC141613, MGC72570, Prp1, TROP2tumor-associated calcium signal transducer 2202286_s_at
TEKAA517024, CD202B, Hyk, MGC139569, TIE-2, VMCM, VMCM1TEK tyrosine kinase, endothelial206702_at
TTF1AV245725, RGD1565673, Ttf-Itranscription termination factor, RNA polymerase I204772_s_at
ZEB13110032K11Rik, AREB6, BZP, DELTA-EF1, MEB1, MGC133261, NIL-2-A, Nil2, Tcf18, TCF8, TCP8, TF8, TRANSCRIPTION FACTOR 8, ZEB, ZFHEP, Zfhep2, ZFHX1A, Zfx1a, Zfx1ha, [delta]EF1zinc finger E-box binding homeobox 1212758_s_at
ZNF10KOX1zinc finger protein 10216350_s_at
Table 2

The 34 cancer related genes of the 100 features selected by NMSC-MSC on original training group of MAQC-II breast cancer data for pCR prediction.

SymbolSynonym(s)Entrez Gene NameAffymetrix
ALBALB1, ALBUMIN, Albumin 1, Albza, DKFZp779N1935, PRO0883, PRO0903, PRO1341, SA, SERUM ALBUMIN, SERUM ALBUMIN CHAIN A, Serum albumin precursoralbumin214837_at
ARAF1200013E08Rik, ARAF1, AW495444, PKS, PKS2, RAFA1v-raf murine sarcoma 3611 viral oncogene homolog201895_at
C10ORF819930023K05RIK, bA211N11.2, FLJ23537, HEL185, MGC99964, RGD1559884, RP11-211N11.2chromosome 10 open reading frame 81219857_at
CDH1AA960649, Arc-1, Cadherin 1, CD324, CDHE, CSEIL, E-CADHERIN, E-CADHERIN 120 KDA, ECAD, L-CAM, MGC107495, Um, UVO, uvomorulincadherin 1, type 1, E-cadherin (epithelial)201131_s_at
CDK10BC017131, MGC112847, PISSLREcyclin-dependent kinase 10210622_x_at
CEACAM1bb-1, BGP, BGP1, BGPA, Bgpd, BGPI, BGPR, C-CAM, C-CAM1, CCAM105, CD66, CD66A, Cea-1, Cea-7, CEACAM1-4L, ECTO ATPASE, HV2, mCEA1, Mhv-1, MHVR, MHVR1, mmCGM1, mmCGM1a, mmCGM2, Pp120carcinoembryonic antigen-related cell adhesion molecule 1 (biliary glycoprotein)211889_x_at
CEBPEC/EBP EPSILON, C/EBPe, C/EPBe, CRP1, Gm294, MGC124002, MGC124003CCAAT/enhancer binding protein (C/EBP), epsilon214523_at
CHRNB4Acrb-4, NACHR BETA4cholinergic receptor, nicotinic, beta 4207516_at
CR1C3b/C4b receptor, C3BR, CD35, CD46, Cr1l, Crry, KN, Mcp, mCRY, MGC102484, SCR1complement component (3b/4b) receptor 1 (Knops blood group)217552_x_at
CXCL3Cinc-2, CINC-2a, CINC-2b, Cinc3, Cxcl2, Dcip1, Gm1960, GRO ALPHA, GRO BETA, GRO GAMMA, GRO1, Gro2, GRO3, GROA, GROb, GROg, KC, MGSA, Mgsa-b, MIP-2, MIP-2a, MIP-2b, Mip2 alpha, N51, Scyb, Scyb2, SCYB3chemokine (C-X-C motif) ligand 3207850_at
CXCL13ANGIE, ANGIE2, BCA-1, BLC, BLC1, BLR1L, CXC CHEMOKINE, Loc498335, SCYB13chemokine (C-X-C motif) ligand 13205242_at
DRD2D2, D2 DOPAMINE RECEPTOR, D2a dopamine receptor, D2DR, D2R, D2S, DOPAMINE D2 RECEPTOR, Dr2dopamine receptor D2216924_s_at
DYRK1A2310043O08Rik, D16Ertd272e, D16Ertd493e, DUAL-SPECIFICITY TYROSINE-(Y)-PHOSPHORYLATION REGULATED KINASE 1A, DYRK, DYRK1, HP86, MGC150253, MGC150254, mmb, MNB, MNBH, Mp86, PSK47dual-specificity tyrosine-(Y)-phosphorylation regulated kinase 1A211541_s_at
EPOREP-R, ERYTHROPOIETIN RECEPTOR, MGC108723, MGC138358erythropoietin receptor215054_at
FAM153AKIAA0752, NY-REN-7family with sequence similarity 153, member A211166_at
GINS32700085M18Rik, AI616142, FLJ13912, PSF3, RGD1308153GINS complex subunit 3 (Psf3 homolog)218719_s_at
GRIA2GLUR-B, GluR-K2, GLUR2, GLUR2 IONOTROPIC, HBGR2glutamate receptor, ionotropic, AMPA 2205358_at
GSNDKFZp313L0718, GELSOLIN, MGC28083, MGC95032gelsolin (amyloidosis, Finnish type)214040_s_at
IFNAR1ALPHA CHAIN OF TYPE I IFNR, AVP, BETA R1, CD118, Ifar, IFN RECEPTOR TYPE 1, IFN TYPE 1 RECEPTOR, IFN-alpha-beta-R, IFN-ALPHA-REC, IFNalpha/betaR, IFNAR, IFNBR, IFRC, Infar, INFAR1, Interferon Receptor, LOC284829, Type I infrinterferon (alpha, beta and omega) receptor 1204191_at
ITGB4AA407042, C230078O20, CD104, INTEGRIN-BETA 4integrin, beta 4204989_s_at
KLALPHA KLOTHO, alpha-kl, KLOTHOklotho205978_at
LPAR15031439C20, AI326300, clone 4.9, EDG2, ENDOTHELIAL DIFFERENTIATION LYSOPHOSPHATIDIC ACID G-PROTEIN-COUPLED RECEPTOR 2, Gpcr26, GPR26, Kdt2, LPA receptor 1, LPA1, LPA1 RECEPTOR, LPA2, LYSOPHOSPHATIDIC ACID G-PROTEIN-COUPLED RECEPTOR, MGC105279, MGC29102, Mrec1.3, rec.1.3, vzg-1lysophosphatidic acid receptor 1204037_at
MCF2B230117G22Rik, DBL, MGC159138, RGD1566098MCF.2 cell line derived transforming sequence208017_s_at
MYO10AW048724, D15Ertd600e, FLJ10639, FLJ21066, FLJ22268, FLJ43256, KIAA0799, MGC131988, mKIAA0799, Myo10 (predicted), myosin-Xmyosin X201976_s_at
NAIPAV364616, BIRC1, BIRC1A, Birc1b, Birc1e, BIRC1F, D13Lsd1, FLJ18088, FLJ42520, FLJ58811, LGN1, LOC652755, Naip-rs1, Naip-rs3, Naip-rs4, Naip-rs4A, Naip1, Naip2, Naip5, Naip6, NLRB1, psiNAIP, RGD1559914NLR family, apoptosis inhibitory protein204860_s_at
NDST11200015G06RIK, HSNST, HSST, HSST1, NST1N-deacetylase/N-sulfotransferase (heparan glucosaminyl) 1202608_s_at
PHLDA1DT1P1B11, MGC131738, PHRIP, PQ-RICH, Proline- and glutamine-rich, TDAG, TDAG51pleckstrin homology-like domain, family A, member 1218000_s_at
PPP1R15A (includes EG:23645)9630030H21, GADD34, MYD116, Myeloid Differentiation, Peg-3, PP1 REGULATORY SUBUNIT, Ppp1r15aprotein phosphatase 1, regulatory (inhibitor) subunit 15A202014_at
RASSF1123F2, AA536941, AU044980, D4Mgi37, MGC94319, NORE2A, PTS, Rassf1A, Rassf1B, Rassf1C, RDA32, REH3P21Ras association (RalGDS/AF-6) domain family member 1204346_s_at
SIAH1AA982064, AI853500, D9MGI7, FLJ08065, hSIAH1, HUMSIAH, SIAH, Siah1a, Sinh1aseven in absentia homolog 1 (Drosophila)202981_x_at
TTF1AV245725, RGD1565673, Ttf-Itranscription termination factor, RNA polymerase I204772_s_at
VAMP2FLJ11460, mVam2, RATVAMPB, RATVAMPIR, SYB, SYB2, SYNAPTOBREVIN 2, Vamp iivesicle-associated membrane protein 2 (synaptobrevin 2)201557_at
ZNF10KOX1zinc finger protein 10216350_s_at
ZNF2054933429B21, AI835008, Krox-8, Zfp13, ZINC FINGER PROTEIN 205, ZNF210zinc finger protein 205206416_at
Table 3

The 26 cancer related genes of the 100 features selected by GLGS on original training group of MAQC-II breast cancer data for pCR prediction.

SymbolSynonym(s)Entrez Gene NameAffymetrix
ADAM15MDC15, METARGIDIN, tMDCVIADAM metallopeptidase domain 15217007_s_at
ARAF1200013E08Rik, ARAF1, AW495444, PKS, PKS2, RAFA1v-raf murine sarcoma 3611 viral oncogene homolog201895_at
ASNSAS, ASPARAGINE SYNTHETASE, MGC93148, TS11asparagine synthetase205047_s_at
B4GALT59430078I07Rik, AW049941, AW539721, BETA4-GALT-IV, beta4Gal-T5, beta4GalT-V, gt-V, MGC138470UDP-Gal:betaGlcNAc beta 1,4- galactosyltransferase, polypeptide 5221485_at
CEACAM1bb-1, BGP, BGP1, BGPA, Bgpd, BGPI, BGPR, C-CAM, C-CAM1, CCAM105, CD66, CD66A, Cea-1, Cea-7, CEACAM1-4L, ECTO ATPASE, HV2, mCEA1, Mhv-1, MHVR, MHVR1, mmCGM1, mmCGM1a, mmCGM2, Pp120carcinoembryonic antigen-related cell adhesion molecule 1 (biliary glycoprotein)211889_x_at
CXCL3Cinc-2, CINC-2a, CINC-2b, Cinc3, Cxcl2, Dcip1, Gm1960, GRO ALPHA, GRO BETA, GRO GAMMA, GRO1, Gro2, GRO3, GROA, GROb, GROg, KC, MGSA, Mgsa-b, MIP-2, MIP-2a, MIP-2b, Mip2 alpha, N51, Scyb, Scyb2, SCYB3chemokine (C-X-C motif) ligand 3207850_at
CYCS (includes EG:54205)CYC, CYCS, CYCSA, CYCT, CYCTA, CYTC, CYTOCHROME C, HCS, MGC93634, T-Cccytochrome c, somatic208905_at
DRD2D2, D2 DOPAMINE RECEPTOR, D2a dopamine receptor, D2DR, D2R, D2S, DOPAMINE D2 RECEPTOR, Dr2dopamine receptor D2216924_s_at
EGFR9030024J15RIK, AI552599, EGF RECEPTOR, EGF-TK, EGFR1, EGFRec, ER2, ERBB, ERBB1, Errp, HER1, MENA, PIG61, wa-2, Wa5epidermal growth factor receptor (erythroblastic leukemia viral (v-erb-b) oncogene homolog, avian)201983_s_at
FBLN1FBLN, FIBULIN 1fibulin 1207834_at
FIS12010003O14Rik, CGI-135, Riken cDNA 2010003o14, TTC11fission 1 (mitochondrial outer membrane) homolog (S. cerevisiae)218034_at
FOXC1ARA, ch, fkh-1, FKHL7, FLJ11796, FLJ11796 FIS, FREAC-3, frkhda, IGDA, IHG1, IRID1, Mf1, Mf4, rCG 44068, rCG_44068forkhead box C1213260_at
FTLFERRITIN LIGHT CHAIN, FTL1, Ftl2, L-FERRITIN, MGC102130, MGC102131, MGC118079, MGC118080, MGC71996, RGD1560687, RGD1561055, RGD1566189, YB24D08ferritin, light polypeptide201265_at
GFRA1AU042498, GDNFR, Gdnfr alpha, GDNFRA, GFR-ALPHA-1, GRFA1, MGC23045, Ret, RET1L, RETL1, TRNR1GDNF family receptor alpha 1205696_s_at
ITGB4AA407042, C230078O20, CD104, INTEGRIN-BETA 4integrin, beta 4204989_s_at
KLALPHA KLOTHO, alpha-kl, KLOTHOklotho205978_at
LEF13000002B05, AI451430, DKFZp586H0919, LEF, TCF/LEF, TCF1ALPHAlymphoid enhancer-binding factor 1221557_s_at
LMO4A730077C12Rik, Crp3, Etohi4, MGC105593LIM domain only 4209205_s_at
LPAR2EDG-4, FLJ93869, IPA2, LPA receptor 2, LPA2, LPA2 RECEPTOR, Pbx4, RGD1561336lysophosphatidic acid receptor 2206723_s_at
MKI67D630048A14Rik, KI-67, KIA, MIB1, MIKI67A, MKI67Aantigen identified by monoclonal antibody Ki-67212022_s_at
NAIPAV364616, BIRC1, BIRC1A, Birc1b, Birc1e, BIRC1F, D13Lsd1, FLJ18088, FLJ42520, FLJ58811, LGN1, LOC652755, Naip-rs1, Naip-rs3, Naip-rs4, Naip-rs4A, Naip1, Naip2, Naip5, Naip6, NLRB1, psiNAIP, RGD1559914NLR family, apoptosis inhibitory protein204861_s_at
PPP1R15A (includes EG:23645)9630030H21, GADD34, MYD116, Myeloid Differentiation, Peg-3, PP1 REGULATORY SUBUNIT, Ppp1r15aprotein phosphatase 1, regulatory (inhibitor) subunit 15A202014_at
RARBA830025K23, BETA RAR, HAP, LOC51036, NR1B2, RAR BETA, RAR-EPSILON, RRB2retinoic acid receptor, beta208530_s_at
TEKAA517024, CD202B, Hyk, MGC139569, TIE-2, VMCM, VMCM1TEK tyrosine kinase, endothelial206702_at
TTF1AV245725, RGD1565673, Ttf-Itranscription termination factor, RNA polymerase I204772_s_at
USF2bHLHb12, FIP, MGC91056upstream transcription factor 2, c-fos interacting202152_x_at
Table 4

The 25 cancer related genes of the 100 features selected by LOOCSFS on original training group of MAQC-II breast cancer data for pCR prediction.

SymbolSynonym(s)Entrez Gene NameAffymetrix
ADAM17Alpha Secretase, CD156b, cSVP, MGC71942, TACE, TNFA CONVERTASEADAM metallopeptidase domain 17213532_at
APPA beta 25–35, A-BETA 40, A-BETA 42, AAA, ABETA, ABPP, AD1, Adap, AL024401, AMYLOID BETA, AMYLOID BETA 40, AMYLOID BETA 40 HUMAN, AMYLOID BETA 42, Amyloid beta A4, AMYLOID BETA PEPTIDE 40, Amyloid precursor, Amyloidogenic glycoprotein, App alpha, APPI, appican, BETAAPP, CTFgamma, CVAP, E030013M08RIK, Nexin II, P3, PN2, PreA4, PROTEASE NEXIN2amyloid beta (A4) precursor protein214953_s_at
ARAF1200013E08Rik, ARAF1, AW495444, PKS, PKS2, RAFA1v-raf murine sarcoma 3611 viral oncogene homolog201895_at
CEACAM1bb-1, BGP, BGP1, BGPA, Bgpd, BGPI, BGPR, C-CAM, C-CAM1, CCAM105, CD66, CD66A, Cea-1, Cea-7, CEACAM1-4L, ECTO ATPASE, HV2, mCEA1, Mhv-1, MHVR, MHVR1, mmCGM1, mmCGM1a, mmCGM2, Pp120carcinoembryonic antigen-related cell adhesion molecule 1 (biliary glycoprotein)211889_x_at
CTSL21190035F06Rik, Cathepsin l, CATHEPSIN V, CATHL, CATL2, Ctsl, Ctsl1, CTSU, CTSV, fs, MEP, MGC125957, nktcathepsin L2210074_at
CYR61AI325051, CCN1, Cysteine-rich protein 61, GIG1, IGFBP10, MGC93040cysteine-rich, angiogenic inducer, 61210764_s_at
DKK1Dkk1 predicted, mdkk-1, SKdickkopf homolog 1 (Xenopus laevis)204602_at
DRD2D2, D2 DOPAMINE RECEPTOR, D2a dopamine receptor, D2DR, D2R, D2S, DOPAMINE D2 RECEPTOR, Dr2dopamine receptor D2216924_s_at
ETS2AU022856, ETS2IT1v-ets erythroblastosis virus E26 oncogene homolog 2 (avian)201329_s_at
GRIA2GLUR-B, GluR-K2, GLUR2, GLUR2 IONOTROPIC, HBGR2glutamate receptor, ionotropic, AMPA 2205358_at
ITGB4AA407042, C230078O20, CD104, INTEGRIN-BETA 4integrin, beta 4204990_s_at
KIF3A111-11-71, 111-11-86, AF180004, AF180009, AW124694, KIF3, Kifl, Kns3kinesin family member 3A213623_at
KLF6Aa1017, AI448727, BCD1, C86813, COPEB, CPBP, DKFZp686N0199, Erythropoietin 1, FM2, FM6, GBF, Ierepo1, IEREPO3, KRUPPEL LIKE ZINC FINGER PROTEIN ZF9, PAC1, PROTO-ONCOGENE BCD, Proto-oncogene BCD1, R75280, ST12, ZF9Kruppel-like factor 6211610_at
LTBP19430031G15Rik, 9830146M04, MGC163161, TGFBlatent transforming growth factor beta binding protein 1202729_s_at
MCF2B230117G22Rik, DBL, MGC159138, RGD1566098MCF.2 cell line derived transforming sequence208017_s_at
REST2610008J04RIK, AA407358, D14MGI11, MGC150099, NRSF, XBRRE1-silencing transcription factor204535_s_at
SDC1AA408134, AA409076, BB4, CD138, HSPG, SDC, SYN1, Synd, SYND1, SYNDECA, Syndecan, SYNDECAN-1syndecan 1201287_s_at
SEMA3BAW208495, FLJ34863, LUCA-1, SEMA, SEMA5, SEMAA, semaVsema domain, immunoglobulin domain (Ig), short basic domain, secreted, (semaphorin) 3B203071_at
SLPIALK1, ALP, ANTILEUKOPROTEASE, antileukoproteinase, BLPI, HUSI, HUSI-I, MPI, Secretory Leukoprotease Inhibitor, SLP1, WAP4, WFDC4secretory leukocyte peptidase inhibitor203021_at
SP11110003E12RIK, AA450830, AI845540, Sp1 (trans spliced isoform), SP1-1, Trans-acting transcription factor 1Sp1 transcription factor214732_at
TCF7L2 (includes EG:6934)LOC679869, mTcf-4B, mTcf-4E, TCF-4, TCF4B, TCF4E, Tcf7l2transcription factor 7-like 2 (T-cell specific, HMG-box)216511_s_at
TNFAIP3A20, MAD6, MGC104522, MGC138687, MGC138688, OTUD7C, TNF ALPHA-INDUCED PROTEIN 3, TNF-inducible early response, TNFA1P2, Tnfip3tumor necrosis factor, alpha-induced protein 3202644_s_at
TTF1AV245725, RGD1565673, Ttf-Itranscription termination factor, RNA polymerase I204772_s_at
TXNADF, AW550880, DKFZp686B1993, EOSINOPHIL CYTOTOXICITY FACTOR, MGC151960, MGC61975, THIOREDOXIN, TRX, TRX1, Txn1thioredoxin208864_s_at
Table 5

The 20 cancer related genes of the 100 features selected by SVMRFE on original training group of MAQC-II breast cancer data for pCR prediction.

SymbolSynonym(s)Entrez Gene NameAffymetrix
APPA beta 25–35, A-BETA 40, A-BETA 42, AAA, ABETA, ABPP, AD1, Adap, AL024401, AMYLOID BETA, AMYLOID BETA 40, AMYLOID BETA 40 HUMAN, AMYLOID BETA 42, Amyloid beta A4, AMYLOID BETA PEPTIDE 40, Amyloid precursor, Amyloidogenic glycoprotein, App alpha, APPI, appican, BETAAPP, CTFgamma, CVAP, E030013M08RIK, Nexin II, P3, PN2, PreA4, PROTEASE NEXIN2amyloid beta (A4) precursor protein214953_s_at
CLEC3BDKFZp686H17246, TETRANECTIN, TN, TNA, TTNC-type lectin domain family 3, member B205200_at
CXCL9BB139920, CMK, crg-10, Humig, MGC105312, MIG, SCYB9chemokine (C-X-C motif) ligand 9203915_at
DKK1Dkk1 predicted, mdkk-1, SKdickkopf homolog 1 (Xenopus laevis)204602_at
DRD2D2, D2 DOPAMINE RECEPTOR, D2a dopamine receptor, D2DR, D2R, D2S, DOPAMINE D2 RECEPTOR, Dr2dopamine receptor D2216924_s_at
EPOREP-R, ERYTHROPOIETIN RECEPTOR, MGC108723, MGC138358erythropoietin receptor215054_at
ESR1AA420328, Alpha estrogen receptor, AU041214, DKFZp686N23123, ER, ER ALPHA, Er alpha (46 kDa isoform), ER66, ERA, ER[a], ESR, ESRA, ESTR, ESTRA, ESTROGEN RECEPTOR ALPHA, ESTROGEN RECEPTOR1, NR3A1, RNESTROR, TERP-1estrogen receptor 1215552_s_at
FAM153AKIAA0752, NY-REN-7family with sequence similarity 153, member A211166_at
GSNDKFZp313L0718, GELSOLIN, MGC28083, MGC95032gelsolin (amyloidosis, Finnish type)214040_s_at
IFNAR1ALPHA CHAIN OF TYPE I IFNR, AVP, BETA R1, CD118, Ifar, IFN RECEPTOR TYPE 1, IFN TYPE 1 RECEPTOR, IFN-alpha-beta-R, IFN-ALPHA-REC, IFNalpha/betaR, IFNAR, IFNBR, IFRC, Infar, INFAR1, Interferon Receptor, LOC284829, Type I infrinterferon (alpha, beta and omega) receptor 1204191_at
KLALPHA KLOTHO, alpha-kl, KLOTHOklotho205978_at
NAIPAV364616, BIRC1, BIRC1A, Birc1b, Birc1e, BIRC1F, D13Lsd1, FLJ18088, FLJ42520, FLJ58811, LGN1, LOC652755, Naip-rs1, Naip-rs3, Naip-rs4, Naip-rs4A, Naip1, Naip2, Naip5, Naip6, NLRB1, psiNAIP, RGD1559914NLR family, apoptosis inhibitory protein204861_s_at
NDST11200015G06RIK, HSNST, HSST, HSST1, NST1N-deacetylase/N-sulfotransferase (heparan glucosaminyl) 1202608_s_at
PPP1R15A (includes EG:23645)9630030H21, GADD34, MYD116, Myeloid Differentiation, Peg-3, PP1 REGULATORY SUBUNIT, Ppp1r15aprotein phosphatase 1, regulatory (inhibitor) subunit 15A202014_at
PTGISCYP8, CYP8A1, MGC126858, MGC126860, Pcs, Pgi2, PGIS, PROSTACYCLIN, Prostacyclin Synthase, PTGIprostaglandin I2 (prostacyclin) synthase211892_s_at
RAD23B0610007D13Rik, AV001138, HHR23B, HR23B, MGC112630, mHR23B, P58RAD23 homolog B (S. cerevisiae)201223_s_at
SP11110003E12RIK, AA450830, AI845540, Sp1 (trans spliced isoform), SP1-1, Trans-acting transcription factor 1Sp1 transcription factor214732_at
TACSTD2C80403, EGP-1, GA733, GA733-1, Ly97, M1S1, MGC141612, MGC141613, MGC72570, Prp1, TROP2tumor-associated calcium signal transducer 2202286_s_at
TUBB33200002H15Rik, beta-4, M(beta)3, M(beta)6, MC1R, Nst, Tub beta3, TUBB4, Tubulin beta-3, Tubulin beta-III, Tuj1tubulin, beta 3202154_x_at
ZNF10KOX1zinc finger protein 10216350_s_at
Table 6

The 34 cancer related genes of the 100 features selected by NBC-MSC on original training group of MAQC-II breast cancer data for erpos prediction.

SymbolSynonym(s)Entrez Gene NameAffymetrix
APPA beta 25–35,A-BETA 40,A-BETA 42,AAA,ABETA,ABPP,AD1,Adap,AL024401,AMYLOID BETA,AMYLOID BETA 40,AMYLOID BETA 40 HUMAN,AMYLOID BETA 42,Amyloid beta A4,AMYLOID BETA PEPTIDE 40,Amyloid precursor,Amyloidogenic glycoprotein,App alpha,APPI,appican,BETAAPP,CTFgamma,CVAP,E030013M08RIK,Nexin II,P3,PN2,PreA4,PROTEASE NEXIN2amyloid beta (A4) precursor protein214953_s_at
CCDC28A1700009P13Rik,AI480677,C6ORF80,CCRL1AP,DKFZP586D0623,MGC116395,MGC131913,MGC19351,RGD1310326coiled-coil domain containing 28A209479_at
CEACAM1bb-1,BGP,BGP1,BGPA,Bgpd,BGPI,BGPR,C-CAM,C-CAM1,CCAM105,CD66,CD66A,Cea-1,Cea-7,CEACAM1-4L,ECTO ATPASE,HV2,mCEA1,Mhv-1,MHVR,MHVR1,mmCGM1,mmCGM1a,mmCGM2,Pp120carcinoembryonic antigen-related cell adhesion molecule 1 (biliary glycoprotein)211883_x_at
CHRNB4Acrb-4,NACHR BETA4cholinergic receptor, nicotinic, beta 4207516_at
CXCL9BB139920,CMK,crg-10,Humig,MGC105312,MIG,SCYB9chemokine (C-X-C motif) ligand 9203915_at
EIF1A121,EIF1A,ISO1,MGC101938,MGC6503,SUI1,SUI1-RS1eukaryotic translation initiation factor 1212130_x_at
EMP1CL-20,ENP1MR,EPITHELIAL MEMBRANE PROTEIN 1,MGC93627,TMPepithelial membrane protein 1201325_s_at
EPOREP-R,ERYTHROPOIETIN RECEPTOR,MGC108723,MGC138358erythropoietin receptor215054_at
ETV51110005E01Rik,8430401F14Rik,ERMets variant 5203349_s_at
FBLN1FBLN,FIBULIN 1fibulin 1207834_at
FBXL7AL023057,D230018M15Rik,FBL6,FBL7,FBP7,MGC102204F-box and leucine-rich repeat protein 7213249_at
GHRAA986417,GHBP,GHR/BP,GROWTH HORMONE RECEPTOR,MGC124963,MGC156665growth hormone receptor205498_at
GPC3DGSX,Glypican 3,GTR2-2,MGC93606,OCI-5,SDYS,SGB,SGBS,SGBS1glypican 3209220_at
GPS2AI505953,AMF-1,MGC104294,MGC119287,MGC119288,MGC119289G protein pathway suppressor 2209350_s_at
IGF2BP32610101N11Rik,AA522010,Ab2-255,AL022933,AU045931,DKFZp686F1078,IMP-3,KOC1,Koc13,mimp3,Neilsen,RGD1306512,VICKZ3insulin-like growth factor 2 mRNA binding protein 3203820_s_at
ING4D6Wsu147e,D6Xrf92,MGC12557,MGC156688,my036,p29ING4,p33ING1 ISOLOGinhibitor of growth family, member 4218234_at
ITGB4AA407042,C230078O20,CD104,INTEGRIN-BETA 4integrin, beta 4204989_s_at
KLF54930520J07Rik,BTEB2,CKLF,IKLF,Kruppel-like factor 5,mBTEB2Kruppel-like factor 5 (intestinal)209211_at
MFAP3L4933428A15RIK,5430405D20Rik,AI461995,AW125052,KIAA0626,mKIAA0626,NYD-sp9microfibrillar-associated protein 3-like210493_s_at
NCOR15730405M06RIK,A230020K14RIK,hCIT529I10,hN-CoR,KIAA1047,MGC104216,MGC116328,mKIAA1047,N-COR,RIP13,Rxrip13,TRAC1nuclear receptor co-repressor 1200854_at
PA2G438kDa,AA672939,EBP1,HG4-1,Itaf45,MGC94070,p38-2G4,Plfap,PROLIFERATION ASSOCIATED 2G4,Proliveration-associated protein 1proliferation-associated 2G4, 38kDa214794_at
PCNAMGC8367,Pcna/cyclin,PCNARproliferating cell nuclear antigen217400_at
PLD1AA536939,C85393,Pc-Plc,Phospholipase D1,PLD1A,PLD1B,Plda,Pldbphospholipase D1, phosphatidylcholine-specific215723_s_at
RPL13A1810026N22Rik,23-KD HIGHLY BASIC,MGC107571,Ribol13a,Ribosomal protein l13a,Tstap198-7,tum-antigenribosomal protein L13a211942_x_at
SCGB1D2BU101,LIPB,LIPOPHILIN B,LPHBsecretoglobin, family 1D, member 2206799_at
SEL1LAW493766,IBD2,KIAA4137,mKIAA4137,PRO1063,SEL1,SEL1-LIKE,SEL1Hsel-1 suppressor of lin-12-like (C. elegans)202063_s_at
SERPINA6AI265318,AV104445,CBG,MGC112780serpin peptidase inhibitor, clade A (alpha-1 antiproteinase, antitrypsin), member 6206325_at
SMA4b55C20.2,FLJ36702,MGC22265,MGC60382,SMA3glucuronidase, beta pseudogene214850_at
SMC42500002A22Rik,C79747,CAP-C,DKFZP434F205,HCAP-C,MGC125078,SMC4L1structural maintenance of chromosomes 4215623_x_at
SOS19630010N06,AI449023,GF1,GGF1,GINGF,HGF,MSOS1,NS4,Sosson of sevenless homolog 1 (Drosophila)212780_at
TNFRSF11ACD265,FEO,LOH18CR1,Ly109,MGC112793,mRANK,ODAR,ODFR,OFE,OPGL receptor,OPTB7,OSTS,PDB2,RANK,RGD1563614,TRANCE-Rtumor necrosis factor receptor superfamily, member 11a, NFKB activator207037_at
TNFSF12APO3L,DR3L,DR3LG,MGC129581,MGC20669,TWEAKtumor necrosis factor (ligand) superfamily, member 12205611_at
TRPV6ABP/ZF,Cac,CAT,CAT-L,CAT1,Crac,ECAC2,HSA277909,LP6728,Otrpc3,ZFABtransient receptor potential cation channel, subfamily V, member 6206827_s_at
USP72210010O09Rik,AA409944,AA617399,AU019296,AW548146,C80752,HAUSP,TEF1ubiquitin specific peptidase 7 (herpes virus-associated)222032_s_at
Table 7

The 33 cancer related genes of the 100 features selected by NMSC-MSC on original training group of MAQC-II breast cancer data for erpos prediction.

SymbolSynonym(s)Entrez Gene NameAffymetrix
BAG1BAG1L, Bag1s, RAP46BCL2-associated athanogene202387_at
CCDC28A1700009P13Rik, AI480677, C6ORF80, CCRL1AP, DKFZP586D0623, MGC116395, MGC131913, MGC19351, RGD1310326coiled-coil domain containing 28A209479_at
CEACAM1bb-1, BGP, BGP1, BGPA, Bgpd, BGPI, BGPR, C-CAM, C-CAM1, CCAM105, CD66, CD66A, Cea-1, Cea-7, CEACAM1-4L, ECTO ATPASE, HV2, mCEA1, Mhv-1, MHVR, MHVR1, mmCGM1, mmCGM1a, mmCGM2, Pp120carcinoembryonic antigen-related cell adhesion molecule 1 (biliary glycoprotein)211883_x_at
CHRNB4Acrb-4, NACHR BETA4cholinergic receptor, nicotinic, beta 4207516_at
EIF1A121, EIF1A, ISO1, MGC101938, MGC6503, SUI1, SUI1-RS1eukaryotic translation initiation factor 1212130_x_at
EPOREP-R, ERYTHROPOIETIN RECEPTOR, MGC108723, MGC138358erythropoietin receptor215054_at
FBLN1FBLN, FIBULIN 1fibulin 1207834_at
GBP1 (includes EG:2633)5830475C06, GBP1, GBPI, IFI67-K, MAG-1, MGC124334, Mpa-1guanylate binding protein 1, interferon-inducible, 67kDa202270_at
GHRAA986417, GHBP, GHR/BP, GROWTH HORMONE RECEPTOR, MGC124963, MGC156665growth hormone receptor205498_at
GPC3DGSX, Glypican 3, GTR2-2, MGC93606, OCI-5, SDYS, SGB, SGBS, SGBS1glypican 3209220_at
GPS2AI505953, AMF-1, MGC104294, MGC119287, MGC119288, MGC119289G protein pathway suppressor 2209350_s_at
IMPDH2ENSMUSG00000071041, Imp dehydrogenase 2, IMPD, IMPD2, Impdh, IMPDH-II, inosine 5′-phosphate dehydrogenase 2, MGC72938, OTTMUSG00000019498IMP (inosine monophosphate) dehydrogenase 2201892_s_at
ING4D6Wsu147e, D6Xrf92, MGC12557, MGC156688, my036, p29ING4, p33ING1 ISOLOGinhibitor of growth family, member 4218234_at
ITGB4AA407042, C230078O20, CD104, INTEGRIN-BETA 4integrin, beta 4204990_s_at
KLF54930520J07Rik, BTEB2, CKLF, IKLF, Kruppel-like factor 5, mBTEB2Kruppel-like factor 5 (intestinal)209211_at
MAP3K12DLK, MUK, PK, ZPK, ZPKP1mitogen-activated protein kinase kinase kinase 12205447_s_at
MFAP3L4933428A15RIK, 5430405D20Rik, AI461995, AW125052, KIAA0626, mKIAA0626, NYD-sp9microfibrillar-associated protein 3-like210493_s_at
NCOR15730405M06RIK, A230020K14RIK, hCIT529I10, hN-CoR, KIAA1047, MGC104216, MGC116328, mKIAA1047, N-COR, RIP13, Rxrip13, TRAC1nuclear receptor co-repressor 1200854_at
PA2G438kDa, AA672939, EBP1, HG4-1, Itaf45, MGC94070, p38-2G4, Plfap, PROLIFERATION ASSOCIATED 2G4, Proliveration-associated protein 1proliferation-associated 2G4, 38kDa214794_at
PCNAMGC8367, Pcna/cyclin, PCNARproliferating cell nuclear antigen217400_at
PLA2G16ADPLA, C78643, H-REV107-1, HRASLS3, HREV107, HREV107-3, MGC118754phospholipase A2, group XVI209581_at
PLD1AA536939, C85393, Pc-Plc, Phospholipase D1, PLD1A, PLD1B, Plda, Pldbphospholipase D1, phosphatidylcholine-specific215723_s_at
PTPN14C130080N23RIK, MGC126803, OTTMUSG00000022087, PEZ, PTP36, PTPD2protein tyrosine phosphatase, non-receptor type 14205503_at
SCGB1D2BU101, LIPB, LIPOPHILIN B, LPHBsecretoglobin, family 1D, member 2206799_at
SEL1LAW493766, IBD2, KIAA4137, mKIAA4137, PRO1063, SEL1, SEL1-LIKE, SEL1Hsel-1 suppressor of lin-12-like (C. elegans)202063_s_at
SERPINA6AI265318, AV104445, CBG, MGC112780serpin peptidase inhibitor, clade A (alpha-1 antiproteinase, antitrypsin), member 6206325_at
SMC42500002A22Rik, C79747, CAP-C, DKFZP434F205, HCAP-C, MGC125078, SMC4L1structural maintenance of chromosomes 4215623_x_at
SPAM1 (includes EG:6677)4933439A12Rik, HYA1, HYAL1, HYAL3, HYAL5, MGC108951, MGC26532, PH-20, SPAG15, SPAM1, TESTICULAR HYALURONIDASEsperm adhesion molecule 1 (PH-20 hyaluronidase, zona pellucida binding)210536_s_at
TCF3A1, AA408400, ALF2, AW209082, bHLHb21, E12, E12/E47, E2-5, E2A, E47, ITF1, ME2, MGC129647, MGC129648, PAN1, Pan2, Tcfe2a, TRANSCRIPTION FACTOR 3, VDIRtranscription factor 3 (E2A immunoglobulin enhancer binding factors E12/E47)213730_x_at
TGFARATTGFAA, TFGA, TGF ALPHA, TGFAA, wa-1transforming growth factor, alpha211258_s_at
TNFRSF11ACD265, FEO, LOH18CR1, Ly109, MGC112793, mRANK, ODAR, ODFR, OFE, OPGL receptor, OPTB7, OSTS, PDB2, RANK, RGD1563614, TRANCE-Rtumor necrosis factor receptor superfamily, member 11a, NFKB activator207037_at
TNFSF12APO3L, DR3L, DR3LG, MGC129581, MGC20669, TWEAKtumor necrosis factor (ligand) superfamily, member 12205611_at
USP72210010O09Rik, AA409944, AA617399, AU019296, AW548146, C80752, HAUSP, TEF1ubiquitin specific peptidase 7 (herpes virus-associated)222032_s_at
Table 8

The 40 cancer related genes of the 100 features selected by GLGS on original training group of MAQC-II breast cancer data for erpos prediction.

SymbolSynonym(s)Entrez Gene NameAffymetrix
ADRA2AADRA-2, ADRA2R, ADRAR, ADRENOCEPTOR ALPHA2A, alpha(2A)AR, ALPHA-2A ADRENERGIC RECEPTOR, ALPHA2-C10, alpha2A, Alpha2a Adrenoceptor, ALPHA2A-AR, AW122659, RATRG20, RG20, ZNF32adrenergic, alpha-2A-, receptor209869_at
ANGPTL4ANGIOPOIETIN-LIKE 4, ANGPTL2, ARP4, Bk89, Fasting Induced Adipose Factor, FIAF, Harp, HFARP, LOC362850, Ng27, NL2, PGAR, PGARG, pp1158, PPARGangiopoietin-like 4221009_s_at
BAG1BAG1L, Bag1s, RAP46BCL2-associated athanogene202387_at
CCNE1AW538188, CCNE, CYCLE, CYCLIN E, Cyclin E1cyclin E1213523_at
CHRNA5AChR alpha5, Acra-5, ALPHA5 NACHR, ALPHA5 NICOTINIC RECEPTOR, LNCR2, MGC124059, MGC124168, nAChR alpha5, sialoproteincholinergic receptor, nicotinic, alpha 5206533_at
CHRNB4Acrb-4, NACHR BETA4cholinergic receptor, nicotinic, beta 4207516_at
CTSL21190035F06Rik, Cathepsin l, CATHEPSIN V, CATHL, CATL2, Ctsl, Ctsl1, CTSU, CTSV, fs, MEP, MGC125957, nktcathepsin L2210074_at
EGFR9030024J15RIK, AI552599, EGF RECEPTOR, EGF-TK, EGFR1, EGFRec, ER2, ERBB, ERBB1, Errp, HER1, MENA, PIG61, wa-2, Wa5epidermal growth factor receptor (erythroblastic leukemia viral (v-erb-b) oncogene homolog, avian)201983_s_at
EPHA2AW545284, ECK, ECKR, Myk2, Sek-2EPH receptor A2203499_at
EPOREP-R, ERYTHROPOIETIN RECEPTOR, MGC108723, MGC138358erythropoietin receptor215054_at
FBLN1FBLN, FIBULIN 1fibulin 1207834_at
GPC3DGSX, Glypican 3, GTR2-2, MGC93606, OCI-5, SDYS, SGB, SGBS, SGBS1glypican 3209220_at
GPS2AI505953, AMF-1, MGC104294, MGC119287, MGC119288, MGC119289G protein pathway suppressor 2209350_s_at
HMGA1AL023995, HMG-I(Y), HMG-R, Hmg-y/i, HMGA1A, Hmgi, Hmgi/y, HMGIY, HMGY, MGC102580, MGC12816, MGC4242, MGC4854high mobility group AT-hook 1206074_s_at
HNRPDLAA407431, AA959857, D5Ertd650e, D5Wsu145e, HNRNP, HNRNP-D LIKE, hnRNP-DL, JKTBP, JKTBP1, JKTBP2, laAUF1, MGC125262heterogeneous nuclear ribonucleoprotein D-like209068_at
HSD17B217Hsd, AI194836, AI194967, AI255511, EDH17B2, HSD17, SDR9C2hydroxysteroid (17-beta) dehydrogenase 2204818_at
ICAM1BB2, CD54, ICAM, INTERCELLULAR ADHESION MOLECULE 1, Ly-47, M90551, MALA-2, Melanoma Progression Associated Antigen, MGC6195, P3.58intercellular adhesion molecule 1202637_s_at
IGF2BP32610101N11Rik, AA522010, Ab2-255, AL022933, AU045931, DKFZp686F1078, IMP-3, KOC1, Koc13, mimp3, Neilsen, RGD1306512, VICKZ3insulin-like growth factor 2 mRNA binding protein 3203820_s_at
ITGB4AA407042, C230078O20, CD104, INTEGRIN-BETA 4integrin, beta 4204989_s_at
MAP3K12DLK, MUK, PK, ZPK, ZPKP1mitogen-activated protein kinase kinase kinase 12205447_s_at
MFAP3L4933428A15RIK, 5430405D20Rik, AI461995, AW125052, KIAA0626, mKIAA0626, NYD-sp9microfibrillar-associated protein 3-like210493_s_at
PCM12600002H09Rik, 9430077F19Rik, C030044G17Rik, LOC100044052, MGC170660, PTC4pericentriolar material 1214118_x_at
PCNAMGC8367, Pcna/cyclin, PCNARproliferating cell nuclear antigen217400_at
PLD1AA536939, C85393, Pc-Plc, Phospholipase D1, PLD1A, PLD1B, Plda, Pldbphospholipase D1, phosphatidylcholine-specific215723_s_at
PML1200009E24Rik, AI661194, MYL, PP8675, Retinoic acid receptor, RGD1562602, RNF71, TRIM19promyelocytic leukemia211013_x_at
PPP3CB1110063J16Rik, Calcineurin A Beta, CALCINEURIN A1, CALNA2, CALNB, CnA-gamma, CnAbeta, Protein phosphatase 3 catalytic subunit beta isoformprotein phosphatase 3 (formerly 2B), catalytic subunit, beta isoform202432_at
PURACAGER-1, PUR-ALPHA, PUR1, ssCRE-BP, VACSSBF1purine-rich element binding protein A204020_at
PVR3830421F03Rik, CD155, D7ERTD458E, FLJ25946, HVED, mE4, NECL5, NECTIN 2 ALPHA, NECTIN-2, Pe4, PVS, TAA1, TAGE4poliovirus receptor212662_at
RBM5D030069N10RIK, FLJ39876, G15, H37, LUCA15, RMB5RNA binding motif protein 5201395_at
SEL1LAW493766, IBD2, KIAA4137, mKIAA4137, PRO1063, SEL1, SEL1-LIKE, SEL1Hsel-1 suppressor of lin-12-like (C. elegans)202063_s_at
SERPINA6AI265318, AV104445, CBG, MGC112780serpin peptidase inhibitor, clade A (alpha-1 antiproteinase, antitrypsin), member 6206325_at
SLC5A3AA623876, BF642829, SMIT, Smit1, SMIT2solute carrier family 5 (sodium/myo-inositol cotransporter), member 3213164_at
SLC7A54f2 light chain, 4F2LC, CD98, CD98 LIGHT CHAIN, CD98LC, D0H16S474E, D16S469E, E16, E16/TA1, hLAT1, LAT1, MPE16, TA1solute carrier family 7 (cationic amino acid transporter, y+ system), member 5201195_s_at
SMC42500002A22Rik, C79747, CAP-C, DKFZP434F205, HCAP-C, MGC125078, SMC4L1structural maintenance of chromosomes 4215623_x_at
SOS19630010N06, AI449023, GF1, GGF1, GINGF, HGF, MSOS1, NS4, Sosson of sevenless homolog 1 (Drosophila)212780_at
TGFARATTGFAA, TFGA, TGF ALPHA, TGFAA, wa-1transforming growth factor, alpha205016_at
TNFRSF11ACD265, FEO, LOH18CR1, Ly109, MGC112793, mRANK, ODAR, ODFR, OFE, OPGL receptor, OPTB7, OSTS, PDB2, RANK, RGD1563614, TRANCE-Rtumor necrosis factor receptor superfamily, member 11a, NFKB activator207037_at
UBE2B17-kDa Ubiquitin-Conjugating Enzyme E2, 2610301N02RIK, E2-14K, E2-17kDa, E2b, HHR6B, HR6B, LOC81816, RAD6B, UBC2ubiquitin-conjugating enzyme E2B (RAD6 homolog)211763_s_at
USP72210010O09Rik, AA409944, AA617399, AU019296, AW548146, C80752, HAUSP, TEF1ubiquitin specific peptidase 7 (herpes virus-associated)222032_s_at
VAMP2FLJ11460, mVam2, RATVAMPB, RATVAMPIR, SYB, SYB2, SYNAPTOBREVIN 2, Vamp iivesicle-associated membrane protein 2 (synaptobrevin 2)201557_at
Table 9

The 44 cancer related genes of the 100 features selected by LOOCSFS on original training group of MAQC-II breast cancer data for erpos prediction.

SymbolSynonym(s)Entrez Gene NameAffymetrix
ACP14632432E04Rik, AI427468, HAAP, LMPTP, LMW-PTP, MGC103115, MGC111030, MGC132904, MGC3499acid phosphatase 1, soluble215227_x_at
BAG1BAG1L, Bag1s, RAP46BCL2-associated athanogene202387_at
BCL2AW986256, Bcl2 alpha, C430015F12Rik, CED9, D630044D05RIK, D830018M01RIK, LOC100046608, ORF16B-cell CLL/lymphoma 2203685_at
BTG2AA959598, Agl, An, an-1, APRO1, B-cell translocation gene 2, antiproliferative, MGC126063, MGC126064, PC3, TIS21BTG family, member 2201236_s_at
CCDC28A1700009P13Rik, AI480677, C6ORF80, CCRL1AP, DKFZP586D0623, MGC116395, MGC131913, MGC19351, RGD1310326coiled-coil domain containing 28A209479_at
CCNA2AA408589, CCN1, CCNA, CYCA, CYCLIN A2, CYCLIN-A, MGC156527, p60cyclin A2213226_at
CD479130415E20RIK, AA407862, AI848868, AW108519, B430305P08RIK, CD47 ANTIGEN, CDw149, IAP, Itgp, Locuslink 71587, MER6, MGC93490, OA3CD47 molecule211075_s_at
CFDADIPSIN, ADN, Complement Factor D, DF, EVE, FACTOR D, PFDcomplement factor D (adipsin)205382_s_at
CHRNA5AChR alpha5, Acra-5, ALPHA5 NACHR, ALPHA5 NICOTINIC RECEPTOR, LNCR2, MGC124059, MGC124168, nAChR alpha5, sialoproteincholinergic receptor, nicotinic, alpha 5206533_at
CTSL21190035F06Rik, Cathepsin l, CATHEPSIN V, CATHL, CATL2, Ctsl, Ctsl1, CTSU, CTSV, fs, MEP, MGC125957, nktcathepsin L2210074_at
CXCL10C7, CRG-2, GAMMA-IFN INDUCIBLE EARLY RESPONSE, gIP-10, IFI10, IFNG INDUCIBLE PROTEIN-10, INP10, Interferon-inducible protein-10, IP-10, mob-1, SCYB10, SMALL INDUCIBLE CYTOKINE SUBFAMILY B (Cys-X-Cys), MEMBER 10chemokine (C-X-C motif) ligand 10204533_at
CYB5A0610009N12Rik, CB5, CYB5, Cytb5, CYTOCHROME B5, MCB5, MGC108694, MGC128769cytochrome b5 type A (microsomal)209366_x_at
DDX172610007K22RIK, A430025E01Rik, AI047725, C80929, DEAD-box protein p72, DEAD/H box RNA helicase p72, DKFZp761H2016, Gm926, MGC109323, MGC79147, P72, RH70DEAD (Asp-Glu-Ala-Asp) box polypeptide 17208718_at
DNMT3BDNA MTase HsaIIIB, ICF, M.HsaIIIB, MGC124407DNA (cytosine-5-)-methyltransferase 3 beta220668_s_at
EPHA2AW545284, ECK, ECKR, Myk2, Sek-2EPH receptor A2203499_at
ERBB4C-erb-b4, Erbb4 Cyt2, ERBB4 JM-A, HER4, MGC138404, p180erbB4v-erb-a erythroblastic leukemia viral oncogene homolog 4 (avian)214053_at
ESR1AA420328, Alpha estrogen receptor, AU041214, DKFZp686N23123, ER, ER ALPHA, Er alpha (46 kDa isoform), ER66, ERA, ER[a], ESR, ESRA, ESTR, ESTRA, ESTROGEN RECEPTOR ALPHA, ESTROGEN RECEPTOR1, NR3A1, RNESTROR, TERP-1estrogen receptor 1205225_at
ETV51110005E01Rik, 8430401F14Rik, ERMets variant 5203349_s_at
FAM134B1810015C04RIK, AU015349, FLJ20152, FLJ22155, FLJ22179family with sequence similarity 134, member B218510_x_at
FBLN1FBLN, FIBULIN 1fibulin 1207834_at
GREB15730583K22Rik, 9130004E13, AF180470, AU023194, KIAA0575, mKIAA0575GREB1 protein205862_at
IL6ST5133400A03Rik, AA389424, Ac1055, BB405851, CD130, CDw130, D13Ertd699e, GP130, GP130-RAPS, Il6 transd, IL6R-betainterleukin 6 signal transducer (gp130, oncostatin M receptor)211000_s_at
IRS1ENSMUSG00000022591, G972R, HIRS-1, IRS1IRMinsulin receptor substrate 1204686_at
ITPR1D6Pas2, I145TR, InsP3R, INSP3R1, IP3 RECEPTOR, Ip3 Receptor Type 1, IP3R, IP3R1, opt, P400, Pcd6, Pcp-1, SCA15, SCA16inositol 1,4,5-triphosphate receptor, type 1203710_at
LRP84932703M08Rik, AA921429, AI848122, APOER2, HSZ75190, LR8B, MCI1low density lipoprotein receptor-related protein 8, apolipoprotein e receptor208433_s_at
MAP3K12DLK, MUK, PK, ZPK, ZPKP1mitogen-activated protein kinase kinase kinase 12205447_s_at
MAPTAI413597, ALZ50, AW045860, DDPAC, FLJ31424, FTDP-17, MAPTL, MGC134287, MGC138549, MGC156663, MSTD, Mtapt, MTBT1, MTBT2, PHF TAU, PPND, pTau, RNPTAU, TAU, Tau 3r, Tau-1, TAU-FACTOR, Tau40, TAU4Rmicrotubule-associated protein tau203929_s_at
MCM7AI747533, CDABP0042, CDC47, D16Mgi24, mCDC47, MCM2, Mcmd7, MGC93853, P1.1-MCM3, P1CDC47, P85MCM, PNAS-146minichromosome maintenance complex component 7208795_s_at
MYO10AW048724, D15Ertd600e, FLJ10639, FLJ21066, FLJ22268, FLJ43256, KIAA0799, MGC131988, mKIAA0799, Myo10 (predicted), myosin-Xmyosin X201976_s_at
NCAPG (includes EG:64151)CAP-G, CHCG, FLJ12450, HCAP-G, MGC126525, NCAPG, NY-MEL-3, RGD1562646non-SMC condensin I complex, subunit G218663_at
NDRG1CAP43, CMT4D, DRG1, GC4, HMSNL, N-myc Downstream Regulated 1, N-myc downstream regulatory protein 1, NDR1, NDRL, NMSL, PROXY1, RIT42, RTP, TARG1, TDD5N-myc downstream regulated 1200632_s_at
NOTCH3AW229011, CADASIL, CASILNotch homolog 3 (Drosophila)203238_s_at
NPY1RMGC109393, NPY receptor Y1, NPY Y1 receptor, NPY-1, NPYIR, NPYR, Y1, Y1 RECEPTOR, Y1-Rneuropeptide Y receptor Y1205440_s_at
PCM12600002H09Rik, 9430077F19Rik, C030044G17Rik, LOC100044052, MGC170660, PTC4pericentriolar material 1214118_x_at
PDZK11700023D20Rik, 2610507N21Rik, 4921513F16Rik, AI267131, AI314638, AL022680, CAP70, CLAMP, D3Ertd537e, mPDZK1, NHERF3, PDZD1, Sodium sulfate cotransporterPDZ domain containing 1205380_at
PTPN14C130080N23RIK, MGC126803, OTTMUSG00000022087, PEZ, PTP36, PTPD2protein tyrosine phosphatase, non-receptor type 14205503_at
PVR3830421F03Rik, CD155, D7ERTD458E, FLJ25946, HVED, mE4, NECL5, NECTIN 2 ALPHA, NECTIN-2, Pe4, PVS, TAA1, TAGE4poliovirus receptor212662_at
RPS15EG633683, MGC111130, RIGribosomal protein S15200819_s_at
RRM2AA407299, MGC113712, MGC116120, R2, Ribonucleoside-diphosphate reductase M2 subunit, Ribonucleotide reductase non-heme subunit, RIBONUCLEOTIDE REDUCTASE SMALL SUBUNIT, Rnr-r2, RNRII, RR2, RR2Mribonucleotide reductase M2 polypeptide209773_s_at
SEL1LAW493766, IBD2, KIAA4137, mKIAA4137, PRO1063, SEL1, SEL1-LIKE, SEL1Hsel-1 suppressor of lin-12-like (C. elegans)202063_s_at
SLC39A6Ermelin, LIV-1solute carrier family 39 (zinc transporter), member 6202088_at
TMBIM40610007H07Rik, AU022431, CGI-119, GAAP, MGC73002, S1R, ZPROtransmembrane BAX inhibitor motif containing 4219206_x_at
UBE2B17-kDa Ubiquitin-Conjugating Enzyme E2, 2610301N02RIK, E2-14K, E2-17kDa, E2b, HHR6B, HR6B, LOC81816, RAD6B, UBC2ubiquitin-conjugating enzyme E2B (RAD6 homolog)211763_s_at
VEGFAGd-vegf, MGC70609, VEGF, Vegf-3, VEGF1, VEGF120, VEGF164, Vegf165, Vegf188, Vegfa 188, VPF, VPF/VEGFvascular endothelial growth factor A210513_s_at
Table 10

The 40 cancer related genes of the 100 features selected by SVMRFE on original training group of MAQC-II breast cancer data for erpos prediction.

SymbolSynonym(s)Entrez Gene NameAffymetrix
ANGPTL4ANGIOPOIETIN-LIKE 4, ANGPTL2, ARP4, Bk89, Fasting Induced Adipose Factor, FIAF, Harp, HFARP, LOC362850, Ng27, NL2, PGAR, PGARG, pp1158, PPARGangiopoietin-like 4221009_s_at
APPA beta 25–35, A-BETA 40, A-BETA 42, AAA, ABETA, ABPP, AD1, Adap, AL024401, AMYLOID BETA, AMYLOID BETA 40, AMYLOID BETA 40 HUMAN, AMYLOID BETA 42, Amyloid beta A4, AMYLOID BETA PEPTIDE 40, Amyloid precursor, Amyloidogenic glycoprotein, App alpha, APPI, appican, BETAAPP, CTFgamma, CVAP, E030013M08RIK, Nexin II, P3, PN2, PreA4, PROTEASE NEXIN2amyloid beta (A4) precursor protein214953_s_at
CHRNB4Acrb-4, NACHR BETA4cholinergic receptor, nicotinic, beta 4207516_at
CREBL2AI046348, B230205M03, MGC109304, MGC117311, MGC130380, MGC130381, MGC138362cAMP responsive element binding protein-like 2201990_s_at
CTSL21190035F06Rik, Cathepsin l, CATHEPSIN V, CATHL, CATL2, Ctsl, Ctsl1, CTSU, CTSV, fs, MEP, MGC125957, nktcathepsin L2210074_at
CX3CL1AB030188, ABCD-3, AI848747, C3Xkine, CX3C, CX3CL, CXC3, CXC3 CHEMOKINE PRECURSOR, CXC3C, D8Bwg0439e, FK, FKN, FRACTALKINE, NEUROTACTIN, NTN, NTT, SCYD1chemokine (C-X3-C motif) ligand 1823_at
DACH1AI182278, Dac, DACH, E130112M23Rik, FLJ10138dachshund homolog 1 (Drosophila)205471_s_at
E2F1E2f, E2f1 predicted, KIAA4009, mKIAA4009, RBAP1, RBBP3, RBP3E2F transcription factor 1204947_at
EPOREP-R, ERYTHROPOIETIN RECEPTOR, MGC108723, MGC138358erythropoietin receptor215054_at
ERBB4C-erb-b4, Erbb4 Cyt2, ERBB4 JM-A, HER4, MGC138404, p180erbB4v-erb-a erythroblastic leukemia viral oncogene homolog 4 (avian)214053_at
ESR1AA420328, Alpha estrogen receptor, AU041214, DKFZp686N23123, ER, ER ALPHA, Er alpha (46 kDa isoform), ER66, ERA, ER[a], ESR, ESRA, ESTR, ESTRA, ESTROGEN RECEPTOR ALPHA, ESTROGEN RECEPTOR1, NR3A1, RNESTROR, TERP-1estrogen receptor 1217190_x_at
FAM152A5830417C01Rik, C1orf121, CGI-146, DKFZp586C1019, FLJ21998, PNAS-4family with sequence similarity 152, member A212371_at
FBLN1FBLN, FIBULIN 1fibulin 1207834_at
GATA3HDR, MGC2346, MGC5199, MGC5445GATA binding protein 3209602_s_at
GHRAA986417, GHBP, GHR/BP, GROWTH HORMONE RECEPTOR, MGC124963, MGC156665growth hormone receptor205498_at
GPC3DGSX, Glypican 3, GTR2-2, MGC93606, OCI-5, SDYS, SGB, SGBS, SGBS1glypican 3209220_at
GREB15730583K22Rik, 9130004E13, AF180470, AU023194, KIAA0575, mKIAA0575GREB1 protein205862_at
IGF2BP32610101N11Rik, AA522010, Ab2-255, AL022933, AU045931, DKFZp686F1078, IMP-3, KOC1, Koc13, mimp3, Neilsen, RGD1306512, VICKZ3insulin-like growth factor 2 mRNA binding protein 3203820_s_at
IL6ST5133400A03Rik, AA389424, Ac1055, BB405851, CD130, CDw130, D13Ertd699e, GP130, GP130-RAPS, Il6 transd, IL6R-betainterleukin 6 signal transducer (gp130, oncostatin M receptor)212195_at
IRS1ENSMUSG00000022591, G972R, HIRS-1, IRS1IRMinsulin receptor substrate 1204686_at
ITGB4AA407042, C230078O20, CD104, INTEGRIN-BETA 4integrin, beta 4204990_s_at
ITPR1D6Pas2, I145TR, InsP3R, INSP3R1, IP3 RECEPTOR, Ip3 Receptor Type 1, IP3R, IP3R1, opt, P400, Pcd6, Pcp-1, SCA15, SCA16inositol 1,4,5-triphosphate receptor, type 1211323_s_at
LMO4A730077C12Rik, Crp3, Etohi4, MGC105593LIM domain only 4209205_s_at
MFAP3L4933428A15RIK, 5430405D20Rik, AI461995, AW125052, KIAA0626, mKIAA0626, NYD-sp9microfibrillar-associated protein 3-like210493_s_at
MGMTAGAT, AGT, AI267024, ATase, MGC107020, O6-ALKYLGUANINE DNA ALKYLTRANSFERASEO-6-methylguanine-DNA methyltransferase204880_at
PCNAMGC8367, Pcna/cyclin, PCNARproliferating cell nuclear antigen217400_at
PGR9930019P03Rik, BB114106, ENSMUSG00000074510, LOC360433, NR3C3, Pgrb, PR, PR BETA, PR-A, PR-B, PROGESTERONE RECEPTOR, Progesterone receptor Aprogesterone receptor208305_at
PLD1AA536939, C85393, Pc-Plc, Phospholipase D1, PLD1A, PLD1B, Plda, Pldbphospholipase D1, phosphatidylcholine-specific215723_s_at
PML1200009E24Rik, AI661194, MYL, PP8675, Retinoic acid receptor, RGD1562602, RNF71, TRIM19promyelocytic leukemia211013_x_at
PVR3830421F03Rik, CD155, D7ERTD458E, FLJ25946, HVED, mE4, NECL5, NECTIN 2 ALPHA, NECTIN-2, Pe4, PVS, TAA1, TAGE4poliovirus receptor212662_at
RARRES15430417P09RIK, AI662122, TIG1, Tig1/retinoic acid receptor responder 1retinoic acid receptor responder (tazarotene induced) 1206391_at
SCGB1D2BU101, LIPB, LIPOPHILIN B, LPHBsecretoglobin, family 1D, member 2206799_at
SEL1LAW493766, IBD2, KIAA4137, mKIAA4137, PRO1063, SEL1, SEL1-LIKE, SEL1Hsel-1 suppressor of lin-12-like (C. elegans)202063_s_at
SERPINA54933415L04, MGC93420, PAI-3, PCI, Pi5 alpha1, PLANH3, PROCIserpin peptidase inhibitor, clade A (alpha-1 antiproteinase, antitrypsin), member 5209443_at
SERPINA6AI265318, AV104445, CBG, MGC112780serpin peptidase inhibitor, clade A (alpha-1 antiproteinase, antitrypsin), member 6206325_at
SLC7A54f2 light chain, 4F2LC, CD98, CD98 LIGHT CHAIN, CD98LC, D0H16S474E, D16S469E, E16, E16/TA1, hLAT1, LAT1, MPE16, TA1solute carrier family 7 (cationic amino acid transporter, y+ system), member 5201195_s_at
SPAM1 (includes EG:6677)4933439A12Rik, HYA1, HYAL1, HYAL3, HYAL5, MGC108951, MGC26532, PH-20, SPAG15, SPAM1, TESTICULAR HYALURONIDASEsperm adhesion molecule 1 (PH-20 hyaluronidase, zona pellucida binding)210536_s_at
TBC1D94933431N12RIK, AI847101, AW490653, C76116, KIAA0882, MDR1, RGD1308221TBC1 domain family, member 9 (with GRAM domain)212956_at
TFRC2610028K12Rik, AI195355, AI426448, AU015758, CD71, E430033M20Rik, Mtvr-1, p90, TFNR, TFR, TFR1, TRANSFERRIN RECEPTOR, TRFRtransferrin receptor (p90, CD71)208691_at
TRPV6ABP/ZF, Cac, CAT, CAT-L, CAT1, Crac, ECAC2, HSA277909, LP6728, Otrpc3, ZFABtransient receptor potential cation channel, subfamily V, member 6206827_s_at
Figure 1 lists the testing accuracy, MCC values, and AUC errors on prediction of erpos by using two RFA methods: NBC-MSC and NMSC-MSC, as well as GLGS, LOOCSFS, and SVMRFE with the four learning classifiers. Regarding the prediction performance evaluated using testing accuracy and MCC values (left column and middle column, Fig. 1), on average the two RFA methods NBC-MSC and NMSC-MSC outperform other compared gene selection methods. The advantage of RFA by using NMSC and UDC is especially noticeable.
Figure 1

Comparison of different gene selection methods for prediction of erpos status of MAQC-II breast cancer dataset with different learning classifiers.

X-axis shows the number of used features and Y-axis shows average values of the testing accuracy (left column), MCC values (middle column), and AUC errors (right column) of twenty-time experiments, respectively.

Comparison of different gene selection methods for prediction of erpos status of MAQC-II breast cancer dataset with different learning classifiers.

X-axis shows the number of used features and Y-axis shows average values of the testing accuracy (left column), MCC values (middle column), and AUC errors (right column) of twenty-time experiments, respectively. We notice that the performance evaluation using testing accuracy (left column) and MCC (middle column) is consistent, but the performance evaluation using AUC measurement is not always consistent with the evaluation using testing accuracy and MCC. For instance, in applying UDC to the feature sets, RFA methods have the best prediction performance evaluated using testing accuracy and MCC values; however, with respect to performance evaluated using AUC errors RFA is not the best. Regarding the testing performance measured by using AUC errors, the best results have been obtained by using RFA with NMSC classifier. The AUC errors are as low as 0.08, which is much better than the results by using other methods of gene selection. The prediction results on pCR endpoint, shown in Figure 2, also demonstrate the advantage of RFA methods over other compared methods. All the best prediction results, evaluated by using AUC, MCC and testing accuracy, are obtained by using RFA methods.
Figure 2

Comparison of different gene selection methods for prediction of pCR status of MAQC-II breast cancer dataset with different learning classifiers.

X-axis shows the number of used features and Y-axis shows average values of the testing accuracy (left column), MCC values (middle column), and AUC errors (right column) of twenty-time experiments, respectively.

Comparison of different gene selection methods for prediction of pCR status of MAQC-II breast cancer dataset with different learning classifiers.

X-axis shows the number of used features and Y-axis shows average values of the testing accuracy (left column), MCC values (middle column), and AUC errors (right column) of twenty-time experiments, respectively. Tables 11 and 12 list the average testing under the best training, MS_HR, and the best testing under the best training, HS_HR, for the predictions on erpos and pCR, respectively. Figure 3 and Figure 4 show box-plots of MS_HR values for the predictions on erpos and pCR. The results indicate that the prediction performance depends not only on gene selection but also on learning classifier. In other words, gene selection and classifier are coupled in determining prediction performance. On average, RFA gene selection methods deliver the best performance, followed by GLGS; NMSC classifier outperforms the others with respect to the performance and the consistency across the three measurements. The combination of RFA with NMSC has the best overall prediction performance.
Table 11

Mean values and standard errors of HS_HR and MS_HR on ERPOS prediction in the study of MAQC-II breast cancer dataset. In applying each classifier with each measurement, the best prediction value is in bold; the best prediction value of the results using the four learning classifiers is in bold and italic.

MEASURE-MENTSGENE SELECTION METHODMEAN(HS_HR)±STD(HS_HR), %MEAN(MS_HR)±STD(MS_HR), %
NMSCSVMNBCUDCNMSCSVMNBCUDC
Testing AccuracyNBC-MSC89.5±2.279.3±2.282.9±1.191.6±1.788.5±1.476.5±0.581.7±0.391.0±1.6
NMSC-MSC 90.6±2.8 73.3±085.4±2.8 93.0±0.8 86.9±1.568.3±1.683.7±1.9 91.6±1.1
GLGS89.7±2.1 83.8±2.1 86.8±1.890.5±1.4 89.2±1.7 80.3±0.9 85.7±1.790.2±1.3
LOOCSFS86.4±0.866.7±1.9 86.9±1.4 86.3±1.686.1±0.865.1±0.9 86.2±1.0 85.8±1.4
SVMRFE77.3±7.284.4±077.6±6.677.5±8.975.6±6.778.4±0.176.7±6.175.8±7.6
MCC valuesNBC-MSC76.7±4.556.1±4.763.1±2.381.9±3.775.3±3.350.5±1.060.4±0.781.0±3.4
NMSC-MSC 79.1±5.8 44.8±066.6±5.6 84.5±2.4 72.2±4.134.4±3.564.4±4.2 82.7±2.5
GLGS77.4±3.6 67.1±4.9 71.3±3.9 79.5±3.2 76.9±3.4 59.2±2.3 69.7±3.8 79.1±2.7
LOOCSFS70.8±1.727.0±5.670.9±3.369.7±3.370.5±1.824.4±3.070.2±2.969.6±3.3
SVMRFE51.9±14.566.6±053.9±11.953.8±16.651.0±13.953.8±0.353.5±11.552.3±15.0
AUC errorsNBC-MSC8.5±0.419.5±2.222.2±0.525.7±4.48.5±0.322.2±0.722.2±0.525.7±4.4
NMSC-MSC8.4±0.723.5±0.423.1±1.317.1±3.68.4±0.728.2±2.323.1±1.317.1±3.6
GLGS 8.0±0.4 10.3±1.8 16.7±1.0 15.1±1.5 8.0±0.4 13.4±1.1 16.7±1.0 15.1±1.5
LOOCSFS11.1±2.232.3±2.715.5±0.517.5±2.611.1±2.232.9±1.815.5±0.517.5±2.6
SVMRFE17.3±6.114.1±0 15.3±0.6 17.3±2.917.3±6.119.1±0.1 15.3±0.6 17.3±2.9
Table 12

Mean values and standard errors of HS_HR and MS_HR on pCR prediction in the study of MAQC-II breast cancer dataset. In applying each classifier with each measurement, the best prediction value is in bold; the best prediction value is in bold and italic of the results by using four classifiers.

MEASURE-MENTSGENE SELECTION METHODMEAN(HS_HR)±STD(HS_HR), %MEAN(MS_HR)±STD(MS_HR), %
NMSCSVMNBCUDCNMSCSVMNBCUDC
Testing AccuracyNBC-MSC88.7±2.3 88.9±0 90.9±1.0 88.6±0.8 87.5±1.983.1±0.3 86.6±0.6 87.7±1.6
NMSC-MSC 90.4±2.1 86.7±0.085.3±2.279.6±7.7 88.6±2.7 79.8±0.782.9±2.078.9±7.4
GLGS81.2±4.181.9±0.879.4±1.077.0±5.280.8±4.376.2±0.679.1±1.076.5±5.2
LOOCSFS77.0±1.182.1±0.572.1±1.774.1±3.176.6±1.278.1±2.571.4±1.873.2±2.7
SVMRFE81.9±2.9 88.9±0 79.1±2.680.7±1.980.9±2.7 85.3±0.2 77.9±2.380.3±1.8
MCC valuesNBC-MSC62.5±6.7 63.9±0 69.7±3.4 60.7±2.8 61.3±5.948.8±0.6 54.7±2.1 59.2±5.0
NMSC-MSC 68.4±6.5 55.2±047.7±7.443.9±16.0 63.3±7.9 37.8±1.940.9±5.042.2±14.9
GLGS36.2±9.543.0±3.529.3±3.530.3±10.235.4±9.829.2±1.228.4±2.830.0±10.3
LOOCSFS23.4±4.938.5±3.518.7±4.721.6±11.023.0±5.126.3±5.118.2±5.021.1±10.7
SVMRFE40.9±7.962.5±0.032.6±6.941.1±4.739.4±7.4 49.1±0.7 30.5±5.540.7±4.9
AUC errorsNBC-MSC17.2±2.223.2±1.140.0±5.6 34.8±6.3 17.2±2.227.5±0.340.0±5.6 34.8±6.3
NMSC-MSC 14.4±3.0 19.1±047.9±3.746.0±5.1 14.4±3.0 28.2±1.147.9±3.746.0±5.1
GLGS22.6±1.329.3±0 37.4±1.8 36.7±1.822.6±1.333.5±0.3 37.4±1.8 36.7±1.8
LOOCSFS27.5±1.322.8±1.445.8±3.841.6±3.127.5±1.327.1±2.245.8±3.841.6±3.1
SVMRFE20.0±4.4 18.8±5.7 38.7±3.438.4±2.020.0±4.3 25.6±0.2 38.7±3.438.4±2.0
Figure 3

Average erpos prediction performance by using MAQC-II breast cancer dataset with the measurements testing accuracy (left column), MCC values (middle column), and AUC errors (right column), respectively.

Classification models are setup based on the best training. In each column, the best combination of gene selection and classifier is highlighted by a red dash circle. If there are multiple best combinations, or the difference of these combinations is not conspicuous, multiple circles are placed.

Figure 4

Average pCR prediction performance by using MAQC-II breast cancer dataset with the measurements testing accuracy (left column), MCC values (middle column), and AUC errors (right column), respectively.

Classification models are setup based on the best training. In each column, the best combination of gene selection and classifier is highlighted by a dash circle.

Average erpos prediction performance by using MAQC-II breast cancer dataset with the measurements testing accuracy (left column), MCC values (middle column), and AUC errors (right column), respectively.

Classification models are setup based on the best training. In each column, the best combination of gene selection and classifier is highlighted by a red dash circle. If there are multiple best combinations, or the difference of these combinations is not conspicuous, multiple circles are placed.

Average pCR prediction performance by using MAQC-II breast cancer dataset with the measurements testing accuracy (left column), MCC values (middle column), and AUC errors (right column), respectively.

Classification models are setup based on the best training. In each column, the best combination of gene selection and classifier is highlighted by a dash circle.

Results on MAQC-II Multiple Myeloma Dataset

Figures 5 and 6 show the box-plots of average testing values for EFSMO and OSMO, the classification models are based on the best training. We did not apply GLGS because it would take an excessive amount of time for the identification of the feature sets on the multiple myeloma dataset. Experimental results again manifest that gene selection is strictly coupled to learning classifier in performance measurement. On average, RFA methods and LOOCSFS are superior to SVMRFE.
Figure 5

Average EFSMO prediction performance by using MAQC-II multiple myeloma dataset with the measurements testing accuracy (left column), MCC values (middle column), and AUC errors (right column), respectively.

Classification models are setup based on the best training. In each column, the best combination of gene selection and classifier is highlighted by a dash circle.

Figure 6

Average OSMO prediction performance by using MAQC-II multiple myeloma dataset with the measurements testing accuracy (left column), MCC values (middle column), and AUC errors (right column), respectively.

Classification models are setup based on the best training. In each column, the best combination of gene selection and classifier is highlighted by a dash circle. If there are multiple best combinations, or the difference of these combinations is not conspicuous, multiple circles are placed.

Average EFSMO prediction performance by using MAQC-II multiple myeloma dataset with the measurements testing accuracy (left column), MCC values (middle column), and AUC errors (right column), respectively.

Classification models are setup based on the best training. In each column, the best combination of gene selection and classifier is highlighted by a dash circle.

Average OSMO prediction performance by using MAQC-II multiple myeloma dataset with the measurements testing accuracy (left column), MCC values (middle column), and AUC errors (right column), respectively.

Classification models are setup based on the best training. In each column, the best combination of gene selection and classifier is highlighted by a dash circle. If there are multiple best combinations, or the difference of these combinations is not conspicuous, multiple circles are placed.

Discussion

Due to a huge number of variables and small sample size, there are complicated interactions and relations among genes as well as high redundancy information with microarray data. The selection of predictive models that depend on selected features and employed classifiers is extremely important for the classification of microarray data and for the further biological function analysis/validation. Machine learning and data mining techniques provide us with a powerful approach to the study of the relationship among genes. Based on supervised learning and similarity measurements, we propose a Recursive Feature Addition (RFA), recursively employ supervised learning to obtain the highest training accuracy and add a subsequent gene based on the similarity between the chosen features and the candidates to minimize the redundancy within the feature set. We believe this RFA method captures more informative and differently expressed genes than other methods. Experimental comparisons are performed by using two MAQC-II microarray datasets, breast cancer and multiple myeloma. Our studies show that the method of gene selection is strictly paired with learning classifier, which determines the final predictive model by using training data. In other words, the best classification models under different learning classifiers are associated with different methods of gene selection. Using several popular learning classifiers including NMSC, NBC, SVM, and UDC, on average, the best method of gene selection is RFA, followed by GLGS, LOOCSFS, and SVMRFE. Regarding compared learning classifiers, NMSC outperforms the others with respect to testing performance, stabilization, and consistency. Biological function analysis based on MAQC-II breast cancer dataset finds that our applied feature selection methods including RFA, GLGS, LOOCSFS, and SVMRFE can generate features containing a significant portion of known cancer related genes for both pCR and erpos endpoints (Tables 1– 10). Although the cancer related gene number is not absolutely correlated with the prediction performance of various methods of feature selection, the remarkable cancer related genes in the features indicate that the feature selection methods including RFA, GLGS, LOOCSFS, and SVMRFE could produce biologically meaningful features, which will convince the users to apply them for phenotype prediction. In all results of the five gene selection methods with the four learning classifiers, on average, the combination of gene selections of NMSC-MSC and NBC-MSC with the learning classifier of NMSC has performed the best, regarding the comprehensive evaluation criteria of testing accuracy, MCC values, and AUC errors. It should be noted that the gene selection methods of NMSC-MSC and NBC-MSC are not always the best over the four learning classifiers, in other words, the best models among different learning classifiers are associated with different gene selection methods. The selection of the best model with the use of a specific learning classifier is normally based on the training and the evaluation criteria. Under an evaluation criterion with the use of some learning classifier, the best training model among the five gene selection methods is selected as the best model under the learning classifier. To select the best model among the four learning classifiers, the best models among the four learning classifiers are compared and the model obtaining the highest evaluation score is generally selected the best among the five gene selection methods and the four learning classifiers. For instance, Figure 7 demonstrates the average training performance of MAQC-II breast cancer dataset over twenty times, with the measurements of training accuracy, MCC values, and AUC errors, for the classification of pCR endpoint status. Regarding the comprehensive evaluation of the three criteria, the best classification models are obtained by using gene selection method NMSC-MSC for learning classifier NMSC, NBC-MSC for NBC classifier, and NBC-MSC and SVMRFE for UDC classifier. With the use of SVM classifier, although the gene selection method of SVMRFE first hits the best training as the feature dimension increases, almost all gene selection methods achieve the best before the feature number increases to 100, in such case, it is hard for us to determine the best model with the use of learning classifier SVM. In the limit of feature number, we can say that SVMRFE is the best for SVM classifier. The comparison between the training (Figure 7) and the testing (Figure 2) shows that such selection of the best model among the various methods of gene selection and various learning classifiers is reasonable.
Figure 7

Comparison of different gene selection methods for the training of pCR endpoint of MAQC-II breast cancer dataset using the four classifiers.

X-axis shows the number of used features and Y-axis shows average values of the training accuracy (left column), MCC values (middle column), and AUC errors (right column) of twenty-time experiments, respectively.

Comparison of different gene selection methods for the training of pCR endpoint of MAQC-II breast cancer dataset using the four classifiers.

X-axis shows the number of used features and Y-axis shows average values of the training accuracy (left column), MCC values (middle column), and AUC errors (right column) of twenty-time experiments, respectively. Regarding different evaluation criteria, the gene selection associated with the best model under a learning classifier may be different, as shown in Figure 7; with the use of learning classifier NMSC, the best model is obtained by using gene selection method of NMSC-MSC, evaluated by training accuracy and MCC values, the best model is associated with SVMRFE, evaluated by AUC errors. Overall, NMSC-MSC is the best for learning classifier NMSC. Figure 7 shows that dozens of the best training models are obtained by using different methods of gene selection with the use of SVM classifier. One possible solution for the selection of the best model under SVM classifier is to divide all data points into training set, validation set, and testing set. The training set is used to construct training models, the validation set is used to select the best model by applying the validation data to the best training models and selecting the best training model that produce the best validation result. The testing set is used for prediction or testing. The selection of the best model using SVM classifier is very interesting and challenging, especially in the case of small data points and huge number of features. Although the topic is beyond the scope of this paper, it is worthy to be explored in the future study.

Materials and Methods

MAQC-II Breast Cancer Dataset

The breast cancer dataset used in the MAQC-II project is used to predict the pre-operative treatment response (pCR) and estrogen receptor status (erpos) [28]. The normalization was provided by MAQC-II project using standard procedure (i.e., MAS 5.0 for Affymetrix platform). It was originally grouped into two groups: a training set containing 130 samples (33 positives and 97 negatives for pCR, 80 positives and 50 negatives for erpos), and a validation set containing 100 samples (15 positives and 85 negatives for pCR, 61 positives and 39 negatives for erpos).

MAQC-II Multiple Myeloma Dataset

We take the MAQC-II multiple myeloma dataset to predict the overall survival milestone outcome (OSMO, 730-day cutoff) and to predict event-free survival milestone outcome (EFSMO, 730-day cutoff). For OSMO label information, there are 51 positives and 289 negatives in original training set, 27 positives and 187 negatives in original validation set; as for EFSMO label information, there are 84 positives and 256 negatives in original training set, 34 positives and 180 negatives in original validation set [28]. The normalization was provided by MAQC-II project research group.

Feature Selection

Supervised recursive learning

Our method of recursive feature addition (RFA) employs supervised learning to achieve the best training accuracy and uses statistical similarity measures to choose the next variable with the least dependence on, or correlation to, the already identified variables as follows: Insignificant genes are removed according to their statistical insignificance. Specifically, a gene with a high p-value is usually not differently expressed and therefore has little contribution to classification of microarray data. To reduce the computational load, those insignificant genes are removed. Each individual gene is selected by supervised learning. A gene with highest classification accuracy is chosen as the most important feature or the first element of the feature set. If multiple genes achieve the same highest classification accuracy, the one with the lowest p-value measured by test-statistics (e.g., score test), is identified as the first element. At this point the chosen feature set, G 1, contains just one element, g 1, corresponding to the feature dimension one. The (N+1)st dimensional feature set, G N +1 = {g 1, g 2, …, g N, g N +1} is obtained by adding g N +1 to the N dimensional feature set, G N = {g 1, g 2, …, g N}. The choice of g N+1 is processed as follows: Add each gene g (g G N) into G N and have the classification accuracy of the feature set G N {g}. The g (g G N) associated with the group, G N {g} that obtains the highest classification accuracy, is the candidate for g N+1 (not yet g N+1). Considering the large number of variables, it is very likely that multiple features correspond to the same highest classification accuracy, these multiple candidates are placed into the set C, but only one candidate in C will be identified as g N+1. How to make the selection is described next.

Candidate feature addition

To find a most informative (or least redundant) candidate for g N+ 1, we measure the statistical similarity between the chosen features and each candidate. We design a similarity measurement with the use of a widely-used Pearson's correlation coefficient [30]. Suppose g (g G N, n = 1, 2, …, N) is a chosen gene, g (g C) is a candidate gene, and cor stands for the function of Pearson's correlation coefficient. The sum of the square of the correlation SC that is calculated as follows: Then selection of g N+1 is based on Minimal value of the Square of the Correlation (MSC), that is, In the methods mentioned above, a feature is recursively added to the chosen feature set based on supervised learning and the similarity measurement. In our experiments we choose naïve bayes classier (NBC) and nearest means scale classifier (NMSC) [29] for supervised learning, NBC and NMSC-based RFA feature selection methods are denoted as NBC-MSC and NMSC-MSC, respectively.

Model Implementation and Comparison

Cross-validation is a technique for estimating how accurately a predictive model will perform in practice. Generally, the data are partitioned into complementary subsets, one subset (called the training set) is used for constructing the predictive model and the other subset (called the validation set or testing set) is used for validation. To reduce variability, multiple rounds are performed using different partitions, and the validation results are averaged over all rounds. There are three common types of cross-validation: Repeated random sub-sampling validation. This technique randomly splits the dataset intro training and testing data. The results are then averaged over the splits. The advantage over K-fold cross validation (described below) is that the portion of the training/testing split is not dependent on the number of iterations (folds); K-fold cross-validation. The original sample is partitioned into K subsamples. A single subsample is retained as the validation data for testing the model and the remaining K-1 subsamples are used as training data. The cross-validation process is repeated K times with each of the K subsamples used exactly once as the validation data; Leave-one-out cross-validation. It uses a single observation from the original sample as the validation data and the remaining observations as the training data. It is the same as a K-fold cross validation with K being equal to the number of observations in the original sample. Leave-one-out cross-validation is often computationally expensive. Considering the high computational requirement of leave-one-out cross-validation and the insufficiency of one time K-fold cross-validation, we took the strategy of repeated random sub-sampling validation. In the model implementation, we mixed all the training data points and validation points. In each experiment, we randomly chose 80% of samples for training and the remaining 20% of samples for testing. Twenty experiments were performed (this strategy is approximately equal to performing 5-fold cross validation four times). The average testing performances, evaluated in terms of testing accuracy, MCC values, and AUC errors, were compared. The learning classifiers UDC, NBC, NMSC, and SVM were employed for training and testing.
  21 in total

Review 1.  Assessing the accuracy of prediction algorithms for classification: an overview.

Authors:  P Baldi; S Brunak; Y Chauvin; C A Andersen; H Nielsen
Journal:  Bioinformatics       Date:  2000-05       Impact factor: 6.937

Review 2.  Computational analysis of microarray data.

Authors:  J Quackenbush
Journal:  Nat Rev Genet       Date:  2001-06       Impact factor: 53.242

3.  LS Bound based gene selection for DNA microarray data.

Authors:  Xin Zhou; K Z Mao
Journal:  Bioinformatics       Date:  2004-12-14       Impact factor: 6.937

Review 4.  From signatures to models: understanding cancer using microarrays.

Authors:  Eran Segal; Nir Friedman; Naftali Kaminski; Aviv Regev; Daphne Koller
Journal:  Nat Genet       Date:  2005-06       Impact factor: 38.330

5.  Clustering microarray gene expression data using weighted Chinese restaurant process.

Authors:  Zhaohui S Qin
Journal:  Bioinformatics       Date:  2006-06-09       Impact factor: 6.937

Review 6.  Standards for systems biology.

Authors:  Alvis Brazma; Maria Krestyaninova; Ugis Sarkans
Journal:  Nat Rev Genet       Date:  2006-08       Impact factor: 53.242

7.  The MicroArray Quality Control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models.

Authors:  Leming Shi; Gregory Campbell; Wendell D Jones; Fabien Campagne; Zhining Wen; Stephen J Walker; Zhenqiang Su; Tzu-Ming Chu; Federico M Goodsaid; Lajos Pusztai; John D Shaughnessy; André Oberthuer; Russell S Thomas; Richard S Paules; Mark Fielden; Bart Barlogie; Weijie Chen; Pan Du; Matthias Fischer; Cesare Furlanello; Brandon D Gallas; Xijin Ge; Dalila B Megherbi; W Fraser Symmans; May D Wang; John Zhang; Hans Bitter; Benedikt Brors; Pierre R Bushel; Max Bylesjo; Minjun Chen; Jie Cheng; Jing Cheng; Jeff Chou; Timothy S Davison; Mauro Delorenzi; Youping Deng; Viswanath Devanarayan; David J Dix; Joaquin Dopazo; Kevin C Dorff; Fathi Elloumi; Jianqing Fan; Shicai Fan; Xiaohui Fan; Hong Fang; Nina Gonzaludo; Kenneth R Hess; Huixiao Hong; Jun Huan; Rafael A Irizarry; Richard Judson; Dilafruz Juraeva; Samir Lababidi; Christophe G Lambert; Li Li; Yanen Li; Zhen Li; Simon M Lin; Guozhen Liu; Edward K Lobenhofer; Jun Luo; Wen Luo; Matthew N McCall; Yuri Nikolsky; Gene A Pennello; Roger G Perkins; Reena Philip; Vlad Popovici; Nathan D Price; Feng Qian; Andreas Scherer; Tieliu Shi; Weiwei Shi; Jaeyun Sung; Danielle Thierry-Mieg; Jean Thierry-Mieg; Venkata Thodima; Johan Trygg; Lakshmi Vishnuvajjala; Sue Jane Wang; Jianping Wu; Yichao Wu; Qian Xie; Waleed A Yousef; Liang Zhang; Xuegong Zhang; Sheng Zhong; Yiming Zhou; Sheng Zhu; Dhivya Arasappan; Wenjun Bao; Anne Bergstrom Lucas; Frank Berthold; Richard J Brennan; Andreas Buness; Jennifer G Catalano; Chang Chang; Rong Chen; Yiyu Cheng; Jian Cui; Wendy Czika; Francesca Demichelis; Xutao Deng; Damir Dosymbekov; Roland Eils; Yang Feng; Jennifer Fostel; Stephanie Fulmer-Smentek; James C Fuscoe; Laurent Gatto; Weigong Ge; Darlene R Goldstein; Li Guo; Donald N Halbert; Jing Han; Stephen C Harris; Christos Hatzis; Damir Herman; Jianping Huang; Roderick V Jensen; Rui Jiang; Charles D Johnson; Giuseppe Jurman; Yvonne Kahlert; Sadik A Khuder; Matthias Kohl; Jianying Li; Li Li; Menglong Li; Quan-Zhen Li; Shao Li; Zhiguang Li; Jie Liu; Ying Liu; Zhichao Liu; Lu Meng; Manuel Madera; Francisco Martinez-Murillo; Ignacio Medina; Joseph Meehan; Kelci Miclaus; Richard A Moffitt; David Montaner; Piali Mukherjee; George J Mulligan; Padraic Neville; Tatiana Nikolskaya; Baitang Ning; Grier P Page; Joel Parker; R Mitchell Parry; Xuejun Peng; Ron L Peterson; John H Phan; Brian Quanz; Yi Ren; Samantha Riccadonna; Alan H Roter; Frank W Samuelson; Martin M Schumacher; Joseph D Shambaugh; Qiang Shi; Richard Shippy; Shengzhu Si; Aaron Smalter; Christos Sotiriou; Mat Soukup; Frank Staedtler; Guido Steiner; Todd H Stokes; Qinglan Sun; Pei-Yi Tan; Rong Tang; Zivana Tezak; Brett Thorn; Marina Tsyganova; Yaron Turpaz; Silvia C Vega; Roberto Visintainer; Juergen von Frese; Charles Wang; Eric Wang; Junwei Wang; Wei Wang; Frank Westermann; James C Willey; Matthew Woods; Shujian Wu; Nianqing Xiao; Joshua Xu; Lei Xu; Lun Yang; Xiao Zeng; Jialu Zhang; Li Zhang; Min Zhang; Chen Zhao; Raj K Puri; Uwe Scherf; Weida Tong; Russell D Wolfinger
Journal:  Nat Biotechnol       Date:  2010-07-30       Impact factor: 54.908

8.  Finding groups in gene expression data.

Authors:  David J Hand; Nicholas A Heard
Journal:  J Biomed Biotechnol       Date:  2005-06-30

9.  New feature subset selection procedures for classification of expression profiles.

Authors:  Trond Bø; Inge Jonassen
Journal:  Genome Biol       Date:  2002-03-14       Impact factor: 13.583

10.  Analysis of strain and regional variation in gene expression in mouse brain.

Authors:  P Pavlidis; W S Noble
Journal:  Genome Biol       Date:  2001-09-27       Impact factor: 13.583

View more
  14 in total

1.  Modelling the genetic risk in age-related macular degeneration.

Authors:  Felix Grassmann; Lars G Fritsche; Claudia N Keilhauer; Iris M Heid; Bernhard H F Weber
Journal:  PLoS One       Date:  2012-05-30       Impact factor: 3.240

2.  Establishment of a monoclonal antibody against a peptide of the novel zinc finger protein ZNF32 proved to be specific and sensitive for immunological measurements.

Authors:  Yuyan Wei; Jie Zhang; Kai Li; Lugang Huang; Jun Li; Xiujie Wang; Ping Lin; Yuquan Wei
Journal:  Med Sci Monit       Date:  2012-05

3.  Maximizing biomarker discovery by minimizing gene signatures.

Authors:  Chang Chang; Junwei Wang; Chen Zhao; Jennifer Fostel; Weida Tong; Pierre R Bushel; Youping Deng; Lajos Pusztai; W Fraser Symmans; Tieliu Shi
Journal:  BMC Genomics       Date:  2011-12-23       Impact factor: 3.969

4.  Elastic SCAD as a novel penalization method for SVM classification tasks in high-dimensional data.

Authors:  Natalia Becker; Grischa Toedt; Peter Lichter; Axel Benner
Journal:  BMC Bioinformatics       Date:  2011-05-09       Impact factor: 3.169

5.  Gene selection and classification for cancer microarray data based on machine learning and similarity measures.

Authors:  Qingzhong Liu; Andrew H Sung; Zhongxue Chen; Jianzhong Liu; Lei Chen; Mengyu Qiao; Zhaohui Wang; Xudong Huang; Youping Deng
Journal:  BMC Genomics       Date:  2011-12-23       Impact factor: 3.969

6.  Comprehensive evaluation of composite gene features in cancer outcome prediction.

Authors:  Dezhi Hou; Mehmet Koyutürk
Journal:  Cancer Inform       Date:  2015-02-24

Review 7.  A Review of Feature Selection and Feature Extraction Methods Applied on Microarray Data.

Authors:  Zena M Hira; Duncan F Gillies
Journal:  Adv Bioinformatics       Date:  2015-06-11

8.  A novel strategy for gene selection of microarray data based on gene-to-class sensitivity information.

Authors:  Fei Han; Wei Sun; Qing-Hua Ling
Journal:  PLoS One       Date:  2014-05-20       Impact factor: 3.240

9.  Discovering Pair-wise Synergies in Microarray Data.

Authors:  Yuan Chen; Dan Cao; Jun Gao; Zheming Yuan
Journal:  Sci Rep       Date:  2016-07-29       Impact factor: 4.379

10.  An algorithm for finding biologically significant features in microarray data based on a priori manifold learning.

Authors:  Zena M Hira; George Trigeorgis; Duncan F Gillies
Journal:  PLoS One       Date:  2014-03-03       Impact factor: 3.240

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.