Xiaoxing Liu1, Arun Krishnan, Adrian Mondry. 1. Bioinformatics Institute, 30, Biopolis Street, #07-01, (S) 138671, Singapore. xiaoxing@bii.a-star.edu.sg
Abstract
BACKGROUND: Accurate diagnosis of cancer subtypes remains a challenging problem. Building classifiers based on gene expression data is a promising approach; yet the selection of non-redundant but relevant genes is difficult. The selected gene set should be small enough to allow diagnosis even in regular clinical laboratories and ideally identify genes involved in cancer-specific regulatory pathways. Here an entropy-based method is proposed that selects genes related to the different cancer classes while at the same time reducing the redundancy among the genes. RESULTS: The present study identifies a subset of features by maximizing the relevance and minimizing the redundancy of the selected genes. A merit called normalized mutual information is employed to measure the relevance and the redundancy of the genes. In order to find a more representative subset of features, an iterative procedure is adopted that incorporates an initial clustering followed by data partitioning and the application of the algorithm to each of the partitions. A leave-one-out approach then selects the most commonly selected genes across all the different runs and the gene selection algorithm is applied again to pare down the list of selected genes until a minimal subset is obtained that gives a satisfactory accuracy of classification. The algorithm was applied to three different data sets and the results obtained were compared to work done by others using the same data sets. CONCLUSION: This study presents an entropy-based iterative algorithm for selecting genes from microarray data that are able to classify various cancer sub-types with high accuracy. In addition, the feature set obtained is very compact, that is, the redundancy between genes is reduced to a large extent. This implies that classifiers can be built with a smaller subset of genes.
BACKGROUND: Accurate diagnosis of cancer subtypes remains a challenging problem. Building classifiers based on gene expression data is a promising approach; yet the selection of non-redundant but relevant genes is difficult. The selected gene set should be small enough to allow diagnosis even in regular clinical laboratories and ideally identify genes involved in cancer-specific regulatory pathways. Here an entropy-based method is proposed that selects genes related to the different cancer classes while at the same time reducing the redundancy among the genes. RESULTS: The present study identifies a subset of features by maximizing the relevance and minimizing the redundancy of the selected genes. A merit called normalized mutual information is employed to measure the relevance and the redundancy of the genes. In order to find a more representative subset of features, an iterative procedure is adopted that incorporates an initial clustering followed by data partitioning and the application of the algorithm to each of the partitions. A leave-one-out approach then selects the most commonly selected genes across all the different runs and the gene selection algorithm is applied again to pare down the list of selected genes until a minimal subset is obtained that gives a satisfactory accuracy of classification. The algorithm was applied to three different data sets and the results obtained were compared to work done by others using the same data sets. CONCLUSION: This study presents an entropy-based iterative algorithm for selecting genes from microarray data that are able to classify various cancer sub-types with high accuracy. In addition, the feature set obtained is very compact, that is, the redundancy between genes is reduced to a large extent. This implies that classifiers can be built with a smaller subset of genes.
Authors: U Alon; N Barkai; D A Notterman; K Gish; S Ybarra; D Mack; A J Levine Journal: Proc Natl Acad Sci U S A Date: 1999-06-08 Impact factor: 11.205
Authors: M Bittner; P Meltzer; Y Chen; Y Jiang; E Seftor; M Hendrix; M Radmacher; R Simon; Z Yakhini; A Ben-Dor; N Sampas; E Dougherty; E Wang; F Marincola; C Gooden; J Lueders; A Glatfelter; P Pollock; J Carpten; E Gillanders; D Leja; K Dietrich; C Beaudry; M Berens; D Alberts; V Sondak Journal: Nature Date: 2000-08-03 Impact factor: 49.962
Authors: M West; C Blanchette; H Dressman; E Huang; S Ishida; R Spang; H Zuzan; J A Olson; J R Marks; J R Nevins Journal: Proc Natl Acad Sci U S A Date: 2001-09-18 Impact factor: 11.205
Authors: J N Weinstein; T G Myers; P M O'Connor; S H Friend; A J Fornace; K W Kohn; T Fojo; S E Bates; L V Rubinstein; N L Anderson; J K Buolamwini; W W van Osdol; A P Monks; D A Scudiero; E A Sausville; D W Zaharevitz; B Bunow; V N Viswanadhan; G S Johnson; R E Wittes; K D Paull Journal: Science Date: 1997-01-17 Impact factor: 47.728
Authors: C M Perou; T Sørlie; M B Eisen; M van de Rijn; S S Jeffrey; C A Rees; J R Pollack; D T Ross; H Johnsen; L A Akslen; O Fluge; A Pergamenschikov; C Williams; S X Zhu; P E Lønning; A L Børresen-Dale; P O Brown; D Botstein Journal: Nature Date: 2000-08-17 Impact factor: 49.962
Authors: J Khan; J S Wei; M Ringnér; L H Saal; M Ladanyi; F Westermann; F Berthold; M Schwab; C R Antonescu; C Peterson; P S Meltzer Journal: Nat Med Date: 2001-06 Impact factor: 53.440
Authors: Andrew G Renehan; Marcel Zwahlen; Christoph Minder; Sarah T O'Dwyer; Stephen M Shalet; Matthias Egger Journal: Lancet Date: 2004-04-24 Impact factor: 79.321
Authors: Amanda B Zheutlin; Clark D Jeffries; Diana O Perkins; Yoonho Chung; Adam M Chekroud; Jean Addington; Carrie E Bearden; Kristin S Cadenhead; Barbara A Cornblatt; Daniel H Mathalon; Thomas H McGlashan; Larry J Seidman; Elaine F Walker; Scott W Woods; Ming Tsuang; Tyrone D Cannon Journal: Neuropsychopharmacology Date: 2017-02-10 Impact factor: 7.853
Authors: Diana O Perkins; Clark D Jeffries; Barbara A Cornblatt; Scott W Woods; Jean Addington; Carrie E Bearden; Kristin S Cadenhead; Tyrone D Cannon; Robert Heinssen; Daniel H Mathalon; Larry J Seidman; Ming T Tsuang; Elaine F Walker; Thomas H McGlashan Journal: Schizophr Res Date: 2015-10-04 Impact factor: 4.939