Literature DB >> 18815365

Higher criticism thresholding: Optimal feature selection when useful features are rare and weak.

David Donoho1, Jiashun Jin.   

Abstract

In important application fields today-genomics and proteomics are examples-selecting a small subset of useful features is crucial for success of Linear Classification Analysis. We study feature selection by thresholding of feature Z-scores and introduce a principle of threshold selection, based on the notion of higher criticism (HC). For i = 1, 2, ..., p, let pi(i) denote the two-sided P-value associated with the ith feature Z-score and pi((i)) denote the ith order statistic of the collection of P-values. The HC threshold is the absolute Z-score corresponding to the P-value maximizing the HC objective (i/p - pi((i)))/sqrt{i/p(1-i/p)}. We consider a rare/weak (RW) feature model, where the fraction of useful features is small and the useful features are each too weak to be of much use on their own. HC thresholding (HCT) has interesting behavior in this setting, with an intimate link between maximizing the HC objective and minimizing the error rate of the designed classifier, and very different behavior from popular threshold selection procedures such as false discovery rate thresholding (FDRT). In the most challenging RW settings, HCT uses an unconventionally low threshold; this keeps the missed-feature detection rate under better control than FDRT and yields a classifier with improved misclassification performance. Replacing cross-validated threshold selection in the popular Shrunken Centroid classifier with the computationally less expensive and simpler HCT reduces the variance of the selected threshold and the error rate of the constructed classifier. Results on standard real datasets and in asymptotic theory confirm the advantages of HCT.

Mesh:

Year:  2008        PMID: 18815365      PMCID: PMC2553037          DOI: 10.1073/pnas.0807471105

Source DB:  PubMed          Journal:  Proc Natl Acad Sci U S A        ISSN: 0027-8424            Impact factor:   11.205


  8 in total

1.  Boosting for tumor classification with gene expression data.

Authors:  Marcel Dettling; Peter Bühlmann
Journal:  Bioinformatics       Date:  2003-06-12       Impact factor: 6.937

2.  High Dimensional Classification Using Features Annealed Independence Rules.

Authors:  Jianqing Fan; Yingying Fan
Journal:  Ann Stat       Date:  2008       Impact factor: 4.028

3.  Gene-expression profiles in hereditary breast cancer.

Authors:  I Hedenfalk; D Duggan; Y Chen; M Radmacher; M Bittner; R Simon; P Meltzer; B Gusterson; M Esteller; O P Kallioniemi; B Wilfond; A Borg; J Trent; M Raffeld; Z Yakhini; A Ben-Dor; E Dougherty; J Kononen; L Bubendorf; W Fehrle; S Pittaluga; S Gruvberger; N Loman; O Johannsson; H Olsson; G Sauter
Journal:  N Engl J Med       Date:  2001-02-22       Impact factor: 91.245

4.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays.

Authors:  U Alon; N Barkai; D A Notterman; K Gish; S Ybarra; D Mack; A J Levine
Journal:  Proc Natl Acad Sci U S A       Date:  1999-06-08       Impact factor: 11.205

5.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring.

Authors:  T R Golub; D K Slonim; P Tamayo; C Huard; M Gaasenbeek; J P Mesirov; H Coller; M L Loh; J R Downing; M A Caligiuri; C D Bloomfield; E S Lander
Journal:  Science       Date:  1999-10-15       Impact factor: 47.728

6.  Diagnosis of multiple cancer types by shrunken centroids of gene expression.

Authors:  Robert Tibshirani; Trevor Hastie; Balasubramanian Narasimhan; Gilbert Chu
Journal:  Proc Natl Acad Sci U S A       Date:  2002-05-14       Impact factor: 11.205

7.  BagBoosting for tumor classification with gene expression data.

Authors:  Marcel Dettling
Journal:  Bioinformatics       Date:  2004-10-05       Impact factor: 6.937

8.  Gene expression correlates of clinical prostate cancer behavior.

Authors:  Dinesh Singh; Phillip G Febbo; Kenneth Ross; Donald G Jackson; Judith Manola; Christine Ladd; Pablo Tamayo; Andrew A Renshaw; Anthony V D'Amico; Jerome P Richie; Eric S Lander; Massimo Loda; Philip W Kantoff; Todd R Golub; William R Sellers
Journal:  Cancer Cell       Date:  2002-03       Impact factor: 31.743

  8 in total
  21 in total

1.  Impossibility of successful classification when useful features are rare and weak.

Authors:  Jiashun Jin
Journal:  Proc Natl Acad Sci U S A       Date:  2009-05-15       Impact factor: 11.205

2.  Denoising single-molecule FRET trajectories with wavelets and Bayesian inference.

Authors:  J Nick Taylor; Dmitrii E Makarov; Christy F Landes
Journal:  Biophys J       Date:  2010-01-06       Impact factor: 4.033

3.  Statistical challenges of high-dimensional data.

Authors:  Iain M Johnstone; D Michael Titterington
Journal:  Philos Trans A Math Phys Eng Sci       Date:  2009-11-13       Impact factor: 4.226

4.  Más-o-menos: a simple sign averaging method for discrimination in genomic data analysis.

Authors:  Sihai Dave Zhao; Giovanni Parmigiani; Curtis Huttenhower; Levi Waldron
Journal:  Bioinformatics       Date:  2014-07-23       Impact factor: 6.937

Review 5.  Beyond smartphones and sensors: choosing appropriate statistical methods for the analysis of longitudinal data.

Authors:  Ian Barnett; John Torous; Patrick Staples; Matcheri Keshavan; Jukka-Pekka Onnela
Journal:  J Am Med Inform Assoc       Date:  2018-12-01       Impact factor: 4.497

6.  Score test variable screening.

Authors:  Sihai Dave Zhao; Yi Li
Journal:  Biometrics       Date:  2014-08-14       Impact factor: 2.571

7.  A Selective Overview of Variable Selection in High Dimensional Feature Space.

Authors:  Jianqing Fan; Jinchi Lv
Journal:  Stat Sin       Date:  2010-01       Impact factor: 1.261

8.  Classification based hypothesis testing in neuroscience: Below-chance level classification rates and overlooked statistical properties of linear parametric classifiers.

Authors:  Hamidreza Jamalabadi; Sarah Alizadeh; Monika Schönauer; Christian Leibold; Steffen Gais
Journal:  Hum Brain Mapp       Date:  2016-03-26       Impact factor: 5.038

9.  Study design in high-dimensional classification analysis.

Authors:  Brisa N Sánchez; Meihua Wu; Peter X K Song; Wen Wang
Journal:  Biostatistics       Date:  2016-05-05       Impact factor: 5.899

10.  Analytic P-value calculation for the higher criticism test in finite d problems.

Authors:  Ian J Barnett; Xihong Lin
Journal:  Biometrika       Date:  2014       Impact factor: 2.445

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.