Literature DB >> 21988833

The self-assessment trap: can we all be better than average?

Raquel Norel, John Jeremy Rice, Gustavo Stolovitzky.

Abstract

Entities: Disease Gene Species

Year: 2011 PMID： 21988833 PMCID： PMC3261704 DOI： 10.1038/msb.2011.70

Source DB: PubMed Journal: Mol Syst Biol ISSN： 1744-4292 Impact factor: 11.429

× No keyword cloud information.

Computational systems biology seems to be caught in what we call the ‘self-assessment trap', in which researchers wishing to publish their analytical methods are required by referees or by editorial policy (e.g., Bioinformatics, BMC Bioinformatics, Nucleic Acids Research) to compare the performance of their own algorithms against other methodologies, thus being forced to be judge, jury and executioner. The result is that the authors' method tends to be the best in an unreasonable majority of cases (Table I). In many instances, this bias is the result of selective reporting of performance in the niche in which the method is superior. Evidence of that is that most papers reporting best performance choose only one or two metrics of performance, but when the number of performance metrics is larger than two, most methods fail to be the best in all categories assessed (Table I). Choosing many metrics can dramatically change the determination of best performance (Supplementary Table S1). Selective reporting can be inadvertent, but in some cases biases are more disingenuous, involving hiding information or quietly cutting corners in the performance evaluation (similar problems have been discussed in assessments of the performance of supercomputers, e.g., Bailey (1991)).

Table 1

Break out of 57 surveyed papers in which the authors assess their own methods

Number of performance metrics	Total number of studies surveyed	Authors' method is the best in all metrics and all data sets	Authors' method is the best in most metrics and most data sets
Note that we did not find any self-assessment paper where the presented method was not top ranked in at least one metric or data set. The survey was conducted over a large pool of scientific peer-reviewed papers selected as follows. First, a Google Scholar search using the keywords ‘computational biology method assessment' was conducted. When papers with comparisons of methods were identified, we further examined (1) papers from the same journal issue and (2) downstream papers that cite the identified paper (as determined by Google Scholar). The 57 papers (see Supplementary information) resulting from the search span 22 journals. Most papers are in the categories of gene regulatory networks/reverse engineering (24/69), structure prediction/assessment (14/69) and DNA–protein interactions/regulatory element identification. An additional nine papers found in the same manner but not shown in the Table reported independently (not-self) assessed methods, of which only four were top performers, whereas five reported methods that ranked high but were not top performers.
1	25	19	6
2	15	13	2
3	7	4	3
4	4	1	3
5	4	1	3
6	2	1	1

Even assuming that there is no selective reporting, we would like to argue that papers reporting good-yet-not-the-best methods (of which we found none in our literature survey of self-assessed papers listed in the Supplementary information) can still advance science. For example, a method that is not top ranked can still have value by unearthing biological results that are complementary to the results reported by other better performing methods. Furthermore, the effectiveness of a top-performing algorithm can be boosted when its results are aggregated with second and third best performers (Figure 1, and Supplementary Figures S1 and S2; Marbach et al, 2010; Prill et al, 2010). The discussion above suggests that self-evaluation is suspect and that insistence on publication of only best performing methods can suppress the reporting of good-yet-not-best performing methods that also have scientific value.

Figure 1

The performance metrics Area Under the Precision–Recall curve (AUPR, left panel) and Area Under the ROC curve (AUROC, right panel) for individual teams participating in a DREAM2 challenge. The challenge consisted of predicting transcriptional targets of the transcription factor BCL6. Even when as the performance of the individual teams decreases (black line and circles), the integrated prediction of the best performer and runner-up teams (red line and diamonds) outperformed the best individual team.

In biosciences, as well as in other natural sciences, we are often faced with situations that have been referred to as uncomfortable science, a term attributed to statistician John Tukey, in which the little available data are used both in the inference model and the confirmatory data analysis. The resulting overoptimistic ‘confirmatory' results are often referred to as ‘systematic bias'. Similarly, ‘information leak' from data to methods can occur from improper and repeated cross-validation. In the general case, information leak results from developing or training an algorithm based on the entire available data set so that the test set is not independent. In some cases, the leak can occur subtly and inadvertently such as when a very similar sample is present both in training and test set. A better-known effect is ‘overfitting', in which a model is developed with superior accuracy on its training data at the cost of reduced generalization of the model to new data sets. A notable example of this effect can be found in the search for biomarker signatures in cancer. For about a decade, scientists have scoured high-throughput data to find collections of genes or proteins that can be used in diagnosis or prognosis of cancer. However, the tools used to find signatures in massive data sets can yield spurious associations with phenotype (Ioannidis, 2005), even when the results appear to be statistically sound in self-assessment. In most cases, unfortunately, these signatures do not generalize; taken to the task of showing the diagnostics or prognostics value of these signatures, the accuracy of the predictions is much poorer on impartial assessments on previously unseen patients than on the original data. This problem with cancer signatures is of sufficient general interest to be highlighted recently in the popular media (Kolata, 2011). In order to alleviate the overestimation of accuracy from the many bias sources described above, we proposed a few guidelines: use third-party validation to test a model with previously unseen data use more than one metric to evaluate the methods report well-performing methods even if they are not the best performers on a particular data set increase the awareness of editors and reviewers that superior performance in self-assessment is a biased demonstration of the method's value; instead, impartial assessment should be the preferred evaluation Establish a scientific culture that values timely, well-conducted follow-up studies that confirm or refute previous results To a large extent, the remedies suggested above have been addressed in the context of genome-wide association studies (Chanock et al, 2007), and are embodied in existing independent assessments presented to the scientific community in efforts such as CASP (http://predictioncenter.org/), CAPRI (http://www.ebi.ac.uk/msd-srv/capri/) and DREAM (http://www.the-dream-project.org). In contrast to the usual practice of ‘post-diction' (retrospective prediction) of known results as a way to test their methods, participants to these third-party collaborative competitions (alternatively known as challenges) submit predictions that are evaluated by impartial scorers against an independent data set that is hidden from the participants. The level of performance in these evaluations better tests the generalization ability of the methods, because the predictions are made based on unseen data, thus minimizing many of the above-discussed biases. We envision that a repository of blind challenges and data sets could be created (DREAM, for example, has 20 such data sets and challenges) with data produced on demand by third parties, especially funded to create verification data and challenges. This repository could be used to test the validity of many of the tasks that we deal with in Systems Biology, Bioinformatics and Computational Biology. In summary, systematic bias, information leak and overfitting can all be considered facets of the same self-assessment trap. That is, by knowing too much about the desired results, the researcher gets snared into a trap of consciously or unconsciously overestimating performance. Moreover, the researcher is further lured to the trap by the common assumption that top performance is required for scientific value and publication. By exposing the self-assessment trap, we hope to lessen its effect with the ultimate goal of advancing predictive biology and improving human healthcare.

The self-assessment trap – Supplementary Materials

This supplement provides further support for the claims made in the main text.

4 in total

1. Revealing strengths and weaknesses of methods for gene network inference.

Authors: Daniel Marbach; Robert J Prill; Thomas Schaffter; Claudio Mattiussi; Dario Floreano; Gustavo Stolovitzky
Journal: Proc Natl Acad Sci U S A Date: 2010-03-22 Impact factor: 11.205

2. Microarrays and molecular research: noise discovery?

Authors: John P A Ioannidis
Journal: Lancet Date: 2005 Feb 5-11 Impact factor: 79.321

3. Replicating genotype-phenotype associations.

Authors: Stephen J Chanock; Teri Manolio; Michael Boehnke; Eric Boerwinkle; David J Hunter; Gilles Thomas; Joel N Hirschhorn; Goncalo Abecasis; David Altshuler; Joan E Bailey-Wilson; Lisa D Brooks; Lon R Cardon; Mark Daly; Peter Donnelly; Joseph F Fraumeni; Nelson B Freimer; Daniela S Gerhard; Chris Gunter; Alan E Guttmacher; Mark S Guyer; Emily L Harris; Josephine Hoh; Robert Hoover; C Augustine Kong; Kathleen R Merikangas; Cynthia C Morton; Lyle J Palmer; Elizabeth G Phimister; John P Rice; Jerry Roberts; Charles Rotimi; Margaret A Tucker; Kyle J Vogan; Sholom Wacholder; Ellen M Wijsman; Deborah M Winn; Francis S Collins
Journal: Nature Date: 2007-06-07 Impact factor: 49.962

4. Towards a rigorous assessment of systems biology models: the DREAM3 challenges.

Authors: Robert J Prill; Daniel Marbach; Julio Saez-Rodriguez; Peter K Sorger; Leonidas G Alexopoulos; Xiaowei Xue; Neil D Clarke; Gregoire Altan-Bonnet; Gustavo Stolovitzky
Journal: PLoS One Date: 2010-02-23 Impact factor: 3.240

4 in total

27 in total

1. Stratification of amyotrophic lateral sclerosis patients: a crowdsourcing approach.

Authors: Robert Kueffner; Neta Zach; Maya Bronfeld; Raquel Norel; Nazem Atassi; Venkat Balagurusamy; Barbara Di Camillo; Adriano Chio; Merit Cudkowicz; Donna Dillenberger; Javier Garcia-Garcia; Orla Hardiman; Bruce Hoff; Joshua Knight; Melanie L Leitner; Guang Li; Lara Mangravite; Thea Norman; Liuxia Wang; Jinfeng Xiao; Wen-Chieh Fang; Jian Peng; Chen Yang; Huan-Jui Chang; Gustavo Stolovitzky
Journal: Sci Rep Date: 2019-01-24 Impact factor: 4.379

Review 2. Crowdsourcing biomedical research: leveraging communities as innovation engines.

Authors: Julio Saez-Rodriguez; James C Costello; Stephen H Friend; Michael R Kellen; Lara Mangravite; Pablo Meyer; Thea Norman; Gustavo Stolovitzky
Journal: Nat Rev Genet Date: 2016-07-15 Impact factor: 53.242

3. A community computational challenge to predict the activity of pairs of compounds.

Authors: Mukesh Bansal; Jichen Yang; Charles Karan; Michael P Menden; James C Costello; Hao Tang; Guanghua Xiao; Yajuan Li; Jeffrey Allen; Rui Zhong; Beibei Chen; Minsoo Kim; Tao Wang; Laura M Heiser; Ronald Realubit; Michela Mattioli; Mariano J Alvarez; Yao Shen; Daniel Gallahan; Dinah Singer; Julio Saez-Rodriguez; Yang Xie; Gustavo Stolovitzky; Andrea Califano
Journal: Nat Biotechnol Date: 2014-11-17 Impact factor: 54.908

4. Systematic analysis of challenge-driven improvements in molecular prognostic models for breast cancer.

Authors: Adam A Margolin; Erhan Bilal; Erich Huang; Thea C Norman; Lars Ottestad; Brigham H Mecham; Ben Sauerwine; Michael R Kellen; Lara M Mangravite; Matthew D Furia; Hans Kristian Moen Vollan; Oscar M Rueda; Justin Guinney; Nicole A Deflaux; Bruce Hoff; Xavier Schildwachter; Hege G Russnes; Daehoon Park; Veronica O Vang; Tyler Pirtle; Lamia Youseff; Craig Citro; Christina Curtis; Vessela N Kristensen; Joseph Hellerstein; Stephen H Friend; Gustavo Stolovitzky; Samuel Aparicio; Carlos Caldas; Anne-Lise Børresen-Dale
Journal: Sci Transl Med Date: 2013-04-17 Impact factor: 17.956

5. A Community Challenge for Inferring Genetic Predictors of Gene Essentialities through Analysis of a Functional Screen of Cancer Cell Lines.

Authors: Mehmet Gönen; Barbara A Weir; Glenn S Cowley; Francisca Vazquez; Yuanfang Guan; Alok Jaiswal; Masayuki Karasuyama; Vladislav Uzunangelov; Tao Wang; Aviad Tsherniak; Sara Howell; Daniel Marbach; Bruce Hoff; Thea C Norman; Antti Airola; Adrian Bivol; Kerstin Bunte; Daniel Carlin; Sahil Chopra; Alden Deran; Kyle Ellrott; Peddinti Gopalacharyulu; Kiley Graim; Samuel Kaski; Suleiman A Khan; Yulia Newton; Sam Ng; Tapio Pahikkala; Evan Paull; Artem Sokolov; Hao Tang; Jing Tang; Krister Wennerberg; Yang Xie; Xiaowei Zhan; Fan Zhu; Tero Aittokallio; Hiroshi Mamitsuka; Joshua M Stuart; Jesse S Boehm; David E Root; Guanghua Xiao; Gustavo Stolovitzky; William C Hahn; Adam A Margolin
Journal: Cell Syst Date: 2017-10-04 Impact factor: 10.304

6. On the optimistic performance evaluation of newly introduced bioinformatic methods.

Authors: Rory Wilson; Anne-Laure Boulesteix; Stefan Buchka; Alexander Hapfelmeier; Paul P Gardner
Journal: Genome Biol Date: 2021-05-11 Impact factor: 13.583

7. RedeR: R/Bioconductor package for representing modular structures, nested networks and multiple levels of hierarchical associations.

Authors: Mauro A A Castro; Xin Wang; Michael N C Fletcher; Kerstin B Meyer; Florian Markowetz
Journal: Genome Biol Date: 2012-04-24 Impact factor: 13.583

Review 8. Industrial methodology for process verification in research (IMPROVER): toward systems biology verification.

Authors: Pablo Meyer; Julia Hoeng; J Jeremy Rice; Raquel Norel; Jörg Sprengel; Katrin Stolle; Thomas Bonk; Stephanie Corthesy; Ajay Royyuru; Manuel C Peitsch; Gustavo Stolovitzky
Journal: Bioinformatics Date: 2012-03-14 Impact factor: 6.937

9. Teamwork: improved eQTL mapping using combinations of machine learning methods.

Authors: Marit Ackermann; Mathieu Clément-Ziza; Jacob J Michaelson; Andreas Beyer
Journal: PLoS One Date: 2012-07-24 Impact factor: 3.240

10. Improving breast cancer survival analysis through competition-based multidimensional modeling.

Authors: Erhan Bilal; Janusz Dutkowski; Justin Guinney; In Sock Jang; Benjamin A Logsdon; Gaurav Pandey; Benjamin A Sauerwine; Yishai Shimoni; Hans Kristian Moen Vollan; Brigham H Mecham; Oscar M Rueda; Jorg Tost; Christina Curtis; Mariano J Alvarez; Vessela N Kristensen; Samuel Aparicio; Anne-Lise Børresen-Dale; Carlos Caldas; Andrea Califano; Stephen H Friend; Trey Ideker; Eric E Schadt; Gustavo A Stolovitzky; Adam A Margolin
Journal: PLoS Comput Biol Date: 2013-05-09 Impact factor: 4.475