| Literature DB >> 31221194 |
Lukas M Weber1,2, Wouter Saelens3,4, Robrecht Cannoodt3,4, Charlotte Soneson1,2,5, Alexander Hapfelmeier6, Paul P Gardner7, Anne-Laure Boulesteix8, Yvan Saeys9,10, Mark D Robinson11,12.
Abstract
In computational biology and other sciences, researchers are frequently faced with a choice between several computational methods for performing data analyses. Benchmarking studies aim to rigorously compare the performance of different methods using well-characterized benchmark datasets, to determine the strengths of each method or to provide recommendations regarding suitable choices of methods for an analysis. However, benchmarking studies must be carefully designed and implemented to provide accurate, unbiased, and informative results. Here, we summarize key practical guidelines and recommendations for performing high-quality benchmarking analyses, based on our experiences in computational biology.Entities:
Mesh:
Year: 2019 PMID: 31221194 PMCID: PMC6584985 DOI: 10.1186/s13059-019-1738-8
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Fig. 1Summary of guidelines
Summary of our views regarding ‘how essential’ each principle is for a truly excellent benchmark, along with examples of key tradeoffs and potential pitfalls relating to each principle
| Principle (see Fig. | How essential?a | Tradeoffs | Potential pitfalls |
|---|---|---|---|
| 1. Defining the purpose and scope | +++ | How comprehensive the benchmark should be | Scope too broad: too much work given available resources Scope too narrow: unrepresentative and possibly misleading results |
| 2. Selection of methods | +++ | Number of methods to include | Excluding key methods |
| 3. Selection (or design) of datasets | +++ | Number and types of datasets to include | Subjectivity in the choice of datasets: e.g., selecting datasets that are unrepresentative of real-world applications Too few datasets or simulation scenarios Overly simplistic simulations |
| 4. Parameter and software versions | ++ | Amount of parameter tuning | Extensive parameter tuning for some methods while using default parameters for others (e.g., competing methods) |
| 5. Evaluation criteria: key quantitative performance metrics | +++ | Number and types of performance metrics | Subjectivity in the choice of metrics: e.g., selecting metrics that do not translate to real-world performance Metrics that give over-optimistic estimates of performance Methods may not be directly comparable according to individual metrics (e.g., if methods are designed for different tasks) |
| 6. Evaluation criteria: secondary measures | ++ | Number and types of performance metrics | Subjectivity of qualitative measures such as user-friendliness, installation procedures, and documentation quality Subjectivity in relative weighting between multiple metrics Measures such as runtime and scalability depend on processor speed and memory |
| 7. Interpretation, guidelines, and recommendations | ++ | Generality versus specificity of recommendations | Performance differences between top-ranked methods may be minor Different readers may be interested in different aspects of performance |
| 8. Publication and reporting of results | + | Amount of resources to dedicate to building online resources | Online resources may not be accessible (or may no longer run) several years later |
| 9. Enabling future extensions | ++ | Amount of resources to dedicate to ensuring extensibility | Selection of methods or datasets for future extensions may be unrepresentative (e.g., due to requests from method authors) |
| 10. Reproducible research best practices | ++ | Amount of resources to dedicate to reproducibility | Some tools may not be compatible or accessible several years later |
aThe higher the number of plus signs, the more central the principle is to the evaluation
Fig. 2Summary and examples of performance metrics. a Schematic overview of classes of frequently used performance metrics, including examples (boxes outlined in gray). b Examples of popular visualizations of quantitative performance metrics for classification methods, using reference datasets with a ground truth. ROC curves (left). TPR versus FDR curves (center); circles represent observed TPR and FDR at typical FDR thresholds of 1, 5, and 10%, with filled circles indicating observed FDR lower than or equal to the imposed threshold. PR curves (right). Visualizations in b were generated using iCOBRA R/Bioconductor package [56]. FDR false discovery rate, FPR false positive rate, PR precision–recall, ROC receiver operating characteristic, TPR true positive rate
Fig. 3Example of an interactive website allowing users to explore the results of one of our benchmarking studies [27]. This website was created using the Shiny framework in R