| Literature DB >> 30918265 |
Serghei Mangul1,2, Lana S Martin3, Brian L Hill4, Angela Ka-Mei Lam4, Margaret G Distler5, Alex Zelikovsky6,7, Eleazar Eskin4,8, Jonathan Flint5.
Abstract
Computational omics methods packaged as software have become essential to modern biological research. The increasing dependence of scientists on these powerful software tools creates a need for systematic assessment of these methods, known as benchmarking. Adopting a standardized benchmarking practice could help researchers who use omics data to better leverage recent technological innovations. Our review summarizes benchmarking practices from 25 recent studies and discusses the challenges, advantages, and limitations of benchmarking across various domains of biology. We also propose principles that can make computational biology benchmarking studies more sustainable and reproducible, ultimately increasing the transparency of biomedical data and results.Entities:
Mesh:
Year: 2019 PMID: 30918265 PMCID: PMC6437167 DOI: 10.1038/s41467-019-09406-4
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 14.919
Fig. 1Study design for benchmarking omics computational tools. to evaluate the accuracy of benchmarked computational tools, results obtained by running the computational tools are compared against the gold standard data (ground truth). First, biological samples are probed by regular measurement protocols (processes that generate omics data) (a). Raw omics data generated by these protocols serve as the input for examined computational tools (b, c). Results obtained by running computational tools are the final output of the omics pipeline (d). Gold standard data are produced by the benchmarking procedure and are based on technological protocol, expert manual evaluation, synthetic mock community, curated databases, or computational simulation (e). (Types of technologies available for use in the preparation of gold standard data are described in the section Preparation of Gold Standard Data.) Some of the techniques used to generate the gold standard data produce raw data, which needs to be analyzed (f); other techniques directly produce the gold standard data (g). Gold standard data obtained by or in conjunction with the raw omics data generated by regular measurement protocols enables researchers to use statistical metrics (h) and performance metrics to assess the computational cost and speed required to run the benchmarked computational tools (h), allowing the researcher to draw explicit, standardized comparison of existing computational algorithms. Methods with the best performances are located on the Pareto frontier and are identified as Pareto-efficient methods (i). A method is considered to be Pareto efficient if no other benchmarked method improves the score of one evaluation metric without degrading the score of another evaluation metric. (Evaluation methods and criteria for selecting the methods with the best performances are described in the section Selecting a Method with the Best Performance).
Advantages and limitations of various techniques used to prepare gold standard data
| Technique | Advantages | Limitations |
|---|---|---|
| Trusted technology | High accuracy | Carries high cost |
| Alternative technology | Direct, usually, no computational inference is required | Not necessarily more accurate |
| Multiple ordinary technologies | Using a consensus between the technologies allow reducing the number of false positives compared with each individual technology | Disagreement between used technologies results in the incompleteness of the gold standard |
| Mock community | Ground truth is fully known, because raw data are generated from prepared gold standard | The small number of items (e.g., microbial species) compared with reality |
| Expert manual evaluation | Most suitable for specialist understanding | Does not scale |
| Curated database | Allows access to sensitivity, by comparing the number of elements in the sample and the database | Incompleteness of curated databases results in limited ability to define true positives and false negatives |
| Curated software input | Ground truth is fully known, because raw data are generated from prepared gold standard | Does not validate on real inputs, which usually contain errors |
| Computational simulation | Ground truth is fully known, because raw data are generated from prepared gold standard | Technology is simulated, and cannot capture true experimental variability and will always be less complex than real data |
Summary of benchmarking study design and methods
| Benchmarking study | Application | No. of tools | Model of study | Raw input data type | Gold standard data preparation method | Parameter optimization |
|---|---|---|---|---|---|---|
| Yang et al. 2013 | Error correction | 7 | I | R | SIMUL | N |
| Aghaeepour et al. 2013 | Flow cytometry analysis | 14 | C | R | EXPERT | N |
| Bradnam et al. 2013 | Genome assembly | 21 | C | R | ALTECH | n/a |
| Hunt et al. 2014 | Genome assembly | 10 | I | R, S | SOFTWARE | N |
| Lindgreen et al. 2016 | Microbiome analysis | 14 | I | S | SIMUL | No |
| McIntyre et al. 2017 | Microbiome analysis | 11 | I | R, S | MOCK | N |
| Sczyrba et al. 2017 | Microbiome analysis | 25 | C | S | SIMUL | n/a |
| Altenhoff et al. 2016 | Ortholog prediction | 15 | I | DB | DB | Y |
| Jiang et al. 2016 | Protein function prediction | 121 | C | R | DB | n/a |
| Radjvojac et al. 2013 | Protein function prediction | 54 | C | R | DB | n/a |
| Baruzzo et al. 2017 | Read alignment | 14 | I | S | SIMUL | Y |
| Earl et al. 2014 | Read alignment | 12 | C | R, S | SIMUL | n/a |
| Hatem et al. 2013 | Read alignment | 9 | I | R, S | SIMUL | Y |
| Hayer et al. 2015 | RNA-Seq analysis | 7 | I | R, S | ALTECH | N |
| Kanitz et al. 2015 | RNA-Seq analysis | 11 | I | R, S | ALTECH | N |
| Łabaj et al. 2016 | RNA-Seq analysis | 7 | I | R | ALTECH | N |
| Łabaj et al. 2016 | RNA-Seq analysis | 4 | I | R | DB | N |
| Li et al. 2014 | RNA-Seq analysis | 5 | I | R | ALTECH | Y |
| Steijger et al. 2013 | RNA-Seq analysis | 14 | C, I | R | ALTECH | n/a |
| Su et al. 2014 | RNA-Seq analysis | 6 | I | R | ALTECH | Y |
| Zhang et al. 2014 | RNA-Seq analysis | 3 | I | R | ALTECH | Y |
| Thompson et al. 2011 | Sequence alignment | 8 | I | DB | DB | N |
| Bohnert et al. 2017 | Variant analysis | 19 | I | R, S | I&A | Y |
| Ewing et al. 2015 | Variant analysis | 14 | C | S | SIMUL | n/a |
| Pabinger et al. 2014 | Variant analysis | 32 | I | R, S | SIMUL | N |
Surveyed benchmarking studies published from 2011 to 2017 are grouped according to their area of application (indicated in column “Application”). We also recorded the number of tools benchmarked by each study (“Number of Tools”). We documented the coordinating model used to conduct the benchmarking study (“Model of Study”), such as those independently performed by a single group (“I”), a competition-based approach (“C”), and a hybrid approach combining elements of “I” and “C” (“C, I”). Types of raw omics data (“Raw Omics Data”) and gold standard data (“Gold Standard Data Preparation Method”) were documented across benchmarking study. When a benchmarking study uses computationally simulated data, we marked the study as “S”; when real raw data were experimentally generated in the wet-lab, we marked the study as “R”. When the study used both simulated and real data, we marked the study as “R, S”. Gold standard data types included data that were computationally simulated (marked as “SIMUL”), manually evaluated by experts (marked as “EXPERT”), prepared by alternative technology (“marked as ALTECH”), prepared as curated software input (marked as “SOFTWARE”), prepared as mock community (marked as “MOCK”), prepared from curated databases (marked as “DB”), and prepared using an integration and arbitration approach (marked as “I&A”). In competition-based benchmarking studies, parameter optimization (“Parameter Optimization”) is performed by each team and is not mandatory (marked here as “n/a”). More details about the characteristics of techniques to prepare gold standard data sets are provided in Table 1
Summary of information types provided by benchmarking studies
| Benchmarking study | Application | Summary provided | Computational costs reported | Supporting documentation | Data provided |
|---|---|---|---|---|---|
| Yang et al. 2013 | Error correction | Y | ExTIME, RAM | N | P |
| Aghaeepour et al. 2013 | Flow cytometry analysis | Y | None | Y | Y |
| Bradnam et al. 2013 | Genome assembly | Y | None | Y | Y |
| Hunt et al. 2014 | Genome assembly | Y | CPU, RAM | Y | P |
| Lindgreen et al. 2016 | Microbiome analysis | Y | ExTIME | Y | N |
| McIntyre et al. 2017 | Microbiome analysis | Y | ExTIME, RAM | Y | P |
| Sczyrba et al. 2017 | Microbiome analysis | Y | None | Y | Y |
| Altenhoff et al. 2016 | Ortholog prediction | Y | None | N | P |
| Jiang et al. 2016 | Protein function prediction | N | None | N | P |
| Radjvojac et al. 2013 | Protein function prediction | Y | None | N | P |
| Baruzzo et al. 2017 | Read alignment | Y | ExTIME, CPU,RAM | Y | P |
| Earl et al. 2014 | Read alignment | N | None | Y | Y |
| Hatem et al. 2013 | Read alignment | Y | ExTIME, CPU,RAM | Y | Y |
| Hayer et al. 2015 | RNA-Seq analysis | N | None | N | P |
| Kanitz et al. 2015 | RNA-Seq analysis | Y | ExTIME, CPU,RAM | Y | Y |
| Łabaj et al. 2016 | RNA-Seq analysis | Y | None | P | Y |
| Łabaj et al. 2016 | RNA-Seq analysis | Y | None | P | Y |
| Li et al. 2014 | RNA-Seq analysis | Y | None | P | Y |
| Steijger et al. 2013 | RNA-Seq analysis | Y | None | P | P |
| Su et al. 2014 | RNA-Seq analysis | N | None | Y | Y |
| Zhang et al. 2014 | RNA-Seq analysis | Y | None | Y | P |
| Thompson et al. 2011 | Sequence alignment | N | None | N | P |
| Bohnert et al. 2017 | Variant analysis | Y | None | Y | P |
| Ewing et al. 2015 | Variant analysis | N | None | N | P |
| Pabinger et al. 2014 | Variant analysis | Y | None | N | N |
Surveyed benchmarking studies published from 2011 to 2017 are grouped according to their area of application (indicated in column “Application”). We documented whether benchmarking studies summarized the benchmarked algorithm’s features (“Summary Provided). We recorded whether commands to install and run benchmarked tools were shared (“Supporting Documentation Provided”). We documented whether the benchmarking data are shared publicly (“Data Provided”). We consider the benchmarking data to be fully shared (“Y”) if the gold standard data, raw omics data, and raw output of each benchmarked tool are shared. When any one or more of those data sets is not shared publicly, we recorded the study as partially (“P”). We recorded the computational resources required to run the benchmarked tools (‘Computational Costs Reported”). When the benchmarking study used none of the statistical measures from the confusion matrix, the study was marked as none (“N”). We recorded three measures of computational costs: Execution time (marked as “ExTIME”), CPU time (marked as “CPU”), and the maximum amount of RAM required to run the tool (marked as “RAM”)