Literature DB >> 25810428

PRROC: computing and visualizing precision-recall and receiver operating characteristic curves in R.

Jan Grau1, Ivo Grosse2, Jens Keilwagen3.   

Abstract

Precision-recall (PR) and receiver operating characteristic (ROC) curves are valuable measures of classifier performance. Here, we present the R-package PRROC, which allows for computing and visualizing both PR and ROC curves. In contrast to available R-packages, PRROC allows for computing PR and ROC curves and areas under these curves for soft-labeled data using a continuous interpolation between the points of PR curves. In addition, PRROC provides a generic plot function for generating publication-quality graphics of PR and ROC curves.
© The Author 2015. Published by Oxford University Press.

Mesh:

Year:  2015        PMID: 25810428      PMCID: PMC4514923          DOI: 10.1093/bioinformatics/btv153

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 Introduction

The assessment of classifier performance is a recurring task in machine learning and data mining, and in particular in bioinformatics applications. It assists researchers in identifying the most promising approach for the classification problem at hand. For binary classification tasks, the receiver operating characteristic (ROC) curve and the area under this curve (AUC-ROC) are widely accepted as a general measure of classifier performance. In many bioinformatics applications, however, positive examples are substantially less abundant than negative examples, resulting in a highly imbalanced class ratio. For instance, the number of target genes of a microRNA is substantially smaller than the number of non-target genes. In such cases, the precision-recall (PR) curve and AUC (AUC-PR) is better suited for comparing the performance of individual classifiers than the ROC curve and AUC-ROC (Davis ). Often, the decision for the true class labels of a given data point is arguable and, for instance, based on an arbitrary threshold for some continuous measurement or based on multiple, possibly contradictory, expert labelings. However, the choice of this threshold decisively influences classifier training and assessment. One solution to this problem is the transition from hard-labeling to soft-labeling, where each data point is assigned to both classes with a certain probability that reflects confidence in the labeling (Grau ; Mihaljevic ). Although soft-labeling has been used extensively for classifier training in the past, it has been neglected for classifier assessment (Keilwagen ). Computing empirical AUC-PR and AUC-ROC values from a limited set of test data points requires interpolation between discrete supporting points corresponding to a series of classification threshold affecting the classification result. AUC-ROC can be computed by linear interpolation between the supporting points of the curve for hard-labeled and soft-labeled data. In contrast, Davis and Goadrich (2006) show that for AUC-PR an interpolation along the true positives is more accurate than linear interpolation for hard-labeled data, while Boyd and Keilwagen propose a more fine-grained, continuous interpolation between the supporting points of the PR curve. Only the latter can also be used for soft-labeled data (Keilwagen ). In Table 1, we list several common R-packages for computing PR or ROC curves or their AUCs, which in some cases provide further performance measures. For PR curves, however, none of the previous packages uses the more accurate interpolation of Davis and Goadrich (2006) or the continuous interpolation of Boyd and Keilwagen . Hence, none of these packages is applicable to soft-labeled data.
Table 1.

R-packages for computing PR and ROC curves, and their AUCs; “both”: AUC and curve can be computed; “linear”: linear interpolation, “DG”: interpolation of Davis and Goadrich (2006), “con.”: interpolation of Boyd ; Keilwagen

PackageAUCPerfMeaspROCaROCRbPRROC
Version0.3.01.2.11.7.31.0–51.1
PR curve
 Hard-labeledNoBothNoCurveBoth
 InterpolationN/ALinearN/ALinearDG/con
 Soft-labeledNoNoNoNoBoth
ROC curve
 Hard-labeledBothAUCBothBothBoth
 Soft-labeledNoNoNoNoBoth
PlottingYesStd. RYesYesYes

aRobin ;

bSing .

R-packages for computing PR and ROC curves, and their AUCs; “both”: AUC and curve can be computed; “linear”: linear interpolation, “DG”: interpolation of Davis and Goadrich (2006), “con.”: interpolation of Boyd ; Keilwagen aRobin ; bSing . In this article, we present the R-package PRROC, which closes both gaps by (i) using the continuous interpolation of Keilwagen for computing and drawing PR curves and, by this means, (ii) enabling the computation of PR and ROC curves, and AUC-PR and AUC-ROC for soft-labeled and hard-labeled data. In addition, PRROC optionally computes curves and AUC values for the optimal, the worst and the random classifier as a reference. These references are particularly useful for (a) PR curves and (b) ROC and PR curves in case of soft-labeled data, where, for instance, the minimum and maximum AUC might differ from 0 and 1, respectively. Finally, PRROC provides a plotting function for generating publication-quality plots of PR and ROC curves.

2 Use cases

In this section, we present typical applications of the PRROC R-package. Complete listings of the corresponding R-code and further examples are available in the R vignette of PRROC. First, we consider the scenario that we developed a novel approach for a classification problem with ‘hard class labels’ and now want to assess its performance on an independent test dataset. Further assume that the classification scores of the data points belonging to the positive class are stored in a vector fg, and those of the negative class in bg. Using PRROC, we can compute the ROC and PR curve, respectively, by roc<-roc.curve(fg,bg,curve=T); pr<-pr.curve(fg,bg,curve=T); obtain the AUC values with print(roc); print(pr), and plot the curves with plot(roc); plot(pr). An ROC curve obtained by this procedure is shown in the left panel of Figure 1.
Fig. 1.

Plots of ROC (left) and PR (right) curves generated by PRROC. For the ROC curve, we consider hard-labeled data and show the plotting variant with a color scale that indicates classification thresholds yielding the points on the curve. For the PR curve, we consider soft-labeled data and show a comparative plot for two classifiers as solid and dashed lines. We also include the maximal and minimal possible curves and the curve of a random classifier for the given soft-labels

Plots of ROC (left) and PR (right) curves generated by PRROC. For the ROC curve, we consider hard-labeled data and show the plotting variant with a color scale that indicates classification thresholds yielding the points on the curve. For the PR curve, we consider soft-labeled data and show a comparative plot for two classifiers as solid and dashed lines. We also include the maximal and minimal possible curves and the curve of a random classifier for the given soft-labels Alternatively, classification scores for both classes may be stored in one joint vector (x) and the corresponding class labels (1/0 for positive/negative class) in another vector (lab). In this case, we can compute the ROC and PR curves by roc<-roc.curve(x,weights.class0=lab,curve=T); pr<-pr.curve(x,weights.class0=lab,curve=T); Second, we consider a scenario for a classification problem with ‘soft class labels’, where each data point belongs to the positive class with probability P and to the negative class with probability (1−P). We assume that the classification scores are again stored in one joint vector (x) and the soft-labels, i.e., the probability of belonging to the positive class for each data point, in another vector (w). Using PRROC, we can compute the PR curve as well as the minimum and maximum curve, and the curve for the random classifier by pr.1<-pr.curve(x,weights.class0=w,curve=T, max.compute=T,min.compute=T,rand.compute=T) Analogously, we compute the PR curve pr.2 for another classifier and plot both curves together with the maximum and minimum curve, and the curve of the random classifier by plot(pr.1,col=2,max.plot=T,min.plot=T, rand.plot=T,fill.area=T,auc.main=F); plot(pr.2,col=3,add=T) A plot obtained by this procedure is shown in the right panel of Figure 1. We clearly see the difference in performance of the two classifiers and may conclude that the ranking implied by the classification scores behind the solid curve reconstructs the soft-labels with greater accuracy.

3 Discussion

We present PRROC, an R-package for computing PR and ROC curves as well as their AUCs for soft-labeled and hard-labeled data, which may be beneficial for typical bioinformatics applications. Additionally, PRROC provides a function for plotting PR and ROC curves within R. The PRROC package provides R documentation files and a vignette. Conflict of Interest: none declared.
  5 in total

1.  ROCR: visualizing classifier performance in R.

Authors:  Tobias Sing; Oliver Sander; Niko Beerenwinkel; Thomas Lengauer
Journal:  Bioinformatics       Date:  2005-08-11       Impact factor: 6.937

2.  pROC: an open-source package for R and S+ to analyze and compare ROC curves.

Authors:  Xavier Robin; Natacha Turck; Alexandre Hainard; Natalia Tiberti; Frédérique Lisacek; Jean-Charles Sanchez; Markus Müller
Journal:  BMC Bioinformatics       Date:  2011-03-17       Impact factor: 3.307

3.  A general approach for discriminative de novo motif discovery from high-throughput data.

Authors:  Jan Grau; Stefan Posch; Ivo Grosse; Jens Keilwagen
Journal:  Nucleic Acids Res       Date:  2013-09-20       Impact factor: 16.971

4.  Multi-dimensional classification of GABAergic interneurons with Bayesian network-modeled label uncertainty.

Authors:  Bojan Mihaljević; Concha Bielza; Ruth Benavides-Piccione; Javier DeFelipe; Pedro Larrañaga
Journal:  Front Comput Neurosci       Date:  2014-11-25       Impact factor: 2.380

5.  Area under precision-recall curves for weighted and unweighted data.

Authors:  Jens Keilwagen; Ivo Grosse; Jan Grau
Journal:  PLoS One       Date:  2014-03-20       Impact factor: 3.240

  5 in total
  93 in total

1.  Prediction of Incident Delirium Using a Random Forest classifier.

Authors:  John P Corradi; Stephen Thompson; Jeffrey F Mather; Christine M Waszynski; Robert S Dicks
Journal:  J Med Syst       Date:  2018-11-14       Impact factor: 4.460

2.  Identification of cancer driver genes based on nucleotide context.

Authors:  Felix Dietlein; Donate Weghorn; Amaro Taylor-Weiner; André Richters; Brendan Reardon; David Liu; Eric S Lander; Eliezer M Van Allen; Shamil R Sunyaev
Journal:  Nat Genet       Date:  2020-02-03       Impact factor: 38.330

3.  Combining transcription factor binding affinities with open-chromatin data for accurate gene expression prediction.

Authors:  Florian Schmidt; Nina Gasparoni; Gilles Gasparoni; Kathrin Gianmoena; Cristina Cadenas; Julia K Polansky; Peter Ebert; Karl Nordström; Matthias Barann; Anupam Sinha; Sebastian Fröhler; Jieyi Xiong; Azim Dehghani Amirabad; Fatemeh Behjati Ardakani; Barbara Hutter; Gideon Zipprich; Bärbel Felder; Jürgen Eils; Benedikt Brors; Wei Chen; Jan G Hengstler; Alf Hamann; Thomas Lengauer; Philip Rosenstiel; Jörn Walter; Marcel H Schulz
Journal:  Nucleic Acids Res       Date:  2016-11-29       Impact factor: 16.971

4.  A population-based approach for gene prioritization in understanding complex traits.

Authors:  Massimo Mezzavilla; Massimiliano Cocca; Francesca Guidolin; Paolo Gasparini
Journal:  Hum Genet       Date:  2020-03-30       Impact factor: 4.132

5.  Key Parameters of Tumor Epitope Immunogenicity Revealed Through a Consortium Approach Improve Neoantigen Prediction.

Authors:  Daniel K Wells; Marit M van Buuren; Kristen K Dang; Vanessa M Hubbard-Lucey; Kathleen C F Sheehan; Katie M Campbell; Andrew Lamb; Jeffrey P Ward; John Sidney; Ana B Blazquez; Andrew J Rech; Jesse M Zaretsky; Begonya Comin-Anduix; Alphonsus H C Ng; William Chour; Thomas V Yu; Hira Rizvi; Jia M Chen; Patrice Manning; Gabriela M Steiner; Xengie C Doan; Taha Merghoub; Justin Guinney; Adam Kolom; Cheryl Selinsky; Antoni Ribas; Matthew D Hellmann; Nir Hacohen; Alessandro Sette; James R Heath; Nina Bhardwaj; Fred Ramsdell; Robert D Schreiber; Ton N Schumacher; Pia Kvistborg; Nadine A Defranoux
Journal:  Cell       Date:  2020-10-09       Impact factor: 41.582

6.  Predicting Nondiagnostic Home Sleep Apnea Tests Using Machine Learning.

Authors:  Robert Stretch; Armand Ryden; Constance H Fung; Joanne Martires; Stephen Liu; Vidhya Balasubramanian; Babak Saedi; Dennis Hwang; Jennifer L Martin; Nicolás Della Penna; Michelle R Zeidler
Journal:  J Clin Sleep Med       Date:  2019-11-15       Impact factor: 4.062

7.  Image-based pooled whole-genome CRISPRi screening for subcellular phenotypes.

Authors:  Gil Kanfer; Shireen A Sarraf; Yaakov Maman; Heather Baldwin; Eunice Dominguez-Martin; Kory R Johnson; Michael E Ward; Martin Kampmann; Jennifer Lippincott-Schwartz; Richard J Youle
Journal:  J Cell Biol       Date:  2021-02-01       Impact factor: 10.539

8.  ZCWPW1 is recruited to recombination hotspots by PRDM9 and is essential for meiotic double strand break repair.

Authors:  Daniel Wells; Emmanuelle Bitoun; Daniela Moralli; Gang Zhang; Anjali Hinch; Julia Jankowska; Peter Donnelly; Catherine Green; Simon R Myers
Journal:  Elife       Date:  2020-08-03       Impact factor: 8.140

9.  Disentangling transcription factor binding site complexity.

Authors:  Ralf Eggeling
Journal:  Nucleic Acids Res       Date:  2018-11-16       Impact factor: 16.971

10.  Machine learning-based chemical binding similarity using evolutionary relationships of target genes.

Authors:  Keunwan Park; Young-Joon Ko; Prasannavenkatesh Durai; Cheol-Ho Pan
Journal:  Nucleic Acids Res       Date:  2019-11-18       Impact factor: 16.971

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.