Literature DB >> 25810428

PRROC: computing and visualizing precision-recall and receiver operating characteristic curves in R.

Jan Grau¹, Ivo Grosse², Jens Keilwagen³.

Abstract

Precision-recall (PR) and receiver operating characteristic (ROC) curves are valuable measures of classifier performance. Here, we present the R-package PRROC, which allows for computing and visualizing both PR and ROC curves. In contrast to available R-packages, PRROC allows for computing PR and ROC curves and areas under these curves for soft-labeled data using a continuous interpolation between the points of PR curves. In addition, PRROC provides a generic plot function for generating publication-quality graphics of PR and ROC curves.

Mesh：

Year: 2015 PMID： 25810428 PMCID： PMC4514923 DOI： 10.1093/bioinformatics/btv153

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

The assessment of classifier performance is a recurring task in machine learning and data mining, and in particular in bioinformatics applications. It assists researchers in identifying the most promising approach for the classification problem at hand. For binary classification tasks, the receiver operating characteristic (ROC) curve and the area under this curve (AUC-ROC) are widely accepted as a general measure of classifier performance. In many bioinformatics applications, however, positive examples are substantially less abundant than negative examples, resulting in a highly imbalanced class ratio. For instance, the number of target genes of a microRNA is substantially smaller than the number of non-target genes. In such cases, the precision-recall (PR) curve and AUC (AUC-PR) is better suited for comparing the performance of individual classifiers than the ROC curve and AUC-ROC (Davis ). Often, the decision for the true class labels of a given data point is arguable and, for instance, based on an arbitrary threshold for some continuous measurement or based on multiple, possibly contradictory, expert labelings. However, the choice of this threshold decisively influences classifier training and assessment. One solution to this problem is the transition from hard-labeling to soft-labeling, where each data point is assigned to both classes with a certain probability that reflects confidence in the labeling (Grau ; Mihaljevic ). Although soft-labeling has been used extensively for classifier training in the past, it has been neglected for classifier assessment (Keilwagen ). Computing empirical AUC-PR and AUC-ROC values from a limited set of test data points requires interpolation between discrete supporting points corresponding to a series of classification threshold affecting the classification result. AUC-ROC can be computed by linear interpolation between the supporting points of the curve for hard-labeled and soft-labeled data. In contrast, Davis and Goadrich (2006) show that for AUC-PR an interpolation along the true positives is more accurate than linear interpolation for hard-labeled data, while Boyd and Keilwagen propose a more fine-grained, continuous interpolation between the supporting points of the PR curve. Only the latter can also be used for soft-labeled data (Keilwagen ). In Table 1, we list several common R-packages for computing PR or ROC curves or their AUCs, which in some cases provide further performance measures. For PR curves, however, none of the previous packages uses the more accurate interpolation of Davis and Goadrich (2006) or the continuous interpolation of Boyd and Keilwagen . Hence, none of these packages is applicable to soft-labeled data.

Table 1.

Package	AUC	PerfMeas	pROC^a	ROCR^b	PRROC
Version	0.3.0	1.2.1	1.7.3	1.0–5	1.1
PR curve
Hard-labeled	No	Both	No	Curve	Both
Interpolation	N/A	Linear	N/A	Linear	DG/con
Soft-labeled	No	No	No	No	Both
ROC curve
Hard-labeled	Both	AUC	Both	Both	Both
Soft-labeled	No	No	No	No	Both
Plotting	Yes	Std. R	Yes	Yes	Yes

aRobin ;

bSing .

R-packages for computing PR and ROC curves, and their AUCs; “both”: AUC and curve can be computed; “linear”: linear interpolation, “DG”: interpolation of Davis and Goadrich (2006), “con.”: interpolation of Boyd ; Keilwagen aRobin ; bSing . In this article, we present the R-package PRROC, which closes both gaps by (i) using the continuous interpolation of Keilwagen for computing and drawing PR curves and, by this means, (ii) enabling the computation of PR and ROC curves, and AUC-PR and AUC-ROC for soft-labeled and hard-labeled data. In addition, PRROC optionally computes curves and AUC values for the optimal, the worst and the random classifier as a reference. These references are particularly useful for (a) PR curves and (b) ROC and PR curves in case of soft-labeled data, where, for instance, the minimum and maximum AUC might differ from 0 and 1, respectively. Finally, PRROC provides a plotting function for generating publication-quality plots of PR and ROC curves.

2 Use cases

In this section, we present typical applications of the PRROC R-package. Complete listings of the corresponding R-code and further examples are available in the R vignette of PRROC. First, we consider the scenario that we developed a novel approach for a classification problem with ‘hard class labels’ and now want to assess its performance on an independent test dataset. Further assume that the classification scores of the data points belonging to the positive class are stored in a vector fg, and those of the negative class in bg. Using PRROC, we can compute the ROC and PR curve, respectively, by roc<-roc.curve(fg,bg,curve=T); pr<-pr.curve(fg,bg,curve=T); obtain the AUC values with print(roc); print(pr), and plot the curves with plot(roc); plot(pr). An ROC curve obtained by this procedure is shown in the left panel of Figure 1.

Fig. 1.

Plots of ROC (left) and PR (right) curves generated by PRROC. For the ROC curve, we consider hard-labeled data and show the plotting variant with a color scale that indicates classification thresholds yielding the points on the curve. For the PR curve, we consider soft-labeled data and show a comparative plot for two classifiers as solid and dashed lines. We also include the maximal and minimal possible curves and the curve of a random classifier for the given soft-labels Alternatively, classification scores for both classes may be stored in one joint vector (x) and the corresponding class labels (1/0 for positive/negative class) in another vector (lab). In this case, we can compute the ROC and PR curves by roc<-roc.curve(x,weights.class0=lab,curve=T); pr<-pr.curve(x,weights.class0=lab,curve=T); Second, we consider a scenario for a classification problem with ‘soft class labels’, where each data point belongs to the positive class with probability P and to the negative class with probability (1−P). We assume that the classification scores are again stored in one joint vector (x) and the soft-labels, i.e., the probability of belonging to the positive class for each data point, in another vector (w). Using PRROC, we can compute the PR curve as well as the minimum and maximum curve, and the curve for the random classifier by pr.1<-pr.curve(x,weights.class0=w,curve=T, max.compute=T,min.compute=T,rand.compute=T) Analogously, we compute the PR curve pr.2 for another classifier and plot both curves together with the maximum and minimum curve, and the curve of the random classifier by plot(pr.1,col=2,max.plot=T,min.plot=T, rand.plot=T,fill.area=T,auc.main=F); plot(pr.2,col=3,add=T) A plot obtained by this procedure is shown in the right panel of Figure 1. We clearly see the difference in performance of the two classifiers and may conclude that the ranking implied by the classification scores behind the solid curve reconstructs the soft-labels with greater accuracy.

3 Discussion

We present PRROC, an R-package for computing PR and ROC curves as well as their AUCs for soft-labeled and hard-labeled data, which may be beneficial for typical bioinformatics applications. Additionally, PRROC provides a function for plotting PR and ROC curves within R. The PRROC package provides R documentation files and a vignette. Conflict of Interest: none declared.

5 in total

1. ROCR: visualizing classifier performance in R.

Authors: Tobias Sing; Oliver Sander; Niko Beerenwinkel; Thomas Lengauer
Journal: Bioinformatics Date: 2005-08-11 Impact factor: 6.937

2. pROC: an open-source package for R and S+ to analyze and compare ROC curves.

Authors: Xavier Robin; Natacha Turck; Alexandre Hainard; Natalia Tiberti; Frédérique Lisacek; Jean-Charles Sanchez; Markus Müller
Journal: BMC Bioinformatics Date: 2011-03-17 Impact factor: 3.307

3. A general approach for discriminative de novo motif discovery from high-throughput data.

Authors: Jan Grau; Stefan Posch; Ivo Grosse; Jens Keilwagen
Journal: Nucleic Acids Res Date: 2013-09-20 Impact factor: 16.971

4. Multi-dimensional classification of GABAergic interneurons with Bayesian network-modeled label uncertainty.

Authors: Bojan Mihaljević; Concha Bielza; Ruth Benavides-Piccione; Javier DeFelipe; Pedro Larrañaga
Journal: Front Comput Neurosci Date: 2014-11-25 Impact factor: 2.380

5. Area under precision-recall curves for weighted and unweighted data.

Authors: Jens Keilwagen; Ivo Grosse; Jan Grau
Journal: PLoS One Date: 2014-03-20 Impact factor: 3.240

5 in total

93 in total

1. Prediction of Incident Delirium Using a Random Forest classifier.

Authors: John P Corradi; Stephen Thompson; Jeffrey F Mather; Christine M Waszynski; Robert S Dicks
Journal: J Med Syst Date: 2018-11-14 Impact factor: 4.460

2. Identification of cancer driver genes based on nucleotide context.

Authors: Felix Dietlein; Donate Weghorn; Amaro Taylor-Weiner; André Richters; Brendan Reardon; David Liu; Eric S Lander; Eliezer M Van Allen; Shamil R Sunyaev
Journal: Nat Genet Date: 2020-02-03 Impact factor: 38.330

3. Combining transcription factor binding affinities with open-chromatin data for accurate gene expression prediction.

Authors: Florian Schmidt; Nina Gasparoni; Gilles Gasparoni; Kathrin Gianmoena; Cristina Cadenas; Julia K Polansky; Peter Ebert; Karl Nordström; Matthias Barann; Anupam Sinha; Sebastian Fröhler; Jieyi Xiong; Azim Dehghani Amirabad; Fatemeh Behjati Ardakani; Barbara Hutter; Gideon Zipprich; Bärbel Felder; Jürgen Eils; Benedikt Brors; Wei Chen; Jan G Hengstler; Alf Hamann; Thomas Lengauer; Philip Rosenstiel; Jörn Walter; Marcel H Schulz
Journal: Nucleic Acids Res Date: 2016-11-29 Impact factor: 16.971

4. A population-based approach for gene prioritization in understanding complex traits.

Authors: Massimo Mezzavilla; Massimiliano Cocca; Francesca Guidolin; Paolo Gasparini
Journal: Hum Genet Date: 2020-03-30 Impact factor: 4.132

5. Key Parameters of Tumor Epitope Immunogenicity Revealed Through a Consortium Approach Improve Neoantigen Prediction.

Authors: Daniel K Wells; Marit M van Buuren; Kristen K Dang; Vanessa M Hubbard-Lucey; Kathleen C F Sheehan; Katie M Campbell; Andrew Lamb; Jeffrey P Ward; John Sidney; Ana B Blazquez; Andrew J Rech; Jesse M Zaretsky; Begonya Comin-Anduix; Alphonsus H C Ng; William Chour; Thomas V Yu; Hira Rizvi; Jia M Chen; Patrice Manning; Gabriela M Steiner; Xengie C Doan; Taha Merghoub; Justin Guinney; Adam Kolom; Cheryl Selinsky; Antoni Ribas; Matthew D Hellmann; Nir Hacohen; Alessandro Sette; James R Heath; Nina Bhardwaj; Fred Ramsdell; Robert D Schreiber; Ton N Schumacher; Pia Kvistborg; Nadine A Defranoux
Journal: Cell Date: 2020-10-09 Impact factor: 41.582

6. Predicting Nondiagnostic Home Sleep Apnea Tests Using Machine Learning.

Authors: Robert Stretch; Armand Ryden; Constance H Fung; Joanne Martires; Stephen Liu; Vidhya Balasubramanian; Babak Saedi; Dennis Hwang; Jennifer L Martin; Nicolás Della Penna; Michelle R Zeidler
Journal: J Clin Sleep Med Date: 2019-11-15 Impact factor: 4.062