INTRODUCTION: Evaluation of computer programs which generate multiple diagnoses can be hampered by a lack of effective, well recognized performance metrics. We have developed a method to calculate mean sensitivity and specificity for multiple diagnoses and generate ROC curves. METHODS: Data came from a clinical evaluation of the Heart Disease Program (HDP). Sensitivity, specificity, positive and negative predictive value (PPV, NPV) were calculated for each diagnosis type in the study. A weighted mean of overall sensitivity and specificity was derived and used to create an ROC curve. Alternative metrics Comprehensiveness and Relevance were calculated for each case and compared to the other measures. RESULTS: Weighted mean sensitivity closely matched Comprehensiveness and mean PPV matched Relevance. Plotting the Physician's sensitivity and specificity on the ROC curve showed that their discrimination was similar to the HDP but sensitivity was significantly lower. CONCLUSIONS: These metrics give a clear picture of a program's diagnostic performance and allow straightforward comparison between different programs and different studies.
INTRODUCTION: Evaluation of computer programs which generate multiple diagnoses can be hampered by a lack of effective, well recognized performance metrics. We have developed a method to calculate mean sensitivity and specificity for multiple diagnoses and generate ROC curves. METHODS: Data came from a clinical evaluation of the Heart Disease Program (HDP). Sensitivity, specificity, positive and negative predictive value (PPV, NPV) were calculated for each diagnosis type in the study. A weighted mean of overall sensitivity and specificity was derived and used to create an ROC curve. Alternative metrics Comprehensiveness and Relevance were calculated for each case and compared to the other measures. RESULTS: Weighted mean sensitivity closely matched Comprehensiveness and mean PPV matched Relevance. Plotting the Physician's sensitivity and specificity on the ROC curve showed that their discrimination was similar to the HDP but sensitivity was significantly lower. CONCLUSIONS: These metrics give a clear picture of a program's diagnostic performance and allow straightforward comparison between different programs and different studies.
Authors: C P Friedman; A S Elstein; F M Wolf; G C Murphy; T M Franz; P S Heckerling; P L Fine; T M Miller; V Abraham Journal: JAMA Date: 1999-11-17 Impact factor: 56.272
Authors: E S Berner; G D Webster; A A Shugerman; J R Jackson; J Algina; A L Baker; E V Ball; C G Cobbs; V W Dennis; E P Frenkel Journal: N Engl J Med Date: 1994-06-23 Impact factor: 91.245