Literature DB >> 31479448

A systematic review of the diagnostic accuracy of artificial intelligence-based computer programs to analyze chest x-rays for pulmonary tuberculosis.

Miriam Harris1,2,3, Amy Qi2,4, Luke Jeagal4, Nazi Torabi5, Dick Menzies1,4,6, Alexei Korobitsyn7, Madhukar Pai1,4,6, Ruvandhi R Nathavitharana8, Faiz Ahmad Khan1,4,6.   

Abstract

We undertook a systematic review of the diagnostic accuracy of artificial intelligence-based software for identification of radiologic abnormalities (computer-aided detection, or CAD) compatible with pulmonary tuberculosis on chest x-rays (CXRs). We searched four databases for articles published between January 2005-February 2019. We summarized data on CAD type, study design, and diagnostic accuracy. We assessed risk of bias with QUADAS-2. We included 53 of the 4712 articles reviewed: 40 focused on CAD design methods ("Development" studies) and 13 focused on evaluation of CAD ("Clinical" studies). Meta-analyses were not performed due to methodological differences. Development studies were more likely to use CXR databases with greater potential for bias as compared to Clinical studies. Areas under the receiver operating characteristic curve (median AUC [IQR]) were significantly higher: in Development studies AUC: 0.88 [0.82-0.90]) versus Clinical studies (0.75 [0.66-0.87]; p-value 0.004); and with deep-learning (0.91 [0.88-0.99]) versus machine-learning (0.82 [0.75-0.89]; p = 0.001). We conclude that CAD programs are promising, but the majority of work thus far has been on development rather than clinical evaluation. We provide concrete suggestions on what study design elements should be improved.

Entities:  

Mesh:

Year:  2019        PMID: 31479448      PMCID: PMC6719854          DOI: 10.1371/journal.pone.0221339

Source DB:  PubMed          Journal:  PLoS One        ISSN: 1932-6203            Impact factor:   3.240


Introduction

The need to improve tuberculosis (TB) diagnostic and screening services in high-burden countries is clear: in 2016, active TB was the leading cause of death due to an infectious agent, and only 69% of the 10.4 million people that developed this disease were detected by or notified to national TB programmes [1]. In developed countries, chest x-rays (CXRs) have been used for the evaluation of persons presenting with symptoms of possible active pulmonary TB (PTB), and for screening of individuals in high risk groups, for several decades [2]. However, uptake of CXR in high TB burden countries, particularly in resource-constrained settings, has been limited [3, 4]. In recent years, there has been increasing interest in expanding access to chest radiography in order to improve TB case detection in high-burden areas [5]. However, one of the challenges is the paucity of professionals to interpret radiographic images in resource-constrained settings [6]. In recent years, advances in artificial intelligence (AI) technology and methods have led to major progress in automated image recognition by computers. AI has been applied to the analysis of radiologic images to identify abnormalities—referred to as computer-aided detection, or CAD—and represents one potential solution to overcome the personnel shortage. Two commonly used AI approaches that have been used to create CAD programs capable of reading CXRs are Machine learning (ML) and Deep Learning (DL). ML is a type of AI analysis that relies less on human specification (i.e. defining a set of variables to be included) and instead allows algorithms to decide what variables are important [7, 8]. DL is a subset of ML which attempts to model brain architecture [7]. It uses neural networks, or overlaying models, that emphasize learning increasingly meaningful representations of the data [7]. The World Health Organization (WHO) has called for greater evidence before endorsing the use of CAD in PTB diagnostic and screening pathways [5]. To date, there has been only one systematic review of CAD use for PTB detection,[9] and it was limited to reviewing the only commercially available software at the time of publication. Amongst the 5 studies included, the reviewers identified methodological limitations that prevented the pooling of results. Because the prior review was limited to studies of the single commercially available software, it excluded the vast majority of studies of CAD for detecting PTB. Hence, in order to provide a more comprehensive and expansive summary of the CAD literature we undertook an updated systematic review which included non-commercially available CAD studies. Our primary objectives were to evaluate the evidence base with regards to the estimation of the diagnostic accuracy of CAD, including assessing potential for bias, and if appropriate, to calculate pooled estimates of area under the receiver operating characteristic curves (AUC), sensitivity, and specificity. Secondary objectives were to evaluate study-level factors associated with diagnostic accuracy; including those related to the design of the study, and the type of software used (ML versus DL).

Methods

Design

This systematic review followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses guidelines [10]. The International Prospective Register of Systematic Reviews (PROSPERO) registration number of this protocol is CRD42018073016.

Date source and search strategy

A search strategy was developed in consultation with an academic librarian (NT) to identify published articles in MEDLINE (Ovid), EMBASE (Ovid), PubMed, and Scopus (S1 Appendix). The search strategies included subject headings (where applicable) and text words for the concepts of pulmonary tuberculosis, computer aided diagnosis, and diagnostic accuracy. The search period was limited to papers published after January 1, 2005, and included articles published up to February 13, 2019. Studies were limited to English and French.

Study selection

We included all published studies that used any form of computer software to analyze CXR in place of human readers, for PTB detection purposes. Studies were excluded if they reported CAD for diagnostic imaging other than CXR, or if CAD was used for diseases other than PTB. Studies reported only in conference abstracts were excluded. Four independent reviewers selected studies for inclusion (MH, AQ, LJ, FAK). Conflicts were reviewed by a third reviewer (FAK).

Data extraction

Data were extracted using a standardized extraction form (S2 Appendix). Three reviewers performed the extraction, with one reviewer (MH) verifying all data forms completed by the second reviewer (AQ & LJ). Data collected included year of enrollment; funding sources and conflicts of interest; software name and version number; country where study was completed; CXR site and number on which the software was trained; model of CXR machine, and digitization methods; study design and patient selection methods; inclusion and exclusion criteria; microbiologic tests collected; scoring of software tools and methods of scoring selection; patient characteristics including HIV status, age, and history of TB; and diagnostic accuracy measures including sensitivity, specificity, AUC for microbiologic and radiologic references.

Descriptive analysis

We classified studies as either Development or Clinical. Development studies primarily focused on reporting methods for creating a CAD program for PTB, and some included an assessment of diagnostic accuracy—the latter being the focus of our systematic review. Development studies were often published in engineering, computer science, medical imaging journals, or proceedings from engineering or medical imaging conferences. The development studies were further subdivided based on the type of AI technology used (ML versus DL). Clinical studies primarily focused on the assessment of the accuracy of an already-developed CAD software. We further classified Clinical studies based on the context in which the CXR was used, using WHO terminology for categorizing usage of x-ray as either for Triage or for Screening [5]. In Triage studies, CXRs were used in a healthcare setting—hospital, or clinic—as part of the diagnostic pathway of someone with PTB symptoms. In Screening studies, CXRs were used for active case finding or prevalence surveys, where populations are screened to identify those with active TB often regardless of symptoms. The distinction was made because the prevalence of more advanced or extensive disease will be higher in the Triage setting, thereby affecting the sensitivity of CXR and hence the accuracy of CAD.

Quality assessment with respect to the evaluation of diagnostic accuracy

The data sources used for evaluating diagnostic accuracy of CAD were databases consisting of CXRs, with each image linked to a reference standard result classifying PTB as present or absent. Some of these data sources had been used by more than one Development study. We evaluated these data sources for potential risk of bias by applying a modified Quality Assessment of Diagnostic Accuracy Studies (QUADAS)-2 approach [11]. As our interest was to assess the composition of the database itself including how PTB cases were defined, we restricted our approach to the domains of patient selection and the reference test. Because Development studies often did not provide sampling or reference details about the data sources, we sought additional information from citations that described the data sources [12-15]. We applied QUADAS-2 to all the CAD studies, assessing each study across the four domains (patient selection, the performance of the index test, performance of the reference test, and flow and timing). In all quality assessments, when the reference standard used for determining a CAD program’s diagnostic accuracy was image interpretation by a human reader instead of microbiologic testing of sputum, we judged this as a potential source of bias. This is because human interpretation of CXR is moderately specific for PTB, has variable sensitivity, is marked by limited inter-reader reliability, and the reproducibility is limited [5, 16].

Statistical analysis

Diagnostic accuracy measures (sensitivity, specificity, AUC) were reported when available. For the studies that reported sensitivities and specificities, if two by two tables were not available, we back calculated counts based on reported accuracy measures to build forest plots. A meta-analysis was not undertaken given that different software programs were used, and for most studies the raw data necessary to meta-analyze diagnostic accuracy measures were unavailable. For studies of the most commonly reported software, CAD4TB, a meta-analysis was also not pursued due to the variability of the methods and software versions tested. The following study-level factors were evaluated as potential determinants of the reported AUC: type of CAD study (Development vs Clinical); the method of AI software (ML versus DL); whether the same CXRs used for evaluating diagnostic accuracy were the same CXRs that had been used to train the software; the type of reference standard for PTB (microbiologically confirmed vs human interpretation of CXR image); and the degree of patient selection, index test, and reference standard bias. While the data were insufficient for a traditional meta-analysis, to identify associations between these factors and reported AUC, we compared the pooled distribution of the reported AUCs between groups defined by these study-level factors using Kruskal-Wallis tests. When studies reported more than one AUC, a mean AUC was calculated and used for this analysis. This assessment was done for the AUC but not for Sensitivity or Specificity, as the latter two were reported in too few studies to undertake a meaningful comparison of distributions. For all Clinical studies and Development studies which reported sensitivity, specificity, and true positives, forest plots were used to visually assess heterogeneity of diagnostic accuracies.

Results

We identified 4712 unique citations (Fig 1), of which 2821 studies were excluded at the title and abstract phase. Of the remaining 391, 338 were excluded after full-text review. Amongst the 53 included articles, 40 were classified as Development studies and 13 were classified as Clinical (Table 1). The software developers were either authors or funded the research in 9/13 (69%) of the Clinical studies [17-25], and in 100% (40/40) of the Development studies.
Fig 1

Study flow diagram.

Computer aided detection (CAD).

Table 1

Methods of studies included in the descriptive analysis.

Author and yearCountry where CXR completedDatabases usedComputer softwareReferencestandardAccuracy measures
Development Studies
Deep learning
Heo et al, 2019South KoreaYU AWHNot namedHuman readerAUC
Hwang et al, 2018South Korea, USA, ChinaSNUH, BMC, KUHG, DEMC, MC, CHDLADLiquid culture, NAAT, and or TB treatmentAUC,
Lakhani et al, 2017USA, ChinaMC, CH, TJH, BelarusAlexNet and GoogLeNetHuman readerAUC, Sn, Sp
Santosh et al, 2017USA, China, IndiaMC, CH, INNot namedHuman readerAUC, Sn, Sp
Lopes et al, 2017USA, ChinaMC, CHNot namedHuman readerAUC
Santosh et al, 2016USA, ChinaMC, CHNot namedHuman readerAUC
Hwang et al, 2016South Korea, USA, ChinaKIT, MC, CHAlexnetHuman readerAUC
Machine learning
Ilena et al, 2018ChinaCHMatlabHuman readerSn, Sp, TP, TN, FP, FN
Rajaraman et al, 2018China, USA, Kenya, IndiaCH, MC, Kenya, INNot namedHuman readerAUC
Sivaramakrishnan et al, 2018China, USA, Kenya, IndiaCH, MC, Kenya, INCustom 12-layer CNNHuman readerAUC
Vajda et al, 2018USA, ChinaMC, CHMatlabHuman readerAUC
Alfadhli et al, 2017USAMCNot namedHuman readerAUC, Sn, TP
Fatima et al, 2017USAMCNot namedHuman readerSn, Sp
Ding et al, 2017China, India, KenyaKenya, IN, CHNot namedHuman readerNR
Hogeweg, et al, 2017Japan, Sub-Saharan AfricaJSRT, Sub-Saharan AfricaNot namedHuman readerAUC
Udayakumar et al. 2017USA, ChinaMC, CHSVM and CBC techniquesHuman readerAUC
Maduskar et al, 2016ZambiaLarge ZambianNot namedHuman readerAUC
Poornimadevi et al, 2016Japan, USAJSRT, MCNot namedHuman readerSn, Sp
Karargyris et al, 2016China, JapanJSRT, CHNot namedHuman readerAUC
Melendez et al, 2016ZambiaZambianNot namedHuman readerAUC
Melendez et al, 2015Zambia, Tanzania, GambiaZambian, Tanzania, GambianNot namedHuman readerNR
Hogeweg et al, 2015UK, South AfricaF&T, TB-NEATNot namedHuman reader, Liquid culture, composite reference standard **AUC, Sn, Sp
Giacomini et al, 2015BrazilProspective, study-specificNot namedLiquid culture+NR
Jaeger et al, 2015ChinaCHNot namedHuman readerNR
Requena-Mendez et al, 2015PeruCXR from DOT study in PeruNot namedHuman readerNR
Jaeger et al, 2014China, USA, JapanJSRT, MC, CHNot namedHuman readerAUC, Sn, Sp
Melendez et al, 2014Zambia, South AfricaZambianTB-XpredictHuman readerAUC
Chauhan et al, 2014IndiaINNot namedHuman readerNR
Seixas et al, 2013BrazilClinical data set from another study*Artificial Neural NetworkComposite reference**NR
Sundaram et al, 2013Not specifiedNot specifiedNot namedHuman readerNR
Jaeger et al, 2012USA, JapanJSRT, MCNot namedHuman readerAUC
Xu et al, 2011Japan, CanadaJSRT, Calgary datasetAndrews' curveHuman readerTP, FP, FPR
Noor et al, 2011MalaysiaRetrospective non-clinical study specific radiologicalNot namedHuman readerSn, Sp
Shen et al, 2010CanadaJSRT, CalgaryNot namedHuman readerTP, FPR
Mouton et al, 2010South AfricaClinical dataset from previous study not specific to PTBNot namedHuman readerAUC
Hogeweg et al, 2010Sub-Saharan AfricaSub-Saharan AfricaCAD with rib suppressionHuman readerAUC
Hogeweg et al, 2010Not specifiedNot specifiedNot namedHuman readerNR
Lieberman et al, 2009ChinaProspective, study-specificNot namedHuman readerNR
Arzhaeva et al, 2009NetherlandsF&TNot namedHuman readerAUC
Noor et al, 2005China, USAMC, CHAndrews' curveComposite reference**NR
Clinical studies
Machine learning
Koesoemadinata et al, 2018IndonesiaProspective study-specificCAD4TB(v 5)Liquid culture/NAATAUC, Sn, Sp
Melendez et al, 2018United KingdomFind & TreatCAD4TB(v 5)Human reader, TB treatmentAUC, Sn, Sp, TP, FP, TN, FN
Zaidi et al, 2018PakistanSehatmand Zindagi (Healthy Life)CAD4TB(v 3.07)NAATAUC, Sn, Sp
Rahman et al, 2017BangladeshProspective, study-specificCAD4TB(v 3.07)NAATAUC, Sn, Sp
Melendez et al, 2017ZambiaZambia National TB Prevalence SurveyCAD4TB(v 5)Human reader CXR-, Liquid culture/NAAT for CXR+AUC, Sn, Sp
Muyoyeta et al, 2017ZambiaProspective, study-specificCAD4TB(v 1.08)NAAT for CXR+, AFB Smear for CXR-NR
Melendez et al, 2016South AfricaTB-NEAT collaborative studyCAD4TB(v 3.07)Liquid cultureAUC, Sn, Sp
Philipsen et al, 2015South AfricaTB-NEAT collaborative studyCAD4TB(v 3.07)NAAT, liquid cultureAUC, Sn, Sp
Steiner et al, 2015TanzaniaTB REACH projectCAD4TB(v 3.07)Human readerAUC, Sn, Sp
Muyoyeta et al, 2015ZambiaProspective, study-specificCAD4TB(v 1.08)NAAT, AFB Smear for CXR-AUC, Sn, Sp
Breuninger et al, 2014TanzaniaTB Cohort and TB CHILD studyCAD4TB(v 3.07)Liquid culture, AFB smearAUC, Sn, Sp
Muyoyeta et al, 2014ZambiaProspective, study-specificCAD4TB(v 1.08)NAATAUC, Sn, Sp
Maduskar et al, 2013ZambiaProspective, study-specificCAD4TB(v 1.08)Liquid culture, AFB smearAUC, Sn, Sp

CXR, chest x-ray; USA, United States of America; UK, United Kingdom; AI, artificial intelligence; YU AWHE, Yonsei University annual worker's health examination; SNUH, Seoul National University Hospital; BMC, Boramae Medical Center; KUHG, Kyunghee University Hospital at Gangdong; DEMC, Daejeon Eulji Medical Center; MC, Montgomery County; CH, Shenzhen Hospital, China; IN, Indian collection New Delhi; TJH, Thomas Jefferson Hospital dataset; JSRT, Japanese Society of Radiology; KIT, Korean Institute of Tuberculosis; F&T, Find and Treat; DLAD, deep learning automatic detection; SVM, Support vector machines; CBC, clustering based classification; CAD, computer aided detection; NAAT, nucleic acid amplification test; AFB, acid fast bacilli; ‘+’, positive; ‘-‘, negative; AUC, area under the receiver operating curve; Sn, sensitivity; Sp, specificity; NR, not reported; TP, true positives; FP, false positives; FPR, false positive rate; TN, true negatives, FN, false negatives; ACC, accuracy

* Trajman et al. Pleural fluid ADA, IgA-ELISA and NAAT sensitivities for the diagnosis of pleural tuberculosis Study

**Composite reference: positive culture/NAAT and/or initiation of TB treatment

†In these studies the study database was developed prospectively for the specific study

Study flow diagram.

Computer aided detection (CAD). CXR, chest x-ray; USA, United States of America; UK, United Kingdom; AI, artificial intelligence; YU AWHE, Yonsei University annual worker's health examination; SNUH, Seoul National University Hospital; BMC, Boramae Medical Center; KUHG, Kyunghee University Hospital at Gangdong; DEMC, Daejeon Eulji Medical Center; MC, Montgomery County; CH, Shenzhen Hospital, China; IN, Indian collection New Delhi; TJH, Thomas Jefferson Hospital dataset; JSRT, Japanese Society of Radiology; KIT, Korean Institute of Tuberculosis; F&T, Find and Treat; DLAD, deep learning automatic detection; SVM, Support vector machines; CBC, clustering based classification; CAD, computer aided detection; NAAT, nucleic acid amplification test; AFB, acid fast bacilli; ‘+’, positive; ‘-‘, negative; AUC, area under the receiver operating curve; Sn, sensitivity; Sp, specificity; NR, not reported; TP, true positives; FP, false positives; FPR, false positive rate; TN, true negatives, FN, false negatives; ACC, accuracy * Trajman et al. Pleural fluid ADA, IgA-ELISA and NAAT sensitivities for the diagnosis of pleural tuberculosis Study **Composite reference: positive culture/NAAT and/or initiation of TB treatment †In these studies the study database was developed prospectively for the specific study

Overview of studies

Within the Development studies, 7/40 (17%) employed DL methods while the remaining 33/40 (83%) used ML approaches (Table 1) [26-65]. An important consideration when evaluating the accuracy of a CAD software, is that it should be tested using a set of CXR images that are separate from the training set (i.e. avoid testing accuracy with CXRs that were used for training, or CXRs that were not used for training but that originate from the same subset/study as those with which the program was trained). Otherwise, the evaluation is likely to overestimate the diagnostic accuracy, and will also have limited generalizability [66]. Within the Development studies that reported accuracy measures, 3/32 (12%) did not report the database used to train and test their software. Overall, the majority of studies (32/40, 80%), either used the same databases to train and test their software, or did not comment on this (Table 2). For the majority of Development studies demographic data of the study population whose CXR were used to train and evaluate CAD were not reported in detail.
Table 2

Accuracy measures reported by development studies.

Author and yearDatabase(s) used for training of CADNumber of CXRs used for trainingDatabase (s) used for testing CADNumber of CXRs used for testingNumber of TB positive CXRAUC (95% CI)Thres-hold score Sn (95% CI)Sp (95% CI)
Deep learning
Heo et al, 2019YU AWHE2000YU AWHE3747512020.91 (NR), 0.92 (NR)NRNRNR
Hwang et al, 2018SNUH60989SNUH, BMC, KUHG, DEMC, MC,CHNR67680.988 (0.976–0.999)NR0.95(SNUH), 0.94 (BMC), 1.0 (KUGH), 1.0 (DEMC), 1.0 (MC), 0.95 (CH)*1.0 (SNUH), 0.96 (BMC), 0.91 (KUGH), 0.98 (DEMC), 0.94 (MC), 0.91 (CH)*
Lakhani et al, 2017MC,CH, TJH, Belarus857MC, CH,TJ, Belarus150750.99 (0.96–1.00)NR0.97 (0.90–1.0)0.95 (0.87–0.98)
Santosh et al, 2017MC,CH, IN976MC,CH, IN9764780.92 (MC) 0.82 (CH) 0.96 (IN)*NR0.88 (MC) 0.78 (CH) 0.92 (IN)*0.81 (MC) 0.76 (CH) 0.86 (IN)*
Lopes et al, 2017NRNRCHMC, CI,NR10315500.834 (CH) 0.926 (MC)*NRNRNR
Santosh et al, 2016NRNRCHMC, CI8784000.93 (CH) & 0.88 (MC)*NRNRNR
Hwang et al, 2016KIT9221KIT,MC,CH2427NR0.96*+NRNRNR
Machine learning
Ilena et al, 2018CH20CH3015NRNR0.67 (NR)*0.86 (NR)*
Rajaraman et al, 2018CH,MC, AMPATH, Kenya, IN2073CH,MC, Kenya,IN20737850.991 (CH) 0.962 (MC) 0.826 (Kenya) 0.965 (IN)*NRNRNR
Sivaramakrishnan et al, 2018CH,MC, Kenya, IN1659CH,MC, Kenya, IN12287850.926 (CH), 0.833 (MC), 0.775 (Kenya), 0.956 (IN)*NRNRNR
Vajda et al, 2018MC,CHNRMC,CH8143920.91 (MC), 0.99 (CH)*NRNRNR
Alfadhli et al, 2017MC97MC41580.89*NR0.79*NR
Fatima et al, 2017MC138MC13858NRNR0.83*0.78*
Udayakumar et al.MC,CHNRMC, CHNRNR0.87*NR0.81*0.74*
Hogeweg, et al, 2017JSRT, Sub-Saharan AfricaNRSub-Saharan Africa3481740.891*NRNRNR
Ding et al, 2017NRNRKenya, IN,CHNRNR0.949 (CH), 0.982 (IN), 0.76 (Kenya)*NRNRNR
Maduskar et al, 2016Large Zambian629Large Zambian638NR0.9*NR0.83*0.70*
Poornimadevi et al, 2016JSRT247JSRT247NANRNR0.56*0.36*
Karargyris et al, 2016CH43JSRT,CHNRNR0.93*NRNRNR
Melendez et al, 2016Zambian461Zambian4562480.87*0.45NRNR
Melendez et al, 2015Zambian, Tanzania Gambian1323Zambian, Tanzania, Gambian13136710.86 (Zambia), 0.88 (Tanzania), 0.91 Gambia*NRNRNR
Hogeweg et al, 2015F&T, TB-Neat400F&T, TB-Neat4001530.87 (0.81–0.92)(F&T), 0.74 (0.69–0.83)(TB-Neat)#NRNRNR
Jaeger et al, 2014MC,CH, JSRT1000MC,CH7533330.87*NR0.78 (0.70–0.85)0.81 (0.71–0.89)
Melendez et al, 2014Zambian461Zambian456NR0.88*NRNRNR
Chauhan et al, 2014IN204IN1021530.96 (0.86–0.99) (DA), 0.89 (0.77–0.96) (DB)##NR0.96 (DA), 0.88 (DB)*0. 92 DA, 0.84 (DB)*&
Sundaram et al, 2013NR95NR9552NRNR0.75*0.90*
Jaeger et al, 2012JSRT247MC138NR0.83*NRNRNR
Xu et al, 2011JSRT, Calgary60JSRT, Calgary60NRNRNR0.68*0.68*
Noor et al, 2011Retrospective non-clinical90Retrospective non-clinical213208NRNR0.88*0.84*
Shen et al, 2010JSRT, Calgary18JSRT, Calgary13119NRNR0.82*NR
Mouton et al, 2010Clinical non-TB specific119Clinical non-TB specific119NRNR0.78*NRNR
Hogeweg, et al, 2017CRASS348CRASS, JSRT498NR0.75*NRNRNR
Arzhaeva et al, 2009F&T217F&T217*++37NR0.83 TB-sus, 0.74 micro *NRNR

CAD, Computer aided detection;; YU AWHE, Yonsei University annual worker's health examination; SNUH, Seoul National University Hospital; BMC, Boramae Medical Center; KUHG, Kyunghee University Hospital at Gangdong; DEMC, Daejeon Eulji Medical Center; MC, Montgomery County; CH, Shenzhen Hospital, China; IN, Indian collection New Delhi; TJH, Thomas Jefferson Hospital dataset; AMPATH, Academic Model Providing Access to Healthcare; JSRT, Japanese Society of Radiology; KIT, Korean Institute of Tuberculosis; F&T, Find and Treat; AUC, area under the receiver operating curve; 95% CI, 95 percent confidence interval; NR, not reported; DA, dataset A; DB, dataset B; Sn, sensitivity; Sp, specificity;; TP, true positives; FP, false positives; FPR, false positive rate; TB-sus, TB suspect

* No 95% CI reported

+Average AUC from KIT, MC, Shenzhen

++ 128 of the normal images were the same CXRS used in the training

# An external and radiological reference standard were used. The external reference for tuberculosis was set by an independent test not associated with the CXR; the result of a sputum culture testing for the TB-NEAT database and a combination of sputum culture testing and clinical diagnosis for the Find & Treat database

## Two CXR digital image datasets, dataset A and B, were obtained from two different X-ray machines available at the National Institute of Tuberculosis and Respiratory Diseases, New Delh

†The database was split between TB suspect cases were re-read by a third radiologist, and if classified differently were excluded. The database contained 256 normal radiographs, 178 TB suspect radiographs, and 37 microbiologically diagnosed TB CXRs.

CAD, Computer aided detection;; YU AWHE, Yonsei University annual worker's health examination; SNUH, Seoul National University Hospital; BMC, Boramae Medical Center; KUHG, Kyunghee University Hospital at Gangdong; DEMC, Daejeon Eulji Medical Center; MC, Montgomery County; CH, Shenzhen Hospital, China; IN, Indian collection New Delhi; TJH, Thomas Jefferson Hospital dataset; AMPATH, Academic Model Providing Access to Healthcare; JSRT, Japanese Society of Radiology; KIT, Korean Institute of Tuberculosis; F&T, Find and Treat; AUC, area under the receiver operating curve; 95% CI, 95 percent confidence interval; NR, not reported; DA, dataset A; DB, dataset B; Sn, sensitivity; Sp, specificity;; TP, true positives; FP, false positives; FPR, false positive rate; TB-sus, TB suspect * No 95% CI reported +Average AUC from KIT, MC, Shenzhen ++ 128 of the normal images were the same CXRS used in the training # An external and radiological reference standard were used. The external reference for tuberculosis was set by an independent test not associated with the CXR; the result of a sputum culture testing for the TB-NEAT database and a combination of sputum culture testing and clinical diagnosis for the Find & Treat database ## Two CXR digital image datasets, dataset A and B, were obtained from two different X-ray machines available at the National Institute of Tuberculosis and Respiratory Diseases, New Delh †The database was split between TB suspect cases were re-read by a third radiologist, and if classified differently were excluded. The database contained 256 normal radiographs, 178 TB suspect radiographs, and 37 microbiologically diagnosed TB CXRs. All Clinical studies used ML-based versions of CAD4TB. Within the triage use-case studies, 6/8 (75%) used a microbiologic reference standard on all participants [18, 19, 22, 25, 67, 68]. Within the screening studies, 4/5 (80%) used a microbiologic reference [20, 24, 69, 70]. In two Clinical studies, the CADscore was used to select which participants underwent microbiologic testing, hence the software’s diagnostic accuracy could not be assessed [17, 69]. The study populations of all the triage studies with microbiologic references were quite similar (S1 and S2 Tables). Notably, the estimated HIV and TB prevalence in the triage studies were quite high, ranging from 15% to 33%. The screening studies had lower TB prevalence compared to triage studies (S1 and S2 Tables).

Quality assessment development studies

We first assessed the databases that were used as sources of CXR images and reference standards for the Development studies (S3 Table). Risk of selection bias was high in 2/18 (11%) of the databases. One dataset did not include PTB cases, and the other only included patients with “typical TB” images [13, 51]. Selection bias was unclear in 6/16 (38%), and low in 8/16 (50%) where consecutive enrollment either prospectively or retrospectively was used. The reference standard risk of bias was high in 10/18 (56%) studies as a human reader was used, unclear in 3/18 (17%), and low in 4/18 (22%) where a microbiologic reference was used. The quality of the Development Studies with respect to the assessment of diagnostic accuracy is reported in Fig 2. Selection biased was largely determined by which databases were used (S3 Table). The potential for selection bias was high in 13/33 (39%) studies, unclear in 13/33 (39%), and low in 7/33 (21%). One study [62] had a pre-specified threshold score and therefore had a low risk of bias in the assessment of the index test, but the other 97% had a high risk of bias as the threshold scores were set after the analysis. Additionally, 29/33 (88%) of the studies were considered to have a high degree of bias and low degree of applicability with regards to the reference test utilized due to use of a human reader’s interpretation of CXRs. The flow and timing had low bias in 15/33 (45%) studies, in 17/33 (52%) it was unclear, and in 1/33 (3%) it was high.
Fig 2

Quality assessment (QUADAS 2) graph of development studies.

Quality assessment of clinical studies

All triage studies used a consecutive enrollment strategy, with 3/8 (38%) being prospective, 5/8 (63%) retrospective. Additional details about selection are provided in the Appendix (S2 Table). Fig 3 summarizes the QUADAS-2 assessment of the Clinical studies. There were methodological concerns that likely resulted in a high degree of selection bias in 4/13 (31%) of the studies [18, 21, 23, 68]. This was secondary to case-control design [21], and inappropriate exclusion of patients in the analysis [18, 23, 68]. The threshold score was pre-specified in only 5/13 (38%) of the studies [17, 19, 22, 25, 71]. The remainder of the studies reported threshold scores post-analysis and were therefore determined to have a high risk of bias [18, 20, 21, 23, 24, 68, 70, 72]. The majority of studies, 10/13 (77%) had low potential for bias with regards to the use and performance of the reference standard [18–20, 22–25, 70, 72]. In two studies, the CAD software was used to select patients to undergo microbiologic testing for PTB, and therefore were determined to have a high risk of bias for estimating diagnostic accuracy of CAD [17, 71]. In another study, the reference standard was human reading of the CXR which was deemed to have a high risk of bias [21]. The flow and timing had a high risk of bias in 2/10 (20%) of the studies due to CAD4TB selection of the reference standard [17, 71], was unclear in 3/10 (30%), and low in 5/10 (50%).
Fig 3

Quality assessment (QUADAS 2) graph of clinical studies.

Diagnostic accuracy reported in development studies

We found 33/40 (83%) of the Development studies reported measures of accuracy for index tests. Of the 33 references that did include accuracy assessments, the AUC ranged from 0.78 to 0.99, sensitivity from 0.56 to 0.97, and specificity from 0.36 to 0.95 (Table 2). The forest plots graphically display the diagnostic heterogeneity of the sensitivity and specificity of the Development studies that published sensitivity, specificity, and the number of true positive TB cases (Fig 4).
Fig 4

Forest plots of accuracy measures of development and CAD4TB studies.

TP, true positive; FP, false positive; FN, false negative; TN, true negative; AI, artificial intelligence; CXRs, chest x-rays; ML, machine learning; DL, deep learning; CI, confidence interval; NAAT, nucleic acid amplification test.

Forest plots of accuracy measures of development and CAD4TB studies.

TP, true positive; FP, false positive; FN, false negative; TN, true negative; AI, artificial intelligence; CXRs, chest x-rays; ML, machine learning; DL, deep learning; CI, confidence interval; NAAT, nucleic acid amplification test.

Diagnostic accuracy reported in clinical studies

The forest plots graphically display the diagnostic heterogeneity of the sensitivity and specificity of the triage studies that used a microbiologic reference (Fig 4). In these studies, the sensitivity ranged from 0.86 to 1.00, and specificity ranged from 0.23 to 0.69. In the screening studies, sensitivity ranged from 0.53 to 0.89 and the specificity ranged from 0.56 to 0.98. In one screening study, [21] investigators used a human reader as the reference standard and reported the sensitivity and specificity of CAD were 0.59 and 0.78, respectively. The sensitivity of CAD was higher when using NAAT as the microbiologic reference standard compared to culture. Given the methodological heterogeneity, the lack of standardized threshold scores, and the variability of software versions used, a meta-analysis was not undertaken.

Assessment of study-level factors associated with reported AUC

Fig 5 shows the distribution of reported AUCs stratified by study level characteristics. Reported AUCs were higher in: Development studies (median [IQR] AUC: 0.88 [0.82–0.90]) versus Clinical studies (0.75 [0.66–0.87]; p-value 0.004); and with DL (0.91 [0.88–0.99]) versus ML (0.82 [0.75–0.89]; p = 0.001). While not statistically significant, we found that the median AUC of studies using a human reader as the reference standard were higher than those studies using a microbiologic reference standard of 0.88 [0.81–0.90] versus 0.77 [0.67–0.89] respectively (p = 0.16). There was no significant difference in AUCs of studies that used the same CXRs as the source for software development and evaluation of diagnostic accuracy, or of the AUCs by the degree of patient selection, index test, or reference standard bias (Fig 5).
Fig 5

Boxplots of the AUC of studies stratified by software design, CXR usage, reference standard, and degree of patient selection, index test, and reference standard bias.

AUC, area under the cure; Vs, versus; CXR, chest x-ray.

Boxplots of the AUC of studies stratified by software design, CXR usage, reference standard, and degree of patient selection, index test, and reference standard bias.

AUC, area under the cure; Vs, versus; CXR, chest x-ray.

Discussion

In this systematic review, we sought to determine the diagnostic accuracy of CAD software programs for detecting PTB on CXRs. Due to study heterogeneity, we did not meta-analyze the data. We identified a number of methodological limitations in the existing evidence base. Moreover, we identified a number of study-level factors associated with the reported accuracy, which should be taken into consideration when evaluating future CAD studies. The majority of the CAD evidence base for PTB detection consists of Development studies. While many of these reported some measure of diagnostic accuracy, this was done without assessing the potential risks of bias arising from the databases that were used. Applying a widely accepted standardized tool—QUADAS-2—for evaluating the quality of diagnostic studies we found that the potential risk of bias was common in the databases used to evaluate CAD in Development studies. We suggest future development studies apply the QUADAS-2 tool to assess for bias of the databases (Box 1).

Box 1. Recommendations for CAD accuracy study design elements

All Clinical studies evaluated the same commercially available software, CAD4TB. As noted above, meta-analysis was not completed due to the methodological heterogeneity, the lack of standardized threshold scores, and the variability of software versions used. While the software achieved high sensitivities (0.85 to 1.0), there was a large degree of variability in the reported specificities (0.23–0.69). Furthermore, the analysis in some studies was performed on CXRs from datasets or sites that may have also contributed to training the software, potentially resulting in an overestimation of the predictive power. Lastly, because the populations studied had very high HIV and TB prevalence, the results may have limited generalizability to other populations. We identified a number of study-level factors that were associated with the reported AUC. These included the type of technology used to classify images, and whether it was a Development or Clinical study. The accuracy of DL vs ML studies was higher (median AUC DL vs ML p-value 0.001), suggesting superior diagnostic accuracy of DL technology. The median AUC of development studies was higher than clinical studies (p-value 0.004). This likely because of the greater risk of bias due to the lack of pre-specified threshold scores, the use of the same databases for training and testing, and the use of a human reader as the reference standard. Our findings also suggested that studies using a human reader reference standard may have systematically overestimated the diagnostic accuracy of CAD, as the median AUC of these studies was higher compared to studies that used a microbiologic reference; the differences were not statistically significant, however. We did not find a significant difference in AUCs from studies that used the same CXRs for training and testing. However, we can extrapolate from other studies that using the same databases for training and testing will results in the systematic overestimations of reported predative value [73]. We suggest some elements that could improve the clinical applicability of future studies of CAD. Studies should include a description of how CXRs were selected for training and testing. Furthermore, CXRs from distinct databases should be used for training and testing. Ideally, accuracy of CAD should be evaluated against a microbiologic reference standard. Lastly, if the software has a continuous output, the threshold score to differentiate between a positive or negative CXR should be reported, along with how this was determined (Box 1). The US Food and Drug Administration (FDA) requires all of these standards be met and additionally necessitates clear instructions for clinical use in their guidelines of CAD applied to radiology devices (17). One potential weakness of this review is that we only included studies from the published literature, which could increase the risk that publication bias affected our reported results. Additionally, we restricted our search to English and French studies only. Furthermore, we were unable to complete a meta-analysis of the clinical studies and hence unable to comment on the pooled accuracy of CAD. This systematic review highlights the need for additional research of CAD of PTB on CXR. To our knowledge, this is the first study to analyze the quality of current CXR databases that have been used to train and test multiple CAD software tools. We conclude that AI based CAD programs are promising, but more clinical studies are needed that minimize sources of potential bias to ensure validity of the findings outside of the study setting.

Search strategies.

(PDF) Click here for additional data file.

Extraction form.

(PDF) Click here for additional data file.

Prisma (Preferred reporting items for systematic reviews and meta-analyses) checklist.

(DOCX) Click here for additional data file.

Demographics of CAD4TB studies with microbiologic reference standard.

CAD, computer aided diagnosis; yrs, years; NR, not reported; TB, tuberculosis; HIV, human immunodeficiency virus *This is the median, the mean age was not reported. (PDF) Click here for additional data file.

Selection, enrolment of CAD4TB studies with microbiologic reference standard.

NR, not reported; CAD, computer aided diagnosis; NAAT, nucleic acid amplification test * Patients with an abnormal CXR as per radiologist reading, or presumptive TB based on TB symptoms received culture ** Patients with a normal CXR by CAD received an AFB smear, while patients with an abnormal CXR as per CAD received NAAT. (PDF) Click here for additional data file.

Quality assessment of datasets used to test and train CAD software of development studies: Risk of bias and applicability concerns.

AMPATH, Academic Model Providing Access to Healthcare; CH, Shenzhen Hospital, China; F&T, Find and Treat; IN, Indian collection New Delhi; JSRT, Japanese Society of Radiology; KIT, Korean Institute of Tuberculosis; MC, Montgomery County; YU AWHE, Yonsei University Annual Worker's health examination; SNUH, Seoul National University Hospital; TJH, Thomas Jefferson Hospital dataset; U, unclear; H, high; NA, not applicable; L, low * Calgary dataset included preselected “typical PTB” images ** JSRT data set does not include PTB cases, but rather comprises images with single pulmonary nodules, confirmed by computed tomography and histology as either benign or pathologic. (PDF) Click here for additional data file.

Quality assessment (QUADAS 2) summary of development studies: Risk of bias & applicability concerns.

(PDF) Click here for additional data file.

Quality assessment (QUADAS 2) summary of clinical studies: Risk of bias and applicability concerns.

(PDF) Click here for additional data file.
Recommendations for studies assessing CAD accuracy
    • For the databases used to assess CAD accuracy, describe whether CXR had been used for triage or screening purposes.    • State whether results of the evaluation being reported are applicable to Triage or Screening CXR use-cases
    • Apply QUADAS-2 to assess the risk of bias in the databases used to evaluate CAD’s diagnostic accuracy
    • Describe how CXRs were selected for training and testing
    • Use different CXRs from separate databases for training and testing
    • Clearly define true positive PTB
    • Use a microbiologic reference standard of culture (preferred) or NAAT
    • For CAD that output a continuous score, preferably pre-specify the threshold used to differentiate between a positive and negative CAD result.
    • For CAD that output a continuous score, report how the threshold score was determined
    • State whether pre-training/verification of CAD with local CXRs is required prior to use in each setting
  54 in total

1.  Development of a digital image database for chest radiographs with and without a lung nodule: receiver operating characteristic analysis of radiologists' detection of pulmonary nodules.

Authors:  J Shiraishi; S Katsuragawa; J Ikezoe; T Matsumoto; T Kobayashi; K Komatsu; M Matsui; H Fujita; Y Kodera; K Doi
Journal:  AJR Am J Roentgenol       Date:  2000-01       Impact factor: 3.959

2.  Global and local multi-valued dissimilarity-based classification: application to computer-aided detection of tuberculosis.

Authors:  Yulia Arzhaeva; Laurens Hogeweg; Pim A de Jong; Max A Viergever; Bram van Ginneken
Journal:  Med Image Comput Comput Assist Interv       Date:  2009

3.  Applying a statistical PTB detection procedure to complement the gold standard.

Authors:  Norliza Mohd Noor; Ashari Yunus; S A R Abu Bakar; Amran Hussin; Omar Mohd Rijal
Journal:  Comput Med Imaging Graph       Date:  2010-10-30       Impact factor: 4.790

4.  A hybrid knowledge-guided detection technique for screening of infectious pulmonary tuberculosis from chest radiographs.

Authors:  Rui Shen; Irene Cheng; Anup Basu
Journal:  IEEE Trans Biomed Eng       Date:  2010-07-12       Impact factor: 4.538

5.  Computer-aided detection of pulmonary pathology in pediatric chest radiographs.

Authors:  André Mouton; Richard D Pitcher; Tania S Douglas
Journal:  Med Image Comput Comput Assist Interv       Date:  2010

6.  Diagnostic accuracy of digital chest radiography for pulmonary tuberculosis in a UK urban population.

Authors:  I Abubakar; A Story; M Lipman; G Bothamley; R van Hest; N Andrews; J M Watson; A Hayward
Journal:  Eur Respir J       Date:  2010-03       Impact factor: 16.671

7.  Computer-assisted detection (CAD) methodology for early detection of response to pharmaceutical therapy in tuberculosis patients.

Authors:  Robert Lieberman; Heston Kwong; Brent Liu; Hk Huang
Journal:  Proc SPIE Int Soc Opt Eng       Date:  2009

8.  Discrimination between two lung diseases using chest radiographs.

Authors:  Norliza Mohd Noor; Omar Mohd Rijal; Hamidah Shaban; Ong Ee Ling
Journal:  Conf Proc IEEE Eng Med Biol Soc       Date:  2005

9.  Anti-tuberculosis programs in Thailand: a cost analysis.

Authors:  B Chunhaswasdikul; P Kamolratanakul; A Jittinandana; V Tangcharoensathien; S Kuptawintu; P Pantumabamrung
Journal:  Southeast Asian J Trop Med Public Health       Date:  1992-06       Impact factor: 0.267

10.  Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement.

Authors:  David Moher; Alessandro Liberati; Jennifer Tetzlaff; Douglas G Altman
Journal:  PLoS Med       Date:  2009-07-21       Impact factor: 11.069

View more
  32 in total

Review 1.  Advanced imaging tools for childhood tuberculosis: potential applications and research needs.

Authors:  Sanjay K Jain; Savvas Andronikou; Pierre Goussard; Sameer Antani; David Gomez-Pastrana; Christophe Delacourt; Jeffrey R Starke; Alvaro A Ordonez; Patrick Jean-Philippe; Renee S Browning; Carlos M Perez-Velez
Journal:  Lancet Infect Dis       Date:  2020-06-23       Impact factor: 25.071

2.  TB preventive treatment in high- and intermediate-incidence countries: research needs for scale-up.

Authors:  O Oxlade; S den Boon; D Menzies; D Falzon; M Y Lane; A Kanchar; M Zignol; A Matteelli
Journal:  Int J Tuberc Lung Dis       Date:  2021-10-01       Impact factor: 2.373

3.  Self-evolving vision transformer for chest X-ray diagnosis through knowledge distillation.

Authors:  Sangjoon Park; Gwanghyun Kim; Yujin Oh; Joon Beom Seo; Sang Min Lee; Jin Hwan Kim; Sungjun Moon; Jae-Kwang Lim; Chang Min Park; Jong Chul Ye
Journal:  Nat Commun       Date:  2022-07-04       Impact factor: 17.694

4.  Xpert MTB/RIF and Xpert Ultra assays for screening for pulmonary tuberculosis and rifampicin resistance in adults, irrespective of signs or symptoms.

Authors:  Adrienne E Shapiro; Jennifer M Ross; Mandy Yao; Ian Schiller; Mikashmi Kohli; Nandini Dendukuri; Karen R Steingart; David J Horne
Journal:  Cochrane Database Syst Rev       Date:  2021-03-23

5.  Diagnostic accuracy of point-of-care ultrasound for pulmonary tuberculosis: A systematic review.

Authors:  Jacob Bigio; Mikashmi Kohli; Joel Shyam Klinton; Emily MacLean; Genevieve Gore; Peter M Small; Morten Ruhwald; Stefan Fabian Weber; Saurabh Jha; Madhukar Pai
Journal:  PLoS One       Date:  2021-05-07       Impact factor: 3.240

6.  Deep transfer learning artificial intelligence accurately stages COVID-19 lung disease severity on portable chest radiographs.

Authors:  Jocelyn Zhu; Beiyi Shen; Almas Abbasi; Mahsa Hoshmand-Kochi; Haifang Li; Tim Q Duong
Journal:  PLoS One       Date:  2020-07-28       Impact factor: 3.240

Review 7.  Clinical Implementation of Deep Learning in Thoracic Radiology: Potential Applications and Challenges.

Authors:  Eui Jin Hwang; Chang Min Park
Journal:  Korean J Radiol       Date:  2020-05       Impact factor: 3.500

8.  Deep learning prediction of mild cognitive impairment conversion to Alzheimer's disease at 3 years after diagnosis using longitudinal and whole-brain 3D MRI.

Authors:  Ethan Ocasio; Tim Q Duong
Journal:  PeerJ Comput Sci       Date:  2021-05-25

9.  An artificial neural network-pharmacokinetic model and its interpretation using Shapley additive explanations.

Authors:  Chika Ogami; Yasuhiro Tsuji; Hiroto Seki; Hideaki Kawano; Hideto To; Yoshiaki Matsumoto; Hiroyuki Hosono
Journal:  CPT Pharmacometrics Syst Pharmacol       Date:  2021-05-27

10.  Deep-learning convolutional neural networks with transfer learning accurately classify COVID-19 lung infection on portable chest radiographs.

Authors:  Shreeja Kikkisetti; Jocelyn Zhu; Beiyi Shen; Haifang Li; Tim Q Duong
Journal:  PeerJ       Date:  2020-11-05       Impact factor: 2.984

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.