Kanan T Desai1, Brian Befano2,3, Zhiyun Xue4, Helen Kelly1, Nicole G Campos5, Didem Egemen1, Julia C Gage1, Ana-Cecilia Rodriguez1, Vikrant Sahasrabuddhe6, David Levitz1, Paul Pearlman7, Jose Jeronimo1, Sameer Antani4, Mark Schiffman1, Silvia de Sanjosé1,8. 1. Division of Cancer Epidemiology and Genetics, National Cancer Institute, Rockville, Maryland, USA. 2. Information Management Services Inc., Calverton, Maryland, USA. 3. Department of Epidemiology, University of Washington School of Public Health, Seattle, Washington, USA. 4. US National Library of Medicine, Bethesda, Maryland, USA. 5. Center for Health Decision Science, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, USA. 6. Division of Cancer Prevention, National Cancer Institute, Rockville, Maryland, USA. 7. Center for Global Health, National Cancer Institute, Rockville, Maryland, USA. 8. ISGlobal, Barcelona, Spain.
Abstract
There is limited access to effective cervical cancer screening programs in many resource-limited settings, resulting in continued high cervical cancer burden. Human papillomavirus (HPV) testing is increasingly recognized to be the preferable primary screening approach if affordable due to superior long-term reassurance when negative and adaptability to self-sampling. Visual inspection with acetic acid (VIA) is an inexpensive but subjective and inaccurate method widely used in resource-limited settings, either for primary screening or for triage of HPV-positive individuals. A deep learning (DL)-based automated visual evaluation (AVE) of cervical images has been developed to help improve the accuracy and reproducibility of VIA as assistive technology. However, like any new clinical technology, rigorous evaluation and proof of clinical effectiveness are required before AVE is implemented widely. In the current article, we outline essential clinical and technical considerations involved in building a validated DL-based AVE tool for broad use as a clinical test. Published 2021. This article is a U.S. Government work and is in the public domain in the USA. International Journal of Cancer published by John Wiley & Sons Ltd on behalf of UICC.
There is limited access to effective cervical cancer screening programs in many resource-limited settings, resulting in continued high cervical cancer burden. Human papillomavirus (HPV) testing is increasingly recognized to be the preferable primary screening approach if affordable due to superior long-term reassurance when negative and adaptability to self-sampling. Visual inspection with acetic acid (VIA) is an inexpensive but subjective and inaccurate method widely used in resource-limited settings, either for primary screening or for triage of HPV-positive individuals. A deep learning (DL)-based automated visual evaluation (AVE) of cervical images has been developed to help improve the accuracy and reproducibility of VIA as assistive technology. However, like any new clinical technology, rigorous evaluation and proof of clinical effectiveness are required before AVE is implemented widely. In the current article, we outline essential clinical and technical considerations involved in building a validated DL-based AVE tool for broad use as a clinical test. Published 2021. This article is a U.S. Government work and is in the public domain in the USA. International Journal of Cancer published by John Wiley & Sons Ltd on behalf of UICC.
artificial intelligenceadenocarcinoma in situASCUS‐LSIL Triage Studyatypical squamous cells of undetermined significanceautomated visual evaluationcervical intraepithelial neoplasiaconvolutional neural networkdeep learningdeoxyribose nucleic aciddiabetic retinopathydigital single lens reflexendocervical curettagefemale genital schistosomiasishuman immunodeficiency virushuman papillomavirushigh‐risklosses‐to‐follow‐uplarge loop excision of the transformation zonelow‐grade squamous intraepithelial lesionmachine learningNatural History Studyregion of interestsquamocolumnar junctionsquamous intraepithelial lesiontransformation zonevisual assessment for treatabilityvisual inspection with acetic acidWorld Health Organizationwomen living with HIV
INTRODUCTION
Cervical cancer remains a leading cause of women's morbidity and mortality in resource‐limited settings.
The World Health Organization's (WHO) global call to eliminate cervical cancer relies on high‐coverage of human papillomavirus (HPV) vaccination and screening with accurate and practical technologies to detect and treat precancers.Existing cervical cancer screening and triage technologies fall into three categories: visual, microscopic (eg, cytology) and molecular (eg, HPV testing).Visual inspection of the cervix after applying acetic acid (VIA), though widely used in low‐resource settings for primary screening or triage, is poorly reproducible across settings and not reliable in discriminating precancers from benign HPV‐related and “look‐alike” changes.
Cervical cytology as performed in most low‐resource settings has had poor historical impact due to lack of infrastructure, poor quality assurance, need for repeated screening and poor follow‐up of screen positives.
HPV testing is the most sensitive primary screening method for detecting precancers, thus providing long‐term reassurance for HPV‐negative women.
Moreover, HPV testing is compatible with self‐collected vaginal specimens.
However, to avoid overtreatment, HPV positivity is best followed by triage testing to identify the minority of HPV infections linked to precancer.Deep learning (DL)‐based automated visual evaluation (AVE) of cervical images is emerging as an alternative novel, low‐cost screening and triage solution. Machine learning (ML) is a type of artificial intelligence (AI) that uses computers to detect patterns in data without being explicitly programmed to do so.
DL, inspired by the network of neurons in the human brain, is a kind of ML method that uses many layers of arithmetic operations
to arrive at a model that mimics the pattern identification for which it has been trained. DL has numerous applications in medicine (eg, image recognition algorithms like AVE, automated dual‐stain cytology, diagnostic radiology and automated diabetic retinopathy [DR] screening).
,
,
In a DL model for image recognition, information on different characteristics (eg, texture, edges and curves) associated with target of interest is gathered from individual pixels in an image through different layers. Through big data and advanced computational resources, these elements, combined in what we call an algorithm, are analyzed to provide accurate diagnosis for previously unseen images.
,
,
AVE as an assistive technology to VIA
,
offers an opportunity to improve VIA to create a screening process that supports accelerated control of cervical cancer.General reporting guidelines for clinical trials with AI‐interventions have been reported previously.
,
This article, however, outlines our collective view of considerations required, specifically for developing and adopting a DL‐based AVE algorithm for cervical precancer detection. We aim specifically to ensure its applicability as a well‐validated clinical test in cervical cancer screening programs globally, although most principles are likely to be applicable for any AI‐based clinical tests. The text in this article elaborates on an accompanying checklist to guide the development and validation of an effective and clinically relevant DL‐based AVE algorithm. Particularly, we wish to caution clinicians and policymakers for the need to evaluate the clinical effectiveness and applicability of those tools when they are applied in cervical cancer screening programs to avoid premature introduction (Table S1).
STEP‐WISE CONSIDERATIONS FOR AI‐BASED AVE
Before training the algorithm
The indicated use of AVE
Detecting and treating precancer is the main aim of cervical screening.
However, the point‐prevalence of precancer, even in previously unscreened populations, is only ~1% in the general population, and ~2.5% in the women living with HIV (WLWH).
Therefore, as a general screening tool, AVE needs to detect precancers sensitively, but with the perspective that almost all screened women (>95%) will never develop cervical cancer.In contrast, among the HPV positives, the prevalence of precancer increases considerably from ~1% to >5%.
Based on the well‐established role of HPV as a necessary cause in cervical carcinogenesis,
together with the evidence of long‐term negative predictive value of HPV tests (virtually zero risk over 5 years),
an ideal use‐case of AVE is for triage of HPV‐positive women (Box 1).HPV testing for carcinogenic HPV types is the most sensitive method for cervical cancer screening, providing many years of reassurance (negative predictive value).
Therefore, HPV testing is a desirable primary screening test, mainly when few screening rounds are possible.
,Currently, the cost is the prohibitive factor in adopting HPV as a primary screening test in many low‐resource settings. However, based on available tests that cost <5 US dollars and take <1 hour to perform and offer partial HPV genotyping, even lower‐cost, point‐of‐care HPV tests will likely be widely available in a few years.HPV infection is too common to treat all infected women, most of whom do not need treatment, particularly given possible iatrogenic harms. Relying on negative HPV testing to reassure most women against cervical cancer risk permits public health efforts to focus on the triage of HPV‐positive women with newer technologies like HPV typing and AVE.Risk‐informed hierarchical partial genotyping of HPV, if incorporated with minimal additional cost into HPV testing, provides important risk stratification useful for triage of HPV‐positive women.
Even among the types of HPV defined as carcinogens, there are at least four distinguished categories based on the risk of invasive cancers. HPV16 (species alpha‐9) is uniquely carcinogenic with the highest risk of cervical precancer and cancer, causing ~60% of squamous cancers. HPV18 and HPV45 (species alpha‐7) cause ~15% of squamous cancers and with HPV16 also account for >90% of adenocarcinomas.
The types of HPV closely related genetically to HPV16, namely, HPV31, HPV33, HPV35, HPV52 and HPV58, account for another ~15% of squamous cancers and are conceptually worth distinguishing from the lower risk, minimally carcinogenic types (HPV39, HPV51, HPV56, HPV59 and HPV68), accounting for ~5% of squamous cancers.
,
Of note, HPV35 is particularly pernicious for women of African origin.It is pertinent to note that if AVE is used alone for standalone primary screening, “look‐alike” confounding conditions like severe cervicitis could lead to over‐treatment
of many women with benign conditions unrelated to cervical cancer. Hence AVE is used as a triage test for the relevant set of HPV‐positives. Cervical sampling for HPV testing abrades the cervix's critical transformation zone (TZ; where most cancers arise), complicating the use of AVE for triage. Fortunately, vaginal sampling, either by the woman herself or a clinician, has been convincingly shown now to be almost equivalent to clinician sampling of the cervix when a sensitive HPV DNA test is used.
In addition, self‐sampling is also demonstrated to permit very high‐throughput cervical screening in a COVID‐safe manner.
,
,Recognizing the eventual importance of vaginal HPV testing, we aim to develop a screening strategy using HPV self‐sampling, with risk‐informed partial HPV typing
and AVE. When used sequentially in combination, this will classify the woman into risk strata (of highest to lowest probability of precancer) to guide treatment and limit overtreatment.If found to be effective, this screening strategy is envisioned by our group to be scaled up in a community‐based campaign combining: (a) screen‐and‐treat screening of mid‐adult women (ie, 25 or 30 to around 45 or 50), and (b) single‐dose vaccination of multiple birth‐cohorts of girls and younger women to induce herd protection.
Such a conjoined primary and secondary prevention effort is likely to lead to accelerated cervical cancer control in low‐resource settings.The prevalence of visual abnormalities and precancer further increase in a colposcopy clinic, where most women have been referred for equivocal or minor cytologic abnormalities such as HPV‐positive atypical squamous cell of undetermined significance (ASC‐US) or squamous intraepithelial lesion (SIL), respectively.
Within this context, women referred for colposcopy also have an increased prevalence of cervical visual abnormalities regardless of final diagnosis. Therefore, an AVE algorithm trained for indicated use in general screening should not be assumed to be suitable for use as a tool for triage in a colposcopy setting and vice versa unless the accuracy of both approaches is explicitly demonstrated in a formal evaluation (Figures 1 and S1).
FIGURE 1
Scatterplot of algorithm scores and disease status provided in two different settings: (A) general screening and (B) triage. In (A) scores cluster in each disease state and differentiate precancer from the rest. In (B) data across disease status are sparser, and the distinction between the three disease strata is less apparent.
[Color figure can be viewed at wileyonlinelibrary.com]
Scatterplot of algorithm scores and disease status provided in two different settings: (A) general screening and (B) triage. In (A) scores cluster in each disease state and differentiate precancer from the rest. In (B) data across disease status are sparser, and the distinction between the three disease strata is less apparent.
[Color figure can be viewed at wileyonlinelibrary.com]In addition, until sufficient supportive evidence accumulates regarding accuracy, reliability and portability of the method to different settings, AVE is best used as an ancillary technology to aid health workers performing VIA to improve their accuracy, rather than a standalone tool.
,
Clarifying target population for using AVE
Any visual cervical screening methods, including AVE, works best when applied at an appropriate age range (eg, 25‐49 years).
Within this age‐range, HPV infections are more likely to be clinically meaningful than at younger ages at which transiently detectable HPV are extremely prevalent but cancer is very rare.
Moreover, prominent glandular epithelium (“ectopy” or “ectropion”), common at younger ages, may lead to false‐positive AVE predictions. Also, in mid‐adulthood compared to older ages, the squamocolumnar junction (SCJ) at which most cancers arise is frequently still fully visible,
and lesions, if detected, could still be treated safely without disproportionate risk of damaging atrophic pelvic structures.
Using an AVE algorithm on cervical images when the main site of cervical cancer, the SCJ, is no longer visible as occurs with aging may lead to false‐negative AVE prediction (especially when the visible epithelium covering the cervix appears completely “pink” and normal), hence, negative results in such women should be reported with caution (Figure 2A).
FIGURE 2
Factors affecting image interpretations by AVE. (A) The impact of age: With increasing age, the SCJ moves inside the endocervical canal, creating a pink image appearance of mature squamous epithelium and likely to provide false reassurance by negative AVE prediction. On the other hand, ectopy at younger ages is likely to lead to false‐positive results on AVE. Examples: (A1) Type III TZ (AVE prediction = negative, final diagnosis on endocervical curettage [ECC] histopathology = precancer), (A2) Ectopy (AVE prediction = precancer, Final diagnosis on ECC histopathology = normal); (B) The impact of quality: AI will give a prediction on any input, including images where the humans would even fail to identify the region of interest due to either technical factors making the image not even recognizable as a cervix. Examples: (B1) Excessive blur (AVE prediction = negative), (B2) Bad angle with the undetectable cervical os (AVE prediction = negative), (B3) speculum reflection and glare, (B4) non‐cervix image (AVE prediction = negative), OR anatomic factors obstructing the SCJ; (C) The impact of obstruction of the SCJ: Examples: (C1) Cervical polyp (AVE prediction = negative), (C2) Uterine fibroid, (C3) Menstrual blood plugging the os.
[Color figure can be viewed at wileyonlinelibrary.com]
Factors affecting image interpretations by AVE. (A) The impact of age: With increasing age, the SCJ moves inside the endocervical canal, creating a pink image appearance of mature squamous epithelium and likely to provide false reassurance by negative AVE prediction. On the other hand, ectopy at younger ages is likely to lead to false‐positive results on AVE. Examples: (A1) Type III TZ (AVE prediction = negative, final diagnosis on endocervical curettage [ECC] histopathology = precancer), (A2) Ectopy (AVE prediction = precancer, Final diagnosis on ECC histopathology = normal); (B) The impact of quality: AI will give a prediction on any input, including images where the humans would even fail to identify the region of interest due to either technical factors making the image not even recognizable as a cervix. Examples: (B1) Excessive blur (AVE prediction = negative), (B2) Bad angle with the undetectable cervical os (AVE prediction = negative), (B3) speculum reflection and glare, (B4) non‐cervix image (AVE prediction = negative), OR anatomic factors obstructing the SCJ; (C) The impact of obstruction of the SCJ: Examples: (C1) Cervical polyp (AVE prediction = negative), (C2) Uterine fibroid, (C3) Menstrual blood plugging the os.
[Color figure can be viewed at wileyonlinelibrary.com]
Aligning the AVE classification categories with the natural history of HPV and cervical carcinogenesis
Detection of precancer is the main objective of cervical screening. However, based on the natural history, there are four biologically distinct stages in the pathogenesis of cervical cancer, which are: (a) normal cervix; (b) infection with high‐risk (HR)‐HPV (very common); (c) precancer, defined as transforming HR‐HPV infection associated with lesions with a high‐likelihood of invasion if left untreated (uncommon) and (d) invasive cervical cancer (comprising a small minority of cases compared to precancers).
,
Each stage in the carcinogenic process can be linked to distinct clinical management action in screening programs: (a) reassurance for women with a normal cervix; (b) triage of HR‐HPV infections; (c) treatment of precancer and (d) advanced treatment of cervical cancer.
Therefore, the success of AVE can be related to its assignment of a screened individual to the proper stage linked to distinct management actions.
Reference standard for defining the AVE classification categories
AVE needs to be trained on representative cervical images of each of the four natural history categories shown in Figure 3,
,
which must be defined clearly to avoid misclassification by teaching (“training”) the AVE on incorrect labels. In this regard, defining the cervical carcinogenesis stages based on nonreproducible historical grading systems (eg, dysplasia or cervical intraepithelial neoplasia [CIN] stages) is no longer optimal.
,
,
Rather, the four stages can be defined as follows.
FIGURE 3
The AVE classification categories are expected to be consonant with the four biological distinct stages in the natural history and pathogenesis of cervical cancer. Reprinted with permission from Schiffman et al
; Histopathology image source: Desai et al [Color figure can be viewed at wileyonlinelibrary.com]
The AVE classification categories are expected to be consonant with the four biological distinct stages in the natural history and pathogenesis of cervical cancer. Reprinted with permission from Schiffman et al
; Histopathology image source: Desai et al [Color figure can be viewed at wileyonlinelibrary.com]Invasive cervical cancer is defined histologically unless the clinical picture is so severe that surgical pathology is not obtained.Precancer is defined stringently as a histopathologic CIN3/AIS (adenocarcinoma in situ) since most histopathologic CIN3/AIS cases contain the same HPV types found in invasive cancers.
,
Moreover, CIN3/AIS histopathologic diagnosis of precancer is reasonably reproducible without resorting to expensive molecular markers of cellular transformation (eg, viral methylation and viral DNA integration). Additionally, selected high‐risk histopathologic CIN2, if the diagnosis is corroborated by expert gynecologic pathologist review and accompanied with highest risk HPV‐type positivity, is likely to represent precancer.
However, one needs to be cautious in including all CIN2 as a precancer target because CIN2 is a poorly reproduced diagnosis with a mixture of high‐grades and regressive low‐grades (associated with noncarcinogenic HPV types as HPV53), creating a phenocopy of early precancer. For colposcopic biopsy to be sensitive, multiple biopsies of all visible lesions (based on turning white after application of vinegar, called acetowhitening) is necessary, rather than targeting of the most severe appearing lesion. Clinician colposcopic impressions, even when performed by experienced gynecologists, are subjective and variable in distinguishing precancer from benign HPV‐related changes and “look‐alike” conditions.
,
,
An algorithm trained on target class definitions based on human interpretation of cervical images instead of histopathologic diagnosis, particularly for “precancer” target, will be restricted by the same limitations in accuracy and intraobserver and interobserver variability as other visual methods (eg, VIA).
Thus, multiple biopsies and histopathologic definition of precancer are preferable to high‐grade colposcopic impression.However, histopathology cannot define the normal cervix, as most normal women are never thoroughly biopsied. Since the negative predictive value of the HPV test is very high, the ideal definition of “normal” (in the sense of virtually no imminent risk of cancer) will be images from confirmed HR‐HPV negative women.
Alternatively, in the absence of HPV results, the absence of any acetowhitening (ie, entirely “pink” cervix) on expert review of images from women at a general screening clinic can be used to define normal because acetowhitening is a sensitive measure of the risk of precancer,
and chances of finding CIN3/AIS in women at a general screening with no cervical acetowhitening is very low.Once cancer and precancer are defined histologically (and ideally virologically as well), and the normal cervix is defined visually, the remaining category can be conceived of as “HPV‐related and other equivocal changes.” Histopathology has limitations in defining this category due to subjectivity in microscopic diagnosis and biopsy placement errors (eg, targeting only the worst appearing lesions).
In our experience, an algorithm not trained explicitly to recognize these “equivocal” images tends to give extremely erratic predictions on these images (Figure S2). Since it is in this “equivocal” zone where the experts also struggle the most and since the associated risk of cervical cancer is likely to be intermediate (ie, nonzero but much lower than precancer), it is desirable to train the cervical images with acetowhite changes as a separate target interposed between “normal” and “precancer” targets. Ongoing work by our group is addressing how best to include this equivocal class in training (ie, training a multiclass ordinal classifier).
Choosing images and metadata for training the algorithm
Size, source and representativeness of the dataset
A typical number of learned parameters in a DL algorithm development tend to be up to millions compared to tens in a traditional multivariate model.
Although it is difficult to predict the exact number, the number of representative images with truth label required to build an accurate yet generalizable AVE algorithm via DL approach, can be assumed to be hundreds or greater for each target class to achieve satisfactory disease discrimination.
,
It is worth recalling that, even in high‐burden settings, cervical precancer is relatively uncommon
; thus, ethical acquisition
of accurately labeled, representative case images, is challenging.
Image quality evaluation and pre‐exclusion
The provider's training to capture good quality images is a first step for AVE's successful application. However, when an AI‐based image recognition tool is applied in real‐world clinics, variation in the quality of images is inevitable. The image quality is affected, in addition to the user training, by the lighting (eg, external ring light vs built‐in camera flashlight, shade of white light), image capture device and postcapture processing of images by device‐specific software, anatomic variation, speculums (eg, metallic vs transparent plastic) and so on.Without a quality check, AVE will provide a prediction for any image given to it as an input, including images not even recognizable as cervix and images with a completely obstructed region of interest (ROI) (ie, SCJ) (Figure 2B,C).
,
Therefore, a manual or automated gatekeeping mechanism should be in place to exclude poor‐quality images from training and evaluation to minimize false predictions.Various parameters define the image quality, such as blur, Gaussian noise, resolution, color, angle and glare/reflections; not all affect the AVE's performance equally. The composite minimal image quality standards needed to obtain a good performance on AVE is an ongoing advanced research topic.
Choosing DL methods for training AVE
Training a DL algorithm is more complex than the simple explanation described previously.
Multiple technical choices need to be taken while training the algorithm (Box S1),
,
,
which may have implications for interpreting the output
,
,
(Figure S3). The aim is: (a) to achieve accurate and reliable prediction on hold‐back images from the same database as the training set (called “internal validation”), (b) not to lose generalizability in new images from different databases (“external validation”). Ongoing work from our group is exploring the optimal DL approach to train an AVE algorithm to achieve maximum risk discrimination that has external validity.In addition, the choice of methods has implications for time and computational speed requirements to run the algorithm. Ideally, a scalable AVE algorithm should be available to run as a standalone app (without internet) on the image‐capture device itself, providing quick (within few seconds), and real‐time predictions for on‐site patient's management to minimize loss‐to‐follow ups.
Validation of the output of the algorithm
Reproducibility of AVE
The essential first parameter in assessing AVE's validity, like any medical test, is reproducibility. Like a thermometer, giving a consistent reading of body temperature on the repeated measurement of the same person, an AVE algorithm should give virtually identical outputs when asked to predict the same image repeatedly. However, in the case of near‐duplicate images (ie, images collected from a woman under the same image capture protocol consecutively), subtle changes in the numerical pixel values of the image due to changes in body or camera position may alter the AVE predictions especially for equivocal images, despite the visual similarity of the images to the human eyes. Clinically, it is confusing to the user if an AVE algorithm were to label one image as a precancer and a near‐duplicate image (or same image in a different run) as normal (Figure S4). Therefore, before its use for clinical decision‐making, AVE's robustness for near‐duplicate pairs of images should be measured and reported.
Internal validity of AVE
To “teach” the algorithm to recognize the target of interest, we provide it with sets of “labeled” cervical images in each target class as a “training” (to learn the features associated with the outcome of interest) and a “validation” set (to iteratively check on and optimize the algorithm's performance as part of training).
It is important to note that the validation set is not a true blinded test set. A performance achieved by the algorithm on the validation set is likely to be misguiding and over‐optimistic.
When the “training‐validation” set is limited, an algorithm is prone to overfitting to the image features in the “training‐validation” set and may completely fail on the third independent “hold‐back” set of previously unseen (ie, blinded) images from the same database as training and validation set.
,
,
Therefore, it is essential to assess AVE's performance on an independent completely blinded “test” set of images not included in the “training‐validation” process (Figure 4A) to have a realistic estimate of internal validity of AVE on a dataset.
FIGURE 4
(A) AUC results for the discrimination of disease vs no disease in a validation set and a test set. Notice that the AUC value from the same study images decreases from 0.94 to 0.86 when the algorithm was tested in a hold‐back test data set images that were not used at all during the training and validation of the algorithm.
. (B) Score values obtained in a binary classification algorithm trained on cervigram images. AVE prediction scores were presented per definite case, definite control, equivocal case and equivocal control. When using a selective set of clearly defined cases (precancers) and controls (normal), the algorithm easily discriminated between disease strata, but when adding equivocal images, as it would be in a real‐life scenario, the score distribution tended to be wider and less discriminative of diseases status.
[Color figure can be viewed at wileyonlinelibrary.com]
(A) AUC results for the discrimination of disease vs no disease in a validation set and a test set. Notice that the AUC value from the same study images decreases from 0.94 to 0.86 when the algorithm was tested in a hold‐back test data set images that were not used at all during the training and validation of the algorithm.
. (B) Score values obtained in a binary classification algorithm trained on cervigram images. AVE prediction scores were presented per definite case, definite control, equivocal case and equivocal control. When using a selective set of clearly defined cases (precancers) and controls (normal), the algorithm easily discriminated between disease strata, but when adding equivocal images, as it would be in a real‐life scenario, the score distribution tended to be wider and less discriminative of diseases status.
[Color figure can be viewed at wileyonlinelibrary.com]In addition, it is important to include a realistic set of images in the “test” set on which the performance is finally evaluated. For example, we may observe good case‐control discrimination by AVE on a “restricted” test set, including only the clearest examples of high‐grade cases (CIN3) and HPV‐negative controls. However, it is important to realize that many cervix images will fall into an equivocal intermediate zone, including HPV infection, cervicitis and low‐grade changes. Without examining the discrimination and reproducibility achieved by AVE in this intermediate zone of “not so clear” case or control assigment (ie, due to noisy data) where even the expert colposcopists struggle the most, the promising claims about the algorithm's capacity could be misleading for real‐life implementation (Figure 4B).A critical aspect to evaluate the algorithm's performance is choosing the appropriate statistical approach. First, AVE class predictions can be compared against the reference standard (eg, histopathology) in a comprehensive independent test set. Second, the AVE, to be worth adopting, would ideally demonstrate consistently superior performance to the existing standard of care (eg, unaided VIA as practiced in the setting), and at least noninferior performance to the expert clinicians (eg, colposcopists). Of note, AVE is not limited in performance by human factors such as fatigue, mental stress and so on; hence is theorized to have lower intraobserver variability in addition to lower interobserver variability than VIA and colposcopy, leading to higher consistency.
External validity (generalizability) of AVE and avoiding overfitting
Verifying an AVE algorithm's performance is a two‐step process. Testing the algorithm's performance (achieving accurate predictions without overfitting) on an independent “test” set of images derived from the same source as the training set (called “internal validation”) is a crucial first step,
but not a final benchmark. This testing set will be limited by the same “finite” representation and idiosyncratic random variations as in the “training” set. Thus, the process does not reflect true validation of an algorithm in terms of how it will perform in actual clinical practice with “infinite” variations in patient characteristics, user training and image capture protocols.
For example, an AVE algorithm that is overfitted to a particular set of images from a clinic
will learn to recognize random (ie, nonrelevant) variations in the particular training set that distinguish precancer from normal, but these distinctions are not necessarily generalizable to other settings (eg, images from different clinics captured under different light sources by different providers) to distinguish patterns associated with precancer detection
,
,
(Figure S5). Therefore, to assess true generalizability, one needs to evaluate the AVE algorithm's performance on a diverse set of images from various clinical settings worldwide. In addition, ideally, multiple independent formal efficacy assessments should demonstrate replicability of the results.
,
Device portability of AVE
The AVE algorithm works on a pixel‐level (ie, trying to compare and contrast the differences in the pixels on the images and what impact this can have in classifying them). Therefore, an AVE algorithm trained on images from one type of image capture device tends to be overfitted to the features (ie, pixel patterns) of the particular device and works well only on that device.
,
A device‐agnostic AVE algorithm (eg, the current advanced state of facial recognition, likely achieved due to millions of images available for training) that can read accurately across different image capture devices (Figure S6) with minimal adaptation is a critical subcomponent of AVE's generalizability. Such a device‐agnostic algorithm does not yet exist for evaluation of cervical images and is a subject of active research. Unless efforts to develop a device‐agnostic algorithm are successful, a dedicated image capture device or devices, with algorithms trained with their image types, will need to be used to ensure accurate and time‐stable AVE performance.
Anatomical and biologic confounding factors and effect modifiers for AVE
Several patient characteristics may contribute to the erroneous classification of a given image by AVE (Figure S7).
For example, ectopy among young women may result in a high AVE severity class prediction due to the ruddy glandular epithelium extending onto the ectocervix. Similarly, severe cervicitis, female genital schistosomiasis (FGS) can be misclassified as precancer, and certain noncarcinogenic HPV types (eg, HPV71) with no relationship to cervical cancer may cause warty cervical lesions resulting in erroneous high AVE severity class prediction.The co‐existence of human immunodeficiency virus (HIV) infection with HPV infection is probably the most important known difficulty in AVE classification. WLWH, due to shared risk factors for acquisition of infection, have 2‐fold increased likelihood of acquiring HPV. Due to HIV‐associated immunosuppression, they have a greatly elevated risk of persistent HPV and precancer, leading to a 6‐fold increased risk of cervical cancer compared to HIV‐negative women.
,
Precancerous lesions tend to be more severe in WLWH,
affecting the anogenital epithelium more widely. WLWH also have a high risk of co‐infection with other cervical sexually transmitted infections, which can cause cervicitis that may impact visual appearance of the cervix.HIV, FGS and cervicitis are highly prevalent in areas with a high burden of cervical cancer as well (eg, schistosomiasis and HIV in sub‐Saharan Africa, cervicitis in India). Therefore, it is critical to evaluate the need to train the AVE to control for these factors as much as possible and whether to have a subgroup‐specific AVE is needed. Beyond the scope of this discussion, the use of DL algorithms to diagnose FGS is under consideration.
Risk prediction: “calibration” of AVE
Before adopting any DL‐based diagnostic tool in clinical practice, the clinician should ask what is the tool measuring (ie, output) and the limitations of its interpretation for clinical decision‐making. The major goal of AVE, ideally, is to directly predict the risk (conceptually a continuous probability from 0 to 1) of a woman having a precancer today while having some reassurance (ie, negative prediction) for the future.
However, the current classifier AVE algorithm approach is trained to predict discrete target classes (eg, histopathologic cancer or precancer, low‐grade, normal). Such a classifier AVE provides a score associated with each target class it is trained to predict. However, it is important to understand that these scores themselves are not true risk estimates (ie, woman with a “raw” score of 0.9 associated with the “precancer” class does not necessarily have a 90% probability of precancer) and are not reliably portable.
,
In order to obtain a clinically meaningful, reliable and portable estimate of the true risk of precancer from a classification network, the final AVE class label prediction needs to be translated into a risk value (ie, observed total number of women with precancer out of the total number of women with a given class prediction should match the expected number of women with precancer based on the absolute risk prediction for the given class, for example, 90 observed women with precancer out of 100 for the precancer risk of 90% for a precancer class prediction), taking into account other co‐factors (ie, age, HIV status, HPV status, HPV types, etc), if available (Figure 5,
), to accurately risk‐discriminate low‐risk and high‐risk individuals for risk‐based clinical management. Such multivariate models are not yet validated.
FIGURE 5
A recommended approach for cervical cancer screening based on HPV genotyping and AVE. HPV extended genotype provides a risk stratification that, when added to the AVE class label prediction, provides 17 risk strata. When each stratum is calibrated to represent the absolute probability of a woman having a precancer (ie, risk), a direct risk‐based clinical management decision can be taken tailored to resources availability. Reprinted with permission from Wentzensen et al
[Color figure can be viewed at wileyonlinelibrary.com]
A recommended approach for cervical cancer screening based on HPV genotyping and AVE. HPV extended genotype provides a risk stratification that, when added to the AVE class label prediction, provides 17 risk strata. When each stratum is calibrated to represent the absolute probability of a woman having a precancer (ie, risk), a direct risk‐based clinical management decision can be taken tailored to resources availability. Reprinted with permission from Wentzensen et al
[Color figure can be viewed at wileyonlinelibrary.com]
Predicting immediate vs future risk
AVE algorithms have been commonly trained cross‐sectionally on the woman's present status and images, rather than a longitudinal set of images per woman, and are therefore likely to only predict a prevalent risk of precancer. At present, it is not known for how long a negative AVE test confers reassurance. It is unlikely to be for as long as an HPV test.
Field implementation
The considerations described here are mainly focused on the technical efficacy of AVE. When scaled‐up for implementation, outside the research settings, even well‐validated algorithms will have many challenges (eg, data privacy, patient acceptability and provider training) as observed in other medical fields.
,
,
For example, even a highly accurate DR screening algorithm inside a computer lab has been documented to have failed in the field clinics due to practical challenges.
Some of these challenges present valuable parallels to the AVE implementation work. For example, the DR algorithm could not read a high proportion of images due to poor quality attributed to variation in lighting conditions across the field clinics, differentially affecting the retinal dilations.
,
Also, it was sometimes impossible to take a single image capturing the entire field of view (eg, retina), leading to a failed prediction by the algorithm.
These image quality issues lead to a similar dilemma as in AVE of balancing the risk of predictions based on imperfect data against the risk of inaction from losses‐to‐follow‐up (LFU) during referrals.
There is a balance between efforts to improve image quality by users against making the algorithm robust enough to tolerate “less than perfect” images. The delayed or failed image analysis on a cloud‐based DR algorithm due to poor internet connectivity at the field clinics is another parallel with AVE, confirming our group's insistence on the absolute need for the AVE algorithm to work off of a local hardware with sufficient processing power without internet connectivity.
,
The challenges encountered in developing a robust and reliable DR algorithm also have many analogs for the AVE development effort. Some of these challenges are: generating reproducible ground truth labels with high interobserver agreement among the experts for training the algorithm, particularly for the classes with high interclass similarities (eg, hard vs soft exudates)
; difficulties in detecting lesions in the presence of noise (eg, optical reflections) and commonly encountered nonlesion structures (nerve fiber reflections, vessel reflections and drusen)
; and developing a generalizable algorithm that could work accurately across inevitable common variations in the clinical environment (eg, images collected from multiple centers on machines ranging from smartphone cameras to high‐end fundoscopes).The main important considerations specific for implementation of AVE for cervical cancer screening are human resource capacity‐building to manage screen‐positive women detected by AVE, developing data management systems to support tracking women needing referral and cost‐effectiveness analysis to evaluate AVE's impact in real‐life programs.It is important to emphasize that to prevent cancer we need to detect precancer lesions and treat them adequately. Absence of treatment is a major and unfortunately very common reason for screening program's failure. For women requiring treatment, thermal‐ablation using a battery‐operated mobile device is currently the most portable option given that it is safe, effective, affordable and does not require sophisticated equipment.
However, because not all women are eligible for thermal‐ablation due to abnormalities or benign changes on the cervix,
local providers will need to identify which women require referral for further evaluation for more invasive treatments (eg, conization, Large Loop Electrosurgical Excision of the Transformation Zone [LLETZ]), unavailable in many resource‐limited settings. For providers, this assessment is prone to variability and challenges.
DL‐based AVE based on expert reviews of cervical images is under development to predict a woman's eligibility for treatment with ablation; an initial pilot suggesting good performance.
CONCLUSIONS
DL‐based AVE of the cervical image is a promising but still evolving clinical test. Even though the inner workings of DL remain obscure, DL‐based AVE, in the end, is no different from any other clinical diagnostic test. Since the limitations of the DL described here might not be fully appreciated by end‐users, the onus lies on the developer of an AI‐based device to make the subtle issues explicit, particularly in the less regulated markets. Raising awareness and knowledge of the goodness‐of‐fit and limitations of DL‐based AVE among end users is critical to improve clinical practice. Nonetheless, some AVE‐type products are already being marketed without substantial documentation of effectiveness.
,
Thus, in line with the WHO guidance,
we maintain that premature introduction of AI‐based methods, without transparency and accountability, threatens their eventual acceptance and best use.
CONFLICT OF INTEREST
DL is the co‐founder of MobileODT. In the last 3 years, he was an executive into company and sat on the board of directors. He is no longer with the company, but still own some stock. He is currently the owner of Imaging and Analytics Consulting, Ltd., a small consulting company based in Israel. His wife is the owner of DL Analytics, LLC, a small business based in California. Other authors have nothing to declare.
AUTHOR CONTRIBUTIONS
Kanan T. Desai, Silvia de Sanjosé and Mark Schiffman contributed substantially to the conception and design of the study. Ana‐Cecilia Rodriguez, Brian Befano, David Levitz, Didem Egemen, Helen Kelly, Jose Jeronimo, Julia Gage, Nicole Campos and Paul Pearlman contributed to the interpretation and critical thinking. Sameer Antani and Zhiyun Xue contributed to the DL algorithm development. Kanan T. Desai drafted the manuscript. All authors provided critical revision of the article and provided final approval of the version to publish.Appendix S1 Supporting Information.Click here for additional data file.
Authors: R Herrero; M H Schiffman; C Bratti; A Hildesheim; I Balmaceda; M E Sherman; M Greenberg; F Cárdenas; V Gómez; K Helgesen; J Morales; M Hutchinson; L Mango; M Alfaro; N W Potischman; S Wacholder; C Swanson; L A Brinton Journal: Rev Panam Salud Publica Date: 1997-05
Authors: Zhiyun Xue; Akiva P Novetsky; Mark H Einstein; Jenna Z Marcus; Brian Befano; Peng Guo; Maria Demarco; Nicolas Wentzensen; Leonard Rodney Long; Mark Schiffman; Sameer Antani Journal: Int J Cancer Date: 2020-05-19 Impact factor: 7.396
Authors: Kayode Olusegun Ajenifuja; Jerome Belinson; Andrew Goldstein; Kanan T Desai; Silvia de Sanjose; Mark Schiffman Journal: Infect Agent Cancer Date: 2020-10-14 Impact factor: 2.965
Authors: Kanan T Desai; Kayode O Ajenifuja; Adekunbiola Banjo; Clement A Adepiti; Akiva Novetsky; Cathy Sebag; Mark H Einstein; Temitope Oyinloye; Tamara R Litwin; Matt Horning; Fatai Olatunde Olanrewaju; Mufutau Muphy Oripelaye; Esther Afolabi; Oluwole O Odujoko; Philip E Castle; Sameer Antani; Ben Wilson; Liming Hu; Courosh Mehanian; Maria Demarco; Julia C Gage; Zhiyun Xue; Leonard R Long; Li Cheung; Didem Egemen; Nicolas Wentzensen; Mark Schiffman Journal: Infect Agent Cancer Date: 2020-10-14 Impact factor: 2.965
Authors: Katharine J Looker; Minttu M Rönn; Patrick M Brock; Marc Brisson; Melanie Drolet; Philippe Mayaud; Marie-Claude Boily Journal: J Int AIDS Soc Date: 2018-06 Impact factor: 5.396
Authors: Kanan T Desai; Brian Befano; Zhiyun Xue; Helen Kelly; Nicole G Campos; Didem Egemen; Julia C Gage; Ana-Cecilia Rodriguez; Vikrant Sahasrabuddhe; David Levitz; Paul Pearlman; Jose Jeronimo; Sameer Antani; Mark Schiffman; Silvia de Sanjosé Journal: Int J Cancer Date: 2021-12-06 Impact factor: 7.316
Authors: Darcy White Rao; Cara J Bayer; Gui Liu; Admire Chikandiwa; Monisha Sharma; Christine L Hathaway; Nicholas Tan; Nelly Mugo; Ruanne V Barnabas Journal: J Int AIDS Soc Date: 2022-10 Impact factor: 6.707
Authors: Tara Herrick; Kerry A Thomson; Michelle Shin; Sarah Gannon; Vivien Tsu; Silvia de Sanjosé Journal: BMC Health Serv Res Date: 2022-10-14 Impact factor: 2.908
Authors: Helen Kelly; Iman Jaafar; Michael Chung; Pamela Michelow; Sharon Greene; Howard Strickler; Xianhong Xie; Mark Schiffman; Nathalie Broutet; Philippe Mayaud; Shona Dalal; Marc Arbyn; Silvia de Sanjosé Journal: EClinicalMedicine Date: 2022-09-27