Literature DB >> 35671415

Analytical Validation of a Deep Neural Network Algorithm for the Detection of Ovarian Cancer.

Gerard Reilly¹, Rowan G Bullock², Jessica Greenwood², Daniel R Ure², Erin Stewart², Pierre Davidoff², Justin DeGrazia², Herbert Fritsche², Charles J Dunton², Nitin Bhardwaj², Lesley E Northrop².

Abstract

PURPOSE: Early detection of ovarian cancer, the deadliest gynecologic cancer, is crucial for reducing mortality. Current noninvasive risk assessment measures include protein biomarkers in combination with other clinical factors, which vary in their accuracy. Machine learning can be applied to optimizing the combination of these features, leading to more accurate assessment of malignancy. However, the low prevalence of the disease can make rigorous validation of these tests challenging and can result in unbalanced performance.
METHODS: MIA3G is a deep feedforward neural network for ovarian cancer risk assessment, using seven protein biomarkers along with age and menopausal status as input features. The algorithm was developed on a heterogenous data set of 1,067 serum specimens from women with adnexal masses (prevalence = 31.8%). It was subsequently validated on a cohort almost twice that size (N = 2,000).
RESULTS: In the analytical validation data set (prevalence = 4.9%), MIA3G demonstrated a sensitivity of 89.8% and a specificity of 84.02%. The positive predictive value was 22.45%, and the negative predictive value was 99.38%. When stratified by cancer type and stage, MIA3G achieved sensitivities of 94.94% for epithelial ovarian cancer, 76.92% for early-stage cancer, and 98.04% for late-stage cancer.
CONCLUSION: The balanced performance of MIA3G leads to a high sensitivity and high specificity, a combination that may be clinically useful for providers in evaluating the appropriate management strategy for their patients. Limitations of this work include the largely retrospective nature of the data set and the unequal, albeit random, assignment of histologic subtypes between the training and validation data sets. Future directions may include the addition of new biomarkers or other modalities to strengthen the performance of the algorithm.

Entities: Chemical

Mesh：

Substances：
Biomarkers

Year: 2022 PMID： 35671415 PMCID： PMC9225600 DOI： 10.1200/CCI.21.00192

Source DB: PubMed Journal: JCO Clin Cancer Inform ISSN： 2473-4276

INTRODUCTION

Adnexal masses are a common gynecologic condition. With approximately 10% of women undergoing surgery for an adnexal mass during their lifetime, the research efforts to date have focused on tools designed to identify which of these masses are cancerous.[1,2] Ovarian cancer is the deadliest gynecologic cancer, and therefore, prompt and correct identification of malignancies is crucial. However, the incidence of ovarian cancer is still relatively low.[3] Approximately 85% of masses in premenopausal women will be benign, so testing that can accurately differentiate malignant masses from those that require less extensive intervention and treatment is of clinical value.[1]

CONTEXT

Key Objective Our objective was to examine the potential of a noninvasive machine learning tool to accurately assess the risk of ovarian malignancy in patients with pelvic masses. Knowledge Generated The deep neural network was trained on a large heterogenous data set obtained from patients who had presented with adnexal masses and used seven serum proteins, age, and menopausal status as inputs. In the analytical validity data set, which simulated real-world prevalence for ovarian malignancy (4.9%), the algorithm demonstrated a sensitivity of 89.8%, a specificity of 84.0%, a positive predictive value of 22.5%, and a negative predictive value of 99.5%. Relevance Ovarian cancer is the deadliest gynecologic cancer, and most cases are diagnosed at a late stage, which has low survival rates. Current noninvasive risk assessment measures vary in their accuracy, so the balanced sensitivity and specificity of this algorithm will be a clinically useful combination for providers evaluating appropriate care strategies for patients presenting with a pelvic mass. Identification of a pelvic mass may occur during physical examination but more likely via imaging, typically with transvaginal ultrasonography. Biopsy is usually avoided to reduce the risk of disrupting the cyst wall and allowing any potential malignant cells to disseminate.[4] When a mass shows clear indications of malignancy, the patient benefits from appropriate referral to a gynecologic oncologist for surgery, staging, and any further treatment.[5] Beyond imaging, additional methods of assessing adnexal masses include the use of biomarker-based blood tests, such as cancer antigen 125 (CA125) and human epididymis protein 4 (HE4). Relying on these traditional methods to stratify the oncologic risk of adnexal masses has several challenges. First, a small set of biomarkers may not be able to ascertain the physiology of certain ovarian cancers because different histologic subtypes are known to present with different biomarker patterns.[6-8] Second, the process of using a set threshold for each biomarker can become cumbersome when multiple markers are added to the analysis. Third, this process may be further complicated by the age and menopausal status of the patient, which can affect the baseline or so-called normal level of these proteins. Machine learning–based classification models can address these limitations, which is why their use in early cancer detection and risk stratification is increasing.[9] These models are capable of incorporating a long list of protein biomarkers along with clinical/health features as inputs to generate a unified score for risk assessment. However, building these models can be challenging because of the low incidence of ovarian cancer. Having a small set of positive samples for training can result in a skewed model with a high specificity but a low sensitivity. Developing a balanced classification model with high sensitivity and specificity is crucial, especially given the mortality implications of false negatives (FNs) and the burden on the health care system and the patient of false positives (FPs). This study describes the development and validation process used to establish test performance metrics for MIA3G, a new machine learning algorithm to assess ovarian cancer risk in patients with an adnexal mass. Powered by a robust data set inclusive of a large number of malignancies for training and testing, this algorithm has demonstrated balanced performance in a large analytical validation set.

METHODS

Algorithm Description

The MIA3G assay is an algorithm developed with a proprietary application of machine learning methods whose purpose is to stratify women with an ovarian mass into two categories—low and elevated risk of malignancy. The algorithm uses supervised learning with known histopathology diagnoses (malignant and nonmalignant) as the labels for algorithm training. MIA3G is a classification deep feedforward neural network that uses the following features as inputs: age, menopausal status, and seven protein biomarker values for each patient. The neural network has multiple hidden layers each with their own weighted nodes and activation functions. The neural network is regularized using node dropout to reduce overfitting where a percentage of the nodes are randomly omitted from each hidden layer during training.[10] The final layer of the neural network has two nodes and uses the softmax function to assign a binary classification: low or elevated risk of malignancy. Additional details of methods used to reduce overfitting and oversampling are provided section S1.1. of Appendix 1.

Protein Biomarkers and Input Features

Seven biomarkers are used in the MIA3G algorithm: CA125, HE4, beta-2 microglobulin, apolipoprotein A-1, transferrin, transthyretin, and follicle-stimulating hormone. CA125 and HE4 were chosen for their overexpression in many types of ovarian cancers.[11,12] The remaining biomarkers have demonstrated ability to detect malignancy in patients with low serum CA125 and/or HE4, such as early-stage malignancies, as well as reducing FPs in benign cases for which serum CA125 and/or HE4 were elevated for other reasons.[13-16] These features have been examined for their correlation with each other and their contribution (Appendix Fig A1). Biomarker assays are performed using the Roche cobas 6000 analyzer, according to the manufacturer's instructions for use (Roche Corporation, Pleasanton, CA). In addition to these biomarkers, the patient's age and menopausal status are used as categorical input features. Menopause is defined as the absence of menses for ≥ 12 months.

FIG A1.

(A) Correlation matrix of the features used in the MIA3G algorithm. (B) Variable importance analysis of the features used in the MIA3G algorithm. ApoA1, apolipoprotein A1; B2M, beta-2 microglobulin; CA125, cancer antigen 125; FSH, follicle-stimulating hormone; HE4, human epididymis protein 4; Meno, menopausal status; TRF, transferrin; TT, transthyretin.

Studies and Sample Sets

To create a highly generalizable classification algorithm, it is essential to train it on a diverse set of specimens with a wide reference range of biomarkers and other features. To this end, a heterogenous set of specimens was first created by combining samples from several prospective and retrospective studies, all of which underwent Institutional Review Board approval and in accordance with appropriate regulatory and ethical guidelines (Table 1).

TABLE 1.

Sample Set Composition

Sample Set Composition Broadly, the inclusion criteria for these studies were as follows: Patient age ≥ 18 years Informed consent provided by the patient to participate in research Patient agreeable to phlebotomy Patient had a documented pelvic mass that was planned for surgical intervention within 3 months of imaging. The pelvic mass was confirmed by imaging (computed tomography, ultrasonography, or magnetic resonance imaging) before enrollment. Exclusion criteria included a diagnosis of malignancy in the previous 5 years (except nonmelanoma skin cancers). Exclusion criteria also included pelvic surgery within six weeks before enrollment in the study. This heterogenous set comprised a total of 3,067 samples (Fig 1). The composite set was randomly broken into two nonoverlapping sets such that

FIG 1.

Workflow of the development and validation of the algorithm. B, benign samples; M, malignant samples.

One thousand sixty-seven samples were used for development of the algorithm and formed the training and testing set. The remaining 2,000 samples were used for analytical validation. Each set roughly received samples from every study proportionate to the size of the study. The validation set had a prevalence rate of approximately 5% (98 malignant and 1,902 benign samples). Workflow of the development and validation of the algorithm. B, benign samples; M, malignant samples. Although the sample size and prevalence of malignancy were fixed, the sample assignment to each set was completely random, performed using a random number generator to remove any potential bias. The above binning of samples into development and validation sets ensures that not only is the assignment of samples fair and random, but it also allows the algorithm to be trained/tested and then validated on sets that have an optimal level of similarities (and differences). Table 2 details the clinicopathologic makeup of each set including age, pathology, histologic subtypes, and stages.

TABLE 2.

Clinicopathologic Breakdown of Training, Test, and Validation Data Sets

Data and Ethics

All data were obtained from Institutional Review Board–approved trials, from adult patients who provided informed consent to participate in research. Data obtained in this analysis are proprietary to Aspira Women's Health Inc.

Training and Testing

The MIA3G algorithm was developed on 1,067 specimens composed of proportionate samples from every study (Fig 1), with 339 malignant and 728 benign samples resulting in a prevalence of 31%. This set was randomly divided into a training set (n = 853) and a nonoverlapping testing set (n = 214), representing 80% and 20% of the available samples, respectively. The algorithm was built on the training set and tested on the testing set to obtain an initial assessment of its performance. The performance metrics for the test data set are provided in Appendix Table A1.

TABLE A1.

Performance of MIA3G in the Test Data Set

The numbers of malignant and benign specimens were further balanced for algorithm training using an adaptation of the synthetic minority oversampling technique (SMOTE) that balances the minority and majority classes by creating synthetic observations near the decision boundary (called Borderline-SMOTE).[23] The resulting data set has an equivalent number of malignant and benign specimens, where the synthetic observations are close to the decision boundary. In the case of MIA3G, the synthetic observations improve the algorithm's ability to discern between malignant and benign specimens. To improve malignancy detection, a modestly higher weight was attached to the positive class during algorithm training in MIA3G. Weighing the malignant samples during training improved on the gains from balancing using the Borderline-SMOTE in positive detection, while having a negligible impact on benign discernment. Several algorithms and software libraries were used to explore which technique would return the best risk classification for ovarian cancer. The caret library in R was used to screen 190 classification algorithms on the data.[24] Most algorithms in the caret library did not successfully classify ovarian cancer with a high level of sensitivity. Deep feedforward neural networks demonstrated a high and balanced sensitivity, negative predictive value (NPV), and specificity, leading to the selection of this algorithm for the development of MIA3G. Network hyperparameters evaluated during algorithm training and testing included the following: network architectures, activation functions, loss functions, node dropout for algorithm regularization, and learning rates. The final MIA3G algorithm is a network with these hyperparameters optimized to stratify malignancy risk. This algorithm was locked and used for subsequent analytical validation.

Analytical Performance Validation

Analytical validation was performed on 2,000 samples with 98 malignant and 1,902 benign specimens, resulting in a prevalence of 4.9%. Once the algorithm was developed and locked in a cloud-based Health Insurance Portability and Accountability Act–compliant infrastructure, it was then run on the analytical validation samples in a blinded manner so that the person running the algorithm was blinded to the sample identities and their pathology results. Two honest brokers (HB1 and HB2) were used to deidentify the samples, run the algorithm blinded, compare the classification of samples with the histology results, and then issue an independent report containing performance metrics on the basis of their findings (Fig 2).

FIG 2.

Workflow of the analytical validation exercise. HB, honest broker.

RESULTS

Performance metrics along with counts of true positives, true negatives, FPs, and FNs from analytical validation are provided in Table 3. Receiver operating characteristic and precision-recall curves are also plotted (Fig 3). Overall, a sensitivity of 89.8% and a specificity of 84.02% were achieved, with an area under the curve value of 0.938. MIA3G demonstrated an NPV of 99.38%. The positive predictive value was lower at 22.45% because of the low prevalence of disease (approximately 5%) in this data set. Metrics have also been provided for specimens stratified by menopausal status, cancer stage, cancer type, and malignancy potential. MIA3G was able to detect 20 of 26 early-stage cancers (76.92% sensitivity) and misclassified only one late-stage malignancy (98.04% sensitivity). The algorithm also correctly classified nine of the 10 metastatic ovarian cancer cases (90% sensitivity) and 75 of 79 instances of epithelial ovarian cancer, the most common type of ovarian cancer (94.94% sensitivity).

TABLE 3.

Performance of MIA3G in the Validation Data Set

FIG 3.

ROC and precision-recall curves for the algorithm. Area under the receiver operating characteristic curve: 0.938, area under the precision-recall curve: 0.700. n, negative. P, positive; ROC, receiver operating characteristic.

Performance of MIA3G in the Validation Data Set ROC and precision-recall curves for the algorithm. Area under the receiver operating characteristic curve: 0.938, area under the precision-recall curve: 0.700. n, negative. P, positive; ROC, receiver operating characteristic.

DISCUSSION

A thorough and rigorous development process combined with comprehensive analytical validation is the cornerstone of any clinical laboratory–developed test. It is the foundation for setting quality standards and illustrates the performance and reliability of the underlying machinery. MIA3G has undergone a rigorous and blinded analytical validation process that meets the highest regulatory standards in evaluating all aspects of the test. After assessing several classification algorithms, MIA3G was trained on neural networks with the most balanced performance and then tested on a heterogeneous cohort. The model was optimized to reduce overfitting, and an oversampling technique was used to achieve a balanced performance, which was higher than all other methods that were explored (Appendix Table A2). The training and testing stage used > 1,050 specimens with > 30% positive specimens indicative of a high-risk ovarian cancer population. This development was followed by a detailed validation process on 2,000 specimens that show performance in a low prevalence population (approximately 5%), making the algorithm highly generalizable. MIA3G has also been meticulously validated for its repeatability and reproducibility (Appendix Table A3).

TABLE A2.

Performance of Other Methods in Comparison With Neural Networks, Which Demonstrated Highest Sensitivity and NPV

TABLE A3.

%CV Measurement of the MIA3G for Runs, Days, and Operators by Sample (pooled serum)

The potential clinical utility of MIA3G in the evaluation of adnexal masses comes from its balanced performance, which is facilitated by three development features: a large malignant set used in training and testing (n = 339), the SMOTE technique applied to further boost the positive set, and a higher weight attached to the positive class. These features lead to an algorithm with a high sensitivity, a vital feature that shows the high mortality of ovarian cancer, while retaining a high specificity. The high specificity drives a high NPV in a population with a lower disease prevalence where clinical management options may include conservative management and at the same time minimizes the potentially lethal implications of FNs in the context of cancer detection. Limitations of this study include the nature of the development data set. Although MIA3G was developed and validated on a highly diverse cohort obtained by merging several studies, most of these studies were retrospective in design with data collected from patients who were confirmed to have an adnexal mass and scheduled for surgery at the time of diagnosis. To address this, prospective trials are currently underway to validate the algorithm's performance in patients with a variety of clinical presentations. In addition, because of the random assignment of samples to the training and validation data sets, there was no way to match the distribution of cancer types between sets (Table 2). For example, by happenstance, five of the tumors in the validation set were stromal tumors and one was a germ cell tumor, subtypes known to have a different biomarker presentation compared with the more common epithelial types. In the test set, however, MIA3G demonstrated 100% sensitivity in nonepithelial malignancies, as sarcomas and carcinosarcomas comprised 4 of 5 nonepithelial malignancies in that set (Appendix Table A1). These cancer types present more similar to epithelial ovarian cancer in terms of biomarker distribution. Nonepithelial subtypes are rare presentations of ovarian cancer, comprising approximately 10% of all ovarian malignancies,[25] so their particularly low incidence presents a challenge with regard to generating sufficient data for training and validating machine learning algorithms. Future directions include evaluating how to train an algorithm on multiple subtypes that express different biomarker patterns and achieve consistent test performance across these subtypes. The application of a deep neural network algorithm to biomarker testing opens significant areas for future study. Understanding where the algorithm fails provides an opportunity for deeper exploration into alternate biologic explanations for FP and FN results. For example, there is a possibility that some combination of biomarkers may be identifying cancers outside of the ovaries and therefore correctly suggesting malignancy, albeit not of ovarian origin. As a step for improvement, expanding the number and types of features that feed into the algorithm may help further enhance the sensitivity and specificity of the test. Preliminary efforts are underway to evaluate the addition of novel biomarkers and other modalities such as microRNA, cell tumor DNA, and other genomic identifiers that may strengthen the algorithms' ability to both detect and rule out malignancy and advance the diagnostic ability of noninvasive testing.

15 in total

1. Conservative management of adnexal masses.

Authors: Taymaa May; Amit Oza
Journal: Lancet Oncol Date: 2019-02-05 Impact factor: 41.316

2. Adnexal mass risk assessment: a multivariate index assay for malignancy risk stratification.

Authors: Zhen Zhang; Rowan G Bullock; Herbert Fritsche
Journal: Future Oncol Date: 2019-10-01 Impact factor: 3.404

Review 3. [Sex cord-stromal tumors of the ovary : Current aspects with a focus on granulosa cell tumors, Sertoli-Leydig cell tumors, and gynandroblastomas].

Authors: F Kommoss; H-A Lehr
Journal: Pathologe Date: 2019-02 Impact factor: 1.011

4. Human epididymis protein 4 (HE4) is a secreted glycoprotein that is overexpressed by serous and endometrioid ovarian carcinomas.

Authors: Ronny Drapkin; Hans Henning von Horsten; Yafang Lin; Samuel C Mok; Christopher P Crum; William R Welch; Jonathan L Hecht
Journal: Cancer Res Date: 2005-03-15 Impact factor: 12.701

5. Characterization of serum biomarkers for detection of early stage ovarian cancer.

Authors: Katherine R Kozak; Feng Su; Julian P Whitelegge; Kym Faull; Srinivasa Reddy; Robin Farias-Eisner
Journal: Proteomics Date: 2005-11 Impact factor: 3.984

6. Three biomarkers identified from serum proteomic analysis for the detection of early stage ovarian cancer.

Authors: Zhen Zhang; Robert C Bast; Yinhua Yu; Jinong Li; Lori J Sokoll; Alex J Rai; Jason M Rosenzweig; Bonnie Cameron; Young Y Wang; Xiao-Ying Meng; Andrew Berchuck; Carolien Van Haaften-Day; Neville F Hacker; Henk W A de Bruijn; Ate G J van der Zee; Ian J Jacobs; Eric T Fung; Daniel W Chan
Journal: Cancer Res Date: 2004-08-15 Impact factor: 12.701

Review 10. Early detection of ovarian cancer.

Authors: Donna Badgwell; Robert C Bast
Journal: Dis Markers Date: 2007 Impact factor: 3.434