Ovarian cancer has the sixth-largest fatality rate in the United States among all cancers. A non-surgical assay capable of detecting ovarian cancer with acceptable sensitivity and specificity has yet to be developed. However, such a discovery would profoundly impact the pace of the treatment and improvement to patients' quality of life. Achieving such a solution requires high-quality imaging, image processing, and machine learning to support an acceptably robust automated diagnosis. In this work, we propose an automated framework that learns to identify ovarian cancer in transgenic mice from optical coherence tomography (OCT) recordings. Classification is accomplished using a neural network that perceives spatially ordered sequences of tomograms. We present three neural network-based approaches, namely a VGG-supported feed-forward network, a 3D convolutional neural network, and a convolutional LSTM (Long Short-Term Memory) network. Our experimental results show that our models achieve a favorable performance with no manual tuning or feature crafting, despite the challenging noise inherent in OCT images. Specifically, our best performing model, the convolutional LSTM-based neural network, achieves a mean AUC (± standard error) of 0.81 ± 0.037. To the best of the authors' knowledge, no application of machine learning to analyze depth-resolved OCT images of whole ovaries has been documented in the literature. A significant broader impact of this research is the potential transferability of the proposed diagnostic system from transgenic mice to human organs, which would enable medical intervention from early detection of an extremely deadly affliction.
Ovarian cancer has the sixth-largest fatality rate in the United States among all cancers. A non-surgical assay capable of detecting ovarian cancer with acceptable sensitivity and specificity has yet to be developed. However, such a discovery would profoundly impact the pace of the treatment and improvement to patients' quality of life. Achieving such a solution requires high-quality imaging, image processing, and machine learning to support an acceptably robust automated diagnosis. In this work, we propose an automated framework that learns to identify ovarian cancer in transgenic mice from optical coherence tomography (OCT) recordings. Classification is accomplished using a neural network that perceives spatially ordered sequences of tomograms. We present three neural network-based approaches, namely a VGG-supported feed-forward network, a 3D convolutional neural network, and a convolutional LSTM (Long Short-Term Memory) network. Our experimental results show that our models achieve a favorable performance with no manual tuning or feature crafting, despite the challenging noise inherent in OCT images. Specifically, our best performing model, the convolutional LSTM-based neural network, achieves a mean AUC (± standard error) of 0.81 ± 0.037. To the best of the authors' knowledge, no application of machine learning to analyze depth-resolved OCT images of whole ovaries has been documented in the literature. A significant broader impact of this research is the potential transferability of the proposed diagnostic system from transgenic mice to human organs, which would enable medical intervention from early detection of an extremely deadly affliction.
Cancer is currently the second leading cause of death in the United States, and the number and percentage of people that get cancer in their lifetime have seen an increase in the past 15 years. Ovarian cancer was found to be the most frequent cause of death due to cancer in the U.S [1]. Ovarian cancer is particularly devastating due to its non-specific symptoms, many of which are considered idiopathically harmless when assessed in isolation. The impact of cancer is compounded by the lack of a useful early screening tool, leading to a late-diagnosis rate of [2]. If ovarian cancer is found and can be treated before metastasis, the five-year survival rate is (compared to a baseline of for metastatic cases) [2]. This provides clear evidence for the need for an effective early detection technique.A non-surgical and high-throughput ovarian cancer screening method would provide a tremendous improvement in quality of life and prognosis. Several imaging techniques have been investigated toward this end. One technique that has shown tremendous promise is optical coherence tomography (OCT). OCT is an interferometric imaging technique that yields depth-resolved, high-resolution images that carry information about the imaged tissue’s microstructure. Historically, OCT has been applied with much success to biological imaging in the human eye [3-5], the lung [6, 7], the esophagus [8], the coronary artery [9, 10], and a number of other organs including the ovaries [11-13]. The physical principle of OCT systems is similar to that of ultrasound, except that OCT systems measure time-resolved backscattered light instead of sound waves [14]. In particular, OCT images have a wealth of microstructural features in the ovaries, including the stroma, epithelium, and collagen, which show great potential for disease diagnostics and tissue classification [11, 12, 15–18]One factor stymieing efforts to use OCT for ovarian cancer screening is the optical noise produced by tomographic imaging of the ovaries [19]. Additionally, the data are three-dimensional, subject to scaling challenges, and yield depth-dependent imaging performance. In addition to the characteristic speckle noise, these factors render tomograms extremely challenging for human radiologists and oncologists to diagnose reliably. As a result, advanced computational techniques such as machine learning methods could provide the means to extract quantitative diagnostic information for cancer screening. This manuscript presents our assessment of state-of-the-art neural network-based classification algorithms that solve this task. The results show tremendous promise in machine learning for detecting early tissue changes near the onset of cancer. Our experiments demonstrate that deep VGG-like 3D convolutional neural networks, as well as convolutional LSTMs (Long Short-Term Memory; similar in architecture to those employed previously), achieve high diagnostic accuracy when evaluated on a dataset collected by acquiring optical coherence tomography (OCT) recordings of mouse ovaries in a mouse model of the development of ovarian cancer, introduced in Sect. 4.1 [20, 21].This manuscript is organized as follows: Section 2 defines our nomenclature. Section 3 summarizes related work. In Sect. 4.1, data acquisition processes are outlined. Section 4.3 exposes our data preprocessing routine. Section 4.4 contains information on the neural network models investigated in this work. In Sect. 5, we present an evaluation of the diagnostic efficacy of these neural networks. Results are interpreted in Sect. 6.
Nomenclature
In this section, we introduce variables, parameters and general notation used throughout this work. First, we define a sample, , to be a sequence of OCT images as , where t indexes the animal sample, N is the number of slices in a sequence, and represents the image corresponding to the animal indexed by t. This sequence of images is formed by concatenating the subsequences of tomograms of the left and right ovaries. For example, consider the case where . Then each is represented as 50 sequential images (i.e., 25 images selected from approximately the same depth on each ovary), progressing from the most superficial to the deepest slices. Each animal is assigned a label, , where indicates animal t is predisposed to developing ovarian cancer by 8 weeks of age (i.e., transgenic), while corresponds to wild type (WT) mice. denotes predictions of these labels computed by the neural network. In this work, a dataset is a collection of tuples .
Related work
OCT provides an abundance of information about tissue health. However, quantitatively analyzing three-dimensional OCT data of the ovaries is challenging due to the dimensionality of the data, the depth-dependence, the presence of speckle noise, and the sizeable biological variation inherent to the ovaries. To date, quantitative analyses for OCT images of whole ovaries has been limited to first and second-order statistical techniques such as texture, shape, and frequency analysis [22-24]. These approaches that use OCT imaging technology have shown promise for quantification of tissue changes with the onset of different types of cancer. Despite these quantitative techniques’ success, disease detection’s sensitivity and specificity could be greatly improved when coupled with more sophisticated machine learning techniques.Neural networks and related approaches have shown remarkable promise in the context of biological OCT imaging. For example, machine learning has demonstrated utility in assessment for glaucoma [25, 26], retinal diseases [27-30], pulmonary cancer [20], and neurodegenerative and dermatological disease [31, 32]. Other applications of machine learning that have demonstrated success include the modality of intravascular OCT imaging, where neural networks and random forests have been used for atherosclerotic plaque identification [33-35]. Neural networks were also able to detect COVID-19 in pulmonary x-ray imaging [36]. The success of machine learning in cancer diagnosis suggests these techniques in the domain of ovarian tissue imaging could greatly advance the technology toward clinical application. To the best of our knowledge, deep neural networks have not been used to analyze depth-resolved OCT images of whole ovaries.Machine learning applied to cancer identification has meaningfully benefited from transfer learning [37, 38]. In transfer learning, a model (e.g., neural network) is trained on a task that is not necessarily related to the task on which the model will be evaluated. For example, training a neural network to classify ImageNet data solves a problem quite distinct from that of cancer detection [39]. However, the neural network trained on ImageNet may have learned some useful features from the real scenery that can transfer to cancer classification from OCT imagery. Transfer learning has been successfully used in many tasks and we use transfer learning in this work to boost the performance of our predictive model [38].
Methodology
Data acquisition and imaging
OCT images were collected from a swept source OCT system (OCS1050SS, Thorlabs). The system was set to operate in non-contact mode with a central wavelength of 1040 nm and spectral bandwidth of 80 nm. The axial scan rate was 16 kHz and the power on the sample was measured as 0.36 mW. The system was set to average 4 axial scans, with transverse resolution and μm axial resolution in tissue. The total imaging volume was 4 mm 4 mm lateral, and 2 mm deep. The digital images are pixels (pixel size of approximately 5 μm 5 μm). The image volume was exported as a series of 2D images (or slices).The OCT data in this work were initially curated for automated segmentation algorithms [19], and 3D texture analysis [15]. Ideally, the sequence of OCT images can be concatenated in a third dimension to visualize the 3D structure of an organ. Unfortunately, a major challenge with OCT data is that the noise statistics associated with optical backscattering vary with organ depth and presumably tissue health. For example, common irregularities attributed to variations in tissue density, optical absorption characteristics, and concentration of scatterers impeded early attempts at quantitative analysis of optical coherence tomograms [19]. To ameliorate the impacts of these inconsistencies, we propose a Gaussian blur during preprocessing to smooth the images. All remaining computation to counteract any deleterious effects of optical noise resides in the neural network classifiers described in Sect. 4.4.
Mouse model
The image data were collected from a transgenic mouse model (TgMISIIR-TAg) in which females spontaneously develop bilateral epithelial ovarian cancer [40, 41]. All TAg positive (TAg+) TgMISIIR-TAg female mice develop bilateral epithelial ovarian cancer, with invasive tumors in the ovaries evident in nearly all mice by eight weeks of age. Sixteen mice were sacrificed at eight weeks for imaging (eight TAg+, eight wild type) and explanted organs were imaged using the OCT system. Details on mouse breeding protocol, and surgical explantation can be found in previous publications [15, 19, 42]. All imaged tissue was analyzed via immunohistochemistry and evaluated by a pathologist for the presence and extent of tumors, which is determined via cell morphology and presence of the TAg protein. This process provides a thorough validation that the TAg+ mice exhibit ovarian cancer by eight weeks of age. Further details on the histological analysis can be found in a previous work by Sawyer et al. [43].
Data preprocessing
The tomography imagery consists of pixel images (see Fig. 1a), with pixel intensities in [0, 255]. We perform a sequence of preprocessing transformations to render these images useful to neural networks. Figure 1 highlights the pipeline of preprocessing operations. First, we rescale pixel intensities linearly to the interval, . A prior study of this OCT dataset revealed that speckle noise inherent in the medium significantly confounds automatic segmentation systems’ efforts to isolate ovarian tissue [19]. In order to reduce this noise and improve perceptibility of the images we pass each image through a Gaussian filter (with a standard deviation of 1) to produce Fig. 1b. The Gaussian filter was empirically shown to mitigate the effects of the noise (e.g., compared to median, low-pass, or anisotropic filters) for the segmentation task studied in [19]. At the final stage of preprocessing, we standardize each image (i.e., normalize by calculating the pixel-wise mean and standard deviation of intensity from selected training data, then subtracting this mean from each pixel value and dividing by the empirical standard deviation). After preprocessing each image, the next phase in our cancer detection framework is to train a deep neural network to perform the task of classification of sequences of OCT images.
Fig. 1
An illustration of the sequence of preprocessing steps under consideration: a depicts the raw image of slice 100 of the left ovary of animal 3767, b shows the result of convolving this with a 2-dimensional Gaussian kernel
An illustration of the sequence of preprocessing steps under consideration: a depicts the raw image of slice 100 of the left ovary of animal 3767, b shows the result of convolving this with a 2-dimensional Gaussian kernel
Classification model
Convolutional neural networks (CNNs) are a class of artificial neural network consisting of banks of neurons whose output states (which can be thought of as pixels in a visual analogy) are computed as the convolution of an input signal with the filter learned by the given bank of neurons. CNNs were introduced as a solution to the problem of handwritten digit classification and have since been applied nearly ubiquitously in computer vision tasks [20, 44, 45]. VGG is a remarkably deep CNN that achieves near state-of-the-art performance on challenging image classification tasks, including medical image analysis [45, 46]. As shown in Fig. 2, the VGG-based model consists of sequential convolutional layers, represented as yellow volumes, and pooling operations (i.e., down-sampling via pixel aggregation), represented as orange volumes. The primary convolutional block is composed of two two-dimensional serially connected spatial convolution layers outputting 64 channels into a max-pooling layer. This pattern is repeated in subsequent blocks as illustrated, successively doubling the number of channels in a convolutional layer’s output until the final two convolutional blocks (each of which outputs 512 channels). Blocks 3-5 all have three serially connected convolutional layers. Block 5 feeds into a fully connected neural network (i.e., the re-encoding layer), which feeds into a second fully connected layer (i.e., the decoding layer) with a single sigmoidally activated neuron (as opposed to the rectified exponential nonlinearity defined in Eq. 2), which ensures that the final layer solves the classification problem by effectively performing logistic regression on the penultimate layer’s encoding of the OCT imagery. The output of the decoding layer indicates an estimated likelihood of each class (WT vs transgenic) for the given sample. Unlike the original implementation of the VGG network, which uses ReLU nonlinearities, all convolutional layers in our model signal with rectified exponential activation functions.
Fig. 2
A graphical depiction of the VGG architecture investigated in this work. The primary convolutional block is composed of two 2D serially connected spatial convolution layers (represented in yellow) outputting 64 channels. These feed into max-pooling layers represented in orange. This pattern is repeated in subsequent blocks as illustrated, successively doubling the number of channels in a convolutional layer’s output until the final two convolutional blocks (each of which outputs 512 channels). Blocks 3-5 all have three serially connected convolutional layers. Block 5 feeds into a fully connected neural network (the re-encoding layer), which feeds into a second fully connected layer (the decoding layer) with a single sigmoid neuron (as opposed to the rectified exponential nonlinearity). The output of the decoding layer indicates an estimated likelihood of each class (WT vs transgenic) for the given sample (Color figure online)
We initialize the VGG sub-network with weights learned on Imagenet, which contains photographic images, to leverage transfer knowledge [47]. Transfer learning is the approach taken in many computer vision applications where a pre-trained deep neural network is first optimized on an unrelated dataset, in which there is an abundance of labeled data [38, 48, 49], before being fine-tuned on the task-relevant dataset. The pre-trained network provides feature maps (i.e., nonlinear feature extractors) that are learned from Imagenet. Once the network is pre-trained, we fine-tune the network on the OCT data described in the previous section. Despite any suspected disparity between the generation of imagery of natural scenes compared with that of biological tissues (e.g., melanoma dermoscopy compared with Imagenet), transfer learning has shown to be beneficial in other neural network-based medical image tasks [45] and applications [50-52]. VGG minimizes the cross-entropy between the probability distribution underlying ground truth ( ) and the distribution of decisions decoded from the output of the model, denoted as (each implicitly conditioned on the data, ):where M is the number of images in the training dataset. is the cross-entropy loss function and can be thought of as an empirically estimated divergence between the distributions of ground truth and predictions.
Minimizing this loss function tends to drive a model toward maximizing the information it encodes about its training dataset [53, 54].A graphical depiction of the VGG architecture investigated in this work. The primary convolutional block is composed of two 2D serially connected spatial convolution layers (represented in yellow) outputting 64 channels. These feed into max-pooling layers represented in orange. This pattern is repeated in subsequent blocks as illustrated, successively doubling the number of channels in a convolutional layer’s output until the final two convolutional blocks (each of which outputs 512 channels). Blocks 3-5 all have three serially connected convolutional layers. Block 5 feeds into a fully connected neural network (the re-encoding layer), which feeds into a second fully connected layer (the decoding layer) with a single sigmoid neuron (as opposed to the rectified exponential nonlinearity). The output of the decoding layer indicates an estimated likelihood of each class (WT vs transgenic) for the given sample (Color figure online)A graphical depiction of the convolutional LSTM architecture investigated in this work. The input and convolutional blocks are feed-forward layers consisting of a pair of 2-channel, 2-dimensional spatial convolutions (yellow) which feed into max-pooling aggregation layers (orange). 16-channel convolutional LSTMs comprise the central layers (green), g and h, with feedback connections represented by dashed arrows connecting each iteration’s output (e.g., g) to the subsequent iterations input (). For each slice, t, , is relayed to a sequence of fully connected feed-forward layers (turquoise) that re-encode it for classification in the decoding layer (purple) (Color figure online)Long short-term memories (LSTMs) are a class of recurrent neural networks that learn temporal dependencies in data in recurrent connections gated by their constituent LSTM cells [55]. Recently a new class of convolutional neural network equipped with the feedback connections and gates that distinguish LSTMs from earlier recurrent architectures has demonstrated favorable performance in precipitation forecasting [21] and anomaly detection in video [56]. Inspired by these results, we also use a convolutional LSTM that learns spatial correlations inherent in 3D tomography data as temporal relationships in its training data [57]. A convolutional LSTM is depicted graphically in Fig. 3. The input and convolutional blocks consist of 2-channel, 2-dimensional spatial convolutions that feed into max-pooling layers. As with conventional CNNs, the max-pooling layers downsample each channel in their input to half resolution [45]. There are 16-channel convolutional LSTMs that comprise the next layers (shown in green), and , with feedback connections represented by dashed arrows. The convolutional LSTM layers instantiate architectures described by Xingjian et al. and initialize intermediate states, g and h, as zeros [21]. The second convolutional LSTM layer’s output, h, is relayed to a sequence of fully connected feed-forward layers that re-encode h for classification by the decoding layer. Other than the model’s output, which uses sigmoid activation functions, every other neuron in this model uses the rectified version of the exponential linear activation [58]. For completeness, the rectified exponential linear activation function is explicitly defined as
Fig. 3
A graphical depiction of the convolutional LSTM architecture investigated in this work. The input and convolutional blocks are feed-forward layers consisting of a pair of 2-channel, 2-dimensional spatial convolutions (yellow) which feed into max-pooling aggregation layers (orange). 16-channel convolutional LSTMs comprise the central layers (green), g and h, with feedback connections represented by dashed arrows connecting each iteration’s output (e.g., g) to the subsequent iterations input (). For each slice, t, , is relayed to a sequence of fully connected feed-forward layers (turquoise) that re-encode it for classification in the decoding layer (purple) (Color figure online)
A graphical depiction of the 3D CNNs investigated in this work. The primary convolutional block contains feed-forward layers consisting of 2D spatial convolutions (yellow) which feed into max-pooling aggregation layers (orange). These are organized with 2, 4, and 8 channels (i.e., 2D filters) in the first, second, and third convolutional layers, respectively. A 64 channel 3D CNN connected to a 4 channel 3D CNN make up the central layers (green). For each slice, t, , is relayed to a sequence of fully connected feed-forward layers (turquoise) that re-encode it for classification in the decoding layer (purple) (Color figure online)We also experiment with another neural network model, namely 3D CNNs. The 3D CNNs are an extension of convolutional layers and model a spatial dimension along which imaging data are arranged. 3D CNNs have found success in applications ranging from human pose estimation [59] to medical image analysis [20, 60]. A 3D CNN is implemented nearly identically to 2D CNNs, differing only in the number of dimensions over which convolutions are evaluated. A 2D CNN filter evaluates a single 2D convolution of an image with a 2D kernel (i.e., an image). In contrast, a 3D CNN filter convolves sequences of images with 3D kernels (i.e., volumes). These 3D-CNN architectures consist of three feed-forward subnetworks: (a) a sequentially distributed (i.e., “TimeDistributed” in the nomenclature of TensorFlow) 2D convolutional neural network, (b) a 3D-CNN, in which the third dimension is formed by ordering elements of each sequence, , and (c) a multilayer perceptron responsible for estimating the likelihood that each belongs to a transgenic animal. As shown in Fig. 4, the primary convolutional block contains feed-forward layers consisting of 2D spatial convolutions (yellow) which feed into max-pooling aggregation layers (orange). These are organized with 2, 4, and 8 channels (i.e., 2D filters) in the first, second, and third convolutional layers, respectively. A 64 channel 3D CNN connected to a four channel 3D CNN makes up the central layers (green). For each slice, t, the outputs of the 3D convolutional block are connected to a sequence of fully connected feed-forward layers (turquoise) that re-encode them for classification in the decoding layer (purple).
Fig. 4
A graphical depiction of the 3D CNNs investigated in this work. The primary convolutional block contains feed-forward layers consisting of 2D spatial convolutions (yellow) which feed into max-pooling aggregation layers (orange). These are organized with 2, 4, and 8 channels (i.e., 2D filters) in the first, second, and third convolutional layers, respectively. A 64 channel 3D CNN connected to a 4 channel 3D CNN make up the central layers (green). For each slice, t, , is relayed to a sequence of fully connected feed-forward layers (turquoise) that re-encode it for classification in the decoding layer (purple) (Color figure online)
Results
In this section, we present an empirical analysis of the VGG, Convolutional LSTM and 3D-CNN on the OCT dataset described in Sect. 4.1. These comparisons also allow us to determine each algorithm’s strengths and weaknesses for the task of cancer detection. We begin our discussion of the results and findings by describing the experimental paradigm and model parameterizations studied.
Model parameterizations
This subsection offers an in-depth description of model parameterizations and configurations. We use dropout at the connections from VGG to the re-encoding layer [61], through which the outputs of VGG’s penultimate layer are randomly and dynamically zeroed out during training. Stochastically setting neurons’ outputs to zero during training typically reduces training time while guiding optimization away from deep local minima in the loss function. Weights and biases in the first two layers (i.e., those belonging to the input block shown in Fig. 2) are held constant throughout the learning routine. Fixing these parameters to the values optimized on Imagenet reduces training time (by decreasing the number of variable parameters) and has no significant effect on average and peak AUC. A marginal (but insignificant) enhancement in mean (and peak) AUC can be seen by comparing the ROCs summarized in Fig. 6a with those in Fig. 5, for which parameters of all layers are variables learned in optimization. Optimization is regularized by augmenting (in Eq. 1) with a penalization (weighted by a factor of 0.0005) of the norm of the weights learned in the re-encoding layer. The weights and biases of this model are optimized by the “Nadam” routine [62], an extension of the popular Adam optimization algorithm that incorporates Nesterov momentum to increase the rate of convergence of the optimization process. The learning rate is initially set to 0.001, and is adapted as a function of gradients of . In contrast to the 3D CNNs and convolutional LSTMs, for which Batch Normalization (BN) [63] (the process of normalizing the outputs of intermediate layers of a neural network) was necessary in order to stabilize learning, incorporating BN between the intermediate layers of our implementation of VGG did not seem to affect performance metrics assessed here significantly.
Fig. 6
Receiver operating characteristic (ROC) curves computed by interpolating the functional mean ROC from recordings of replications of the aforementioned CV experiment for a VGG, b a convolutional LSTM, and c a 3D-CNN corresponding to the parameterizations outlined Sect. 5.1. The shaded error region shown is within one standard error of the mean ROC curve. The dashed red curve (for which true positive rate is equal to false positive rate) is an idealized ROC corresponding to classifying by random chance (i.e., uniformly random guessing). The dashed black curve is the ROC that achieved the maximum area enclosed below among all replications of the CV experiment
Fig. 5
A comparison of summaries of ROCs achieved by training a an instance of VGG in which the weights and biases of the two primary layers, which were optimized on Imagenet, are fixed during learning with b an instance of VGG in which weights and biases are initialized randomly and remain variable throughout optimization. A marginal but likely insignificant improvement due to transfer learning is evident in the differences in geometries of the peak ROCs. However, the improvement in area under ROC is a small fraction of the standard error of the mean (i.e., the shaded region)
A comparison of summaries of ROCs achieved by training a an instance of VGG in which the weights and biases of the two primary layers, which were optimized on Imagenet, are fixed during learning with b an instance of VGG in which weights and biases are initialized randomly and remain variable throughout optimization. A marginal but likely insignificant improvement due to transfer learning is evident in the differences in geometries of the peak ROCs. However, the improvement in area under ROC is a small fraction of the standard error of the mean (i.e., the shaded region)Distinct from 3D CNN- and VGG-based models, the convolutional LSTM model proposed perceives OCT imagery that has not been normalized as described in Sect. 4.3. Also, unlike the VGG- and 3D CNN-based models, dropout is not used while training our LSTM-based models. Cross-entropy loss is optimized and regularized by the norm of each layers’ weights. In contrast to the VGG-based and 3D CNN models considered in this work, convolutional LSTMs are optimized using the Adadelta algorithm [64], which is empirically a more stable (and computationally parsimonious) choice for this architecture and dataset. Batch normalization was applied to the inputs of the intermediate layers of the convolutional LSTM block during training to stabilize the estimation of gradients of with respect to the parameters belonging to these layers [63]. Batch normalization was found to accelerate learning and improve generalization performance. The learning rate is initially set to 0.001, and is adapted as a function of gradients of as proposed by Zeiler [64].The 3D CNN model proposed is the only model whose performance was experimentally shown to benefit from the normalization of OCT imagery described in Sect. 4.3. As in our implementations of the VGG- and LSTM-based models, we train 3D CNNs subject to dropout rate of while minimizing cross-entropy loss regularized with the norm of the weights. The weights of this architecture are optimized using the Nadam algorithm [65]. Without BN applied to the inputs of the intermediate layers of the 3D convolutional block, the performance of the 3D CNN models assessed here is critically impaired. Exactly as with the VGG-based models, the learning rate is initially set to 0.001 and is adapted as a function of gradients of .Neural network optimization parameters
Cross-validation experiment
Generalization performance is the most important set of statistics that we are interested in understanding. The experiments seek to measure the performance on data never observed in the past during training time. We devised a leave-one-out cross-validation (CV) experiment to test our models’ ability to generalize to unseen data. Newly initialized models are trained and validated on a subset of the complete set of tomography sequences before evaluating the hold-out animal’s sequence. Specifically, for each animal in our dataset, we perform one fold of CV. Within each fold, the remaining 15 animals are divided among a singleton containing an animal whose label equals the test animal and seven disjoint sets containing two animals (one transgenic and one WT). Our models are trained on sequential mini-batches corresponding to these seven stratified subsets. Validation subsets are selected as the next mini-batch that the model will train on to cope with physical memory constraints and maximize the number of mini-batches whose training is validated by an as-yet-unseen subset of samples. In the final mini-batch, which must be validated on already-seen data, the validation set is chosen to be the training data exposed in the first mini-batch. Validating models on unseen data during training serves an additional role in mitigating catastrophic forgetting. Catastrophic forgetting is a phenomenon observed in learning, where a neural network forgets previously learned knowledge as it is exposed to new information [66]. We also use early stopping during training to reduce the risk of over-training. Early stopping is implemented by halting training on batches for which further training does not improve the loss on the validation subset. Our training routine ensures that each model learns from at most a single positive and negative sample in each mini-batch. After training is complete, the model in question is evaluated on the held-out test sample.
Performance results
Figure 6 compares our models’ diagnostic efficacies (i.e., their ability to predict the occurrence of ovarian cancer from OCT images in the transgenic mouse model described in 4.1). Efficacy is assessed by the Receiver Operating Characteristic (ROC) curve, which plots true positive rate (i.e., ) against false positive rate (i.e., ) [67]. The red dashed line shows the result of random prediction (i.e., uniformly random guessing), which is the worst performance that a classifier can achieve. We also report the area under the ROC, which approximates the probability that a given model will rank the likelihood of a positive sample higher than that of a randomly chosen negative sample. These statistics are summarized in Table 2. The mean ROC curves shown are interpolated from the CV experiment described in Sect. 5.2. The convolutional LSTM achieved maximum AUC, showing a marginal improvement of only 0.06 over the 3D CNN (see Table 2 for the peak AUC and average AUCs with the standard error). In contrast, the VGG-based model is significantly underperformed, only achieving a maximum AUC of 0.86. Interestingly, the VGG model achieved the worst AUC despite requiring the greatest amount of time to train. The 3D CNN and convolutional LSTM incur similar time costs, but complete a single training epoch in less than half the time required for the VGG-based model to do the same. The empirically most powerful classifiers evaluated in this work, the convolutional LSTMs that achieved a peak AUC of 0.98, committed only a single false positive (and no other errors). Based on these results, the convolutional LSTM shows a tremendous amount of promise for ovarian cancer detection from OCT imagery.
Table 2
Peak and average AUCs achieved over ten replications of the leave-one-out cross-validation experiment described in Sect. 5.2, summarized from the results shown in Fig. 6
Model
Peak AUC
Mean AUC ± SE
Areas under ROC
VGG
0.86
0.59 ± 0.068
Conv. LSTM
0.98
0.81 ± 0.037
3D-CNN
0.92
0.69 ± 0.029
The standard error is measured on a 90% confidence interval
Receiver operating characteristic (ROC) curves computed by interpolating the functional mean ROC from recordings of replications of the aforementioned CV experiment for a VGG, b a convolutional LSTM, and c a 3D-CNN corresponding to the parameterizations outlined Sect. 5.1. The shaded error region shown is within one standard error of the mean ROC curve. The dashed red curve (for which true positive rate is equal to false positive rate) is an idealized ROC corresponding to classifying by random chance (i.e., uniformly random guessing). The dashed black curve is the ROC that achieved the maximum area enclosed below among all replications of the CV experimentPeak and average AUCs achieved over ten replications of the leave-one-out cross-validation experiment described in Sect. 5.2, summarized from the results shown in Fig. 6The standard error is measured on a 90% confidence interval
Discussion
Conclusions
This work’s contributions form a critical first step toward an automatic OCT-based human ovarian cancer diagnostic system. The proposed classifiers learn and adapt abstract representations of tomograms conducive to detecting radiographic signatures of ovarian cancer in OCT imagery without manual feature selection. Results presented here show that (to the extent of the limits imposed by the dataset), highly discriminatory classifiers that can be expected to generalize to unseen data can be evolved. Moreover, their incurrence of very few misclassifications is replicable across multiple runs of the leave-one-out cross-validation program.To the best of the authors’ knowledge, this is the first demonstration of a proof-of-concept model for cancer detection using depth-resolved OCT recordings of ovaries, which has been shown to be a challenging medium on which to base inferences of genotype in both OCT and widefield fluorescence [19, 68–70]. A recent approach to OCT-based ovarian cancer detection using a generalized linear model classifier showed promising results for detecting malignant (vs. normal) ovarian tissue [71]. However, that effort is distinct from ours in that they imaged biopsies of ovarian tissue using full field OCT and performed classification on hand-crafted features developed by human analysis of ovarian OCT data. In contrast, our methodology learns features maps from the training data, and our proposed neural networks are benchmarked on depth-resolved OCT recordings of intact ovaries.
Future work
With an admittedly small dataset, consisting of only 16 total animals, future experimentation with the proposed classifiers must involve validation on a larger dataset, which would enable larger cross-validation experiments where many animals are held out for testing on each fold. We emphasize that to the extent of the limits imposed by the dataset analyzed in this work, the cross-validation results presented are exclusively the results of generalization performance (i.e., all testing is performed on samples that do not appear in the training subset). However, this procedure suffers from the limitation of only assessing a single test animal in the test phase. A larger dataset enabling a larger cross-validation experiment would allow us to draw stronger conclusions on diagnostic efficacy with reduced uncertainty. Additionally, a larger collection of mouse OCT imagery may provide valuable information to be leveraged in a transfer learning experiment when eventually adapting the models for human subjects. An incredibly useful extension of the models presented here is a quantitative method to identify features and regions in the OCT imagery that leads to a neural network’s decision (i.e., the specific region in the OCT image where the tumor is present). These regions could provide medical practitioners insight into the uncertainty of a neural network’s prediction. For example, consider the case in which an artificial occlusion (e.g., an implanted medical device) or an interferometric artifact partially obscures (or mimics) a radiographic signature of cancer. Unless such occlusions are sufficiently common throughout the classifier’s training dataset, it is unlikely that the neural network has learned to accurately identify the signatures of ovarian cancer in the presence of the occlusion. Therefore, health care providers may decide that the result warrants further consideration, perhaps in conjunction with other assays (e.g., collecting serum to identify or exclude the possible presence of biomarkers that indicate the progression of ovarian cancer [72]).
Broader impacts
Perhaps this work’s most profound broader impact lies in potentially dramatically improving the likelihood of detecting ovarian cancer in patients before metastasis throughout the peritoneal cavity, which would radically improve treatment outcomes. That our models were trained and evaluated on a transgenic mouse model of ovarian cancer development begs a central question: to what extent does a neural network from our work transfer to OCT data collected from humans Given the difficulty of collecting such data, developing an even larger dataset of mouse ovary tomograms may prove advantageous if the knowledge learned from the mouse model is relevant for analyzing human ovarian OCT data.
Authors: Zvia Burgansky-Eliash; Gadi Wollstein; Tianjiao Chu; Joseph D Ramsey; Clark Glymour; Robert J Noecker; Hiroshi Ishikawa; Joel S Schuman Journal: Invest Ophthalmol Vis Sci Date: 2005-11 Impact factor: 4.799
Authors: Lida P Hariri; Erica R Liebmann; Samuel L Marion; Patricia B Hoyer; John R Davis; Molly A Brewer; Jennifer K Barton Journal: Cancer Biol Ther Date: 2010-09-01 Impact factor: 4.742
Authors: Denise C Connolly; Rudi Bao; Alexander Yu Nikitin; Kasie C Stephens; Timothy W Poole; Xiang Hua; Skye S Harris; Barbara C Vanderhyden; Thomas C Hamilton Journal: Cancer Res Date: 2003-03-15 Impact factor: 12.701
Authors: Leila B Mostaço-Guidolin; Alex C-T Ko; Fei Wang; Bo Xiang; Mark Hewko; Ganghong Tian; Arkady Major; Masashi Shiomi; Michael G Sowa Journal: Sci Rep Date: 2013 Impact factor: 4.379