Literature DB >> 33909264

Using spatial-temporal ensembles of convolutional neural networks for lumen segmentation in ureteroscopy.

Jorge F Lazo^1,2, Aldo Marzullo³, Sara Moccia^4,5, Michele Catellani⁶, Benoit Rosa⁷, Michel de Mathelin⁷, Elena De Momi⁸.

Abstract

PURPOSE: Ureteroscopy is an efficient endoscopic minimally invasive technique for the diagnosis and treatment of upper tract urothelial carcinoma. During ureteroscopy, the automatic segmentation of the hollow lumen is of primary importance, since it indicates the path that the endoscope should follow. In order to obtain an accurate segmentation of the hollow lumen, this paper presents an automatic method based on convolutional neural networks (CNNs).
METHODS: The proposed method is based on an ensemble of 4 parallel CNNs to simultaneously process single and multi-frame information. Of these, two architectures are taken as core-models, namely U-Net based in residual blocks ([Formula: see text]) and Mask-RCNN ([Formula: see text]), which are fed with single still-frames I(t). The other two models ([Formula: see text], [Formula: see text]) are modifications of the former ones consisting on the addition of a stage which makes use of 3D convolutions to process temporal information. [Formula: see text], [Formula: see text] are fed with triplets of frames ([Formula: see text], I(t), [Formula: see text]) to produce the segmentation for I(t).
RESULTS: The proposed method was evaluated using a custom dataset of 11 videos (2673 frames) which were collected and manually annotated from 6 patients. We obtain a Dice similarity coefficient of 0.80, outperforming previous state-of-the-art methods.
CONCLUSION: The obtained results show that spatial-temporal information can be effectively exploited by the ensemble model to improve hollow lumen segmentation in ureteroscopic images. The method is effective also in the presence of poor visibility, occasional bleeding, or specular reflections.

Entities: Chemical Disease Gene Species

Keywords: Convolutional neural networks; Deep learning; Image segmentation; Upper tract urothelial carcinoma (UTUC); Ureteroscopy

Mesh：

Year: 2021 PMID： 33909264 PMCID： PMC8166718 DOI： 10.1007/s11548-021-02376-3

Source DB: PubMed Journal: Int J Comput Assist Radiol Surg ISSN： 1861-6410 Impact factor: 2.924

Introduction

Sample images in our dataset showing: a the hue variability of the surrounding tissue as well as the shape and location of the lumen (the hollow lumen is highlighted in green to show clearly the variety of shapes in which it could appear). b–e Samples of artifacts (the lumen was not highlighted to have a clear view of the image artifacts) Upper tract urothelial cancer (UTUC) is a sub-type of urothelial cancer which arises in the renal pelvis and the ureter. The disease has an estimated number of 3,970 patients affected in 2020 [1] in the USA. Flexible ureteroscopy (URS) is nowadays the gold standard for UTUC diagnosis and conservative treatment. URS is used to inspect the tissue in the urinary system, determine the presence and size of tumor [2] as well as for biopsy of suspicious lesions [3]. The procedure is carried out under the visual guidance of an endoscopic camera [4]. Navigation and diagnosis through the urinary tract are highly dependent upon the operator expertise [5]. For this reason, the current development of methods in computer-assisted interventions (CAI) intends to support surgeons by providing them with relevant information during the procedure [6]. Additionally, within the endeavors of developing new tools for robotic ureteroscopy, a navigation system which relies on image information from the endoscopic camera is also needed [7]. Workflow of the proposed ensemble for lumen segmentation in ureteroscopic videos. Blocks of 3 consecutive frames of size (where p and q refer to the spatial dimensions and to the number of channels of each individual frame) are fed into the ensemble. Models and (orange line) take these blocks as input, whereas models and only take the central frame (red line). Each of the predictions made by each model is ensembled with the function defined in Eq. 1 to perform the final output In this study, we focus on the segmentation of the ureter’s lumen. In ureter-endoscopic images, the lumen appears most likely as a tunnel or hole in the images with its center being the region with the lowest illuminance inside the field of view (FOV). Lumen segmentation presents some particular challenges such as the difficulty of defining the concrete boundary of it, the narrowing of the ureter around the ureteropelvic junction [4], and the appearance of image artifacts such as blur, occlusions due to the appearance of floating debris or bleeding. Some examples of these, present in our data, are shown in Fig. 1.

Fig. 1

In the CAI domain, deep learning (DL)-based methods represent the state of the art for many image processing tasks, including segmentation. In [8], an eight-layer fully convolutional network (FCN) is presented for semantic segmentation of colonoscopy images for different classes, including lumen in the colon, polyps and tools. In [9], a U-Net-like architecture based on residual blocks for lumen segmentation in ureteroscopy images is proposed. However, these DL-based approaches in the field of CAI only use single frames, which dismisses the chance of obtaining extra information from temporal features. The exploitation of spatial-temporal information has shown to obtain better performance than approaches that only process single frames. In [10], a model based on 3D convolutions is proposed for the task of tool detection and articulation estimation, and in [11], a method for infants limb-pose estimation in intensive care uses 3D convolutions to encode the connectivity in the temporal direction. Additionally, recent results in different biomedical image segmentation challenges have shown the effectiveness of DL ensemble models, such as in [12] where an ensemble consisting of 4 UNet-like models and one Deeplabv3+ network was proposed obtaining the second place in the 2019 SIIM-ACR pneumothorax challenge, and in [13] where an ensemble which analyzed single-slices data 3D volumetric data separately was presented, obtaining top performance in the HVSMR 3D Cardiovascular MRI in Congenital Heart Disease 2016 challenge dataset. Inspired by both paradigms, our research hypothesis is that the use of ensembles which use both single-frame and consecutive-frames information could achieve a better generalization in data than models which uses only one of them. For this purpose, we propose an ensemble model which uses in parallel 4 convolutional neural networks which can exploit the information contained in single-frame and continue-frames, of ureteroscopy videos.

Proposed method

As introduced in [12, 14], we considered the use of ensembles to reach a better generalization of the model when testing it on unseen data. The proposed ensemble of CNNs for ureter’s lumen segmentation is depicted in Fig. 2.

Fig. 2

Workflow of the proposed ensemble for lumen segmentation in ureteroscopic videos. Blocks of 3 consecutive frames of size (where p and q refer to the spatial dimensions and to the number of channels of each individual frame) are fed into the ensemble. Models and (orange line) take these blocks as input, whereas models and only take the central frame (red line). Each of the predictions made by each model is ensembled with the function defined in Eq. 1 to perform the final output

Our ensemble is fed with three consecutive frames and produces the segmentation for the frame . The ensemble is made of two pairs of branches. One pair (the red one in Fig. 2) consists of U-Net with residual blocks () and Mask-RCNN (), which process the central frame . The other pair (orange path in Fig. 2) processes the three frames with and , which extend and as explained in “Proposed method” Section. It is important to notice that frames constituting the input for any M are expected to have the minimal possible changes, but still significant to provide extra information which could not be obtained by other means. Some specific examples in our case study include the appearance of debris crossing rapidly the FOV, the sudden appearance or disappearance of some image specularity, a slightly change in the illumination or the position of the element we are interested to segment. For this reason, we consider only three consecutive frames as input for the model. The core models , on which our method is based are two state-of-the-art architectures, for instance segmentation: Since our implementation is made of different sets of models, the final output is determined using an ensemble function defined as:where corresponds to the prediction of each of the models for a frame I(t). (): The U-Net implementation used in this work is based on residual units as used in [9], instead of using the classical convolutional blocks, and this is meant to address the degradation as proposed in [15]. (): Is an implementation of Mask-RCNN [16] using ResNet50 as backbone. Mask-RCNN is composed of different stages. The first stage is composed of two networks: a “backbone”, which performs the initial classification of the input given a pretrained network, and a region proposal network. The second stage of the model consists of different modules which include a network that predicts the bounding boxes, an object classification network and a FCN which generate the masks for each RoI. The initial stage of the models M. The blocks of consecutive frames of size (where p and q refer to the spatial dimensions and to the number of channels (ch) of each individual frame) pass through an initial 3D convolution with number of kernels. The output of this step has a shape of size which is padding with zeros in the second and third dimensions to latter, and then reshaped to fit as input for the m core-models

Extending the core models for handling multi-frame information

For each core model m, an extension M is obtained by adapting the architecture for processing multi-frame information. Let be an ordered set of n elements corresponding to frames of a video, where p and q represent spatial dimensions and the number of color channels (Fig. 3). Starting from any core model (m), which takes as input elements from , we can define another segmentation model (M) which receives multi-frame information from . Specifically, it receives inputs of the form , where represent the temporal dimension (number of frames). To this aim, the core model m is extended by prepending an additional 3D convolution layer with kernels of size (). The new layer produces an output , so that feeding it into m is straightforward. The issue of having and instead of p and q after the 3D convolution is fixed by padding the output with zeros in the two spatial dimensions. A graphical representation of the process is shown in Fig. 3.

Fig. 3

The initial stage of the models M. The blocks of consecutive frames of size (where p and q refer to the spatial dimensions and to the number of channels (ch) of each individual frame) pass through an initial 3D convolution with number of kernels. The output of this step has a shape of size which is padding with zeros in the second and third dimensions to latter, and then reshaped to fit as input for the m core-models

Information about the dataset collected The video marked in bold indicates the patient-case that was used for testing

Evaluation

Dataset

For this study, 11 videos from 6 patients undergoing ureteroscopy procedures were collected. Videos from five patients were used for training the model and tuning hyperparameters. Videos from the remaining patient, randomly chosen, were kept aside and only used for evaluating the performance. The videos were acquired from the European Institute of Oncology (IEO) at Milan, Italy, following the ethical protocol approved by the IEO and in accordance with the Helsinki Declaration. The number of frames extracted and manually segmented by video is shown in Table 1. Data augmentation was implemented before starting the trainings. The operations used for this purpose were rotations in intervals of 90, horizontal and vertical flipping and zooming in and out in a range of ± 2 the size of the original image.

Table 1

Information about the dataset collected

Patient no.	Video no.	No. of annotated frames	Image size (pixels)
1	Video 1	21	356 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times $$\end{document}× 256
1	Video 2	240	256 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times $$\end{document}× 266
2	Video 3	462	296 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times $$\end{document}× 277
2	Video 4	234	296 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times $$\end{document}× 277
3	Video 5	51	296 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times $$\end{document}× 277
4	Video 6	201	296 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times $$\end{document}× 277
5	Video 7	366	256 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times $$\end{document}× 262
6	Video 8	387	256 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times $$\end{document}× 262
6	Video 9	234	256 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times $$\end{document}× 262
6	Video 10	117	256 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times $$\end{document}× 262
6	Video 11	360	256 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times $$\end{document}× 262
Total	–	2673	–

The video marked in bold indicates the patient-case that was used for testing

Training setting

All the models were trained, once at time, at minimizing the loss function based on the Dice similarity coefficient () defined as:where true positive (TP) is the number of pixels that belong to the lumen, which are correctly segmented, false positive (FP) is the number of pixels miss-classified as lumen, and false negative (FN) is the number of pixels which are classified as part of lumen but actually they are not. For the case of (m1), the hyperparameters learning rate (lr) and mini batch size (bs) were determined using a fivefold cross-validation strategy with the data from patients 1, 2, 3, 4 and 6 in a grid search. The ranges in which this search was performed were and . The DSC was set as the evaluation metric to determine the best model for each of the experiments. Concerning the extensions M, the same strategy was used to determine the number of kernels of the input 3D convolutional layer. The remaining hyperparameters were set the same as for . In case of , the same fivefold cross-validation strategy was used. The hyperparameters tuned were: the backbone (from the options ResNet50 and ResNet101 [15]) and the value of minimal detection confidence in a range of 0.5–0.9 with differences of 0.1. To cover the range of different sizes of masks in the training and validation dataset, the anchor scales were set to the values of 32, 64, 128 and 160. In this case, the number of filters in the initial 3D convolutional layer was set to a value of 3 which is the only one that could match the predefined input-size, after reshaping, of ResNet backbone. For each core models and their respective extensions, once the hyperparameters values were chosen, an additional training process was carried out using these values in order to obtain the final model. The training was performed using all the annotated frames obtained from the previously mentioned 5 patients, 60 of the frames were used for training and 40 for validation. The results obtained in this step were the ones used to calculate the ensemble results the function defined in Eq. 1. The networks were implemented using Tensorflow and Keras frameworks in Python 3.6 trained on a NVIDIA GeForce RTX 280 GPU.

Performance metrics

The performance metrics chosen were DSC, precision (Prec) and recall (Rec), defined as:

Ablation study and comparison with state of the art

First, the performance of the proposed method was compared with the one presented in [9], where the same U-Net based on residual blocks architecture was used. Then, as ablation study, four versions of the ensemble model were tested: In these cases, the ensemble function was computed using the values of the predictions of each of the models. The Kruskal–Wallis test on the DSC was used to determine the statistical significance between the different single models tested. (,): only single-frame information was considered in the ensemble; (,): only multi-frame information was considered in the ensemble; (,), (,): each of the core models and its respective extension were considered in the ensemble, separately. Box plots of the precision (Prec), recall (Rec) and the Dice similarity coefficient (DSC) for the models tested. (yellow): ResUNet with single image frames, (green): ResUNet using consecutive temporal frames, (brown): Mask-RCNN with single image frames, (pink): Mask-RCNN using consecutive temporal frames, and the proposed ensemble method (blue) formed by all the previous models. The asterisks represent the significant difference between the different architectures in terms of the Kruskal–Wallis sign rank test (*, **, ***)

Results

The box plots of the Prec, Rec and the DSC are shown in Fig. 4. Results of the ablation study are shown in Table 2. The proposed method achieved a DSC value of 0.80 which is better than using single frames () and than trained as well with single frames (). When using single-frame information, performs better than . However, the result is the opposite using multi-frame information. The ensembles of single-frame models () perform better with respect to ensembles of models exploiting multi-frame information (). In the case of spatiotemporal-based models, U-Net based on residual blocks () performs better than the one based on Mask-RCNN (). This might be due to the constraint of fitting the output of the 3D convolution into the layers of the backbone of Mask-RCNN. The same limitation might explain the similar behavior when it comes to the comparison of the ensembles composed only of U-Net based in residual blocks models and Mask-RCNN-based models, where the former one performs better than the second one. The only model which achieves a better performance than the proposed one in any metric is U-Net based on residual blocks with the Rec, obtaining a value 0.04 better than the model we proposed. Visual examples of the achieved results are shown in Fig. 5 and in the video attached to this paper. Here, the first 2 rows show frames in which the lumen appears clearly and there is no presence of major image artifacts. As observable, each single model underestimates the ground-truth mask. However, their ensemble gives a better approximation. The next 2 rows show cases in which some kind of occlusions (such as blood or debris) is covering most of the FOV. In those cases, single-frame models (m) give better results than its counterparts handling temporal information (M). Finally, the last 2 rows of the image contain samples showing minor occlusions (such as small pieces of debris crossing the FOV) and images where the lumen is not on focus.

Fig. 4

Box plots of the precision (Prec), recall (Rec) and the Dice similarity coefficient (DSC) for the models tested. (yellow): ResUNet with single image frames, (green): ResUNet using consecutive temporal frames, (brown): Mask-RCNN with single image frames, (pink): Mask-RCNN using consecutive temporal frames, and the proposed ensemble method (blue) formed by all the previous models. The asterisks represent the significant difference between the different architectures in terms of the Kruskal–Wallis sign rank test (*, **, ***)

Table 2

Average Dice similarity coefficient (DSC), precision (Prec) and recall (Rec) in the cases in which the ensembles were formed only by: 1. spatial models ; 2. spatial-temporal , 3. ResUnet with both spatial and temporal inputs and 4. Mask-RCNN with the same setup

F(*)	DSC	Prec	Rec
(\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$m_1$$\end{document}m1, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$m_2$$\end{document}m2)	0.78	0.65	0.71
(\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$M_1$$\end{document}M1, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$M_2$$\end{document}M2)	0.71	0.55	0.57
(\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$M_1$$\end{document}M1, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$m_1$$\end{document}m1)	0.72	0.56	0.66
(\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$M_2$$\end{document}M2, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$m_2$$\end{document}m2)	0.68	0.51	0.63

refers to the ensemble function used (Eq. 1), and the components used to form the ensemble are stated between the parenthesis

Fig. 5

Samples of segmentation with the different models test. The colors in the overlay images represent the following for each pixel. True positive (TP): yellow, false positive (FP): pink, false negative (FN): blue, true negative (TN): black. The first two rows depict images where the lumen is clear with the respective segmentation from each model. Rows 3–4 show cases in which some kind of occlusion appears. Finally, the rows 5–6 depict cases in which the lumen is contracted, and/or there is debris crossing the FOV

The average inference time was also calculated. Results for and are 26.3±3.7 ms and 31.5±4.7 ms, respectively. In case of and , the average inference times are 29.7±2.1 ms and 34.7±6.2 ms, respectively. In the case of the ensemble, the average inference time was 129.6±6.7 ms when running the models consecutively. Average Dice similarity coefficient (DSC), precision (Prec) and recall (Rec) in the cases in which the ensembles were formed only by: 1. spatial models ; 2. spatial-temporal , 3. ResUnet with both spatial and temporal inputs and 4. Mask-RCNN with the same setup refers to the ensemble function used (Eq. 1), and the components used to form the ensemble are stated between the parenthesis Samples of segmentation with the different models test. The colors in the overlay images represent the following for each pixel. True positive (TP): yellow, false positive (FP): pink, false negative (FN): blue, true negative (TN): black. The first two rows depict images where the lumen is clear with the respective segmentation from each model. Rows 3–4 show cases in which some kind of occlusion appears. Finally, the rows 5–6 depict cases in which the lumen is contracted, and/or there is debris crossing the FOV

Discussion

The proposed method achieved satisfactory results, outperforming existing approaches for lumen segmentation [9]. Quantitative evaluation, together with a visual inspection of the obtained segmentations, highlights the advantage of using ensembles, confirming our research hypotheses. This is particularly appreciable in the presence of occlusions such as blood or dust covering the FOV (Fig. 5 rows 5–6). In those cases, single-frame-based models tended to include non-lumen regions in the predicted segmentation. An opposite behavior was observed when using only multi-frame-based models, which tended to predict smaller regions with respect to the ground-truth and which is also noticeable in the general performances carried during the ablation studies (Table 2). The ensemble of all of them resulted, instead, in a predicted mask closer to the ground-truth and exemplifies why the use of it in general turns into better performances. It was also observed that the proposed ensemble method was able to correctly manage undesirable false positives appearing in single models. This is due the fact that those false positives did not appear in all the models at the same regions; therefore, the use of ensembles eliminates them from the final result. This is of great importance in the clinical practice, given that false positive classifications during endoluminal inspection might result in a range of complications of the surgical operation, including tools colliding with tissues [17], incorrect path planning [18], among others. Despite the positive results achieved by the proposed approach, some limitations are worth to be mentioned. Computational time required for inference is one of those. In terms of inference time, the proposed model requires 4 times more than previous implementations. However, it is important to state that when it comes to applications of minimal invasive surgery, accuracy may be preferred over speed to avoid any complication, such as perforations of the ureter [5]. Furthermore, such time could be improved by taking advantage of distributed parallel setups. A final issue is related to the scarcity of public available and annotated data, necessary to train and benchmark, which is a well-known problem in the literature. However, this can be overcome in future as new public repositories containing spatial-temporal data are released. Regarding the effectiveness, we consider it as the metric defined for DL systems proposed in [19] which takes into account the product of data quality, robustness and information gain, and we can assert the proposed model is more effective than previous implementations since: (1) the data quality produced with it is better in terms of the mean DSC, Prec and Rec values; (2) the method is more robust against the appearance of artifacts as shown in Fig. 5 and the additional videos attached; and 3) the information gain is higher since the lumen area is delineated better. The disclosed cost-effectiveness of this method for its clinical application such as the one presented in [20] for diabetic retinopathy screening is beyond the scope of this paper. However, a rough estimation should consider 1) the economical cost of the GPU model used to train the networks presented in this work (NVIDIA RTX 2080); 2) the current cost that requires to perform ureteroscopy procedures, according to national health system of each country; and 3) the rate in which this method could reduce complications and thus reduce hospitalization time or the requirement of further interventions.

Conclusion

In this paper, we introduced a novel ensemble method for ureter’s lumen segmentation. Two core models based on U-Net and Mask-RCNN were exploited and extended, in order to capture both single-frame and multi-frame information. Experiments showed that the proposed ensemble method outperforms previous approaches for the same tasks [9], by achieving an increment of in terms of DSC. In the future, evident extensions of the present work will be investigated, including better methods to fit spatial-temporal data into models which were pre-trained in single image datasets (such as Mask-RCNN). Furthermore, we will investigate methods for decreasing the inference time, thus allowing real-time applications. Below is the link to the electronic supplementary material. Supplementary material 1 (mp4 1642 KB)

6 in total

Review 1. Handling and prevention of complications in stone basketing.

Authors: Jean J M C H de la Rosette; Thomas Skrekas; Joseph W Segura
Journal: Eur Urol Date: 2006-02-28 Impact factor: 20.096

2. Low biopsy volume in ureteroscopy does not affect tumor biopsy grading in upper tract urothelial carcinoma.

Authors: Claudia P Rojas; Scott M Castle; Cesar A Llanos; Janice A Santos Cortes; Vincent Bird; Senen Rodriguez; Isildinha M Reis; Wei Zhao; Carmen Gomez-Fernandez; Raymond J Leveillee; Merce Jorda
Journal: Urol Oncol Date: 2012-07-21 Impact factor: 3.498

1 in total

1. A Novel Preoperative Prediction Model Based on Deep Learning to Predict Neoplasm T Staging and Grading in Patients with Upper Tract Urothelial Carcinoma.

Authors: Yuhui He; Wenzhi Gao; Wenwei Ying; Ninghan Feng; Yang Wang; Peng Jiang; Yanqing Gong; Xuesong Li
Journal: J Clin Med Date: 2022-09-30 Impact factor: 4.964