| Literature DB >> 34821396 |
Muhammad Adeel Azam1,2, Claudio Sampieri3,4, Alessandro Ioppi3,4, Stefano Africano3,4, Alberto Vallin3,4, Davide Mocellin3,4, Marco Fragale3,4, Luca Guastini3,4, Sara Moccia5, Cesare Piazza6,7, Leonardo S Mattos1,2, Giorgio Peretti3,4.
Abstract
OBJECTIVES: To assess a new application of artificial intelligence for real-time detection of laryngeal squamous cell carcinoma (LSCC) in both white light (WL) and narrow-band imaging (NBI) videolaryngoscopies based on the You-Only-Look-Once (YOLO) deep learning convolutional neural network (CNN). STUDYEntities:
Keywords: Larynx cancer; computer-assisted image interpretation; deep learning; narrow band imaging; videolaryngoscopy
Mesh:
Year: 2021 PMID: 34821396 PMCID: PMC9544863 DOI: 10.1002/lary.29960
Source DB: PubMed Journal: Laryngoscope ISSN: 0023-852X Impact factor: 2.970
Fig 1Laryngeal cancer dataset sample images. The left column contains the original narrow‐band imaging (NBI) and white light (WL) laryngoscopy images, while in the right column the same frames are shown after applying the contrast limited adaptive histogram equalization. Case A is an in‐office WL view of an infrahyoid epiglottic cancer. Case B is an in‐office NBI videoframe of a left vocal fold cancer extending to the anterior commissure. Case C is an intraoperative WL videoframe of a right vocal fold cancer. Case D is an intraoperative NBI view of a left vocal fold cancer extending to the bottom of the ventricle.
Fig 2YOLOv5 architecture representation.
Fig 3Graphical representation of the Intersection over Union (IoU) calculation on a narrow‐band imaging videoframe. The light blue rectangle represents the ground truth bounding box, while the red rectangle represents the model prediction. The IoU is calculated by dividing the overlap area by the total area of union.
Performance Evaluation of Various YOLO Models During Training‐Validation and the Testing Phase.
| Training and Validation | ||||||
|---|---|---|---|---|---|---|
| Model | Batch Size (N° of Samples) | Image Resolution (Pixel) | Parameters of the Model (Millions) | Precision | Recall | mAP@.5 |
| YOLOv5s | 64 | 640 × 640 | 7.06 | 0.712 | 0.538 | 0.576 |
| YOLOv5m | 32 | 640 × 640 | 21.0 | 0.561 | 0.590 | 0.576 |
| YOLOv5l | 16 | 640 × 640 | 46.6 | 0.585 | 0.615 | 0.545 |
| YOLOv5x | 16 | 640 × 640 | 87.3 | 0.55 | 0.628 | 0.571 |
| YOLOv5s6 | 16 | 1280 × 1280 | 12.4 | 0.697 | 0.474 | 0.492 |
| YOLOv5m6 | 8 | 1280 × 1280 | 35.5 | 0.66 | 0.474 | 0.506 |
Values in bold denote the best results during testing.
mAP@.5 = mean Average Precision with an Intersection over Union threshold of 0.5; TP% = rate of true positive predicted bounding boxes among the total number of bounding boxes predicted; FP% = rate of false‐positive predicted bounding boxes among the total number of bounding boxes predicted; TTA = test time augmentation.
Fig 4YOLO models performance metrics during training and validation. (A), (B), and (C), respectively, represent Recall, Precision, and mAP@0.5 (mean Average Precision at 0.5 intersection over union) curves trained up to 100 epochs.
Fig 5Examples of automatic laryngeal cancer prediction provided by the ensemble model (YOLOv5s with YOLOv5m—TTA). The first two columns on the left contain images with ground truth bounding boxes, while the two columns on the right contain the same images with YOLO‐predicted bounding boxes. Case A is a carcinoma of the infrahyoid epiglottis; case B is a carcinoma of the infrahyoid epiglottis; case C is a carcinoma of the left vocal fold; case D is a carcinoma of the right vocal fold involving the anterior commissure.
Characteristics and Computation Times of the Testing Videos After Applying the Ensemble Model (YOLOv5s with YOLOv5m—TTA) for LSCC Detection.
| Video ID | Size (Mb) | Video Format | Video Resolution | Video Frame Rate (fps) | Total Frame Count | LSCC | Average Computation Time Per Frame (s) |
|---|---|---|---|---|---|---|---|
| 1 | 23.1 | avi | 768 × 576 | 30 | 1321 | Yes | 0.027 |
| 2 | 25.3 | mp4 | 778 × 480 | 25 | 1448 | Yes | 0.034 |
| 3 | 34.9 | avi | 768 × 576 | 30 | 1529 | Yes | 0.025 |
| 4 | 20.6 | mp4 | 778 × 480 | 25 | 1421 | Yes | 0.023 |
| 5 | 27.3 | mp4 | 860 × 480 | 25 | 1519 | Yes | 0.024 |
| 6 | 39.1 | mp4 | 1280 × 720 | 30 | 946 | No | 0.028 |
| Average computation time | 0.026 | ||||||
fps = frame per second; LSCC = laryngeal squamous cell carcinoma; Mb = megabytes; TTA = test time augmentation.
Fig 6Panel of testing videoframes extracted from six videolaryngoscopies. Each row represents a different video: the first four pictures of every row are extracted from the original video, while the last four images are the same frames extracted after the prediction of the ensemble model (YOLOv5s with YOLOv5m—TTA). The first video (V1) shows a carcinoma of the left vocal fold; the second video (V2) shows a cancer of the right vocal fold; images extracted from the third video (V3) show a carcinoma affecting the laryngeal surface of the suprahyoid epiglottis and a severe dysplasia of the right vocal fold. Lastly, V4 shows a carcinoma of the right vocal fold extending to the anterior commissure, V5 shows a tumor of the left vocal fold, while V6 shows frames extracted from a healthy larynx.