| Literature DB >> 32561845 |
Pablo Gómez1, Andreas M Kist2, Patrick Schlegel3, David A Berry4, Dinesh K Chhetri4, Stephan Dürr3, Matthias Echternach5, Aaron M Johnson6, Stefan Kniesburges3, Melda Kunduk7, Youri Maryn8,9,10,11,12, Anne Schützenberger3, Monique Verguts8,13, Michael Döllinger3.
Abstract
Laryngeal videoendoscopy is one of the main tools in clinical examinations for voice disorders and voice research. Using high-speed videoendoscopy, it is possible to fully capture the vocal fold oscillations, however, processing the recordings typically involves a time-consuming segmentation of the glottal area by trained experts. Even though automatic methods have been proposed and the task is particularly suited for deep learning methods, there are no public datasets and benchmarks available to compare methods and to allow training of generalizing deep learning models. In an international collaboration of researchers from seven institutions from the EU and USA, we have created BAGLS, a large, multihospital dataset of 59,250 high-speed videoendoscopy frames with individually annotated segmentation masks. The frames are based on 640 recordings of healthy and disordered subjects that were recorded with varying technical equipment by numerous clinicians. The BAGLS dataset will allow an objective comparison of glottis segmentation methods and will enable interested researchers to train their own models and compare their methods.Entities:
Mesh:
Year: 2020 PMID: 32561845 PMCID: PMC7305104 DOI: 10.1038/s41597-020-0526-3
Source DB: PubMed Journal: Sci Data ISSN: 2052-4463 Impact factor: 6.444
Fig. 1Workflow for creating the BAGLS dataset. Subjects with varying age, gender and health status were examined at different hospitals with differing equipment (camera, light source, endoscope type). The recorded image data is diverse in terms of resolutions and quality. Next, the glottis was segmented using manual or semi-automatic techniques and the segmentation was crosschecked. The segmented videos were split into a training and a test set. The test set features equal amounts of frames from each hospital. We validated BAGLS by training a deep neural network and found that it provides segmentations closely matching the manual expert segmentations.
Composition of the dataset in relation to origin; the training data featured 50 or 100 frames per video depending on video length and test data 50 frames per video.
| Institution | # in training | # in test |
|---|---|---|
| Boston University | 10 | 10 |
| Louisiana State University | 15 | 10 |
| New York University | 14 | 10 |
| Sint-Augustinus Hospital, Wilrijk | 30 | 10 |
| University of California, Los Angeles | 20 | 10 |
| University Hospital Erlangen | 458 | 10 |
| University Hospital of Munich (LMU) | 23 | 10 |
| Total Number of Videos | 570 | 70 |
| Total Number of Frames | 55750 | 3500 |
Fig. 2Age distribution of subjects in the BAGLS dataset.
Overview of voice disorders represented in the BAGLS dataset, multiple disorders per video are possible.
| Disorder Status | # of videos | Disorder Status | # of videos |
|---|---|---|---|
| Healthy | 380 | Contact granuloma | 5 |
| Muscle tension dysphonia | 139 | Paresis | 4 |
| Muscle thyroarythaenoideus atrophy | 25 | Laryngitis | 4 |
| Vocal insufficiency | 18 | Papilloma | 1 |
| Edema | 14 | Leucoplacia | 1 |
| Insufficient glottis closure | 14 | Carcinoma | 1 |
| Nodules | 13 | Other | 8 |
| Polyp | 9 | Unknown status | 50 |
| Cyst | 6 |
Overview of the sampling rates and resolutions of the recorded HSV data in the dataset.
| Sampling rate [Hz] | # of videos | Resolution | # of videos |
|---|---|---|---|
| 1000 | 21 | 256 × 120 | 15 |
| 2000 | 17 | 256 × 256 | 88 |
| 3000 | 30 | 288 × 128 | 7 |
| 4000 | 542 | 320 × 256 | 33 |
| 5000 | 1 | 352 × 208 | 30 |
| 6000 | 2 | 352 × 256 | 11 |
| 8000 | 26 | 512 × 96 | 1 |
| 10000 | 1 | 512 × 128 | 22 |
| 512 × 256 | 431 | ||
| 512 × 512 | 2 |
Fig. 3The BAGLS dataset. (a) The folder structure of the dataset. We provide two folders for training and test, respectively, containing image/segmentation pairs, and one folder that contains the raw data (folder “raw”). (b) Exemplary images from the dataset next to the binary segmentation mask as present in the folders shown on the left. Note differing aspect ratios and image properties such as black image borders. (c) An exemplary subset from raw video 0.mp4 where two open/close cycles are visible. For illustration purposes, we cropped the image (blue frame). Consecutive frame ids (1…30) are in the lower right corner.
Overview of cameras used to record the HSV.
| Camera | # of videos |
|---|---|
| KayPentax HSV 9700 (Photron) | 16 |
| KayPentax HSV 9710 (Photron) | 495 |
| HERS 5562 Endocam Wolf | 79 |
| Phantom v210 | 30 |
| FASTCAM Mini AX100 540K-C-16GB | 20 |
Overview of the utilized light sources and endoscopes to record the HSV data.
| Endoscope type | # of videos | Light source | # of videos |
|---|---|---|---|
| Oral 70° | 543 | Kay Pentax Model 7152B | 491 |
| Oral 90° | 46 | Xenon Light | |
| Nasal 2.4 mm | 9 | Wolf 300 W Xenon | 79 |
| Nasal 3.5 mm | 12 | CUDA Surgical E300 Xenon | 40 |
| N/A | 30 | N/A | 30 |
Example metadata for a given frame (BAGLS test/0.meta).
| Key | Value |
|---|---|
| Video Id | 27 |
| Camera | HERS 5562 Endocam Wolf |
| Sampling rate (Hz) | 4000 |
| Video resolution (px, HxW) | [256, 256] |
| Color | false |
| Endoscope orientation | 70° |
| Endoscope application | oral |
| Age range (yrs) | 90–100 |
| Subject sex | f |
| Subject disorder status | spasmodic dysphonia |
| Segmenter | 1 |
| Post-processed | 2 |
Technical Validation of External Expert Segmentations.
| Additional Expert Id | IoU |
|---|---|
| E1 | 0.745 |
| E2 | 0.749 |
| E3 | 0.798 |
| E4 | 0.796 |
| Average | 0.772 |
Fig. 4Evaluation of model performance using the Intersection over the Union (IoU). (a) Cumulative distribution of the IoU across validation (magenta) and test (blue) set. Shaded error shows 95% confidence interval of bootstrapped distributions. (b) Distribution of IoUs against segmented area in the ground truth. Left: validation set; right: test set. (c) Example images and segmentations for IoUs close to 0.25, 0.5 and 0.75. Intersection of segmented pixels in the ground truth and prediction in green. Blue and red pixels were classified as glottis only in the ground truth and only in the prediction, respectively. (d) An example video from the BAGLS dataset (0.mp4, same subset as in Fig. 3c) segmented (orange overlay) using the trained model with respective glottal area waveform (sum of segmented pixels over time).
| Measurement(s) | glottis • Image Segmentation |
| Technology Type(s) | Endoscopic Procedure • neural network model |
| Factor Type(s) | age • sex • healthy versus disordered subjects • recording conditions |
| Sample Characteristic - Organism | Homo sapiens |