| Literature DB >> 32576905 |
Xavier P Burgos-Artizzu1,2, David Coronado-Gutiérrez3,4, Brenda Valenzuela-Alcaraz3, Elisenda Bonet-Carne3,5,6, Elisenda Eixarch3,5,6, Fatima Crispi3,5,6, Eduard Gratacós3,5,6.
Abstract
The goal of this study was to evaluate the maturity of current Deep Learning classification techniques for their application in a real maternal-fetal clinical environment. A large dataset of routinely acquired maternal-fetal screening ultrasound images (which will be made publicly available) was collected from two different hospitals by several operators and ultrasound machines. All images were manually labeled by an expert maternal fetal clinician. Images were divided into 6 classes: four of the most widely used fetal anatomical planes (Abdomen, Brain, Femur and Thorax), the mother's cervix (widely used for prematurity screening) and a general category to include any other less common image plane. Fetal brain images were further categorized into the 3 most common fetal brain planes (Trans-thalamic, Trans-cerebellum, Trans-ventricular) to judge fine grain categorization performance. The final dataset is comprised of over 12,400 images from 1,792 patients, making it the largest ultrasound dataset to date. We then evaluated a wide variety of state-of-the-art deep Convolutional Neural Networks on this dataset and analyzed results in depth, comparing the computational models to research technicians, which are the ones currently performing the task daily. Results indicate for the first time that computational models have similar performance compared to humans when classifying common planes in human fetal examination. However, the dataset leaves the door open on future research to further improve results, especially on fine-grained plane categorization.Entities:
Mesh:
Year: 2020 PMID: 32576905 PMCID: PMC7311420 DOI: 10.1038/s41598-020-67076-5
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Maternal-fetal US categories from our dataset: image examples, total number of images and frequency of occurrence (p). Fetus drawing was downloaded from openclipart.org, a creative commons repository of images for public domain use.
Figure 2Fine-grained brain labels: image examples.
Maternal-fetal US dataset statistics: anatomical planes labeled, number of patients, number of images, US machines and Operators.
| Anatomical plane | Clinical use | N. patients | N. images | ||
|---|---|---|---|---|---|
| Fetal abdomen | Morphology, fetal weight | 595 | 711 | ||
| Fetal brain | 1,082 | 3,092 | |||
| Trans-thalamic | Neuro-development, | 909 | 1,638 | ||
| Trans-cerebellum | fetal weight | 575 | 714 | ||
| Trans-ventricular | 446 | 597 | |||
| Fetal femur | Fetal weight | 754 | 1,040 | ||
| Fetal thorax | Heart and lung development | 755 | 1,718 | ||
| Maternal cervix | Prematurity | 917 | 1,626 | ||
| Other | Several | 734 | 4,213 | ||
| TOTAL | 1,792 | 12,400 | |||
| Voluson E6 | 807 | 5,862 | Op. 1 | 407 | 2,792 |
| Voluson S10 | 91 | 1,082 | Op. 2 | 344 | 2,435 |
| Aloka | 270 | 3,560 | Op. 3 | 270 | 3,560 |
| Others | 631 | 1,896 | Others | 803 | 3,613 |
Results of the wide variety of classification CNN tested for maternal-fetal common planes recognition.
| Net | Params | Layers | Speed (Hz) | top1-err(%) | top3-err(%) | class-acc(%) |
|---|---|---|---|---|---|---|
| VGG-M[ | 99 M | 25 | 110 | 12.9 | 0.66 | 84.8 +− 9.3 |
| VGG-16[ | 134 M | 54 | 40 | 7.9 | 0.55 | 92.1 +− 6.1 |
| MobileNet[ | 2 M | 157 | 20 | 10.7 | 0.82 | 87.5 +− 9.6 |
| Inception-v3[ | 22 M | 318 | 12 | 6.5 | 0.34 | 93.5 +− 5.0 |
| ResNet-18[ | 11 M | 73 | 66 | 7.6 | 0.47 | 92.5 +− 5.9 |
| ResNet-34[ | 21 M | 129 | 38 | 7.6 | 0.76 | 92.5 +− 5.8 |
| ResNet-50[ | 24 M | 177 | 25 | 6.8 | 0.32 | 93.1 +− 5.4 |
| ResNet-101[ | 43 M | 347 | 10 | 6.7 | 0.23 | 93.4 +− 5.2 |
| ResNet-152[ | 58 M | 517 | 5 | 6.5 | 0.21 | 92.8 +− 5.5 |
| ResNeXt-50[ | 23 M | 179 | 25 | 7.3 | 0.34 | 92.7 +− 5.9 |
| ResNeXt-101[ | 42 M | 348 | 13 | 6.5 | 0.55 | 94.0 +− 4.8 |
| SENet[ | 113 M | 773 | 2 | 7.6 | 0.27 | 92.9 +− 5.9 |
| SE-ResNet-50[ | 26 M | 256 | 14 | 7.6 | 0.76 | 93.3 +− 6.1 |
| SE-ResNet-101[ | 47 M | 511 | 6 | 7.0 | 0.36 | 93.3 +− 6.1 |
| SE-ResNet-152[ | 47 M | 511 | 3 | 7.5 | 0.32 | 92.7 +− 6.0 |
| SE-ResNeXt-50[ | 25 M | 256 | 15 | 7.1 | 0.21 | 92.7 +− 5.7 |
| SE-ResNeXt-101[ | 47 M | 511 | 6 | 7.1 | 0.27 | 92.7 +− 5.8 |
| DenseNet-121[ | 8 M | 428 | 11 | 7.1 | 0.32 | 92.9 +− 5.8 |
| DenseNet-169[ | 14 M | 596 | 7 | 6.2 | 0.27 | 93.6 +− 5.1 |
| Baseline1 (PCA + Boosting) | — | — | 41 | 39.6 | 10.4 | 54.7 +− 37.6 |
| Baseline2 (Hog+Boosting) | — | — | 28 | 25.5 | 5.2 | 68.6 +− 28.8 |
DenseNet-169 is the best performing model in terms of top-1 error. Inception and ResNetXt-101 are the best performing models taking into account trade-off between speed and performance.
Figure 3Results on common planes classification. Confusion matrices on the 896 test patients (5,271 images) are shown. Matrix rows show the true class, labeled by our expert maternal-fetal clinician. Top-1 error and mean +− std of the diagonal (class-acc) are shown. Matrix columns are the prediction from (a,b) our two research technicians and (c) DenseNet-169 model.
Figure 4Results on fine-grained brain categorization. Confusion matrices on the 536 test patients (1,406 images). Matrix rows show the true class, labeled by our expert maternal-fetal clinician. Top-1 error and mean +− std of the diagonal (class-acc) are shown. Matrix columns are the prediction from (a,b) our two research technicians and (c) DenseNet-169 model. All numbers are shown as percentages. NOT-A-BRAIN: mistaken by something other than a Brain.
Ablation study results using Inception.
| Experiment | top1-err(%) | top3-err(%) | class-acc(%) |
|---|---|---|---|
| No Imagenet pre-training | 14.1 | 0.89 | 84.7 +− 11.6 |
| Last layer training only | 10.7 | 0.57 | 87.6 +− 9.7 |
| No data augmentation | 8.1 | 0.66 | 92.2 +− 6.6 |
| Data reweighting[ | 7.4 | 0.36 | 92.4 +− 6.0 |
| Baseline | 6.5 | 0.34 | 93.5 +− 5.0 |
Figure 5Performance vs number of training patients/images. Performance of the Inception CNN on the test set as a function of the number of training patients used.
Model transferability between US machines and operators.
| Training (N images) | Testing (N images) | |||
|---|---|---|---|---|
| Operator 1 (874) | Operator 2 (946) | Operator 3 (1,460) | ALL (5,271) | |
| Operator 1 (1,918) | 9.7 | 13.0 | 33.0 | 18.9 |
| Operator 2 (1,489) | 17.5 | 10.6 | 32.7 | 23.0 |
| Operator 3 (2,100) | 24.8 | 24.4 | 4.3 | 19.6 |
| ALL (7,129) | 7.2 | 7.5 | 3.7 | 6.5 |
Images were divided into categories according to US machine and operator, keeping those having at least a total of 2, 000 images. Then, each category was used to train an Inception CNN (table rows) and tested on sets built using images from all categories (columns). Numbers are top-1error in the test set (%). Operators 1 and 2 always used a Voluson E6 US machine, while Operator 3 used exclusively an Aloka US machine.