| Literature DB >> 35831385 |
Vito Paolo Pastore1,2, Matteo Moro3, Francesca Odone3.
Abstract
VisionTool is an open-source python toolbox for semantic features extraction, capable to provide accurate features detectors for different applications, including motion analysis, markerless pose estimation, face recognition and biological cell tracking. VisionTool leverages transfer-learning with a large variety of deep neural networks allowing high-accuracy features detection with few training data. The toolbox offers a friendly graphical user interface, efficiently guiding the user through the entire process of features extraction. To facilitate broad usage and scientific community contribution, the code and a user guide are available at https://github.com/Malga-Vision/VisionTool.git .Entities:
Mesh:
Year: 2022 PMID: 35831385 PMCID: PMC9279291 DOI: 10.1038/s41598-022-16014-8
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.996
Figure 1Example of VisionTool’s annotation GUI. The user can annotate keypoints of interest with the mouse, visualize images and the predictions overlaid on them.
Figure 2Examples of challenges for the datasets included in the work. (a, b) Moca’s keypoints occlusion in egocentric point of view (a) and frontal point of view (b) for action pouring-multi. It is common for such views to have little finger occluded by index, as well as wrist occluded by hand. (c, d) Mouth keypoints can be correctly annotated with a difference of several pixels. (c) ground-truth; (d) example of a different manual annotation. (e–g) stentor ceruleous contracting and relaxing during different stages of swimming with significant shape changing.
VisionTool’s detection accuracy with respect to number of annotated frames on MOCA dataset. A LinkNet with EfficientNetb1 backbone is trained on (i) 10; (ii) 25 and (iii) 50 frames, and used to predict the remaining ones, for each of the 10 lateral view point videos included in the evaluation subset. The results reported in this table correspond to the average mAP computed across the whole subset of videos.
| # Frames | mAP | mAP | mAP | mAP | mAP | mAP | mAP | mAP |
|---|---|---|---|---|---|---|---|---|
| 10 | 0.888 | 0.869 | 0.818 | 0.855 | 0.866 | 0.882 | 0.890 | 0.862 |
| 25 | 0.979 | 0.966 | 0.852 | 0.862 | 0.967 | 0.888 | 0.890 | 0.892 |
| 50 | 0.984 | 0.977 | 0.949 | 0.972 | 0.975 | 0.986 | 0.989 |
Best results are in bold.
VisionTool’s detection accuracy on MOCA dataset, with respect to neural networks and backbones. The 4 neural networks (i.e., FPN, LinkNet, PSPNet and Unet) are combined with EfficientNetb1 and ResNet50 backbone. Each model is trained on the 50 annotated frames, and used to predict the remaining ones, for each of the 10 lateral view-point videos included in the evaluation subset. The results reported in this table correspond to the average mAP computed across the whole subset of videos.
| Net/Backbone | mAP | mAP | mAP | mAP | mAP | mAP | mAP | mAP |
|---|---|---|---|---|---|---|---|---|
| FPN/Efficientb1 | 0.991 | 0.984 | 0.957 | 0.973 | 0.981 | 0.987 | 0.991 | |
| FPN/ResNet50 | 0.975 | 0.944 | 0.874 | 0.949 | 0.923 | 0.978 | 0.985 | 0.942 |
| LinkNet/Efficientb1 | 0.992 | 0.974 | 0.945 | 0.985 | 0.971 | 0.976 | 0.969 | |
| LinkNet/ResNet50 | 0.858 | 0.849 | 0.769 | 0.786 | 0.859 | 0.788 | 0.890 | 0.819 |
| PSPNet/Efficientb1 | 0.987 | 0.957 | 0.894 | 0.929 | 0.928 | 0.957 | 0.949 | 0.931 |
| PSPNet/ResNet50 | 0.983 | 0.914 | 0.803 | 0.867 | 0.850 | 0.927 | 0.935 | 0.876 |
| Unet/Efficientb1 | 0.981 | 0.962 | 0.976 | 0.978 | 0.984 | 0.976 | 0.970 | |
| Unet/ResNet50 | 0.952 | 0.945 | 0.848 | 0.875 | 0.878 | 0.887 | 0.975 | 0.893 |
Best results are in bold.
Neural networks and backbones complexity in terms of FLoating point Operations Per Second (FLOPS), number of parameters and layers.
| Net/Backbone | FLOPS (bilions) | # Params (milions) | # Layers |
|---|---|---|---|
| FPN/Efficientb1 | 12.80 | 0.96 | 379 |
| FPN/ResNet50 | 6.43 | 2.69 | 237 |
| LinkNet/Efficientb1 | 8.04 | 0.86 | 388 |
| LinkNet/ResNet50 | 2.09 | 2.88 | 246 |
| PSPNet/Efficientb1 | 1.76 | 0.18 | 142 |
| PSPNet/ResNet50 | 0.90 | 0.39 | 116 |
| Unet/Efficientb1 | 8.72 | 1.26 | 373 |
| Unet/ResNet50 | 2.58 | 3.26 | 231 |
VisionTool’s detection accuracy on MOCA dataset, when used as annotator. A Unet with EfficientNetb1 backbone is trained on 50 frames, and used to predict the remaining ones, for each of the 60 videos included in the dataset. The results reported in this table correspond to the average mAP computed across the whole set of videos.
| View point | mAP | mAP | mAP | mAP | mAP | mAP | mAP | mAP |
|---|---|---|---|---|---|---|---|---|
| All together | 0.992 | 0.987 | 0.974 | 0.945 | 0.985 | 0.971 | 0.976 | 0.970 |
VisionTool’s detection accuracy on MOCA dataset. A k-fold (k 5) approach is used for each view point (i.e., the detectors are trained on fourfolds and the remaining one was predicted). The results reported in the table correspond to the average mAP computed across the different folds.
| View point | mAP | mAP | mAP | mAP | mAP | mAP | mAP | mAP |
|---|---|---|---|---|---|---|---|---|
| Lateral | 0.969 | 0.905 | 0.865 | 0.845 | 0.889 | 0.958 | 0.988 | 0.909 |
| Egocentric | 0.962 | 0.929 | 0.925 | 0.789 | 0.963 | 0.922 | 0.978 | 0.915 |
| Frontal | 0.957 | 0.858 | 0.861 | 0.907 | 0.836 | 0.930 | 0.992 | 0.905 |
| All together | 0.954 | 0.904 | 0.880 | 0.821 | 0.912 | 0.949 | 0.980 | 0.908 |
Facial keypoints detection accuracy in terms of mAP. EfficientNetb1 is used as backbone for the 4 neural networks implemented in VisionTool. The 15 detected Keypoints are divided into 4 semantic groups, as explained in “Facial keypoints detection dataset” section.
| Net/Backbone | mAP | mAP | mAP | mAP | mAP | mAP | mAP |
|---|---|---|---|---|---|---|---|
| FPN/Efficientb1 | 0.998 | 0.958 | 0.791 | 0.939 | 0.739 | 0.926 | |
| LinkNet/Efficientb1 | 0.998 | 0.950 | 0.771 | 0.920 | 0.724 | 0.908 | 0.838 |
| PSPNet/Efficientb1 | 0.992 | 0.896 | 0.742 | 0.915 | 0.636 | 0.878 | 0.803 |
| Unet/Efficientb1 | 0.994 | 0.934 | 0.749 | 0.905 | 0.708 | 0.896 | 0.824 |
Best results are in bold.
Plankton cell center detection accuracy in terms of mAP. EfficientNetb1 is used as backbone for the 4 neural networks implemented in VisionTool.
| Net/Backbone | mAP | mAP | mAP |
|---|---|---|---|
| FPN/Efficientb1 | 0.980 | 0.919 | |
| LinkNet/Efficientb1 | 0.951 | 0.837 | 0.839 |
| PSPNet/Efficientb1 | 0.942 | 0.776 | 0.784 |
| Unet/Efficientb1 | 0.976 | 0.919 | 0.907 |
Best results are in bold.
Figure 3VisionTool’s workflow description.