| Literature DB >> 36110895 |
Tim Johannes Hartmann1,2, Julien Ben Joachim Hartmann3, Ulrike Friebe-Hoffmann2, Christiane Lato2, Wolfgang Janni2, Krisztian Lato2.
Abstract
Introduction To date, most ways to perform facial expression recognition rely on two-dimensional images, advanced approaches with three-dimensional data exist. These however demand stationary apparatuses and thus lack portability and possibilities to scale deployment. As human emotions, intent and even diseases may condense in distinct facial expressions or changes therein, the need for a portable yet capable solution is signified. Due to the superior informative value of three-dimensional data on facial morphology and because certain syndromes find expression in specific facial dysmorphisms, a solution should allow portable acquisition of true three-dimensional facial scans in real time. In this study we present a novel solution for the three-dimensional acquisition of facial geometry data and the recognition of facial expressions from it. The new technology presented here only requires the use of a smartphone or tablet with an integrated TrueDepth camera and enables real-time acquisition of the geometry and its categorization into distinct facial expressions. Material and Methods Our approach consisted of two parts: First, training data was acquired by asking a collective of 226 medical students to adopt defined facial expressions while their current facial morphology was captured by our specially developed app running on iPads, placed in front of the students. In total, the list of the facial expressions to be shown by the participants consisted of "disappointed", "stressed", "happy", "sad" and "surprised". Second, the data were used to train a self-normalizing neural network. A set of all factors describing the current facial expression at a time is referred to as "snapshot". Results In total, over half a million snapshots were recorded in the study. Ultimately, the network achieved an overall accuracy of 80.54% after 400 epochs of training. In test, an overall accuracy of 81.15% was determined. Recall values differed by the category of a snapshot and ranged from 74.79% for "stressed" to 87.61% for "happy". Precision showed similar results, whereas "sad" achieved the lowest value at 77.48% and "surprised" the highest at 86.87%. Conclusions With the present work it can be demonstrated that respectable results can be achieved even when using data sets with some challenges. Through various measures, already incorporated into an optimized version of our app, it is to be expected that the training results can be significantly improved and made more precise in the future. Currently a follow-up study with the new version of our app that encompasses the suggested alterations and adaptions, is being conducted. We aim to build a large and open database of facial scans not only for facial expression recognition but to perform disease recognition and to monitor diseases' treatment progresses. The Author(s). This is an open access article published by Thieme under the terms of the Creative Commons Attribution-NonDerivative-NonCommercial-License, permitting copying and reproduction so long as the original work is given appropriate credit. Contents may not be used for commercial purposes, or adapted, remixed, transformed or built upon. (https://creativecommons.org/licenses/by-nc-nd/4.0/).Entities:
Keywords: disease recognition; facial expression recognition; facial geometry; self-normalizing neural networks
Year: 2022 PMID: 36110895 PMCID: PMC9470291 DOI: 10.1055/a-1866-2943
Source DB: PubMed Journal: Geburtshilfe Frauenheilkd ISSN: 0016-5751 Impact factor: 2.754
Fig. 1The screen that the participants were shown during the examination: It was equipped with instructions on the facial expression to obtain (“Bitte schauen Sie nun traurig” – “Please look sad now”), an “X” which indicated the preferred position on the screen to look at as well as the instructions therefore (“Bitte die ganze Zeit auf das X schauen” – “Please look at the X all the time”).
Fig. 2Schematic representation of the network architecture, based on the function plot_model of the environment TensorFlow/Keras 78 80 . The input layer is shown in the first box. This is followed by various levels of the “Dense” and “AlphaDropout” layers. Finally, the output layer is shown. If a curly bracket was shown next to a box, this means that the corresponding section is repeated by the number given.
Table 1 Composition of the training-set from the various categories, as well as χ 2 test to check the equal distribution.
| – | Disappointed | Stressed | Happy | Sad | Surprised | Total number |
| Number of snapshots in training | 95402 | 95585 | 95734 | 96270 | 95750 | 478741 |
| Frequency | 19.93% | 19.97% | 20.00% | 20.11% | 20.00% | – |
| χ 2 test | 0.357537338 | |||||
Table 2 Composition of the test set from the various categories, as well as χ 2 test to check the equal distribution.
| – | Disappointed | Stressed | Happy | Sad | Surprised | Total number |
| Number of snapshots in training | 17086 | 16876 | 16781 | 16972 | 16768 | 84483 |
| Frequency | 20.22% | 19.98% | 19.86% | 20.09% | 19.85% | – |
| χ 2 test | 0.372682717 | |||||
Fig. 3Representation of the accuracy achieved during training depending on the number of passes (epochs).
Fig. 4Representation of the values of the loss function achieved during training depending on the number of passes (epochs).
Table 3 Listing of the specifics of the individual categories during a run with all data of the training set.
| By category | Disappointed | Stressed | Happy | Sad | Surprised |
| Recall/Sensitivity | 80.917% | 76.911% | 88.245% | 83.006% | 81.381% |
| Precision/Positive predictive value | 81.774% | 79.520% | 83.582% | 78.487% | 87.549% |
| F1-Score | 0.81342859 | 0.78193721 | 0.85850169 | 0.8068335 | 0.84352166 |
| Specificity | 95.512% | 95.059% | 95.667% | 94.273% | 97.106% |
| Negative cases | 383339 | 383156 | 383007 | 382471 | 382991 |
| Total | |||||
| Recall/Sensitivity | 82.095% |
Table 4 Listing of the specifics of the individual categories during a run with all data of the test set.
| By category | Disappointed | Stressed | Happy | Sad | Surprised |
| Recall/Sensitivity | 80.118% | 74.793% | 87.605% | 82.842% | 80.439% |
| Precision/Positive predictive value | 80.986% | 78.252% | 82.697% | 77.483% | 86.868% |
| F1-Score | 0.805495896 | 0.764830637 | 0.850801551 | 0.800728971 | 0.835299582 |
| Specificity | 95.231% | 94.811% | 95.457% | 93.948% | 96.989% |
| Negative cases | 67397 | 67607 | 67702 | 67511 | 67715 |
|
| |||||
| Recall/Sensitivity | 81.152% |
Fig. 5Example representation of the course of the raw data of a participant for the category “Happy” during the recording of the data for AI training. The names of the individual factors listed in the legend are in accordance with 74 .
Table 5 Inference times needed for a single prediction cycle using the final trained artificial neural network on different devices. Input data was supplied in form of the test- and training-set as well as random numbers. The listed numbers denote the inference times in microseconds (μs).
| Number N of supplied snapshots | Range of values in µs | Minimum inference time in µs | Maximum inference time in µs | Mean inference time in µs | Std. Deviation in µs | |
| MacBook PRO M1 Max random | 500000 | 128.46 | 9.71 | 138.17 | 11.16 | 0.75 |
| MacBook PRO M1 Max test-set | 84483 | 106.42 | 10.08 | 116.50 | 11.11 | 0.71 |
| MacBook PRO M1 Max training-set | 478741 | 114.12 | 10.00 | 124.12 | 11.10 | 0.76 |
| iPhone 12 PRO test-set | 84483 | 618.62 | 11.50 | 630.12 | 20.20 | 12.25 |
| iPhone 12 PRO training-set | 478741 | 408.12 | 11.54 | 419.67 | 22.93 | 14.43 |
| iPhone 12 PRO random | 500000 | 744.54 | 13.67 | 758.21 | 23.65 | 15.36 |
| iPad PRO 12.9″ 3rd gen test-set | 84483 | 1323.63 | 15.04 | 1338.67 | 18.41 | 5.36 |
| iPad PRO 12.9″ 3rd gen training-set | 478741 | 1225.54 | 14.63 | 1240.17 | 18.42 | 3.34 |
| iPad PRO 12.9″ 3rd gen random | 500000 | 762.33 | 14.83 | 777.17 | 18.92 | 3.19 |