Abdulrahman Takiddin1, Mohammad Shaqfeh2, Osman Boyaci1, Erchin Serpedin1, Mitchell A Stotland3,4. 1. Electrical and Computer Engineering Department, Texas A&M University, College Station, Tex. 2. Electrical and Computer Engineering Department, Texas A&M University, Doha, Qatar. 3. Division of Plastic, Craniofacial and Hand Surgery, Sidra Medicine, Doha, Qatar. 4. Weill Cornell Medical College, Doha, Qatar.
Abstract
A sensitive, objective, and universally accepted method of measuring facial deformity does not currently exist. Two distinct machine learning methods are described here that produce numerical scores reflecting the level of deformity of a wide variety of facial conditions. METHODS: The first proposed technique utilizes an object detector based on a cascade function of Haar features. The model was trained using a dataset of 200,000 normal faces, as well as a collection of images devoid of faces. With the model trained to detect normal faces, the face detector confidence score was shown to function as a reliable gauge of facial abnormality. The second technique developed is based on a deep learning architecture of a convolutional autoencoder trained with the same rich dataset of normal faces. Because the convolutional autoencoder regenerates images disposed toward their training dataset (ie, normal faces), we utilized its reconstruction error as an indicator of facial abnormality. Scores generated by both methods were compared with human ratings obtained using a survey of 80 subjects evaluating 60 images depicting a range of facial deformities [rating from 1 (abnormal) to 7 (normal)]. RESULTS: The machine scores were highly correlated to the average human score, with overall Pearson's correlation coefficient exceeding 0.96 (P < 0.00001). Both methods were computationally efficient, reporting results within 3 seconds. CONCLUSIONS: These models show promise for adaptation into a clinically accessible handheld tool. It is anticipated that ongoing development of this technology will facilitate multicenter collaboration and comparison of outcomes between conditions, techniques, operators, and institutions.
A sensitive, objective, and universally accepted method of measuring facial deformity does not currently exist. Two distinct machine learning methods are described here that produce numerical scores reflecting the level of deformity of a wide variety of facial conditions. METHODS: The first proposed technique utilizes an object detector based on a cascade function of Haar features. The model was trained using a dataset of 200,000 normal faces, as well as a collection of images devoid of faces. With the model trained to detect normal faces, the face detector confidence score was shown to function as a reliable gauge of facial abnormality. The second technique developed is based on a deep learning architecture of a convolutional autoencoder trained with the same rich dataset of normal faces. Because the convolutional autoencoder regenerates images disposed toward their training dataset (ie, normal faces), we utilized its reconstruction error as an indicator of facial abnormality. Scores generated by both methods were compared with human ratings obtained using a survey of 80 subjects evaluating 60 images depicting a range of facial deformities [rating from 1 (abnormal) to 7 (normal)]. RESULTS: The machine scores were highly correlated to the average human score, with overall Pearson's correlation coefficient exceeding 0.96 (P < 0.00001). Both methods were computationally efficient, reporting results within 3 seconds. CONCLUSIONS: These models show promise for adaptation into a clinically accessible handheld tool. It is anticipated that ongoing development of this technology will facilitate multicenter collaboration and comparison of outcomes between conditions, techniques, operators, and institutions.
Question: The field of facial reconstruction is hampered by the absence of an objective and clinically practical means of measuring disease severity and clinical outcome.Findings: We designed two machine learning models that generate a numerical score of normality for any face. The models were trained using images of 200,000 normal faces, and tested on 30 images reflecting a range of facial deformity. The machine-generated data closely correlate with scores obtained from a cohort of human raters.Meaning: Development of these systems will allow for a universal and objective means of assessing quality of clinical outcome by facilitating a meaningful comparison between techniques, surgeons, and institutions.
INTRODUCTION
Although individuals with congenital or acquired facial conditions may present with abnormalities across a spectrum of severity, even relatively subtle differences can result in considerable psychosocial impact.[1] However, because a sensitive, objective, and universally accepted method of measuring facial deformity does not currently exist, there is also a lack of reliable means to assess the benefits of reconstructive facial surgery. Most medical practitioners are able to plan and evaluate their treatments based on some combination of laboratory values, functional measures, or radiologic and pathologic findings. Facial reconstructive surgeons, however, are resigned to working almost exclusively with subjective assessments (ie, examining “before and after” photographs) or anthropometric measurements that may not faithfully reflect the complexity of human perception of facial appearance.For the purposes of clinical evaluation and comparison of outcomes, it would be useful to create a scale of deformity across broad populations against which any face—and any facial disorder—could objectively be measured. It is challenging, however, for human raters to establish a gradient of facial form. Sorting through large numbers of faces requires the recognition and active retention of vast amounts of perceptual information, something better suited to a machine system than to the human mind. In addition, human appraisal is influenced by personal or cultural preferences, as well as cognitive biases based on factors such as age, gender, race, and professional background. Here, we describe two computer models that integrate data from an extensive and diverse population of normal faces and are able to then score any newly encountered facial image in an impartial and predictable manner. Being able to determine where a particular face falls within the spectrum of normality is an essential task for the facial reconstructive surgeon, and one that until now has relied almost exclusively on intuition.Previously, our team used a generative adversarial network facial generator[2] to produce authentic, normalized analogues of raw facial images exhibiting deformity. The model we developed was also able to calculate the perceptual distance between the normalized face and its raw abnormal counterpart, yielding scores that correlated closely with the human ratings of deformity.[3] However, because that method turns out to be computationally inefficient, it does not lend itself easily to adaptation into a portable application for clinical use. In the current report, we propose two alternative design approaches that avoid the steps of image normalization and perceptual distance measurement, and thus require less processing power: (1) an object detector model based on an ensemble of boosted Haar Cascade classifiers, relying on the confidence level of the system to discriminate gradations of abnormality, and (2) a convolutional autoencoder model, relying on the reconstruction error of the model as an indicator of deviation from the norm. Similar to our earlier method, both the object detector and the convolutional autoencoder models are shown here to generate facial scoring of a wide variety of facial conditions that correlates closely with human scoring. Moreover, we demonstrate that the object detector approach can be tuned to not only holistically evaluate a face, but to judge discrete aesthetic units within a face, thereby potentially enhancing sensitivity to subtle differences in the orbital, nasal, and oral regions. Placing this type of technology into the hands of the clinician in the future could usher in a paradigm shift in the way patients, surgeons, researchers, and third-party payers interpret the clinical problem of facial deformity and the potential benefits of corrective surgical intervention.
METHODS
Data Preparation
An estimated 200,000 images of normal faces were used to train both the object detector and the convolutional autoencoder. We produced these images using the StyleGAN facial image generator, which fabricates highly realistic facial images that are demographically well distributed and reflect a range of lighting, pose, and expression (Fig. 1).[2] All images used in this study were in RGB mode and scaled to a common size of 224 × 224 pixels to align with our lower resolution testing dataset.
Fig. 1.
Representative sample of normal faces fabricated by the facial generator (StyleGAN) that provided the 200,000-image training database for both study models.
Representative sample of normal faces fabricated by the facial generator (StyleGAN) that provided the 200,000-image training database for both study models.For the training of the object detector, a second group of negative (ie, nonface) images was required. For this, we used the Canadian Institute for Advanced Research 100 dataset, which consists of 100 classes of images (600 images per class).[4] We excluded all facial classes from the dataset, leaving only objects such as nature, fruits, vegetables, electrical devices, and buildings. In total, this negative training group consisted of 47,500 images.Both measurement techniques that we developed were tested using 30 open-source images of facial deformity (licensed for re-use under the Creative Commons, Mountain View, Calif.), as well as 30 normal faces fabricated with the StyleGAN (unique from the 200,000 normal faces used in the training phase). The 30 images depicting deformity included 12 women, 12 men, and six infants of indeterminate gender seven adults, 23 children; and a diversity of ethnic backgrounds as outlined in the Results section. Eighty volunteers aged 18–65 rated the 60 images on a 1–7 Likert scale (1: most deformed, 7: most normal).
Object Detector Method
We implemented the OpenCV version of the Haar Cascade object detector, which has been shown to work efficiently and reliably with resource-constrained devices.[5] The Haar Cascade object detector scans images using a sliding window approach, summating pixel data in adjacent rectangular areas as it progresses sequentially across an image. A given classifier is defined by the measured difference in pixel sums between adjacent areas, relative to a determined threshold value. The Viola-Jones method[6] was applied, computing edge, line, and diagonal (four-rectangle) image features to target known facial properties (ie, orbital region darker than upper-cheeks, nasal bridge brighter than nasal sidewalls, etc.) (Fig. 2). As per the original description, our approach involved the use of an adaptive boosting algorithm (AdaBoost)[7] that takes weak classifiers and uses them to incrementally build a much better, stronger classifier by optimizing the weights for, and adding, one weak classifier at a time. To enhance the model’s accuracy and efficiency, we tuned the hyperparameters of image scaling and k-nearest neighbors using a sequential grid-search hyperparameter optimization approach.[8] As the Haar process cascaded forward through stages, the optimal hyperparameters were obtained for each cascade for each image, allowing background regions to be discarded more quickly and more computational attention to be placed on promising areas of the image.
Fig. 2.
Training input and example of Haar feature extraction for object detector method. The whole face, as well as defined orbital, nasal, and oral aesthetic subunits were considered.
Training input and example of Haar feature extraction for object detector method. The whole face, as well as defined orbital, nasal, and oral aesthetic subunits were considered.Following the training phase, we derived a confidence score for our test images by taking into account the following considerations: (1) each weak classifier is essentially a one-level decision tree; (2) each decision is associated with a threshold value in pixel sums between the adjacent rectangular areas of its Haar-feature; (3) the confidence score of a single weak classifier is the difference between its value and its threshold; and (4) during the AdaBoost training stage, each selected weak classifier gets associated with a relative weight. Therefore, all weak classifiers in a single stage of the cascade can be combined in a weighted manner, and the confidence score for each stage can be calculated (Fig. 3). Because final strong classifiers are determined by weighted majority “voting” of all weak classifiers, a higher percentage of weak classifiers in favor of the presence of a face equates to a higher confidence of facial detection [expressed here on a 0 (abnormal) to 10 (normal) scale)].[9]
Fig. 3.
Algorithm used to derive confidence score of Haar Cascade object detector.
Algorithm used to derive confidence score of Haar Cascade object detector.
Convolutional Autoencoder Method
We designed an unsupervised anomaly detector based on a connected convolutional neural network and autoencoder [ie, a convolutional autoencoder (CAE)] that was trained on images of normal faces and tested on images of normal and abnormal faces. The architecture of our CAE, which is similar to the VGG16 architecture,[10] is depicted in Figure 4.
Fig. 4.
Schematic illustration of implemented convolutional autoencoder architecture.
Schematic illustration of implemented convolutional autoencoder architecture.As depicted, the convolutional encoder takes the input image and processes it through multiple convolutional encoder layers that reduce the image dimensions from [224 × 224 × 3] to [14 × 14 × 512]. The height and width of the volumes (image input: [224 × 224] pixels) progressively decrease throughout the convolutional layers, while the depth, which represents the number of feature maps, increases as image features are extracted. Following a pooling layer, a one-dimensional vector is reached at the fully connected layer. Within the second half of the construct, the convolutional decoder receives the encoder output and reconstructs it through multiple convolutional layers. The convolutional decoder reconstructs the image by increasing the volume from [14 × 14 × 512] to the original dimensions of [224 × 224 × 3]. Similar to VGG16, all convolutional layers in our model have 3 × 3 filters. During the training process, the CAE learns the forming features of the input normal images, as the autoencoder model learns the parameters required to minimize the reconstruction error of output versus input. The score of the reconstruction error is derived from a calculated cost function (Fig. 5) expressed on a 0 (normal) to 1 (abnormal) scale, where XTR denotes the training set. During testing, the reconstruction error should be small for normal facial images having similar forming representations as the training dataset, while the score should be progressively higher for those images displaying more dissimilar facial features.
Fig. 5.
Algorithm used to derive reconstruction error of convolutional autoencoder model.
Algorithm used to derive reconstruction error of convolutional autoencoder model.As for the Haar object detector method, we conducted a sequential grid-search hyperparameter optimization algorithm for the CAE to select the best combination of hyperparameters. After running the algorithm, we confirmed that the activation function of the internal layers is ReLU. The training process was done using 500 epochs with stochastic gradient descent, a learning rate of 0.01, and a momentum of 0.9.Experiments were repeated in triplicate and the results were reported in terms of the average testing set (XTS) for all experiments. For all 60 images of facial deformity that we tested, a Pearson correlation was calculated between human rating and object detector confidence score, and between human rating and CAE reconstruction error. Significance was set at a P value less than 0.05.Both machine learning models described in this study were trained and tested using the same machine setup using an NVIDIA GeForce RTX 2080 hardware accelerator (NVIDIA, Santa Clara, Calif.) and Python 3.7 (Python Software Foundation, www.python.org).
RESULTS
Diagnoses, reference numbers, human ratings, Haar Cascade object detector confidence scores, and CAE reconstruction errors for eight representative abnormal testing images (open-source and licensed for re-use through the Creative Commons) are listed in Figure 6.[11-18] Note that for the object detector method, we were also able to obtain individual confidence scores for aesthetic subunits of the face.
Fig. 6.
Subset of test images (with diagnoses and source reference number[11–18]) used to evaluate the object detector and convolutional autoencoder machine learning models. Scoring ranges from abnormal to normal: human rating (1–7); object detector (0–10); CAE (1–0). NF: Neurofibromatosis; TC: Treacher Collins; UC/L: Unilateral cleft lip; MFH: Midface hypoplasia; FP: Facial palsy; VL: Vascular lesion; R BC/L: Repaired bilateral cleft lip.
Subset of test images (with diagnoses and source reference number[11-18]) used to evaluate the object detector and convolutional autoencoder machine learning models. Scoring ranges from abnormal to normal: human rating (1–7); object detector (0–10); CAE (1–0). NF: Neurofibromatosis; TC: Treacher Collins; UC/L: Unilateral cleft lip; MFH: Midface hypoplasia; FP: Facial palsy; VL: Vascular lesion; R BC/L: Repaired bilateral cleft lip.For all 60 images of facial deformity and normality that we tested, a close correlation was found between human rating and object detector confidence score (r = 0.96, P < 0.00001, Fig. 7), and between human rating and CAE reconstruction error (r = 0.98, P < 0.00001, Fig. 8). The diagnoses and all human and machine scoring for the 30 training images are listed in Table 1.
Fig. 7.
Scatter plot of object detector confidence score relative to human ratings for all 60 test images (30 abnormal and 30 normal). For object detector the rating scale is 0 (abnormal) to 10 (normal), and for human rating it is 1 (abnormal) to 7 (normal).
Fig. 8.
Scatter plot of convolutional autoencoder reconstruction error relative to human ratings for all 60 test images (30 abnormal and 30 normal). For convolutional autoencoder the rating scale is 0 (normal) to 1 (normal), and for human rating it is 1 (abnormal) to 7 (normal).
Table 1.
Diagnosis, Human Rating, Haar Cascade Object Detector Confidence Score, and Convolutional Autoencoder Reconstruction Error for All 30 Test Images
*Pre- and postoperative images are of the same individual.
Scoring ranges from abnormal to normal: human rating (1–7); object detector (0–10); CAE (1–0). R: right; L: left; B: bilateral; C/L: cleft lip.
Diagnosis, Human Rating, Haar Cascade Object Detector Confidence Score, and Convolutional Autoencoder Reconstruction Error for All 30 Test Images*Pre- and postoperative images are of the same individual.Scoring ranges from abnormal to normal: human rating (1–7); object detector (0–10); CAE (1–0). R: right; L: left; B: bilateral; C/L: cleft lip.Scatter plot of object detector confidence score relative to human ratings for all 60 test images (30 abnormal and 30 normal). For object detector the rating scale is 0 (abnormal) to 10 (normal), and for human rating it is 1 (abnormal) to 7 (normal).Scatter plot of convolutional autoencoder reconstruction error relative to human ratings for all 60 test images (30 abnormal and 30 normal). For convolutional autoencoder the rating scale is 0 (normal) to 1 (normal), and for human rating it is 1 (abnormal) to 7 (normal).For the Haar Cascade, the average time to report the facial rating score for all four facial segments (orbital, nasal, oral, and full face) for one image was 3.2 seconds, whereas the CAE required 1.2 seconds to report the abnormality score for the full face.
DISCUSSION
With recent advancements in artificial intelligence, a rich set of computational methods and platforms are readily available for use and development.[19-21] This provides a great opportunity for end-users in industries such as healthcare who are seeking to match new solutions to longstanding problems.[22-24] Because the sensitive detection and recognition of faces has become almost commonplace today, it is inevitable that modern computer methods be adapted for use in the evaluation of faces within the clinical setting. Our intention in this work was to construct a universal facial rating system that would align with human perception, and offer itself as an objective and clinically accessible modality for gauging any type of facial impairment and assessing surgical outcomes. However, there are some important issues to bear in mind when considering the automated judgment of facial appearance.First, no ground truth exists that can be applied as a reference standard. The detection and appraisal of perceptual difference within a face is inherently an idiosyncratic task. Yet, although human ratings may be influenced by bias or self-censorship, they also represent highly evolved judgments that integrate vast amounts of perceptual information within the context of cultural predilections. Existing measurement methods that target anatomic landmarks and focus discretely on factors such as symmetry, proportion, or gender averageness are unable to capture the holistic information instinctively being processed by a human rater.[25-32]Another issue that should be considered when measuring faces is the concept referred to as lookism, a form of social discrimination that has been widely discussed elsewhere.[33] Despite an extensive body of literature outlining the biological and evolutionary foundations of our attraction to beauty,[34] it is appreciated that a range of injustice is unwittingly committed upon those whose faces are perceived as relatively less appealing.[35,36] The introduction of a mechanized rating of appearance could therefore be seen as posing a hazard if used to expose and disparage individuals who rate poorly. However, precisely because evidence shows that the perception of facial appearance is so fundamentally hardwired into our cognitive processes,[37] it is unreasonable to expect humans to refrain from appraising faces. Ultimately, the ethical implementation of a facial measurement tool should be seen as no different than the expectation for right-minded behavior in response to any of the various and readily perceived differences between people (age, gender, race, habitus, etc.).In terms of functioning as a meaningful arbiter of the human face, a computer system ideally should demonstrate (1) alignment with human judgment, (2) order-preservation, (3) indifference to extraneous variations, and (4) sensitivity to subtle structural discrepancies. The high correlation between the output of our two machine learning models and the human ratings of the test images plainly confirms the models’ alignment with human judgment. The order-preservation of the systems is reflected by the gradient of scoring across the training set that matched closely between machine and human (eg, consistently superior scores for repaired versus unrepaired cleft lip deformities), and by the capacity of the Haar object detector to reliably distinguish the abnormal from the normal subunits of the face (Fig. 6, Table 1). Extraneous variations in the diversity of age, gender, and race of our test dataset did not seem to affect the reliability of the computer ratings, although we were not able to specifically test that with our limited sample size of test images. With regard to feature sensitivity, our machine learning models appeared to discern subtle changes in facial appearance as reflected in the gradient of scores ranging from an individual with extensive neurofibromatosis to one with modest jaw asymmetry.A growing body of work involving the use of artificial intelligence in plastic surgery is now emerging, the vast majority of which has employed supervised learning and classification models to focus on diagnostic and outcome prediction.[38] A recent systematic review references a host of studies predicting burn wound depth and clinical outcome using machine learning models primed with various combinations of patient images, demography, and laboratory parameters. Several groups have described automated systems used to diagnose and assess the severity of craniosynostosis based on shape analysis, whereas only a handful of studies have considered the assessment of facial parameters.[39-43] The current study can be clearly distinguished from any of the preceding work in that our models combine a rich training dataset, a lack of requirement for anatomic landmark recognition or the use of arbitrary human scales, and a holistic consideration of faces that can theoretically be applied to any category of deformity. We are currently compiling a new database of high-resolution clinical images that are approved for analysis and publication. For the purpose of this project, however, we relied on testing open-source images of lower quality, which compelled us to downgrade the resolution of the training set. This could theoretically limit sensitivity to facial feature details. Another avenue we are exploring to enhance discernment is to re-train our models separately based on gender and adult/child status of the faces, as well as on isolated facial aesthetic subunits. However, acknowledging any possible limitation in the sensitivity of our method, we believe that the initial results we report here are impressive and notable—particularly in comparison with testing we performed with off-the-shelf, state-of-the-art detection tools, including YOLOv3[44] and fast.ai.[45] The main issue we found with these latter tools is that they are designed to classify objects as either “human face or not human face”; therefore, even images depicting clinical deformities were rated with extremely high confidence. Our custom tailored models provided far better granularity of measurement in a computationally efficient manner. We anticipate that development of this new technology into a clinically accessible, portable platform will usher in new opportunities for multicenter collaboration and objective comparison of outcomes between conditions, techniques, operators, and institutions.
Authors: Kevin Chen; Stephen M Lu; Roger Cheng; Mark Fisher; Ben H Zhang; Marcelo Di Maggio; James P Bradley Journal: Plast Reconstr Surg Date: 2020-01 Impact factor: 4.730
Authors: Thanapoom Boonipat; Tiffany L Brazile; Oliver A Darwish; Philip Montana; Kevin K Fleming; Mitchell A Stotland Journal: J Plast Reconstr Aesthet Surg Date: 2018-12-14 Impact factor: 2.740