Literature DB >> 34235035

Machine Learning Demonstrates High Accuracy for Disease Diagnosis and Prognosis in Plastic Surgery.

Angelos Mantelakis¹, Yannis Assael², Parviz Sorooshian³, Ankur Khajuria^4,1.

Abstract

INTRODUCTION: Machine learning (ML) is a set of models and methods that can detect patterns in vast amounts of data and use this information to perform various kinds of decision-making under uncertain conditions. This review explores the current role of this technology in plastic surgery by outlining the applications in clinical practice, diagnostic and prognostic accuracies, and proposed future direction for clinical applications and research.
METHODS: EMBASE, MEDLINE, CENTRAL and ClinicalTrials.gov were searched from 1990 to 2020. Any clinical studies (including case reports) which present the diagnostic and prognostic accuracies of machine learning models in the clinical setting of plastic surgery were included. Data collected were clinical indication, model utilised, reported accuracies, and comparison with clinical evaluation.
RESULTS: The database identified 1181 articles, of which 51 articles were included in this review. The clinical utility of these algorithms was to assist clinicians in diagnosis prediction (n=22), outcome prediction (n=21) and pre-operative planning (n=8). The mean accuracy is 88.80%, 86.11% and 80.28% respectively. The most commonly used models were neural networks (n=31), support vector machines (n=13), decision trees/random forests (n=10) and logistic regression (n=9).
CONCLUSIONS: ML has demonstrated high accuracies in diagnosis and prognostication of burn patients, congenital or acquired facial deformities, and in cosmetic surgery. There are no studies comparing ML to clinician's performance. Future research can be enhanced using larger datasets or utilising data augmentation, employing novel deep learning models, and applying these to other subspecialties of plastic surgery.

Entities: Chemical

Year: 2021 PMID： 34235035 PMCID： PMC8225366 DOI： 10.1097/GOX.0000000000003638

Source DB: PubMed Journal: Plast Reconstr Surg Glob Open ISSN： 2169-7574

INTRODUCTION

An expanding population in the United States has resulted in an increasing demand for plastic surgery services, which, coupled with static number of residents and increasing number of retiring surgeons, is increasing the pressure for the delivery of high-quality care.[1] It is now estimated that there is a workforce shortage of 800 attending physicians in the United States, reducing the availability of care.[1] Artificial Intelligence (AI) could have a major impact on addressing challenges that healthcare systems face. Digital technologies are predicted to affect more than 80% of the healthcare workforce in the next 2 decades, changing the way physicians practice medicine and meeting the increasing demand for services.[2] AI can help drive this change by automating repetitive tasks to free up time from clinicians, improving the diagnostic accuracy of diseases and predicting patient outcomes.[2] Machine learning (ML), a subfield of AI, is a set of models able to learn from past cases (data) to make future predictions. A wide variety of such algorithms are in use today, such as in the automated, individualized suggestions generated during a Google Search, based on ones’ previous searches. These models can be classified into two broad categories: supervised learning and unsupervised learning. The difference between these two categories of learning models lies in the existence of labeled data. In supervised learning, the models are trained using examples of data with known labels, labeled data, and after training, they aim to predict outcomes utilizing new data.[3,4] This function has been utilized in healthcare to assist in both making a diagnosis and for disease outcome prediction. Authors have utilized supervised learning to successfully classify whether a skin lesion is benign (eg, benign nevi) or malignant (malignant melanoma), outperforming the accuracy of 21 board-certified dermatologists (accuracy 72% versus 66%, P < 0.05).[5] Similarly, supervised learning has also been utilized in predicting the risk of developing a condition such as breast cancer based on epidemiological data, and the risk of recurrence after treatment.[6,7] In contrast, unsupervised learning models are trained using unlabeled data, and after training, aim to discover underlying groupings or patterns from the data themselves.[3,8] These algorithms can be particularly useful in identifying previously unknown patterns in vast amounts of unprocessed data, which may then be used in clinical practice. Examples include novel classification of diseases into various subtypes and identifying subgroups of patients with increased risk of certain conditions based on various characteristics (for example, their genome).[9,10] In addition to meeting demand for plastic surgery services, this technology has the potential to revolutionize how plastic surgery is practiced and enhance surgeon’s diagnosis prediction, preoperative planning, and outcome prediction, leading to improved patient care. In burn surgery, even the most experienced surgeons have a clinical estimation of 64%–76% accuracy in the diagnosis of burn depth.[11,12] ML models may outperform this, achieving correct burn depth identification from 2D photographs up to 87%, potentially leading to more appropriate clinical management at presentation.[13] Further, in the prognostication of whether a burn injury will heal within 14 days of presentation, ML models have demonstrated an accuracy of 86%, again surpassing the accuracy of prognostication by clinicians.[4] In the field of microsurgery, postoperative monitoring via 2D image analysis achieves a 95% accuracy in classifying a flap as normal, presence of venous obstruction, or presence of arterial occlusion, leading to potential early identification of flap failure and increased salvage rates.[4] However, the evidence of applications of ML is abstract, with no systematic reviews that summarize the clinical accuracy of such models in practice. This could act as a starting point of developing clinical practice guidelines and to guide future research.[14-17] The aim of this study was to systematically synthesize and report the current literature in the clinical applications of ML in plastic surgery.

METHODS

Search Strategy

The protocol for this systematic review was registered with PROSPERO international prospective registration of systematic reviews registration number: CRD42019140924. The full protocol was published a priori, and there were no deviations from the original protocol.[18] This systematic review was conducted and reported according to the Preferred Reporting Items for Systematic Reviews and Meta-Analysis (PRISMA) guidelines.[19] A systematic literature search was performed in MEDLINE (OVID SP), EMBASE (OVID SP), CENTRAL, and ClinicalTrials.gov databases to identify relevant studies for review. The reference lists of all included studies were also screened, and relevant studies were included in the search. Lastly, manual searches of bibliographies, citations, and related articles (Pubmed function) were also performed to identify missed relevant studies. Medical Subject Headings (MeSH) terms were used in combination with free text to construct our search strategy. A sample search strategy used in MEDLINE (OVID SP) is shown in Table 1.[20-70]

Table 1.

Example Search Strategy Used for MEDLINE[20–70]

1	(“deep learning” OR “artificial intelligence” OR “machine learning” OR “decision trees” OR “random forests” OR SVM OR “support vector machine”)
2	exp “NEURAL NETWORKS (COMPUTER)”/ OR exp “DEEP LEARNING”/
3	exp “ARTIFICIAL INTELLIGENCE”/
4	(1 OR 2 OR 3)
5	(microsurgery OR (surgery AND (plastic OR reconstructive OR esthetic OR aesthetic OR burns OR hand OR craniofacial OR “peripheral nerve”)))
6	exp “SURGERY, PLASTIC”/ OR exp “RECONSTRUCTIVE SURGICAL PROCEDURES”/
7	(5 OR 6)
8	(4 AND 7)

Example Search Strategy Used for MEDLINE[20-70]

Selection Criteria

All eligible studies between January 1990 and June 2020 were included in this review. We included any primary studies (including case reports) that present clinical data on the application of ML in plastic surgery. Only articles in the English language were included. Our exclusion criteria included descriptions of ML in plastic surgery without clinical data, review articles, conference abstracts, animal studies, and articles pertaining to the use of ML outside the remit of the specialty (as defined by the Intercollegiate Surgical Curriculum Program in Plastic Surgery). After the library preparation, two independent reviewers (AM and PS) screened the search results for inclusion based on the title and abstracts. Subsequently, a full-text review was performed independently by the same two researchers (AM and PS) for all included studies. At each step, any discrepancy of opinion was resolved with consensus, and if not resolved, was referred to a third reviewer (AK). If any doubt remained, the article proceeded to the next step of the review. The search results of all included articles, abstracts, full-text articles, and records of the reviewers’ decisions, including reasons for exclusion, were recorded.

Outcome Measures

The primary outcome was the ML algorithm statistical accuracy in performing a prespecified clinical task (eg, prediction of a clinical diagnosis or postoperative outcome). Secondary outcomes include the reported specificity, sensitivity, area under the curve, and technical characteristics of the algorithms.

Data Extraction and Analysis

The data from all full-text articles accepted for the final analysis were independently retrieved by AM and PS, using a standardized data extraction form. Any disagreements were resolved by discussion or referred to the third researcher (AK). The following data (where available) were extracted: a) Study details (year of publication, country), patient demographics, study setting, clinical condition examined. b) ML algorithm characteristics (intended function, whether the model was supervised or unsupervised, function via classification or outcome prediction, usage of real or synthetic data, and which type of ML model was used) c) Primary and secondary outcomes, as above. Statistical meta-analysis could not be performed because of the heterogeneity of the studies in the conditions examined and software models utilized. Instead, a narrative review was performed, with a subgroup analysis of the mean accuracy of the models, calculated by measuring the number of correct predictions over the total predictions made. The subgroup analyses are based on the model function (diagnosis prediction, preoperative planning and outcome prediction) and type of models (NNs, SVMs, decision tree/random forest, and linear regression). This subgroup classification was utilized based on the objectives set for AI models in clinical practice by NHS England.[2]

Quality Assessment

The quality of the included studies was assessed based on the Quality Assessment of Diagnostic Accuracy Studies-2 (QUADAS-2), performed by two independent reviewers (AM and PS).[71] There were no disagreements between the authors. The QUADAS-2 tool allows for risk of bias assessment and applicability concern assessment of primary diagnostic accuracy studies. Risk of bias was assessed based on the patient selection, index test (in this review, this is the ML algorithm), reference standard (comparator), and flow and timing. Concerns regarding applicability were assessed on the first three terms alone.

RESULTS

Literature Search Results

From a total of 1536 studies, after removal of duplicates, 1181 articles were eligible for a title and abstract review. Of these, 1074 articles did not meet the inclusion criteria and were excluded. Following full-text review of the remaining 107 articles, 56 articles were excluded because the inclusion criteria were not met. A total of 51 articles were included and formed the basis of this systematic review (Fig. 1). Details of the included studies are summarized in Table 2.[20-70]

Fig. 1.

The PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) flow diagram.

Table 2.

Primary Outcomes of Accuracy, Sensitivity, and Specificity for Reconstructive and Burns Surgery

Study	Author, Year	Function	Model	Accuracy	Sensitivity	Specificity	AUC
1	Abubakar et al, 2020[20]	DP	CNN	White: 99.3% Afro-Carribean: 97.1%	NR	NR	NR
2	Chauhan J et al, 2020[21]	DP	BPBSAM (CNN + SVM)	91.70%	NR	NR	NR
3	Desbois et al, 2020[22]	DP	DNN with 3 measures	91.98%	NA	NA	NR
			DNN with 4 measures	92.45%	NA	NA	NR
			Boost with 3 measures	97.89%	NA	NA	NR
			Boost with 4 measures	98.08%	NA	NA	NR
			avNN with 3 measures	97.45%	NA	NA	NR
			avNN with 4 measures	98.30%	NA	NA	NR
4	Rashidi et al, 2020[23]	OP	DNN	100%	92%	93%	0.880
			LR	95%	91%	90%	0.940
			SVM	98%	NR	NR	0.780
			RF	93%	NR	NR	1.000
			k-NN	98%	91%	82%	0.960
5	Bhalodia et al, 2020[24]	DP	Shapeswork software with PCA	NR	NR	NR	NR
6	Guarin et al, 2020[25]	DP	NR	NR	NR	NR	NR
7	Formeister et al, 2020[26]	OP	Gradient Boosted Decision Tree	60.00%	62.00%	60.00%	NR
8	Boczar et al, 2020[27]	Intervention	IBM Watson	92.30%	NR	NR	NR
9	O’Neil et al, 2020[28]	OP	Decision Tree	NR	5.00%	86.80%	0.672
10	Yoo et al, 2020[29]	OP	Deep Learning (Generative adversarial network- GAN)	NR	NR	NR	NR
			Pix2pix	NR	NR	NR	NR
			Lightweight CycleGAN	NR	NR	NR	NR
		DP	Deep Learning + No data augmentation	74.20%	75.80%	72.70%	0.824
			Deep Learning + Std data augmentation	83.3%%	78.80%	87.90%	0.872
			Deep Learning + GAN data augmentation	90.90%	87.80%	93.90%	0.957
11	Angullia et al, 2020[30]	OP	Least squares radial basis function	NA	NA	NA	NA
12	Eguia et al, 2020[31]	OP	Decision Tree	NA	NA	NA	0.690
			Stepwise Logistic Regression	NA	NA	NA	0.800
			LR	NA	NA	NA	0.830
			k-NN	NA	NA	NA	0.840
13	Ohura et al, 2019[32]	DP	SegNet	97.60%	90.90%	98.20%	0.994
			LinkNet	97.20%	98.90%	98.90%	0.987
			U-Net	98.80%	99.30%	99.30%	0.997
			Unet_VGG16	98.90%	99.20%	99.20%	0.998
14	Porras et al, 2019[33]	DP	SVM	95.30%	94.70%	96%	NR
15	Knoops et al, 2019[34]	DP	SVM	95.40%	95.50%	95.20%	NR
15	Knoops et al, 2019[34]	OP	LRRRLARLASSO	NR	NR	NR	NR
16	Hallac et al, 2019[35]	DP	Pretrained Google-Net	94.10%	97.80%	86%	NR
17	Levites et al, 2019[36]	DP	Text-based emotion analysis	NR	NR	NR	NR
18	Shew et al, 2019[37]	OP	2-class Decision Forest	64.40%	NR	NR	NR
19	Dorfman et al, 2019[38]	DP	Neural Nets	NR	NR	NR	NR
20	Qiu et al, 2019[39]	PP	U-Net CNN	NR	NR	NR	NR
21	Aghei et al, 2019[40]	OP	ANN-MLP	73.3%	76.20%	70.2	0.762
			SVM	67.20%	66.10%	68.40%	0.731
			RF	67.20%	61%	73.70%	0.751
			LR (FS)	67.20%	61%	73.70%	0.711
			LR (BS)	66.40%	64.40%	67.70%	0.718
22	Cirillo et al, 2019[41]	DP	VGG-16	77.53%	NR	NR	NR
			Google-Net	73.80%	NR	NR	NR
			Res-Net 50	77.79%	NR	NR	NR
			Res-Net 101 without data aug	90.54%	74.35%	94.25%	NR
			Res-Net 101 with data aug	82.72%	NR	NR	NR
23	Tran et al, 2019[42]	OP	k-NN with k = 1-6 or 8-20	100%	NA	NA	NR
24	Yadav et al, 2019[43]	DP	MDS modeling	80%	97.00%	60.00%	NR
24	Yadav et al, 2019[43]	DP	SVM	82.43%	87.80%	83.33%	NR
25	Jiao et al, 2019[44]	DP	R101A CNN	82.04%	NA	NA	NR
			IV2RA CNN	83.02%	NA	NA	NR
			R101FA CNN	84.51%	NA	NA	NR
26	Liu et al, 2018[45]	PP	Least Squares Regression	NR	NR	NR	NR
			Decision tree	NR	NR	NR	NR
			Sigmoid Neural Nets	NR	NR	NR	NR
			Hyperbolic Tangent Neural Net	NR	NR	NR	NR
			Combined Model (Tree +NN)	NR	NR	NR	NR
27	Martinez-Jemenez et al, 2018[46]	OP	Recurrent Partitioning Random Forest	85.35%	NR	NR	NR
28	Su et al, 2018[47]	OP	Random Forest	NA	NA	NA	NR
29	Tang et al, 2018[48]	OP	L.R	80.50%	84.40%	77.70%	0.875
29	Tang et al, 2018[48]	OP	XGBoost	85.40%	82.0%%	89.7%%	0.920
30	Cobb et al, 2018[49]	OP	Random Forest	NA	NA	NA	NR
30	Cobb et al, 2018[49]	OP	Stochastic Gradient Boosting	NA	NA	NA	NR
31	Cho MJ et al, 2018[50]	DP	K-means	96%	NR	NR	NR
32	Kuo et al, 2018[51]	OP	MLR	72.70%	22.10%	93.30%	NR
33	Tan et al, 2017[52]	PP	NR	NR	NR	NR	NR
34	Huang et al, 2016[53]	OP	SVM	100%	NA	NA	NR
35	Park et al, 2015[54]	PP	Feature wrapping	77.30%	99%	74.10%	NR
36	Serrano et al, 2015[55]	PP	SVM	79.73	97%	60%	NR
37	Mukherjee et al, 2014[56]	DP	SVM with 3rd polynomial kernel	86.13%	NA	NA	NR
37	Mukherjee et al, 2014[56]	DP	Bayesian classifier	81.15%	NA	NA	NR
38	Mendoza et al, 2014[57]	DP	LDA	95.70%	97.90%	99.60%	NR
		DP	Random Forest	87.90%	NR	NR	NR
		DP	SVM	90.80%	NR	NR	NR
39	Acha et al, 2013[58]	DP	k-NN	66.2%	NR	NR	NR
		DP	SVM	75.7%	NR	NR	NR
		PP	k-NN	83.8%	NR	NR	NR
		PP	SVM	82.4%	NR	NR	NR
40	Schneider et al, 2012[59]	OP	CART Decision Tree with Gini splitting function	73.30%	NA	NA	NR
41	Patil et al, 2009[60]	OP	Bayesian classifier	97.78%	100%	95.50%	0.978
			Decision Tree	96.12%	96.60%	95.51%	0.961
			SVM	96.12%	98.60%	93.26%	0.961
			Back propagation	95%	96.71%	93.26%	0.949
42	Yamamura et al, 2008[61]	OP	ANN	100%	NA	NA	NR
42	Yamamura et al, 2008[61]	OP	LR	72%	NA	NA	NR
43	Correa et al, 2008[62]	DP	SVM	95.05%	NR	NR	NR
44	Acha et al, 2005[63]	DP	Fuzzy-ArtMap Neural Network	82.26%	83.01%	NA	NR
45	Yeong et al, 2005[64]	OP	ANN	86%	75%	97%	NR
46	Serrano et al, 2005[65]	DP	Fuzzy-ArtMap Neural Network	88.57%	83.01%	NA	NR
47	Yamamura et al, 2004[66]	OP	ANN	100%	100%	100%	NR
			LR	80%	66.70%	85.70%	NR
			ANN with leave-one-out crossvalidation	86.60%	66.70%	95.20%	NR
48	Acha et al, 2003[67]	OP	Fuzzy-ArtMap Neural Network	82.60%	NR	NR	NR
49	Estahbanati et al, 2002[68]	OP	ANN	90%	80%	NA	NR
50	Hsu et al, 2000[69]	PP	Shallow Neural Net	NA	NA	NA	NR
51	Fyre et al, 1996[70]	OP	Feed forward, back propagation error adjustment model	98%	NA	NA	NR
51	Fyre et al, 1996[70]	OP	Feed forward, back propagation error adjustment model	77%	NA	NA	NR

ADTree, alternating decision tree; AUC, area under the curve; CNN, convoluted NNs; DNN, deep neural network; DP, diagnosis prediction; k-NN, k-nearest neighbor; LASSO, least absolute shrinkage and selection operator; LDA, liner discriminant analysis; MLR, multiple logistic regression; NA, not applicable; NB classifier, Naive Bayes classifier; NR, not reported; OP, outcome prediction; PP, preoperative planning; RF, random forest .

Primary Outcomes of Accuracy, Sensitivity, and Specificity for Reconstructive and Burns Surgery ADTree, alternating decision tree; AUC, area under the curve; CNN, convoluted NNs; DNN, deep neural network; DP, diagnosis prediction; k-NN, k-nearest neighbor; LASSO, least absolute shrinkage and selection operator; LDA, liner discriminant analysis; MLR, multiple logistic regression; NA, not applicable; NB classifier, Naive Bayes classifier; NR, not reported; OP, outcome prediction; PP, preoperative planning; RF, random forest . The PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) flow diagram.

Breakdown of the Applications of ML Models in Diagnosis Prediction, Outcome Prediction, and Preoperative Planning

In total, 51 studies were included in the review, which evaluated the accuracy of 103 ML algorithms. Of these, 27 were on burns surgery and 24 on general reconstructive surgery. The publication years ranged from 1996 to 2020, with 25 studies published in the past year alone (2019–2020). The clinical utility of these algorithms was to assist clinicians in diagnosis prediction (n = 22), outcome prediction (n = 21), and preoperative planning (n = 8). In diagnosis prediction, algorithms were created to assist in automated burn depth diagnosis from 2D photography (n = 9) and total burn surface area (n = 1), automated diagnosis of craniosynostosis (n = 5), wound identification in 2D photography (n = 2), diagnosis and severity assessment of facial palsy (n = 1), diagnosis of congenital auricular deformities (n = 1), identification of emotional responses to plastic surgery on Twitter (n = 1), automated age estimation after rhinoplasty (n = 1), and identifying the correct answer to frequently asked questions (n = 1). In outcome prediction, the ML algorithms created predicted mortality in burn patients (n = 5), the occurrence of AKI in burn and trauma patients (n = 4), occurrence of postoperative complications in breast and head and neck free flap reconstruction (n = 3), concentration and response of aminoglycosides in burn patients (n = 2), postoperative faces after oculoplastic and craniosynostosis surgery (n = 2), burn healing time (n = 1), mortality in patients with necrotizing soft tissue infection (n = 1), delay in radiotherapy following cancer excision (n = 1), posttraumatic stress disorder following burns (n = 1), and factors predicting the occurrence of burns in the pediatric population (n = 1). In preoperative planning, ML was used to predict which wounds will need grafting (n = 2), which patients will need orthognathic or cleft palate operations (n = 2), planning orthognathic and mandibular resections (n = 2), predicting open wound size (n = 1), and complexion of reconstruction following head and neck cancer excision (n = 2).

ML Models Demonstrate High Accuracy, Sensitivity, and Specificity That May Enhance Clinical Decision-making

The 51 studies evaluated 103 ML algorithms (Table 2). The pooled mean of accuracy of ML algorithms was 86.84% (range 60.00–100%). The pooled mean sensitivity and specificity is 81.88% (range 5.00– 99.30%) and 86.38% (range 60.00–100%), respectively, as reported in 39 models. A subgroup analysis was performed based on the clinical utility of the algorithms. For diagnosis prediction, the pooled accuracy, sensitivity, and specificity of ML algorithms was 88.80% (range 66.20–97.60%), 90.62% (range 75.80–97.90%), and 86.81% (range 60.00–99.60%). In outcome prediction, this was 86.11% (range 66.20–97.60%), 69.67% (range 5.00–100%), and 85.94% (range 60.00–100%), respectively. In preoperative planning, two studies reported the accuracy, sensitivity, and specificity, which were 80.28% (range 77.30–83.80%), 98.00% (range 97.00–99.00%), and 67.05% (range 60.00–74.10%). A second subgroup analysis on the reported accuracy was performed based on the type of model utilized. The mean accuracy for NNs was 88.25% (range 73.80–100%), SVMs 88.02% (range 67.20–100%), decision trees/random forest 78.75% (range 60.00–96.12%), and linear regression 76.85% (range 66.40–95.00%).

Breakdown and Analysis of the Supervised and Unsupervised ML Models Utilized

Supervised ML was utilized in 50 of the included studies and unsupervised learning in three studies (two studies employed both supervised and unsupervised learning). The supervised ML algorithms identified are summarized in Table 3. The most commonly used ones were NNs (n = 34), SVMs (n = 13), decision trees/random forests (DT/RF, n = 10), and LR (n = 9). The unsupervised ML models utilized were K-means clustering, a shapeswork software with principal component analysis and the algorithm was not reported in one study.

Table 3.

Technical Characteristics of ML Algorithms Utilized in Burns and Reconstructive Surgery

Study No.	Author	Function	Purpose	Input	Output	Supervised or Unsupervised	Modeling (Classification or Regression)	Real or Synthetic Data	Training
Study No.	Author	Function	Purpose	Input	Output	Supervised or Unsupervised	Modeling (Classification or Regression)	Real or Synthetic Data	Training	Validation	Test
1	Abubakar et al, 2020[20]	DP	Differentiate healthy versus burned skin in both white and black skin	2D photographs	Differentiate healthy versus burned skin in both white and black skin	Supervised	Classification	Data augmentation	80%	NA	20%
2	Chauhan J et al, 2020[21]	DP	Diagnose depth of burns	2D photographs	Differentiate body part + severity of burn	Supervised	Classification	Data augmentation	80%	20%	Separate test set
3	Desbois et al, 2020[22]	DP	Automated assessment of TBSA	Anthropometric measurements	Automated assessment of TBSA	Supervised	Regression	Real data	80%	NA	20%
4	Rashidi et al, 2020[23]	OP	Prediction of AKI in burn and trauma patients	Renal injury biomarkers and urine output	Prediction of AKI in burn and trauma patients	Supervised	Classification	Real data	59%	NA	41%
5	Bhalodia et al, 2020[24]	DP	Measuring severity of craniosynostosis	CT images	Measuring severity of craniosynostosis	Unsupervised	NA	Real data	NR	NR	NR
6	Guarin et al, 2020[25]	DP	Diagnosis and severity assessment of facial palsy	2D photographs	Automatic localization of 68 facial features in healthy and patients photographs	Unsupervised	N/A	Real data	90%	5%	5%
7	Formeister et al, 2020[26]	OP	Predicting any type of complications following free flap reconstruction	14 patient characteristics	Prediction of complications in microvascular free flaps	Supervised	Classification	Real data	80%	NA	20%
8	Boczar et al, 2020[27]	DP	Answering frequently asked questions	Participant question	Correct answer to FAQs	Supervised	Classification	Real data	NR	NR	NR
9	O’Neil et al, 2020[28]	OP	Predicting flap failure in microvascular breast free flap reconstruction	7 patient characteristics	Flap failure (yes/no)	Supervised	Classification	Data augmentation	50%–70%	NA	30%–50%
10	Yoo et al, 2020[29]	OP	Postoperative appearance following oculoplastic surgery for thyroid-associated opthalmopathy	Preoperative photograph	Postoperative photograph	Supervised	Regression	Data augmentation	NR	NR	NR
11	Angullia et al, 2020[30]	OP	Prediction of changes in face shape from craniosynostosis surgery	High resolution CT	Predict changes in face shape from craniosynostosis surgery	Supervised	Regression	Real data	NR	NR	NR
12	Eguia et al, 2019[31]	OP	Prediction of in-hospital mortality in patients with necrotizing skin and soft tissue infection	Patient demographics, co-morbidities, and hospital characteristics (73 parameters in total)	Prediction of in-hospital mortality in patients with necrotizing skin and soft tissue infection	Supervised	Classification	Real data	80%	NA	20%
13	Ohura et al, 2019[32]	DP	Diagnosis of wound ulcer	2D photographs	Differentiation of healthy tissue from ulcer region	Supervised	Classification	Real data	90%	NA	10%
14	Porras et al, 2019[33]	DP	Diagnosis of craniosynostosis from 3D photographs	3D photographs	Diagnosis of craniosynostosis from 3D photographs	Supervised	Classification	Real data	NR	NR	NR
15	Knoops et al, 2019[34]	PP	Orthgonathic surgery	CT	Need for orthognathic surgery (yes/no)	Supervised	Classification	Real data	80%	NA	20%
16	Hallac et al, 2019[35]	DP	Diagnosis of congenital auricular deformities	2D photographs	Identify presence of congenital auricular deformities (yes/no)	Supervised	Classification	Real data	NR	NR	NR
17	Levites et al, 2019[36]	DP	Identify emotional responses to plastic surgery	Twitter key words	Analyze emotional responses to plastic surgery procedures	Supervised	Classification	Real data	60%	20%	20%
18	Shew et al, 2019[37]	OP	Prediction of delay in radiotherapy	Variable inpatient patient data	Prediction of delay of radiotherapy (more or less than 50 days to treatment)	Supervised	Classification	Real data	NR	NR	NR
19	Dorfman et al, 2019[38]	DP	Identification of age perception following rhinoplasty	2D photographs	Automated age prediction	Supervised	Classification	Real data	NR	NR	NR
20	Qiu et al, 2019[39]	PP	Plan mandibular resections	CT	Automated 3D mandibular segmentation preoperatively	Supervised	Regression	Real data	48%	7%	45%
21	Aghaei et al, 2019[40]	OP	Elaboration of factors predicting pediatric burns	Various health, social, and demographic risk factors	Most important factors in predicting burn occurrence	Supervised	Classification	Real data	70%	NA	30%
22	Cirillo et al, 2019[41]	DP	Diagnose depth of burns	2D photographs	Classification of burn depth	Supervised	Classification	Data augmentation	NR	NR	NR
23	Tran et al, 2019[42]	OP	Prediction of AKI in burn and trauma patients	Renal injury biomarkers and urine output	Prediction of AKI in burn and trauma patients	Supervised	Classification	Real data	80%	NA	20%
24	Yadav et al, 2019[43]	DP	Diagnose depth of burns	2D photographs	Classify burns by depth and surface area	Supervised	Classification	Real data	NR	NR	NR
25	Jiao et al, 2019[44]	DP	Diagnose depth of burns	2D photographs	Classify burns by depth and surface area	Supervised	Classification	Real data	87%	NA	13%
26	Liu et al, 2018[45]	PP	Explore whether ML can predict open wound size	Fluid resus volume and other patient factors	Predict open wound size	Supervised	Regression	Real data	90%	NA	10%
27	Martinez-Jimenez et al, 2018[46]	PP	Predicting which wounds need grafting	Infrared thermography	Prediction of treatment modality required for burn wound	Supervised	Classification	Real data	61%	NA	39.00%
28	Su et al, 2018[47]	OP	Prediction of PTSD & major depressive disorder in burn patients	Burn-related variables, empirically-derived risk factors from previous meta-analysis & theory-derived cognitive variables	Prediction of PTSD & major depressive disorder in burn patients	NR	NR	NR	NR	NR	NR
29	Tang et al, 2018[48]	OP	Prediction of AKI in burn patients	Patient risk factors and laboratory measurements	Prediction of AKI in burn patients	Supervised	Classification	Real data	NR	NR	NR
30	Cobb et al, 2018[49]	OP	Prediction of mortality of burn patients	Patient risk factors and laboratory measurements	Predict whether a patient would (1) live versus (2) die	Supervised	Classification	Real data	66%	NA	34%
31	Cho MJ et al, 2018[50]	DP	Diagnosis of cranionynostosis	CT images	Automated differentiation of craniosynostosis from benign metopic ridge from CT	Unsupervised	Classification	Real data	NR	NR	NR
32	Kuo et al, 2018[51]	OP	Predicting surgical site infection	Patient risk factors	Prediction of SSI (yes/no)	Supervised	Classification	Real data	70%	NA	30%
33	Tan et al, 2017[52]	PP	Complexion of reconstruction following basal cell cancer excision	Patient risk factors	Prediction of intraoperative surgical complexity	Supervised	Classification	Real data	NR	NR	NR
34	Huang et al, 2016[53]	OP	Prediction of mortality of burn patients	Patient risk factors and laboratory measurements	Prediction of whether a patient would (1) live versus (2) die	Supervised	Classification	Real data	21%	66%	13%
35	Park et al, 2015[54]	PP	Prediction of need for surgery in patients with cleft lip/palate	Lateral cephalograms	Prediction of need for surgery in patients with cleft lip/palate	Supervised	Classification	Real data	NR	NR	NR
36	Serrano et al, 2015[55]	PP	Predicting which wounds need grafting	2D photographs	Predicting which wounnds need grafting (yes/no)	Supervised	Classification	Real data	21%	NA	79%
37	Mukherjee et al, 2014[56]	DP	Wound recognition and classification	2D photographs	Automated assessment of wound classification	Supervised	Classification	Real data	NR	NR	NR
38	Mendoza et al, 2014[57]	DP	Diagnosis of cranionynostosis	CT images	Automated craniosynostosis diagnosis from CT	Supervised	Classification	Real data	NR	NR	NR
39	Acha et al, 2013[58]	DP	Diagnose depth of burns	2D photographs	Classify burns by depth	Supervised	Classification	Real data	21%	NA	79%
39	Acha et al, 2013[58]	PP	Predicting which wounds need grafting	2D photographs	Predict whether a burn will need grafting	Supervised	Classification	Real data	21%	NA	79%
40	Schneider et al, 2012[59]	OP	Prediction of AKI in burn patients	Patient risk factors and laboratory measurements	Prediction of AKI in burn patients	Supervised	Classification	Real data	71%	NA	29.00%
41	Patil et al, 2009[60]	OP	Prediction of mortality of burn patients	Patient risk factors and laboratory measurements	Prediction of mortality in burn patients	Supervised	Classification	Real data	K-cross validation	K-cross validation	K-cross validation
42	Yamamura et al, 2008[61]	OP	Prediction of response of aminoglycosides against MRSA infection in burn patients	Patient risk factors and laboratory measurements	Prediction of response of aminoglycosides against MRSA infection in burn patients	Supervised	Classification	Real data	K-cross validation	K-cross validation	K-cross validation
43	Ruiz-Correa et al, 2008[62]	DP	Diagnosis of craniosynostosis	CT images	Classification of craniosynostosis	Supervised	Classification	Real data
44	Acha et al, 2005[63]	DP	Diagnose depth of burns	2D photographs	Automated assessment of burn wound depth	Supervised	Classification	Real data	56%	NA	44%%
45	Yeong et al, 2005[64]	OP	Prediction of burn healing time	Reflectance spectometer measurements	Prediction of burn healing time	Supervised	Classification	Real data	NR	NR	NR
46	Serrano et al, 2005[65]	DP	Diagnose depth of burns	2D photographs	Automated assessment of burn wound depth	Supervised	Classification	Real data	NR	NR	NR
47	Yamamura et al, 2004[66]	OP	Prediction of aminoglycoside/ab × concentration in burn patients	Patient risk factors and laboratory measurements	Prediction of aminoglycoside/ab × concentration in burn patients	Supervised	Classification	Real data	100%	100%	100%
47	Yamamura et al, 2004[66]	OP		Patient risk factors and laboratory measurements		Supervised	Classification	Real data	80%	66.70%	85.70%
48	Acha et al, 2003[67]	DP	Identify burn tissue from healthy, and classify depth of burn	2D photographs	Identify burn tissue from healthy, and classify depth of burn	Supervised	Classification	Real data	80%	NA	20%
49	Estahbanati et al, 2002[68]	OP	Prediction of mortality of burn patients	Patient risk factors and laboratory measurements	Prediction of mortality of burn patients	Supervised	Classification	Real data	75%	NA	25%
50	Hsu et al, 2000[69]	PP	Skull reconstruction of areas needing an operation	CT	Skull reconstruction in CT for preoperative planning	Supervised	Regression	Real data	NA	NA	NA
51	Frye et al, 1996[70]	OP	Prediction of mortality of burn patients	Patient risk factors and laboratory measurements	Prediction of mortality of burn patients	Supervised	Classification	Real data	90%	NA	10%
51	Frye et al, 1996[70]	OP	Prediction of hospital stay of burn patients	Patient risk factors and laboratory measurements	Prediction of hospital stay of burn patients	Supervised	Classification	Real data	90%	NA	10%

NA, not applicable; NR, not reported.

Technical Characteristics of ML Algorithms Utilized in Burns and Reconstructive Surgery NA, not applicable; NR, not reported.

Lack of Data Augmentation and Validation during Training

Data augmentation is often used in small datasets, to artificially create more data samples and increase the effective dataset size, and as a result the statistical performance of a model. Data augmentation was used in only six of the 51 included studies. The remaining articles relied only on real data. For diagnostic predictions, the majority of studies utilized 2D photographs (n = 15) and CT scans (n = 4). For clinical outcome prediction, patient risk factors and laboratory measurements on admission was utilized in most models (n = 17). In preoperative planning, CT scans (n = 3) and 2D photographs (n = 2) comprised the majority of inputs utilized. Training ML models requires splitting the data set in training, validation, and test sets, where the validation set is used for hyperparameter tuning during training to prevent “overfitting” of the model to the given data. Only 10 of the 35 studies utilized a validation set during training. In total, 35 studies report their data training and testing splits, with an 80%–20% split between the training and testing set being the most common methodology presented (n = 9). In terms of output, ML algorithms functioned primarily via classification in 45 studies and via regression in six studies. Classification was utilized for the allocation of a new subject to a specific outcome (for example, burn patient needing a grafting versus healing via secondary intention). Regression was used in studies aiming to recreate a prediction of a postoperative outcome (postoperative CT scan, postoperative 2D photograph, and predicted wound size).

Risk of Bias Assessment

The risk of bias was assessed via the QUADAS-2 tool for risk of bias assessment and concerns over applicability (Fig. 2). The majority of studies had an unclear risk of bias (RoB) in the patient selection (n = 20) and index test domains (n = 24). Most had a low RoB by the reference standard (n = 39) and flow and timing domains (n = 35). For applicability concern, more than half of the studies had a low risk of RoB regarding the patient selection, index test, and reference standard domains (n = 32, n = 33, and n = 38 respectively).

Fig. 2.

Summary of the QUADAS-2 (Quality Assessment on Diagnostic Accuracy Studies-2) analysis.

DISCUSSION

This is the first systematic review focusing on the application of ML in plastic surgery, adding to previous reviews on AI in the specialty.[72] After careful selection of studies that demonstrated the clinical application of these algorithms, we identified 51 articles describing the application of 103 ML algorithms. In our review, the mean accuracy for diagnosis prediction, outcome prediction, and preoperative planning was 88.80%, 86.11%, and 80.28%, respectively. The model with the highest mean accuracy was NNs (88.25%), followed by SVMs (88.02%), decision trees/random forest (78.75%), and linear regression (76.85%). Similar findings have been reported in systematic reviews of other surgical specialties. In orthopedic surgery and neurosurgery, the most common models utilized have been Neural Networks (NNs), followed by support vector machines (SVMs) and logistic regression (LR).[3,73] Outcome prediction of ML models in these specialties ranged from 70% to 97%, which is in line with the findings of this report[8,72] Nonsurgical specialties have also utilized NNs and SVMs the most frequent, with accuracies approaching 96% depending on the specialty and model intent.[74,75] The reason behind this preference is potentially that NN, SVM, and DT most closely resemble the cognition behind clinical judgment, where clinicians aim to derive outcome classifications based on multiple, nonlinear inputs. In plastic surgery, ML demonstrated potentially superior accuracy in diagnosis and outcome prediction when compared with clinician judgment. In burn surgery, models included in this review were able to classify burn thickness with an accuracy of up to 99.3%, in contrast to the 60%–70% achieved by surgeons.[21,76] Models have also demonstrated the ability to predict mortality rates with an accuracy of 93%, outperforming commonly used predictive models such as the Belgian score, Boston score, and APACHE II with a sensitivity of 72%, 66%, and 81%, respectively.[50] In microsurgery, models produced high accuracy in prognosis of free flap failure (66%), whereas commonly used prognostic surgical risk calculators have been deemed unreliable for head and neck and breast microsurgical reconstruction (Brier score <0.01 and 0.09–0.44, respectively).[77,78] In addition, ML models demonstrated a predictive capacity for outcomes for which predictive models have not yet been developed but may assist the surgeon in the clinical workplace. Examples include prediction of AKI in burn patients, mortality from necrotizing infections, and postoperative surgical outcomes in craniosynostosis surgery and reconstructive surgery following craniosynostosis correction.[29,31,48,59] ML in plastic surgery has an incredible potential to advance patient care, but it is still in its infancy. This review has highlighted several patterns in successful application. Whenever a diagnosis is solely reliant upon a visual stimulus, for example 2D photography or CT, ML has consistently and reliably outperformed surgeons’ diagnostic accuracy.[18,37,39,40,46,51,53,59,63] Further, in conditions in which there are well-established correlations between certain risk markers and an outcome of interest, such as deranged blood tests on admission and AKI in burn patients, ML yielded highly accurate predictive algorithms.[38,44,55] [2447]However, attempts to include weakly related risk markers resulted in algorithms that had an overall lower predictive accuracy, rendering them unsafe for clinical practice. This review further identified that some plastic surgery subspecialties, such as hand surgery, have yet to incorporate this technology. This may be due to the challenging nature of classifying potential outcomes (eg, classification of hand function outcomes), or lack of data, yet future studies should aim to harvest the potential of this technology. From a technological standpoint, this review identified three key areas to improve future algorithms, that is by tapping into the potential of expanding the dataset size using data augmentation, utilizing novel deep learning models, and making proper use of algorithm validation in research. Data augmentation can be invaluable in the creation of future algorithms, solving the main obstacle of accessibility to large amounts of data needed to train these models. It is a process by which one can artificially enhance the diversity of a patient database without actually collecting new data. (See figure, Supplemental Digital Content 1, which displays data augmentation utilizing random cropping, random rotation, and mirroring (horizontal flipping). A single datapoint has now been augmented to seven novel datapoints. .) This was utilized in only five studies in this review. O’Neil et al utilized data augmentation to enhance a database of 11 patients to 269, allowing the creation of an algorithm to predict the probability of total free flap failure in microvascular breast reconstruction.[24] Until large-scale anonymized medical datasets become more readily available, such as the OpenSAFELY platform, by tapping to this potential of data augmentation, clinicians can overcome the challenges of limited patient datasets. Secondly, future research could substantially benefit from utilizing more recent advances in the field of NNs and deep learning. Compared with traditional ML, deep NNs can process vast amounts of data efficiently and discover complex underlying patterns in the data at scale. A limitation here is the large volume of appropriately structured data needed to train these models. Lastly, future research should ensure that all algorithms created are validated before testing. Separating the validation and test sets is crucial because it prevents overfitting of an algorithm to a set of given data and reports a misleading higher performance. Our review identified that only 10 of the 51 studies utilized validation, indicating that there is a high risk of bias in the remaining studies, as the high accuracies of the algorithms could be the result of overfitting. The evidence in this study is limited by the lack of high-quality level I evidence. The existing studies are mostly small retrospective case series that are inherently at the risk of bias. There are no prospective, randomized controlled trials evaluating these technologies in the clinical setting comparing them with clinician acumen, which limits our comparison on the safety and utility of the technologies. Further, the mean accuracy, sensitivity, and specificity of included algorithms were reported collectively for all algorithms, rather than performing subgroup analysis based on the condition examined because of insufficient studies in the specialty. This pooling of results is not an indication of the accuracy of any individual model, where each algorithm should be examined in isolation. However, this still provided an invaluable insight into the accuracy of these algorithms in plastic surgery. Lastly, because of the limited MeSH terms currently utilized in ML and medicine, potentially important studies on the topic may have been missed. These are expected to be minimal, as we performed a wide library search, which was also completed by extensive reference checking to provide an accurate, up-to-date review.

CONCLUSIONS

ML has the potential to enhance clinical decision-making in plastic surgery by making highly accurate diagnostic and outcome predictions; however, the technology is still in its infancy. There is vast heterogeneity between published studies in regard to the clinical task the algorithms are designed on and the model utilized, thus not allowing for data synthesis and meta-analysis. There is a pressing need for larger prospective, randomized control trials for level I and II data, where these algorithms are utilized in the clinical setting. Future research could benefit from larger datasets, data augmentation, state-of-the-art deep learning models, and more rigorous validation during design.

74 in total

1. Quantification of Head Shape from Three-Dimensional Photography for Presurgical and Postsurgical Evaluation of Craniosynostosis.

Authors: Antonio R Porras; Liyun Tu; Deki Tsering; Esperanza Mantilla; Albert Oh; Andinet Enquobahrie; Robert Keating; Gary F Rogers; Marius George Linguraru
Journal: Plast Reconstr Surg Date: 2019-12 Impact factor: 4.730

2. Prediction of aminoglycoside response against methicillin-resistant Staphylococcus aureus infection in burn patients by artificial neural network modeling.

Authors: Shigeo Yamamura; Keiko Kawada; Rieko Takehira; Kenji Nishizawa; Shirou Katayama; Masaaki Hirano; Yasunori Momose
Journal: Biomed Pharmacother Date: 2007-12-03 Impact factor: 6.529

3. Features identification for automatic burn classification.

Authors: Carmen Serrano; Rafael Boloix-Tortosa; Tomás Gómez-Cía; Begoña Acha
Journal: Burns Date: 2015-07-15 Impact factor: 2.744

4. The impending shortage and cost of training the future plastic surgical workforce.

Authors: Jonathan Yang; Madhav Kishore Jayanti; Anne Taylor; Thomas E Williams; Pankaj Tiwari
Journal: Ann Plast Surg Date: 2014-02 Impact factor: 1.539

5. Seeing the forest beyond the trees: Predicting survival in burn patients with machine learning.

Authors: Adrienne N Cobb; Witawat Daungjaiboon; Sarah A Brownlee; Anthony J Baldea; Arthur P Sanford; Michael M Mosier; Paul C Kuo
Journal: Am J Surg Date: 2017-11-07 Impact factor: 2.565

6. Toward an Automatic System for Computer-Aided Assessment in Facial Palsy.

Authors: Diego L Guarin; Yana Yunusova; Babak Taati; Joseph R Dusseldorp; Suresh Mohan; Joana Tavares; Martinus M van Veen; Emily Fortier; Tessa A Hadlock; Nate Jowett
Journal: Facial Plast Surg Aesthet Med Date: 2020 Jan/Feb

7. The applications of machine learning in plastic and reconstructive surgery: protocol of a systematic review.

Authors: Angelos Mantelakis; Ankur Khajuria
Journal: Syst Rev Date: 2020-02-28

Review 8. Machine Learning in Orthopedics: A Literature Review.

Authors: Federico Cabitza; Angela Locoro; Giuseppe Banfi
Journal: Front Bioeng Biotechnol Date: 2018-06-27

9. Identifying Ear Abnormality from 2D Photographs Using Convolutional Neural Networks.

Authors: Rami R Hallac; Jeon Lee; Mark Pressler; James R Seaward; Alex A Kane
Journal: Sci Rep Date: 2019-12-03 Impact factor: 4.379

10. Machine learning prediction in cardiovascular diseases: a meta-analysis.

Authors: Chayakrit Krittanawong; Hafeez Ul Hassan Virk; Sripal Bangalore; Zhen Wang; Kipp W Johnson; Rachel Pinotti; HongJu Zhang; Scott Kaplin; Bharat Narasimhan; Takeshi Kitai; Usman Baber; Jonathan L Halperin; W H Wilson Tang
Journal: Sci Rep Date: 2020-09-29 Impact factor: 4.379

3 in total