Literature DB >> 35395049

Utilizing computer vision for facial behavior analysis in schizophrenia studies: A systematic review.

Zifan Jiang^1,2, Mark Luskus³, Salman Seyedi¹, Emily L Griner⁴, Ali Bahrami Rad¹, Gari D Clifford^1,2, Mina Boazak⁵, Robert O Cotes⁴.

Abstract

BACKGROUND: Schizophrenia is a severe psychiatric disorder that causes significant social and functional impairment. Currently, the diagnosis of schizophrenia is based on information gleaned from the patient's self-report, what the clinician observes directly, and what the clinician gathers from collateral informants, but these elements are prone to subjectivity. Utilizing computer vision to measure facial expressions is a promising approach to adding more objectivity in the evaluation and diagnosis of schizophrenia.
METHOD: We conducted a systematic review using PubMed and Google Scholar. Relevant publications published before (including) December 2021 were identified and evaluated for inclusion. The objective was to conduct a systematic review of computer vision for facial behavior analysis in schizophrenia studies, the clinical findings, and the corresponding data processing and machine learning methods.
RESULTS: Seventeen studies published between 2007 to 2021 were included, with an increasing trend in the number of publications over time. Only 14 articles used interviews to collect data, of which different combinations of passive to evoked, unstructured to structured interviews were used. Various types of hardware were adopted and different types of visual data were collected. Commercial, open-access, and in-house developed models were used to recognize facial behaviors, where frame-level and subject-level features were extracted. Statistical tests and evaluation metrics varied across studies. The number of subjects ranged from 2-120, with an average of 38. Overall, facial behaviors appear to have a role in estimating diagnosis of schizophrenia and psychotic symptoms. When studies were evaluated with a quality assessment checklist, most had a low reporting quality.
CONCLUSION: Despite the rapid development of computer vision techniques, there are relatively few studies that have applied this technology to schizophrenia research. There was considerable variation in the clinical paradigm and analytic techniques used. Further research is needed to identify and develop standardized practices, which will help to promote further advances in the field.

Entities: Chemical

Mesh：

Year: 2022 PMID： 35395049 PMCID： PMC8992987 DOI： 10.1371/journal.pone.0266828

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

Schizophrenia is a severe psychiatric disorder with a lifetime prevalence of approximately 0.48% [1]. This conditionis slightly more common in males [2], appears generally during early adulthood [3], and causes significant social and functional impairment. In 2013, schizophrenia was thought to have an annual economic burden of $155 billion in the United States [4]. Since the identification of schizophrenia in the late 1800s, significant efforts have been made to characterize symptoms of the disorder. The 5th edition of the Diagnostic and Statistical Manual of Mental Disorders (DSM-5) indicates that for a diagnosis of schizophrenia to be made, two or more of five symptom categories must be present [5]. These five symptom categories include delusions, hallucinations, disorganized speech, disorganized behavior, and negative symptoms [5]. Schizophrenia is an illness that demonstrates heterogeneity in its symptoms from person to person, and each of these symptom categories can vary vastly by presentation leading to overlap with other diagnoses. Additionally, schizophrenia has a heterogeneous longitudinal course, with some individuals having a relapsing remitting course, others chronic symptoms, and others with symptoms followed by remission [6]. Currently, the diagnosis of schizophrenia is based on the self-report of the patient, what the interviewer observes, and collateral information, all of which can be highly subjective. Reducing subjectivity in establishing the diagnosis of schizophrenia is necessary from both a research perspective (to ensure treatments work for people with the same underlying condition) and a clinical perspective. Clinically, many people with schizophrenia have a lack of awareness that they have an illness [7], and those with poor insight may be at risk for nonadherence to antipsychotic medications and other negative outcomes [8]. Clinicians often are easily able to identify and evaluate positive symptoms of the illness, but may struggle more with the identification, assessment, and quantification of negative symptoms [9]. Negative symptoms contribute to the overall disability of the illness more than positive symptoms, and include symptoms such as a lack of motivation, social withdrawal, alogia (poverty of speech), and affective flatting. Affective flatting is defined by diminished emotional expressivity in the face, is an example of a negative symptom with the potential to objectively quantify. Despite recognition of impairment of facial expressions as a key diagnostic construct in schizophrenia, research in this area has been limited, and the most recent reviews on the topic are nearly two decades old [10, 11]. Early analyses of facial expressions were primarily conducted using the Facial Action Coding System (FACS). FACS, developed by Ekman and Friesen in 1978, is a framework for developing objective and repeatable methods of coding of facial movement [12]. The system relies on trained rater coding of the presence and magnitude of multiple facial action units (AUs) such as facial, eye, and head movements. A visual illustration can be found in [13]. These ratings formalized the methodology for the evaluation of subject facial movement and expression. In the case of schizophrenia, the system allowed for the identification of variation in patient expression in negative emotionssuch as sadness and anger [14], reduction in patient expression of happiness [10], reduction in patient emotional expressions [3, 15], and reductions in facial responsivity [16]. Still, despite its strengths, FACS was an expensive and manually laborious methodology. Since the turn of the 21st century, machine learning has played an increasing role in mental health research [17], which is driven by rapid development of affective computing, increases in computing power, and ease of data acquisition. Incremental algorithm improvements progressing from the single layer perceptron to the convolutional neural network have also led to significant advances in the field of computer vision [18, 19]. The combination of these factors has led to the development of automated FACS software, such as Noldus FaceReader [20] and Openface 2.0 [21]. In regard to the FACS, the present state-of-the-art model [22] classifies AUs with F1-score and accuracy values of 0.55 and 0.91 respectively in the testing set of the EmotioNet dataset [23], consisting of 23 AUs presenting in 200,000 images. Given the potential applicability of these models, researchers have used them to evaluate facial expressivity. Facial expression models have been studied in depression [24, 25], autism [26], dementia [27, 28], and schizophrenia research. Despite the increasing adoption of this technology, a review of the present state of computer vision models in understanding facial expressions in schizophrenia has not been conducted. Previous reviews have investigated the usage of computer-vision-based facial information in medical applications in general [29, 30], but they focused more on the specific technical facial analyses adopted than the complete processing and analyzing pipeline, and few schizophrenia studies were discussed in detail. Here, we conduct a systematic review on the use of computer vision in the evaluation of facial expressivity in schizophrenia. A systematic narrative synthesis will be provided with the information presented in the text and tables to summarize and explain the characteristics and findings of the included studies. The narrative synthesis will describe the current work, its evolution, and clinical findings, in addition to discussing the data processing pipeline.

Methods

Searching methods

We conducted a literature searchfor publications published before (including) December 2021, on Google Scholar and PubMed in February of 2022 using the following search terms: (“facial emotion” OR “facial expression” OR “facial analysis” OR “facial behavior” OR “facial action units”) AND “schizophrenia” AND “computer vision”. Multiple synonyms and sub-categories of facial behaviors were used in the first keyword set to cover a broad definition of facial behaviors, and the latter two keywords were selected to limit our search with studies that used computer vision in schizophrenia. Based on the articles we found in the search, we conducted a secondary search to include othernotable and relevant papers worthy of inclusion, which were written by the same group of authors, and included relevant articles which were cited by articles found in the primary search. The secondary process was adopted to enhance the review by including those relevant articles that were not discovered in the first search using a general search process. The detailed process can be found in the PRISMA flow diagram in Fig 1.

Fig 1

PRISMA flow diagram.

Inclusion and exclusion

The inclusion and exclusion process is shown in Fig 1. Returned records were first filtered by the presence of related keywords in the titles, where titles without any related keyword were excluded. Those keywords include but not limited to “schizo-”, “psychosis”, “schizophrenia”, “schizophrenic”, “psychiatric/psychiatry” and “neuropsychiatric/neuropsychiatry”. Then the duplicated records were removed. After that, 45 records were screened by titles and abstracts, where 26 records were excluded because their titles were unrelated to the surveyed topic, they did not represent original research (thesis or review), because they did not related to schizophrenia (not schizophrenia), because they only consisted of human-rated subject affect or facial movement (not computer vision), or because they focused on the processing instead of the expression of facial behaviors (not expressing). Lastly, 19 records were assessed for eligibility where one was excluded because it did not relate to schizophrenia and another one was excluded because it was not peer-reviewed and had limited rigor.

Data collection

A data collection survey was conducted by ZJ and ML to extract relevant data, including research goals, findings, interview types, interview structures, hardware used, types of data captured/used, data pre-processing method, behavioral features, statistical testing method, and model evaluation method. For the purposes of this review, we specifically report on clinical processes and data handling components that we discovered within the literature. From the clinical perspective, we report on how computer vision is currently used in the evaluation of schizophrenia, the general clinical findings in relation to those uses, and the interview structure during which data is collected. From the perspective of data handling, we evaluate the data pre-processing, the data processing, and hardware used in the literature. In addition, the reporting quality of the studies were assessed with Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) [31]. The completed quality assessment can be found in S2 Table.

Results

Seventeen articles were selected for inclusion in this review. These articles were published from 2007 to 2021, with the number of publications increasing over time: three completed before 2009 [32-35], four completed between 2010 to 2014 [36-39], and eight were completed subsequent to 2015 [40-48].

Study objectives and participant characteristics

The study objectives were divided into three types: 1) descriptive, meaning those that described schizophrenia phenomenology, 2) predictive, meaning those that utilized predictors to classify presence or absence of schizophrenia or those which predicted certain clinical outcome scores based on facial expressions, or 3) those which included both descriptive and predictive outcomes. Of the 17 included studies, eight were descriptive, one was predictive only, and eight were descriptive and predictive. The study objective type, participant characteristics, description of the studies, and a summary of the findings can be found in Table 1. Of the descriptive studies, five used automated computer vision techniques to describe schizophrenia emotional expression and three to describe specific movements, such as facial movement, head movement hand movement or body movement. Of those that included predictive components, three were used to predict schizophrenia presence or absence, and four to predict schizophrenia severity using clinical outcome measures. The number of participants included in the studies ranged from ranged from two (case study, one control and one patient) to 120, with an average of 38. Four studies did not include a control.

Table 1

Overview of the participants, objective types, descriptions and findings.

Article	Year	Subject	Type	Description	Findings
[32]	2007	11 SZ, 10 NC	Descriptive	Developed a computational framework to quantify intended emotional expression differences between patients with schizophrenia and healthy controls matched for age, ethnicity, and gender.	Significant difference in average abilities to express emotions, especially in the case of anger. The average abilities to express emotions correlated significantly with clinical severity of flat affect.
[33]	2007	12 SZ, 12 NC	Descriptive	Provided a framework to quantify the facial expression abnormality of patients with schizophrenia in posed and evoked emotions by combing 2D and 3D facial features and compared with results from human raters.	Human raters can only correctly identify a low percentage (mostly 40% to 70% except for happiness) of intended emotions for both controls and patients, but showed different accuracies for controls and schizophrenia patients. Significant group difference in evoked disgust was found.
[34]	2007	12 SZ, 12 NC	Descriptive	Captured facial expressions of individuals and quantified their expression flatness by estimating overlap between different facial expression clusters in the learned embedding.	Patient group has much larger facial expression overlap than the control group, and demonstrate that the flat affect is an important symptom in diagnosing schizophrenia patients.
[35]	2008	1 SZ, 1 NC	Descriptive	Created an automated computerized scoring system as an alternative to FACS for systematic analysis of facial expressions of healthy controls, schizophrenia patients and patients with Asperger’s syndrome.	The healthy control expressed intended emotion better than the patient with Asperger’s and schizophrenia (especially in the fear). The control has more neutral expression than the two patients.
[38]	2010	27 SZ, unreported number of NC	Descriptive	Authors aimed to determine whether automated video-based quantification of body movement could be reliable indicators for nonverbal behavior in schizophrenia patients, and if body movement is valid as a measure of expressiveness.	Automated MEA-based detection of body and head movement and movement speed was found to be highly reliable, with clear indications for its validity. MEA provides an objective assessment of body movement.
[36]	2011	4 SZ, 4 NC	Descriptive	Developed an automated FACS based on advanced computer science technology and derived quantitative measures of flat and inappropriate facial affect automatically from temporal AU profiles.	NA
[39]	2013	20 SZ, 100NC	Descriptive and predictive	Determined whether schizophrenia patients display less speaking gestures and listener nods and whether patients’ increased symptom severity and poorer social cognition are associated with patients’ reduced gesture and nods. Additionally, authors aimed to determine if patients’ partners compensate for patients’ reduced nonverbal behavior by gesturing more when speaking and nodding when listening.	Patients with schizophrenia exhibit reduced rates of gesture making compared to healthy controls. Increased levels of negative symptoms are associated with poorer rapport with patients.
[37]	2014	28 SZ, 26 NC	Descriptive and predictive	The authors worked to develop novel measures of facial expressivity using information theory. In particular, they developed measures of ambiguity and distinctiveness in facial expressivity, and hoped that these measures could be used to analyze large data sets of dynamic expressions.	Results indicated that ambiguity and distinctiveness of expression were both associated with a diagnosis of schizophrenia. The method developed is more repeatable and objective than observer-based rating scales. Predictions were highest for measures of overall facial expression, with an F-score of 12.
[45]	2015	34 SZ, 33 NC	Descriptive and predictive	This study aimed to pair data-driven analysis of facial expression with descriptive methods using machine learning tools and other technology.	Results from this study are in agreement with previous studies, which demonstrate that schizophrenia symptoms result in changes to AUs when compared to healthy controls.
[47]	2016	34 SZ, 33 NC	Descriptive and predictive	The authors aimed to create‘prototype’ facial expression clusters in order to study a wider range of facial features than traditional AU and FACS computation allows for.	The authors findings were consistent with prior studies, which showed that schizophrenia patients overall have lower levels of facial expressivity.
[46]	2016	34 SZ, 33 NC	Descriptive	The authors aimed to compute discriminative features of AU activity for the purpose of measuring the following qualities, which represent symptomology used in the diagnosis of schizophrenia: flat affect, incongruent affect, and inappropriate affect.	In contrast with previous studies, the authors found that patients with schizophrenia exhibited reduced amounts of expression in positive emotional responses. Their findings also suggest that the magnitude of changes in facial expression may correlate to symptom severity.
[44]	2016	18 SZ	Descriptive and predictive	Overarching goal was to create novel methods for examining clinical behavior by identifying behavioral indicators relevant to various symptoms. Application to psychiatric populations could provide needed method to collect objective behavioral data. Authors worked to identify behavior indicators relevant to certain psychosis symptoms as measured by clinical scales and determine which structured interview questions correlate to facial findings suggestive of specific psychotic symptoms.	Negative and positive symptoms are best elicited via different questions. E.g. positive symptoms were elicited via questions regarding the patient’s energy, and negative symptoms were elicited via questions regarding self-confidence. AU5 and AU6 are activated more frequently in patients with depression. AU12 is negatively correlated with the PANSS Negative summative scale. Overall conclusion was that AUs can be used to detect psychotic symptoms as measured on the PANSS, BPRS, and MADRS. There is value at evaluating facial expressions at the question level.
[43]	2017	1 SZ, 1 NC	descriptive	Compared facial expressions of a patient with schizophrenia and a healthy control, utilizing marker-based technology that recognizes facial features.	Facial expressivity intensity was higher in the healthy control and analysis of facial expressions using marker-based technology displays high fidelity.
[42]	2018	91 SZ	Descriptive and predictive	Proposed SchiNet, a novel neural network architecture, trained on large-sclae FACS datasets, that estimates presence and intensity of action units. Then it is used to predict expression-related symptoms from two commonly-used assessment interviews; Positive and Negative Syndrome Scale (PANSS) and Clinical Assessment Interview for Negative Symptoms (CAINS).	Significant correlations are found between symptoms and the frequency of occurrence of automatically detected facial expression. The score of several symptoms in the PANSS and CAINS interviews can be estimated with a MAE less than 1 level. Automatic estimation of symptom severity needs further improvement to reach human level performance.
[40]	2019	25 SZ	Descriptive and predictive	Develop a proof-of-concept for the potential of using the machine learning FAR system as a clinician-supporting tool, in an attempt to improve the consistency and reliability of mental status examination.	There is a lack of inter-rater reliability between five senior adult psychiatrists working in the same mental health center. Automatic facial analysis may be able to predict the label provided by psychiatrists.
[41]	2019	74 SZ	Predictive	Incorporated temporal information into the SchiNet using stacked GRU to directly addresses the problem of Treatment Outcome Estimation (TOE) in schizophrenia—more specifically, is aimed at determining whether specific symptoms have improved or not by analysing jointly two videos of the same patient, one before and one after the treatment.	Proposed method can determine The TOE of CAINS expression symptoms and PANSS negative symptoms with an accuracy of about 0.7 (0.64–0.71) and a F1 score of around 0.4 (0.33–0.46). The determination is more accurate with proposed specifically designed TOE method than applying symptom severity estimation to before and after treatment independently.
[48]	2021	18 SZ, 9 NC	Descriptive and predictive	Developed a remote, smartphone based assessments to capture objective measurement of head movement, which were then used as features to predict both PANSS subscale scores and individual items in each of those subscales, with age and gender as confounding variables.	Head movements acquired remotely through smartphone were able to classify schizophrenia diagnosis and quantify symptom severity in patients with schizophrenia.

SZ = schizophrenia patient, NC = normal control.

Interview techniques

Only 14 articles used interviews during the data capture (video recording) phase as shown in Table 2. For those that used an interview, the interview structure can be broadly broken into two classifications: evoked and passive. In evoked interviews, participants were asked to express certain emotions, such as anger and happiness. In passive interviews, participants were asked a series of questions, and their resulting facial expressions were recorded. Of the 14 studies, four adopted an evoked approach, seven used a passive approach, and three used a combination of interview styles. Of the 11 that used a passive interview, all were novel interviews developed by the authors. Of these, six were structured, four were semi-structured, and one was unstructured. Over the years, the articles are utilizing progressively more passive approaches.

Table 2

Overview of participant interviews.

Article	Passive/Evoked	Interview Structure
[32]	Evoked	Subjects were asked to make facial expressions of happiness, sadness, anger, and fear.
[33–35]	Both	Participants were asked to express happiness, anger, fear, sadness, and disgust at mild, moderate, and peak levels, respectively. In the evoked session, participants were guided through vignettes, which were provided by the participants themselves and describe a situation in their life pertaining to each emotion.
[38]	Passive	Researchers conducted role-play tests (RPT), which were used to measure social competence in schizophrenia. All RPTs were video recorded. Each test consisted of 14 social scenes that represented three response domains.
[36]	Evoked	Researchers had participants express the following emotions: happiness, sadness, anger, fear, and disgust.
[37]	Both	Patients were recorded expressing sadness, anger, happiness, fear, and disgust. Each emotion recording lasted for approximately 2 minutes. Additionally, patients were recorded while being read self-recorded vignettes about times in their life in which they experienced these emotions.
[45, 47]	Passive	The interview was semi-structured and involved a single question of“Tell me about yourself” followed by three emotionally evocative questions that were not described.
[46]	Unknown	Participants underwent a short, structured interview that was not described by the authors.
[44]	Passive	Participants were interviewed in a style consistent with a routine clinical encounter for a patient under inpatient treatment for schizophrenia. The interview was semi-structured, consisted of 13 questions, and was approximately 10 minutes in length.
[41, 42]	Passive	Recordings from a previous [49] trial were used. No novel participant interviews in this study.
[40]	Passive	Participants underwent a semi-structured 10-minute interview that consisted of the following ten questions: (1) Can you please present yourself and tell me a bit about yourself? (2) How do you feel? (3) Can you tell me about the events that led to your current hospitalization? (4) Can you tell me some things about your family? (5) Can you tell me of something sad that has recently happened to you? (6) Can you tell me of something pleasing that has recently happened to you? (7) Is there anything else you want to add? (8) What do you think about the recent situation in the country? (9) What are your future plans? (10) How did you feel about talking with me in front of the camera?
[48]	Evoked	Open-ended questions such as “What have you been doing for the past few hours?” and “What are your plans for the rest of the day?” were asked to elicit a free verbal response.

The evoked interview style was conducted so that authors could associate different expressions with schizophrenia symptom burden. Alvino et al. [32] used an evoked interview to determine that patients with schizophrenia demonstrated particular deficiencies in expressing anger when compared to healthy controls. Wang et al. [33], using both an evoked and passive approach, identified disgust as an emotion with decreased expressivity in patients with schizophrenia. In Tron et al. [45, 47], participants were asked, “Tell me about yourself”, which was followed with three questions that intended to lead to the expression of emotions. Vijay et al. [44] interviewed patients in a semi-structured naturalistic clinical encounter that used 13 standardized questions, such as, “How has your mood been?” and “How is your energy?”. In addition to the diversity of the interview techniques, detailed descriptions of the interviews were unavailable. Several papers stated that they asked specific questions but did not report the details of the questions in the manuscript.

Hardware

Hardware for data collection varied dramatically over the period of the review, often reflecting changes in the quality and pricing of consumer technology. Due to the low quality of consumer-level systems, early studies used multi-camera systems [33, 45–47] and physical markers placed on the subjects face for landmark localization [39, 43]. In contrast, more recent studies used one visible-light camera, an approach which may be more feasible in a clinical setting, and more cost-effective. Later works also had the benefit of the discoveries of early works using higher quality systems. Among those who reported the video or image collection details, the resolution ranged from 1280x960 pixels [44] to 1920x1080 pixels [42] and the video collection rate ranged from 25 [42] to 30 [44] frames-per-second. Of those that used 3-dimensional data, one group [33] used polyocular stereo cameras and a color camera, while another [45-47] used 3D cameras based on structured light technology.

Data pipeline

Table 3 summarizes how data was collected, processed, analyzed and reported in the 17 studies we surveyed. More specifically, the following aspects were included: (1) the types of data collected, (2) features calculated from the data, (3) and the corresponding statistical analyses and performance metrics used for reporting the results. A visualization of the adopted data pipelines can be found in Fig 2. Detailed data processing steps can be found in S1 Table.

Table 3

Overview of data processing and statistical analyses.

Article	Frame-level Features	Subject-level Features	Statistical Tests	Performance Metrics	Validation
Studies with 2D or 3D image data
[32]	SVM output of the intended expression normalized with outputs from other SVMs.	Average normalized output.	Paired t-test	PCC	NA
[33]	2D features: the area of facial regions, the distance between some fiducial points; 3D Curvature Features and 3D Gabor moment invariants for six facial regions.	Lower dimensional embedding of the frame level features was learned with the ISOMAP manifold learning algorithm.	Paired t-test	NA	NA
Studies with video data
[34]	Geometric features similar to [32].	Lower dimensional embedding of the frame level features was learned with the ISOMAP manifold learning algorithm. A “Flatness Index” was defined as the minimal pair-wise overlap between one expression to other expressions in the ISOMAP embedding.	Paired t-test	NA	NA
[38]	Motion energy: the amount of grayscale changes from one frame to the next in the ROIs normalized by ROI size.	Percentage of time with detectable movement in ROIs and the speed of body movement.	paired t-tests; ANOVA	PCC; Cronbach’s alpha	NA
[36]	Confidence and presence of the 15 AUs.	Frequency (percentage of frames presented) of single AUs and AU combination; Flatness measure: frequency of neutral frames (no AU was present); Inappropriateness measure: frequency of“disqualifying” AUs defined in [15].	NA (method paper)	NA	NA
[37]	Confidence and presence of the 15 AUs.	Same as [36].	two-way ANOVA;	PCC; Cohen’s d	None
[44]	Intensities and presence of the 20 AUs.	Mean and standard deviations of intensities of each AU during answers to specific questions.	NA	PCC	LOSO
[42]	Normalized intensities of ten AUs and smile.	The Fisher vector representation of the distribution of intensities over time, from unsupervised learned Gaussian Mixture Model.	NA	SCC, PCC, MAE, RMSE	LOSO
[40]	Intensities of seven emotions: norm, anger, disgust, fear, happiness, sadness and surprise; Mean grey scale of the face.	Mean intensity of the emotions; Number of transitions of emotions; Standard deviation of mean gray scale.	NA	ACC	LOSO
[41]	Normalized intensities of ten AUs and smile.	Two stacked GRUs were used to extract clip-level (15s segments of the videos) and patient-level representations.	NA	F1, ACC	LOSO
[48]	Head location of each subject relative to the camera.	Average head movement.	t-test	R², Adj. R²	None
Studies with IR videos
[39]	Identities of listener and speaker; Head and hand locations of each subject.	Head and hand movement rate; Percentage of time spent in speaking, nodding/gesture as listener of the patients, patients’ partners and controls.	t-test	QICC, SE	None
[43]	3D locations of the facial markers.	Average value of distances traveled by markers during shifts from a neutral position.	NA	NA	NA
Studies with depth camera videos
[35]	Output of the five SVMs trained for classifying five expressions (happiness, sadness, anger, fear and neutral).	Output of SVMs were modeled as the observed variable in HMM, where the hidden variable indicates emotions. Four features were used: 1. the average of posterior probabilities of intended and neutral emotions; 2. the occurrence frequency of the appropriate and neutral expressions.	NA (method paper)	NA	NA
[45]	Activity level of each AU.	Activation Ratio: Fraction of segment during which the AU was activated; Activation Level: Mean intensity of AU activation; Activation Length: Number of frames that the AU activation lasted; Change Ratio: fraction of the period of AU activation when there was a change in activity level; Fast Change Ratio: fraction of fast changes in activation level.	One-way ANOVA, t-test	AUC, PCC	LOSO
[47]	Activity level of each AU.	Richness: how many prototype expressions appeared; Typicality: how similar they were to the prototype. Distribution: which expressions were more prevalent.	Bonferroni correction, one-way ANOVA, t-test	PCC, AUC	LOSO
[46]	Activity level of each AU.	Flatness Measures: the sum of the variance in facial activity for similarly/differently rated photos; Congruity Measures: the ratio between the sum of the variance within similarly rated photos and the sum of total variance; Inappropriateness measure: the sum of the squared difference between the average facial activity of all controls and each subject’s individual facial activities.	t-test	PCC, Cohen’s d	LOSO

ACC = accuracy, Adj. R2 = adjusted R square, ANOVA = analysis of variance, AU = action unit, GRU = gated recurrent units, HMM = hidden Markov model, ISOMAP = isometric mapping, LOSO = leave-one-(subject)-out, MAE = mean absolute error, PCC = Pearson correlation coefficient, QICC = corrected quasi-likelihood under independence model criterion, RMSE = root mean square error, ROI = region of interest, SCC = Spearman rank-order correlation coefficient, SE = standard error. “NA” in the Statistical Tests column indicates that no statistical test was used or was clearly reported. “NA” in the validation column indicates that no classification or regression was conducted in the study, hence the validation was not needed.

Fig 2

Visualization of the data pipelines.

Visualization of the data pipelines.

Different combinations of the methods in each section were adopted in different studies. Pre-processing and recognition methods used in commercial software were not included due to the lack of clarify on what algorithms were used in them. The face used in illustration was an average face generated from http://faceresearch.org/demos/average, which is available open access (CC-BY-4.0) [50]. The icons used in the figure are available open access (CC-BY) from the NounProject.com. 2D/3D: two/three-dimension, ANOVA: analysis of variance, AUROC: area under the receiver operating characteristic, CNN: convolutional neural network, IR: infrared, ISOMAP: isometric mapping, KNN: k nearest neighbor, LDA: linear discriminant analysis, ML: machine learning, RNN: recurrent neural network, ROI: region of interest, SVM: support vector machine. ACC = accuracy, Adj. R2 = adjusted R square, ANOVA = analysis of variance, AU = action unit, GRU = gated recurrent units, HMM = hidden Markov model, ISOMAP = isometric mapping, LOSO = leave-one-(subject)-out, MAE = mean absolute error, PCC = Pearson correlation coefficient, QICC = corrected quasi-likelihood under independence model criterion, RMSE = root mean square error, ROI = region of interest, SCC = Spearman rank-order correlation coefficient, SE = standard error. “NA” in the Statistical Tests column indicates that no statistical test was used or was clearly reported. “NA” in the validation column indicates that no classification or regression was conducted in the study, hence the validation was not needed.

Types of the visual data

Raw data inputs of all included studies were vision-based. Table 3 shows the type(s) of data used in each study. Five types of visual data were used, namely two-dimensional (2D) images, three-dimensional (3D) images, 2D videos (with infrared or visible light), and 3D surface videos from structured light cameras. The earliest two studies [32, 33] used only image data, where [33] used both 2D and 3D images, and [32] used 2D images. Four studies [35, 45–47] used 3D videos, two studies [39, 43] used 2D infrared light videos, and the remaining nine studies used 2D visible light videos. Only one group [42] evaluated the advantages of using videos over images, demonstrating superior performance in both AU recognition and symptom estimation with video data.

Behavior recognition methods

While most studies focused on analyzing facial expressions, three [38, 39, 48] selected head movement as the primary behavior to be investigated. Kupper et al. [38] used the head as the region of interest (ROI) and used the changes of framewise pixel intensity in the ROI as a surrogate for head movement, while Lavelle et al. [39] adopted the vertical distance between the positions of the reflective marker on the head in consecutive frames as the movement of the head. Though both analyses were mostly based on computer vision, some level of manual help was involved, where the former requires manual selection of the head area as ROI, and the latter requires the researchers to put reflective markers on the participants. Abbas et al. [48] addressed this issue by recognizing the head area with a convolutional neural network (CNN). The remaining 14 studies were divided into three categories based on how facial expressions were evaluated: 1) analysis of basic emotions, 2) AU analysis, and 3) surrogate measures to measure facial expression. Three studies analyzed basic emotions [32, 35, 40], including emotional categories such as happiness [32, 35, 40], sadness [32, 35, 40], anger [32, 35, 40], surprise [40], disgust [40], fear [32, 35, 40], and neutrality [35, 40]. Eight studies [36, 37, 41, 42, 44–47] focused on AU analysis, a proxy measure for underlying facial muscle movement, based on the earlier mentioned FACS. AUs including AU0 (Neutral Face [41, 42]), AU1 (Inner Brow Raiser [36, 37, 41, 42, 44–47]), AU2 (Outer Brow Raiser [36, 37, 41, 42, 44–47]), AU4 (Brow Lowerer [36, 37, 41, 42, 44]), AU5 (Upper Lid Raiser [36, 37, 41, 42, 44]), AU6 (Cheek Raiser [36, 37, 41, 42, 44]), AU7 (Lid Tightener [36, 37, 44]), AU9 (Nose Wrinkler [36, 37, 44]), AU10 (Upper Lip Raiser [36, 37, 44]), AU12 (Lip Corner Puller [36, 37, 41, 42, 44–47]), AU 14(Dimpler [44-47]), AU15 (Lip Corner Depressor [36, 37, 44–47]), AU17 (Chin Raiser [36, 37, 44–47]), AU18 (Lip Puckerer [36]), AU20 (Lip stretcher [36, 37, 44–47]), AU23 (Lip Tightener [36, 37, 41, 42, 44]), AU25 (Lips part [36, 37, 41, 42, 44–47]), AU26 (Jaw Drop [44-47]), AU 43 (Eyes Closed [41, 42, 45–47]), AU44 (Squint, [45-47]) AU45 (Blink [44]), and AU62 (Eyes Turns Right [45-47]). Other non-traditional AUs, such as smile, frown and sneer, were also mentioned in [45-47]. The remaining three studies [33, 34, 43] used surrogate measures to represent the facial expression: two studies [33, 34] only used basic emotion classification as a task to evaluate the effectiveness of the calculated features. These features, instead of classified emotions, were directly used as the quantification of participants; Another study [43] defined the average value of distances traveled by facial markers during shifts from a neutral position as the facial expression intensity. To recognize facial expressions, six studies used existing solutions to estimate the presence and the intensity of the AUs or to locate the faces, including commercial software (FaceShift [45-47], Vicon Blade software [43]) and open-source software [44, 48] (Openface [21]). Nine studies used novel facial expression recognition (FER) methods. For the six studies before 2017, face and facial landmarks were first collected. Detected landmark locations were then used for AU recognition or emotional recognition. The combination of geometric and wavelet-based texture features and various statistical supervised classifiers (including KNN, SVM, Adaboost) had been most adopted for these AU and emotional detectors until the introduction of a CNN based end-to-end AU classifier in [42]. It is worth noting that only [35] incorporated temporal information into facial expression recognition, and they modeled the dependencies between frames with a Hidden Markov Model.

Analysis of frame-level and subject-level features

The “Frame/Subject-level Features” columns in Table 3 shows the different features extracted at both frame-level and subject-level in each study. Features extracted at frame-level helped summarize the information generated from the behavior recognition in each frame or image, and features extracted at subject-level were used to represent the characteristics of a subject by summarizing and reducing the dimension of the frame-level features. This was useful for identifying meaningful patterns from a limited number of subjects. They were either used to quantitatively describe the behaviors of the subjects or used as the input for the final task, such as diagnosis classification or symptom score regression. Although the subject-level features used in the studies focusing on facial expressions were diverse, the frame-level features appeared to primarily measure the intensity and/or the presence of a specific facial expression. Only Wang et al. [33-35] adopted the 2D and 3D geometric and texture features directly without using them to recognize the facial expression in the preprocessing steps. Additionally, AUs were only identified at the level of video clips instead of frame-level in two studies because of the protocol used by human FACS raters. Many simple statistics of the time series of the frame-level features were used as subject-level features, including max [51], average [32, 35, 37–40, 44, 45] and variance [44, 46] of either presence or intensity of the frame-level features, often normalized when counting presence. Those statistics were also calculated in earlier studies where facial expressions were manually recognized. For example, average was used in [52, 53]. When statistics were only calculated for a specific subset of frame-level features, like neutral or disqualifying AU [53] they could be interpreted as surrogate measures of flatness and inappropriateness, respectively. Other tailored subject-level features have also been reported. Some counted the number of different AUs or emotions expressed and defined it as richness [37, 47, 53]. Another type of feature aimed to measure how much the facial expressions alternated by calculating the percentage of time when there are changes in intensities [45]. The measure of incongruity was first introduced in [46] and defined as the ratio between the variance within each emotion and the total variance, indicating how consistent the facial expressions were when the similar emotional response was evoked. In addition to manually designing subject-level features, many data-driven feature generation methods have been proposed over the years. Some treated frame-level features from all the frames as independent observations and reduced the dimension with Isomap [33] or Gaussian mixture model [42]. Wang et al. [34] then defined the “flatness” of each video as the minimal overlap between one expression cluster to clusters of other expressions in the learned Isomap embedding. Similarly, Tron et al. [47] first conducted clustering via K-means and then used the centroids of the clusters as prototype expressions to define measures like richness and typicality. Bishay et al. [41] took the research a step further and first made use of the temporal dynamics to learn subject-level representations with stacked Gated Recurrent Units (GRUs) [54].

Evaluation methods

The “Statistical Tests”, “Performance Metrics”, “Validation” and “Subjects” columns in Table 3 shows the different evaluation methods and population adopted in each study. Statistical tests including t-test, analysis of variance were mainly adopted to evaluate the differences between different clinical groups. However, some studies [35, 36, 40–44] did not report using any statistical tests on the features or the performances. To evaluate the performance of the proposed classification or regression approaches, Pearson correlation coefficient (PCC) or Spearman rank-order correlation coefficient (SCC) were calculated in all studies that tried to estimate the symptom rating scales except two [39, 48], where corrected quasi-likelihood under independence model criterion (QICC) and standard error (SE) were reported in [39], and (adjusted) R square was reported in [48]. Other metrics used include Cohen’s d, Cronbach’s alpha, mean absolute error (MAE), and root mean square error (RMSE). For studies that targeted diagnosis classification, no unified metric was used in them. Instead, subsets of area under receiver operating characteristic (AUC), accuracy, F1 score were selected in different studies. Although the two earliest predictive studies [37, 39] did not report their model performance on a held-out test set, all later predictive studies reported performance using the leave-one-subject-out (LOSO) cross-validation procedure.

Study findings

Thirteen studies used computer vision to detect the presence or absence of schizophrenia and four used computer vision to predict disease severity. Of those predicting the presence of disease, authors aimed to identify decreases in global facial expressivity, as well as differences in emotions associated with schizophrenia. In regards to global reductions in facial expressivity, performance varied by measure and the sub-component of the measure. Research groups attempting to predict disease severity identified components of facial expressivity associated with symptom severity scales. AU12 (which corresponds to the zygomatic major) is negatively correlated with the PANSS-NEG scale, with a correlation coefficient of -0.578 [44]. The overall magnitude of changes in facial expression is associated with ‘Blunted Affect’ on the PANSS scale, with an R-value of -0.598 [47]. In addition, one group identified 3 out of 7 PANSS-NEG symptoms, flat affect, poor rapport, and lack of spontaneity, as being associated with changes in facial expression [42]. Computer vision performed poorly on the P1-P7 items on the PANSS-POS scale, with no groups identifying statistically significant correlations between facial expression and positive symptoms. Regarding the BPRS symptom scale, one group noted that AU2 (which corresponds to the frontalis and pars lateralis muscles) is correlated with unusual thought content, with a correlation coefficient of 0.752 [44].

Discussion

Although facial expressions can be identified with the help of trained experts [51, 53], manual identification fails to scale due to time and financial constraints and are not feasible in a busy outpatient clinic. Furthermore, due to the lack of easily reproducible standards for facial expressions, the field is yet to develop an objective consensus definition on what precisely constitutes affective flattening or other facial abnormalities in schizophrenia. Automated computer vision techniques may help to solve some of these challenges, as advances in affective computing have made it easier and cheaper to analyze a large amount of data while providing a consistent way to quantify facial behaviors. With improving technology harnessing advances in temporal and spatial granularity, computer vision based analysis has the potential to allow researchers to better understand the phenomenology of schizophrenia and differentiate those with schizophrenia from without it, and to help to subtype schizophrenia based on digital phenotypes. Additionally computer vision can objectively introduce non-verbal facial behavior data into the clinical area, allowing for clinicians to better identify negative symptoms and monitor treatment response from medication and psychosocial treatments. The systematic review serves as a road map for researchers to understand the current approaches, technical parameters, and existing challenges when using computer vision to analyze facial movements in patients with schizophrenia. The papers reviewed used a broad array of interview techniques during video or image capture, which are broadly classified as either evoked or passive. While both techniques resulted in demonstrably significant differences in cross-group expressivity, passive techniques have been the primary modality used in the majority of studies and have been more frequently adopted in recent studies. The overarching goal in a passive interview is to capture a wide range of facial expressions, similar to what would be elicited in a clinical encounter. This interview style may provide data that is more relevant to a psychiatric appointment, helping researchers develop predictive models that are applicable to the clinical environment. Additionally, studies using a passive interview technique were able to appreciate a broader range of expressions and AUs in their participants, which may allow for more data collection. Nonetheless, there is significant variability in the types of passive interview techniques. All groups, with the exception of Bishay et al. [41, 42], used novel group-developed semi-structured interviews during their data collection, which invariably impacts subject expressivity. It should therefore be of no surprise to learn that group descriptives of schizophrenia facial expressions and AU’s differed. Even in looking to predictive models, there were within-group variations in model performance across interview sub-components (e.g., Periods of silence vs. each of the semi-structured interview questions [44]). Due to the inconsistencies in the interviews, we can not clearly state the best approach for future research, however, we can point to elements that stood out. A period of silence during the interview yielded facial expressions that correlated to positive symptoms as measured by PANSS, as well as questions regarding the patient’s energy. Negative symptoms were elicited via questions regarding a patient’s self confidence [44]. Positive symptoms were elicited via questions regarding the patient’s energy. AU5 and AU6 were activated more frequently in patients with depression. AU12 was negatively correlated with the PANSS Negative subscale. Overall, AUs appear to have a role in estimating psychotic symptoms as measured on the PANSS, BPRS, and MADRS.

Effect of hardware and data source

The type of hardware and collected data also varied across studies. As Table 3 shows, some studies involved expensive procedures that entailed the utilization of 3D reflective surface markers in still images. Most of the more recent approaches used less complicated equipment, with many simply utilizing visual spectrum cameras, and some utilizing infrared and depth-recording devices. Results indicated that studies with inexpensive and accessible hardware were also able to model facial behaviors successfully and demonstrated cross-group differences. While most recent studies in this area have defaulted to use inexpensive and accessible hardware, it is unclear whether or how much the performance benefits from additional information from more complicated and expensive setup. For example, the effectiveness of using 3D surface data or reflective-marker-based methods still await comparison with 2D based methods in a larger dataset. With the increasing use of telemedicine in psychiatry in recent years, even further accelerated by the COVID-19 pandemic [55], another unanswered question is whether data collected remotely (such as in [48]) can provide a similar level of information as the data collected in a lab-controlled environment. In addition, most studies did not fully utilize the data acquired. For instance, most studies acquired video data, which enabled the generation of subject-level features like flatness and change ratio based on the facial behaviour fluctuations. However, facial behavior recognition in most studies (except [33]) was still conducted at the static image in each frame of the videos, which makes it unknown whether the use of dynamic information (using multiple frames) in videos improves the accuracy of within or across subject facial behavior recognition.

Existing barriers and future directions

Because of the sensitive and potentially identifiable nature of facial data for patients with schizophrenia, none of the datasets mentioned in this survey are publicly available. In addition, different datasets used in different studies vary in many aspects such as data type, subject size, demographic, diagnoses distribution, and the selection of performance metrics. Consequently, it is difficult to compare the performances of the different methodologies evaluated, which adds an additional burden to the researchers who want to follow or replicate the previous studies. Considering the population whom the analyses were applied to, this performance comparability issue is two-fold: how the methods perform in general and how they perform in the schizophrenia population, who might have significantly different facial behaviors from the population whom the algorithms were trained on. The former issue can be solved by comparing the proposed method with other recently proposed methods in the same dataset, either on the private dataset or on the publicly available dataset. Taking AU recognition as an example, dataset like EmotioNet [23] could be used as the benchmarking dataset. In addition, many state-of-the-art methods often provide publicly available implementations, such as JAA-Net (Joint facial action unit detection and face alignment via adaptive attention) [56]. The second issue, however, is much less straightforward. Most of the studies did not evaluate the performance of their methods in the patients, which could lead to an overly optimistic estimation. Results in [32] indicated that algorithms trained on healthy subjects might have significant performance differences when applying to patients and healthy controls. It is also important to explicitly note that the mismatch of distribution in demographic and cultural distribution between training and application population could lead to poor performance and bias in targeted population [57]. Besides, whether it is plausible to accurately recognize some types of the patients’ facial behaviors, such as emotion, is still debatable since their behaviors might be conflicted with their intention. An additional point for consideration include the limitations of the psychometric test raters. There can be significant inter-rater variability for psychometric tests of schizophrenia. For instance, one study reported that individual items of the PANSS had inter-rater reliabilities ranging from 0.23 to 0.88 in intraclass correlation [58]. Given that the performance of predictive models depend on the “gold standard” scored by human raters, work would need to be done to improve the accuracy and reliability of ratings. One potential solution to this challenge would include utilizing average psychometric scores from multiple raters. Besides inconsistency in psychometric tests, disparities in diagnosis are introduced by factors such as racial bias. African American and Latino American patients are diagnosed with psychotic disorders at approximately three times the rate of Euro-Americans [59]. It is imperative that inappropriate diagnoses due to racial bias are not perpetuated in the developed algorithms. Again, a potential solution in this case would be to ensure that selected cases for model training are diagnosed and rated by multiple raters. Many studies have tried to avoid directly interpreting facial behaviors, but to use the recognized pattern as features for data-driven description or classification of the patient population. Nevertheless, it might alleviate the interpretability of the method and prevent (or delay) implementation in a clinical setting. In addition, learned subject-level features might not necessarily be utterly superior to manually designed ones. Bishay et al. [41] compared the performance of the manually designed facial behavior features from [45, 47] with the data-driven ones and showed the manual ones could be better in some cases. As described above in the results section, temporal dynamics of the facial behaviors were not effectively used neither in behavior recognition modeling nor in the final symptom/treatment output classification or estimation. The former might be easier to start with since there are temporal facial expression datasets publicly available, such as Annotated Facial-Expression Databases (AFEW) [60]. Recent progress in computer vision could help bring superior performance in facial expression recognition. Replacing the current computer vision models used in affective computing with better backbone neural networks like ConvNext [61] and new video classification frameworks like video vision transformer [62] could be a potential direction. The latter issue of using temporal dynamics in schizophrenia, like other studies in psychiatry, is limited by the number of participants recruited, hence, is limited by the size of the final dataset. Therefore, the complexity of the model must be kept in mind when designing the method; otherwise, it will inevitably overfit to the training data. Bishay et al. [41] took this into account and selected GRUs [54] over a long short-term memory network (LSTM) [63] for having fewer parameters. Another barrier in applying computer vision to schizophrenia research is the lack of the open-source, state-of-the-art computer vision toolbox that is specifically designed for psychiatric facial behavior analysis. The most widely used currently is Openface 2.0 [21] that was released in 2016. Although it covers a wide range of analysis, such as head tracking, facial AU recognition, and gaze tracking, the methods used perform significantly poorer than the latest deep learning-based methods (such as JAA-Net [56]). Furthermore, since Openface is not specifically designed for psychiatry studies, it only focused on the frame-level behavior recognition without implementing any video-level analysis. Lastly, the interface can be difficult for researchers without previous experience in programming. Therefore, the next generation of the open-source toolbox that aims to tackle these issues might help accelerate the use of computer vision in schizophrenia. Last but not the least, how and what kind of data should be saved have not been sufficiently discussed. Ideally, just enough information should be saved for the necessary evaluation of disorders and protect the privacy of the subjects to as great an extent possible. Taking video data as an example, one type of approach is to de-identify the person in the video by pixelating or blurring certain parts of the face or trying to keep some of the behavior information while switching other facial properties via methods like deepfake [64]. The second type of approach is not to save the videos, but to maintain the derived features that can help to preserve privacy. The second approach may be more more efficient than the first, and can conserve the most relevant information without necessarily leading to a PHI leak, as shown before [65].

Current use and the future

Present use of computer vision techniques in schizophrenia is focused on subject descriptives, prediction of disease presence, and prediction of disease severity. From a descriptive standpoint, as described in the “study findings” section of the results, these techniques have demonstrated significant differences between patient and control with respect to facial expressions. Predictive content remains to perform modestly and much work still needs to be done in this area for computer vision tools to effectively augment clinical care. Notably, modest performance of predictive models for disease severity states is not surprising. Schizophrenia impacts more than just facial expression and eye movement in patients. Much of the disorder is evaluated through a person’s speech (which expresses the content of one’s thoughts, and is an expression of one’s thought process). It is in a person’s speech that patient key characteristics are evident, including their expression of hallucinations, delusions, and disorganization in thought. Nevertheless, we were surprised to see that even for the best performing model, prediction of affective blunting (characterized by facial expression) was no better than (R Pearson = 0.686, p ≪ 0.01) [45]. This may very well be due to the inconsistency in affective blunting predictions across raters. One group, for instance, found affective blunting to have amongst the lowest percentage of rater agreement within the PANSS [66]. With such low inter-rater agreement it would be unrealistic to expect models trained on these human-rated tools to perform much better than humans. Rather, it may be that the future of the use of computer vision for the evaluation of facial expression in schizophrenia lies in the development of a new digital biomarker or improved definitions of terms used in the psychiatric mental status (i.e. affective quality and quantity). As they stand, facial landmark predictors have achieved unparalleled performance. Inter-ocular distance estimates using these tools at the state of the art can be estimated with a normalized mean square error (NME) of 3.13 [67]. That means that estimates of the distance between outer eyebrow lids can generally be estimated with a rough 3% error margin. One group has reported human level estimates to sit at an NME of 5.6 [68]. With improving landmark prediction performance, the field is primed for exploration of the role of time series characterizations of landmark displacement as a digital biomarker. Certainly, AU predictions may also be used, but performance here has not achieved the same level as landmark prediction. Much needs to be done in order to achieve a state where landmark displacement could be utilized as a biomarker. That includes development of standard performance requirements for landmark predictors, characterization of population level norms and variations, and, in the case of schizophrenia, evaluation for differences in these biomarkers from population norms. While unlikely to independently assist in schizophrenia diagnostics, this digital biomarker may in combination with others be utilized to track disease state and treatment response. Such biomarkers may eventually be used in a manner similar to the complete blood count, or may be nested in clinical decision support tools.

Conclusion

Here we reviewed the utilization of computer vision and affective computing techniques in schizophrenia research to date. We reported on the various uses of these techniques and those elements we felt to be relevant to researchers interested in utilizing these techniques in their work. We found that despite the rapid pace in automated facial and landmark detection techniques, there remains to be limited utilization of these techniques in the study of schizophrenia. More studies and testing on larger and more diverse population need be conducted, and standardized metrics need to be reported to enable the community to select and further develop the paradigm and methods suitable for this field. Lastly, as the progress in this field will depend on the uptake of the work by multiple research and clinical groups, we hope that this review promotes entry and work in this area. (PDF) Click here for additional data file.

Assessment of study quality.

(XLSX) Click here for additional data file.

TRIPOD checklist for reporting quality assessment.

(DOCX) Click here for additional data file.

Data processing steps [21, 23, 32–48, 69–79].

(PDF) Click here for additional data file. 25 Jan 2022

PONE-D-21-39628

Utilizing computer vision for facial behavior analysis in schizophrenia studies: A systematic review

PLOS ONE Dear Dr. Jiang, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Please submit your revised manuscript by Mar 11 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols. We look forward to receiving your revised manuscript. Kind regards, Felix Albu, Ph.D. Academic Editor PLOS ONE Journal Requirements: When submitting your revision, we need you to address these additional requirements. 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf 2. Please attach an assessment of study quality as a Supplemental file. 3. Thank you for stating the following financial disclosure: "Research reported in this publication was supported in part by Imagine, Innovate and Impact (I3) Funds from the Emory School of Medicine and through the Georgia CTSA NIH award (UL1-TR002378)." Please state what role the funders took in the study. If the funders had no role, please state: ""The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript."" If this statement is not correct you must amend it as needed. Please include this amended Role of Funder statement in your cover letter; we will change the online submission form on your behalf. 4. Thank you for stating the following in the Acknowledgments Section of your manuscript: "Research reported in this publication was supported in part by Imagine, Innovate and Impact (I3) Funds from the Emory School of Medicine and through the Georgia CTSA NIH award (UL1-TR002378). We thank Scott Haden Kollins and Matthew Engelhard (both from Duke University) for designing the data collection survey utilized in this review with Mina Boazak." We note that you have provided funding information. However, funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form. Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows: "Research reported in this publication was supported in part by Imagine, Innovate and Impact (I3) Funds from the Emory School of Medicine and through the Georgia CTSA NIH award (UL1-TR002378)." Please include your amended statements within your cover letter; we will change the online submission form on your behalf. Additional Editor Comments: The decision is Major Revision. Please address all the comments of the reviewers. [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Partly Reviewer #2: Partly Reviewer #3: Partly ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: No Reviewer #2: N/A Reviewer #3: Yes ********** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: Review of Utilizing computer vision for facial behavior analysis in schizophrenia studies: A systematic review due to PLOS ONE: The authors introduce a review of computer vision studies used to detect the schizophrenia disease in facial behavior. Although the subject elaborated by authors is interesting, I have a number of observations on their work. My observations are as follows: (1) First of all, the abstract does not explain the challenges encountered in the task. For example, it is not interesting to make that the reference used is only "google scholar " and the words used is terms: 1.(“Computer Vision” or “Affective Computing”) AND “Schizophrenia” and 2. “Facial Expression” AND “Schizophrenia” AND "Computer Vision”. (2) The author also say that they used the most relevant and up to-date publications. I did not understand the relevance criterion used (impact factor of article, ...). It is also interesting to mention the years of articles publication which it is better to mention up to-date publications. (3) In addition, I don't understand why the authors used the articles written by the same group of authors, or identified articles that were cited by the articles in the primary search. (3) The article is very poorly presented and the choice of titles of sections is very poorly expressed: You used the main title 'Materials and methods' even though you did not use any material The section 'Searching methods' contains information which is not important and this information is repeated in the introduction and the abstract. (4) Several information is missing such as the description of the Schizophrenia disease, indeed I see that the authors must mention part of the article to explain that. For example, how emotions can identify this disease? what emotions are used (primary, secondary, etc.)? What are the negative emotions you mentioned? How can machine learning methods analyze these emotions in patients? (5) In Table 1. Overview of participant interviews: I suggest to present other information like the year of the article, the emotions used... (6) In Table 2. Overview of data processing and statistical analyses, the table is not clear, I suggest to subdivided the table on sub tables and make just the information related to each section example in section Type of the raw data, i suggest to make table regrouped by type of data(video, image (2D, 3D) and make other information like the databases used, the number of samples in datasets, the number of data used for learning and test..... . (7) In general, the paper needs a deep review, made by a native English speaker. (8) Along the paper, I found several tables included in the document where the contents of the tables are not clear. (9) In a survey papers, it is interesting to make a comparison between methods using the rate of classification values, precision... (10) Finally, I see that is necessary to reorganize the paper by finding a way to regroup papers in sub-groups. Example, according to the machine learning methods, the emotions, the databases used, the classes... Reviewer #2: The paper focuses on the goal of providing objective measures for the evaluation and diagnosis of schizophrenia. In particular, it deals with utilizing computer vision and machine learning to measure facial movements. It provides a systematic overview of computer vision for facial behaviour analysis in schizophrenia studies, its evolution, the clinical findings, and the corresponding data processing and machine learning methods. As a general consideration, I don't like systematic reviews. I prefer survey manuscripts that provide an overview depending on the confidence of the authors with the subject and independent from the queries on google scholar. Anyway, the following comments are independent from this initial consideration. While reading the manuscript, especially in the first sections, it seems like the authors lost the focus of the paper stated in the title and in the abstract. I would have expected to start from an introduction describing how different aspect of clinical diagnosis have been faced by computer vision methods and I read a description of how papers have been selected and a dissemination about how interviews have been carried out. I found of interest from row 128. In table 2 a column describing the goal of each paper should be added. Some papers stopped at lower-level analysis and leave to human the diagnosis. Others one tried to provide a higher-level outcome (pathological /not pathological). This is an interesting aspect in my opinion that should emerge. As a general comment, Authors should consider their manuscript as a guideline for researchers facing this topic for the first time and they should provide any useful information to get started in using computer vision for schizophrenia diagnosis. A graphical representation of the most interesting approaches can be help to understand the cutting-edge works. On the other hand, I found very interesting the discussion provided by authors. Minor comments: Figure 1 has low quality. References to following papers should be added. [1] Leo, M., Carcagnì, P., Mazzeo, P. L., Spagnolo, P., Cazzato, D., & Distante, C. (2020). Analysis of facial information for healthcare applications: A survey on computer vision-based approaches. Information, 11(3), 128. [2] Thevenot, Jérôme, Miguel Bordallo López, and Abdenour Hadid. "A survey on computer vision for assistive medical diagnosis from faces." IEEE journal of biomedical and health informatics 22, no. 5 (2017): 1497-1511. Reviewer #3: Although the article is interesting and could be deemed useful for the research community, I do have it presents two major problems that should be addressed: 1) For a systematic review, the keywords used in the search are of utmost importance. I believe that there is no clear explanation of why these words were selected, and others such as "face recognition", "face analysis" or "face emotion" are left out. At least an explanation of how the keywords were selected and a possible exploratory search around the terms would be needed. 2) the conclusions and discussion are technically shallow. Although the usefulness of the studies is discussed sufficiently, there is no clear learning about what technical approaches would be useful for (semi)automatic assessment of schizophrenia, and the choice and analysis is left for the readers. Also there is no discussion on the latest advances of computer vision and its possible application to schizophrenia predition, or the real challenges and opportunities on the field, which mostly refer to data availability and privacy and data security constrains. ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes: Siwar yahia Reviewer #2: No Reviewer #3: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. 17 Mar 2022 Response to reviewers: Reviewer #1: Review of Utilizing computer vision for facial behavior analysis in schizophrenia studies: A systematic review due to PLOS ONE: The authors introduce a review of computer vision studies used to detect the schizophrenia disease in facial behavior. Although the subject elaborated by authors is interesting, I have a number of observations on their work.My observations are as follows: (1) First of all, the abstract does not explain the challenges encountered in the task. For example, it is not interesting to make that the reference used is only "google scholar ". and the words used is terms: 1.(“Computer Vision” or “Affective Computing”) AND “Schizophrenia” and 2. “Facial Expression” AND “Schizophrenia” AND "Computer Vision”. Reply 1:> Thanks for raising this issue. We have updated our search terms to cover a broader definition of facial behaviors and added clarity by reducing two searches to one with the new search term: (“Facial emotion” OR “Facial Expression” OR "facial analysis" OR "Facial Behavior" OR "facial action units") AND “Schizophrenia” AND "Computer Vision” We also added PubMed as a second search engine, where no results were found. We have updated the PRISMA flow diagram on page 4, the corresponding sentences in Methods section of Abstract and in ‘Searching methods” (page 4 line 72-78). We added: “We conducted a literature search for publications published before (including) December 2021, on Google Scholar and PubMed in February of 2022 using the following search terms: (“facial emotion” OR “facial expression” OR "facial analysis" OR "facial behavior" OR "facial action units") AND “schizophrenia” AND "computer vision”. Multiple synonyms and sub-categories of facial behaviors were used in the first keyword set to cover a broad definition of facial behaviors, and the latter two key words were selected to limit our search with studies that used computer vision in schizophrenia.” We have also segmented the abstract into four sections (Background, Methods, Results, Conclusion) to add clarity and to help explain more about the challenges. We summarized the results and identified the key challenges in conclusion with the following text: “Seventeen studies published between 2007 to 2021 were included, with an increasing trend in the number of publications. Only 14 articles collected data during interviews, of which different combinations of evoked to passive, structured to unstructured interviews were used. Various types of hardware were adopted and different types of visual data were collected. Commercial, open-access, and in-house developed models were used to recognize facial behaviors, where frame-level and subject-level features were extracted. Statistical tests and evaluation metrics varied across studies. The number of subjects ranged from 2-120, with an average of 38. Overall, facial behaviors appear to have a role in estimating diagnosis of schizophrenia and psychotic symptoms. When studies were evaluated with a quality assessment checklist, most had a low reporting quality. Despite the rapid development of computer vision techniques, there are relatively few studies that have applied this technology to schizophrenia research. There was considerable variation in the clinical paradigm and analytic techniques used. Further research is needed to identify and develop standardized practices, which will help to promote further advances in the field.” (2) The author also say that they used the most relevant and up to-date publications. I did not understand the relevance criterion used (impact factor of article, ...). It is also interesting to mention the years of articles publication which it is better to mention up to-date publications. Reply 2:> We apologize for the lack of clarity here. We included all relevant work found in the searches in the review. We have updated the sentence in abstract as: “Relevant publications published before (including) December 2021 were identified and discussed” We have described the years of articles publication in the section “Results” at page 5, line 115-118: “These articles were published from 2007 to 2021, with the number of publications increasing over time: three completed before 2009, four completed between 2010 to 2014, and eight were completed subsequent to 2015.” (3) In addition, I don't understand why the authors used the articles written by the same group of authors, or identified articles that were cited by the articles in the primary search. Reply 3:> Thanks for raising the question. Although a systematic review was performed, when reading the discovered papers, it became clear that there were other notable papers worthy of inclusion, either from references or from the same research groups. We felt it would enhance the review to add such articles, rather than exclude them because a general search process of a systematic review did not discover the research on a first pass. We have modified our method accordingly in “Searching methods” at page 4, line 79-84: “Based on the articles we found in the search, we conducted a secondary search to include other notable and relevant papers worthy of inclusion, which were written by the same group of authors, and included relevant articles which were cited by articles found in the primary search. The secondary process was adopted to enhance the review by including those relevant articles that were not discovered in the first search using a general search process.” (3) The article is very poorly presented and the choice of titles of sections is very poorly expressed: You used the main title 'Materials and methods' even though you did not use any material The section 'Searching methods' contains information which is not important and this information is repeated in the introduction and the abstract. Reply 4: > Thanks for this feedback. We modified the title of several sections and subsections to add additional clarify for the reader, for example we changed the section 'Materials and methods' to ‘Method’, changed ‘Current use’ in ‘Results’ to ‘Study objectives and participant characteristics’, changed ‘Data handling’ in ‘Results’ to ‘Data pipelines’, ‘Types of raw data’ in ‘Results’ to ‘Types of visual data’. We have also removed the detailed description of searching terms in abstract. Instead, we divided the abstract on page 2 into four sections (Background, Methods, Results, Conclusion) and updated it with more information. Please kindly refer to reply1 for the details. (4) Several information is missing such as the description of the Schizophrenia disease, indeed I see that the authors must mention part of the article to explain that. For example, how emotions can identify this disease? what emotions are used (primary, secondary, etc.)? What are the negative emotions you mentioned? How can machine learning methods analyze these emotions in patients? Reply 5: > Thanks for the suggestion. We added additional details about schizophrenia, specifically, adding additional description about the diagnosis and the related challenges in the first paragraph in ‘Introduction’ on page 2 with the following text: “Schizophrenia is a severe psychiatric disorder with a lifetime prevalence of approximately 0.48% [1]. This condition is slightly more common in males [2], appears generally during early adulthood [3], and causes significant social and functional impairment. In 2013, schizophrenia was thought to have an annual economic burden of $155 billion in the United States [4]. Since the identification of schizophrenia in the late 1800s, significant efforts have been made to characterize symptoms of the disorder. … Schizophrenia is an illness that demonstrates heterogeneity in its symptoms from person to person, and each of these symptom categories can vary vastly by presentation leading to overlap with other diagnoses. Additionally, schizophrenia has a heterogeneous longitudinal course, with some individuals having a relapsing remitting course, others chronic symptoms, and others with symptoms followed by remission [6]. Currently, the diagnosis of schizophrenia is based on the self-report of the patient, what the interviewer observes, and collateral information, all of which can be highly subjective. Reducing subjectivity in establishing the diagnosis of schizophrenia is necessary from both a research perspective (to ensure treatments work for people with the same underlying condition) and a clinical perspective. Clinically, many people with schizophrenia have a lack of awareness that they have an illness [7], and those with poor insight may be at risk for nonadherence to antipsychotic medications and other negative outcomes [8].” Various emotions and facial actions units were utilized previously with manual identification. We have discussed some examples of variation of the expression in anger/sadness and reduction in happiness in the third paragraph in ‘Introduction’. The negative emotions in this paragraph refers to anger and sadness. We have discussed the derived deficits such as responsivity and expressivity as well. We have updated that paragraph accordingly on page 3 line 41-43. In addition, we identified the basic emotions (happiness, sadness, anger, surprise, disgust, fear, contempt, and neutrality.) on page 15 line 210-225 utilized in the surveyed studies. We have added more descriptions on the specific facial action units in Section ‘Data pipeline’-’Behavior recognition methods’ on page 15 with the following text: “Three studies analyzed basic emotions [32,35,40], including emotional categories such as happiness [32,35,40], sadness [32,35,40], anger [32,35,40], surprise [40], disgust [40], fear [32,35,40], and neutrality [35,40]. Eight studies [36,37,41,42,44-47] focused on AU analysis, a proxy measure for underlying facial muscle movement, based on the earlier mentioned FACS. AUs including AU0 (Neutral Face [41,42]), AU1 (Inner Brow Raiser [36,37,41,42,44-47]), AU2 (Outer Brow Raiser [36,37,41,42,44-47]), AU4 (Brow Lowerer [36,37,41,42,44]), AU5 (Upper Lid Raiser [36,37,41,42,44]), AU6 (Cheek Raiser [36,37,41,42,44]), AU7 (Lid Tightener [36,37,44]), AU9 (Nose Wrinkler [36,37,44]), AU10 (Upper Lip Raiser [36,37,44]), AU12 (Lip Corner Puller [36,37,41,42,44-47]), AU 14(Dimpler [44-47]), AU15 (Lip Corner Depressor [36,37,44-47]), AU17 (Chin Raiser [36,37,44-47]), AU18 (Lip Puckerer [36]), AU20 (Lip stretcher [36,37,44-47]), AU23 (Lip Tightener [36,37,41,42,44]), AU25 (Lips part [36,37,41,42,44-47]), AU26 (Jaw Drop [44-47]), AU 43 (Eyes Closed [41,42,45-47]), AU44 (Squint, [45-47]) AU45 (Blink [44]), and AU62 (Eyes Turns Right [45-47]). Other non-traditional AUs, such as smile, frown and sneer, were also mentioned in [45-47]. “ Various machine learning methods were utilized to identify facial behaviors. We have discussed the general approaches in the section ‘Introduction’ on page 3 line 47-54. those approaches used in the surveyed studies in section ‘Behavior recognition methods’ on page 15. (5) In Table 1. Overview of participant interviews: I suggest to present other information like the year of the article, the emotions used... Reply 6: > Thanks for the suggestion. We did not include them in (previous) Table 1 (now, Table 2) because we designed Table 1 (now, Table 2) to focus on the participant interview techniques. However, we also agree that the year of the articles and the emotions used in them are useful information to include. Hence, we have added the year information in (previous) Supplementary Table 1 ‘Detailed description and findings’ and changed that into the new Table 1: “Overview of the participants, objective types, descriptions and findings” on page 6-9. We have also included emotion information in Table 3 (previously Table 2) on page 12-14. (6) In Table 2. Overview of data processing and statistical analyses, the table is not clear, I suggest to subdivided the table on sub tables and make just the information related to each section example in section Type of the raw data, i suggest to make table regrouped by type of data(video, image (2D, 3D) and make other information like the databases used, the number of samples in datasets, the number of data used for learning and test..... . Reply 7: > Thanks for the suggestion and we apologize for the unclarity here. We have rearranged Table 3 (previously Table 2) on page 12-14 and regrouped into different segments based on the type of raw data. Different subsections (like ‘types of raw data’, ‘behavior recognition methods’, etc.) in the ‘Data pipeline’ text section correspond to different sections or column(s) in Table 2. We have updated the ‘Data pipeline’ section on page 11, line 174-177 to clarify the organization and the content of Table 3. The updated paragraph now reads: “Table 3 summarizes how data was collected, processed, analyzed and reported in the 17 studies we surveyed. More specifically, the following aspects were included: (1) the types of data collected, (2) features calculated from the data, (3) and the corresponding statistical analyses and performance metrics used for reporting the results.“ We have also updated each subsection (highlighted in blue on page 15-17) so it’s more clear to see which column(s) in Table 3 were discussed in those subsections. In addition, we have added more description on statistical tests in “Evaluation methods’’ subsections: “Statistical tests including t-test, analysis of variance were mainly adopted to evaluate the differences between different clinical groups. However, some studies did not report using any statistical tests on the features or the performances.” We have moved the description of subjects into section “Study objectives and participant characteristics” and Table 1 on page 5-9 to give an overview of the studies. (7) In general, the paper needs a deep review, made by a native English speaker. Reply 8: > Thanks for the suggestions. We asked multiple native English speakers (the last three senior authors) to carefully review it. Please see the highlighted (in blue) texts in the pdf file for all the edits from the initial submission. (8) Along the paper, I found several tables included in the document where the contents of the tables are not clear. Reply 9: > We have updated Table 3 (previously Table 2) on page 12-14 and the corresponding descriptions and clarifications to add more information and clarity. Please kindly refer to the described description in Reply 7 (for comment 6). We have also added the column ‘Year’ in Supplementary Table 1 (now Table 1 on page 6-9) and moved the column ‘Subject’ from Table 3 (previously Table 2) to Table 1. Please kindly refer to the described description in Reply 6 (for comment 5). (9) In a survey papers, it is interesting to make a comparison between methods using the rate of classification values, precision... Reply 10: > Thanks for the suggestion. We also agree that the comparison between different studies and methods would be interesting. However, a fair and meaningful quantitative comparison between different methods is difficult to make with the reported information found in the surveyed studies. Here are a few main reasons: There is a lack of consensus on the selection of performance metrics. For example, for studies with classification tasks, two studies used AUC as metric, and the other two used accuracy. We have presented this point in more detail in the ‘Evaluation methods’ section on page 17. No open access dataset could be used as a benchmark, and different datasets used in different studies vary in many aspects such as data type, subject size, demographic and diagnoses distribution. The huge differences between different datasets make it unfair to directly compare the methods across datasets. We have discussed this point in the second paragraph of the ‘Existing barriers and future directions’ section in Discussion on page 19, line 402-420. We have instead compared the methods applied on the same dataset by the same research group in the third- and fourth-to-last paragraph in the ‘Existing barriers and future directions’ section on page 20, line 443-457. For example, using learned subject-level features might not necessarily be utterly superior to manually designed ones, which were supported by studies done by Bishay et al.. To clarify the above points, we have added the following text to the first paragraph in the ‘Existing barriers and future directions’ section on page 19, line 397-399 : “In addition, different datasets used in different studies vary in many aspects such as data type, subject size, demographic, diagnoses distribution, and the selection of performance metrics.” (10) Finally, I see that is necessary to reorganize the paper by finding a way to regroup papers in sub-groups. Example, according to the machine learning methods, the emotions, the databases used, the classes… Reply 11: > Thanks for the suggestion. Due to the lack of consensus on the methods and databases, we felt it is difficult to properly group the papers into consistent groups across different aspects of the studies. Instead, we have grouped them based on the method they used in a specific aspect. For example, in Table 2, we have grouped the studies into “evoked”, “passive”, “both” and “unknown” based on the types of interview used in the studies. We have also grouped the studies into “descriptive”, “predictive” and “both” based on how they reported their results (please kindly find more discussion in the “Study objectives and participant characteristics” section on page 5). The main barrier that stops us grouping them consistently throughout the review manuscript is that studies fall into different groups when being looked at different aspects (ex. Studies that used “evoked” interviews might use either “descriptive” or “predictive” methods for reporting the results.) We have updated the above discussions into the first paragraph in the ‘Existing barriers and future directions’ section in Discussion on page 19 with the following text: “Consequently, it is difficult to compare the performances of the different methodologies evaluated, which adds an additional burden to the researchers who want to follow or replicate the previous studies.” Reviewer #2: C1: The paper focuses on the goal of providing objective measures for the evaluation and diagnosis of schizophrenia.In particular, it deals with utilizing computer vision and machine learning to measure facial movements. It provides a systematic overview of computer vision for facial behaviour analysis in schizophrenia studies, its evolution, the clinical findings, and the corresponding data processing and machine learning methods. As a general consideration, I don't like systematic reviews. I prefer survey manuscripts that provide an overview depending on the confidence of the authors with the subject and independent from the queries on google scholar. Anyway, the following comments are independent from this initial consideration. Reply 1: > Thanks for raising the issue on systematic reviews vs. narrative reviews. We felt that we are trying to combine the advantages of both methods by providing an overview on how computer vision and facial behavior analyses are currently being used in schizophrenia studies, while leveraging the search engines like Google Scholar and PubMed to get a better coverage of the subject. Please find our detailed replies below to all the other comments. C2: While reading the manuscript, especially in the first sections, it seems like the authors lost the focus of the paper stated in the title and in the abstract. I would have expected to start from an introduction describing how different aspect of clinical diagnosis have been faced by computer vision methods and I read a description of how papers have been selected and a dissemination about how interviews have been carried out. I found of interest from row 128. Reply 2: > Thanks for the suggestion. We have segmented the abstract into four sections (Background, Methods, Results, Conclusion) to add clarity and to help explain more about the challenges. We summarized the results and identified the key challenges in conclusion with the following text: “Seventeen studies published between 2007 to 2021 were included, with an increasing trend in the number of publications over time. Only 14 articles used interviews to collect data, of which different combinations of passive to evoked, unstructured to structured interviews were used. Various types of hardware were adopted and different types of visual data were collected. Commercial, open-access, and in-house developed models were used to recognize facial behaviors, where frame-level and subject-level features were extracted. Statistical tests and evaluation metrics varied across studies. The number of subjects ranged from 2-120, with an average of 38. Overall, facial behaviors appear to have a role in estimating diagnosis of schizophrenia and psychotic symptoms. When studies were evaluated with a quality assessment checklist, most had a low reporting quality. Despite the rapid development of computer vision techniques, there are relatively few studies that have applied this technology to schizophrenia research. There was considerable variation in the clinical paradigm and analytic techniques used. Further research is needed to identify and develop standardized practices, which will help to promote further advances in the field.” We have also added more description of schizophrenia, its diagnosis and the related challenges in the first paragraph in ‘Introduction’ on page 2-3 with the following text: “Schizophrenia is a severe psychiatric disorder with a lifetime prevalence of approximately 0.48% [1]. This condition is slightly more common in males [2], appears generally during early adulthood [3], and causes significant social and functional impairment. In 2013, schizophrenia was thought to have an annual economic burden of $155 billion in the United States [4]. Since the identification of schizophrenia in the late 1800s, significant efforts have been made to characterize symptoms of the disorder. … Schizophrenia is an illness that demonstrates heterogeneity in its symptoms from person to person, and each of these symptom categories can vary vastly by presentation leading to overlap with other diagnoses. Additionally, schizophrenia has a heterogeneous longitudinal course, with some individuals having a relapsing remitting course, others chronic symptoms, and others with symptoms followed by remission [6]. Currently, the diagnosis of schizophrenia is based on the self-report of the patient, what the interviewer observes, and collateral information, all of which can be highly subjective. Reducing subjectivity in establishing the diagnosis of schizophrenia is necessary from both a research perspective (to ensure treatments work for people with the same underlying condition) and a clinical perspective. Clinically, many people with schizophrenia have a lack of awareness that they have an illness [7], and those with poor insight may be at risk for nonadherence to antipsychotic medications and other negative outcomes [8].” We have also added the discussion on previous related surveys of using computer vision in clinical diagnosis in “Introduction” on page 3, line 60-63. Please kindly see Reply 7 for details. C3:In table 2 a column describing the goal of each paper should be added. Some papers stopped at lower-level analysis and leave to human the diagnosis. Others one tried to provide a higher-level outcome (pathological /not pathological). This is an interesting aspect in my opinion that should emerge. Reply 3: > Thanks for raising this point. We completely agree that the goal of each paper is important to discuss. We have moved (previous) Supplementary Table 1 as the new Table 1 (on page 6-9) so the readers could have a better overview of the studies, including their different goals. In the “Type” column, we have grouped the papers into “descriptive”, “predictive” and “both” based on the final goal of the papers. The “descriptive” studies used descriptive statistics to report schizophrenia phenomenology and stopped there without attempting to directly estimate the outcome; In contrast, the “predictive” studies utilized predictors to classify presence or absence of schizophrenia, certain schizophrenia symptoms, or to predict clinical outcome measure scores. We have updated the section “Study objectives and participant characteristics” in ‘Results' on page 5 correspondingly: “The study objectives were divided into three types: 1) descriptive, meaning those that described schizophrenia phenomenology, 2) predictive, meaning those that utilized predictors to classify presence or absence of schizophrenia or those which predicted certain clinical outcome scores based on facial expressions, or 3) those which included both descriptive and predictive outcomes. Of the 17 included studies, eight were descriptive, one was predictive only, and eight were descriptive and predictive. The study objective type, participant characteristics, description of the studies, and a summary of the findings can be found in Table 1.” C4: As a general comment, Authors should consider their manuscript as a guideline for researchers facing this topic for the first time and they should provide any useful information to get started in using computer vision for schizophrenia diagnosis. Reply 4: > Thanks for pointing this out. We have updated our manuscript significantly in Abstract and Introduction to clarify and add more information about schizophrenia, the current use of computer vision in it, and the key challenges. We hope in this way we could promote the readers who are not familiar with the subject to learn and enter this area. Please kindly see Reply 2 for the detailed modifications in abstract and introductions. C5: A graphical representation of the most interesting approaches can be help to understand the cutting-edge works. Reply 5: > Thanks for the suggestion. We have added a new figure 2: “Visualization of the data pipelines” on page 11 (please find the separate image file (fig2.tiff) because PLOS ONE does not allow in-place figures) with a brief illustration of the representative approaches used in the studies. On the other hand, I found very interesting the discussion provided by authors. Minor comments: C6: Figure 1 has low quality. Reply 6: > Good catch on that. It seems the figure in the final pdf was compressed, hence the lower quality compared to the separate figure file (fig1.tif). We will contacted the journal to see whether this will affect the quality in the published version if accepted. C7: References to following papers should be added. [1] Leo, M., Carcagnì, P., Mazzeo, P. L., Spagnolo, P., Cazzato, D., & Distante, C. (2020). Analysis of facial information for healthcare applications: A survey on computer vision-based approaches. Information, 11(3), 128. [2] Thevenot, Jérôme, Miguel Bordallo López, and Abdenour Hadid. "A survey on computer vision for assistive medical diagnosis from faces." IEEE journal of biomedical and health informatics 22, no. 5 (2017): 1497-1511. Reply 7 >: Thanks for the suggestions. We have added the discussion on previous related surveys in “Introduction” on page 3, line 60-63: “Previous reviews have investigated the usage of computer-vision-based facial information in medical applications in general [29,30], but they focused more on the specific technical facial analyses adopted than the complete processing and analyzing pipeline, and few schizophrenia studies were discussed in detail.” Reviewer #3: Although the article is interesting and could be deemed useful for the research community, I do have it presents two major problems that should be addressed: 1) For a systematic review, the keywords used in the search are of utmost importance. I believe that there is no clear explanation of why these words were selected, and others such as "face recognition", "face analysis" or "face emotion" are left out. At least an explanation of how the keywords were selected and a possible exploratory search around the terms would be needed. Reply 1: > Thanks for raising this point. To address the lack of clarity in the selection of the keywords, we have updated our search terms to cover a broader definition of facial behaviors and to add clarity by reducing two searches to one with the new search term: (“Facial emotion” OR “Facial Expression” OR "facial analysis" OR "Facial Behavior" OR "facial action units") AND “Schizophrenia” AND "Computer Vision” We also added PubMed as a second search engine, where no results were found. We have updated the PRISMA flow diagram on page 4 (please kindly see fig1.tiff), the corresponding sentences in Abstract (first sentence in Methods) and in ‘Searching methods” (page 4 line 71-84). We added: “We conducted a literature search for publications published before (including) December 2021, on Google Scholar and PubMed in February of 2022 using the following search terms: (“facial emotion” OR “facial expression” OR”facial analysis” OR”facial behavior” OR”facial action units”) AND “schizophrenia” AND”computer vision”. Multiple synonyms and sub-categories of facial behaviors were used in the first keyword set to cover a broad definition of facial behaviors, and the latter two keywords were selected to limit our search with studies that used computer vision in schizophrenia.” 2) the conclusions and discussion are technically shallow. Although the usefulness of the studies is discussed sufficiently, there is no clear learning about what technical approaches would be useful for (semi)automatic assessment of schizophrenia, and the choice and analysis is left for the readers. Also there is no discussion on the latest advances of computer vision and its possible application to schizophrenia prediction, or the real challenges and opportunities on the field, which mostly refer to data availability and privacy and data security constrains. Reply 2: > Thanks for raising this issue. We agree that leaving all the choices and analyzing options to readers, especially who are not familiar with the topic, could be challenging and drives the potential entries away. However, we wanted to point out that, although we would love to provide a complete guideline in this subject, there is not enough evidence to support which interview techniques or technical approaches are the best. The key barrier here is the difficulty in making a fair and meaningful quantitative comparison between different methods with the reported information found in the surveyed studies. Here are a few main reasons: There is a lack of consensus on the selection of performance metrics. For example, for studies with classification tasks, two studies used AUC as metric, and the other two used accuracy. We have presented this point in more detail in the ‘Evaluation methods’ section on page 17, line 291-300. No open access dataset could be used as a benchmark, and different datasets used in different studies vary in many aspects such as data type, subject size, demographic and diagnoses distribution. The huge differences between different datasets make it unfair to directly compare the methods across datasets. We have discussed this point in the second paragraph of the ‘Existing barriers and future directions’ section in Discussion on page 19, line 402-420. We have limited our comparison between the methods applied on the same dataset by the same research group in the third- and fourth-to-last paragraph in the ‘Existing barriers and future directions’ section on page 20, line 443-457. For example, using learned subject-level features might not necessarily be utterly superior to manually designed ones, which were supported by studies done by Bishay et al.. To clarify the above points, we have added the following text to the first paragraph in the ‘Existing barriers and future directions’ section on page 19, line 395-401 : “Because of the sensitive and potentially identifiable nature of facial data for patients with schizophrenia, none of the datasets mentioned in this survey are publicly available. In addition, different datasets used in different studies vary in many aspects such as data type, subject size, demographic, diagnoses distribution, and the selection of performance metrics. Consequently, it is difficult to compare the performances of the different methodologies evaluated, which adds an additional burden to the researchers who want to follow or replicate the previous studies.“ To further address the dilemma between the lack of starting point or recommendation for the prospective readers and the lack of ample evidence from the literature, we added our suggestions in ‘Existing barriers and future directions’ on page 19-20 to point out the areas where we believe there is potential, and make it clear that further research is needed, with the following text: “In addition, many state-of-the-art methods often provide publicly available implementations, such as JAA-Net (Joint facial action unit detection and face alignment via adaptive attention) [56]. … Bishay et al. [41] compared the performance of the manually designed facial behavior features from [45,47] with the data-driven ones and showed the manual ones could be better in some cases. As described above in the results section, temporal dynamics of the facial behaviors were not effectively used neither in behavior recognition modeling nor in the final symptom/treatment output classification or estimation. The former might be easier to start with since there are temporal facial expression datasets publicly available, such as Annotated Facial-Expression Databases (AFEW) [60].” Regarding the lack of discussion on the latest advances of computer vision and its possible application to schizophrenia evaluation, we added a few latest progress in computer vision in the same section on page 20 and pointed out that they could potentially improve the facial behavior recognition accuracy in general population: “Recent progress in computer vision could help bring superior performance in facial expression recognition. Replacing the current computer vision models used in affective computing with better backbone neural networks like ConvNext [61] and new video classification frameworks like video vision transformer [62] could be a potential direction.” Lastly, we wanted to clarify that we believe how to apply computer vision based facial behavior recognition in schizophrenia study is still an ongoing discussion, where the community is still debating on both the clinical and technical sides of the approach, mainly due to the complex social, ethical, financial issues involved in this topic, when compared to general image/video classification/segmentation tasks. Although improving the accuracy of facial behavior recognition is a crucial part of the future directions, we believe the challenges and opportunities in this field also lie in addressing the issues such as reaching consensus in interview techniques, in data collection/security/privacy, and in method/result reporting, while reducing the bias and unfairness in both participants recruiting and diagnosing. Submitted filename: Response to Reviewers.pdf Click here for additional data file. 29 Mar 2022 Utilizing computer vision for facial behavior analysis in schizophrenia studies: A systematic review PONE-D-21-39628R1 Dear Dr. Jiang, We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Felix Albu, Ph.D. Academic Editor PLOS ONE Additional Editor Comments (optional): The decision is Accept. The authors should take into consideration the comments about tables and subsections in the abstract. Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #1: All comments have been addressed Reviewer #2: All comments have been addressed Reviewer #3: All comments have been addressed ********** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes ********** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes ********** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes ********** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: The authors have addressed most of my comments. In my view the paper has been improved and can be published now. Reviewer #2: The revised version of the paper is largely better than the initial one. The authors addressed all raised comments. They should pay attention to tables that are out of margins and I suggest them to not put subsections in the abstract but to leave the text as it is (I mean no titles of paragraphs). Reviewer #3: The authors have tried to address most of my previous concerns. Although I still feel that the critical discussion of which methods are useful for the assessment of schyzophrenia and which ones are less so is still mostly missing, and the search terms are somehow limited, I understand that this is somehow a systematic review that needs to be relatively focused. In such a young and unexplored research field, I would have preferred a narrative review style where the authors take a stand on the methods and speculate about the challenges that need to be overcome, trusting their expertise to select the relevant studies to be mentioned, but this is just my opinion. I believe the manuscript has some merit and could be published in its present form, with maybe a small round of edits to streamline the content and improve the text flow. ********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes: siwar yahia Reviewer #2: No Reviewer #3: No 31 Mar 2022 PONE-D-21-39628R1 Utilizing computer vision for facial behavior analysis in schizophrenia studies: A systematic review Dear Dr. Jiang: I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. If we can help with anything else, please email us at plosone@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. Felix Albu Academic Editor PLOS ONE

41 in total

1. A method for obtaining 3-dimensional facial expressions and its standardization for use in neurocognitive studies.

Authors: Ruben C Gur; Radim Sara; Michiel Hagendoorn; Oren Marom; Paul Hughett; Larry Macy; Travis Turner; Ruzena Bajcsy; Aaron Posner; Raquel E Gur
Journal: J Neurosci Methods Date: 2002-04-15 Impact factor: 2.390

2. Validity and reliability of the Polish version of the Positive and Negative Syndrome Scale (PANSS).

Authors: Małgorzata Rzewuska
Journal: Int J Methods Psychiatr Res Date: 2002 Impact factor: 4.035

3. Long short-term memory.

Authors: S Hochreiter; J Schmidhuber
Journal: Neural Comput Date: 1997-11-15 Impact factor: 2.026

4. Machine learning in mental health: a scoping review of methods and applications.

Authors: Adrian B R Shatte; Delyse M Hutchinson; Samantha J Teague
Journal: Psychol Med Date: 2019-02-12 Impact factor: 7.723

Review 5. Racial disparities in psychotic disorder diagnosis: A review of empirical literature.

Authors: Robert C Schwartz; David M Blankenship
Journal: World J Psychiatry Date: 2014-12-22

6. Facial expressivity in the course of schizophrenia and depression.

Authors: Wolfgang Gaebel; Wolfgang Wölwer
Journal: Eur Arch Psychiatry Clin Neurosci Date: 2004-10 Impact factor: 5.270

7. Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD): the TRIPOD statement.

Authors: Gary S Collins; Johannes B Reitsma; Douglas G Altman; Karel G M Moons
Journal: Ann Intern Med Date: 2015-01-06 Impact factor: 25.391

8. Computer Vision-Based Assessment of Motor Functioning in Schizophrenia: Use of Smartphones for Remote Measurement of Schizophrenia Symptomatology.

Authors: Anzar Abbas; Vijay Yadav; Emma Smith; Elizabeth Ramjas; Sarah B Rutter; Caridad Benavidez; Vidya Koesmahargyo; Li Zhang; Lei Guan; Paul Rosenfield; Mercedes Perez-Rodriguez; Isaac R Galatzer-Levy
Journal: Digit Biomark Date: 2021-01-21

9. Classifying Major Depressive Disorder and Response to Deep Brain Stimulation Over Time by Analyzing Facial Expressions.

Authors: Zifan Jiang; Sahar Harati; Andrea Crowell; Helen S Mayberg; Shamim Nemati; Gari D Clifford
Journal: IEEE Trans Biomed Eng Date: 2021-01-21 Impact factor: 4.538

10. Automated analysis of facial emotions in subjects with cognitive impairment.

Authors: Zifan Jiang; Salman Seyedi; Rafi U Haque; Alvince L Pongos; Kayci L Vickers; Cecelia M Manzanares; James J Lah; Allan I Levey; Gari D Clifford
Journal: PLoS One Date: 2022-01-21 Impact factor: 3.240

1 in total

1. Multimodal Assessment of Schizophrenia and Depression Utilizing Video, Acoustic, Locomotor, Electroencephalographic, and Heart Rate Technology: Protocol for an Observational Study.

Authors: Robert O Cotes; Mina Boazak; Emily Griner; Zifan Jiang; Bona Kim; Whitney Bremer; Salman Seyedi; Ali Bahrami Rad; Gari D Clifford
Journal: JMIR Res Protoc Date: 2022-07-13

1 in total