Literature DB >> 35482660

An evaluation to determine if reading the mind in the eyes scores can be improved through training.

Jacklin Hope Stonewall¹, Kaitlyn M Ouverson², Andrina Helgerson¹, Stephen B Gilbert¹, Michael C Dorneich¹.

Abstract

The Reading the Mind in the Eyes Test (RMET) has received attention due to its correlation with collective intelligence. If the RMET is a marker of collective intelligence, training to improve RMET could result in better teamwork, whether for human-human or human-AI (artificial intelligence) in composition. While training on related skills has proven effective in the literature, RMET training has not been studied. This research evaluates the development of RMET training, testing the impact of two training conditions (Naturalistic Training and Repeated RMET Practice) compared to a control. There were no significant differences in RMET scores due to training, but speed of response was positively correlated to RMET score for high-scoring participants. Both management professionals and AI creators looking to cultivate team skill through the application of the RMET may need to reconsider their tool selection.

Entities: Chemical

Mesh：

Year: 2022 PMID： 35482660 PMCID： PMC9049333 DOI： 10.1371/journal.pone.0267579

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.752

Introduction

Imagine you find some delicious cookies in the kitchen pantry. You hide the cookies under the sink so your sibling does not eat them. When your sibling enters the kitchen, where will they look for the cookies? If you say in the pantry, you have demonstrated Theory of Mind (ToM). In other words, ToM allows one to think through the example from the sibling’s perspective to determine where they would look for the sweets. Theory of Mind is defined as the ability to attribute mental states to ourselves and others [1, 2]. Humans use this unique ability to predict or understand another’s behavior [3]. Young children typically have not yet developed ToM and will expect a sibling to look under the sink, as they assume the sibling holds the same knowledge they do within their minds [4, 5]. The term “ToM” is sometimes referred to as “Mentalizing,” or (the more magical-sounding) “mindreading” [6]; however, the authors in the present paper will use ‘ToM’ as an umbrella term referring to all three, simultaneously. Many social and interpersonal skills associated with ToM have been linked to team success [7]. As more classrooms and workplaces acknowledge the benefits of better learning outcomes through group and team-work, and as enterprises continue to pursue human-AI (artificial intelligence) teaming, efforts to understand factors affecting these collaborations have also gained momentum. Research shows that increased ability to understand another’s feelings or perceptions aids in the ability to manage social situations that are critical to successful teamwork [8, 9]. In human-AI interactions the machine’s ability to attribute mental states to others greatly enhances the quality of the interaction [10]. As such, AI researchers have turned to ToM as a way to provide automation with the ability to process human facial expression data in real time [10]. Likewise, the ability of humans to utilize ToM to manage human-AI interactions also needs to be studied. This requires the capability to measure ToM and to understand if a ToM-based ability to succeed in cooperative tasks can be increased through training. ToM may be measured in various ways, including the Reading the Mind in the Eyes Test (RMET). This test of social sensitivity evaluates individuals’ ability to “tune in” to another’s mental state by looking at images of the eye region of the face and matching the expression they see to the closest-matching one-word description [11]. In this way, participants demonstrate their ability to attribute a mental state to others. Scores on the RMET correlate positively with collective intelligence, or an individual’s ability to succeed in a variety of cooperative tasks [9, 12]. However, it is unknown whether an improvement on RMET scores leads to an improvement in collective intelligence. The first step in investigating if this is the case is to establish whether or not it is possible to improve RMET scores through training. While training on ToM is commonly used in the treatment of autism [13], training designed specifically to increase individual ability to glean mental state information from the faces of others has not been developed. By developing better “face-reading” skills, it is hoped that individuals (human or otherwise) could thus be trained to be better teammates. Therefore, the aim of this work was to develop and test methods of training individuals to perform better on the RMET. The RMET is a widely applied and accepted test for measuring ToM [14, 15]. The RMET presents participants with 36 photographs of the eye regions of given individuals. Paired with each photograph are four descriptor words from which the participant must select the one best corresponding with the emotion exhibited in the eyes. Participants score one point for every correct answer, with a maximum score of 36. Evaluations against other tests for measuring ToM have shown that RMET has fair reliability [14, 15]. There are a few known limitations of the RMET: neurodiversity, cultural differences in emotion recognition and processing, and knowledge of the English language. First, and perhaps most clearly, the RMET has demonstrated a strong ability to identify individuals with different social intelligence but otherwise typical cognitive intelligence; examples include neurodiverse individuals, such as those with high functioning autism or Asperger syndrome [11, 14]. This was the purpose for which the RMET was originally developed [11]. While the basic emotions (happiness, sadness, surprise, fear, disgust, anger, and contempt) are accepted as universal by some [16-18], research by Jack et al. [19] countered this belief by demonstrating a cultural influence on how individuals express different emotions. Jack and colleagues’ [19] research found that for East Asian individuals, whose facial expressions are governed by different, culture-specific rules for how emotions should be displayed, Ekman’s basic emotions were not consistently identified, and other emotions fundamental to the culture were not included in these “basic emotions.” Further, the RMET features exclusively light-skinned faces which could affect the ability of non-white participants to recognize the emotion displayed in the stimuli, reminiscent of the “other-race effect,” in which individuals have difficulty recognizing faces of individuals whose race differs from their own [20]. Lastly, The RMET involves the association of facial expressions to emotion words which may be difficult for participants whose native language is not English. The difficulty of completing the RMET outside of one’s native language and the utility of offering the test in multiple languages is evidenced by the translation of the test into French [21] and Spanish among others [22]. Various studies have been conducted on increasing ToM through training, including Kidd and Castano [23] in which a correlation was identified between reading literary fiction and improved RMET scores. Participants in Kidd and Castano’s [23] experiments read one brief text and showed only short-term improvements to RMET scores, a result that is debated in the literature [24-26]. Studies which have looked into establishing long-term improvements have focused primarily on children with more-extensive (compared to Kidd and Castano’s work [23]) literature-based training [27, 28]. These studies found that by discussing stories filled with mental-state vocabulary, children’s ability to understand this vocabulary and to interpret the emotions of others was improved. However, limited studies have been performed on ToM training for adults [29, 30], the purpose which the RMET was created to fulfill [6]. Further, understanding how RMET may or may not train ToM may help to establish deep-learning strategies for teaching AI to infer mental states [10]. The study had four main hypotheses. ToM training has shown to be helpful for those with diminished ToM capacity. As RMET is a Theory of Mind measure [11], a Naturalistic RMET Training was developed based on traditional ToM training. By basing Naturalistic RMET Training on ToM training, it is hypothesized that individuals who receive training will improve their RMET scores. H1: Naturalistic RMET Training will result in a RMET score increase compared to No Training. To ensure that the effect of the Naturalistic Training is more than any changes in RMET score due to familiarity with the content of the test, a portion of participants were placed in the Repeated RMET Training group, where they were instructed to take the test multiple times without feedback. The Naturalistic Training is hypothesized to result in a larger score increase by teaching participants to read emotions, rather than just expecting them to correct their ability without feedback. H2: Naturalistic RMET training will result in a greater increase in score than Repeated RMET training. Participants with higher initial RME ability are hypothesized to benefit less from training than those whose initial RMET ability was low. H3: Low initial RMET performers will see greater improvements in RMET score from training than those who initially scored high on the RMET. We expect that RMET score will be higher for individuals who answer the questions faster, i.e., in less time. This hypothesis was first published by Tracy and Robins [31] as part of their second study in that manuscript. The present study does not restrict response time or induce cognitive load, instead allowing participants to deliberate and observing whether, within an environment encouraging of deliberation, higher average response times per question were related to lower RMET score. H4: Time per question is negatively correlated to RMET score.

Experimental method

The objective of the study was to determine if training can impact RMET scores. This study was approved as exempt by the Institutional Review Board of Iowa State University (#18–075). Electronic informed consent was obtained from all participants.

Participants

The study included 429 participants (307 women, 117 men, 3 non-binary) recruited from a public university and social media. Participants averaged 30.4 years of age (range 18–74). English was the most comfortable language for 76% of participants and 73% of participants identified their native country as the United States. In terms of sexual orientation, 90.4% of participants identified as heterosexual or straight, 0.7% identified as lesbian, 1.2% identified as gay, 3.5% identified as bisexual, 2.3% identified as an orientation not listed, and 1.9% chose not to respond. For completing the study, participants were given a chance to win one of three $99 Amazon e-gift certificates.

Experimental design and procedure

The experiment was a between-subjects design in the form of an online survey via Qualtrics. All participants were instructed to complete the experiment on a laptop or desktop computer and the use of smartphones or tablets was discouraged. Participation in the experiment lasted approximately 45 minutes. The specifics of the procedure are discussed below. Once participants had given electronic consent and verified their age, they completed the Pre-training RMET (henceforth referred to as the Pre-RMET). After completing the Pre-RMET, each participant was randomly assigned to one of three conditions, Naturalistic Training, Repeated Training, or No Training, using Qualtrics’s randomization function. Time in the assigned condition was approximately 20 minutes for all conditions. In the “Repeated Training” condition, participants repeated the RMET three times without feedback. In the “Naturalistic Training” condition, participants were guided through the training described below, which was designed to improve their ability to read faces. In the “No Training” condition, participants were given unrelated distractor tasks in the form of visual puzzles and videos to keep them occupied for the same amount of time taken in other training groups. The distractor tasks (I-spy puzzle, spot-the-difference puzzle, videos to watch) were specifically chosen to be as unrelated to the test as possible yet require visual processing similar to the RMET. After each distractor video, the participant completed a quiz on the content to allow for control for participant inattention. All participants then completed the Post-training RMET (Post-RMET) and a demographics survey. Participants in the Naturalistic Training condition were introduced to the concept of basic emotions by watching a video on decoding facial expressions [32] and reading an article which explains those facial expressions in more detail [33]. Drawing on the Theory of Mind training described by Adibsereshki, Nesayan, Asadi Gandomani, & Karimlou [34], in which participants were given feedback on how well they sorted pictures and drawings of facial expressions into emotion categories, all of the Naturalistic Training media were followed by quizzes with correct/incorrect feedback. Training on complex emotions featured two Pixar short films, Lifted [35] and Partly Cloudy [36]. After watching these short films, participants answered questions about the animated characters’ emotions based on still images from the films. Naturalistic Training ended with a sample RMET-like quiz that used dynamic images, or video clips, of an eye-region of faces expressing emotions as stimuli, rather than static images [37], and feedback was given on the answers to each question. The next section briefly describes the systematic methods used to develop these training stimuli, while more detail can be found in Ouverson, Stonewall, Gilbert and Dorneich [38].

Naturalistic training stimuli development

The objective of the Naturalistic Training (previously referred to as Strategic Training; [38]) was to give participants practice interpreting facial cues, patterns, and expressions so that they could ultimately interpret the emotional state of the person. To develop the materials, a set of training stimuli different from the 36 faces in the RMET was needed. As detailed by Ouverson et al. [38], 92 complex emotion answer choices were assigned from the original RME to the Pixar stills. First, the answer choices from the RMET were grouped into seven categories corresponding to basic emotions: Anger, Sadness, Surprise, Happiness, Disgust, Fear, and Contempt. Each category had approximately 20 complex emotions answer choices listed (some words appeared in multiple categories). Stills were also uniquely sorted into the seven basic emotion categories. A survey randomly assigned a Pixar still to four of the 20 RMET answer choices corresponding to the basic emotion category. A separate sample of participants from this paper’s primary study (n = 136) rated how well the choices fit with the emotion seen (1 = “Does Not Fit at All” and 7 = “Fits Very Well”). The answer choice with the highest mean score was chosen as the “correct” answer for the still, while the three answer choices with the lowest mean scores were chosen as the distracter choices. The results from this initial survey were used to create the Naturalistic Training on reading complex emotions.

Independent variable

The study had one independent variable, Training Type, with three levels: Repeated Training (the RMET was taken three times without feedback), Naturalistic Training (participants were trained using different materials with feedback), and No Training (participants completed distractor tasks in lieu of training).

Quasi-independent variable

A quasi-independent variable was Initial RME Ability (low, high). Initial RME ability was calculated using the Pre-RMET score. If a participant scored at or below the 25th percentile of collected data (i.e., the observed first quartile, a score of 25 or less), they were considered to have low RME ability. Conversely, if the participant achieved a score of at least 30 (75th percentile), they were considered to have high RME ability. In order to observe the effects of high and low, but not moderate Initial RME ability, the middle quartiles were not included in analysis.

Dependent variables

The study included three dependent variables: Post-RMET score (out of 36), Change in RMET score (range of -36 to +36), and Time per Question (in seconds; measured for each RMET question). The Post-RMET was replicated in Qualtrics after training using the same images and choices as the pre-training RMET and scored out of 36. Change in RMET score was calculated during analysis by subtracting the initial RMET score from the final RMET score and used to delineate “low-scoring” and “high-scoring” participants. Because participants are introduced to manipulations between the Pre- and Post-RMET, the RMET scores compared in most analyses are the Pre-RMET scores.

Data analysis plan

After assumptions of normality and homogeneity of variance were met, ANOVAs and Tukey’s Honest Difference (HSD) Post-hoc tests were used to analyze the data. Results at the p < .05 level are reported as statistically significant, while marginal significance is assigned to those at the p < .10 [39]. Kendall’s tau-b was calculated to assess the relationship between Time per Question (a continuous variable) and RMET score (an ordinal variable) for both the Pre- and Post-RMETs. Cohen’s d was used for effect size. Cohen [40] indicated that, when interpreting effect sizes, 0.2 ≤ d < 0.5 showed a small effect, 0.5 ≤ d < 0.8 could be considered a medium-sized effect, and d ≥ 0.8 showed a large effect, an interpretation the authors adopt in the present work.

Data integrity

To ensure maximum data integrity, the researchers removed data that failed to meet a set of criteria designed to ensure that the remaining participants were ones who were assumed to have treated the survey seriously: Participants must answer at least 75% of questions on each scale. Participants must spend at least 940 seconds (just under 16 minutes) on the entire survey. Participants must stay on the Naturalistic Training videos at least long enough to see all important content (skipping the video credits will not result in removal). Participants must score at least six of eight points on the distractor video quizzes (random guessing resulted in a score of four, so this higher threshold ensured attention). Participants must take at least half the time that the researcher in charge of choosing the puzzles took to complete those distractor activities (considered the minimum time). After 100 participants were excluded via the criteria above, the Naturalistic Training condition contained 118 participants, the Repeated Training condition contained 145 participants, and the No Training/Control condition contained 166 participants. Demographics by condition are presented in Table 1. The randomization resulted in groups that have roughly similar representation across demographic groups in each condition.

Table 1

Demographics of participants per condition.

Training Group	Gender (count)	Median Age (IQR)	Native English Speakers (count)	Race/Ethnicity (count)	US-Natives (count)
Control	F(118) M(48)	22 (14)	Native(155), Non-Native(11)	White(140), Asian(10), Hispanic/Latino(7), Black/African American(5), Indigenous American/Native Hawaiian(1), Other(3)	Native(149), Non-native(17)
Repeated RMET	F(107) M(38)	23 (15.8)	Native(135), Non-Native(10)	White(118), Asian(5), Hispanic/Latino(10), Black/African American(6), Indigenous American/Native Hawaiian(2), Other(4)	Native(135), Non-native(10)
Naturalistic	F(83) M(35)	28 (22)	Native(108), Non-Native(10)	White(98), Asian(8), Hispanic/Latino(4), Black/African American(4), Indigenous American/Native Hawaiian(0), Other(4)	Native(99), Non-native(19)

Results

RMET training

Fig 1 illustrates the pre- and post-RMET score by condition. The mean Pre-RMET score across all conditions was 27.6 (SD = 3.5) while the mean Post-RMET score across all conditions was 27.8 (SD = 4.2). There were no significant differences between mean Pre-RMET score and mean Post-RMET score for any of the training conditions.

Fig 1

Pre- and Post-RMET score by condition.

Bars represent one standard deviation from the mean.

Pre- and Post-RMET score by condition.

Bars represent one standard deviation from the mean.

Change in RMET performance for high-scorers vs. low-scorers

A two-way ANOVA was run to assess the change in RMET performance for individuals who scored in the top quartile on the initial RMET (M = 31.7, SD = 1.3) and those who scored in the bottom quartile on the initial RMET (M = 23.0, SD = 2.0). The test revealed a statistically significant difference in change in scores between high (M = -1.45, SD = 2.9) and low (M = 1.7, SD = 4.1) initial RMET ability (F(1, 232) = 48.38, p < .001, d = 0.17) and a marginally significant difference between training conditions (F(2, 232) = 2.74, p = .07). A post-hoc Tukey HSD test showed that the Naturalistic Training group’s change in RME score was marginally significantly greater than the Control group’s change (p = .07, 95% CI (-2.60, 0.07)). The interaction term was not significant. Fig 2 visualizes these differences.

Fig 2

Difference in mean change in RMET score by condition and initial RME ability.

Error bars represent 95% Confidence Intervals.

Difference in mean change in RMET score by condition and initial RME ability.

Error bars represent 95% Confidence Intervals. Two ANOVAs were used to inspect the marginal effects of training condition at each level of Initial RME ability, with significance evaluated at an alpha level of .025 to adjust for the use of multiple tests. For those participants who presented with Low Initial RME ability (F(1, 108) = 2.418, p = .12; M = 1.7, SD = 3.4) and those who scored initially high (High Initial RME ability) (F(1, 128) = 0.59, p = .59; M = -1.45, SD = 2.9), the change in RMET scores was not statistically significant.

Time per question by RMET score

A Kendall’s tau-b correlation was used to evaluate the relationship between RMET Score and average time per RMET question, in seconds (Mdn = 9.5; IQR = 3.5–15.5). While RMET Score and time per question were not significantly associated in the pretest (τb = -0.001, p = .97), there was a marginally significant, weak, and negative association in the post-test (τb = -.06, p = .07). Examining this association for both low- and high-skill RME participants separately, no association was found between overall RMET Score and overall average time per question for low-skill participants (τb = 0.002, p = .98), but a marginally significant, weak, and negative association was uncovered between overall RMET Score and overall average time per question (τb = -.11, p = .09) for high-skill participants.

Discussion

Hypotheses H1 and H2 were not supported. There were no significant changes between Pre- and Post-RMET for any of the three training conditions. The lack of effect, specifically for the Naturalistic Training condition, could imply that it is difficult to train individuals to perform better on the RMET. This could imply that the skills needed to succeed at the RMET are linked to those social skills. The Naturalistic Training used in this study was based upon standard ToM training [34]. When used to help autistic individuals, ToM training is administered in person and by a professional. The method of delivery used in this experiment could have therefore impaired the effectiveness of the training. More work is necessary to decide conclusively whether any type of training can improve an individual’s ability to “read faces”. Similarly, the ToM training that informed the Naturalistic Training is generally only administered to children. While some studies have indicated that standard ToM training can be effective for older adults [29, 30], little work has examined whether ToM training can be effective for younger to middle-aged adults. Given the average age of the participants in the Naturalistic Training group (M = 32.6 years) and the basis for the Naturalistic Training, the training may not have been effective for the population studied. Hypothesis H3 was not supported. While those with low initial RME ability saw changes which approached an acceptable level of significance, the result was not statistically significant. This could indicate that the training is simply not effective for people who are already good at Reading the Mind in the Eyes, or it could be evidence of regression to the mean, a statistical phenomenon showing that high performers regress toward the mean or average while low performers will more typically see performance gains relative to their starting measurement [41]. Regression to the mean, in this context, speaks to the observation that average scores for the top 25% performers were on average one-and-a-half points lower (1.45) in the post-RMET, while the bottom 25% saw net gains of near two points (1.7), bringing each group closer to the average than where they started. Hypothesis H4 was partially supported. All explored associations between average time per question and RMET score were negative, except for the low-skill individuals. While no results reached true statistical significance, the marginally significant results suggest a need for further testing. The data possibly suggest that emotion-state reading is a function of “fast-thinking.” In addition, the lack of a result for the low-skill participants, when comparing average response speed and RMET score, suggests that low-skill individuals may not have the proper schemata to classify the emotions in a fast manner. Therefore, no association between speed and score would be expected. The researchers note a few limitations of the current research. First, because the survey was distributed over the Internet, the experiment was not as tightly controlled as one distributed in a lab setting. Additionally, participants in Naturalistic Training could pause the videos. In sum, despite the researchers’ best efforts, completion times and medium of interaction may have varied from participant-to-participant. Second, the RME describes specific emotion states using words that were potentially problematic for non-native English speakers. The nuanced differences between, say, “despondent” and “dispirited” led to challenges during Naturalistic Training development. Lastly, it could be that the sample may have contained more naturally good mind-in-the-eyes readers than the general population. Cursory analysis revealed a significant difference between RMET performance for this study’s sample (Mean = 27.6, SD = 3.5) compared to the general population (as reported by Baron-Cohen et al. [11], Mean = 26.2, SD = 3.6, t(132.4) = 2.82, p = .01, but not compared to their student population, Mean = 28.0, SD = 3.5, t(113.4) = -0.75, p = .30.

Conclusion

The aim of this work was to develop and test methods of training individuals to perform better on the RMET. The design of the training was informed by standard Theory of Mind (ToM) training often used as a way to help autistic people address some of the more difficult components of their neurological difference. As ToM has been recognized as a pivotal component of creating more human-friendly AI, RMET and other methods of training ToM seem like logical sources of training data or strategies. The results of the study indicated that there was no effect of this training on RMET score, but the training’s effectiveness may have been impacted by its delivery mechanism. Overall, the training was not shown to be effective. Before using the RMET for evaluation or training, generalization issues must be addressed. Further, eye-region imagery alone may not be enough to teach ToM, whether to human- or AI-agents. In future work related to training development, the delivery of training could be in-person to better echo proven ToM training methods and may need to occur over a period of days to weeks, and especially in the case of training machine ToM, ethical considerations around the use of imagery of white faces, cultural differences in emotions, and other concerns must be addressed. 6 Oct 2021

PONE-D-21-26174

An Evaluation to Determine if Reading the Mind in the Eyes Scores Can Be Improved Through Training

Both Reviewers are quite critical with regard to the way the manuscript is written and I feel it would need some substantial restructuring before it is ready for publication. However, I also think there is scope for revision and therefore, would like to offer you the opportunity to resubmit the revised paper, if you feel you can address the points both Reviewers make.

Please submit your revised manuscript by Nov 20 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols. We look forward to receiving your revised manuscript. Kind regards, Magdalena Ewa Król, Ph.D. Academic Editor PLOS ONE Journal Requirements: When submitting your revision, we need you to address these additional requirements. 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf 2. We note that you have stated that you will provide repository information for your data at acceptance. Should your manuscript be accepted for publication, we will hold it until you provide the relevant accession numbers or DOIs necessary to access your data. If you wish to make changes to your Data Availability statement, please describe these changes in your cover letter and we will update your Data Availability statement to reflect the information you provide. 3. Please include your full ethics statement in the ‘Methods’ section of your manuscript file. In your statement, please include the full name of the IRB or ethics committee who approved or waived your study, as well as whether or not you obtained informed written or verbal consent. If consent was waived for your study, please include this information in your statement as well. 4. Please ensure that you refer to Figure 1 and 3 in your text as, if accepted, production will need this reference to link the reader to the figure. [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: No Reviewer #2: Yes ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: Yes ********** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: No Reviewer #2: Yes ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: No ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: General comment In an online study, the authors tested how training “face based mindreading” in participants could help them performing better at the Reading the Mind in the Eyes Test (RMET). They included two control conditions: “repeated training” (performing the RMET 3 times in a row) and “distraction” (visuospatial tasks unrelated to mindreading). They also explored how sociodemographical variables such as self-declared gender, race, height, etc. influence the results, with the underlying hypothesis that marginalized groups (females, LGBTQ+ people, POC, etc.) could be better motivated to mindread faces in their environment. They found no effect of training. Some demographical variables showed the expected effect (gender, race, native language), but not others. I really appreciated the good English of the manuscript, and how the authors explored variations of cognitive skills based on social and cultural status. But I found the manuscript to be very confusing from a theoretical point of view. A lot of problematic assertions are backed up with dated literature, or sometimes are not even justified or referenced. There are some mistakes on how hypotheses and results are articulated. Also, it seems that there is no cohesion between hypotheses pertaining to training, and those pertaining to sociodemographical variables. At the end of the reading, I don’t see how the “collective intelligence” framework even relate to the present work. I don’t think the manuscript is suitable for publication in its present form. Major points 1. The references in the Introduction section are outdated. A lot of work has been done in the last decades to understand the steps of ToM development. Please add more recent references, and more nuance in some assertions (for example page 6: “Due to the proposed reliance on instinct in the process, we believe ToM is best attributed to System 1 thinking, theorized by Kahneman (2011).).” See also major point number 5 on a related matter. 2. Page 6: I don’t see how referencing Kahneman’s framework is useful to understand the experiment or the results. Same for believers related data page 7. Same for mood related data page 7. 3. Page 11: “Analogous to the concept of diminishing returns”. Please provide a definition and references to back up the concept of “diminishing returns”. 4. Page 12: “Because mental state interpretation is quick, automatic, and universal”. Whether mental state understanding is automatic, universal or quick is the subject of entire research programs. I don’t think these questions are elucidated to this day. Please provide references to back up this assertion, and develop a theoretical argument to justify it, or either drop or adjust the assertion. 5. Page 14: “while more detail can be found in Authors (2018).” Please refrain from citing unpublished data. I couldn’t find the referenced work on the internet. 6. Page 19: “One-way ANOVAs were used to inspect the simple main effects of training condition,”… I don’t see the point of this analysis. The previous ANOVA on change scores, with Training Type as factor, already showed a main effect of High Vs Low Scorers. Why then use three different ANOVAs for each type of training to just show again this effect for each training? If the authors want to explore the marginal effect, they should instead run two additional ANOVAs: one for the High Scorers with Training Type as factor, and the same for the Low Scorers. 7. Page 19: “As predicted, the effect of previous experience with the RMET on Pre-RMET score was significant, t(426) = -2.41, p = .008. (Figure 4a). »… I don’t see how this was predicted in any of the hypotheses listed at the end of the Introduction. Please add the corresponding hypothesis in the Introduction and justify it, or correct this section. What was predicted is less progression between pre and post training for participants with previous exposure to the RMET, not lower scores at pre-training (see H3a). Please also edit the Discussion section, page 24. “In support of H3A, participants with previous experience taking the RMET performed significantly worse than participants who had no previous experience with the RMET.” 8. The authors say in the Introduction section that empathy could be a mediating factor explaining why women (or hypothetically marginalized groups), could perform better on the RMET. Why not include a measure of empathy in the present experiment? Could the authors discuss this matter? 9. Conclusion section. This section usually contains a single paragraph outlining the main results of the study. Please summarize. 10. The “Related work” section seems not appropriate for a peer reviewed article. It presents data that are not relevant to the experiment. Please extract the relevant information from this section, explicitly link them to your experiment and hypotheses, and drop irrelevant literature. 11. Since the present design is across subjects, it is possible that the Qualtrics randomization led to potential biases. For example, it is possible that a majority of female participants was in the “Repetitive Training” condition, then leading to the effect of training being confounded with the gender factor. Could the authors provide the repartition of participants and their characteristics in each training category? Could they also discuss this potential bias in the Discussion section? Minor points 1. There is a typo in Baron-Cohen’s citations. There should be no space after “Baron-“. Please correct. 2. I don’t see the use of Table 1… Rather, a Table showing the characteristics of participants in each training group would be more useful. 3. Please move the Limitations and Assumptions section to the end of the paper. 4. Page 19: “Figure 2Error! Reference source not found.” Please correct. 5. Page 23: “, but a marginally significant, weak, and negative association was uncovered between overall RMET Score and overall average time per question”. I think that this part of the sentence refers to high skill participants, but this needs to be explicit. Please correct. Reviewer #2: 1. Yes 2. Yes 3. Yes 4. Several parts of the paper should be rewritten. The paper deals with an interesting topic. However, there are important issues that prevent from considering the paper ready for publication, at least in its actual form. The paper would benefit from a change of the structure. The introduction should be focused on the topic of the study, that is the training, and not on other irrelevant information such as benefits of higher Reading the Mind in the Eyes Ability or studies on theory of mind. I would suggest discussing in a critical way the training interventions reported in the literature. This could help the Authors explain the relevance and novelty of their study, at the moment it is not clear and defined. The method section needs to be more specific. Several information such the length of the training should be added. The procedure section is very important, and I do not think is well presented. This makes difficult the interpretation of the results. I do not think that the label “strategic training” is correct, I would suggest using something different that highlights the content of the intervention. The limitations of the study should be moved to the end of the discussion section and the aims should be presented in the introduction. I would suggest reducing the number of hypotheses. ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes: Matias Baltazar Reviewer #2: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. 16 Dec 2021 We would like to thank the reviewers for their feedback and helpful comments to improve this work. They were extremely helpful and well targeted towards improving the paper. We assigned numbers to the comments so we could refer any of them back when there is similar comments from different reviewers. Editor [E.1] Please include your full ethics statement in the ‘Methods’ section of your manuscript file. In your statement, please include the full name of the IRB or ethics committee who approved or waived your study, as well as whether or not you obtained informed written or verbal consent. If consent was waived for your study, please include this information in your statement as well. [Author Response] We added the following statement: "This study was approved as exempt by the Institutional Review Board of Iowa State University (#18-075). Electronic informed consent was obtained from all participants.” [E.2] Please ensure that you refer to Figure 1 and 3 in your text as, if accepted, production will need this reference to link the reader to the figure [Author Response] The citation of Figure 1 has been added to the text. Figure 3 was deleted to address later review comments. [E.3] Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf [Author Response] All style requirements have been implemented in the manuscript. [E.4] Should your manuscript be accepted for publication, we will hold it until you provide the relevant accession numbers or DOIs necessary to access your data. If you wish to make changes to your Data Availability statement, please describe these changes in your cover letter and we will update your Data Availability statement to reflect the information you provide. [Author Response] We have an Open Science Framework repository for this study’s data linked here: https://osf.io/bd4rp/ Reviewer #1 [R1.1] The references in the Introduction section are outdated. A lot of work has been done in the last decades to understand the steps of ToM development. Please add more recent references, and more nuance in some assertions (for example page 6: “Due to the proposed reliance on instinct in the process, we believe ToM is best attributed to System 1 thinking, theorized by Kahneman (2011).).” See also major point number 5 on a related matter. [Author Response] More recent references have been added to the introduction in the areas of training and the potential for training to be useful in creating better Human-AI collaboration. The assertion on page 6 has been removed as it was unnecessary when introducing ToM. Major point 5 is concerned with a potentially missing reference, Authors, 2018. This author names have been redacted for blind review, but the paper has been published. [R1.2] Page 6: I don’t see how referencing Kahneman’s framework is useful to understand the experiment or the results. Same for believers related data page 7. Same for mood related data page 7. [Author Response] Referencing Kahneman's framework is not necessary when introducing ToM, the experiment, or the results. Therefore, the assertion on page 6 has been removed. As part of the effort to the streamline the focus of the manuscript on the training, we have edited the manuscript to no longer include mentions of the research on the impact of belief in a god and mood, and other demographic variables, on the ability to engage in mindreading, or reading the mind in the eyes. [R1.3] 3. Page 11: “Analogous to the concept of diminishing returns”. Please provide a definition and references to back up the concept of “diminishing returns”. [Author Response] The connection to "diminishing returns" was unnecessary in explaining the hypothesis. Therefore, the statement has been removed [R1.4] 4. Page 12: “Because mental state interpretation is quick, automatic, and universal”. Whether mental state understanding is automatic, universal or quick is the subject of entire research programs. I don’t think these questions are elucidated to this day. Please provide references to back up this assertion, and develop a theoretical argument to justify it, or either drop or adjust the assertion. [Author Response] The assertion has been removed from the paper as it was unnecessary in explaining the inclusion of H4. [R1.5] 5. Page 14: “while more detail can be found in Authors (2018).” Please refrain from citing unpublished data. I couldn’t find the referenced work on the internet. [Author Response] This is a published paper, and was referenced as "Authors" for the blind review. It was published in the proceedings of HFES 2018 [R1.6] Page 19: “One-way ANOVAs were used to inspect the simple main effects of training condition,”… I don’t see the point of this analysis. The previous ANOVA on change scores, with Training Type as factor, already showed a main effect of High Vs Low Scorers. Why then use three different ANOVAs for each type of training to just show again this effect for each training? If the authors want to explore the marginal effect, they should instead run two additional ANOVAs: one for the High Scorers with Training Type as factor, and the same for the Low Scorers. [Author Response] Reviewer 1's suggestion to reduce the complexity of the results by running two separate ANOVAs to replace the three one-way ANOVAs is a clear improvement, and as such, we have redone our analysis to explore the marginal effects of score on training type and updated the discussion and results as necessary. [R1.7] Page 19: “As predicted, the effect of previous experience with the RMET on Pre-RMET score was significant, t(426) = -2.41, p = .008. (Figure 4a). »… I don’t see how this was predicted in any of the hypotheses listed at the end of the Introduction. Please add the corresponding hypothesis in the Introduction and justify it, or correct this section. What was predicted is less progression between pre and post training for participants with previous exposure to the RMET, not lower scores at pre-training (see H3a). Please also edit the Discussion section, page 24. “In support of H3A, participants with previous experience taking the RMET performed significantly worse than participants who had no previous experience with the RMET.” [Author Response] Reviewer 1 is correct that the results discussed on page 19 and visualized in Figure 4a is not related to H3A. This text was included mistakenly, and was meant to, instead, give more context to the findings (similar to the distribution of genders across conditions) and the (new) placement of the other demographic variable discussion. To correct this, as well as address Reviewer 2's comment about paring down the arguments to those which are essential, all of figure 4 has been removed. [R1.8] The authors say in the Introduction section that empathy could be a mediating factor explaining why women (or hypothetically marginalized groups), could perform better on the RMET. Why not include a measure of empathy in the present experiment? Could the authors discuss this matter? [Author Response] Reviewer 2 suggested that we reduce the number of hypotheses and restructure the arguments of the introduction. As we did not measure Empathy, we have dropped hypothesis H4 and the related discussion of empathy as a mediating factor. [R1.9] Conclusion section. This section usually contains a single paragraph outlining the main results of the study. Please summarize. [Author Response] We have rewritten the Conclusion section to a single paragraph that summarizes the results and suggests future work. [R1.10] The “Related work” section seems not appropriate for a peer reviewed article. It presents data that are not relevant to the experiment. Please extract the relevant information from this section, explicitly link them to your experiment and hypotheses, and drop irrelevant literature. [Author Response] The Related Work section has been re-structured to focus on training. Other topics, such as the effect of demographics on RMET score, have been removed. [R1.11] 11. Since the present design is across subjects, it is possible that the Qualtrics randomization led to potential biases. For example, it is possible that a majority of female participants was in the “Repetitive Training” condition, then leading to the effect of training being confounded with the gender factor. Could the authors provide the repartition of participants and their characteristics in each training category? Could they also discuss this potential bias in the Discussion section? [Author Response] In the (newly named) Data Integrity section (under Experimental Methods), we have added a Table that contains a description of participant characteristics by condition (Table 1). In addition we have added a statement indicating that there is a relative balance of demographic groups across conditions. [R1.12] 1. There is a typo in Baron-Cohen’s citations. There should be no space after “Baron-“. Please correct. [Author Response] Thank you for catching this. It has been corrected. [R1.13] . I don’t see the use of Table 1… Rather, a Table showing the characteristics of participants in each training group would be more useful. [Author Response] Thank you for this suggestion. We have added a table (Table 1) that gives counts by gender, race/ethnicity, native English speaker status, US-native status, and age for each condition. [R1.14] 3. Please move the Limitations and Assumptions section to the end of the paper. [Author Response] The Limitations and Assumptions section has been moved to the end of the Discussion section, just before the (summary) Conclusion. [R1.15] 4. Page 19: “Figure 2Error! Reference source not found.” Please correct. [Author Response] The Figure 2 citation has been corrected [R1.16] 5. Page 23: “, but a marginally significant, weak, and negative association was uncovered between overall RMET Score and overall average time per question”. I think that this part of the sentence refers to high skill participants, but this needs to be explicit. Please correct. [Author Response] That is correct, the second part of the sentence refers to high skill participants. The sentence has been rewritten to read, "but a marginally significant, weak, and negative association was uncovered between overall RMET Score and overall average time per question (τb = -.11, p = .09) for high-skill participants." Reviewer #2 [R2.1] The paper would benefit from a change of the structure. The introduction should be focused on the topic of the study, that is the training, and not on other irrelevant information such as benefits of higher Reading the Mind in the Eyes Ability or studies on theory of mind. I would suggest discussing in a critical way the training interventions reported in the literature. This could help the Authors explain the relevance and novelty of their study, at the moment it is not clear and defined. [Author Response] We have rewritten the Introduction to focus on the central theme of the paper: can RMET be trained. We have added more discussion on previous work in training. The section on the benefits of higher RMET score has been removed from the literature review, as well as the material on demographics. [R2.2] The method section needs to be more specific. Several information such the length of the training should be added. [Author Response] We have added detail to the method section including: ethics statement, the length of the experiment overall, the length of each condition, limitations, and participant demographics by condition. [R2.3] The procedure section is very important, and I do not think is well presented. This makes difficult the interpretation of the results. [Author Response] The procedure section has been restructured to better convey at what point in the study participants were split into conditions and the differences among the three conditions. [R2.4] I do not think that the label “strategic training” is correct, I would suggest using something different that highlights the content of the intervention. [Author Response] We've changed the condition label to "naturalistic" to reference the trainings' focus on identifying emotions in naturalistic settings (i.e., observing human faces or anthropomorphic faces in more context, including the addition of a mouth or narrative arc). [R2.5] The limitations of the study should be moved to the end of the discussion section [Author Response] The limitations were moved to the end of the Discussion section. [R2.6] the aims should be presented in the introduction [Author Response] The aims of the study have been added to the last paragraph of the Introduction, and reads, "The aim of this work was to develop and test methods of training individuals to perform better on the RMET." [R2.7] I would suggest reducing the number of hypotheses. [Author Response] Upon reviewing our hypotheses, we have removed H4. This information deserves its own fully-fledged study, and it is beyond the scope of our original question about training. Thus, because of the breadth of our study, the demographic variables are no longer included in the hypotheses, only as discussion points. We also have removed H3A, as this hypothesis distracted from the focus of the paper. Submitted filename: Response to Reviewers.docx Click here for additional data file. 3 Feb 2022

PONE-D-21-26174R1

An Evaluation to Determine if Reading the Mind in the Eyes Scores Can Be Improved Through Training

PLOS ONE Dear Dr. Stonewall, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Thank you for carefully revising the manuscript based on the feedback received. One reviewer has raised additional or remaining concerns, which we feel should be addressed. Please see their suggestions below for enhancing the reproducibility and clarity of the study. Please submit your revised manuscript by Mar 19 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript:

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols. We look forward to receiving your revised manuscript. Kind regards, Hanna Landenmark Senior Editor PLOS ONE Journal Requirements: Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice. [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #1: (No Response) Reviewer #2: All comments have been addressed ********** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Partly Reviewer #2: Yes ********** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: Yes ********** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: No Reviewer #2: Yes ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: No Reviewer #2: Yes ********** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: General comment The authors really put effort into editing their manuscript. However, I think some points need clarification. Also, the structure of the paper is still unorthodox, which is a little confusing for readers. Major points 1. The authors have added sentences pertaining to human-human interaction or human to AI interaction. What is the rationale for this? Please provide a rationale, if possible with references, or drop this argument. 2. I am still not satisfied with the “Related Work Section”. It really hinders the interest of your work as it looks more like a patchwork of data rather than a well-constructed introduction. I never found such section in any published article I ever read. I think it would really be better if the authors just picked the relevant parts of this section to inject them in a more streamlined Introduction Section. 3. I don’t understand the rationale for H4. If the authors really follow the line of reasoning in the experiments by Tracy and Robins (2008), they should hypothesize just the opposite, or maybe that there is no correlation between RT and accuracy... I find the H4 hypothesis very confusing and counterintuitive. 4. Page 18: “the RMET has demonstrated a strong ability to identify individuals with impaired social intelligence but otherwise normal cognitive intelligence”. This phrasing could be interpreted as offensive for people with Autism Spectrum Disorder. Please use "typical" instead of "normal", or another similar word. Please use another expression than "impaired social intelligence". The neurodiversity movement is all about being considered as different people and not as impaired or incomplete beings. So please, if you mention this idea, think about the people involved and how they will interpret your phrasing. 5. Page 18: “There are a few known limitations of the RMET, etc.” and the two following paragraphs. These paragraphs are interesting but should be integrated in the Introduction section. Usually, in the Methods section, when the literature guides choices regarding the methods, references are briefly provided with a one sentence rationale. If the rationale requires a full paragraph, it should be fully developed in the Introduction, and then briefly reminded in the Methods, where the relevant procedure is described. 6. Page 19: “Lastly, The RMET involves the association of facial expressions to high-level emotion words which may be difficult for even native English speakers”. Please add at least one reference to back up what is said here. 7. Page 23: “One-way ANOVAs were used to inspect the simple marginal effects, etc.” I don't understand why the authors would want to explore marginal effects. Usually one would want to test for differences (at p<.05) and report marginal effects when they are found instead. 8. Page 25: “or it could be evidence of regression to the mean, as explained by Kahneman”. I don't understand this sentence. What is regression to the mean in this context? Reviewer #2: (No Response) ********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes: Matias Baltazar Reviewer #2: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

18 Mar 2022 Author Responses to Address Reviewer Comments Paper Title: An Evaluation to Determine if Reading the Mind in the Eyes Scores Can Be Improved Through Training Submitted to: PLOS One Manuscript Number: PONE-D-21-26174 We would like to thank the reviewers for their feedback and helpful comments to improve this work. They were extremely helpful and well targeted towards improving the paper. We assigned numbers to the comments so we could refer any of them back when there is similar comments from different reviewers. Editor [E.1] Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice. [Author Response] We have reviewed the reference list to ensure it is complete and correct. Reviewer #1 [R1.1] The authors have added sentences pertaining to human-human interaction or human to AI interaction. What is the rationale for this? Please provide a rationale, if possible with references, or drop this argument. [Author Response] In recent years, researchers have been looking to guide the creation of AI agents which better approximate human social intelligence. One of the threads of this research leads to Theory of Mind, which underlies the Reading the Mind in the Eyes Test. The sentences pertaining to human-human and human-AI interaction were added to demonstrate the utility of ToM training. The sentences pertaining to human-AI interaction have been rewritten, and another reference [10] added: "In human-AI interactions, the machine’s ability to attribute mental states to others greatly enhances the quality of the interaction [10]. As such, AI researchers have turned to ToM as a way to provide automations with the ability to process human facial expression data in real time [10]" [R1.2] I am still not satisfied with the “Related Work Section”. It really hinders the interest of your work as it looks more like a patchwork of data rather than a well-constructed introduction. I never found such section in any published article I ever read. I think it would really be better if the authors just picked the relevant parts of this section to inject them in a more streamlined Introduction Section. [Author Response] The separate "Related Work" section has been removed. The relevant content from this section has been integrated into the Introduction. Additionally, information that was redundant with the content of the introduction was removed. [R1.3] I don’t understand the rationale for H4. If the authors really follow the line of reasoning in the experiments by Tracy and Robins (2008), they should hypothesize just the opposite, or maybe that there is no correlation between RT and accuracy... I find the H4 hypothesis very confusing and counterintuitive. [Author Response] Thank you for drawing this to our attention. We have altered the description of our reasoning to clarify that H4 is meant not to take Tracy and Robins line of reasoning and further it, but to attempt to duplicate or validate their findings under different contexts, to combat the ongoing "replication crisis" within social sciences. [R1.4] Page 18: “the RMET has demonstrated a strong ability to identify individuals with impaired social intelligence but otherwise normal cognitive intelligence”. This phrasing could be interpreted as offensive for people with Autism Spectrum Disorder. Please use "typical" instead of "normal", or another similar word. Please use another expression than "impaired social intelligence". The neurodiversity movement is all about being considered as different people and not as impaired or incomplete beings. So please, if you mention this idea, think about the people involved and how they will interpret your phrasing. [Author Response] We have replaced "normal" with "typical" and "impaired social intelligence" with "different social intelligence". [R1.5] Page 18: “There are a few known limitations of the RMET, etc.” and the two following paragraphs. These paragraphs are interesting but should be integrated in the Introduction section. Usually, in the Methods section, when the literature guides choices regarding the methods, references are briefly provided with a one sentence rationale. If the rationale requires a full paragraph, it should be fully developed in the Introduction, and then briefly reminded in the Methods, where the relevant procedure is described [Author Response] These paragraphs have been integrated into the Introduction section. [R1.6] Page 19: “Lastly, The RMET involves the association of facial expressions to high-level emotion words which may be difficult for even native English speakers”. Please add at least one reference to back up what is said here. [Author Response] This statement has been changed to reflect that taking the RMET in a language outside of one's native language may impact score. Additional references have been added: “Lastly, The RMET involves the association of facial expressions to emotion words which may be difficult for participants whose native language is not English. The difficulty of completing the RMET outside of one’s native language and the utility of offering the test in multiple languages is evidenced by the translation of the test into French [48] and Spanish among others [49].” [R1.7] Page 23: “One-way ANOVAs were used to inspect the simple marginal effects, etc.” I don't understand why the authors would want to explore marginal effects. Usually one would want to test for differences (at p<.05) and report marginal effects when they are found instead. [Author Response] This was a hold-over from our last iteration. We thank you for your insight. The statement has been changed to read “Two ANOVAs were used to inspect the marginal effects of training condition at each level of Initial RME ability, with significance evaluated at an alpha level of .025 to adjust for the use of multiple tests.” [R1.7] “or it could be evidence of regression to the mean, as explained by Kahneman”. I don't understand this sentence. What is regression to the mean in this context? [Author Response] We have added text to explain the statement within the context of our study: “This could indicate that the training is simply not effective for people who are already good at Reading the Mind in the Eyes, or it could be evidence of regression to the mean, a statistical phenomenon showing that high performers regress toward the mean or average while low performers will more typically see performance gains relative to their starting measurement [47]. Regression to the mean, in this context, speaks to the observation that average scores for the top 25% performers were on average one-and-a-half points lower (1.45) in the post-RMET, while the bottom 25% saw net gains of near two points (1.7), bringing each group closer to the average than where they started.” Submitted filename: Response to Reviewers.docx Click here for additional data file. 12 Apr 2022 An Evaluation to Determine if Reading the Mind in the Eyes Scores Can Be Improved Through Training PONE-D-21-26174R2 Dear Dr. Stonewall, We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Thomas Suslow, Ph.D. Academic Editor PLOS ONE Additional Editor Comments (optional): Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #1: All comments have been addressed Reviewer #2: All comments have been addressed ********** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #2: Yes ********** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: Yes ********** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes ********** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: I thank the authors for the hard work adressing my concerns. I am satisfied with the responses and/or text editions regarding all my major and minor concerns. I present my apologies to the authors for my previous R1.7 point (regarding the exploration of a marginal effect with ANOVAs). I just wasn’t careful in my rereading of the manuscript and missed that your point was to explore a marginal effect shown in a previous analysis. I hope my useless point didn’t cost you too much time. I have just one minor point left. But I don’t think it is important enough to prevent the manuscript to be published so I will let the authors and the editor decide whether it should be adressed or not. I don’t think it is necessay that I assess the revised manuscript myself. 1. Page 7, Hypothesis 4. I am sorry but I think relevant theory is still lacking in order for readers to understand this hypothesis. I am not a big fan of this theory but maybe cite Kahneman System 1 System 2 framework ? As you suggest, maybe « face reading » mechanisms involved in the RMET should conform to a System 1 / intuitive and fast cognitive style? And not to a System 2 / slow and reflecting cognitive style ? And that if it is the case we should expect associations between short reaction times and good accuracy ? Reviewer #2: (No Response) ********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes: Matias Baltazar Reviewer #2: Yes: Elena Cavallini 19 Apr 2022 PONE-D-21-26174R2 An evaluation to determine if reading the mind in the eyes scores can be improved through training Dear Dr. Stonewall: I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. If we can help with anything else, please email us at plosone@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Professor Thomas Suslow Academic Editor PLOS ONE

23 in total

1. Evidence for a collective intelligence factor in the performance of human groups.

Authors: Anita Williams Woolley; Christopher F Chabris; Alex Pentland; Nada Hashmi; Thomas W Malone
Journal: Science Date: 2010-09-30 Impact factor: 47.728

2. Reading literary fiction improves theory of mind.

Authors: David Comer Kidd; Emanuele Castano
Journal: Science Date: 2013-10-03 Impact factor: 47.728

3. Social Cognition Psychometric Evaluation: Results of the Initial Psychometric Study.

Authors: Amy E Pinkham; David L Penn; Michael F Green; Philip D Harvey
Journal: Schizophr Bull Date: 2015-05-04 Impact factor: 9.306

4. Does reading a single passage of literary fiction really improve theory of mind? An attempt at replication.

Authors: Maria Eugenia Panero; Deena Skolnick Weisberg; Jessica Black; Thalia R Goldstein; Jennifer L Barnes; Hiram Brownell; Ellen Winner
Journal: J Pers Soc Psychol Date: 2016-09-19

5. Face-blind for other-race faces: Individual differences in other-race recognition impairments.

Authors: Lulu Wan; Kate Crookes; Amy Dawel; Madeleine Pidcock; Ashleigh Hall; Elinor McKone
Journal: J Exp Psychol Gen Date: 2016-11-28

6. Beliefs about beliefs: representation and constraining function of wrong beliefs in young children's understanding of deception.

Authors: H Wimmer; J Perner
Journal: Cognition Date: 1983-01

7. Validation of the Reading the Mind in the Eyes Test in a healthy Spanish sample and women with anorexia nervosa.

Authors: Iratxe Redondo; David Herrero-Fernández
Journal: Cogn Neuropsychiatry Date: 2018-04-11 Impact factor: 1.871

8. Training for generalization in Theory of Mind: a study with older adults.

Authors: Elena Cavallini; Federica Bianco; Sara Bottiroli; Alessia Rosi; Tomaso Vecchi; Serena Lecce
Journal: Front Psychol Date: 2015-08-04

9. The Effectiveness of Theory of Mind Training On the Social Skills of Children with High Functioning Autism Spectrum Disorders.

Authors: Narges Adibsereshki; Abbas Nesayan; Roghayeh Asadi Gandomani; Masood Karimlou
Journal: Iran J Child Neurol Date: 2015

10. Knowing me, knowing you: theory of mind in AI.

Authors: F Cuzzolin; A Morelli; B Cîrstea; B J Sahakian
Journal: Psychol Med Date: 2020-05-07 Impact factor: 7.723