Literature DB >> 34347787

Developing a Natural Language Processing tool to identify perinatal self-harm in electronic healthcare records.

Karyn Ayre^1,2, André Bittar³, Joyce Kam⁴, Somain Verma⁴, Louise M Howard^1,2, Rina Dutta^2,3.

Abstract

BACKGROUND: Self-harm occurring within pregnancy and the postnatal year ("perinatal self-harm") is a clinically important yet under-researched topic. Current research likely under-estimates prevalence due to methodological limitations. Electronic healthcare records (EHRs) provide a source of clinically rich data on perinatal self-harm. AIMS: (1) To create a Natural Language Processing (NLP) tool that can, with acceptable precision and recall, identify mentions of acts of perinatal self-harm within EHRs. (2) To use this tool to identify service-users who have self-harmed perinatally, based on their EHRs.
METHODS: We used the Clinical Record Interactive Search system to extract de-identified EHRs of secondary mental healthcare service-users at South London and Maudsley NHS Foundation Trust. We developed a tool that applied several layers of linguistic processing based on the spaCy NLP library for Python. We evaluated mention-level performance in the following domains: span, status, temporality and polarity. Evaluation was done against a manually coded reference standard. Mention-level performance was reported as precision, recall, F-score and Cohen's kappa for each domain. Performance was also assessed at 'service-user' level and explored whether a heuristic rule improved this. We report per-class statistics for service-user performance, as well as likelihood ratios and post-test probabilities.
RESULTS: Mention-level performance: micro-averaged F-score, precision and recall for span, polarity and temporality >0.8. Kappa for status 0.68, temporality 0.62, polarity 0.91. Service-user level performance with heuristic: F-score, precision, recall of minority class 0.69, macro-averaged F-score 0.81, positive LR 9.4 (4.8-19), post-test probability 69.0% (53-82%). Considering the task difficulty, the tool performs well, although temporality was the attribute with the lowest level of annotator agreement.
CONCLUSIONS: It is feasible to develop an NLP tool that identifies, with acceptable validity, mentions of perinatal self-harm within EHRs, although with limitations regarding temporality. Using a heuristic rule, it can also function at a service-user-level.

Entities: Chemical Disease Gene Species

Year: 2021 PMID： 34347787 PMCID： PMC8336818 DOI： 10.1371/journal.pone.0253809

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

Self-harm is defined by the National Institute for Health and Care Excellence as an “act of self-poisoning or self-injury carried out by a person, irrespective of their motivation” [1]. Data from several high-income countries indicates that self-harm is increasingly common, particularly in young women [2, 3]. During pregnancy and the postnatal year, a time known as “the perinatal period”, around 5–14% of women are estimated to experience thoughts of self-harm [4]. Yet there remains an evidence gap around acts of perinatal self-harm [5]. Given self-harm is strongly associated with mental disorder [6], this is likely to be the case for perinatal self-harm. It may therefore be a marker of unmet treatment need. Suicide is a leading cause of maternal death and such suicides are frequently preceded by acts of perinatal self-harm [7, 8]. Current evidence regarding the prevalence of perinatal self-harm is mainly derived from studies using administrative hospital discharge datasets which may under-represent the true prevalence [5]. Evidence suggests perinatal self-harm is more common in women with serious mental illness (SMI) [5], meaning this population should be a focus of research. The widespread use of electronic healthcare records (EHRs) means that large amounts of nuanced clinical information can be centrally stored for large cohorts of service-users. However, free-text documentation means many clinical variables are not readily extractable. A free-text search strategy could identify self-harm synonyms but would lack the contextual “awareness” required to distinguish relevant from non-relevant mentions, such as thoughts of or statements of negation of self-harm. Natural Language Processing (NLP) can recognise relevant linguistic context (e.g. lexical variation, grammatical structure, negation) and is increasingly used in clinical research to extract information from EHRs [9, 10]. The use of NLP to investigate suicidality is relatively new and the literature is small [11]. However, NLP has been used to identify suicidality in EHRs [12-14], including those of adolescents with autism spectrum disorders [15]; general hospital [16] and primary care attenders [17]. To our knowledge, only one other group has used NLP to identify perinatal self-harm. Self-harm was identified as part of a composite measure of both thoughts of suicide and acts of self-harm [18, 19] and not specifically among women with SMI. In this study, we aimed to develop an NLP tool for the purpose of identifying acts of perinatal self-harm at a mention and service-user level, within de-identified EHRs of women with SMI.

Materials and methods

Data sources

The South London and Maudsley (SLaM) National Institute For Health Research (NIHR) Biomedical Research Centre (BRC) Clinical Record Interactive Search (CRIS) system [20] provides regulated access to a database of de-identified EHRs of all service-users accessing South London and Maudsley NHS Foundation Trust (SLaM), which is the largest secondary mental healthcare provider in the United Kingdom. In this context, “EHR” refers to a single clinical document, within one universal electronic healthcare recording system called the “Electronic Patient Journey System”. CRIS is linked with Hospital Episode Statistics (HES) [21], a database of anonymised clinical, administrative and demographic details of NHS hospital admissions of service-users over the age of 18. By searching for codes indicating delivery, linkage with CRIS has been demonstrated to be a valid way of generating a cohort of women accessing secondary mental healthcare during the perinatal period [22].

Ethical approval

CRIS has pre-existing ethical approval via the Oxfordshire Research Ethics Committee C (ref 18/SC/0372). Linkage with HES data is managed by the Clinical Data Linkage Service. The BRC has ethical and Section 251 approval to enable linkage with HES (Ref: ECC 3-04(f)/2011). This project was done under the CRIS Oversight Committee approval 16–069 that relates to KA’s Fellowship. The CRIS Oversight Committee is chaired by a service-user and member of the SLAM BRC Stakeholder Participation theme.

Development of coding rules

Self-harm is a complex concept and may be defined in different ways. The clinical validity of describing self-harm based on suicidal intent (e.g. “suicide attempts” versus “non-suicidal self-injury”) has been questioned [23]. The NICE definition of self-harm does not incorporate intent [1]. Therefore, when creating a list of synonyms or “keywords” for self-harm, we conceptualised self-harm broadly and utilised several sources: the secondary mental healthcare clinical expertise of the first author, the general literature on self-harm [24, 25] and terms used in other studies of self-harm in EHRs [22, 26–28]. See S1 File for a full list of keywords. Mentions of these keywords within the EHRs were appropriately annotated in a sample of 131 EHRs pre-selected from previous research into self-harm in pregnant women with affective and non-affective psychotic disorders by Taylor et al [22, 28]. We devised rules regarding the span of text to annotate as a mention and how to annotate mention attributes (see S2 File).

Span

Only the keyword within the mention, not the surrounding text or whole sentence, was annotated. The keyword was usually a noun and direct synonym of self-harm, e.g. overdose. Occasionally, the keyword was a noun, but not a direct synonym, e.g. in the phrase “she had scratches on her arm” only the indicative noun (i.e. “scratches”) was annotated. If the keyword was an adjective that modified a noun, it was annotated along with the noun it described, e.g. in the phrase “she had a self-harming impulse”, both “self-harming” and “impulse” were annotated. Where the keyword was a verb, the direct object noun/pronoun that it related to was also annotated, e.g. in the phrase “she cut herself”, both “cut” and “herself” were annotated. Occasionally, the verb implied a passive or non-deliberate action. For example: “she climbed out a window and fell off”. Falling is a passive or unintentional event, as opposed to jumping. However, in this case the prior act of climbing indicates an active element. Although falling is passive, it was the fall that caused harm. Therefore, the verb, the pronoun it related to and the intervening words “fell off” were annotated.

Attributes and coding rules

We identified three main attributes of mentions of self-harm: status, temporality and polarity. Status specified whether a self-harm event occurred or not. For example, if a mention described thoughts of self-harm, rather than an act of self-harm, they were inferred and annotated to be non-relevant. Mentions of third-party self-harm (e.g. “her mother took an overdose”) were annotated as non-relevant. We included an “uncertain” category as in a very small number of cases it was not possible, even with whole document context, to determine whether a mention was referring to an act of self-harm or not. Temporality specified whether an act was current or historical. We were interested in self-harm occurring during pregnancy and/or the postpartum year. As only EHRs created within the service-user’s perinatal period were being annotated, non-perinatal temporality was sometimes obvious e.g. “took an overdose ten years ago”. Events which occurred within one month prior to the EHR were coded as current. This time frame is the same as that used in previous work investigating the prevalence of self-harm in the EHRs of a cohort of pregnant women in CRIS [28] and reflects the standard time period often used in clinical interviews that ask about self-harm, such as the Mini-International Neuropsychiatric Interview [29]. Ambiguous references to chronic events were problematic e.g. "chronic history of self-harm". Although this mention describes a chronic occurrence i.e. happening in the past, it also references the fact that the events are potentially ongoing. We decided to code such mentions as current. We initially included an “uncertain” category in order to flag complex cases during manual annotation, although not as an attribute option for the final tool. Polarity specified whether or not the mention expressed a negation of self-harm (e.g. “she denied self-harm”). The purpose of this attribute was to allow the algorithm to filter out negations. Occasionally negation was written using symbols e.g. “Suicide attempts: X”. Here, the meaning of the mention was annotated i.e. polarity negative.

Manual annotation of a reference standard

For the purposes of developing and evaluating the tool’s performance, we created a reference standard, manually annotated, corpus of EHRs. First, we randomly sampled 400 EHRs from Taylor’s study of self-harm in pregnant SLaM service-users with affective and non-affective psychotic disorders [22, 28]. All EHRs were independently double-annotated by three annotators (KA, JK, SV) according to the coding rules, using Extensible Human Oracle Suite of Tools (eHOST) software [30]. We measured pairwise inter-annotator agreement in terms of precision (positive predictive value), recall (sensitivity) and F-score (harmonic mean of precision and recall), as well as kappa [31] (agreement adjusted for chance) for attributes. Agreement scores were calculated using the scikit-learn (version 0.21.3) machine learning library for Python [32]. The final reference standard was created by adjudication of disagreements by KA. This was split into development (N = 320 EHRs, 152 service-users) and test (N = 80 EHRs, 59 service-users) sets.

NLP development

System description

We developed a rule-based tool around spaCy (version 2.1.3), an NLP library for Python. Code for the tool is available online [33]. The tool takes a text as input and applies five processing layers in sequential order, outputting an XML file in which all detected self-harm mentions and their attributes are annotated with XML tags. Each layer of processing adds annotations that are available in subsequent layers. The five processing layers are as follows:

1. Linguistic pre-processing

Sentence detection, tokenisation (segmentation of the text into word tokens), part-of-speech tagging (determining the grammatical category of words), lemmatisation (finding the “root” form of inflected words) and dependency parsing (determining the grammatical relations between words). The tokenisation step includes a set of custom tokenisation rules to deal with errors made by spaCy’s default tokeniser (e.g. self-harm, self-injury, fh/o which are incorrectly split into several word tokens). The dependency parsing step identifies syntactic relations such as subject, direct object, modifier and negation. Dependency parsing has been used in prior work on the analysis of clinical texts, for such tasks as relation extraction [34, 35], identifying family history [36] and negation detection [37].

2. Lexical rules

This step consists of tagging of words with a given semantic category according to a set of 13 manually created lexicons. These lexicons include terms for self-harm, body parts, as well as relevant negation and temporal markers. A full list of these lexicons and example content is shown in Table 1.

Table 1

Lexicons used for tagging of semantic categories.

Category	Example terms	Annotation	Example
Self-harm	DSH, overdose	SH	She took an overdose
Body part	wrist, hand, torso	BODY_PART	She had cut her left wrist
Harm action	cut, burn, hit, lacerate	HARM_ACTION	She lacerated her arm
Family members	mother, father, daughter	FAMILY	Her mother took an overdose
Uncertainty	plan, prone, risk, thought	HEDGING	She would cut herself
Intention	aim, deliberately, intend	INTENT	She cut herself deliberately
Medication	olanzapine, paracetamol, aspirin	MED	She took 12 paracetamol tablets
Modality	could, would, possible	MODALITY	Possibility of self-harm
Negation	not, never, no, deny	NEGATION	Denies self-harm
Reported speech	say, claim, disclose	R_SPEECH	She disclosed having thoughts of cutting herself
Life stages	adolescent, teenager, young, kid	LIFE_STAGE	She started self-harming in her teens
Past references	previous, past, historical	PAST	Previous episodes of self-harm
Present references	Monday, current, recent	PRESENT	Current episode of self-harm

3. Token sequence rules

The final layer of processing consists of a sequence of regular token-based grammars that take into account the context in which words appear. Grammar rules have access to all linguistic features added during pre-processing, as well as semantic categories added during lexical tagging. These rules are applied to detect self-harm expressions in context and correct and update the annotations added by previous processing layers. These rules are used both to detect or exclude mention spans and assign attribute values. A specific set of token sequence rules is used to identify history sections in EHRs. The rule attribute ‘name’ indicates the unique rule name for development purposes, ‘pattern’ is the token sequence pattern to match in the text, ‘annotation’ is the attribute and value that is marked on the recognised token sequence.

4. Negation detection

Negation is detected using the syntactic dependency tree for each sentence. Any mention that heads a ‘neg’ grammatical dependency is annotated as negative (e.g. “she did not cut herself”). If a mention’s governor is a negated reported speech verb (R_SPEECH), the mention is also assigned negative polarity (e.g. “she did not report harming herself”). Finally, any mention governed by a word annotated as NEGATION is also annotated as having negative polarity (e.g. “she denies any self-harm”).

5. Contextual search

To further assign values to attributes for identified mentions, a contextual search is used to detect markers of temporality and status. A window of ten tokens to the left and right of a mention is used as context. If a token labelled ‘past’ is found within this window, the mention is labelled as historical. Similarly, if a token labelled ‘hedging’ or ‘modality’ is found within the window, the mention is annotated as non-relevant.

Unusual linguistic cases

During development of the coding rules, we identified unusual examples that did not fit with our pre-defined strategy. This led to the refinement of the tool’s processing layers. Some examples are detailed in Fig 1, with the relevant keyword highlighted in italics.

Fig 1

Examples of unusual linguistic cases.

Further development: Service-user selection heuristic

In our reference standard, the majority of service-users who had self-harmed perinatally had more than one mention in their EHRs (see S1 Table). Based on this, we explored the use of a service-user selection heuristic, whereby we restricted flagging of service users as true positive cases to only those who had two or more mentions of perinatal self-harm in their EHRs.

Results

Inter-annotator agreement

Table 2 presents micro-averaged pairwise inter-annotator agreement on mention spans and attributes using precision, recall and F-score and Cohen’s kappa (39), within the development set of EHRs (N = 320 documents). Due to the very small number of cases of “uncertain” status and temporality, the high degree of class imbalance means macro-averaged figures were not a fair representation of performance (see S2 Table). All figures are rounded up to 2 decimal points.

Table 2

Micro-averaged pairwise inter-annotator agreement.

	Precision	Recall	F-score	Kappa
Span	0.83	0.89	0.85	N/A
Polarity	0.96	0.96	0.96	0.92
Temporality	0.90	0.90	0.90	0.78
Status	0.94	0.94	0.94	0.88

Evaluation of the tool

We evaluated the tool on two levels: mention and service-user. Table 3 shows the micro-averaged mention-level evaluation statistics from both the development (N = 320 documents) and test (N = 80 documents) datasets. Again, class imbalance for status meant macro-averaging was not appropriate (only 9 mentions of “uncertain” status in reference standard, see S3 Table for macro-averaged results) so micro-averaged results are presented.

Table 3

Micro-averaged mention-level evaluation results.

	Development set				Test set
	Precision	Recall	F-score	Kappa	Precision	Recall	F-score	Kappa
Span	0.97	0.85	0.90	N/A	0.94	0.81	0.87	N/A
Polarity	0.94	0.94	0.94	0.88	0.96	0.96	0.96	0.91
Temporality	0.81	0.81	0.81	0.57	0.83	0.83	0.83	0.62
Status	0.89	0.89	0.89	0.76	0.88	0.88	0.88	0.68

Service-user-level performance indicates how well the tool identifies service-users who have at least one recorded “true” self-harm mention in any of their EHRs. A “true” mention has the attribute values status = relevant, polarity = positive, temporality = current. We present results with and without the heuristic rule of at least two positive mentions, derived from both the test set (Table 4). When the tool was run with the heuristic, there were no false positives, meaning there were issues with perfect prediction. Total absence of false positives is unlikely to occur in a very large sample and, in this case, most likely indicates the sample size of the test set (N = 59 service-users) is too small for patient-level analysis. We therefore present service-user results in the development set (N = 152 service-users, Table 5).

Table 4

Service-user-level evaluation results on the test set (N = 59 service-users).

	Manual Coding	Tool	Tool with Heuristic
Service-users flagged	11	14	4
Prevalence	18.6%	23.7%	6.8%
Precision_MAJ	N/A	0.91	0.87
Precision_MIN	N/A	0.50	1
Precision_MACRO	N/A	0.71	0.94
Recall_MAJ	N/A	0.85	1
Recall_MIN	N/A	0.64	0.36
Recall_MACRO	N/A	0.75	0.68
F-score_MAJ	N/A	0.88	0.93
F-score_MIN	N/A	0.56	0.53
F-score_MACRO	N/A	0.72	0.73
Kappa	N/A	0.44	0.48
LR_POS (95% CI)	N/A	4.4 (1.9–9.9)	Infinity
LR_NEG (95% CI)	N/A	0.4 (0.2–0.9)	0.6 (0.4–1.0)
Post-test probability_POS (95%CI)	N/A	50.0 (31–69%)	100%
Post-test probability_NEG(95%CI)	N/A	8.9 (4–18%)	12.7 (9–18%)

Table 5

Service-user-level evaluation results on the development set (N = 152 service-users).

	Manual Coding	Tool	Tool with Heuristic
Service-users flagged	29	46	29
Prevalence	19.1%	30.3%	19.1%
Precision_MAJ	N/A	0.97	0.93
Precision_MIN	N/A	0.57	0.69
Precision_MACRO	N/A	0.77	0.81
Recall_MAJ	N/A	0.84	0.93
Recall_MIN	N/A	0.90	0.69
Recall_MACRO	N/A	0.87	0.81
F-score_MAJ	N/A	0.90	0.93
F-score_MIN	N/A	0.69	0.69
F-score_MACRO	N/A	0.80	0.81
Kappa	N/A	0.60	0.62
LR_POS (95%CI)	N/A	5.5 (3.6–8.4)	9.4 (4.8–19)
LR_NEG (95%CI)	N/A	0.1 (0.04–0.4)	0.3 (0.2–0.6)
Post-test probability_POS (95%CI)	N/A	56.5 (46–66%)	69.0 (53–82%)
Post-test probability_NEG(95%CI)	N/A	2.8 (1–8%)	7.3 (4–12%)

Due to class imbalance, we report per-class precision, recall and F-score (e.g. precisionMAJ, precisionMIN) as well as the macro-averaged value (e.g. precisionMACRO). The ultimate purpose of this tool is to identify service-users who have self-harmed perinatally within a cohort. For this reason, we also present positive and negative likelihood ratios (LRPOS, LRNEG) and post-test probabilities.

Error analysis

Span errors

To identify remaining weaknesses in the tool, we performed error analysis on the mention-level evaluation of the test set. The most common recurrent span error was the tool missing mentions of “suicide” that had been annotated in the reference standard test set. Whilst death by suicide is not the same as self-harm (which is, by definition, non-fatal), the conceptual line between suicide and self-harm is, in terms of clinician documentation, often blurred. For example, we found clinicians would document: “no history of suicide”. Clearly, in a clinical entry on a living service-user, a history of death by suicide is impossible. However, this phrase most likely reflects a clinician’s attempt to express that the service-user has no history of attempted suicide, i.e. non-fatal self-harm. There were a small number of instances where the tool erroneously identified phrases not annotated in the test set. This was largely for two reasons. Firstly, there were unusual examples of clinician documentation style that referred to things that were not self-harm e.g. “OD AM”, to indicate “once daily in the morning”. We had included “OD” in the coding structure, as a synonym for “overdose”. Secondly, there were some specific and uncommon examples of self-harm that were not included in the coding structure e.g. “drinking” specific poisonous substances. Finally, the grammatical context of the verb “jump” also proved difficult to capture reliably, as this verb can be used with a variety of prepositions that do not always indicate attempted self-harm e.g. “jump kill herself”/”jump a window” are valid mentions of self-harm, while “jump the stairs” and “jump conclusions” are not.

Attribute errors

Regarding errors on the status attribute, we assumed the modal auxiliary “would” to be a marker of non-relevance in the tool’s contextual search, as it would usually indicate a future conditional event that had not yet happened. However, the tool sometimes erroneously considered modals appearing after mentions, for example, “she thought the [self-harm] would kill her”. A further recurring status error was that we assumed “risk to self” headings in EHRs indicated the part of clinical assessment known as “risk assessment”, which is a discussion of the service-user’s future risk to self and would therefore contain non-relevant mentions. However error analysis revealed this phrase was occasionally used as a section header detailing past self-harm events. Regarding temporality, our default approach was to mark events as current unless there was a clear historical marker. However, we found that temporality indicators became unclear in cases using a coordination, e.g. “no current past suicide attempts”. The tool annotated this mention as current, whilst in the reference standard it was annotated as historical. Assessing temporality was also problematic where there were no contextual markers due to the short-hand note-taking style of the clinician, for example “2 x OD”.

Discussion

Self-harm is a conceptually and clinically complex area. Framing it temporally within the narrow time-period of pregnancy and the postnatal year increases the complexity. However, we have shown that it is possible to develop an NLP tool that, with acceptable precision and recall, can identify perinatal self-harm within electronic healthcare records at both a mention and patient level. Given the limitations in existing data on the prevalence of perinatal self-harm [5], this is a significant step forward. The pair-wise inter-rater agreement suggests that temporality was the hardest attribute for annotators to agree on. This may reflect the high degree of complexity and ambiguity in ways that self-harm is documented. Micro-averaged mention-level evaluation figures reflect this pattern, although precision, recall and F-score for all attributes were all still >0.8. After adjustment for chance, temporality remains the weakest attribute, although kappa is still almost 0.8. We felt that the test set (Table 4) was too small to evaluate service-user level performance. Using the much larger development set (Table 5), we showed that, by using a heuristic rule of two, we could generate a tool with macro-averaged F-score 0.81 and a high positive likelihood ratio of 9.4 (95% CI 4.8–19). Overall, scores for kappa were lower than precision/recall/F-score (patient-level kappa 0.62), suggesting some agreement may have been due to chance. However the limitations of using kappa in dichotomous classification system performance analysis with unbalanced datasets should be noted [38], particularly where the sample size is small [39]. How the tool performs in a much larger sample would be an interesting area of further study. The use of heuristic rules is commonplace in NLP literature [40-42] and it is well-recognised that in clinical contexts moving from mention to person-level performance often requires “post-processing” [43]. We believe the use of a heuristic in this case does have face validity, as in reality if a service-user has self-harmed perinatally this is a significant clinical event, meaning it is likely to be followed up at subsequent visits or by other clinicians, i.e. further mentions of it would be generated within the service-user’s body of EHRs. We believe this tool could potentially be adapted to ascertain self-harm in other contexts. Work is currently underway to investigate whether it can be adapted to ascertain self-harm in adolescent populations and among women with eating disorders.

Strengths

We believe this is a novel development in the field of using NLP to investigate self-harm, as it focusses specifically on acts of perinatal self-harm among women with SMI. We used a bespoke NLP strategy developed using both clinical and NLP expertise. Our iterative approach meant that we could use unusual examples encountered in development phases of annotation to refine the tool.

Limitations

Our corpus was relatively small and generalisability to EHRs from other populations and mental healthcare providers is uncertain. The main outcomes of error analysis are that it is often hard to find reliable contextual markers for ambiguous mentions. The use of syntactic coordination (and, or, etc.) often makes this even more problematic. Temporality is notoriously difficult to analyse with NLP and is a field of research in its own right [44, 45]. The analysis also reveals something about how clinical note-taking is done e.g. the high variability in the words and formulations used by clinicians.

Conclusions

We have shown, using novel methods and a combination of clinical and linguistic processing expertise, that is possible to develop an NLP tool that will, with acceptable precision and recall, identify perinatal self-harm in electronic healthcare records, albeit with limitations, particularly in terms of defining temporality.

Number of true mentions of self-harm per service-user, within the reference standard dataset.

(DOCX) Click here for additional data file.

Macro-averaged pairwise inter-annotator agreement.

(DOCX) Click here for additional data file.

Macro-averaged mention-level performance.

(DOCX) Click here for additional data file.

List of synonyms for self-harm.

(DOCX) Click here for additional data file.

Annotation guidelines.

(DOCX) Click here for additional data file. 18 May 2021 PONE-D-21-06370 Developing a Natural Language Processing tool to identify perinatal self-harm in electronic healthcare records. PLOS ONE Dear Dr. Ayre, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Please submit your revised manuscript by Jul 02 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript: A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols. We look forward to receiving your revised manuscript. Kind regards, Natalia Grabar Academic Editor PLOS ONE Journal Requirements: When submitting your revision, we need you to address these additional requirements. 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf 2. Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice. 3. Please provide additional details regarding participant consent. In the ethics statement in the Methods and online submission information, please ensure that you have specified (1) whether consent was informed and (2) what type you obtained (for instance, written or verbal, and if verbal, how it was documented and witnessed). If your study included minors, state whether you obtained consent from parents or guardians. If the need for consent was waived by the ethics committee, please include this information. If you are reporting a retrospective study of medical records or archived samples, please ensure that you have discussed whether all data were fully anonymized before you accessed them and/or whether the IRB or ethics committee waived the requirement for informed consent. If patients provided informed written consent to have data from their medical records used in research, please include this information. 4. Please include captions for your Supporting Information files at the end of your manuscript, and update any in-text citations to match accordingly. Please see our Supporting Information guidelines for more information: http://journals.plos.org/plosone/s/supporting-information. Additional Editor Comments: Dear Authors, Please consider the reviewers' comments and take them into account when preparing the new version of the submission. The issues on generalizability and reusability of the methods should be addressed. Please, prepare the letter with answers. [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #2: Yes ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: N/A Reviewer #2: Yes ********** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: No Reviewer #2: No ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: [REVIEW] Developing a Natural Language Processing tool to identify perinatal self-harm in electronic healthcare records. PLOS ONE 21st April 2021 # SUMMARY This paper presents work on the development of an NLP algorithm to detect individuals who have experienced perinatal self-harm. 400 clinical notes were sampled from the South London & Maudsley hospital system secondary mental healthcare service representing 232 distinct patients. Interannotator agreement ranged from 0.78 to 0.92. The notes were divided into development and test sets (320 and 80, respectively) and a rule-based system built around the Spacy Python NLP library was developed. The mention level performance of the system performed quite well, as did the patient-level classification, resulting in an F-score of 0.81. This was a clear, well-presented paper on an interesting and important topic (i.e. the identification of perinatal self-harm). The methodology adopted was appropriate and reasonable for the stated research goals. I have made some additional comments/suggestions below. Thanks for the opportunity to review this work. # COMMENTS * ln 56. This isn’t a major issue, but I believe “reference standard” is the preferred term (rather than “gold standard”) * ln 57. “As service users usually had more than one EHR…” This is a bit ambiguous. It could mean either EHR System (i.e. a patient uses two medical systems with different EHR systems) or particular EHR encounters. I believe — based on usage later in the paper — that you mean it in the latter sense. * ln 61. The results section of the abstract is a little schematic * ln 129. “We devised rules around the span of text to annotate as a mention and the attributes to annotate the mention with.” I think there is an grammar issue here * ln 145. “For example, if a mention described someone who had thoughts of self-harm, by definition, no act of self-harm had occurred” I’m not so sure about this. I think “by definition” may be too strong given that it is quite possible that someone may have suicidal ideation and also harm themselves. I’d suggest weakening this (e.g. “indicating…”) * ln 169. “First, we randomly sampled 400 EHRs from Taylor’s study of self harm in pregnant SLaM service-users with affective and non-affective psychotic disorders (23).” See comment re EHRs above * I may have missed this, but do you have the annotation guidelines as supplementary materials * ln 187. I’m not sure “tokens” should be in quotation marks * ln 195. “A full list of these lexicons and example content is shown in Table 1” * ln 209. You might include a couple of sentences on dependency parsing here (what it is, how it has been used with EHR data in the past) * ln 216. Isn’t negation covered in the previous section * ln 308. The structure of the discussion section is a bit confusing (2 “discussion” headers) * ln 331. I agree with the strengths listed, but limitations should also — I believe — include the relatively small annotated corpus and unknown generalisability to other (non-SLAM) records. * ln 347. Some of the references are incomplete * It would be nice to have more discussion of (potential) downstream applications of the tool in the discussion section Reviewer #2: The article proposes to implement an NLP tool apply in a very specific context to identify perinatal self-harm. From a general point of view, the article is sound, easy to read and precise. We do not have access to the all data for ethical concerns. In the end, the results are in general good. The authors choose to work on different tasks and some of them are complex. Thus it is not a surprise to have some disagreement on Kappa scores or other. The final tool seems to be useful in the very specific situation of the authors. My main concerns is about the re-usability of the result. Authors pre-process the data and write ad hoc rules for the data. I am not sure how it generalizes over a given medicine process or on the description of the disease itself. It is quite important in my understanding of the paper because, although the outcome is relevant in the given situation, it seems to rely on artefacts of the institution rather than the description of the disease. This does not call into question the intrinsic value of the result for the hospital service, but it would give it more perspective. Without talking about the reproducibility of the research, it would be interesting to work on the possibility of reusing the proposed strategy for another pathology, especially since the authors are working on a language with many resources. On a more specific point, I am not exactly sure to understand how the negation is used in the tool. This is a crucial point for a correct interpretation, exactly as for the coordination. The authors address this issue but it is not clear how it is done. ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. 2 Jun 2021 Thank you for your letter of 18.05.21 and the two reviewers’ immensely helpful comments. We have undertaken all the revisions to the manuscript (ref PONE-D-21-06370) recommended by them to improve the quality and readability of our manuscript. Below I shall incorporate both reviewers’ comments in order (C1, C2 etc), state our responses (R1, R2 etc) and reference the changes we have made in the revised manuscript. Line numbers refer to the revised manuscript with tracked changes. ============================= Reviewer #1: C1. ln 56. This isn’t a major issue, but I believe “reference standard” is the preferred term (rather than “gold standard”) R1. We have changed “gold standard” to “reference standard” throughout the manuscript and in the title of Supporting Information Table 1. C2. ln 57. “As service users usually had more than one EHR…” This is a bit ambiguous. It could mean either EHR System (i.e. a patient uses two medical systems with different EHR systems) or particular EHR encounters. I believe — based on usage later in the paper — that you mean it in the latter sense. R2. Thank you, we appreciate this was not ideally phrased and could lead to confusion. We have now removed the phrase “As service users usually had more than one EHR…” from the relevant sentence in the abstract (lines 61-62). We have also included a sentence in the “Data Sources” paragraph of the Methods section, stating: “In this context, “EHR” refers to a single clinical document, within one universal electronic healthcare recording system called the “Electronic Patient Journey System” (lines 113-115). C3. ln 61. The results section of the abstract is a little schematic R3. We have included a sentence at the end of the results section of the abstract, to improve the narrative quality: “Considering the task difficulty, the tool performs well, although temporality was the attribute with the lowest level of annotator agreement” (lines 68-9). C4. ln 129. “We devised rules around the span of text to annotate as a mention and the attributes to annotate the mention with.” I think there is an grammar issue here R4. We have changed this to: “We devised rules regarding the span of text to annotate as a mention and how to annotate mention attributes” (lines 139-40). C5. ln 145. “For example, if a mention described someone who had thoughts of self-harm, by definition, no act of self-harm had occurred” I’m not so sure about this. I think “by definition” may be too strong given that it is quite possible that someone may have suicidal ideation and also harm themselves. I’d suggest weakening this (e.g. “indicating…”) R5. We entirely agree and have changed the phrasing to: “For example, if a mention described thoughts of self-harm, rather than an act of self-harm, they were inferred and annotated to be non-relevant” (lines 157-9). C6. ln 169. “First, we randomly sampled 400 EHRs from Taylor’s study of self-harm in pregnant SLaM service-users with affective and non-affective psychotic disorders (23).” See comment re EHRs above R6. We believe we have addressed this issue in our response to Point 2. C7. I may have missed this, but do you have the annotation guidelines as supplementary materials R7. We did not previously, but we have now included the annotation guidelines as “Supporting Information 2”, and have re-numbered the other Supporting Information and Supporting Information Tables in the manuscript accordingly C8. ln 187. I’m not sure “tokens” should be in quotation marks R8. We have removed the quotation marks C9. ln 195. “A full list of these lexicons and example content is shown in Table 1” R9. Apologies but we are not sure what Reviewer 1 is referring to in point 9. We think it is possible that their comment is missing? C10. ln 209. You might include a couple of sentences on dependency parsing here (what it is, how it has been used with EHR data in the past) R10. At lines 206-9, we have added an explanatory sentence and given some examples of previous work which uses dependency parsing on clinical texts. C11. ln 216. Isn’t negation covered in the previous section R11. We agree and have replaced “To further assign values to attributes for identified mentions, a contextual search is used to detect markers of negation, temporality and status” with: “To further assign values to attributes for identified mentions, a contextual search is used to detect markers of temporality and status” (lines 236-7). C12. ln 308. The structure of the discussion section is a bit confusing (2 “discussion” headers) R12. We have removed the “main findings” and “discussion” sub-headings, merging the content of the discussion in a logical structure, with strengths and limitations having separate sub-headings. We hope this makes it easier to follow. C13. ln 331. I agree with the strengths listed, but limitations should also — I believe — include the relatively small annotated corpus and unknown generalisability to other (non-SLAM) records. R13. We have included the following statement in the limitations section: “Our corpus was relatively small and generalisability to EHRs from other populations and mental healthcare providers is uncertain” (lines 364-5). C14. ln 347. Some of the references are incomplete R14. Thank you for highlighting this. We have gone through the references and believe all are now complete. C15. It would be nice to have more discussion of (potential) downstream applications of the tool in the discussion section R15. We have included the following statement: “We believe this tool could potentially be adapted to ascertain self-harm in other contexts. Work is currently underway to investigate whether it can be adapted to ascertain self-harm in adolescent populations and among women with eating disorders” (line 354-6). ============================== Reviewer #2: C1. The final tool seems to be useful in the very specific situation of the authors. My main concerns is about the re-usability of the result. Authors pre-process the data and write ad hoc rules for the data. I am not sure how it generalizes over a given medicine process or on the description of the disease itself. It is quite important in my understanding of the paper because, although the outcome is relevant in the given situation, it seems to rely on artefacts of the institution rather than the description of the disease. This does not call into question the intrinsic value of the result for the hospital service, but it would give it more perspective. R1. Thank you for this feedback. We believe this point is similar to point 13 made by reviewer 1, to which we responded by including the following sentence in the limitations section: “Our corpus was relatively small and generalisability to EHRs from other populations and mental healthcare providers is uncertain” (lines 364-5). C2. Without talking about the reproducibility of the research, it would be interesting to work on the possibility of reusing the proposed strategy for another pathology, especially since the authors are working on a language with many resources. R2. We believe this point is similar to point 15 made by Reviewer 1, to which we responded by including the following sentences in the Discussion section: “We believe this tool could potentially be adapted to ascertain self-harm in other contexts. Work is currently underway to investigate whether it can be adapted to ascertain self-harm in adolescent populations and among service-users with eating disorders” (lines 354-6). C3. On a more specific point, I am not exactly sure to understand how the negation is used in the tool. This is a crucial point for a correct interpretation, exactly as for the coordination. The authors address this issue but it is not clear how it is done. R3. Thank you. We have included a further citation of the use of dependency parsing in the context negation detection (lines 206-9). Submitted filename: Response To Reviewers_26thMay2021.docx Click here for additional data file. 14 Jun 2021 Developing a Natural Language Processing tool to identify perinatal self-harm in electronic healthcare records. PONE-D-21-06370R1 Dear Dr. Ayre, We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Natalia Grabar Academic Editor PLOS ONE 30 Jun 2021 PONE-D-21-06370R1 Developing a Natural Language Processing tool to identify perinatal self-harm in electronic healthcare records. Dear Dr. Ayre: I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. If we can help with anything else, please email us at plosone@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. Natalia Grabar Academic Editor PLOS ONE

32 in total

1. Parasuicide in Europe: the WHO/EURO multicentre study on parasuicide. I. Introduction and preliminary analysis for 1989.

Authors: S Platt; U Bille-Brahe; A Kerkhof; A Schmidtke; T Bjerke; P Crepet; D De Leo; C Haring; J Lonnqvist; K Michel
Journal: Acta Psychiatr Scand Date: 1992-02 Impact factor: 6.392

Review 2. Prevalence of suicidality during pregnancy and the postpartum.

Authors: V Lindahl; J L Pearson; L Colpe
Journal: Arch Womens Ment Health Date: 2005-05-11 Impact factor: 3.633

3. Text Classification to Inform Suicide Risk Assessment in Electronic Health Records.

Authors: André Bittar; Sumithra Velupillai; Angus Roberts; Rina Dutta
Journal: Stud Health Technol Inform Date: 2019-08-21

Review 4. The Mini-International Neuropsychiatric Interview (M.I.N.I.): the development and validation of a structured diagnostic psychiatric interview for DSM-IV and ICD-10.

Authors: D V Sheehan; Y Lecrubier; K H Sheehan; P Amorim; J Janavs; E Weiller; T Hergueta; R Baker; G C Dunbar
Journal: J Clin Psychiatry Date: 1998 Impact factor: 4.384

5. The characteristics and health needs of pregnant women with schizophrenia compared with bipolar disorder and affective psychoses.

Authors: Clare L Taylor; Robert Stewart; Jack Ogden; Matthew Broadbent; Dharmintra Pasupathy; Louise M Howard
Journal: BMC Psychiatry Date: 2015-04-17 Impact factor: 3.630

6. Natural language processing to extract symptoms of severe mental illness from clinical text: the Clinical Record Interactive Search Comprehensive Data Extraction (CRIS-CODE) project.

Authors: Richard G Jackson; Rashmi Patel; Nishamali Jayatilleke; Anna Kolliakou; Michael Ball; Genevieve Gorrell; Angus Roberts; Richard J Dobson; Robert Stewart
Journal: BMJ Open Date: 2017-01-17 Impact factor: 2.692

7. Identification of suicidal behavior among psychiatrically hospitalized adolescents using natural language processing and machine learning of electronic health records.

Authors: Nicholas J Carson; Brian Mullin; Maria Jose Sanchez; Frederick Lu; Kelly Yang; Michelle Menezes; Benjamin Lê Cook
Journal: PLoS One Date: 2019-02-19 Impact factor: 3.240

8. Risk Assessment Tools and Data-Driven Approaches for Predicting and Preventing Suicidal Behavior.

Authors: Sumithra Velupillai; Gergö Hadlaczky; Enrique Baca-Garcia; Genevieve M Gorrell; Nomi Werbeloff; Dong Nguyen; Rashmi Patel; Daniel Leightley; Johnny Downs; Matthew Hotopf; Rina Dutta
Journal: Front Psychiatry Date: 2019-02-13 Impact factor: 4.157

9. Kappa and Beyond: Is There Agreement?

Authors: Joseph R Dettori; Daniel C Norvell
Journal: Global Spine J Date: 2020-03-03

10. Identifying Suicide Ideation and Suicidal Attempts in a Psychiatric Clinical Research Database using Natural Language Processing.

Authors: Andrea C Fernandes; Rina Dutta; Sumithra Velupillai; Jyoti Sanyal; Robert Stewart; David Chandran
Journal: Sci Rep Date: 2018-05-09 Impact factor: 4.379

1 in total

1. Using natural language processing to extract self-harm and suicidality data from a clinical sample of patients with eating disorders: a retrospective cohort study.

Authors: Charlotte Cliffe; Aida Seyedsalehi; Katerina Vardavoulia; André Bittar; Sumithra Velupillai; Hitesh Shetty; Ulrike Schmidt; Rina Dutta
Journal: BMJ Open Date: 2021-12-31 Impact factor: 2.692

1 in total