Literature DB >> 32817624

Using de-identified electronic health records to research mental health supported housing services: A feasibility study.

Christian Dalton-Locke¹, Johan H Thygesen^1,2, Nomi Werbeloff^1,2, David Osborn^1,2, Helen Killaspy^1,2.

Abstract

BACKGROUND: Mental health supported housing services are a key component in the rehabilitation of people with severe and complex needs. They are implemented widely in the UK and other deinstitutionalised countries but there have been few empirical studies of their effectiveness due to the logistic challenges and costs of standard research methods. The Clinical Record Interactive Search (CRIS) tool, developed to de-identify and interrogate routinely recorded electronic health records, may provide an alternative to evaluate supported housing services.
METHODS: The feasibility of using the Camden and Islington NHS Foundation Trust CRIS database to identify a sample of users of mental health supported accommodation services. Two approaches to data interrogation and case identification were compared; using structured fields indicating individual's accommodation status, and iterative development of free text searches of clinical notes referencing supported housing. The data used were recorded over a 10-year-period (01-January-2008 to 31-December-2017).
RESULTS: Both approaches were carried out by one full-time researcher over four weeks (150 hours). Two structured fields indicating accommodation status were found, 2,140 individuals had a value in at least one of the fields representative of supported accommodation. The free text search of clinical notes returned 21,103 records pertaining to 1,105 individuals. A manual review of 10% of the notes indicated an estimated 733 of these individuals had used a supported housing service, a positive predictive value of 66.4%. Over two-thirds of the individuals returned in the free text search (768/1,105, 69.5%) were identified via the structured fields approach. Although the estimated positive predictive value was relatively high, a substantial proportion of the individuals appearing only in the free text search (337/1,105, 30.5%) are likely to be false positives.
CONCLUSIONS: It is feasible and requires minimal resources to use de-identified electronic health record search tools to identify large samples of users of mental health supported housing using structured and free text fields. Further work is needed to establish the availability and completion of variables relevant to specific clinical research questions in order to fully assess the utility of electronic health records in evaluating the effectiveness of these services.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2020 PMID： 32817624 PMCID： PMC7444482 DOI： 10.1371/journal.pone.0237664

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Background

Specialist mental health supported accommodation services are a key component of community based mental health service in many countries and support people with complex and longer term mental health problems [1]. It is estimated that around 60,000 people use these services in the UK at any time [2, 3]. Three main types have been described: residential care homes provide long term, high support for the most disabled group (communal facilities, staffed 24 hours, support with all activities of daily living (ADL) including meals, self-care, cleaning, budgeting, medication management, etc.); supported housing services aim to help individuals gain ADL and vocational skills so they can move-on to more independent accommodation (self-contained or shared tenancies, staffed up to 24 hours a day, time-limited); floating outreach services provide visiting support for a few hours a week to individuals living in a permanent, self-contained, individual tenancy with the aim of reducing the hours of support over time to zero. Recently, a new typology for supported accommodation has been developed, the Simple Taxonomy–Supported Accommodation (STAX-SA) [4]. Residential care maps to STAX-SA Type 1, supported housing services map to Types 2 and 3, and floating outreach services map to Types 4 and 5. A recent national programme of research into mental health supported accommodation services across England (the ‘QuEST study’), which included a national survey [1] and a cohort study following 619 service users from 87 services over 30 months [5], has demonstrated it is possible to conduct high-quality studies in this area using traditional research methods (face to face interviews with participants). However, it also demonstrated that trials in this area are not feasible [6] and that studies using these research methods in this field are lengthy to conduct and expensive. It took a team of three full-time researchers over five years to complete data collection, with a total programme cost of around £2 million. Furthermore, a recent systematic review found there was a lack of empirical evidence, and of the studies that have been conducted there was wide heterogeneity in the terminology used to describe services and the outcomes assessed, such that the evidence could not be synthesised to inform practice or service planning [7]. Therefore, more high quality studies using alternative and more resource efficient research methods are required. The use of routinely collected healthcare records may provide an alternative approach in assessing outcomes and effectiveness of these services. These records contain detailed, longitudinal, clinical data and, in recent years, their use in research has been facilitated by the switch from paper to digital records, encouraged by the UK Government’s aim to create a ‘paperless NHS’ [8]. Researchers no longer need to interrogate handwritten paper records available only at the sites they are physically stored but can instead access electronic health records (EHR) on a computer/device with the appropriate access and information governance permissions. However, EHR contain confidential personal data, and are only accessible to researchers if they have the informed consent of the individual, requiring participant recruitment and the appropriate permissions from the relevant healthcare organisation(s) (and incurring much of the costs associated with traditional research). To address this issue, the Biomedical Research Centre at South London and Maudsley NHS Foundation Trust (SLaM) developed a tool to extract and de-identify data from EHR, the Clinical Record Interactive Search (CRIS) system [9-11]. It locates 'Patient Identifiers' as stipulated by the Caldicott Code on Confidentiality, and de-identifies them by removing or replacing the identifiable data (such as the individual’s name with ‘ZZZZZ’). The tool can interrogate both structured and free text fields. Structured fields hold a range of demographic and clinical information (such as sex, date of birth, diagnosis, etc.) completed by selecting from a list of options (for example, male/female) or by using a specific format (date). Although structured fields provide data that easily lends itself to quantitative analysis, their utility in research is limited by the structured fields that are available and their poor completion rates owing to the preference of clinicians to record data in natural language [12]. Free text fields include any entered text and typically comprise clinical notes and uploaded documents. They represent an estimated 60–70% of the data in EHR [10, 12] and are a rich source of clinical information, ranging from psychiatric assessments to logs of clinical appointments and referrals. CRIS has been shown to be very accurate in its de-identification [10] and has been approved for use in mental health research, without requiring individuals’ informed consent [9]. Dozens of studies using CRIS with SLaM’s EHR have been published on a number of topics including the characteristics of trafficked adults and children with severe mental illness [13], and the outcomes for users of mother and baby units [14]. However, de-identified EHR have not yet been used to research mental health supported accommodation. It is currently unknown whether it is feasible to identify a sample of individuals who use supported accommodation services using CRIS. In the UK, these services are mainly provided by charities or housing associations and not the NHS. However, these services are an essential component of care for people with complex and enduring mental health problems, most of whom are also using NHS mental health services and therefore will have EHR [1]. There is potential then to employ CRIS as a research tool to evaluate supported accommodation services. In 2013, the CRIS tool was deployed in four additional Trusts in the UK, including Camden and Islington NHS Foundation Trust (CIFT). We undertook a recent audit of mental health supported accommodation services in Camden and Islington which identified 57 services providing 783 places at any one time (12 residential care services providing 230 places, 34 supported housing services providing 390 places and 9 floating outreach services providing 163 places). We aimed to assess the feasibility of using CRIS to derive a sample of past and present users of mental health supported accommodation services from CIFT’s EHR by identifying any relevant structured fields and developing a search of free text records.

Methods

Setting

Camden and Islington are both inner London boroughs with a combined population of approximately 470,000. They have a lower proportion of older adults compared to the rest of England (aged 65+ in Camden 11.7%, Islington 8.7%, England 17.7%), more people from Black, Asian and Ethnic minority groups (Camden 33.7%, Islington 32.0%, England 14.5%) [15] and a higher prevalence of adults with psychotic disorders (0.6% Camden, 0.7% Islington, 0.4% England) [16]. Of the 326 local authority areas in England, Camden ranks 74th and Islington 14th as the most deprived [17]. Across both boroughs, supported housing services are provided by several different voluntary organisations and housing associations. CIFT provides inpatient and community mental health services for both boroughs, including general adult, rehabilitation, substance misuse, learning disability and homelessness services, and holds statutory care co-ordination responsibility for people with severe mental health problems subject to the Care Programme Approach (CPA). The Trust’s records were paper based until 2008 when an EHR system was installed. At the end of 2017 there were 126,769 individuals with records in this system [18]. The CRIS tool has been used to de-identify these records making them available for research.

Search approach

Using CRIS, we explored the potential utility of two approaches to obtain a sample of de-identified individuals who have used a supported housing service: i) using structured fields relevant to the individual’s accommodation; and ii) developing a free text search of clinical notes. We also compared the two approaches to see if they identified the same individuals. Finally, we investigated whether it was possible to describe the sample in terms of their clinical characteristics and sociodemographics using structured fields in the CRIS data set, and compare this to a national survey carried out in 2014 which included 619 service users from 87 supported accommodation services. Records between 1 January 2008 and 31 December 2017 were examined. The searches were developed and conducted by CD-L, who has experience as a clinician and researcher in the field of supported accommodation. To assess the potential resource effectiveness of this approach compared to the amount of time and multiple researchers normally required in primary research that requires participant recruitment and research interviews, we limited the available time to carry out the searches to 150 hours (one full-time researcher for four weeks).

Structured fields search for individuals using supported accommodation services

Two structured fields relevant to an individual’s accommodation were identified in the sections of the EHR where the clinician is expected to record and update CPA meeting outcomes (‘cpa_accommodation_desc’) and demographic details (‘accommodation_status_desc’). Both these fields had values (options for the clinician to enter) representative of mental health supported accommodation services. We included values representative of all types of supported accommodation and not just supported housing services because of the heterogeneity in the terminology used to describe supported housing services, and the likelihood that single values are used by different clinicians to record different types of service. The included values were: ‘Supported accommodation’, ‘Supported lodgings’, ‘Supported group home’, ‘Mental Health Registered Care Home’, and ‘Other accommodation with mental health care and support’. For a full list of all the response options (values) available to clinicians for both structured fields, see S1 Table. All entries using either of these fields are stored, so it is possible to identify previous as well as current accommodation status.

Free text search of de-identified clinical notes

The flow diagram shown in Fig 1 illustrates the iterative process for the free text search of clinical notes. First, a list of all supported housing services in the area was generated, based on a previous audit we carried out and verified with senior managers of local mental health supported accommodation services. The final list comprised 35 services. A series of single service searches were developed for each service based on the name of the service. As there were four pairs of services with similar names, 31 single service searches were developed, before combining these in an ‘all service search’. The search started at a simplistic level, by using the most distinctive word from the name of the service so that all clinical notes with a mention of this word were returned. Returned results included the unique identification number randomly assigned by CRIS to each individual on the database, the text of clinical notes that contained the search term and the date the note was recorded. Results were ordered by identification number, which is randomly assigned, and note date, to facilitate manual review. Refinement of the search process was iterative and based on a manual notes review, so that the returned results could be improved in terms of the number of notes they contained, the number of individuals they pertained to and the ratio of true positives to total positives, i.e. the positive predictive value.

Fig 1

Flow diagram of free text search development.

The manual notes review consisted of clinical notes pertaining to the first 10% of individuals listed. If it was clear from the individual’s note(s) that they had previously used or were currently using a supported housing service, the individual (not individual notes) was designated as a true positive. A typical example would be a note documenting a clinician’s visit to a service to see the individual. An individual was assigned as false positive if the notes pertaining to that individual were not actually referring to a supported housing service or if a service was being referred to but it was unclear if the individual had ever actually used the service. Reasons for false positives were noted and if any pattern(s) emerged, used to improve the search term. For example, a search for the fictitious service ‘Forward View’ would initially be based on the search term FORWARD, which would return clinical notes including mentions of Forward View but also any other mention of the word ‘forward’. The search could then be improved by further specifying the search term to FORWARD V or adding terms so that results could not contain the terms FORWARD THINKING or FORWARD PLANNING. Patterns to false positives were not limited to text included in the clinical note but could also include patterns such as the number of notes returned per individual. For example, if a false positive were more likely than a true positive to only have a single note returned by the search, then the search could be refined to only include individuals which have more than one note pertaining to them. This process was repeated until there was no longer a consistent pattern to the false positives and the positive predictive value was acceptable, i.e. over 25%. If the positive predictive value was high, i.e. over 75%, the search was revised to see if a higher number of returns could be achieved whilst maintaining a high positive predictive value. The process was therefore a matter of attempting to achieve the optimal balance between specificity (not over inclusive and lacking in accuracy) and sensitivity (not over exclusive and lacking in sample size). If a pattern to the false positives did not appear and the positive predictive value was not acceptable, development of this single service search was stopped and it was not included in the all service search. This procedure was repeated for each supported housing service. The finalised searches for each service were then combined to produce an ‘all service search’. This search went through the same procedure of development and refinement as the single service searches, which eventually produced a ‘final all service search’ from which the estimated positive predictive value was determined.

Ethics statement

Researchers who wish to use the CIFT Research Database (CIFT records de-identified by CRIS) are required to have an honorary contract or letter of access with the Trust, and submit the CRIS Project Application form to the Oversight Committee’, detailing the proposed study including the parameters of searches. The form is available here: http://www.candi.nhs.uk/health-professionals/research/ci-research-database/researchers-and-clinicians. An application for the present study was submitted and approved, and the lead researcher (CDL) has an honorary contract with CIFT. All studies using the CIFT Research Database have been granted ethical approval by the NRES Committee East of England—Cambridge Central (14/EE/0177).

Results

Structured fields search

Values representative of mental health supported accommodation in the CPA and demographics accommodation status structured fields were recorded for a total of 1,635 and 882 individuals, respectively. A large majority of the total 126,769 individuals in the database did not have any record for either of these structured fields, and both fields are intended to record any type of accommodation. There are a total of 59,408 records using the CPA accommodation field and 65,065 records using the demographics accommodation field; in both instances multiple records can pertain to the same individual. See S1 Table for a full list of response options for each structured field and how these options were grouped, and S2 and S3 Tables to see how many individuals are in each group.

Free text search of clinical notes

Table 1 illustrates the development of the free text search of clinical notes to identify people who had used a supported housing service. Of the 31 single service searches, 28 attained acceptable positive predictive values, the remaining three were removed and not included in the all service search. Half [14] of the single service searches had an acceptable positive predictive value after the first search, the most iterations required to develop an acceptable single service search was 9 (single service search 13).

Table 1

Free text search development: The returned results for the first search, the first iteration and the final search for each service.

Search		First search			First iteration			Final search
Search		Clinical notes	Individuals	Positive predictive value**	Clinical notes	Individuals	Positive predictive value**	Search or iteration no.*	Clinical notes	Individuals	Positive predictive value**
Single service search	1	1582	266	58.3%	179	73	81.8%	1^st	1582	266	58.3%
	2	1856	439	25.0%	1677	340	33.3%	5^th	233	88	66.6%
	3	410	107	36.4%	42	27	…	1^st	410	107	36.4%
	4	824	108	50.0%	13	8	…	1^st	824	108	50.0%
	5	293	74	35.7%	31	17	…	1^st	293	74	35.7%
	6	749	245	0.0%	711	217	7.1%	4^th	256	47	85.7%
	7	7256	1774	0.0%	1851	938	0.0%	5^th	410	117	100.0%
	8	1979	842	0.0%	595	262	11.1%	6^th	268	108	100.0%
	9	1039	284	14.3%	982	241	6.5%	4^th	96	36	100.0%
	10	255	57	85.7%	1003	173	85.7%	2^nd	1003	173	85.7%
	11	119	40	38.5%	1181	100	66.6%	2^nd	1181	100	66.6%
	12	155	45	81.8%	266	73	85.7%	2^nd	266	73	85.7%
	13	1472	637	0.0%	965	464	0.0%	10^th	160	84	62.5%
	14	423	91	33.3%	161	52	40.0%	3^rd	54	33	41.7%
	15	2923	191	63.6%	441	86	…	1^st	2923	191	63.6%
	16	1573	212	36.4%	…	…	…	1^st	1573	212	36.4%
	17	7110	305	50.0%	6560	288	…	3^rd	4487	244	83.3%
	18	2224	612	25.0%	1018	250	17.6%	3^rd	1005	240	…
	19	1758	897	…	1643	823	16.6%	7^th	447	198	44.4%
	20	431	73	44.4%	28114	…	…	1^st	431	73	44.4%
	21	752	344	0.0%	81	17	15.4%	REMOVED
	22	217	74	71.4%	217	75	…	2^nd	217	75	…
	23	798	124	33.3%	33	20	…	1^st	798	124	33.3%
	24	1107	315	0.0%	1078	304	0.0%	REMOVED
	25	56	39	…	8151	1587	…	7^th	1046	345	28.6%
	26	2111	175	66.7%	…	…	…	1^st	2111	175	66.6%
	27	319	101	66.7%	267	117	33.3%	1^st	319	101	66.7%
	28	28592	…	…	…	…	…	REMOVED
	29	1178	704	…	39	…	…	6^th	22	13	100.0%
	30	150	32	37.5%	…	…	…	1^st	150	32	37.5%
	31	222	61	60.0%	…	…	…	1^st	222	61	60.0%
All service search	1	23501	1822	…	22755	1076	59.4%	3^rd	21103	1105	66.4%

Date range for search results: 01-Jan-2008 to 31-Dec-2017.

*1st = first search, 2nd = first iteration.

**Based on a manual review of a random 10% of identified individuals.

Date range for search results: 01-Jan-2008 to 31-Dec-2017. *1st = first search, 2nd = first iteration. **Based on a manual review of a random 10% of identified individuals. The final all service search returned a total of 21,103 de-identified clinical notes pertaining to 1,105 individuals. Notes for 116 individuals (10.5%) were reviewed with a positive predictive value of 77/116 (66.4%). Extrapolating this rate to the remainder of the results produced an estimated positive predictive value of 733/1,105 (66.4%). In the initial all services search, one of the key differences between true positive and false positive individuals was that false positives were much more likely to return with only a single clinical note for that individual, for true positives most often multiple notes would match. Therefore, a condition was added to the search whereby individuals were removed from the results if they only had a single note matching the search term. This largely explains the reduction in the number of individuals relative to the number of clinical notes between the 1st search (1,822 individuals and 23,501 notes) and the 1st iteration of the all service search (1,076 individuals and 22,755 notes, a reduction of 746 individuals and also a reduction of 746 notes). This was the only search condition applied that accounted for frequency patterns, and the only pattern/condition not based on the text content of notes. A full log of the search term development, including the identification of false positive patterns and the SQL search code, is archived on the CIFT CRIS Research Database and is available on request.

Comparing the structured fields and free text search approach

Fig 2 shows how many individuals appeared in each of the three searches (the free text search of clinical notes and the two structured field searches), and the overlap between them. Of the 1,105 identified in the free text search, 739 (66.9%) were also identified in the CPA structured field, but only 249 (22.5%) also appeared in the demographics structured field. A total 768/1,105 (69.5%) of those identified in the free text search were also identified by one of the two structured field searches. The structured fields combined identified 2,140 unique individuals. All sources combined identified a sample of 2,477 unique individuals in total. Overall, 925 (37.2%) appeared in at least two of the searches and 220 (8.8%) appeared in all three. A total 337 individuals appeared only in the free text search.

Fig 2

A Venn diagram showing the overlap of individuals between the structured fields and free text search.

Sociodemographics of the structured fields and the clinical notes free text search samples

Table 2 shows the sociodemographics and diagnosis of the individuals identified from each search approach, extracted from structured fields within the EHR CRIS database, and from service users that participated in a national survey of supported accommodation carried out in 2014 [1]. Around two-thirds in each are male (59.3% - 66.8%), the mean age is between 41.7 and 47.1, the proportions that are White range from 53.7% to 81%, most are single (66% - 83.9%), and the most frequently recorded diagnosis is schizophrenia or psychosis (39.0% - 63.6%). The greatest difference between the search approaches and the national survey was ethnicity, where 53.7% to 60.4% in the search approaches were White compared to 81% in the national survey. This reflects the greater proportion that are from Black, Asian and Ethnic minority groups in Camden and Islington compared to the rest of the country.

Table 2

Sociodemographics and diagnosis of the individuals identified by the different approaches, and from the national survey [1].

		CPA structured field (N = 1635)	Demographics structured field (N = 882)	Clinical notes free text search (N = 1105)	National survey (N = 619)*
Sex—n (%)	Male	1051 (64.3%)	521 (59.1%)	738 (66.8%)	410 (66%)
	Unknown/Missing	2 (0.1%)	4 (0.5%)	1 (0.1%)	0 (0%)
Age†	Mean (SD)	47.1 (16.3)	41.7 (15.8)	43.7 (14.4)	46.0 (13.5)
	Unknown/Missing	0	2	0	0
Ethnicity	Asian	88 (5.4%)	43 (4.9%)	63 (5.7%)	-
- n (%)	Black	419 (25.6%)	175 (19.8%)	324 (29.3%)	-
	Mixed	70 (4.3%)	35 (4.0%)	65 (5.9%)	-
	White	988 (60.4%)	492 (55.8%)	593 (53.7%)	499 (81%)
	Unknown/Missing	70 (4.3%)	137 (15.5%)	60 (5.4%)	-
Marital status‡	Divorced/Separated/Widowed	211 (12.9%)	91 (10.3%)	108 (9.8%)	-
- n (%)	Married/Civil partner	86 (5.3%)	48 (5.4%)	44 (4.0%)	-
	Single	1311 (80.2%)	619 (70.2%)	927 (83.9%)	406 (66%)**
	Unknown/Missing	27 (1.7%)	124 (14.1%)	26 (2.4%)	-
Diagnosis§	Dementia/organic disorder	92 (5.6%)	23 (2.6%)	21 (1.9%)	-
- n (%)	Alcohol/substance misuse§§	86 (5.3%)	92 (10.4%)	63 (5.7%)	-
	Schizophrenia/psychosis	921 (56.3%)	344 (39.0%)	703 (63.6%)	381 (62%)***
	Affective disorder	209 (12.8%)	113 (12.8%)	120 (10.9%)	169 (27%)****
	Personality disorder	139 (8.5%)	77 (8.7%)	71 (6.4%)	-
	Other	51 (3.1%)	18 (2.0%)	29 (2.6%)	66 (11%)
	Unknown/Missing	137 (8.4%)	215 (24.4%)	98 (8.9%)	3 (0.5%)

*National survey of supported accommodation services in England 2014; 159 residential care service users, 251 supported housing and 209 floating outreach [1].

†Calculated from the median date within the search parameters (01-January-2008 to 31-December-2017, median date: 31-December-2012) and date of birth.

**'Never married or cohabited'.

***''Schizophrenia' & 'Schizoaffective disorder.

****'Bipolar affective disorder' & 'Depression or anxiety'.

‡The most frequently recorded marital status for individuals.

§The most recently recorded diagnosis.

§§Mental health or behavioural problem due to alcohol/substance misuse.

*National survey of supported accommodation services in England 2014; 159 residential care service users, 251 supported housing and 209 floating outreach [1]. †Calculated from the median date within the search parameters (01-January-2008 to 31-December-2017, median date: 31-December-2012) and date of birth. **'Never married or cohabited'. ***''Schizophrenia' & 'Schizoaffective disorder. ****'Bipolar affective disorder' & 'Depression or anxiety'. ‡The most frequently recorded marital status for individuals. §The most recently recorded diagnosis. §§Mental health or behavioural problem due to alcohol/substance misuse.

Discussion

To our knowledge, this is the first study to investigate the feasibility of using de-identified EHR within a confidentiality framework to identify a sample of mental health supported housing service users. We have shown that with very limited time and personnel resources relative to traditional research methods, it is possible to identify a large sample of de-identified people using these services and describe this sample in terms of their sociodemographics. The overlap between the structured fields and the free text searches, which totalled 768 individuals, provides some degree of validation that this group had used a supported housing service. However, the utility of conducting the free text search in addition to the structured field searches needs to be considered. It is unlikely a clinician would take the time to complete the structured field to record the individual is residing in a supported accommodation service if this was not the case but there was a high level of missing data in the structured fields and human error is always a possibility, therefore using this approach alone could introduce bias. The free text search did facilitate a focus on a specific type of supported accommodation but this was only possible with knowledge of the local supported housing services, and using this approach will always return some false positives. The pros and cons of the two approaches therefore need to be weighed in deciding which to use or whether both are required. This will obviously depend on the focus of the evaluation being undertaken. For example, whilst the structured field search does not allow comparison of outcomes for different types of supported accommodation, combined with other routinely recorded clinical information (such as inpatient service use), it could potentially be used to evaluate the effectiveness of supported accommodation services. Free text data could be used to categorise the type of supported accommodation used for comparison, and to provide additional outcome data, such as whether the individual moved on to more independent accommodation successfully.

Strengths and limitations

Most NHS Trusts have a clinical records policy that emphasises the importance of staff ensuring that certain fields are kept up to date to facilitate best practice and patient safety. However, the structured fields for accommodation used in this study had poor completion rates, an issue that has been well documented [12] and reported in other studies [18]. It is unknown, but as these variables are not completed by staff systematically, there may be reasons why these fields are not completed for some individuals whilst completed for others (e.g. greater stability in housing) which would lead to selection bias in this approach. There may also be causes for selection bias with the free text search approach as individuals with greater clinical contact are more likely to have a greater number of records and therefore more likely to be returned by the search. However, most people in supported accommodation have complex and longer term mental health problems [1] and are therefore likely to have an extensive history of contact with NHS mental health services. Poor completion rates are a limitation applicable to all research that uses routinely recorded data. A further issue relevant to all secondary research is that the data are not collected specifically for the purpose of the study and the potential research questions that can be addressed are limited by the available data. This ought to be balanced against the access to large datasets and the relatively low resource required. It should also be noted that there was a relatively good completion rate of the fields used in this study to collect sociodemographic data, including diagnosis. Also, we expect that as staff become more familiar with the technology and record systems, the accuracy of records and completion rates improves. However, to our knowledge this has not yet been confirmed by research but stratifying results by year in future validation studies may be of value. Although we consider it a strength of the study that the researcher was able to complete the search in a short timeframe, more researcher time to investigate the availability and completion rate of further relevant variables would have strengthened the study. More time would also have allowed us to expand the free text searches to all uploaded documents and to all types of supported accommodation service. Also, using a second rater to manually review and code clinical notes would have increased the validity of our free text search of clinical notes. As expected, developing the free text search took the majority of the time available for the study. However, an unforeseen issue arose that inevitably reduced the number of free text results. Service names, our key search terms, often also included part of the service address. As the patient’s address is considered a Patient Identifier, all mentions of it in their records is either removed (in structured fields) or masked (in free text fields). Therefore, any patient using a service where the service name included any part of their address would not have been included in our free text search of clinical notes. Finally, as our free text search approach requires knowledge of the local services available, contact with key personnel or use of a freedom of information request is needed.

Future directions

An alternative approach would be to match the address field of patients to a list of known supported accommodation addresses within the Trust catchment area. The address field is a Patient Identifier so would need to be kept blinded to the researcher. Such an approach has already been used to identify admissions to care homes for older people in a study using the South London and Maudsley CRIS de-identified database [19]. Another possibility would be the use of natural language processing (NLP) to identify instances of supported accommodation in free text, or other potential clinical outcomes (e.g. successfully moving from a supported accommodation service to a more independent setting). NLP falls within the field of machine learning and is a technique used to autonomise the process of analysing free text. It requires two data sets, a training set and test set. Both sets contain the same type of input data but only the training set contains the output. The output is typically generated by a human analysing the input data and then deciding what the output should be (similar to our manual notes review). The computer program analyses the training data set, looking for correlations between the input and output data, and creates an algorithm that can be used on the test data set to predict the output. This methodology has been widely and effectively applied to EHR systems [11]; many NLP applications have been developed to assist in the identification of samples/outcomes that are not readily available in structured fields (e.g. identifying symptoms of severe mental illness [20] and suicide ideation [21]).

Conclusions

This study demonstrates it is possible to use de-identified EHR to identify a large sample of individuals who have used mental health supported accommodation services. This is a promising development in a field which is difficult and expensive to study through traditional research methods. However, it is important to consider the limitations of secondary research. Studies need to be designed with knowledge of the clinical data that is routinely collected, the variables that have sufficient completion rates and, for free text searches, the local supported accommodation services.

Original values in structured fields and their grouping term.

(DOCX) Click here for additional data file.

Search results for the CPA structured field.

(DOCX) Click here for additional data file.

Search results for the demographics structured field.

(DOCX) Click here for additional data file. 19 May 2020 PONE-D-20-10079 Using de-identified electronic health records to research mental health supported housing services: a feasibility study PLOS ONE Dear Mr Dalton-Locke, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. We would appreciate receiving your revised manuscript by Jul 03 2020 11:59PM. When you are ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols Please include the following items when submitting your revised manuscript: A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). This letter should be uploaded as separate file and labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. This file should be uploaded as separate file and labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. This file should be uploaded as separate file and labeled 'Manuscript'. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. We look forward to receiving your revised manuscript. Kind regards, Sreeram V. Ramagopalan Academic Editor PLOS ONE Journal Requirements: When submitting your revision, we need you to address these additional requirements. 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf 2. We note that you have indicated that data from this study are available upon request. PLOS only allows data to be available upon request if there are legal or ethical restrictions on sharing data publicly. For information on unacceptable data access restrictions, please see http://journals.plos.org/plosone/s/data-availability#loc-unacceptable-data-access-restrictions. In your revised cover letter, please address the following prompts: a) If there are ethical or legal restrictions on sharing a de-identified data set, please explain them in detail (e.g., data contain potentially identifying or sensitive patient information) and who has imposed them (e.g., an ethics committee). Please also provide contact information for a data access committee, ethics committee, or other institutional body to which data requests may be sent. b) If there are no restrictions, please upload the minimal anonymized data set necessary to replicate your study findings as either Supporting Information files or to a stable, public repository and provide us with the relevant URLs, DOIs, or accession numbers. Please see http://www.bmj.com/content/340/bmj.c181.long for guidelines on how to de-identify and prepare clinical data for publication. For a list of acceptable repositories, please see http://journals.plos.org/plosone/s/data-availability#loc-recommended-repositories. We will update your Data Availability statement on your behalf to reflect the information you provide. 3. Please include a separate caption for each figure in your manuscript. Additional Editor Comments (if provided): [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Partly ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes ********** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: No ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: Dear Editor, Thank you very much for giving me the opportunity to review the manuscript: PONE-D-20-10079. In this feasibility study the authors investigated whether de-identified electronic health record (EHR) can be used effectively as a tool to identify large samples of users of mental health supported housing using structured fields and free text searches. The authors concluded that it is feasible and resource efficient to use the Clinical Record Interactive Search (CRIS) tool to identify individuals who have used mental health supported accommodation services. The manuscript is well structured and is relevant in the digital world where we have data from patients with huge potential for research. The study is a good first step in utilizing EHR data in the field of mental health support services. The study sample is very large, and the authors discussed multiple ways to identify individuals with mental health supported accommodation services. However, there is potential to improve the description of methodology of the study and to add more analysis to increase the value of the study. I see main problem with the methodology of the study that needs further clarification and explanation to convey the results of the study clearly. First, this feasibility study explains different approaches to identify users of mental health supported accommodation services. But estimated true positive value of only one approach (i.e. free text search approach) is presented. It would add value to the manuscript if true positive values of the structured field search approaches are also estimated and presented. In line 265 the authors mentioned that it is unlikely that a clinician would add false information on supported accommodation service in structured field; however, there has been multiple validation studies showing less than perfect positive predictive value (PPV) of clinical diagnosis in EHR. Therefore, I think it is likely that the PPV of the CPA structured approach is less than 100%. Second, the study is missing the ‘validity’ of the identification methods used by the authors. Ideally, manual chart review of ‘random’ samples from the identified individuals should be performed to get the positive predictive value (PPV) e.g. PPV of combining the CPA structured field approach and free-text approach; PPV of combining CPA structured approach and structured demographic field approach; and PPV of combining all three approaches together. The closest estimate provided is the true positive rate of free text search, which was performed for the first 10% of individuals after sorting the results by note date (line 166) (i.e. not random). Having the information about the validity of the individual and combined search approach will certainly add value to this study. Finally, the technical details in the Method section should be expanded to ensure that readers understand exactly how the authors identified individuals; and missingness and selection bias need to be further discussed. Please see my detailed comments for each section below: Abstract: • The abstract is well-written. • Line 28: the study is not based on data over the last 10 years. • It would be informative if the authors can add something about the ‘setting’ of the study or add name of the mental health trust in the Method section. • Result section line 34: “A manual review of these notes…” Please add, “…manual review of 10% of the notes…”. • Result section, line 35-36: is there any reason of using the term ‘true positive rate’? I think more widely used term is ‘Positive Predictive Value’. • Result section, line 39: The statement that these 337 individuals are likely to be false positive assumes that individuals identified by structured field search are all true positive. This is a strong assumption. Please see my comment on the Discussion section below. • Conclusions: The term ‘efficient’ is very subjective and I suggest using it carefully. In this study authors have fixed the resources before the study. Hence, I cannot see the conclusion of efficiency is based on evidence generated by this study. Please see my comment on the Discussion section below. • Conclusions are based only on results of structured fields, why free text is not mentioned? Background: • This section provides background and good overview of the key literature. However, the section is missing the background on the need of the problem addressed. It appears that the problem is ‘identification’ of people in EHR who have used mental health supported accommodation services. What methodologies have been used in the past to identify such people in the EHR in the same field or other closely related field using CRIS platform? And what were the challenges? Methods: Setting: • Line 127-128: Are there any studies that have investigated the completion of EHR data over the years since it started in 2008? I expect the completion of data to improve over time. If there are differences, then it is a good idea to stratify the results according to years. • I assume there must be changes to the EHR system or healthcare system in the 10 years of the study. Was there any reason of including all available EHR data since 2008 and not restricting the study to recent few years only? I assume the reason was to increase the sample size, but since it is a feasibility study a smaller sample would be acceptable. Search approach: • Line 135-136: “….sample in terms of their sociodemographics using structured fields, and…” I think Table 2 also has information from free text search. • Line 136-137: What was the reason for deciding to compare the sociodemographic data to the national survey from 2014? The study is based on only 2 of the 326 local authority areas, spread from 2008-20017, and is known to be different (as the authors mentioned in the first paragraph of the Setting). • Line 140-142: I could not find any details on how the authors will assess resource effectiveness? How it was measured and what was measured? We cannot assess any effectiveness by fixing the time (=resources). E.g. if we provide only 8 hours to work on something then we will get results, but quality will be compromised. Therefore, it depends on what quality was desired, which is not explained. Hence, we cannot make the resource assessment. However, I do agree that database studies are in general less resource demanding than a prospective real-world study or a clinical trial, but this study does not provide evidence supporting that. Free text search of de-identified clinical notes: • Line 166: “Results were ordered by identification number and note date.” What was the reason for sorting the results by date and the identification number? Ideally manual chart review should be on a random sample. • Line 169: I assume true positive rate is same as positive predictive value (PPV). If this assumption is correct then the definition of true positive rate is incorrect, it should be the ratio of true positive to total positive. Although the calculation is correct in the result section, but the definition is incorrect. • Line 186-187: Do the authors have any reference to support this methodology of balancing sensitivity and specificity? Any previous study that has used similar approach and calculated sensitivity and specificity providing evidence that this approach truly balances sensitivity & specificity? Results Structured fields search: • What was the total base population? I assume it was 126,769 (line 128). • Line 197-199: It is not clear how many individuals have no records of mental health accommodation services. I assume out of total individuals (i.e. 126,769), 1635 had records of mental health supported accommodation services, 9545 had missing/unknown values, and the rest did not get any mental health supported accommodation services. Is that correct? Please clarify. Also, it is not clear how many individuals in total had CPA field records. • Missingness could reflect true absence of the of use of mental health supported accommodation. What was the assumption made for missing/unknown subjects? were they assumed to have no mental health support accommodation, or the data was ‘missing’? May be a flowchart would help. • Table 1: The authors should add footnote explaining that the true positive value was derived from manual review of 10% of the identified individuals. • Line 218-220: Please mention this exclusion criteria in the Method section, it is not currently explained there. Please also add how many individuals were excluded with this exclusion criteria in the final search. • Line 218-220: “Therefore, a condition was added to the search whereby individuals were removed from the results if they only had a single note matching the search term.” why these individuals were not added again after iterating the search term? Comparing the structured fields and free text search approach: • Table 2: It does not illustrate only the sociodemographic but also clinical characteristics. Discussion: • Line 255-261: These can be moved to result section or deleted. • Line 260-261: “….it is likely that many of these 337 are false positives.” This is a big assumption considering we have ‘estimated’ true positive rates. With similar assumptions 45.2% (Figure 2) individuals identified in CPA structured field would ‘likely’ be false positive as they appeared only in CPA structured search. Because the authors did not identify exactly who are false positive in text search field and the authors did not estimate true positive value for CPA structured field search, it will be difficult to conclude anything. • Line 285, “However, the free text search did not appear to significantly enhance sensitivity.” The authors did not estimate ‘sensitivity’ in this study, so there is no data to support this statement. • Line 299: “However, an unforeseen issue arose that inevitably reduced the number of free text results”. I think it would be worthwhile to mention what was the issue and how it had impacted the results. If it was fixed, then this can be deleted. • May be the authors can put light on impact of ever-changing technology during the study. I assume there has been changes to EHR and I assume quality of data (e.g. completion) vary as familiarity to the system increase. • Some of the single service searches have reached PPV of 100% and others remained as low as 35%. Is it possible that the data quality varies between different housing services? Is it possible that data quality varies between providers as well? e.g. some clinicians would not ask or report on accommodation status? • I assume the structured variable that are used in this study to identify the cases are not systematically reported. Therefore, missingness is not at random leading to selection bias. Maybe it is worthwhile to discuss selection bias in this study. Conclusions: • I think the authors’ claim of ‘resource efficiency’ is not substantially supported with evidence. Other comments: • References are not as per the journal’s acceptable style. Many are missing the volume and page number or DOI. • The authors have used some terms inconsistently e.g. effectiveness and efficiency are interchangeable. I suggest using terms consistently to make the manuscript easy to read. ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files to be viewed.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Please note that Supporting Information files do not need this step. 29 Jul 2020 UCL DIVISION OF PSYCHIATRY FACULTY OF BRAIN SCIENCES 23 July 2020 Dear Sreeram, Many thanks to you and the reviewer for the very helpful comments and opportunity to improve our manuscript. As requested, we have responded to the reviewer’s comments (in red, below) and have submitted a version of the manuscript with tracked changed to illustrate the revisions made since the original submission, as well as a clean version without tracked changes. We hope that you both find this satisfactory and look forward to learning your response. Yours sincerely Christian Dalton-Locke PhD student Journal Requirements: When submitting your revision, we need you to address these additional requirements. 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf Files renamed and separate files created for supporting documents (previously ‘Appendices’). 2. We note that you have indicated that data from this study are available upon request. PLOS only allows data to be available upon request if there are legal or ethical restrictions on sharing data publicly. For information on unacceptable data access restrictions, please see http://journals.plos.org/plosone/s/data-availability#loc-unacceptable-data-access-restrictions. In your revised cover letter, please address the following prompts: a) If there are ethical or legal restrictions on sharing a de-identified data set, please explain them in detail (e.g., data contain potentially identifying or sensitive patient information) and who has imposed them (e.g., an ethics committee). Please also provide contact information for a data access committee, ethics committee, or other institutional body to which data requests may be sent. There are ethical and legal restrictions on sharing this data set. The data cannot be shared upon request unless the person is an authorised researcher adhering to the relevant protocols. Further details to this can be found under the headings: ‘Ethics approval and consent to participate’ and ‘Availability of data and material’. b) If there are no restrictions, please upload the minimal anonymized data set necessary to replicate your study findings as either Supporting Information files or to a stable, public repository and provide us with the relevant URLs, DOIs, or accession numbers. Please see http://www.bmj.com/content/340/bmj.c181.long for guidelines on how to de-identify and prepare clinical data for publication. For a list of acceptable repositories, please see http://journals.plos.org/plosone/s/data-availability#loc-recommended-repositories. We will update your Data Availability statement on your behalf to reflect the information you provide. 3. Please include a separate caption for each figure in your manuscript. Additional Editor Comments (if provided): [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Partly ________________________________________ 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes ________________________________________ 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: No ________________________________________ 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes ________________________________________ 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: Dear Editor, Thank you very much for giving me the opportunity to review the manuscript: PONE-D-20-10079. In this feasibility study the authors investigated whether de-identified electronic health record (EHR) can be used effectively as a tool to identify large samples of users of mental health supported housing using structured fields and free text searches. The authors concluded that it is feasible and resource efficient to use the Clinical Record Interactive Search (CRIS) tool to identify individuals who have used mental health supported accommodation services. The manuscript is well structured and is relevant in the digital world where we have data from patients with huge potential for research. The study is a good first step in utilizing EHR data in the field of mental health support services. The study sample is very large, and the authors discussed multiple ways to identify individuals with mental health supported accommodation services. However, there is potential to improve the description of methodology of the study and to add more analysis to increase the value of the study. I see main problem with the methodology of the study that needs further clarification and explanation to convey the results of the study clearly. First, this feasibility study explains different approaches to identify users of mental health supported accommodation services. But estimated true positive value of only one approach (i.e. free text search approach) is presented. It would add value to the manuscript if true positive values of the structured field search approaches are also estimated and presented. Thank you for this comment and thoughtful suggestion. Estimated true positive values of the structured field approaches would indeed add value to this study but unfortunately we do not believe it feasible to obtain such a measure. The closest we would be able to get to an estimate of the true positive values of the structured field approach would be to review uploaded documents and/or clinical notes that are recorded around the same date the structured field is completed. However, we believe that this would likely produce an estimate that greatly underestimates the actual true positive rate from this sample. This is mainly due to the fact that it is not standard or routine practice for NHS staff to record whether their patients are using supported accommodation or not (these services are not usually provide by the NHS), and because they have recorded it using structured fields doesn’t necessarily mean they will also record it in clinical notes/ uploaded documents. Furthermore, the de-identified nature of the data means it is not possible to use data available outside of the EHR system to validate the structured fields. In line 265 the authors mentioned that it is unlikely that a clinician would add false information on supported accommodation service in structured field; however, there has been multiple validation studies showing less than perfect positive predictive value (PPV) of clinical diagnosis in EHR. Therefore, I think it is likely that the PPV of the CPA structured approach is less than 100%. We agree that structured fields on accommodation status are unlikely to be 100% accurate, human error is always going to be an issue/limitation when using health records for research – we have made a tracked change to reflect this (p.19). Second, the study is missing the ‘validity’ of the identification methods used by the authors. Ideally, manual chart review of ‘random’ samples from the identified individuals should be performed to get the positive predictive value (PPV) e.g. PPV of combining the CPA structured field approach and free-text approach; PPV of combining CPA structured approach and structured demographic field approach; and PPV of combining all three approaches together. As described above, we agree that adding PPV for the structured fields would add value to this study, but unfortunately we believe it unfeasible to obtain accurate estimates of this. We feel that showing the overlap between the three different approaches still provides a degree of validation of each approach (Figure 2, Venn diagram, p.16) – a researcher would reasonably be more confident of their sample using the overlap of all three indicators rather than working on a sample identified from just one of the approaches. The closest estimate provided is the true positive rate of free text search, which was performed for the first 10% of individuals after sorting the results by note date (line 166) (i.e. not random). Thank you for this comment. The results were ordered by patient identification number, which is randomly assigned by CRIS, and then by note date, therefore the sample is random but this is not clear. We have added tracked changes to clarify this (p.9). Having the information about the validity of the individual and combined search approach will certainly add value to this study. Finally, the technical details in the Method section should be expanded to ensure that readers understand exactly how the authors identified individuals; and missingness and selection bias need to be further discussed. I have added text explaining exactly which values in the structured fields were used to identify our sample (p.8-9), and a reference to table S1 which lists all possible response options for both fields. Have also added discussion on potential selection bias of both search approaches under ‘Limitations’ (p.19-20). Please see my detailed comments for each section below: Abstract: • The abstract is well-written. • Line 28: the study is not based on data over the last 10 years. Thank you for spotting this. This line has been corrected (p.2). • It would be informative if the authors can add something about the ‘setting’ of the study or add name of the mental health trust in the Method section. Name of mental health Trust added (p.2). • Result section line 34: “A manual review of these notes…” Please add, “…manual review of 10% of the notes…”. Added (p.2). • Result section, line 35-36: is there any reason of using the term ‘true positive rate’? I think more widely used term is ‘Positive Predictive Value’. We agree, thank you for this comment. All mentions of ‘true positive rate’ changed to ‘positive predictive value’ (p.2). • Result section, line 39: The statement that these 337 individuals are likely to be false positive assumes that individuals identified by structured field search are all true positive. This is a strong assumption. Please see my comment on the Discussion section below. We feel this statement only assumes individuals appearing in the free text search and also in one of the structured fields is more likely to be a true positive than those appearing only in the free text search, which we think is a reasonable assumption. • Conclusions: The term ‘efficient’ is very subjective and I suggest using it carefully. In this study authors have fixed the resources before the study. Hence, I cannot see the conclusion of efficiency is based on evidence generated by this study. Please see my comment on the Discussion section below. The word ‘efficient’ replaced with ‘requires minimal resources’ (p.3). • Conclusions are based only on results of structured fields, why free text is not mentioned? Mention of free text approach added (p.3). Background: • This section provides background and good overview of the key literature. However, the section is missing the background on the need of the problem addressed. It appears that the problem is ‘identification’ of people in EHR who have used mental health supported accommodation services. What methodologies have been used in the past to identify such people in the EHR in the same field or other closely related field using CRIS platform? And what were the challenges? Thank you for this comment, enabling us to address this. We have added text to the Background (p.6) explaining de-identified EHR have not yet been used to research mental health supported accommodation. Regarding challenges, we have added that although supported accommodation services are not usually provided by the NHS, there is potential for using CRIS to identify them as most are also using NHS mental health services. Methods: Setting: • Line 127-128: Are there any studies that have investigated the completion of EHR data over the years since it started in 2008? I expect the completion of data to improve over time. If there are differences, then it is a good idea to stratify the results according to years. We found a study reviewing the evolution of EHR since 1992 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5171496/) but we are unaware of any study reporting completion rates of records by year. However, we do agree it is likely that as staff become more familiar with the records system and the technology, completion rates improve, and so have added text discussing this (p.20). • I assume there must be changes to the EHR system or healthcare system in the 10 years of the study. Was there any reason of including all available EHR data since 2008 and not restricting the study to recent few years only? I assume the reason was to increase the sample size, but since it is a feasibility study a smaller sample would be acceptable. We wanted to see if we could identify a relatively large sample, not just a sample, therefore we wanted to maximise our sampling frame and the period of time looked at, especially given that lengths of stay at these services are four years on average (https://bmcpsychiatry.biomedcentral.com/articles/10.1186/s12888-016-0797-6). Search approach: • Line 135-136: “….sample in terms of their sociodemographics using structured fields, and…” I think Table 2 also has information from free text search. Here, we meant that we used structured fields to investigate the sociodemographics (and clinical characteristics) of the sample identified using the different search approaches (including the free text search), and compare this to a national survey. We have added text here to hopefully make this clear (p.18). • Line 136-137: What was the reason for deciding to compare the sociodemographic data to the national survey from 2014? The study is based on only 2 of the 326 local authority areas, spread from 2008-20017, and is known to be different (as the authors mentioned in the first paragraph of the Setting). This national survey provides the best description available of individuals who use mental health supported accommodation services, and so although different, we think it still a useful comparison to make. The last sentence in this section notes this difference and its effect on the results in this table (p.18). • Line 140-142: I could not find any details on how the authors will assess resource effectiveness? How it was measured and what was measured? We cannot assess any effectiveness by fixing the time (=resources). E.g. if we provide only 8 hours to work on something then we will get results, but quality will be compromised. Therefore, it depends on what quality was desired, which is not explained. Hence, we cannot make the resource assessment. However, I do agree that database studies are in general less resource demanding than a prospective real-world study or a clinical trial, but this study does not provide evidence supporting that. We wanted to show whether this approach could return a large sample with limited researcher hours compared to what would normally be invested in a primary research study requiring participant recruitment and research interviews. We have added text to the Methods to clarify this (p.8). Free text search of de-identified clinical notes: • Line 166: “Results were ordered by identification number and note date.” What was the reason for sorting the results by date and the identification number? Ideally manual chart review should be on a random sample. Thank you for noting this. As described above, text has been added to clarify how this is a random sample (p.9). • Line 169: I assume true positive rate is same as positive predictive value (PPV). If this assumption is correct then the definition of true positive rate is incorrect, it should be the ratio of true positive to total positive. Although the calculation is correct in the result section, but the definition is incorrect. Thank you for noting this. Definition corrected (p.9). • Line 186-187: Do the authors have any reference to support this methodology of balancing sensitivity and specificity? Any previous study that has used similar approach and calculated sensitivity and specificity providing evidence that this approach truly balances sensitivity & specificity? Have added text to explain the reason for this methodology: iterations were an attempt to increase/maintain accuracy whilst maintaining number of returns (p.10). Results Structured fields search: • What was the total base population? I assume it was 126,769 (line 128). • Line 197-199: It is not clear how many individuals have no records of mental health accommodation services. I assume out of total individuals (i.e. 126,769), 1635 had records of mental health supported accommodation services, 9545 had missing/unknown values, and the rest did not get any mental health supported accommodation services. Is that correct? Please clarify. Also, it is not clear how many individuals in total had CPA field records. • Missingness could reflect true absence of the of use of mental health supported accommodation. What was the assumption made for missing/unknown subjects? were they assumed to have no mental health support accommodation, or the data was ‘missing’? May be a flowchart would help. This section has largely been re-written to clarify the issues raised here (p.10). • Table 1: The authors should add footnote explaining that the true positive value was derived from manual review of 10% of the identified individuals. Footnote to table added (p.14). • Line 218-220: Please mention this exclusion criteria in the Method section, it is not currently explained there. Have added following text under Methods sub-section Free text search (p.10): “Patterns to false positives were not limited to text included in the clinical note but could also include patterns such as the number of notes returned per individual. For example, if a false positive were more likely than a true positive to only have a single note returned by the search, then the search could be refined to only include individuals which have more than one note pertaining to them.” Please also add how many individuals were excluded with this exclusion criteria in the final search. Have added text stating difference in number of individuals and notes for first search and first iteration (p.15). • Line 218-220: “Therefore, a condition was added to the search whereby individuals were removed from the results if they only had a single note matching the search term.” why these individuals were not added again after iterating the search term? Because it was reasoned that individuals with single notes returning would still more likely (but not always) be false positives. The logic behind this was that if an individual really was using supported accommodation then they would likely have more than one clinical note in reference to this, and so more than one note returned by the search. Comparing the structured fields and free text search approach: • Table 2: It does not illustrate only the sociodemographic but also clinical characteristics. Table title and relevant text in main text changed to reflect this (p.17-18). Discussion: • Line 255-261: These can be moved to result section or deleted. Deleted (p.18). • Line 260-261: “….it is likely that many of these 337 are false positives.” This is a big assumption considering we have ‘estimated’ true positive rates. With similar assumptions 45.2% (Figure 2) individuals identified in CPA structured field would ‘likely’ be false positive as they appeared only in CPA structured search. Because the authors did not identify exactly who are false positive in text search field and the authors did not estimate true positive value for CPA structured field search, it will be difficult to conclude anything. Deleted (p.18). • Line 285, “However, the free text search did not appear to significantly enhance sensitivity.” The authors did not estimate ‘sensitivity’ in this study, so there is no data to support this statement. Correct, thank you for spotting this. This text has been deleted (p.20). • Line 299: “However, an unforeseen issue arose that inevitably reduced the number of free text results”. I think it would be worthwhile to mention what was the issue and how it had impacted the results. If it was fixed, then this can be deleted. The issue is explained in the text following this line, and due to the nature of this issue, it was not possible to calculate the impact of this nor were we able to fix it (p.20-21). • May be the authors can put light on impact of ever-changing technology during the study. I assume there has been changes to EHR and I assume quality of data (e.g. completion) vary as familiarity to the system increase. Potential health inequalities and disparities exacerbated by interventions put in place to reduce transmission of COVID-19, and by reduced access to health and social care services. We agree the impact of change of technology and general public familiarity with it (including healthcare clinicians) on completion and quality of EHR is interesting and worth researching, we feel that is it perhaps beyond the scope of this study to address it here but have added text to suggest as a useful addition in future studies (p.20). • Some of the single service searches have reached PPV of 100% and others remained as low as 35%. Is it possible that the data quality varies between different housing services? Is it possible that data quality varies between providers as well? e.g. some clinicians would not ask or report on accommodation status? It is possible, but we believe it is more likely due to the varying distinctiveness of the names of services. The free text search was based on the service name, so a more distinctive service name would produce a search with less false positives. • I assume the structured variable that are used in this study to identify the cases are not systematically reported. Therefore, missingness is not at random leading to selection bias. Maybe it is worthwhile to discuss selection bias in this study. Text added to Strengths and limitations (p.19-20) regarding potential selection bias: “It is unknown, but as these variables are not completed by staff systematically, there may be reasons why these fields are not completed for some individuals whilst completed for others (e.g. greater stability in housing) which would lead to selection bias in this approach. There may also be causes for selection bias with the free text search approach as individuals with greater clinical contact are more likely to have a greater number of records and therefore more likely to be returned by the search. However, most people in supported accommodation have complex and longer term mental health problems (1) and are therefore likely to have an extensive history of contact with NHS mental health services.” Conclusions: • I think the authors’ claim of ‘resource efficiency’ is not substantially supported with evidence. Agreed, claim deleted (p.22). Other comments: • References are not as per the journal’s acceptable style. Many are missing the volume and page number or DOI. Volume and page number or DOI added where missing. Unfortunately, these corrections do not appear as tracked changes because of the referencing software used (p.25-27). • The authors have used some terms inconsistently e.g. effectiveness and efficiency are interchangeable. I suggest using terms consistently to make the manuscript easy to read. Thank you for the noting this. Use of the terms ‘effectiveness’ and ‘efficiency’ have been reviewed throughout the manuscript and we believe are now consistently used, and not used interchangeably. 31 Jul 2020 Using de-identified electronic health records to research mental health supported housing services: a feasibility study PONE-D-20-10079R1 Dear Dr. Dalton-Locke, We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Sreeram V. Ramagopalan Academic Editor PLOS ONE Additional Editor Comments (optional): Reviewers' comments: 11 Aug 2020 PONE-D-20-10079R1 Using de-identified electronic health records to research mental health supported housing services: a feasibility study Dear Dr. Dalton-Locke: I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. If we can help with anything else, please email us at plosone@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. Sreeram V. Ramagopalan Academic Editor PLOS ONE

14 in total

1. Characteristics of trafficked adults and children with severe mental illness: a historical cohort study.

Authors: Siân Oram; Mizanur Khondoker; Melanie Abas; Matthew Broadbent; Louise M Howard
Journal: Lancet Psychiatry Date: 2015-10-17 Impact factor: 27.083

2. Predictors of moving on from mental health supported accommodation in England: national cohort study.

Authors: Helen Killaspy; Stefan Priebe; Peter McPherson; Zohra Zenasni; Lauren Greenberg; Paul McCrone; Sarah Dowling; Isobel Harrison; Joanna Krotofil; Christian Dalton-Locke; Rose McGranahan; Maurice Arbuthnott; Sarah Curtis; Gerard Leavey; Geoff Shepherd; Sandra Eldridge; Michael King
Journal: Br J Psychiatry Date: 2020-06 Impact factor: 9.319

3. Predictors of care home and hospital admissions and their costs for older people with Alzheimer's disease: findings from a large London case register.

Authors: Martin Knapp; Kia-Chong Chua; Matthew Broadbent; Chin-Kuo Chang; Jose-Luis Fernandez; Dominique Milea; Renee Romeo; Simon Lovestone; Michael Spencer; Gwilym Thompson; Robert Stewart; Richard D Hayes
Journal: BMJ Open Date: 2016-11-18 Impact factor: 2.692

4. Mother and Baby Units matter: improved outcomes for both.

Authors: Lucy A Stephenson; Alastair J D Macdonald; Gertrude Seneviratne; Freddie Waites; Susan Pawlby
Journal: BJPsych Open Date: 2018-04-19

5. Feasibility Randomised Trial Comparing Two Forms of Mental Health Supported Accommodation (Supported Housing and Floating Outreach); a Component of the QuEST (Quality and Effectiveness of Supported Tenancies) Study.

Authors: Helen Killaspy; Stefan Priebe; Peter McPherson; Zohra Zenasni; Paul McCrone; Sarah Dowling; Isobel Harrison; Joanna Krotofil; Christian Dalton-Locke; Rose McGranahan; Maurice Arbuthnott; Sarah Curtis; Gerard Leavey; Rob MacPherson; Sandra Eldridge; Michael King
Journal: Front Psychiatry Date: 2019-04-17 Impact factor: 4.157

6. The South London and Maudsley NHS Foundation Trust Biomedical Research Centre (SLAM BRC) case register: development and descriptive data.

Authors: Robert Stewart; Mishael Soremekun; Gayan Perera; Matthew Broadbent; Felicity Callard; Mike Denis; Matthew Hotopf; Graham Thornicroft; Simon Lovestone
Journal: BMC Psychiatry Date: 2009-08-12 Impact factor: 3.630

7. Development and evaluation of a de-identification procedure for a case register sourced from mental health electronic records.

Authors: Andrea C Fernandes; Danielle Cloete; Matthew T M Broadbent; Richard D Hayes; Chin-Kuo Chang; Richard G Jackson; Angus Roberts; Jason Tsang; Murat Soncul; Jennifer Liebscher; Robert Stewart; Felicity Callard
Journal: BMC Med Inform Decis Mak Date: 2013-07-11 Impact factor: 2.796

8. Cohort profile of the South London and Maudsley NHS Foundation Trust Biomedical Research Centre (SLaM BRC) Case Register: current status and recent enhancement of an Electronic Mental Health Record-derived data resource.

Authors: Gayan Perera; Matthew Broadbent; Felicity Callard; Chin-Kuo Chang; Johnny Downs; Rina Dutta; Andrea Fernandes; Richard D Hayes; Max Henderson; Richard Jackson; Amelia Jewell; Giouliana Kadra; Ryan Little; Megan Pritchard; Hitesh Shetty; Alex Tulloch; Robert Stewart
Journal: BMJ Open Date: 2016-03-01 Impact factor: 2.692

9. Mental health supported accommodation services: a systematic review of mental health and psychosocial outcomes.

Authors: Peter McPherson; Joanna Krotofil; Helen Killaspy
Journal: BMC Psychiatry Date: 2018-05-15 Impact factor: 3.630

10. Identifying Suicide Ideation and Suicidal Attempts in a Psychiatric Clinical Research Database using Natural Language Processing.

Authors: Andrea C Fernandes; Rina Dutta; Sumithra Velupillai; Jyoti Sanyal; Robert Stewart; David Chandran
Journal: Sci Rep Date: 2018-05-09 Impact factor: 4.379