| Literature DB >> 23842533 |
Andrea C Fernandes1, Danielle Cloete, Matthew T M Broadbent, Richard D Hayes, Chin-Kuo Chang, Richard G Jackson, Angus Roberts, Jason Tsang, Murat Soncul, Jennifer Liebscher, Robert Stewart, Felicity Callard.
Abstract
BACKGROUND: Electronic health records (EHRs) provide enormous potential for health research but also present data governance challenges. Ensuring de-identification is a pre-requisite for use of EHR data without prior consent. The South London and Maudsley NHS Trust (SLaM), one of the largest secondary mental healthcare providers in Europe, has developed, from its EHRs, a de-identified psychiatric case register, the Clinical Record Interactive Search (CRIS), for secondary research.Entities:
Mesh:
Year: 2013 PMID: 23842533 PMCID: PMC3751474 DOI: 10.1186/1472-6947-13-71
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 2.796
List of key Patient Identifiers (PIs) specified to be de-identified in the U.K. to create a de-identified database as stipulated by the Caldicott Code on Confidentiality
| First Name | |
| | Middle Name |
| | Last Name |
| | Current and Old Address Line 1 |
| | Current and Old Address Line 2 |
| | Current and Old Address Post code |
| | Current and Old Telephone Numbers |
| | Current and Old Email addresses |
| | Date of birth |
| | National Health Service (NHS) Identification (ID) numbers* |
| | Hospital specific ID numbers |
| | Rare or unique characteristics |
| Aliases/Nicknames |
*NHS numbers are assigned to every resident of the U.K.
These are all personal to the patient. There is no obligation to de-identify all non-patient information such as clinical staff names.
Figure 1Diagrammatic description of converting source medical records (Electronic Patient Journey System [ePJS]) to Clinical Record Interactive Search (CRIS).
An example of the CRIS dictionary
| First name | Joe |
| **Middle name | (blank) |
| Second name | Bloggs |
| *Date of birth | 20/08/1987 |
| Trust ID | 12–34–56 |
| Post Code | SW9 6TJ |
| **Nick Name | (blank) |
| **Key Contact First Name | (blank) |
| Key Contact Last Name | O’Connell |
| | |
| Joe | |
| Bloggs | |
| 20/08/1987 | |
| 20/08/’87 | |
| 20–08–1987 | |
| 20–08–87 | |
| 20.08.1987 | |
| 20.08.87 | |
| 20.8.87 | |
| 20th Aug 1987 | |
| 20th Aug ‘87 | |
| 20th of August 1987 | |
| 20th of Aug 1987 | |
| 12–34–56 | |
| 123456 | |
| 12 34 56 | |
| SW9 6TJ | |
| SW96TJ | |
| Connell |
*There would be many more options of recording the date of birth; a non-exhaustive list is provided here.
**The cells show no information was entered in these fields and hence are not included in the CRIS PI dictionary. Note that these fields are patient identifier fields that sometimes do not get filled in the source.
List of XML scopes or XML tags, through which the de-identification algorithm is run
| Summary texts, Event notes, Correspondence notes, Ward Progress Notes, etc. | |
| NHS number field, Trust specific ID field, First Name, Middle Name, Last Name, Telephone Number, etc. | |
| Date of Birth, Post Code, Ethnicity, etc. |
Figure 2Source Electronic Patient Journey System (ePJS) record input and CRIS output. Note that in this example we are using the dictionary from Table 2. “Jie” and “Mary” have not been masked because of being a typographical error and un-entered PI, respectively. All details in this record are fictitious: any resemblance to real persons is entirely coincidental. Also note that there are no legal requirements to de-identify names of clinical staff such as “Terry Scott”, the fictitious assistant psychologist, whose name therefore appears in full in the CRIS record.
Heuristics to identify names, with entirely fictitious examples, as they would appear in the source record and CRIS output
| | | |||||
|---|---|---|---|---|---|---|
| …replaced. Mark will also be able… | .(space) | Mark | None | None | (space) | …replaced. ZZZZZ will also be able… |
| …knowing Mark’s diagnosis… | (space) | Mark | ‘ | S | (space) | …knowing ZZZZZ diagnosis… |
| …7)Mark is compliant… | ) | Mark | None | None | (space) | …7)ZZZZZ is compliant… |
| …OMark is compliant… | No beginning identified | None | None | None | None | …OMark is compliant… |
| …was awarded 9 mark out of 30 in… | (space) | mark | None | None | (space) | …was awarded 9 ZZZZZ out of 30 in… |
| …Nurse informed Mark. Earlier… | (space) | Mark | None | None | . | …Nurse informed ZZZZZ. Earlier… |
| …Marik will be attending… | (space) | None identified due to misspelling | None | None | None | …Marik will be attending… |
| …O’Mark is at the… | O’ | Mark | None | None | (space) | …ZZZZZ is at the… |
| …his father, John, was also present… | , (space) | John | , | None | (space) | …his father, QQQQQ, was also present… |
Heuristics to identify date of birth, with entirely ficitious examples, as they would appear in source records and CRIS output
| | | |||||||
|---|---|---|---|---|---|---|---|---|
| Dob: 01/01/2001 | : | 01 | / | 01 | / | 2001 | (space) | Dob: ZZZZZ |
| 1st of January 2001 | (Space) | 1st | (space) of (space) | January | (space) | 2001 | (space) | ZZZZZ |
| …born in Jan 1st 01… | (space) | Jan | (space) | 1st | (space) | 01 | (space) | …born in ZZZZZ… |
| …01-01-’01… | (space) | 01 | - | 01 | -‘ | 01 | (space) | …ZZZZZ… |
| …01 Jan 2001 | (space) | 01 | (space) | Jan | (space) | 2001 | (space) | …ZZZZZ… |
| Dob: 01//01/2001 | : | 01 | / | None identified owing to typographical error in the source record | None | None | None | Dob: 01//01/2001 |
Heuristics to identify post codes, with entirely fictitious examples, as they would appear in source records and CRIS output
| | | |
|---|---|---|
| He lives at EN1 5SR | EN1 5SR | He lives at ZZZZZ |
| Lives at EN1. No… | None | Lives at EN1. No… |
| Lives at EN1 S5R | None identified owing to typographic error | Lives at EN1 S5R |
Algorithms to de-identify date of birth (word form), phone numbers, NHS identification numbers and addresses
| | Algorithm to identify date of birth (Word date) |
| ::= < beginning > <day > <spaces > <day_suffix > <spaces > <optional_word_delimiter > <spaces > <word_month > <spaces | |
| | > < optional_comma > <spaces > <year > <end > | < beginning > <word_month > <spaces > <optional_month_delimiter > <spaces > <day > <spaces > <day_suffix > <spaces > <optional_comma > <spaces > <year > <end> |
| | |
| | Algorithm to identify phone numbers |
| ::=<optional_open_bracket > <first 5 digits > <optional_close_bracket > <space > <digits 6-8 > <space > <digits 9-11 > | | |
| | <optional_open_bracket > <first 3 digits > <optional_close_bracket > <space > <digits 4-7 > <space > <digits 8-11 > | |
| | <optional_open_bracket > <first 4 digits > <optional_close_bracket > <space > <digits 5-7 > <space > <digits 8-11> |
| | |
| Algorithm to identify NHS Identification numbers | |
| | |
| | Algorithm to identify addresses |
| ::= < address_term1 > | < address_term2 > …. <address_termn> |
Figure 3CRIS security model.
Precision and recall rates from the machine learning approach and CRIS’ pattern matching approach
| 20 | 20 | |
| 191 | 191 | |
| 154 | 169 | |
| 43 | 22 | |
| 8 | 0 | |
Precision and recall from the CRIS performance test
| 500 | |
| 3603 | |
| 3573 | |
| 89* | |
| 30 | |
*See Table 10.
Instances of ‘potential’ PIs: none of which is a breach as none occurred in isolation
| | | |
| 20 | Un-entered PI | |
| 18 | Un-entered PI or misspellings | |
| 14 | Un-entered PI or misspellings | |
| 13 | Un-entered PI or misspellings | |
| 2 | Un-entered PI | |
| 10 | Un-entered PI or misspellings | |
| 5 | Un-entered PI or misspellings | |
| 4 | Un-entered PI | |
| 2 | Un-entered PI or misspellings | |
| 1 | Un-entered PI or misspellings |
Instance of a potential breach: 3 or more PIs appearing for a single patient
| 1 | Patient third line of address | 2006 | Misspellings | Low: Outdated information, and cannot be verified, and incorrect spellings [ |
| Patient post code | | | | |
| Patient last name |
Potential instances that could lead to inferring patient information
| 2 | Patient prison reference number | 2012 | 12 | No rule to de-identify prison reference numbers or prison contact information | Low: cannot be verified, and misspellings [ |
| Prison contact phone number | | | Typographical error in patient’s last name | | |
| Patient last name |