| Literature DB >> 20678228 |
Stephane M Meystre1, F Jeffrey Friedlin, Brett R South, Shuying Shen, Matthew H Samore.
Abstract
BACKGROUND: In the United States, the Health Insurance Portability and Accountability Act (HIPAA) protects the confidentiality of patient data and requires the informed consent of the patient and approval of the Internal Review Board to use data for research purposes, but these requirements can be waived if data is de-identified. For clinical data to be considered de-identified, the HIPAA "Safe Harbor" technique requires 18 data elements (called PHI: Protected Health Information) to be removed. The de-identification of narrative text documents is often realized manually, and requires significant resources. Well aware of these issues, several authors have investigated automated de-identification of narrative text documents from the electronic health record, and a review of recent research in this domain is presented here.Entities:
Mesh:
Year: 2010 PMID: 20678228 PMCID: PMC2923159 DOI: 10.1186/1471-2288-10-70
Source DB: PubMed Journal: BMC Med Res Methodol ISSN: 1471-2288 Impact factor: 4.615
Figure 1Patient identifiers defined in the HIPAA "Safe Harbor" legislation.
Automatic de-identification systems and their principal characteristics
| 1st author | System Name | Availability/License | Knowledge resources | Document Types | |
|---|---|---|---|---|---|
| Aramaki [ | System for the i2b2 de-identification challenge | Not publicly available | CRF++1 | Lists of names, locations, dates | Discharge summaries |
| Beckwith [ | HMS Scrubber | Open source (GNU LGPL v2) | Java, JDOM, MySQL | Lists of names, locations | Surgical pathology reports |
| Berman [ | Concept-Match | System freely available | Perl | UMLS Metathesaurus | Surgical pathology reports |
| Fielstein [ | (VA system) | Not publicly available | Perl | Lists of names, locations, email addresses | VA compensation and pension examinations |
| Friedlin [ | MeDS | Not publicly available | Java | Lists of names, locations, medical terms | HL7 messages |
| Gardner [ | HIDE | Open source (Common Public License v1) | Perl, Java, Mallet 2 | None | Surgical pathology reports |
| Guo [ | System for the i2b2 de-identification challenge | Not publicly available | GATE 3 | Lists of locations, hospitals. | Discharge summaries |
| Gupta [ | DE-ID (DE-ID Data Corp., Richboro, PA) | Commercial system, not freely available. | Unknown | List of U.S. census names, user defined dictionaries | Surgical pathology reports |
| Hara [ | System for the i2b2 de-identification challenge | Not publicly available | C++, BACT and YamCha 5 | None | Discharge summaries |
| Morrison [ | MedLEE | Not publicly available | Prolog | MedLEE lexicon, UMLS Metathesaurus | Outpatient follow-up notes |
| Neamatullah [ | (MIT system) | Open source (GNU GPL v2) | Perl | Lists of common English words (non-PHI), terms indicating PHI, names and locations, known PHI (patients and staff list!) | Nursing progress notes, discharge summaries |
| Ruch [ | MEDTAG framework-based | Not publicly available | Unknown | MEDTAG lexicon (based on UMLS Metathesaurus; only in French) | Various clinical documents (multilingual) |
| Sweeney [ | Scrub | Not publicly available | Unknown | Lists of area codes, names | Various clinical documents |
| Szarvas [ | System for the i2b2 de-identification challenge | Not publicly available | Weka 6 | Lists of first names, locations, diseases, non-PHI (general English) | Discharge summaries |
| Taira [ | (UCLA system) | Not publicly available | Unknown | List of names, and drugs | Various clinical documents |
| Thomas [ | (Regenstrief Institute system) | Not publicly available | Java, XSL | List of names, UMLS Metathesaurus terms. | Surgical pathology reports |
| Uzuner [ | Stat De-id | Not publicly available (open source release planned). | LIBSVM 7 | MeSH terms, lists of names, locations, and hospitals. | Discharge summaries |
| Wellner [ | System for the i2b2 de-identification challenge | Open source (BSD) | Ocaml 8, | Lists of US states, months, common English words. | Discharge summaries |
1 http://crfpp.sourceforge.net/
2 http://mallet.cs.umass.edu/
3 http://gate.ac.uk/
4 http://svmlight.joachims.org/
5 http://www.chasen.org/~taku/software/
6 http://www.cs.waikato.ac.nz/ml/weka/
7 http://www.csie.ntu.edu.tw/~cjlin/libsvm
8 http://caml.inria.fr/ocaml/index.en.html
9 http://sourceforge.net/projects/carafe/
Types of PHI and other data detected by the de-identification systems
| De-identification system | PHI | Clinical data | ||||||
|---|---|---|---|---|---|---|---|---|
| Person names | Ages > 89 | Geographical locations | Dates | Contact information | IDs | |||
| Aramaki | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | None | |
| Beckwith | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | None | |
| Berman | ✸ | ✸ | ✸ | ✸ | ✸ | ✸ | ✸ | UMLS |
| Fielstein | ✔ | ✔ | ✔ | ✔ | ✔ | None | ||
| Friedlin | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | None | |
| Gardner | ✔ | ✔ | ✔ | None | ||||
| Guo | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | None | |
| Gupta | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | None | |
| Hara | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | None | |
| Morrison | ✸ | ✸ | ✸ | ✸ | ✸ | ✸ | ✸ | MedLEE |
| Neamatullah | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | None | |
| Ruch | ✔ | ✔ | ✔ | ✔ | MEDTAG | |||
| Sweeney | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | None | |
| Szarvas | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | None | |
| Taira | - | None | ||||||
| Thomas | - | None | ||||||
| Uzuner | ✔ | ✔ | ✔ | ✔ | ✔ | None | ||
| Wellner | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | None | |
✸ Only extracted concepts (i.e. UMLS or other clinical concepts) are retained.
P+D = Patient and healthcare provider names; P = Patient name
Figure 2Principal methods used for each class of PHI.
Resources used by systems mostly based on pattern matching and/or rule-based methods.
| De-identification system | Knowledge resources | Principal methods |
|---|---|---|
| Beckwith | Lists of proper names, locations | Regular expressions and dictionaries. |
| Berman | UMLS Metathesaurus, stop words | Dictionaries |
| Fielstein | Lists of cities and VA PHI (patient names, SSNs, MRNs...) | Regular expressions and dictionaries. |
| Friedlin | Lists of names (including Regenstrief patients), locations. | Regular expressions and dictionaries; identifiers in HL7 messages. |
| Gupta (De-ID system) | UMLS Metathesaurus, institution-specific identifiers | Regular expressions and dictionaries; identifiers in report headers. |
| Morrison (MedLEE) | MedLEE lexicon and UMLS Metathesaurus. | Rules/grammar-based, with dictionaries. |
| Neamatullah | Lists of common English words (non-PHI), names, locations, UMLS Metathesaurus and other medical terms, known patients and healthcare providers in the institution. | Regular expressions and dictionaries. |
| Ruch | MEDTAG lexicon (enriched with healthcare institution names, drug names, procedures, and devices) | Rule-based, with dictionaries. |
| Sweeney | Lists of names, U.S. states, countries, medical terms. | Rule-based, with dictionaries. |
| Thomas | List of names, UMLS Metathesaurus, Ispell terms. | Regular expressions and dictionaries. |
Algorithms and features used by systems mostly based on machine learning methods.
| De-identification system | Machine learning algorithm | Features | ||
|---|---|---|---|---|
| Lexical/morphological | Syntactic | Semantic | ||
| Aramaki | CRF | Word, surrounding words (5 words window), capitalization, word length, regular expressions (date, phone), sentence position and length. | POS (word + 2 surrounding words) | Dictionary terms (names, locations) |
| Gardner | CRF | Word lemma, capitalization, numbers, prefixes/suffixes, 2-3 character n-grams | POS (word) | None |
| Guo | SVM | Word, capitalization, prefixes/suffixes, word length, numbers, regular expressions (date, ID, phone, age) | POS (word) | Entities extracted by ANNIE (doctors, hospitals, locations) |
| Hara | SVM | Word, lemma, capitalization, regular expressions (phone, date, ID) | POS (word) | Section headings |
| Szarvas | Decision Tree | Word length, capitalization, numbers, regular expressions (age, date, ID, phone), token frequency | None | Dictionary terms (first names, US locations, countries, cities, diseases, non-PHI terms), section heading. |
| Taira | Maximum Entropy | Capitalization, punctuation, numbers, regular expressions (prefixes, physician and hospital name, syndrome/disease/procedure) | POS (word) | Semantic lexicon, dictionary terms (proper names, prefixes, drugs, devices), semantic selectional restrictions |
| Uzuner | SVM | Word, lexical bigrams, capitalization, punctuation, numbers, word length. | POS (word + 2 surrounding words), syntactic bigrams (link grammar) | MeSH ID, dictionary terms (names, US and world locations, hospital names), section headers. |
| Wellner | CRF | Word unigrams/bigrams, surrounding words (3 words window), prefixes/suffixes, capitalization, numbers, regular expressions (phone, ID, zip, date, locations/hospitals) | None | Dictionary terms (US states, months, general English terms). |
CRF = Conditional Random Fields; SVM = Support Vector Machine; POS = Part-of-speech