| Literature DB >> 29941004 |
Richard Jackson1,2, Ismail Kartoglu3, Clive Stringer4, Genevieve Gorrell5, Angus Roberts5, Xingyi Song5, Honghan Wu6,7, Asha Agrawal4, Kenneth Lui8, Tudor Groza9, Damian Lewsley4, Doug Northwood4, Amos Folarin6,8, Robert Stewart6,10, Richard Dobson6,8.
Abstract
BACKGROUND: Traditional health information systems are generally devised to support clinical data collection at the point of care. However, as the significance of the modern information economy expands in scope and permeates the healthcare domain, there is an increasing urgency for healthcare organisations to offer information systems that address the expectations of clinicians, researchers and the business intelligence community alike. Amongst other emergent requirements, the principal unmet need might be defined as the 3R principle (right data, right place, right time) to address deficiencies in organisational data flow while retaining the strict information governance policies that apply within the UK National Health Service (NHS). Here, we describe our work on creating and deploying a low cost structured and unstructured information retrieval and extraction architecture within King's College Hospital, the management of governance concerns and the associated use cases and cost saving opportunities that such components present.Entities:
Keywords: Clinical informatics; Elasticsearch; Electronic health records; Information extraction; Natural language processing
Mesh:
Year: 2018 PMID: 29941004 PMCID: PMC6020175 DOI: 10.1186/s12911-018-0623-9
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 2.796
Fig. 1CogStack Architecture and Dataflow All components can be deployed via the Docker containerisation software. 1. New job execution Master instance of CogStack identifies new data in Trust Data Sources at intermittent intervals. 2. Partitioning The job is partitioned into a user definable number of work units. 3a. Derive the freetext content Extract plain and/or formatted text from common proprietary document binary formats (performing OCR where necessary), using the Tika Library to enable the downstream processing of high value unstructured data elements. 3b. Supplement the text content with meta-data Filter and de-normalise a subset of the structured clinical data to provide a patient orientated, transparent representation of high value metadata concepts. For example, this might include calculated fields to represent patient age at document date, first part of postcode and ethnicity and lab results. 3c. De-identification Transform the resulting text documents into de-identified text documents, by masking personal health identifiers via the use of the Cognition de-identification algorithms. This is necessary to address governance concerns associated with the secondary use of patient data. Identifiers in structured data can be excluded via SQL query, according to business requirements. 4. Information Extraction Apply generic clinical IE pipelines to derive additional structured data from free text and supplement the quantity of available structured data at the point of query. 5. Indexing Build a JSON object from the resulting structured and unstructured data, which can then be readily be indexed into an Elasticsearch cluster. 6. Visualisation The Kibana suite provides a range of attractive options for viewing, aggregating and dash-boarding the loaded data
Fig. 2Kibana interface loaded with pseudo-data
Patient demographics, King’s College Hospital 2004-2016
| Count | % | |
|---|---|---|
| Age (years) | ||
| ≤ 20 | 435 796 | 14.80 |
| 21-40 | 811 865 | 27.57 |
| 41-60 | 876 467 | 29.77 |
| 61-80 | 490 153 | 16.65 |
| ≥ 80 | 326 453 | 11.09 |
| Unknown | 3 792 | 0.13 |
| Gender | ||
| Male | 1 369 074 | 46.50 |
| Female | 1 571 717 | 53.38 |
| Indeterminate | 550 | 0.02 |
| Unknown | 3 185 | 0.11 |
| Race (Self assigned | ||
| Asian or Asian British | 95 682 | 3.25 |
| Black or Black British | 326 618 | 11.09 |
| Mixed | 59 214 | 2.01 |
| Not specified | 1 506 703 | 51.17 |
| Other | 9 7277 | 3.30 |
| White | 859 032 | 29.17 |
ICD10 Code assignment by clinical coders at King’s College Hospital
| Group | Unique patient count |
|---|---|
| I Certain infectious and parasitic diseases | 171 988 |
| II Neoplasms | 259 975 |
| III Diseases of the blood and blood-forming organs and certain disorders involving the immune mechanism | 72 939 |
| IV Endocrine, nutritional and metabolic diseases | 272 317 |
| IX Diseases of the circulatory system | 504 581 |
| V Mental and behavioural disorders | 706 990 |
| VI Diseases of the nervous system | 179 710 |
| VII Diseases of the eye and adnexa | 183 841 |
| VIII Diseases of the ear and mastoid process | 13 416 |
| X Diseases of the respiratory system | 242 282 |
| XI Diseases of the digestive system | 598 165 |
| XII Diseases of the skin and subcutaneous tissue | 131 227 |
| XIII Diseases of the musculoskeletal system and connective tissue | 343 803 |
| XIV Diseases of the genitourinary system | 212 198 |
| XIX Injury, poisoning and certain other consequences of external causes | 351 608 |
| XV Pregnancy, childbirth and the puerperium | 327 111 |
| XVI Certain conditions originating in the perinatal period | 78 541 |
| XVII Congenital malformations, deformations and chromosomal abnormalities | 104 242 |
| XVIII Symptoms, signs and abnormal clinical and laboratory findings, not elsewhere classified | 513 384 |
| XX External causes of morbidity and mortality | 650 984 |
| XXI Factors influencing health status and contact with health services | 1 520 107 |
| XXII Codes for special purposes | 385 417 |
Performance of de-identification on simulated data
| Mutator type | True positives | False positives | False negatives | Precision | Recall |
|---|---|---|---|---|---|
| Character substitution (3%) | 8 191 | 538 | 391 | 93.9 | 95.5 |
| Character substitution (10%) | 7 740 | 447 | 826 | 94.6 | 90.4 |
| Character substitution (20%) | 6 969 | 271 | 1 537 | 96.3 | 82 |
| Address Alias Substitution | 8 171 | 486 | 455 | 94.4 | 94.8 |
| Address Token Removal | 2 761 | 99 | 237 | 96.6 | 92.1 |
| OCR (3% char. sub. 3% white space | 8 464 | 160 | 1555 | 98.2 | 84.5 |
| OCR (10% char. sub. 10% white space | 5 327 | 180 | 7282 | 96.8 | 42.3 |
| OCR (20% char. sub. 20% white space | 1 802 | 151 | 14719 | 92.3 | 11.0 |