| Literature DB >> 21901084 |
Francisco S Roque1, Peter B Jensen, Henriette Schmock, Marlene Dalgaard, Massimo Andreatta, Thomas Hansen, Karen Søeby, Søren Bredkjær, Anders Juul, Thomas Werge, Lars J Jensen, Søren Brunak.
Abstract
Electronic patient records remain a rather unexplored, but potentially rich data source for discovering correlations between diseases. We describe a general approach for gathering phenotypic descriptions of patients from medical records in a systematic and non-cohort dependent manner. By extracting phenotype information from the free-text in such records we demonstrate that we can extend the information contained in the structured record data, and use it for producing fine-grained patient stratification and disease co-occurrence statistics. The approach uses a dictionary based on the International Classification of Disease ontology and is therefore in principle language independent. As a use case we show how records from a Danish psychiatric hospital lead to the identification of disease correlations, which subsequently can be mapped to systems biology frameworks.Entities:
Mesh:
Year: 2011 PMID: 21901084 PMCID: PMC3161904 DOI: 10.1371/journal.pcbi.1002141
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Precision of text-mining associations.
| Incidence precision (#mining hits) | Association precision (#ICD10 codes) | |||||
| Chapter | Correct | False | Precision | Correct | False | Precision |
| I | 7 | 10 | 41.18% | 7 | 6 | 53.85% |
| II | 0 | 1 | 0.00% | 0 | 1 | 0.00% |
| IV | 30 | 4 | 88.24% | 17 | 4 | 80.95% |
| V | 486 | 20 | 96.05% | 128 | 7 | 94.81% |
| VI | 124 | 16 | 88.57% | 46 | 9 | 83.64% |
| VII | 19 | 13 | 59.38% | 11 | 9 | 55.00% |
| IX | 26 | 11 | 70.27% | 13 | 5 | 72.22% |
| X | 78 | 11 | 87.64% | 36 | 4 | 90.00% |
| XI | 67 | 12 | 84.81% | 19 | 2 | 90.48% |
| XII | 73 | 10 | 87.95% | 29 | 9 | 76.32% |
| XIII | 57 | 2 | 96.61% | 17 | 2 | 89.47% |
| XIV | 12 | 2 | 85.71% | 6 | 1 | 85.71% |
| XVIII | 1234 | 115 | 91.48% | 252 | 53 | 82.62% |
| XIX | 141 | 101 | 58.26% | 36 | 8 | 81.82% |
| XX | 4 | 0 | 100.00% | 3 | 0 | 100.00% |
| XXI | 33 | 5 | 86.84% | 27 | 3 | 90.00% |
| All | 2391 | 333 | 87.78% | 647 | 123 | 84.03% |
Precision is the number of true positives divided by the sum of true and false positives. Incidence precision distinguishes every individual mining hit as either correct or false. In association precision each ICD10 code is counted just once per patient and is considered correct if just one of the incidences of the code with this patient is correct. The final row contains the precision over all chapters.
Figure 1Disease chapter networks.
ICD10 Chapters are shown as nodes; links represent correlations. Link weight represents correlation strength between two chapters; node area represents the proportion of codes from that chapter in the entire corpus. (A) Network based on the assigned codes for each patient. Most frequent chapter is chapter V ‘Mental and behavioral disorders’ with a frequency of 81%. The strongest correlation is between chapters V and XXI with a cosine similarity score of 0.45. Chapters IX, ‘Diseases of the circulatory system’ and IV ‘Endocrine, nutritional and metabolic diseases’ have a score of 0.3. (B) Full network containing both the assigned and mined codes for all patients. Chapters V and XVIII have a frequency of 24% and 35% respectively, and have a score of 0.92. After mining, ‘Diseases of the respiratory system’ - chapter X, and ‘Injury, poisoning and certain other consequences of external causes’ - chapter XIX, now have a cosine similarity score of 0.6 and 0.78, respectively.
Figure 2Disease-disease correlations.
Heatmap of the most significant 100 ICD10 codes, based on ranking the list of 802 candidate pairs by their comorbidity scores. Chapter colors are highlighted next to the ICD10 codes. Diseases that occur often together have red color in the heatmap, while those with lower than expected co-occurrence are colored blue. The color label shows the log2 change of comorbidity between two diseases when compared to the expected level.
Figure 3Patient cohort network.
(A) Nodes represent 1,497 patients from 26 clusters. Edges are correlations between patients. Node color denotes cluster membership. (B) Heatmap showing ICD10 composition of each cluster. Values are the fraction of the cluster ICD10 vector covered by this code. Shown are only the 26 ICD10 codes that are most distinguishing codes for a cluster. The heatmap columns match the network clusters in a counter clockwise direction starting at cluster 27.