Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 Identifying and characterizing highly similar notes in big clinical note datasets.

Literature DB >> 29679685

Identifying and characterizing highly similar notes in big clinical note datasets.

Rodney A Gabriel¹, Tsung-Ting Kuo², Julian McAuley³, Chun-Nan Hsu².

Abstract

BACKGROUND: Big clinical note datasets found in electronic health records (EHR) present substantial opportunities to train accurate statistical models that identify patterns in patient diagnosis and outcomes. However, near-to-exact duplication in note texts is a common issue in many clinical note datasets. We aimed to use a scalable algorithm to de-duplicate notes and further characterize the sources of duplication.
METHODS: We use an approximation algorithm to minimize pairwise comparisons consisting of three phases: (1) Minhashing with Locality Sensitive Hashing; (2) a clustering method using tree-structured disjoint sets; and (3) classification of near-duplicates (exact copies, common machine output notes, or similar notes) via pairwise comparison of notes in each cluster. We use the Jaccard Similarity (JS) to measure similarity between two documents. We analyzed two big clinical note datasets: our institutional dataset and MIMIC-III.
RESULTS: There were 1,528,940 notes analyzed from our institution. The de-duplication algorithm completed in 36.3 h. When the JS threshold was set at 0.7, the total number of clusters was 82,371 (total notes = 304,418). Among all JS thresholds, no clusters contained pairs of notes that were incorrectly clustered. When the JS threshold was set at 0.9 or 1.0, the de-duplication algorithm captured 100% of all random pairs with their JS at least as high as the set thresholds from the validation set. Similar performance was noted when analyzing the MIMIC-III dataset.
CONCLUSIONS: We showed that among the EHR from our institution and from the publicly-available MIMIC-III dataset, there were a significant number of near-to-exact duplicated notes. Published by Elsevier Inc.

Entities: Species

Keywords: De-deduplication; Electronic medical record; Natural language processing

Mesh：

Year: 2018 PMID： 29679685 DOI： 10.1016/j.jbi.2018.04.009

Source DB: PubMed Journal: J Biomed Inform ISSN： 1532-0464 Impact factor: 6.317

Keyword Cloud
Cited

5 in total

1. RadBERT: Adapting Transformer-based Language Models to Radiology.

Authors: An Yan; Julian McAuley; Xing Lu; Jiang Du; Eric Y Chang; Amilcare Gentili; Chun-Nan Hsu
Journal: Radiol Artif Intell Date: 2022-06-15

2. CAS: corpus of clinical cases in French.

Authors: Natalia Grabar; Clément Dalloux; Vincent Claveau
Journal: J Biomed Semantics Date: 2020-08-06

3. A Year of Papers Using Biomedical Texts: Findings from the Section on Natural Language Processing of the IMIA Yearbook.

Authors: Natalia Grabar; Cyril Grouin
Journal: Yearb Med Inform Date: 2019-08-16

4. Predicting the Mortality of ICU Patients by Topic Model with Machine-Learning Techniques.

Authors: Chih-Chou Chiu; Chung-Min Wu; Te-Nien Chien; Ling-Jing Kao; Jiantai Timothy Qiu
Journal: Healthcare (Basel) Date: 2022-06-11

5. Impact of Different Approaches to Preparing Notes for Analysis With Natural Language Processing on the Performance of Prediction Models in Intensive Care.

Authors: Malini Mahendra; Yanting Luo; Hunter Mills; Gundolf Schenk; Atul J Butte; R Adams Dudley
Journal: Crit Care Explor Date: 2021-06-11

5 in total