| Literature DB >> 35351120 |
Naima Oubenali1,2, Sabrina Messaoud3, Alexandre Filiot4, Antoine Lamer3,4,5, Paul Andrey4.
Abstract
BACKGROUND: Analyzing the unstructured textual data contained in electronic health records (EHRs) has always been a challenging task. Word embedding methods have become an essential foundation for neural network-based approaches in natural language processing (NLP), to learn dense and low-dimensional word representations from large unlabeled corpora that capture the implicit semantics of words. Models like Word2Vec, GloVe or FastText have been broadly applied and reviewed in the bioinformatics and healthcare fields, most often to embed clinical notes or activity and diagnostic codes. Visualization of the learned embeddings has been used in a subset of these works, whether for exploratory or evaluation purposes. However, visualization practices tend to be heterogeneous, and lack overall guidelines.Entities:
Keywords: Data mining; Deep learning; Medical; Natural language processing; Visualization; Word embeddings
Mesh:
Year: 2022 PMID: 35351120 PMCID: PMC8962592 DOI: 10.1186/s12911-022-01822-9
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 2.796
Fig. 1Articles selection process
Detailed information on the data used in the studies: sources, sizes, types, and visualized features
| Study | Algorithm/method | Training data | Test data (size) | Visualized data |
|---|---|---|---|---|
| Wang et al. [ | Word2Vec SG, Pretrained GloVe | Mayo Clinic clinical notes (n = 113 k patients: 103 k words, d = 100) MedLit Articles from PubMed Central [ (n = 2 million words, d = 60) GloVe Wikipedia (n = 400 k embedded words, d = 100) Google News (n = 3 million embedded words, d = 300) | Pedersen [ Hliaoutakis [ MayoSRS [ UMNSRS [ | 377 medical terms (symptoms and drugs) selected among all embedded ones |
| Shah et al. [ | Word2Vec SG | The Indiana University Health’s EHR system (n = 500 patients: 154 738 clinical notes, d = 300) | The Indiana University Health’s EHR system (NA) | Patient-wise historical list of diseases and symptoms (assigned to 50 embedding-based clusters) |
| Beaulieu et al. [ | Poincaré embeddings | Administrative claims database including diagnostic ICD-9 codes (n = 63 million patients, d = 200) | Administrative claims database including diagnostic billing ICD-9 codes (13.38 million patients) | All embedded ICD-9 diagnoses codes (223 million) |
| Dynomant et al. [ | Word2Vec SG, Word2Vec CBOW, FastText SG, FastText CBOW, GloVe | Rouen University Hospital Data (641 279 documents) French medical paper abstracts from the LiSSa corpus (1.25 million of French abstracts) | Rouen University Hospital Clinical Data (607 135 health documents for Word2vec SG model) | All embedded words (50 066), with a focus on visually-extracted sub-regions |
| De Freitas et al. [ | Word2Vec SG, FastText SG, GloVe | Medical concepts (diagnostics, procedures, lab tests and medication codes, plus selected words) extracted from structured EHRs and unstructured clinical notes from the Mount Sinai Health System (n = 2 208 741 patients: 49 234 medical concepts, d = 200) | De-identified EHRs from the MSHS data warehouse (validation: 4.5 million patients and 57 464 clinical concepts, testing: 1 608 741 patients) | All embedded medical concepts (57 464), with highlight of ten disease codes and their neighborhoods |
| Chen et al. [ | Word2Vec, GloVe, Dependency-based word embeddings [ | Wikipedia health-related articles (n = 322 339, d = 300) | Wikipedia general articles (NA), WordNet (9000 analogy questions), UMLS (33 000 analogy questions) | Relation terms of specific words and their top-10 nearest neighbors in the reduced 2-D space of the embedding space |
| El-Assady et al. [ | LDA, pre-trained embeddings from ConceptNet [ | Utterances from the 2nd Obama-Romney 2012 US presidential debate | NA | Words and their grouping into user- or model-defined concepts and topics |
NA is used when the information is not available
The different visualization methods and the visualization objectives
| Study | Visualization objective | Visualization method |
|---|---|---|
| Wang et al. [ | Exploration of «visual» clusters | t-SNE with focus on manually-selected areas |
| Shah et al. [ | Synthesis of patients' medical history + soft validation of clusters, hence embeddings | ad hoc chronology of symptoms and diseases, organized visually based on embeddings’ cluster assignments |
| Beaulieu et al. [ | Illustration of the embeddings’ consistency with the ICD-9 hierarchy | Plot of 2-d embeddings and Hierarchy Tree [ |
| Dynomant et al. [ | Exploration of « visual» clusters presented as an evaluation of the embeddings | t-SNE with focus on manually-selected areas |
| De Freitas et al. [ | Illustration of the embeddings and of some identified disease phenotypes | UMAP with highlight of neighborhoods around selected diagnostic codes |
| Chen et al. [ | Illustration of the embeddings and some relations between terms used as part of the evaluation tasks | PCA with highlight of selected words, some with their neighborhoods (used for word-retrieval evaluation) |
| El-Assady et al. [ | Topic models’ exploration & user feedback gathering (part of the modeling process) | ad hoc t-SNE-based visualization of topics and related words grouped into concepts + interactive system to gather feedback and trigger model evolutions |