Literature DB >> 31825016

Using word embeddings to expand terminology of dietary supplements on clinical notes.

Yadan Fan1, Serguei Pakhomov1,2, Reed McEwan3, Wendi Zhao1, Elizabeth Lindemann4, Rui Zhang1,2.   

Abstract

OBJECTIVE: The objective of this study is to demonstrate the feasibility of applying word embeddings to expand the terminology of dietary supplements (DS) using over 26 million clinical notes.
METHODS: Word embedding models (ie, word2vec and GloVe) trained on clinical notes were used to predefine a list of top 40 semantically related terms for each of 14 commonly used DS. Each list was further evaluated by experts to generate semantically similar terms. We investigated the effect of corpus size and other settings (ie, vector size and window size) as well as the 2 word embedding models on performance for DS term expansion. We compared the number of clinical notes (and patients they represent) that were retrieved using the word embedding expanded terms to both the baseline terms and external DS sources exandped terms.
RESULTS: Using the word embedding models trained on clinical notes, we could identify 1-12 semantically similar terms for each DS. Using the word embedding exandped terms, we were able to retrieve averagely 8.39% more clinical notes and 11.68% more patients for each DS compared with 2 sets of terms. The increasing corpus size results in more misspellings, but not more semantic variants brand names. Word2vec model is also found more capable of detecting semantically similar terms than GloVe.
CONCLUSION: Our study demonstrates the utility of word embeddings on clinical notes for terminology expansion on 14 DS. We propose that this method can be potentially applied to create a DS vocabulary for downstream applications, such as information extraction.

Entities:  

Keywords:  clinical notes; dietary supplements; natural language processing; terminology expansion; word embeddings

Year:  2019        PMID: 31825016      PMCID: PMC6904105          DOI: 10.1093/jamiaopen/ooz007

Source DB:  PubMed          Journal:  JAMIA Open        ISSN: 2574-2531


INTRODUCTION

The safety of dietary supplements (DS) has received increasing attention in recent years due to evidence showing that DS can cause adverse events, leading to potentially dangerous clinical outcomes., Results from an annual survey on DS by Council for Responsible Nutrition (CRN) revealed that 76% of US adults take DS in 2017, resulting in an increase of 5% compared with 2016. The current postmarketing surveillance utilizes voluntarily submitted reports of suspected adverse events caused by DS. The reporting schema often suffers from underestimation since only a fraction of severe events (eg, death) are reported. Although National Health and Nutrition Examination Survey (NHANES) has reported the DS use on the population level, there remains a critical need to investigate their use on the individual level. Such information is critical for better understanding the effects of supplement use with coadministered medications and attendant adverse events. Moreover, the inherent limitations of both voluntary reporting and clinical trials have created an imperative need for complementary data sources and data-driven methods for automatic identification and detection. Electronic health record (EHR) data, especially clinical notes, offer a potentially effective data source for active pharmacovigilance on DS. One main advantage of EHR data is the availability of comprehensive clinical information obtained during the course of care, especially those related to patient safety extensively documented in clinical notes, such as signs and symptoms. Analyzing the clinical notes provides a promising approach for assessing the DS use on the individual level, which can further facilitate DS safety research and clinical decision support. However, one main obstacle surrounding the secondary use of EHR data is the lack of standardized terminology for DS. Furthermore, a biomedical terminology such as RxNorm usually fails to cover all various expressions of DS in the clinical notes, including misspellings, brand names, other lexical variances, etc. The domain specific terminology plays a significant role in a variety of applications. To facilitate the meaningful use of EHR data for the purpose of improving patient safety in terms of DS consumption, it is vital to understand how DS are represented in EHR, namely to gain insights on the syntactic and semantic variability of DS in clinical notes. A DS terminology developed on EHR is critical for identifying DS use status for patients, which is beneficial for subsequent DS safety research and development of clinical decision support system. Additionally, a comprehensive DS terminology based on EHR data can further contribute to identifying patients who meet the criteria of consuming DS for placement in clinical trials both accurately and thoroughly. This has been demonstrated by the 2018 shared tasks of National Natural Language Processing (NLP) Clinical Challenges (n2c2), one aim of which was to determine whether a patient has used DS (excluding Vitamin D) in the past 2 months. Due to the nature of clinical natural language, the names of DS in the clinical notes often have tremendous syntactic and semantic variability. Existing terminologies such as the Unified Medical Language System (UMLS) have a low level of coverage for DS variants. Although there are databases (eg, Natural Medicine Comprehensive Database), representing DS, these syntactic and semantic variabilities are usually outside the scope of the databases. In addition, as a very specific subdomain language in medicine, the comprehensive terminology for DS does not exist. Therefore, the method to efficiently explore the semantic variants, brand names, and misspellings of DS is required for a number of downstream applications, such as information extraction through natural language processing techniques, which will serve as an initial step for future DS safety surveillance systems. Generally, there are two classes of methods used to expand semantically similar terms based on word similarity. One is a thesaurus-based method, such as measuring the similarity between two senses defined by a thesaurus like MeSH or SNOMED-CT. The limitation of this method is that thesauri might be missing new words or may not be available in every language or sublanguage. The other method is based on the distributional semantics, in which the word similarity is estimated based on the distributions of the words in the corpus. Distributional semantics makes the assumption that words with similar meanings tend to occur in similar contexts. Distributional methods, including spatial and probabilistic models, have been applied to estimate the semantic similarity between two medical terms. To capture the word similarity, vector models, such as co-occurrence vector using some weighting functions including pointwise mutual information (PMI), are most commonly used. However, such representation methods often suffer from the limitation that they are high-dimensional, which requires a large amount of storage. Another problem is that the matrix has sparsity issues, making the subsequent machine learning models less robust and generalizable. Word embedding models have been shown to be able to reveal hidden semantic relationships between words, such as similarity or relatedness. The concept of “word embedding,” as defined by Bengio et al in 2003, refers to the representations for words occupying a real-valued low-dimensional and dense vector space where the similarity between words is measured by cosine similarity. Compared with traditional distributional semantics models, word embedding models are more efficient and scalable since they can be trained on a large amount of unannotated data. Word2vec, and GloVe are two popular word embedding models. Word2vec and GloVe trained the word vectors in a different way, and there were very limited studies conducted to investigate the advantage of one model over another. In the clinical domain, word embedding models have been applied on a variety of NLP tasks, such as named entity recognition and clinical text classification., Pretrained word vectors are often used as input features for such tasks. Nguyen et al utilized word2vec to discover the variants of adverse drug reaction terms in social media data. The results of this study showed that the expanded lexicon by word2vec can improve the performance of using social media data to capture the prevalence of adverse events. Bethany et al applied word2vec for automatic lexicon expansion of radiology terms with promising results. Pakhomov et al evaluated the word2vec on a document retrieval task; the results showed that the expanded queries with semantically similar phrases could identify more patients with heart disease. Wang et al evaluated the word embeddings in an information retrieval task through expanding the search query with five most similar terms from word embeddings. Currently, no prior study has investigated the effects of the corpus size for the word embeddings on the performance of NLP tasks. Based on the theoretical ground of distributional semantics, we hypothesized that word embedding models can be used to detect semantically or syntactically similar terms for DS in clinical notes. Thus, the objective of this study is to use word embeddings to expand the terminology of DS from clinical notes. Specifically, we evaluate the effects of various settings (eg, corpus size, window size, and vector size) of word embedding models, and compare the performance of different word embedding models (ie, word2vec and GloVe) on the task of expanding DS terminology in clinical notes.

METHODS

Study design

The study was carried out in three steps outlined as follows: (1) collecting and preprocessing clinical notes; (2) training word vectors using two word embedding models (ie, word2vec and GloVe) and experimenting on the different settings with respect to corpus size, window size, vector size, and the type of vectors (ie, CBOW, skip-gram); (3) conducting both intrinsic and extrinsic evaluations. The overview and workflow of the method is shown in Figure 1.
Figure 1.

The overview and workflow of the method. EHR: electronic health record.

The overview and workflow of the method. EHR: electronic health record.

Data collection and preprocessing

Clinical notes from April 2015 to December 2016 were collected from clinical data repository (CDR) at the University of Minnesota Medical Center. The CDR houses the EHR of patients seeking healthcare at 8 hospitals and over 40 clinics. The CDR contains 130 million clinical notes of over 2 million patients. Institutional review board (IRB) approval was obtained for accessing the clinical notes. The collected corpus went through minimal preprocessing work including punctuation removal and lowercasing. All the notes were compiled as a single text file with all the words separated by a single space for subsequent model training.

Model training and parameter tuning

In this study, we first applied word2vec to generate the word vectors for preprocessed, different-sized corpora with default setting of parameters (ie, CBOW, window size of 8, and vector size of 200). Specifically, starting at the first 3 months’ (from April to June of 2015) clinical notes, we increased the corpus size by every 3 months. Thus, we obtained 7 corpora with the time spans of 3, 6, 9, 12, 15, 18, and 21 months. Seven word2vec models were then trained on these 7 corpora. By inputting the name (eg, “garlic”) for each of the 14 DS into these trained word2vec models, we obtained a ranked list containing 40 semantically related terms for each of 14 DS from each model. Based on the human annotations (details described below), we investigated how the change of corpus size affect the number of various semantically similar terms. Once the optimal corpus size was determined based on the human evaluation on the top 40 terms, we investigated the different parameter settings regarding the window size (ie, 4, 6, 8, 10, and 12) and the vector size (ie, 100, 150, 200, and 250) on the optimal sized corpus. We also trained the word2vec skip-gram model on the corpus with the optimal size. The threshold for subsampling was set as 1e−4. The number of threads was set as 20 and the number of iterations was 25. In addition, in order to compare the performance of GloVe model with that of the word2vec model, we trained the GloVe model on the same corpus of the optimal size used to train the word2vec model. Different parameter settings were also tested, including the vector size (ie, 50, 100, 150, and 200) and the window size (ie, 8 and 15). For both models, the optimal parameters were chosen based on the number of semantically similar terms annotated by the human experts.

Annotation and intrinsic evaluation

Fourteen commonly used DS were chosen for evaluation based on online survey and peer-reviewed publications, which included calcium, chamomile, cranberry, dandelion, flaxseed, garlic, ginger, ginkgo, ginseng, glucosamine, lavender, melatonin, turmeric, and valerian. For each DS name used as an input, the trained word2vec model returned a list of 40 top-ranked semantically related terms with varied cosine similarity scores. Similarly, we applied the cosine similarity measure on the word embeddings obtained by GloVe to generate a list of 40 top-ranked semantically related terms for each of the 14 DS. Two experts with both clinical and informatics backgrounds independently annotated the lists. Expert judgment was used to evaluate these terms to identify the semantically similar terms. Annotation guidelines were first created to classify terms on the list into four categories: semantic variants, brand names, misspellings, and irrelevant terms. The disagreement was settled by discussion and further judged by another informatics expert. The interannotator agreement was calculated using the Cohen’s Kappa score. We used the expert-curated terms as the gold standard to intrinsically evaluate the mean average precision (MAP) of the returned 40 top-ranked terms for each of the 14 DS (totally 560 terms). We compared the performance of word2vec and GloVe using MAP score and the number of semantically similar terms annotated by human experts.

Extrinsic evaluation (note identification)

We combined the terms identified by both word2vec and GloVe and applied them in two notes identification tasks using NLP-PIER (Patient Information Extraction for Research), a tool developed by the NLP-IE group at the University of Minnesota specifically for indexing the collection of clinical notes used in this study. PIER allows researchers to input keywords to easily access the clinical notes. However, simple keyword searching for DS is often not effective. For example, a keyword of “Vitamin C” in identifying patients taking vitamin C is insufficient without considering its semantically similar terms such as “ascorbic acid” and “Vit C,” which are well-represented in clinical notes. Therefore, we evaluated the effectiveness of our expanded DS terms through notes identification task. Specifically, for querying clinical notes, we compared these terms with two sets of baseline terms: (1) a single DS term for each of 14 DS; (2) a set of expanded terms using only the external DS knowledge bases. Since this query expansion is not involved in an IR system, no relevance related to the identified notes is evaluated. We described the experiments in the following two tasks.

Task 1: Comparing performance of the word embedding expanded queries with the baseline queries

For each DS, the baseline query (using only a single DS term) was used to identify the clinical notes through NLP-PIER. We call query terms identified by the two word embedding models and human experts as “word embedding expanded terms.” The word embedding expanded terms were augmented with the baseline term for query expansion. The expanded queries were used to identify the notes for each DS. The number of the distinct clinical notes and patients were counted for both baseline queries and word embedding expanded queries. The number of additional notes and patients found by expanded queries and percentage increase were calculated.

Task 2: Comparing performance of the word embedding expanded queries with the queries expanded using external DS knowledge sources

We further compared the performance of the word embedding expanded queries with queries based on 2 external knowledge sources including Natural Medicines Comprehensive Database (NMCD) and Dietary Supplement Label Database (DSLD). NMCD, managed by the therapeutic research center, is one of the most comprehensive and reliable natural medicine resources. For each product, the database provides 15 categories of information including comprehensive other names the product is known by. DSLD is created and managed by the Office of Dietary Supplements (ODS) and National Library of Medicine (NLM) at the National Institutes of Health. DSLD provides users the access to the full label derived information from DS products marketed in the United States. DSLD also provides a list of alternate names or synonyms for the ingredients. For each selected DS, two domain experts manually reviewed the information on other names available on NMCD and DSLD to be used in the search queries. The names were restricted to English and Latin names and the names used to be sold in the US market. We used the word embedding expanded queries and external source expanded queries to identify clinical notes through NLP-PIER and compared the number of identified clinical notes and patients. Similar to task 1, the number of additional notes and patients found by expanded queries and percentage increase were calculated.

RESULTS

A total of 26 531 085 clinical notes containing 66 214 049 847 tokens were used to train the word embedding models in this study. The vocabulary size is 635 176. The Cohen’s kappa score between the two annotators was 0.869, which indicates high reliability. The number of semantically similar terms identified by word2vec and human annotators for each of the 14 DS based on the 40 top-ranked terms from corpus with varied sizes was shown in Table 1. The MAP scores for 7 corpora are also shown in this table. The general trend shows that as the corpus size (vocabulary size) increases, the total number of semantically similar terms annotated by human experts from the 40 top-ranked terms increases. While the size of the corpus is increasing, more misspellings were found within the top 40 terms, but the number of semantical variants and brand names reaching the peak when the corpora were created using 6 months’ and 12 months’ notes, respectively. However, we found that these terms found by different corpora with varying sizes have some overlapped terms while containing some new terms. To include more semantically similar terms, we chose to use all the available notes (21 months) to train the final word embedding models and tuned the hyperparameters. We trained CBOW and skip-gram with the default parameter settings. We found that the words returned by CBOW and skip-gram were the same, so we used CBOW in the final model training. After the hyperparameter tuning, the optimal window size was set as 8 and the optimal vector size as 200 for word2vec CBOW model. For GloVe model, we tried different parameters and the optimal window size was also set as 8 and the optimal vector size as 200.
Table 1.

The number of semantically similar terms identified by human experts based on 40 top-ranked terms by word2vec for each 14 DS from 7 corpora

Time span of clinical notes for 7 corpora
3 months6 months9 months12 months15 months18 months21 months
Vocabulary size214 948312 557388 891454 459520 127577 362635 176
Semantic variants1214131311109
Brand names7989675
Misspellings481014131421
Total23313136303135
MAP0.3130.2940.3560.2470.2420.2800.263

MAP: mean average precision; DS: dietary supplements.

The number of semantically similar terms identified by human experts based on 40 top-ranked terms by word2vec for each 14 DS from 7 corpora MAP: mean average precision; DS: dietary supplements. The word embedding expanded terms (semantic variants, brand names, and misspellings) for 14 DS were shown in Supplementary Table S1. In total, the word2vec model has detected 35 semantically similar terms for 14 DS. For cranberry, its semantic variants, brand names, and misspellings were detected. The word2vec model has identified the various forms of misspellings for DS such as calcium and glucosamine. The word2vec model also detected several brand names for DS that are commonly purchased over the counter, such as calcium. For some DS, such as calcium, lavender, and ginkgo, their expert-annotated terms appear in the top 10 words on the returned list. The MAP score for expanding DS terms using word2vec is 0.263. A total of 17 semantically similar terms were identified by GloVe and human annotators. Compared with word2vec model, GloVe model is less capable of detecting misspellings, as only two misspellings were found by GloVe. For lavender and ginger, Glove has found their semantic variants which the word2vec model failed to detect. The MAP score for expanding DS terms using GloVe is 0.236, which is close to that for the word2vec generated terms. We further applied the word embedding expanded terms in two clinical notes identification tasks. The results of the comparison between the baseline and word embedding expanded queries in terms of the number of notes and the number of distinct patients were shown in Table 2. From the table, we can see that for all the DS, the number of notes and distinct patients identified by word embedding expanded queries has increased with a range from 14 to 93 308 and from 5 to 20 086, respectively. For ginger and dandelion, the increase is relatively small. However, as for ginkgo and turmeric, the inclusion of semantic variants, brand names, and misspellings has increased the number of identified notes and patients by a large amount. For glucosamine and valerian, incorporating the baseline term with only detected misspellings has led to an increase in the notes number, indicating that misspellings have great value in identifying patients taking DS.
Table 2.

Results of comparison between word embedding expanded queries and baseline queries (task 1) for 14 dietary supplements

Queries
Number of clinical notes
Number of patients
Dietary supplementsNumber of word embedding expanded termsBase queryWord embedding queryAdditional records foundPercentage increase (%)Base queryWord embedding queryAdditional patients foundPercentage increase (%)
Calcium127 450 2617 543 56993 3081.251 000 5611 002 21116500.16
Chamomile35221612089917.223504414664218.32
Cranberry3196 862198 62517630.9076 66477 3276630.86
Dandelion244684564962.1523772419421.77
Flaxseed2104 007169 34365 33662.8225 13645 22220 08679.91
Garlic192 80393 94111381.2331 27331 4001270.41
Ginger196 43896 452140.0159 69359 69850.01
Ginkgo320 25928 093783438.6758547791193733.09
Ginseng2992611 277135113.614023446944611.09
Glucosamine5466 617467 75811410.2470 84270 938960.14
Lavender318 79320 66718749.9711 85513 01111569.75
Melatonin1753 511753 7532420.03118 846118 896500.04
Turmeric333 57348 74915 17645.20837913 486510760.95
Valerian215 88316 2193362.12705172071562.21
Results of comparison between word embedding expanded queries and baseline queries (task 1) for 14 dietary supplements The word embedding expanded terms and terms from two external DS databases are shown in Supplementary Table S2. The results of the number of clinical notes and patients found by word embedding expanded queries and external source queries are shown in Table 3. Comparing to the external source queries, the word embedding expanded queries has found more clinical notes for most of 14 DS, except for chamomile, flaxseed, and ginger. The terms from two external sources are mainly scientific names or some other names of DS. Even though DSLD contains some brand names for DS sold in the US market, it does not provide sufficient coverage on the complete information on brand names. Our finding demonstrates that the terms identified by word embedding models have very well captured their semantic variants in clinical notes and meanwhile contained some brand names and misspellings which the external sources failed to cover. On the other hand, for chamomile, flaxseed, and ginger, the fact that the external source queries have found a larger number of clinical notes indicate that the external resources can be good complementary source on the terminology of DS, especially in terms of scientific names.
Table 3.

Results of comparison between word embedding expanded queries and external source expanded queries (task 2) for 14 dietary supplements

Queries
Number of clinical notes
Number of patients
Dietary supplementsNumber of external source termsNumber of word embedding expanded termsExternal source queryWord embedding queryAdditional records foundPercentage increase (%)External source queryWord embedding queryAdditional patients foundPercentage increase (%)
Calcium15127 453 8737 543 56989 6961.201 000 9061 002 21113050.13
Chamomile5361936120−73−1.1842434146−97−2.29
Cranberry213196 944198 62516810.8576 69777 3276300.82
Dandelion15245094564551.2223832419361.51
Flaxseed102169 349169 343−60.0045 22945 222−7−0.02
Garlic6192 91393 94110281.1131 32831 400720.23
Ginger15196 49996 452−47−0.0559 71959 698−21−0.04
Ginkgo6320 27528 093781838.5658557791193633.07
Ginseng21210 15811 277111911.02415144693187.66
Glucosamine75466 617467 75811410.2470 84270 938960.14
Lavender5318 79820 66718699.9411 8561301111559.74
Melatonin31753 513753 7532400.03118 847118 896490.04
Turmeric18335 71948 74913 03036.48896213 486452450.48
Valerian10215 88616 2193332.10705172071562.21
Results of comparison between word embedding expanded queries and external source expanded queries (task 2) for 14 dietary supplements The selected example sentences mentioning the semantic variants, brand names, and misspellings for DS were shown in Table 4.
Table 4.

Selected example sentences with mentions of semantic variants, brand names, and misspellings for dietary supplements

Dietary supplementsExamples
Calcium

Increase calicum carb (tums) to 3 times a day.

Stop Citracal but continue vitamin D.

Patient was taking Calcarb D 600/200.

I stopped the Oysco, and put in Rx for cholecalciferol for her.

Chamomile

Recommend chamomile tea for sleep.

A product called No Jet Lag contains homeopathic remedies leopard’s bane (Arnica montana), daisy (Bellis perennis), and wild chamomile (Matricaria chamomilla).

She will try the camomille.

Cranberry

Continue to increase fluids and cran juice.

Restart the methenamine and Ellura a couple of days before you complete your course of atibiotics.

She started craberry tabs.

Dandelion

Ok to take dandilion root but needs to keep taking Lasix and needs follow up appt.

He is taking some dandilion for its potassium sparing effects as well.

Flaxseed

Start flax seed oil 1000 mg daily.

She should stop fish oil and start flaxseed.

You may try linseed for constipation.

Garlic

Pt states she is going to try “Garlique” for 6 months.

She is on Garlique.

Ginger Zingiber officinale rhizome is also known as ginger.
Ginkgo

Okay to start gingko.

Can begin multivitamin and ginko and calcium now.

She had been taking Ginkoba and Vitamin C but she stopped taking them.

Ginseng

Sent my chart message telling her to discontinue the ashwagandha.

Pt states he takes ginsing and has for a couple of years.

Glucosamine

Questions about discontinuing glucosomine.

Please ask her to resume arimidex and us OTC glucosmaine prn for acheness.

Recommended medication clucosamine and eye drops for allergies.

She would like to take glucosame, fish oil, and folic acid.

Lavender

She could try melatonin or lavendar and ginger scents to help you relax and decrease your nausea.

She used lavander oil and super glue on it.

Ok to add a few drops of essential oil to of lavender (Lavandula angustifolia) in milk.

Melatonin

Patient is still having problems sleeping even while taking the melotonin.

Patient wants to know if it’s okay to take melotonin and if she can have an RX for this medication.

She is not sleeping well even on the melotonin.

Turmeric

Stop her curcumin and fenugreek.

Pt is allergic to tumeric.

I would recommend not starting tumeric at this time.

Valerian

Try valarian root for sleep.

Falling asleep better with valarian.

Take the volarian root every night for a few weeks.

Selected example sentences with mentions of semantic variants, brand names, and misspellings for dietary supplements Increase calicum carb (tums) to 3 times a day. Stop Citracal but continue vitamin D. Patient was taking Calcarb D 600/200. I stopped the Oysco, and put in Rx for cholecalciferol for her. Recommend chamomile tea for sleep. A product called No Jet Lag contains homeopathic remedies leopard’s bane (Arnica montana), daisy (Bellis perennis), and wild chamomile (Matricaria chamomilla). She will try the camomille. Continue to increase fluids and cran juice. Restart the methenamine and Ellura a couple of days before you complete your course of atibiotics. She started craberry tabs. Ok to take dandilion root but needs to keep taking Lasix and needs follow up appt. He is taking some dandilion for its potassium sparing effects as well. Start flax seed oil 1000 mg daily. She should stop fish oil and start flaxseed. You may try linseed for constipation. Pt states she is going to try “Garlique” for 6 months. She is on Garlique. Okay to start gingko. Can begin multivitamin and ginko and calcium now. She had been taking Ginkoba and Vitamin C but she stopped taking them. Sent my chart message telling her to discontinue the ashwagandha. Pt states he takes ginsing and has for a couple of years. Questions about discontinuing glucosomine. Please ask her to resume arimidex and us OTC glucosmaine prn for acheness. Recommended medication clucosamine and eye drops for allergies. She would like to take glucosame, fish oil, and folic acid. She could try melatonin or lavendar and ginger scents to help you relax and decrease your nausea. She used lavander oil and super glue on it. Ok to add a few drops of essential oil to of lavender () in milk. Patient is still having problems sleeping even while taking the melotonin. Patient wants to know if it’s okay to take melotonin and if she can have an RX for this medication. She is not sleeping well even on the melotonin. Stop her curcumin and fenugreek. Pt is allergic to tumeric. I would recommend not starting tumeric at this time. Try valarian root for sleep. Falling asleep better with valarian. Take the volarian root every night for a few weeks.

DISCUSSION

Accessing information on DS in clinical notes can help us to understand its use on the individual level and related safety problems. Without a standard terminology, our ability is very limited to identify comprehensive information on DS in clinical notes, which might lead to biased knowledge. In this study, we attempted to apply word embedding models to overcome this limitation and tried to generate relatively comprehensive terms for commonly used DS. We trained two word embedding models on clinical notes to detect and identify semantically similar terms for DS. The terms identified by word embedding models and human experts were applied in two clinical note identification tasks for further evaluation. Our results support the hypothesis that semantic variants, brand names, and misspellings of DS appear in similar context in our clinical note corpus and that applying the word embedding models based on distributional semantics can help detect such syntactic and semantic variants. We conducted a set of comprehensive experiments on the corpus size and hyperparameters. We found out that when the corpus size is small, a relatively small number of semantically similar terms were found. Another finding is that a larger corpus can only help detect more misspellings. Unfortunately, continuously increasing the corpus size cannot generate more semantic variants and brand names. However, the limitation is that we only evaluated the 40 top-ranked terms. In the future, we could potentially extend to evaluate more terms. Our future work will also include investigating new ranking systems. We also evaluated some hyperparameters, including window size and vector size. We tested 5 values of the window size and 4 values of the vector size. We found that these 2 parameters have a large impact on the model performance and that it should be cautious to use default settings, especially for the GloVe model, which failed to generate any valuable semantically similar words when the default settings were applied. One limitation is that we did not test other parameters such as the number of iterations and the number of negative samples, which might also affect the model performance. For CBOW and skip-gram, there was limited and inconclusive evidence available on which model has higher performance. We tested both models and found that they did not differ in this term expansion task. When comparing the performance of the word2vec and GloVe model, we found that GloVe model is more efficient than word2vec. However, since these 2 models differed in the way of training word vectors: word2vec trained the vectors using contextual information in a predictive method and GloVe trained the word vectors through constructing a co-occurrence matrix using the global information in a “count-based” method, the word vectors they trained also differed. We found out that word2vec model has a better performance in this word similarity task, particularly that word2vec model is more capable of detecting misspellings. When reviewing the word lists returned by the trained word embedding models, we found that the returned lists for some DS can contain the variants for other DS. For example, “ginkgo” appeared in the word list for ginseng. We believe this is due to the fact that DS share very similar contexts and expression patterns. We also found that the list for some DS contain some related diseases, symptoms, and medications with similar pharmacological effects associated with this DS. For example, the list of terms for “melatonin” contains related symptoms of “insomnia” and also contains the brand name “Lunesta” and its corresponding generic name “Eszopiclone,” which is a commonly prescribed medication often used to treat insomnia. This finding also demonstrates that the words in the list cannot be included arbitrarily as additional search terms since a varying number of false positives might be introduced in the query results. Human annotation is significantly necessary for excluding the false positive terms. There are several limitations in this study. We only tested one-word DS terms in this study. In the future, we would apply this method on multiword DS terms for further investigation and evaluation. Additionally, we only focus on the comparison of word embedding models on the task of DS terminology development. We will further explore other count-based methods (eg, PMI) and compare the performance of such models with the word embedding models to gain further insights in our future study. Motivated by one study using the task-orientated additional resources, we would also introduce other data resources such as biomedical literature, Wikipedia articles, and social media data into the training corpus for expanding DS terminology in the future. The method used in this study can potentially be applied to a wider range of DS, and ultimately contribute to the construction of a terminology on DS based on clinical notes. The results also indicate that two external sources have less coverage on brand names and misspellings; however, providing rather complete information on scientific names. Therefore, the syntactic or lexical variants for DS expanded using the EHR data through word embedding models can be further standardized and integrated with online resources including knowledge databases, open-access biomedical publications, and social media data to construct a comprehensive terminology for DS.

CONCLUSION

Word embedding models trained on clinical notes are feasible for expanding DS terminology by identifying the semantically similar terms in clinical notes. The expanded query terms help identify more clinical notes and unique patients. The results of our study show that distributional methods serve as a potential way for automatically detecting semantically or syntactically similar terms for DS. The query terms identified by word embedding models have very well captured the semantic variants of DS in clinical notes. The generated terms of DS can also support further information extraction of DS use information and potentially support the development of DS safety surveillance system.

FUNDING

This research was supported by National Center for Complementary & Integrative Health Award (#R01AT009457, PI: Zhang); and the National Center for Advancing Translational Science (#U01TR002062, PIs: Liu/Pakhomov/Jiang and #1UL1TR002494, PI: Blazar).

AUTHOR’S CONTRIBUTIONS

YF, SP, and RZ conceived the study idea and design. YF preprocessed the data and trained the word embedding models. RM retrieved the clinical notes from CDR using PIER. EL and WZ annotated the candidate lists returned by the models. All authors participated in writing and reviewed the manuscript. All authors read and approved the final manuscript.

SUPPLEMENTARY MATERIAL

Supplementary material is available at Journal of the American Medical Informatics Association online. Conflict of interest statement. None declared. Click here for additional data file.
  19 in total

1.  Dietary supplement use in the United States, 2003-2006.

Authors:  Regan L Bailey; Jaime J Gahche; Cindy V Lentino; Johanna T Dwyer; Jody S Engel; Paul R Thomas; Joseph M Betz; Christopher T Sempos; Mary Frances Picciano
Journal:  J Nutr       Date:  2010-12-22       Impact factor: 4.798

Review 2.  Utilizing social media data for pharmacovigilance: A review.

Authors:  Abeed Sarker; Rachel Ginn; Azadeh Nikfarjam; Karen O'Connor; Karen Smith; Swetha Jayaraman; Tejaswi Upadhaya; Graciela Gonzalez
Journal:  J Biomed Inform       Date:  2015-02-23       Impact factor: 6.317

3.  Measures of semantic similarity and relatedness in the biomedical domain.

Authors:  Ted Pedersen; Serguei V S Pakhomov; Siddharth Patwardhan; Christopher G Chute
Journal:  J Biomed Inform       Date:  2006-06-10       Impact factor: 6.317

4.  A comparison of word embeddings for the biomedical natural language processing.

Authors:  Yanshan Wang; Sijia Liu; Naveed Afzal; Majid Rastegar-Mojarad; Liwei Wang; Feichen Shen; Paul Kingsbury; Hongfang Liu
Journal:  J Biomed Inform       Date:  2018-09-12       Impact factor: 6.317

5.  Estimation of the prevalence of adverse drug reactions from social media.

Authors:  Thin Nguyen; Mark E Larsen; Bridianne O'Dea; Dinh Phung; Svetha Venkatesh; Helen Christensen
Journal:  Int J Med Inform       Date:  2017-03-23       Impact factor: 4.046

6.  Evaluating Term Coverage of Herbal and Dietary Supplements in Electronic Health Records.

Authors:  Rui Zhang; Nivedha Manohar; Elliot Arsoniadis; Yan Wang; Terrence J Adam; Serguei V Pakhomov; Genevieve B Melton
Journal:  AMIA Annu Symp Proc       Date:  2015-11-05

7.  Expanding a radiology lexicon using contextual patterns in radiology reports.

Authors:  Bethany Percha; Yuhao Zhang; Selen Bozkurt; Daniel Rubin; Russ B Altman; Curtis P Langlotz
Journal:  J Am Med Inform Assoc       Date:  2018-06-01       Impact factor: 4.497

8.  Mining clinical text for signals of adverse drug-drug interactions.

Authors:  Srinivasan V Iyer; Rave Harpaz; Paea LePendu; Anna Bauer-Mehren; Nigam H Shah
Journal:  J Am Med Inform Assoc       Date:  2013-10-24       Impact factor: 4.497

9.  The prevalence of herb and dietary supplement use among children and adolescents in the United States: Results from the 2007 National Health Interview Survey.

Authors:  Chung-Hsuen Wu; Chi-Chuan Wang; Jae Kennedy
Journal:  Complement Ther Med       Date:  2013-05-29       Impact factor: 2.446

10.  Evaluating word representation features in biomedical named entity recognition tasks.

Authors:  Buzhou Tang; Hongxin Cao; Xiaolong Wang; Qingcai Chen; Hua Xu
Journal:  Biomed Res Int       Date:  2014-03-06       Impact factor: 3.411

View more
  3 in total

1.  Deep learning approaches for extracting adverse events and indications of dietary supplements from clinical text.

Authors:  Yadan Fan; Sicheng Zhou; Yifan Li; Rui Zhang
Journal:  J Am Med Inform Assoc       Date:  2021-03-01       Impact factor: 4.497

2.  iDISK: the integrated DIetary Supplements Knowledge base.

Authors:  Rubina F Rizvi; Jake Vasilakes; Terrence J Adam; Genevieve B Melton; Jeffrey R Bishop; Jiang Bian; Cui Tao; Rui Zhang
Journal:  J Am Med Inform Assoc       Date:  2020-04-01       Impact factor: 4.497

3.  Identification of social determinants of health using multi-label classification of electronic health record clinical notes.

Authors:  Rachel Stemerman; Jaime Arguello; Jane Brice; Ashok Krishnamurthy; Mary Houston; Rebecca Kitzmiller
Journal:  JAMIA Open       Date:  2021-02-09
  3 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.