Literature DB >> 31184919

Significant and Distinctive n-Grams in Oncology Notes: A Text-Mining Method to Analyze the Effect of OpenNotes on Clinical Documentation.

Maryam Rahimian1, Jeremy L Warner2,3, Sandeep K Jain3,4, Roger B Davis1, Jessica A Zerillo1, Robin M Joyce1.   

Abstract

PURPOSE: OpenNotes is a national movement established in 2010 that gives patients access to their visit notes through online patient portals, and its goal is to improve transparency and communication. To determine whether granting patients access to their medical notes will have a measurable effect on provider behavior, we developed novel methods to quantify changes in the length and frequency of use of n-grams (sets of words used in exact sequence) in the notes.
METHODS: We analyzed 102,135 notes of 36 hematology/oncology clinicians before and after the OpenNotes debut at Beth Israel Deaconess Medical Center. We applied methods to quantify changes in the length and frequency of use of sequential co-occurrence of words (n-grams) in the unstructured content of the notes by unsupervised hierarchical clustering and proportional analysis of n-grams.
RESULTS: The number of significant n-grams averaged over all providers did not change, but for individual providers, there were significant changes. That is, all significant observed changes were provider specific. We identified eight providers who were late note signers. This group significantly reduced its late signing behavior after OpenNotes implementation.
CONCLUSION: Although the number of significant n-grams averaged over all providers did not change, our text-mining method detected major content changes in specific providers' documentation at the n-gram level. The method successfully identified a group of providers who decreased their late note signing behavior.

Entities:  

Mesh:

Year:  2019        PMID: 31184919      PMCID: PMC6873977          DOI: 10.1200/CCI.19.00012

Source DB:  PubMed          Journal:  JCO Clin Cancer Inform        ISSN: 2473-4276


INTRODUCTION

History has documented numerous efforts toward enabling patients to become more engaged in their own care. After development of the clinical record in the 19th century in America,[1-7] the 1973 Patient’s Bill of Rights[8] is considered one of the first major steps to give patients the right to receive considerate and high-quality care and to access the details and written records about that care.[8] Just over 20 years later, the 1996 Health Insurance Portability and Accountability Act established national standards for protecting the privacy of patient health data.[9,10] Coincident with the enactment of the Health Insurance Portability and Accountability Act was the rise in personal home computing and Internet access, which allowed for the development of online patient portals where patients could access summary information about their medications, immunizations, visits, and laboratory results online.[11,12] However, barriers still existed to patients viewing important parts of their health record, such as clinic visit notes and correspondence with consultants. In 2010, OpenNotes was launched as a national initiative to promote greater transparency in doctor-patient communication. Patients received open access to unstructured clinician notes in their electronic health records through online patient portals. OpenNotes began as a 12-month demonstration project with primary care physicians at three US institutions.[13] In surveys at the end of the pilot period, participating patients and doctors reported favorably on their experiences.[14] Since the original study, more than 100 institutions with over 30 million patients have implemented OpenNotes.[15] Despite positive initial findings, some doctors expressed concerns about unintended consequences. One ongoing concern about OpenNotes is whether the phenomenon of patients having access to this information would change how doctors constructed notes. One field of medical practice that is likely to have been affected by OpenNotes is hematology/oncology. We hypothesized that granting patients with cancer full access to their health records would influence providers to alter their documentation of patient encounters, given the sensitive and potentially anxiety-provoking nature of cancer diagnosis, the team-based nature of cancer care, and the importance of such notes for documenting communication between clinicians as well as between clinician and patient. Most OpenNotes research to date has relied on surveys and subjective assessments. In previous preliminary work, we conducted to our knowledge the first objective assessment using visual analytic techniques to show that certain single-word co-occurrences had statistically significant changes before and after OpenNotes.[16] In the current article, we describe methods to quantify changes in the length and frequency of sequential co-occurrence of words (n-grams) in the unstructured content of clinical notes by unsupervised hierarchical clustering and proportional analysis. We sought to explore quantitatively whether the introduction of OpenNotes has changed documentation of patient encounters on the basis of repeating sequential occurrences of words (n-grams) and modes of expression as seen in cluster/clique formation. Most providers use a combination of specific words, templates, and their own unique expressions when completing patient notes, and patterns in the unstructured content of these providers’ notes can be used to track documentation changes over time. We hypothesized that changes in the frequency of common phrases may act as a bellwether for changes in institutional attitudes and/or policy, reflect changing social norms and customs among providers, and/or reflect changes in administrative procedures.

METHODS

Numerical analyses were performed using R version 3.3.1 and RStudio version 0.99.903 software (https://cran.r-project.org/bin/macosx/). The R libraries data.table, stringr, and edgebundleR were used. This study was approved by the Beth Israel Deaconess Medical Center (BIDMC) institutional review board (#2014P000158).

Data Sources and Inclusion Criteria

Notes written by providers in the hematology/oncology department from January 1, 2012, to September 1, 2016, were retrieved. These dates bracket the November 25, 2013, OpenNotes rollout date for hematology/oncology clinics. The post-rollout period is longer than the antecedent period to capture the dynamics and transients of adoption. The antecedent period of 18 or more months was believed to be long enough to sample the stationary behavior of the unperturbed system. Analysis was restricted to initial notes, progress notes, and letters written by full-time medical doctors/doctors of osteopathic medicine and nurse practitioners; part-time faculty, fellows, and trainees were excluded because their documentation style might have changed over time as a result of on-the-job training. Finally, analysis was restricted to providers who wrote at least 100 (progress/initial) notes before and 100 notes after the rollout date. The letters were filtered to include communications between doctors, and only those letters that began with the token “Dear Dr” were included. The explicit bigram Dear Dr is a templated salutation that begins all clinical correspondence at BIDMC. The analysis was restricted to providers who had written a minimum of 10 letters before and after the OpenNotes debut date. The final list included 36 providers for initial and progress notes and 12 providers for letters (Table 1).
TABLE 1.

Number of Notes by Provider Type

Number of Notes by Provider Type

Definitions

Definitions of word, corpus, n-gram, distinctive/significant n-grams, and threshold are described in the Appendix.

Cleaning and Preprocessing

The initial and progress notes were divided into two subcorpora: before November 25, 2013, and after. Preprocessing steps involve conversion of all text to lowercase, deletion of punctuation and new line tokens, splitting of the text on the basis of white space, and deletion of all words that are a single character in length (eg, a and I).[1] We chose to avoid stemming for several reasons: Stems are hard to interpret in an actionable sense and although they are supposed to decrease word space, they may increase it by introducing hypothetical word stems that may potentially collide with actual words.

n-Gram Algorithm

The algorithm begins with the selection of a provider and the analysis of the provider’s before and after corpora of initial and progress notes or letters by treating each corpus as a collection of 1-grams. All single words with frequencies less than Ω are eliminated (see Appendix). The corpus is then re-analyzed as a collection of 2-grams. Before analyzing the 2-grams, all 2-grams that do not consist of significant 1-grams are eliminated. The corpus is then analyzed as a collection of the remaining 2-grams, and all 2-grams with frequency less than Ω are eliminated. The analysis proceeds this way until the threshold requirement reduces the list of significant n-grams to 0 (described in the Appendix). The data are analyzed using two complementary strategies. First, the proportions of use of each n-gram for each provider are analyzed in the before and after corpora. The proportion-of-use changes and their significance are determined using a simple statistical proportionality comparison test. Second, the providers are compared with one another using unsupervised cluster analysis to determine clusters of providers who use similar n-grams in their notes.

Simple Proportional Use Analysis

All notes are re-analyzed for the presence or absence of each n-gram. A binary use matrix U, where U[i,j] = 0 or 1, is built. Any note i with at least one instance of the n-gram j is noted as present for that n-gram or U[i,j] = 1. Alternatively, if there are no instances of that n-gram, it is noted as absent or U[i,j] = 0. The proportion of use for each n-gram j is the sum of the column j divided by the number of rows. Assessments are made whether a provider changes his or her use of an n-gram between the before and after OpenNotes subcorpora in a statistically significant way by testing the null hypothesis that no such change occurs. A two-sided 95% confidence proportion test (α/2 = .025) in the normal approximation is used. The SE is based on the pooled proportion. Only n-grams where the expected number of occurrences or nonoccurrences of a given n-gram on the basis of the pooled proportions before and after exceeds 10 are considered. This is guaranteed by the requirement of Ω = 0.10 and the number of notes before greater than 100 and number of notes after greater than 100.

Clustering Algorithm

The providers’ list of n-grams represents sets that overlap when certain n-grams are used by two or more providers. These overlaps are used to perform unsupervised cluster analysis through a greedy aggregation algorithm[17] (described in the Appendix). The lists of n-grams for all providers are compared, and the two that are the most similar on the basis of a similarity score are found. These two providers’ lists are merged, and the process is repeated until there is only one list left. The remaining list is the superset list of n-grams from which all provider lists can be sourced. Figure 1 summarizes the protocol.
FIG 1.

The protocol of cleaning the notes and building n-grams.

The protocol of cleaning the notes and building n-grams.

RESULTS

Results for n-Gram Analysis

The cohort of 36 providers demonstrated several detectable shifts in n-gram use between the before and after corpora. A measure of note-writing behavior before versus after the OpenNotes debut is shown in Figure 2. On average, there was no significant change in the total number of n-grams used. The regression line is N_after = 1.01 × N_before + 80. The uncertainty in the slope is 0.08, and the uncertainty in the intercept is 560. The dashed lines indicate the 95% CI for the values around the regression line.
FIG 2.

Number of primary and constituent n-grams used by providers in their initial and progress notes in the after corpus versus before corpus.

Number of primary and constituent n-grams used by providers in their initial and progress notes in the after corpus versus before corpus. Although the number of significant n-grams averaged over all providers did not change, approximately one half of the providers did have significant changes individually. Eight providers markedly decreased their use of significant n-grams, whereas nine increased their use. The observed effect was highly provider specific. For example, for one provider, the OpenNotes debut correlated with the decreased use of some long n-grams (Fig 3A). This provider was someone who significantly reduced reliance on certain long n-grams, specifically some templated text, such as the following:
FIG 3.

Distribution of lengths of primary and constituent n-grams before and after OpenNotes for two providers. (A) Provider A decreased creation of long n-grams after OpenNotes. (B) Provider B increased creation of long n-grams after OpenNotes.

Distribution of lengths of primary and constituent n-grams before and after OpenNotes for two providers. (A) Provider A decreased creation of long n-grams after OpenNotes. (B) Provider B increased creation of long n-grams after OpenNotes. 96-gram: review_of_systems_negative_unless_marked_general_night_sweats_fever_chill_heent_oral_complaints_headache_visual_changes_or_sore_throat_cardiovascular_chest_pain_dizziness_palpitations_respiratory_cough_shortness_of_breath_or_wheezing_abdomen_abdominal_pain_nausea_vomiting_diarrhea_or_constipation_genitourinary_dysuria_or_change_in_urinary_pattern_ms_bone_pain_changes_in_muscle_strength_or_muscle_pain_skin_rashes_itching_endocrine_fatigue_frequent_urination_excessive_thirst_change_in_hair_texture_heme_lymph_easy_bruising_blood_clotting_or_bleeding_problems_increase_in_frequency_or_unusual_infections_neuro_psych_depression_si_hi_weakness_numbness_tingling_vertigo_physical_examination As an example for another provider (Fig 3B), the debut of OpenNotes led to an increased dependence on longer n-grams. Most of this provider’s n-grams are associated with physical examination and prescription templates. After OpenNotes, this provider significantly increased use of longer n-grams, including: 32-gram: Full_affect_heent_clear_mucous_membranes_moist_sclera_anicteric_conjunctiva_pink_neck_soft_and_supple_chest_cta_no_wheezes_rales_or_rhonchi_heart_rrr_nl_s1_s2_abd_soft_nt_nd_bs

Results for Clustering Analysis

We observed some significant changes to the clustering structure of notes before and after the OpenNotes debut. The before dendrogram (Fig 4A) has three large clusters. The first cluster on the left consists of eight providers. This cluster is defined by 2,392 n-grams, which were used significantly by five of the eight providers. This indicates that these n-grams were 62% sensitive for that group. Ninety-five percent of those n-grams were also 100% specific to that group and, therefore, constituted distinctive n-grams to that cluster. All were constituent of the following 53-gram, which is the late signing attestation at BIDMC:
FIG 4.

Dendrograms of the 36 included providers clustered by their use of similar n-grams. Red boxes show relatedness among the providers in the first cluster (late note signers). (A) Before the OpenNotes debut. (B) After the OpenNotes debut.

Dendrograms of the 36 included providers clustered by their use of similar n-grams. Red boxes show relatedness among the providers in the first cluster (late note signers). (A) Before the OpenNotes debut. (B) After the OpenNotes debut. accurately_reflects_the_documentation_made_when_assessed_diagnosed_treated_and_or_communicated_about_the_above_named_patient_also_attest_that_this_information_is_true_accurate_and_complete_to_the_best_of_my_knowledge_and_understand_that_any_falsification_omission_or_concealment_of_material_fact_may_subject_me_to_administrative_civil_or_criminal_liability The middle cluster of this dendrogram is not as easily interpreted. However, 944 n-grams distinguish it by being more than 60% sensitive and more than 80% specific. There were essentially no interpretatively distinctive n-grams in this group. In the third cluster, there were 1,214 n-grams that were at least 60% sensitive. Of these, 25 n-grams had at least 80% specificity. These n-grams seem to focus on prescriptions, for example: prochlorperazine_maleate_prochlorperazine_maleate_10_mg_tablet_tablet_by_mouth After the debut of OpenNotes, the dendrogram and clusters were much more dispersed (Fig 4B). We observe that the eight providers in the first cluster from the first dendrogram have been scattered around the new set of clusters. We determined that this was due to the five late attesters having substantially reduced their late attestations after OpenNotes. Consequently, their relatedness dropped, and they were deemed closer to other providers on the basis of more subtle aspects of expression and usage. Of the five, one provider’s use of the late note attestation fell below the 10% threshold, and the corresponding n-grams were no longer significant. This result was confirmed with the proportional analysis of the 53-gram late note attestation. To summarize, unsupervised hierarchical clustering identified late note signers. Moreover, we can assert that late note signing was the largest driver of n-gram use variance in our study group.

Proportional Analysis

The statistical test of comparison of proportions was used to determine whether the fractional use of certain n-grams increased or decreased significantly in the before and after corpora from provider to provider. In addition to the significant increase or decrease in proportion of use, the n-grams that rose above or fell below the threshold of significance (Ω) were also included. Because not all n-grams are interpretable as tokens of valued medical communication, some value words were handpicked to display their increase or decrease in usage (Figs 5A and 5B). For example, the bigrams follow_up, distress_score, and concerning_for demonstrated a significant increase in frequency of use in most of the providers. The word distress_score was picked as a control because it was added as a policy to the vital signs sheet on April 11, 2013, and its use was encouraged coincident with the OpenNotes debut (Fig 5C). Thirty-four of 36 providers demonstrated significantly increased use of distress_score in the after corpus.
FIG 5.

Relatedness of 10 providers by comparison of (A) increased and (B) decreased frequency of use of similar selected n-grams. The circle with segmented arcs in distinct colors indicates different providers. Each arc consists of several n-grams selected to indicate increase (or decrease) in use frequency. The lines that connect the n-grams emphasize the correlated increase (or decrease) in use frequency between providers for the specific word. (C) The occurrence of the word distress per 1,000 notes per month. Smoothing/aggregation interval = 1 month. The vertical line indicates the OpenNotes rollout date. GVHD, graft-versus-host disease.

Relatedness of 10 providers by comparison of (A) increased and (B) decreased frequency of use of similar selected n-grams. The circle with segmented arcs in distinct colors indicates different providers. Each arc consists of several n-grams selected to indicate increase (or decrease) in use frequency. The lines that connect the n-grams emphasize the correlated increase (or decrease) in use frequency between providers for the specific word. (C) The occurrence of the word distress per 1,000 notes per month. Smoothing/aggregation interval = 1 month. The vertical line indicates the OpenNotes rollout date. GVHD, graft-versus-host disease.

Letter Analysis

Only five of 36 providers in the study wrote letters that started with the token Dear Dr and with sufficient abundance to guarantee more than 100 letters before and after the OpenNotes rollout. In one case, 403 n-grams appeared in both corpora, and only 88 changed proportion of use (P < .025). The use of the bigrams concerning_for and distress_score both increased significantly. For the others, 711 n-grams were used significantly in letters before and after the OpenNotes rollout. Of those, 113 changed use significantly, including follow_up and distress_score.

DISCUSSION

The current study explored whether the introduction of OpenNotes changed documentation of patient encounters. We explored both the content and the related meta-data. To analyze changes in providers’ notes content over time, we deconstructed medical notes on the basis of repeating sequential occurrences of words (n-grams) and modes of expression as seen in cluster/clique formation. Most providers used a combination of specific words, templates, and their own unique expressions when completing patient notes, and patterns in the unstructured content of these providers’ notes can be used to track documentation changes over time. We have shown that although the number of distinct common n-grams did not change on average, nine providers dramatically increased their diversity of n-gram repertoire and eight contracted. Clustering analysis revealed that one driver for this change was the dramatic decrease in late note attestation by a subset of providers. Most observed changes were provider specific. A previous effort to quantify redundancy in electronic medical records focused on 100 randomly selected patients admitted during a 6-month period at NewYork-Presbyterian Hospital in 2010. The study showed that more than 50% of notes (initial, progress, discharge, etc) borrowed extensively from previous notes from the same patient, which thereby indicates a cut-and-paste approach to note filling. Our approach differs significantly because we compared all notes from the same doctor with one another and not simply within one patient history. Therefore, we found features of doctors’ styles that are particular to the doctor rather than details that are particular to the patient.[18] A major advantage of this novel method is the ability to detect the largest n-gram that is consistently used by providers in their notes. A second advantage of this method of n-gram construction is the independence from any prespecified lexicon. The need for predefined word lists, such as Unified Medical Language System, and the concomitant limitations was eliminated. A third advantage of this method is the use of aggregation clustering, which promotes the detection of sensitive and specific n-grams for clusters. Clusters are not merely mathematic or algorithmic associations. Rather, they are truly cohorts with shared communications modalities. A final advantage is the unsupervised nature of clustering itself. Future work will include an in-depth analysis of the similarities of providers grouped by cluster: Were they trained at similar institutions or during similar time periods? Do they share subspecialization in the field of hematology/oncology? A limitation of this method assumes a static n-gram universe in each corpus. That is, the notes represent a sampling of a presumably time-independent distribution. Human writing, however, is not a static sampling, and as patients’ needs change from patient to patient and over time, the vocabulary shifts accordingly. The choice of a clustering metric is always arbitrary, and there may well be better metrics that give clusters with clearer delineations of sensitive and specific n-grams. Future work would focus on optimizing the metric for this purpose. However, the existing metric provides useful insights into shared n-gram usage. Similarly, the selection of primary n-grams rather than primary and constituent n-grams for analysis is an open question and depends strongly on the choice of clustering metric. We chose not to use stemming, which may have led to substantially different results that were due to errors in over- and understemming.[19,20] Although vectorization has shown excellent results in many natural language processing (NLP) domains, written communication also relies heavily upon cadence, as introduced by punctuation. Thus, two similarly vectorized n-grams possibly would convey a different message to a patient. For example, “Mr Smith’s prognosis is dismal, given his age. We will proceed with aggressive chemotherapy” versus “Mr Smith’s prognosis is dismal. Given his age, we will proceed with aggressive chemotherapy.” Finally, n-grams are summary-level extractions of a document, which can be highly complex. Future work will focus on scaling this analysis to include prosody, tone, and reading-level metrics. Work on parsing the grammar of sentences has been done by the Stanford NLP project, among others. Recent work has sought to compare algorithms and assess for computational efficiencies. This work is an example of the state of the art of NLP, but it is beyond the scope of this article to implement.[21] A limitation of the data source is that after exclusion, the universe of providers was relatively small and may have unmeasured biases that relate to the single institutional nature of the study. Of note, we analyzed notes at the provider level and did not focus on patient characteristics. Although the majority of patients seen in the hematology/oncology clinic have a diagnosis of cancer, the nature (curability) of their cancers may vary widely. We will explore the use of existing tools, such as DeepPhe,[22] to delineate further the patient characteristics in future work. In conclusion, we have performed an objective analysis of large corpora of hematology/oncology notes written before and after the OpenNotes rollout. Significant differences were seen in the content, which can be explained at least partially by the OpenNotes rollout.
  16 in total

1.  A.H.A. Bill of Rights.

Authors:  George J Annas
Journal:  Trial       Date:  1973 Nov-Dec

2.  Open notes: doctors and patients signing on.

Authors:  Tom Delbanco; Jan Walker; Jonathan D Darer; Joann G Elmore; Henry J Feldman; Suzanne G Leveille; James D Ralston; Stephen E Ross; Elisabeth Vodicka; Valerie D Weber
Journal:  Ann Intern Med       Date:  2010-07-20       Impact factor: 25.391

3.  The evolution of the electronic health record.

Authors:  Susan Doyle-Lindrud
Journal:  Clin J Oncol Nurs       Date:  2015-04       Impact factor: 1.027

Review 4.  Interval examination: moving toward open notes.

Authors:  Jan Walker; Tom Delbanco
Journal:  J Gen Intern Med       Date:  2013-04-26       Impact factor: 5.128

5.  Phyloproteomics: species identification of Enterobacteriaceae using matrix-assisted laser desorption/ionization time-of-flight mass spectrometry.

Authors:  G C Conway; S C Smole; D A Sarracino; R D Arbeit; P E Leopold
Journal:  J Mol Microbiol Biotechnol       Date:  2001-01

6.  Inviting patients to read their doctors' notes: a quasi-experimental study and a look ahead.

Authors:  Tom Delbanco; Jan Walker; Sigall K Bell; Jonathan D Darer; Joann G Elmore; Nadine Farag; Henry J Feldman; Roanne Mejilla; Long Ngo; James D Ralston; Stephen E Ross; Neha Trivedi; Elisabeth Vodicka; Suzanne G Leveille
Journal:  Ann Intern Med       Date:  2012-10-02       Impact factor: 25.391

7.  Giving patients access to their medical records via the internet: the PCASSO experience.

Authors:  Daniel Masys; Dixie Baker; Amy Butros; Kevin E Cowles
Journal:  J Am Med Inform Assoc       Date:  2002 Mar-Apr       Impact factor: 4.497

Review 8.  The effects of promoting patient access to medical records: a review.

Authors:  Stephen E Ross; Chen-Tan Lin
Journal:  J Am Med Inform Assoc       Date:  2003 Mar-Apr       Impact factor: 4.497

9.  Providing a web-based online medical record with electronic communication capabilities to patients with congestive heart failure: randomized trial.

Authors:  Stephen E Ross; Laurie A Moore; Mark A Earnest; Loretta Wittevrongel; Chen-Tan Lin
Journal:  J Med Internet Res       Date:  2004-05-14       Impact factor: 5.428

10.  Assessment of US Hospital Compliance With Regulations for Patients' Requests for Medical Records.

Authors:  Carolyn T Lye; Howard P Forman; Ruiyi Gao; Jodi G Daniel; Allen L Hsiao; Marilyn K Mann; Dave deBronkart; Hugo O Campos; Harlan M Krumholz
Journal:  JAMA Netw Open       Date:  2018-10-05
View more
  2 in total

1.  Does Patient Access to Clinical Notes Change Documentation?

Authors:  Charlotte Blease; John Torous; Maria Hägglund
Journal:  Front Public Health       Date:  2020-11-27

2.  Open notes sounds great, but will a provider's documentation change? An exploratory study of the effect of open notes on oncology documentation.

Authors:  Maryam Rahimian; Jeremy L Warner; Liz Salmi; S Trent Rosenbloom; Roger B Davis; Robin M Joyce
Journal:  JAMIA Open       Date:  2021-08-17
  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.