Literature DB >> 34396057

A comparison of attentional neural network architectures for modeling with electronic medical records.

Anthony Finch¹, Alexander Crowell¹, Yung-Chieh Chang¹, Pooja Parameshwarappa¹, Jose Martinez¹, Michael Horberg^1,2.

Abstract

OBJECTIVE: Attention networks learn an intelligent weighted averaging mechanism over a series of entities, providing increases to both performance and interpretability. In this article, we propose a novel time-aware transformer-based network and compare it to another leading model with similar characteristics. We also decompose model performance along several critical axes and examine which features contribute most to our model's performance.
MATERIALS AND METHODS: Using data sets representing patient records obtained between 2017 and 2019 by the Kaiser Permanente Mid-Atlantic States medical system, we construct four attentional models with varying levels of complexity on two targets (patient mortality and hospitalization). We examine how incorporating transfer learning and demographic features contribute to model success. We also test the performance of a model proposed in recent medical modeling literature. We compare these models with out-of-sample data using the area under the receiver-operator characteristic (AUROC) curve and average precision as measures of performance. We also analyze the attentional weights assigned by these models to patient diagnoses.
RESULTS: We found that our model significantly outperformed the alternative on a mortality prediction task (91.96% AUROC against 73.82% AUROC). Our model also outperformed on the hospitalization task, although the models were significantly more competitive in that space (82.41% AUROC against 80.33% AUROC). Furthermore, we found that demographic features and transfer learning features which are frequently omitted from new models proposed in the EMR modeling space contributed significantly to the success of our model. DISCUSSION: We proposed an original construction of deep learning electronic medical record models which achieved very strong performance. We found that our unique model construction outperformed on several tasks in comparison to a leading literature alternative, even when input data was held constant between them. We obtained further improvements by incorporating several methods that are frequently overlooked in new model proposals, suggesting that it will be useful to explore these options further in the future.

Entities: Chemical

Keywords: artificial intelligence; attention network; electronic medical record; neural network; patient modeling

Year: 2021 PMID： 34396057 PMCID： PMC8358476 DOI： 10.1093/jamiaopen/ooab064

Source DB: PubMed Journal: JAMIA Open ISSN： 2574-2531

BACKGROUND

Introduction

With recent advances in computational resources, deep learning has become an increasingly popular methodology for producing models of expected outcomes., In general, deep learning has been most successful in domains where unstructured or semistructured data have rendered more conventional models impractical. One application which has benefited greatly from the advent of deep learning has been the modeling of patient outcomes based on their electronic medical records (EMRs). This domain has been particularly ripe for exploration by deep learning models because EMR data is typically sufficiently large to construct high-quality deep learning models and sufficiently complex that prior methods left some significant facets of the data underexploited.

Modeling modalities

The starting point for most patient modeling typically consists of the patient’s record of diagnoses, pharmaceutical prescriptions, surgical procedures, and lab tests (hereafter summarized as medical entities or medical concepts). Some studies have also examined provider notes and demonstrated significant utility from this data, although it is often much less structured and may contain similar content to the semistructured diagnostic information provided by medical entities. Beginning with techniques such as Med2Vec, it has become standard for modelers to employ entity embedding methods to construct dense representations of these entities. This practice allows models to learn more efficiently by sharing information about similar diagnoses., Frequently, physician decisions can be traced to a relatively small subset of a patient’s health record. To approximate this intelligent filtering process, modelers can use a neural attention mechanism which operates similarly.,,, In such architectures, the model is trained to assign a weight to each embedding. Then, entities with nonzero weights are combined using a weighted averaging function to construct a dense representation of the patient’s health state. This process is interpretable, since the relative sizes of entity weights indicate rough measures of the entities’ relative importance in the model’s decision. Furthermore, some attention mechanisms can be constructed so as to sparsify the weighting function, resulting in a smaller set of entities which could have contributed to the model’s conclusion. The general construction of attention-based networks was most simply implemented in the natural language processing literature by Luong et al for the purpose of machine translation. This construction can be adapted in the medical domain by substituting the occurrence of words for the occurrence of medical codes, leading to a simple and easily interpretable medical prediction. While there are advantages to the simplicity of this model, there are also reasons to suspect that it is inefficient. This model cannot make use of complex relationships between diseases, observe the effects of disease progression over time, or adapt disease representation based on the context within a patient’s medical record. To capture the basic relationships between diseases, several models have incorporated hierarchical structures in which patient visits are aggregated in an intermediary step between entity embedding and patient summary.,,, Such models benefit by encoding logically related information into a structural element of the model. This can help to address questions related both to the relationships between entities and the complexities of capturing time information in the model. Visit-level vectors may be aggregated from individual entities by incorporating within-visit attention, simple average, or through more complex mechanisms. In addition to aggregating visit data, several models have employed more sophisticated self-attention architectures to allow for complex relationships between entities.,,, Typically, self-attention mechanisms incorporate a pairwise matching approach, whereby the model learns to assign weights to the relationships between entities and then re-embed entities to reflect those relationships. Thus, models can learn the differences between, for example, treated and untreated versions of the same disease. This technique has proved extremely useful and interpretable in the natural language processing literature; since its introduction there, it has enjoyed great success in modeling patient data as well.,,,, In order to incorporate time as an element in patient modeling, researchers have taken two divergent approaches. Traditionally, models such as RETAIN have incorporated time using a recurrent neural network (RNN). RNNs employ decay to steadily reduce the impact of older data and allow the proximity of observations to influence the strength of relationships between data. RNNs operate explicitly over sequences and can therefore ignore the distance between visits unless this data is provided separately. Several authors have also proposed purely attentional methods to incorporate time information into models. Typically, these models have been inspired by the transformer model proposed by Vaswani et al for the task of machine translation. This methodology adapts the Transformer’s architecture to encode time data into the entity embedding structure and relies on the attentional structures to interpret this data., In addition to patient diagnoses, several studies have demonstrated that patient demographic features can be useful inputs. Unfortunately, many leading deep learning models such as RETAIN, HiTANet, and ConCare have omitted these features as primary inputs when proposing new models.,, Furthermore, several studies have examined the impact of improving medical concept embedding initialization by pretraining embeddings with an alternative model. Recently, Rasmy et al demonstrated in Med-BERT that such initialization can substantially improve even sophisticated modern architectures.

Objective

In this article, we propose an efficient new transformer-based architecture for predicting patient outcomes from EMR data. Our model synthesizes improvements described by several authors and simplifies the architecture. Critically, we then examine the impact of each element of the model on overall performance and compare it against a leading alternative. Our model differs from prior works in several significant ways. We simplify the construction of the time-awareness by incorporating a trigonometric decomposition. We also flatten the hierarchical embedding structure used in previous works, relying instead on the time encoding to capture the relevant information. This adjustment is conceptually simple and easy to implement. We employ a sparsified global attention mechanism to maximize interpretability and incorporate both demographic data and transfer learning to optimize model performance.

METHODS

In this study, we propose a new adaptation of the Transformer model. This model implements a Trigonometrically encoded Time-aware Transformer Network (T3Net). To assess the efficacy of T3Net in patient prediction, we compare it against a leading recent alternative, HiTANet. We tested these models on two targets (mortality and hospitalization) trained on EMR data from a large regional medical group. Models were trained using medical records from 2017 with targets in 2018 and validated using medical records from 2018 with targets occurring in 2019. Models were evaluated for their performance on average precision (AP) and area under the receiver-operator characteristic (AUROC).

Model architecture

T3Net takes as its primary input a set of patient medical entities, including diagnoses, procedures, lab tests, and pharmaceutical codes. Each entity is converted to a numeric vector embedding. Code embeddings are then decomposed using a trigonometric decomposition and submitted to a transformer-style self-attention encoder, as Vaswani et al. The re-embedded codes are then concatenated with their original embeddings and submitted to a traditional attention layer, as Luong et al. This attention layer yields a single vector which we consider to be a numeric summary of the patient’s known health state. This patient vector is concatenated with a demographic feature vector and submitted to a traditional feedforward neural classifier. For a more complete discussion of T3Net’s architecture, please refer to Supplementary Appendix A: Model Architecture.

Trigonometric time decomposition

Our model incorporates an original trigonometric time decomposition. Prior to self-attention re-embedding, each code is decomposed into two elements by multiplying the code by and by , where indicates the desired period (in our case, 365 days) and indicates the time since code assignment. This decomposition can also be applied multiple times with a variety of periods to construct a more nuanced time encoding, although we do not incorporate multiple periods in this study. By decomposing codes in this way, we allow the model to perfectly reconstruct the original embedding while losslessly (with an appropriately selected period) encoding time data. In principle, this compares favorably with additive methods, where it may not be possible for the model to perfectly reconstruct either element from the available data.

Data

We employed data from patients in the Kaiser Permanente Mid-Atlantic States (KPMAS) medical system. KPMAS is an integrated medical system serving approximately 780 000 members in Maryland, Virginia, and the District of Columbia. We trained models on EMR records for 294 698 patients with active coverage on January 1st, 2018; then, we validated models on a set of 311 156 patients with active coverage on January 1st, 2019. Patients were included in each year of data if they were age 45 or older by the model date. We excluded younger patients from our sample based on preliminary analysis of target prevalence, which showed that mortality and hospitalization were both very rare among younger patients. In addition, patients without any recorded medical history (typically new patients) were excluded. Each patient record included both demographic and EMR data. We modeled two targets for patient outcomes: 12-month mortality and 6-month hospitalization. We chose to incorporate a 6-month hospitalization target instead of 12 months because preliminary experimentation demonstrated that the hospitalization target became less predictable after 6 months. For the hospitalization model, we included all inpatient hospital admissions submitted as claims to the health system. For the 1-year mortality target, we identified death records based on operational records available in our EMR system. Table 1 displays the rates of various demographic groups and outcome measures across the two years studied. We observed no significant departures between these two sets.

Table 1.

Patient demographic summary

Demographic/outcome group	2018 frequency (%)	2019 frequency (%)
Male	45.7	45.6
Female	54.3	54.4
Asian/Pacific Islander	12.8	13.5
Black/African American	38.7	38.2
Hispanic/LatinX	10.2	10.4
White	31.7	31.2
Unknown/other	6.6	6.6
1-year mortality	0.9	0.9
6-month hospitalization	3.8	3.7

This table describes a patient demographic breakdown for our training and validation data sets. Overall, demographic statistics were very similar across the two data sets.

Patient demographic summary This table describes a patient demographic breakdown for our training and validation data sets. Overall, demographic statistics were very similar across the two data sets. Our research was approved by the Kaiser Permanente Mid-Atlantic State Institutional Review Board (IRB).

Comparison model

We compare our model to the state-of-the-art HiTANet model. This model incorporates similar architectural elements, particularly an alternative version of time-aware self-attention and a secondary attention model applied over code embeddings. In contrast, HiTANet incorporates a hierarchical visit-based embedding structure prior to applying self-attention. HiTANet also omits demographic data and transfer learning to initiate code embeddings.

Ablation studies

One of the primary drawbacks to employing HiTANet as a comparison is that T3Net incorporates certain elements not present in HiTANet. In particular, HiTANet’s published implementation does not offer a method for initializing embeddings through transfer learning or inputting demographic data. For this reason, we also examine the results of several ablation studies. In these studies, we omit either transfer learning, demographic inputs, or both. Note that models which did not incorporate transfer learning to initialize code embeddings were instantiated with random embeddings generated from an independent (by vector and dimension) uniform distribution between −0.05 and 0.05.

Model training

For each version of T3Net, we employed a single attentional head (both self-attention and simple attention, where applicable) to maximize interpretability. We examined results for 16-dimensional embeddings. For the final classification task, we employed a deep network with an initial layer size of 256 and 4 layers deep. Models were trained with a dropout rate of 0.3 for layers in the primary classification network and the simple attentional network. Each layer in both networks was also regularized with a penalty of 0.001. Entity embeddings were initialized using pretrained vectors obtained from a Word2Vec model applied to the same training data; however, entity embeddings were also allowed to co-train with the model. Entities were deduplicated prior to submission to either the embedding algorithm or to the core model. For models which incorporated time-sensitivity, we retained the most recent allocation of any individual code as the canonical diagnosis time. This yielded a final count of 8.3 million deduplicated diagnosis, procedure, pharmaceutical, and lab test codes in our training set and 8.9 million deduplicated codes in our validation set. Models were implemented in Tensorflow version 2.3.0, primarily using the Keras functional interface. They were optimized using the Adam optimization algorithm for 20 epochs each. All models were trained on a locally hosted IBM CloudPak4Data instance with 2 cores and 64 GB of RAM. To train the HiTANet model, we used the code published with the original article. Models were trained in PyTorch version 1.3.1. Once again, we used a locally hosted IBM CloudPak4Data instance with 2 cores and 64 GB of RAM to train models. To alter our data set for use with HiTANet, we had to aggregate patient codes into visit lists. For each patient, we combined all codes which occurred on the same date and labeled those as a single visit. Due to computational constraints, we were only able to train the HiTANet model once for each target. To remain consistent with settings for T3Net, we trained the model with an embedding dimensionality of 16. We also employed a classification hidden layer size of 256, which was the same as T3Net and which was also the default suggested by the authors. We used default values for all other hyperparameters. We trained each HiTANet model for 20 epochs and recorded performance at the end of each epoch.

Model comparison

For each model, we examined model performance using the AUROC and the average precision (AP) when using the model to predict our validation data. For each version of T3Net, we performed model training 10 times independently and computed performance on our evaluation metrics after each epoch. We then examined the median performance of the given set of models over all epochs and selected the epoch with the best median AP as our representative for that class of models. We employed a median-based strategy to minimize the impacts of individual runs of each model, which could sometimes produce highly variant results due to poor random starting conditions. Although we took the best epoch of each model as our canonical result, results were robust to our selection of epoch.

RESULTS

Here, we examine the performance of T3Net in comparison to HiTANet on our mortality and hospitalization targets. In Supplementary Appendix B: Attention Analysis, we analyze the interpretability of T3Net by observing the network’s prediction and attentional responses to a synthetic patient profile. In Supplementary Appendix C: Architectural Ablation Studies, we present a detailed breakdown of results from ablating specific architectural elements of T3Net.

Convergence and overfitting

We found that the models which excluded Word2Vec pretraining tended to converge very quickly (optimal epochs were 2–3), but that these models quickly began overfitting (Figures 1–4). This was a tendency which was not evidenced by the full T3Net model or by the model which ablated only demographics. These observations held true for both the mortality target and the hospitalization target.

Figure 1.

Mortality AUROC by epoch. This figure demonstrates the minimum, median, and maximum area under the receiver-operator characteristic (AUROC) performance of each model on the mortality target after each epoch. Model performance peaked early for most model categories; however, our model which omitted Demographics took several more epochs to converge. Furthermore, the models without Word2Vec pretraining displayed a tendency to overfit after only a few epochs. Mortality AP by epoch. This figure demonstrates the minimum, median, and maximum average precision (AP) performance of each model on the mortality target after each epoch. Model performance peaked early for most model categories; however, our model which omitted Demographics took several more epochs to converge. Furthermore, the models without Word2Vec pretraining displayed a tendency to overfit after only a few epochs. Hospitalization AUROC by epoch. This figure demonstrates the minimum, median, and maximum area under the receiver-operator characteristic (AUROC) performance of each model on the hospitalization target after each epoch. Model performance peaked early for most model categories; however, the HiTANet model took several epochs to converge. Furthermore, the models without Word2Vec pretraining displayed a tendency to overfit after only a few epochs. Hospitalization AP by epoch. This figure demonstrates the minimum, median, and maximum average precision (AP) performance of each model on the hospitalization target after each epoch. Model performance peaked early for most model categories; however, the HiTANet model took several epochs to converge. Furthermore, the models without Word2Vec pretraining displayed a tendency to overfit after only a few epochs. In both experiments, HiTANet’s optimal epoch was quite late; however, HiTANet demonstrated an asymptotic convergence and was no longer improving significantly by the final stopping point. On the mortality target, HiTANet did not appear to substantially improve after the fourth epoch, although it continued improving until about the 10th epoch on the hospitalization target.

One-year mortality prediction

T3Net significantly outperformed the HiTANet benchmark on the 1-year mortality target (Figures 1 and 2). In its optimal epoch, median performance from T3Net achieved an AUROC of 91.96% and an AP of 20.35% with the full model. With the fully ablated model (which included neither transfer learning nor demographic data), T3Net achieved an AUROC of 90.17% and an AP of 19.81%. This compares favorably with results achieved by HiTANet, which achieved a maximum AUROC of 73.82% and an AP of 3.90%. As demonstrated by Table 2, the median performance of each version of T3Net outperformed HiTANet on both metrics. Interestingly, ablating only demographic features severely limited model performance and yielded the worst performance over all T3Net models with an optimal median AUROC of 89.01% and AP of 15.89%.

Figure 2.

Mortality AP by epoch. This figure demonstrates the minimum, median, and maximum average precision (AP) performance of each model on the mortality target after each epoch. Model performance peaked early for most model categories; however, our model which omitted Demographics took several more epochs to converge. Furthermore, the models without Word2Vec pretraining displayed a tendency to overfit after only a few epochs.

Table 2.

Results summary by model

Model name	Epoch (M)	AUROC (M)	AP (M)	Epoch (H)	AUROC (H)	AP (H)
T³Net (FULL MODEL)	4	91.96%	20.35%	6	82.41%	23.80%
T³Net (FULL MODEL)	4	(91.61%, 92.26%)	(16.15%, 21.50%)	6	(82.24%, 82.74%)	(22.31%, 24.46%)
T³Net (Ablate demographics)	11	89.01%	15.89%	7	80.33%	20.80%
T³Net (Ablate demographics)	11	(87.67%, 89.26%)	(11.38%, 16.64%)	7	(77.47%, 81.00%)	(15.81%, 22.71%)
T³Net (Ablate Word2vEC)	2	91.59%	20.12%	2	82.09%	23.67%
T³Net (Ablate Word2vEC)	2	(90.99%, 91.89%)	(19.02%, 20.71%)	2	(79.68%, 82.64%)	(20.03%, 24.42%)
T³Net (Ablate Word2vEC and demographics)	2	90.17%	19.81%	2	81.72%	23.16%
T³Net (Ablate Word2vEC and demographics)	2	(89.37%, 90.44%)	(17.82%, 20.19%)	2	(81.02%, 82.09%)	(22.01%, 23.61%)
HiTANET	19	73.82%	3.90%	19	80.16%	21.85%

This table displays the value of the median score achieved by each model in its best-performing epoch by average precision (AP). The table also indicates performance on the area under the receiver-operator characteristic curve (AUROC). Columns marked with an (M) display values for model performance on the mortality target; columns marked with an (H) display values for model performance on the hospitalization target. Each column also indicates the best and worst scores for models on the given metric at the indicated epoch. Note that the best performance in both cases (as indicated by bold font) is achieved by the full T3Net model, although the most appropriate comparison network is the fully ablated model.

Six-month hospitalization prediction

Median performance by the full T3Net model outperformed HiTANet on both metrics using the hospitalization target, even after ablating both demographics and transfer learning (Figures 3 and 4). The full T3Net model achieved an AUROC of 82.41% and an AP of 23.80%, with the fully ablated model achieving an AUROC of 81.72% and an AP of 23.16%. HiTANet’s performance was much closer to our own models’ on this target, achieving an optimal AUROC of 80.16% and AP of 21.85%. HiTANet outperformed the model which ablated only demographics data, which achieved an optimal AUROC of 80.33% and AP of 20.80%.

Figure 3.

Hospitalization AUROC by epoch. This figure demonstrates the minimum, median, and maximum area under the receiver-operator characteristic (AUROC) performance of each model on the hospitalization target after each epoch. Model performance peaked early for most model categories; however, the HiTANet model took several epochs to converge. Furthermore, the models without Word2Vec pretraining displayed a tendency to overfit after only a few epochs.

Figure 4.

Hospitalization AP by epoch. This figure demonstrates the minimum, median, and maximum average precision (AP) performance of each model on the hospitalization target after each epoch. Model performance peaked early for most model categories; however, the HiTANet model took several epochs to converge. Furthermore, the models without Word2Vec pretraining displayed a tendency to overfit after only a few epochs.

Runtime

As demonstrated by Table 2, T3Net achieved significantly shorter runtimes than HiTANet in our experiments. On training tasks, the full T3Net model averaged approximately 640 seconds per epoch, compared to approximately 18 000 seconds per epoch for HiTANet. Over the entire training cycle, we found that HiTANet took 55 633 s (15.45 h) on average to train, store model weights, predict the validation set, and perform other internal maintenance for each epoch using a slightly altered version of the authors’ provided training scripts. In comparison, we observed an average runtime of 13,101 seconds per complete run of T3Net (including 20 epochs of training, forecasting at each epoch, and model saving). Results summary by model This table displays the value of the median score achieved by each model in its best-performing epoch by average precision (AP). The table also indicates performance on the area under the receiver-operator characteristic curve (AUROC). Columns marked with an (M) display values for model performance on the mortality target; columns marked with an (H) display values for model performance on the hospitalization target. Each column also indicates the best and worst scores for models on the given metric at the indicated epoch. Note that the best performance in both cases (as indicated by bold font) is achieved by the full T3Net model, although the most appropriate comparison network is the fully ablated model.

DISCUSSION

Contributions

In this study, we have proposed a novel implementation of the Transformer architecture which incorporates Trigonometrically embedded Time data, named T3Net. Our model incorporates several innovations over current state-of-the-art patient EMR models, including our unique implementation of time encoding. Furthermore, our model incorporates transfer learning and demographic data; while other studies have demonstrated the value of these features, they have been underutilized in the deep learning EMR literature, especially in Transformer models. When applied to a real-world data set, our model outperformed a leading alternative in the space. In addition, we performed a comprehensive series of ablation studies. Interestingly, we found that a self-attentional Transformer model significantly underperformed (Supplementary Appendix C: Architectural Ablation Studies) when time encoding was omitted, lending considerable weight to the observations by Luo et al which indicated that time data was critical to the performance of EMR models. In additional ablation studies, we found that including demographic features significantly improved model performance when keeping all other features constant and that using transfer learning to initialize medical entity embeddings had a significant impact on the model’s ability to overfit. Models with pretrained embeddings tended to both outperform and to be more robust to overfitting (Figures 1–4).

Comparison to HiTANet

We compared our model construction to the recently published alternative HiTANet. HiTANet employs several important architectural features similar toT3Net, including the use of time-aware self-attention; however, there are important distinctions between these models that led to substantial differences in performance. To account for these differences, we examined the performance of several ablated versions of T3Net. We found substantial differences in performance between the ablated versions of T3Net and HiTANet. Our model significantly outperformed HiTANet on the mortality target and marginally outperformed HiTANet on the hospitalization target. By construction, these differences must be attributable to differences in the mathematical construction of our network and HiTANet.

Model interpretation

We found that it was possible to construct a reliable total attention weight which incorporates the impacts of both global attention and self-attention (Supplementary Appendix B: Attention Analysis). Our construction of total attention was strongly correlated with the absolute difference in risk score obtained by adding or removing the given diagnosis, demonstrating that this measure is a reliable tool for interpreting T3Net’s decisions. Finally, we observed that T3Net’s decisions reflected clinical intuition. For example, it learned that time information was significantly more important when incorporating the effects of acute codes than when examining long-term chronic diagnoses.

Limitations

In this article, we present the results from applying several complex models to a single dataset. Further study will be necessary to verify that our results are generalizable to alternative datasets. Unfortunately, there is a notable lack of large, publicly available Electronic Health Record data that focus on long-term, chronic conditions. The most popular research dataset, MIMIC-III, focuses on critical care patients; we do not believe that our methodology is well-optimized for this type of application since it focuses on code occurrences over long periods of time (particularly chronic codes). The data set we have employed may not have been perfectly accurate. In a health system as large as ours, it is inevitable that some patients will not be assigned diagnoses correctly or that demographic data will be recorded incorrectly. Furthermore, our definition of mortality is based on an operational definition which our health system uses in practice; however, it is possible that some mortality events are not recorded through this operational system. Similarly, it is possible that some patients were hospitalized without submitting claims. In any of these cases, we note that these omissions would be likely to degrade model performance. To limit the scope of computational resources required, we have employed only models of moderate size. Our experiments did not explore the impacts of incorporating additional attentional heads or compare our results with other modern architectures. We note, however, that these are practical concerns present in most healthcare organizations that may seek to deploy patient outcome modeling. Furthermore, HiTANet has compared well with other leading recent alternatives. All models presented here were able to run on a virtual machine with only 2 cores and 64 GB of RAM. Our experimentation with HiTANet was limited by model training times. We found that HiTANet epochs took approximately 10× longer to train than our own largest models, with similarly long times required to produce validation predictions. We suspect that this difference was due in large part to our computational infrastructure. HiTANet was implemented and optimized by the original authors for use in a GPU-based environment. Unfortunately, we did not have access to a robust GPU training environment and were therefore unable to make use of these optimizations.

Future directions

This article has indicated several interesting new directions for the future of patient modeling. Our work suggests that incorporating demographic features and transfer learning into other model architectures could improve the performance of those models. Further research will be required to determine how embedding strategy can influence the performance of various modeling architectures, particularly when embedding weights are initialized by training on alternative data sets., We also proposed a new method to encode time data into models. Our results indicate that encoding based on a trigonometric decomposition can drastically improve model performance, although further investigation is required to more thoroughly determine the relative efficiency of the various ways that this data can be encoded. Finally, we found that incorporating advanced re-embedding structures such as self-attention can complicate model interpretation. The influence of such structures will naturally vary by model architecture; however, our results strongly indicate that it is useful to confirm a researcher’s intuition and intentions by correlating attention results with a perturbation study.

CONCLUSION

In this article, we have examined the effectiveness of several attentional models on patient prediction tasks using real EMRs data. Our results indicate that attentional networks can produce strong models of patient outcomes with relatively small computational requirements. Furthermore, our unique adaptation of Vaswani et al’s Transformer model (T3Net) proved to be superior to a leading literature alternative on the given tasks. Finally, our findings have indicated several important considerations to be considered by future models in the space, including the importance of pretraining medical concept embeddings, demographic features, incorporating time data into model architecture, and using the full contribution of a code to the model (total weight) instead of its simple attention weight (global weight) when interpreting a model with self-attentional elements.

SUPPLEMENTARY MATERIAL

Supplementary material is available at Journal of the American Medical Informatics Association online.

CONTRIBUTORS

Anthony Finch proposed the work, performed the core data analysis, implemented the core neural network models, and supervised other technical members of the team. Alexander Crowell built the original model of patient mortality and accompanying dataset, helped to design several of the core model architectures, and assisted in developing the model code. Yung-Chieh Chang built the original model of patient hospital admissions and the accompanying target data and assisted in developing the model code. Pooja Parameshwarappa contributed significantly to the literature review and assisted in developing the model code. Jose Martinez managed and supported several of the personnel that participated in this project. Michael Horberg directly supervised this project and contributed significantly to the research plan. Conflict of interest statement. None declared.

Data availability

All data underlying this article will be shared on reasonable request to the corresponding author. Click here for additional data file.

12 in total

1. Time-sensitive clinical concept embeddings learned from large electronic health records.

Authors: Yang Xiang; Jun Xu; Yuqi Si; Zhiheng Li; Laila Rasmy; Yujia Zhou; Firat Tiryaki; Fang Li; Yaoyun Zhang; Yonghui Wu; Xiaoqian Jiang; Wenjin Jim Zheng; Degui Zhi; Cui Tao; Hua Xu
Journal: BMC Med Inform Decis Mak Date: 2019-04-09 Impact factor: 2.796

Review 2. Deep learning.

Authors: Yann LeCun; Yoshua Bengio; Geoffrey Hinton
Journal: Nature Date: 2015-05-28 Impact factor: 49.962

3. Interpretable Predictions of Clinical Outcomes with An Attention-based Recurrent Neural Network.

Authors: Ying Sha; May D Wang
Journal: ACM BCB Date: 2017-08

4. Machine learning for patient risk stratification: standing on, or looking over, the shoulders of clinicians?

Authors: Brett K Beaulieu-Jones; William Yuan; Gabriel A Brat; Andrew L Beam; Griffin Weber; Marshall Ruffin; Isaac S Kohane
Journal: NPJ Digit Med Date: 2021-03-30

5. Deep Patient: An Unsupervised Representation to Predict the Future of Patients from the Electronic Health Records.

Authors: Riccardo Miotto; Li Li; Brian A Kidd; Joel T Dudley
Journal: Sci Rep Date: 2016-05-17 Impact factor: 4.379

6. Deep learning predicts hip fracture using confounding patient and healthcare variables.

Authors: Marcus A Badgeley; John R Zech; Luke Oakden-Rayner; Benjamin S Glicksberg; Manway Liu; William Gale; Michael V McConnell; Bethany Percha; Thomas M Snyder; Joel T Dudley
Journal: NPJ Digit Med Date: 2019-04-30

7. An attention based deep learning model of clinical events in the intensive care unit.

Authors: Deepak A Kaji; John R Zech; Jun S Kim; Samuel K Cho; Neha S Dangayach; Anthony B Costa; Eric K Oermann
Journal: PLoS One Date: 2019-02-13 Impact factor: 3.240

8. BEHRT: Transformer for Electronic Health Records.

Authors: Yikuan Li; Shishir Rao; José Roberto Ayala Solares; Abdelaali Hassaine; Rema Ramakrishnan; Dexter Canoy; Yajie Zhu; Kazem Rahimi; Gholamreza Salimi-Khorshidi
Journal: Sci Rep Date: 2020-04-28 Impact factor: 4.379

9. Med-BERT: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction.

Authors: Laila Rasmy; Yang Xiang; Ziqian Xie; Cui Tao; Degui Zhi
Journal: NPJ Digit Med Date: 2021-05-20

10. Combining structured and unstructured data for predictive models: a deep learning approach.

Authors: Dongdong Zhang; Changchang Yin; Jucheng Zeng; Xiaohui Yuan; Ping Zhang
Journal: BMC Med Inform Decis Mak Date: 2020-10-29 Impact factor: 2.796