| Literature DB >> 31239489 |
Jeffrey Thompson1,2, Jinxiang Hu3,4, Dinesh Pal Mudaranthakam3,4, David Streeter3,4, Lisa Neums3,4, Michele Park4, Devin C Koestler3,4, Byron Gajewski3,4, Roy Jensen4, Matthew S Mayo3,4.
Abstract
Electronic health records (EHR) represent a rich resource for conducting observational studies, supporting clinical trials, and more. However, much of the data contains unstructured text, presenting an obstacle to automated extraction. Natural language processing (NLP) can structure and learn from text, but NLP algorithms were not designed for the unique characteristics of EHR. Here, we propose Relevant Word Order Vectorization (RWOV) to aid with structuring. RWOV is based on finding the positional relationship between the most relevant words to predicting the class of a text. This facilitates machine learning algorithms to use the interaction of not just keywords but positional dependencies (e.g. a relevant word occurs 5 relevant words before some term of interest). As a proof-of-concept, we attempted to classify the hormone receptor status of breast cancer patients treated at the University of Kansas Medical Center, comparing RWOV to other methods using the F1 score and AUC. RWOV performed as well as, or better than other methods in all but one case. For F1 score, RWOV had a clear edge on most tasks. AUC tended to be closer, but for HER2, RWOV was significantly better for most comparisons. These results suggest RWOV should be further developed for EHR-related NLP.Entities:
Mesh:
Substances:
Year: 2019 PMID: 31239489 PMCID: PMC6592944 DOI: 10.1038/s41598-019-45705-y
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Counts of subjects with hormone receptors status.
| Hormone Receptor | Positive (%) | Negative (%) | Total |
|---|---|---|---|
| Estrogen receptor (ER) | 491 (78.1) | 138 (21.9) | 629 |
| Progesterone receptor (PR) | 396 (66.1) | 203 (33.9) | 599 |
| Human epidermal growth factor 2 (HER2) | 42 (18.3) | 187 (81.7) | 229 |
Classification performance based on vectorization method.
|
| ER+ | ER− | PR+ | PR− | HER2+ | HER2− | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
| |
| RWOV-NN | 0.95 | |||||||||||
| RWOV-SVM | 0.90 | 0.88 | 0.70 | 0.88 | 0.76 | 0.72 | 0.58 | 0.71 | 0.53 | 0.77 | 0.84 | 0.77 |
| SVM(1,2) | 0.92 | 0.94 | 0.69 | 0.94 | 0.91 | 0.95 | 0.81 | 0.95 | 0.31 | 0.67 | 0.87 | 0.68 |
| SVM(2,2) | 0.93 | 0.94 | 0.72 | 0.94 | 0.91 | 0.82 | 0.37 | 0.73 | 0.90 | 0.74 | ||
| SVM(1,3) | 0.94 | 0.94 | 0.73 | 0.94 | 0.90 | 0.95 | 0.80 | 0.95 | 0.32 | 0.69 | 0.89 | 0.70 |
| SVM(2,3) | 0.94 | 0.94 | 0.75 | 0.94 | 0.91 | 0.95 | 0.82 | 0.95 | 0.40 | 0.72 | 0.74 | |
| SVM(3,3) | 0.93 | 0.92 | 0.71 | 0.92 | 0.90 | 0.92 | 0.79 | 0.92 | 0.35 | 0.71 | 0.72 | |
| SVM-W2V | 0.71 | 0.64 | 0.40 | 0.64 | 0.67 | 0.69 | 0.57 | 0.69 | 0.20 | 0.45 | 0.60 | 0.43 |
| NN(1,2) | 0.93 | 0.71 | 0.94 | 0.90 | 0.95 | 0.82 | 0.95 | 0.35 | 0.81 | 0.79 | ||
| NN(2,2) | 0.93 | 0.93 | 0.72 | 0.94 | 0.92 | 0.81 | 0.95 | 0.39 | 0.80 | 0.80 | ||
| NN(1,3) | 0.94 | 0.94 | 0.72 | 0.95 | 0.91 | 0.95 | 0.81 | 0.95 | 0.30 | 0.79 | 0.79 | |
| NN(2,3) | 0.93 | 0.92 | 0.70 | 0.93 | 0.92 | 0.94 | 0.80 | 0.95 | 0.33 | 0.78 | 0.90 | 0.78 |
| NN(3,3) | 0.93 | 0.92 | 0.69 | 0.91 | 0.89 | 0.92 | 0.79 | 0.92 | 0.20 | 0.76 | 0.90 | 0.76 |
| NN-W2V | 0.85 | 0.76 | 0.40 | 0.75 | 0.81 | 0.75 | 0.63 | 0.78 | 0.07 | 0.43 | 0.87 | 0.41 |
RWOV had the most consistent performance across classification tasks. The best performing method for each metric on each task is shown in bold in the table. In only one case did RWOV-NN not have the best performance (PR− AUC), however it was very close to the top performer.
P-values for difference in AUC between RWOV-NN and other methods.
| Method | ER+ | ER− | PR+ | PR− | HER2+ | HER2− |
|---|---|---|---|---|---|---|
| RWOV-SVM | ||||||
| SVM(1,2) | 3.56E-01 | 1.85E-01 | 4.34E-01 | 9.54E-01 | ||
| SVM(2,2) | 4.75E-01 | 2.56E-01 | 8.66E-01 | 5.32E-01 | ||
| SVM(1,3) | 5.70E-01 | 3.08E-01 | 3.03E-01 | 8.78E-01 | ||
| SVM(2,3) | 5.04E-01 | 2.55E-01 | 3.16E-01 | 9.02E-01 | ||
| SVM(3,3) | 5.52E-02 | |||||
| SVM-W2V | ||||||
| NN(1,2) | 9.43E-01 | 3.81E-01 | 2.68E-01 | 5.94E-01 | 8.48E-02 | |
| NN(2,2) | 2.99E-01 | 1.93E-01 | 8.79E-01 | 3.90E-01 | 1.65E-01 | |
| NN(1,3) | 5.83E-01 | 5.43E-01 | 3.53E-01 | 7.23E-01 | 5.34E-02 | |
| NN(2,3) | 1.33E-01 | 6.79E-02 | 2.14E-01 | 8.44E-01 | ||
| NN(3,3) | 5.79E-02 | |||||
| NN-W2V |
Significant results are highlighted in bold.
Figure 195% confidence intervals for AUC across breast cancer subtypes. Our approach is shown on the left, in black. RWOV-NN has consistently high AUC across the tasks.
Figure 295% confidence intervals for F1 across breast cancer subtypes. Our approach is shown on the left, in black. In every case, RWOV-NN has the highest F1 score across the tasks.
Figure 3ROC curves for our method, and the best of each of the comparison methods. RWOV-NN shows more clinically useful cut points with low false positives and high true positives are possible consistently across the tasks.
Top 30 occurring words for each TOI. The mean frequency of occurrence per observation is shown.
| Rank | ER Top Words | ER Frequency | PR Top Words | PR Frequency | HER2 Top Words | HER2 Frequency |
|---|---|---|---|---|---|---|
| 1 | comm | 1.16 | comm | 1.18 | comm | 0.98 |
| 2 | pospct | 0.96 | pospct | 1.01 | pospct | 0.73 |
| 3 | er | 0.71 | pr | 0.71 | her2 | 0.72 |
| 4 | tum | 0.61 | tum | 0.58 | neg | 0.51 |
| 5 | slash | 0.52 | slash | 0.53 | pr | 0.44 |
| 6 | not | 0.55 | not | 0.48 | er | 0.40 |
| 7 | receiv | 0.44 | er | 0.42 | plu | 0.42 |
| 8 | mark | 0.44 | posit | 0.39 | commabef | 0.37 |
| 9 | posit | 0.42 | between | 0.37 | tum | 0.38 |
| 10 | ident | 0.40 | mark | 0.37 | not | 0.37 |
| 11 | pr | 0.39 | hour | 0.34 | stain | 0.38 |
| 12 | outsid | 0.35 | outsid | 0.36 | slash | 0.30 |
| 13 | neg | 0.34 | neg | 0.35 | mark | 0.31 |
| 14 | prognost | 0.33 | prognost | 0.32 | posit | 0.31 |
| 15 | nod | 0.30 | nod | 0.30 | on | 0.29 |
| 16 | per | 0.28 | plu | 0.32 | outsid | 0.29 |
| 17 | plu | 0.29 | ident | 0.30 | in | 0.28 |
| 18 | lymph | 0.27 | per | 0.31 | for | 0.25 |
| 19 | for | 0.28 | lymph | 0.28 | carcinom | 0.29 |
| 20 | between | 0.28 | tim | 0.27 | invas | 0.29 |
| 21 | perform | 0.24 | ki-67 | 0.24 | cel | 0.29 |
| 22 | necros | 0.22 | on | 0.23 | between | 0.25 |
| 23 | sampl | 0.22 | sampl | 0.22 | per | 0.24 |
| 24 | ki-67 | 0.21 | negpct | 0.21 | hour | 0.23 |
| 25 | tim | 0.23 | 1 | 0.22 | nod | 0.23 |
| 26 | hour | 0.23 | her2 | 0.22 | ki-67 | 0.24 |
| 27 | on | 0.20 | patholog | 0.20 | prognost | 0.24 |
| 28 | her2 | 0.19 | remov | 0.21 | stag | 0.21 |
| 29 | follow | 0.19 | necros | 0.20 | grad | 0.18 |
| 30 | negpct | 0.19 | perform | 0.21 | with | 0.19 |
Words have been stemmed (shortened to common roots/parts).