| Literature DB >> 30944913 |
Meijian Guan1,2, Samuel Cho1,3, Robin Petro2, Wei Zhang2,4, Boris Pasche2,4, Umit Topaloglu2,4.
Abstract
OBJECTIVES: Natural language processing (NLP) and machine learning approaches were used to build classifiers to identify genomic-related treatment changes in the free-text visit progress notes of cancer patients.Entities:
Keywords: cancer; electronic health records; genomics; machine learning; natural language processing
Year: 2019 PMID: 30944913 PMCID: PMC6435007 DOI: 10.1093/jamiaopen/ooy061
Source DB: PubMed Journal: JAMIA Open ISSN: 2574-2531
Figure 1.Workflow of text processing and document classification using machine learning models.
Figure 2.Dimensional reduction of term frequency-inverse document frequency (TF-IDF) representation of the documents via singular-value decomposition (SVD). Data points are colored by treatment-change (1) and nontreatment-change (0) groups.
Best hyperparameters for the classifiers
| Classifier | Hyperparameters |
|---|---|
| Deep learning classifiers | |
| LSTM_onFly | Optimizer=Adam, batch size=64, dropout rate=0, word embedding=trained on the fly, recurrent layer=single directional LSTM |
| LSTM_Pre | Optimizer=Adam, batch size=64, dropout rate=0, word embedding=pretrained on the whole corpus, recurrent layer=single directional LSTM |
| LSTM_Bi | Optimizer=Adam, batch size=64, dropout rate=0, word embedding=pretrained on the whole corpus, recurrent layer=bidirectional LSTM |
| GRU | Optimizer=Adam, batch size=64, dropout rate=0, word embedding=pretrained on the whole corpus, recurrent layer=single directional LSTM |
| Conventional classifiers | |
| KNN | Number of neighbors=7 |
| LR | L2 penalty parameter=10 |
| NB | Smoothing parameter alpha=0 |
| RF | Maximum depth of a tree=6 Minimum number of samples required to split an internal node=5 Minimum number of samples required to be at a leaf node=5 |
| SVC | Kernel=linearL 2 penalty parameter=30 |
GRU: gated recurrent unit; KNN: K-nearest Neighbor; LR: logistic regression; LSTM: long short-term memory; NB: Naive Bayes; RF: random forest; SVC: Support Vector Machine for classification.
Figure 3.Architecture of RNN models. GRU: gated recurrent unit; LSTM: long short-term memory; LSTM_Bi: bidirectional LSTM; RNN: recurrent neural network.
Performance of classifiers on the document classification repeated for 100 times
| Classifier | Accuracy (mean±SD) | Precision (mean±SD) | Recall (mean±SD) | F1 score (mean±SD) |
|---|---|---|---|---|
| Deep learning classifiers | ||||
| LSTM_onFly | 0.821±0.026 | 0.850±0.029 | 0.872±0.040 | 0.860±0.023 |
| LSTM_Pre | 0.849±0.015 | 0.874±0.023 | 0.890±0.022 | 0.882±0.013 |
| LSTM_Bi | 0.862±0.019 | 0.885±0.020 | 0.900±0.026 | 0.892±0.015 |
| GRU | 0.859±0.014 | 0.882±0.021 | 0.899±0.022 | 0.890±0.012 |
| Conventional classifiers | ||||
| KNN | 0.806±0.016 | 0.834±0.022 | 0.913±0.024 | 0.829±0.015 |
| LR | 0.829±0.015 | 0.836±0.022 | 0.904±0.023 | 0.826±0.014 |
| NB | 0.772±0.016 | 0.875±0.016 | 0.811±0.023 | 0.806±0.016 |
| RF | 0.809±0.015 | 0.804±0.023 | 0.926±0.017 | 0.809±0.015 |
| SVC | 0.826±0.014 | 0.814±0.024 | 0.830±0.019 | 0.772±0.016 |
GRU: gated recurrent unit; KNN: K-nearest Neighbor; LR: logistic regression; LSTM: long short-term memory; NB: Naive Bayes; RF: random forest; SD: standard deviation; SVC: Support Vector Machine for classification.
Figure 4.Performance comparisons of 9 Machine Learning algorithms based on (A) a single run, and (B) models repeated for 100 times. Mean metrics (dots) and their standard deviations (bars) were included.
Figure 5.Training curves of the first 15 epochs for RNN-based models, where the upper panel is the model accuracy for training and validation datasets, and lower panel is the model loss for training and validation dataset. RNN: recurrent neural network.
Figure 6.Confusion matrix of (A) RNN-based models, and (B) conventional machine learning models. GRU: gated recurrent unit; LSTM: long short-term memory; NB: Naive Bayes; RF: random forest; RNN: recurrent neural network; SVC: Support Vector Machine for classification.