| Literature DB >> 35702624 |
Enshuo Hsu1, Ioannis Malagaris1, Yong-Fang Kuo1, Rizwana Sultana2, Kirk Roberts3.
Abstract
Objective: Scanned documents in electronic health records (EHR) have been a challenge for decades, and are expected to stay in the foreseeable future. Current approaches for processing include image preprocessing, optical character recognition (OCR), and natural language processing (NLP). However, there is limited work evaluating the interaction of image preprocessing methods, NLP models, and document layout. Materials andEntities:
Keywords: electronic health records; natural language processing; optical character recognition; polysomnography; scanned document
Year: 2022 PMID: 35702624 PMCID: PMC9188320 DOI: 10.1093/jamiaopen/ooac045
Source DB: PubMed Journal: JAMIA Open ISSN: 2574-2531
Figure 1.Scanned document images after image preprocessing. (A) The original scanned image. (B) The gray-scaled image. (C) The image with 20% increased contrast. (D) The image with 60% increased contrast. (E) The image with dilation and erosion and 20% increased contrast. (F) the image with dilation and erosion and 60% increased contrast.
Figure 2.Output of OCR for visual inspection.
Analytical data for text classification
| Left | Top | Width | Height | Page | Numeric value | Segment | Label |
|---|---|---|---|---|---|---|---|
| 1048 | 385 | 111 | 50 | 1 | 19.5 | The Medicare scoring rule. The total APNEA/HYPOPNEA INDEX (AHI) was | AHI |
| 231 | 558 | 76 | 23 | 1 | 87.0 | Versus a non-REM AHI of 15.1. The lowest desaturation was | SaO2 |
| 735 | 388 | 61 | 26 | 1 | 26.0 | Hypopneas, 120 met the AASM Version 2 scoring rule, while | Other |
Note: The column “Left” and “Top” are the coordinate of pixels for the top-left corner of the word regions. The column “Width” and “Height” are the width and height in pixels of the word regions. The column “Page” indicates from which page of the document the numeric value was extracted. The column “Numeric value” is the floating point representation of the numeric value. The column “Segment” holds the free text segment of 21 words. We label the numeric value in bold. The column “Label” was derived from manual chart review and was used as the label for the supervised learning classifiers.
Figure 3.Parent neural network architecture. The structured input branch (top-left) takes in position indicators, page number, and numeric value. The sequence input branch (top-right) takes in encoded segments, processed by specific deep learning architectures, and flattened to remove time steps. The classifier layers (bottom) connect the structured input branch (green) and sequence input branch (blue) and make predictions.
Figure 4.Data pipeline flowchart.
Figure 5.Collection of scanned sleep study reports. The images have been intentionally blurred, their purpose here is to provide a sense of the overall structure and consistency (and lack thereof) between scanned documents.
Summary of data and labels
| PDF documents | OCR outputs | |||||
|---|---|---|---|---|---|---|
| Reports | Pages | Numeric values | Instances of AHI | Instances of SaO2 | Instances of other | |
| Entire data set | 955 | 2988 | 83 915 | 1904 | 1698 | 80 313 |
| development set | 669 | 2031 | 56 839 | 1323 | 1146 | 54 370 |
| Test set | 286 | 957 | 27 076 | 581 | 552 | 25 943 |
Evaluation of different classifiers
| Classifier | Segment-level | Document-level | ||||
|---|---|---|---|---|---|---|
| Recall | Precision | F1 | AUROC (95% CI) | Accuracy (95% CI) | ||
| AHI | ||||||
| Bag-of-word models | LR | 0.4819 | 0.8383 | 0.612 | 0.9093 (0.8932–0.9254) | 87.41 (83.57–91.26) |
| LASSO (L1) | 0.4819 | 0.8889 | 0.625 | 0.9169 (0.9014–0.9325) | 89.16 (85.56–92.76) | |
| Ridge (L2) | 0.4802 | 0.8429 | 0.6118 | 0.9176 (0.9021–0.9331) | 87.41 (83.57–91.26) | |
| SVM | 0.6093 | 0.9752 | 0.75 | 0.9050 (0.8886–0.9215) | 93.01 (90.05–95.96) | |
| kNN | 0.6713 | 0.8534 | 0.7514 | 0.8644 (0.8454–0.8834) | 93.57 (90.36–96.78) | |
| NaiveBayes | 0.5577 | 0.4367 | 0.4898 | 0.9179 (0.9024–0.9334) | 75.87 (70.92–80.83) | |
| Random Forest | 0.6299 | 0.9865 | 0.7689 | 0.9476 (0.9350–0.9603) | 93.71 (90.89–96.52) | |
| Sequence models | BiLSTM | 0.6454 | 0.9843 | 0.7796 | 0.9637 (0.9530–0.9743) | 94.06 (91.32–96.80) |
| BERT | 0.747 | 0.8803 | 0.8082 | 0.9705 (0.9609–0.9802) |
| |
| ClinicalBERT | 0.7315 | 0.914 |
|
| 94.76 (92.17–97.34) | |
| SaO2 | ||||||
| Bag-of-word models | LR | 0.567 | 0.4914 | 0.5265 | 0.9153 (0.8992–0.9314) | 82.87 (78.50–87.23) |
| LASSO (L1) | 0.538 | 0.5103 | 0.5238 | 0.9151 (0.8990–0.9312) | 84.62 (80.43–88.80) | |
| Ridge (L2) | 0.5543 | 0.4904 | 0.5204 | 0.9143 (0.8981–0.9305) | 83.22 (78.89–87.55) | |
| SVM | 0.6105 | 0.9133 | 0.7318 | 0.8860 (0.8678–0.9042) | 87.76 (83.96–91.56) | |
| kNN | 0.587 | 0.8663 | 0.6998 | 0.8429 (0.8223–0.8634) | 87.86 (83.84–91.88) | |
| NaiveBayes | 0.6322 | 0.2705 | 0.3789 | 0.9082 (0.8915–0.9248) | 51.75 (45.96–57.54) | |
| Random Forest | 0.6087 | 0.9307 | 0.736 | 0.9264 (0.9113–0.9415) | 89.51 (85.96–93.06) | |
| Sequence models | BiLSTM | 0.6739 | 0.9051 | 0.7726 | 0.9274 (0.9123–0.9424) |
|
| BERT | 0.7319 | 0.8651 |
| 0.9358 (0.9215–0.9500) |
| |
| ClinicalBERT | 0.683 | 0.8871 | 0.7718 |
|
| |
Note: Logistic Regression does not apply penalty; Lasso regression has L1 penalty (λ = 0.01); Ridge has L2 penalty (λ = 0.01). Support Vector Machine uses a polynomial kernel. kNN uses k = 3. NaiveBayes classifier uses alpha = 0.5. BiLSTM uses Word2Vec model for embedding pretrained on the training set with CBOW, input vector of 100 dimensions. BERT and ClinicalBERT are fine-tuned for 100 epochs with sequence length 32, and batch size 64. We highlight the highest F1, AUROC, and Accuracy in bold.
Figure 6.ROC curve for each classifier.
Comparing ClinicalBERT with BERT, BiLSTM, and Random Forest
| Adjusted | ClinicalBERT vs BERT | ClinicalBERT vs BiLSTM | ClinicalBERT vs Random Forest | |||
|---|---|---|---|---|---|---|
| AHI | SaO2 | AHI | SaO2 | AHI | SaO2 | |
| AUROC | 0.4528 |
|
|
|
|
|
| Document accuracy | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
Note: AUROC was pair-wisely compared with DeLong’s test. Document accuracy was pair-wisely compared with the chi-squared test. All P-values are corrected with the Bonferroni procedure. We highlight statistically significant P-values in bold.
Figure 7.Evaluation of effects of training set size.
Comparison of different image preprocessing methods
| Image preprocessing | Segment-level | Document-level | ||||||
|---|---|---|---|---|---|---|---|---|
| Recall | Precision | F1 | AUROC (95% CI) | Accuracy (95% CI) | ||||
| AHI | Gray scale | 0.7187 | 0.9249 | 0.8089 | 0.9699 (0.9601–0.9796) | 95.45 (93.04–97.87) | ||
| Gray scale+dilate and erode | 0.6961 | 0.9687 | 0.81 | 0.9679 (0.9573–0.9784) | 94.41 (91.74–97.07) | |||
| Gray scale+contrast 20% | 0.7126 | 0.9324 | 0.8078 | 0.9705 (0.9609–0.9802) | 94.06 (91.32–96.80) | |||
| Gray scale+contrast 60% | 0.7268 | 0.9216 | 0.8127 | 0.9692 (0.9593–0.9790) | 95.45 (93.04–97.87) | |||
| Gray scale+dilate and erode+contrast 20% | 0.7315 | 0.914 | 0.8126 |
| 94.76 (92.17–97.34) | |||
| Gray scale+dilate and erode+contrast 60% | 0.7268 | 0.9216 | 0.8172 | 0.9715 (0.9620–0.9810) |
| |||
| SaO2 | Gray scale | 0.7258 | 0.8617 | 0.7879 | 0.9334 (0.9190–0.9478) | 91.61 (88.40–94.82) | ||
| Gray scale+dilate and erode | 0.7427 | 0.8819 | 0.8063 |
| 90.21 (86.77–93.65) | |||
| Gray scale+contrast 20% | 0.6957 | 0.8889 | 0.7805 | 0.9431 (0.9296–0.9566) | 91.26 (87.99–94.53) | |||
| Gray scale+contrast 60% | 0.6863 | 0.8671 | 0.7662 | 0.9495 (0.9366–0.9623) | 91.61 (88.40–94.82) | |||
| Gray scale+dilate and erode+contrast 20% | 0.683 | 0.8871 | 0.7718 | 0.9523 (0.9398–0.9647) | 91.61 (88.40–94.82) | |||
| Gray scale+dilate and erode+contrast 60% | 0.6863 | 0.8671 | 0.7684 | 0.9486 (0.9356–0.9616) |
| |||
Note: Each image preprocessing method was followed by fine-tuning a downstream ClinicalBERT. We highlighted the highest AUROC and Accuracy in bold.
Comparison of different sequence model architectures
| Model architecture | Segment-level | Document-level | ||||||
|---|---|---|---|---|---|---|---|---|
| Recall | Precision | F1 | AUROC (95% CI) |
| Accuracy (95% CI) |
| ||
| AHI | Sequence input | 0.7522 | 0.8723 | 0.8078 | 0.9703 (0.9606–0.9800) |
| 94.41 (91.74–97.07) | 1.0000 |
| Sequence input+structured input | 0.7315 | 0.914 | 0.8126 |
|
| |||
| SaO2 | Sequence input | 0.692 | 0.8761 | 0.7733 | 0.9430 (0.9295–0.9565) |
| 90.91 (87.58–94.24) | .8823 |
| Sequence input+structured input | 0.683 | 0.8871 | 0.7718 |
|
| |||
Note: We highlighted the highest AUROC and Accuracy, and statistically significant P-value in bold.