| Literature DB >> 35957704 |
Hao Liu1, Huijin Wang1, Jieyun Bai2,3, Yaosheng Lu2,3, Shun Long1.
Abstract
Background: Complete electronic health records (EHRs) are not often available, because information barriers are caused by differences in the level of informatization and the type of the EHR system. Therefore, we aimed to develop a deep learning system [deep learning system for structured recognition of text images from unstructured paper-based medical reports (DeepSSR)] for structured recognition of text images from unstructured paper-based medical reports (UPBMRs) to help physicians solve the data-sharing problem.Entities:
Keywords: Deep learning; paper-based medical reports; table detection; text detection; text recognition
Year: 2022 PMID: 35957704 PMCID: PMC9358495 DOI: 10.21037/atm-21-6672
Source DB: PubMed Journal: Ann Transl Med ISSN: 2305-5839
Figure 1The pipeline of DeepSSR. DeepSSR, deep learning system for structured recognition of text images from unstructured paper-based medical reports.
Figure 2The architecture of YOLOv3-MobileNet. MobileNet, Efficient Convolutional Neural Network for Mobile Vision Application; SPP, spatial pyramid pooling.
Figure 3The architecture of the text detector DB. DB, differentiable binarization network; N, the upsampling factor.
Figure 4The architecture of the text recognizer CRNN. CRNN, convolutional recurrent neural network; Bi-LSTM, bi-directional long short-term memory network; N, the number of convolutions.
Figure 5Text box assignment. (A) 1-1 represents the first text box of the first line. (B) The blue line represents the range of the current column and 1-1 represents the text box in the first row and first column.
Figure 6UPBMR images. (A) has a slope, (B) has a shadow, (C) has a black border. The structures of (A), (B), (C), and (D) are different. UPBMR, unstructured paper-based medical reports.
Comparison of table detection algorithms
| Detection algorithm | AP50 (%) | AP75 (%) | Test time of a single image per second |
|---|---|---|---|
| Faster RCNN | 96.5 | 94.1 | 0.030 |
| YOLOv3 | 97.5 | 94.9 | 0.014 |
| YOLOv3-MobileNet | 97.8 | 94.7 | 0.006 |
AP50, average precision at Intersection-over-Union =0.5; AP75, average precision at Intersection-over-Union =0.75; Faster RCNN, Faster Region-based Convolutional Neural Network; YOLOv3, You Only Look Once, Version 3 (a real-time object detection algorithm); MobileNet, Efficient Convolutional Neural Network for Mobile Vision Application.
Comparison results of experiments
| Table detection model | Accuracy (%) | Test time of a single image per second |
|---|---|---|
| Faster RCNN | 90.85 | 0.986 |
| YOLOv3 | 89.51 | 0.670 |
| YOLOv3-MobileNet | 91.10 | 0.668 |
Faster RCNN, Faster Region-based Convolutional Neural Network; YOLOv3, You Only Look Once, Version 3 (a real-time object detection algorithm); MobileNet, Efficient Convolutional Neural Network for Mobile Vision Application.
Figure 7An example of a well-processed document. (A) The input image. (B) The character recognition result. (C) Structured data.
Figure 8An example of a poorly processed document. (A) The input image. (B) The character recognition result. (C) Structured data.
Text recognition results on the ICDAR 2015 dataset. These methods with “+” are collected from [Lu et al. 2021, (49)]. These methods with “−” are collected from [Liao et al. 2019, (23)]. These methods with “#” are collected from [Zhang et al. 2021, (50)]. Poly-FRCNN-3 is similar to that of [Xue et al. 2019, (48)]
| Methods | Precision | Recall | F1 score | Note |
|---|---|---|---|---|
| Seglink + VGG16 | 73.10 | 76.80 | 75.00 | + |
| WordSup | 77.03 | 79.33 | 78.16 | + |
| EAST + VGG16 | 80.05 | 72.80 | 76.40 | + |
| EAST + ResNet50 | 77.32 | 81.66 | 79.43 | + |
| EAST + PAVNET2x | 83.60 | 73.50 | 78.20 | + |
| EAST + PAVNET2x MS | 84.64 | 77.23 | 80.77 | + |
| STN-OCR (Saif | 78.53 | 65.20 | 71.86 | + |
| Poly-FRCNN-3 (Ch’ng | 80.00 | 66.00 | 73.00 | + |
| RFRN-4s (Deng | 85.10 | 76.80 | 80.80 | + |
| EAST (Lu | 85.59 | 76.94 | 81.03 | + |
| CTPN (Tian | 74.20 | 51.60 | 60.90 | − |
| EAST (Zhou | 83.60 | 73.50 | 78.20 | − |
| SSTD (He | 80.20 | 73.90 | 76.90 | − |
| WordSup (Hu | 79.30 | 77.00 | 78.20 | − |
| Corner (Lyu | 94.10 | 70.70 | 80.70 | − |
| TB (Liao, Shi, and Bai 2018) | 87.20 | 76.70 | 81.70 | − |
| RRD (Liao | 85.60 | 79.00 | 82.20 | − |
| MCN (Liu | 72.00 | 80.00 | 76.00 | − |
| TextSnake (Long | 84.90 | 80.40 | 82.60 | − |
| PSENet (Wang | 86.90 | 84.50 | 85.70 | − |
| SPCNet (Xie | 88.70 | 85.80 | 87.20 | − |
| LOMO (Zhang | 91.30 | 83.50 | 87.20 | − |
| ATRR (Wang | 89.20 | 86.00 | 87.60 | # |
| CRAFT (Baek | 89.80 | 84.30 | 86.90 | − |
| PAN (Wang | 84.00 | 81.90 | 82.90 | # |
| ContourNet (Wang | 87.60 | 86.10 | 86.90 | # |
| SAE (720) (Tian | 85.10 | 84.50 | 84.80 | − |
| GCN (Zhang | 88.50 | 84.70 | 86.60 | # |
| Texts as Lines (Wu | 81.70 | 77.10 | 79.40 | # |
| WSSTD (Zhang | 83.10 | 85.70 | 84.40 | # |
| SAE (990) (Tian | 88.30 | 85.00 | 86.60 | − |
| Ours (DB) | 91.80 | 83.20 | 87.30 | − |
ICDAR, International Conference on Document Analysis and Recognition; Seglink, Segment Linking; VGG16, Visual Geometry Group Network; WordSup, Exploiting Word Annotations for Character based Text Detection; EAST, Efficient and Accurate Scene Text Detector; ResNet50, Residual Neural Network; PAVNET, Deep but Lightweight Neural Network; STN-OCR, Spatial Transformer Network; Poly-FRCNN, Polygon-Faster-Region-based Convolutional Neural Network; RFRN, Recurrent Feature Refinement Network; CTPN, Connectionist Text Proposal Network; SSTD, Single Shot Text Detector; Corner, scene text detector that localizes text by corner point detection and position-sensitive segmentation; TB, TextBoxes++; RRD, Rotation-sensitive Regression Detector; MCN, Markov Clustering Network; TextSnake, A Flexible Representation for Detecting Text of Arbitrary Shapes; PSENet, Progressive Scale Expansion Network; SPCNet, Scale Position Correlation Network; LOMO, Look More Than Once; ATRR, Arbitrary Shape Scene Text Detection with Adaptive Text Region Representation; CRAFT, Character Region Awareness for Text Detection; PAN, Pixel Aggregation Network; SAE, Shape-Aware Embedding; GCN, Graph Convolutional Network; WSSTD, Weakly Supervised Scene Text Detection; DB, differentiable binarization network.
Text recognition results in a dataset that contains 357 images of medical laboratory reports. The methods are collected from [Xue et al. 2019, (48)]
| Methods | Accuracy (%) | mED | Size (MB) |
|---|---|---|---|
| Attention OCR (Brzeski | 83.8 | 2.51 | 221.5 |
| Xue | 95.8 | 3.29 | 42.9 |
| Ours (CRNN) | 90.6 | 3.79 | 34.0 |
mED, mean Edit Distance; CRNN, Convolutional Recurrent Neural Network; MB, mega byte; OCR, optical character recognition.