| Literature DB >> 33130435 |
Pilar López-Úbeda1, Manuel Carlos Díaz-Galiano2, Teodoro Martín-Noguerol3, Antonio Luna4, L Alfonso Ureña-López5, M Teresa Martín-Valdivia6.
Abstract
COVID-19 diagnosis is usually based on PCR test using radiological images, mainly chest Computed Tomography (CT) for the assessment of lung involvement by COVID-19. However, textual radiological reports also contain relevant information for determining the likelihood of presenting radiological signs of COVID-19 involving lungs. The development of COVID-19 automatic detection systems based on Natural Language Processing (NLP) techniques could provide a great help in supporting clinicians and detecting COVID-19 related disorders within radiological reports. In this paper we propose a text classification system based on the integration of different information sources. The system can be used to automatically predict whether or not a patient has radiological findings consistent with COVID-19 on the basis of radiological reports of chest CT. To carry out our experiments we use 295 radiological reports from chest CT studies provided by the ''HT médica" clinic. All of them are radiological requests with suspicions of chest involvement by COVID-19. In order to train our text classification system we apply Machine Learning approaches and Named Entity Recognition. The system takes two sources of information as input: the text of the radiological report and COVID-19 related disorders extracted from SNOMED-CT. The best system is trained using SVM and the baseline results achieve 85% accuracy predicting lung involvement by COVID-19, which already offers competitive values that are difficult to overcome. Moreover, we apply mutual information in order to integrate the best quality information extracted from SNOMED-CT. In this way, we achieve around 90% accuracy improving the baseline results by 5 points.Entities:
Keywords: COVID-19; Named entity recognition; Natural language processing; Radiological report; Text classification
Year: 2020 PMID: 33130435 PMCID: PMC7577869 DOI: 10.1016/j.compbiomed.2020.104066
Source DB: PubMed Journal: Comput Biol Med ISSN: 0010-4825 Impact factor: 4.589
Fig. 1Example of Spanish radiology report annotated with COVID-19. (See the English translation in Figure A.11 in Appendix A).
Dataset analysis.
| COVID-19 | Non-COVID-19 | |
|---|---|---|
| Number of documents | 158 | 137 |
| Vocabulary size | 2162 | 2017 |
| Avg. of sentence in the reports | 24.58 | 23.32 |
| Avg. of tokens in the reports | 161.11 | 147.97 |
Fig. 2Example of disorder extraction system using SNOMED-CT terminology. Translation: Signs compatible with bilateral basal pneumonic focus associated with pleural effusion, low probability of COVID-19. Cardiomegaly. Signs of pulmonary hypertension. Discrete tracheomalacia.
Fig. 3Example of SNOMED-CT concept representation vector using TF.
Top 10 high-scoring features of COVID-19 related disorders.
| # | SNOMED-CT code | Disorder |
|---|---|---|
| 1 | 407,671,000 | Bilateral pneumonia |
| 2 | 63,531,000,122,103 | Ground-glass opacities |
| 3 | 396,286,008 | Bilateral bronchopneumonia |
| 4 | 68,409,003 | Organized pneumonia |
| 5 | 63,521,000,122,101 | Patchy infiltrate |
| 6 | 95,436,008 | Lung consolidation |
| 7 | 101,401,000,119,103 | Pulmonary granuloma |
| 8 | 233,935,004 | Pulmonary thromboembolism |
| 9 | 59,282,003 | Pulmonary embolism |
| 10 | 63,541,000,122,106 | Interstitial Pneumonia |
Fig. 4Machine learning system architecture for COVID-19 detection.
Summary of results obtained Acc. with different features representations using the SVM algorithm.
| Document representation | SNOMED-CT representation | ||||
|---|---|---|---|---|---|
| Baseline (%) | TF (%) | TF + MI (%) | TF-bin (%) | TF-bin + MI (%) | |
| TF-IDF | 85.08 | 83.39 | 87.10 | 85.02 | |
| TF-IDF disable IDF | 84.90 | 81.69 | 87.39 | 82.71 | |
| TF-IDF n-grams | 84.88 | 81.02 | 86.42 | 81.36 | |
| TF-IDF n-grams disable IDF | 85.42 | 81.36 | 87.14 | 82.36 | |
Deeper summary of the SVM algorithm and performance improvement using different representations of documents and features.
| Doc. Repr. | SNOMED-CT Repr. | COVID-19 | Non-COVID-19 | Macro-avg | AUC (%) | MCC (%) | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| P (%) | R (%) | F1 (%) | P (%) | R (%) | F1 (%) | P (%) | R (%) | F1 (%) | ||||
| TF-IDF | Baseline | 84.34 | 88.61 | 86.42 | 86.05 | 81.02 | 83.46 | 85.19 | 84.81 | 84.94 | 84.81 | 70.01 |
| TF-bin + MI | 87.50 | 93.04 | 90.18 | 91.34 | 84.67 | 87.88 | 89.42 | 88.85 | 89.03 | 88.95 | 78.27 | |
| TF-IDF | Baseline | 85.19 | 87.34 | 86.25 | 84.96 | 82.48 | 83.70 | 85.07 | 84.91 | 84.98 | 84.91 | 69.99 |
| disable IDF | TF-bin + MI | 84.88 | 92.41 | 88.48 | 90.24 | 81.02 | 85.38 | 87.56 | 86.81 | 87.03 | 86.71 | 74.27 |
| TF-IDF n-grams | Baseline | 83.53 | 89.87 | 86.59 | 87.20 | 79.56 | 83.21 | 85.36 | 84.72 | 84.90 | 84.72 | 70.08 |
| TF-bin + MI | 86.50 | 89.24 | 87.85 | 87.12 | 83.94 | 85.50 | 86.81 | 86.59 | 86.68 | 86.59 | 73.40 | |
| TF-IDF n-grams | Baseline | 84.43 | 89.24 | 86.77 | 86.72 | 81.02 | 83.77 | 85.57 | 85.13 | 85.27 | 85.13 | 70.70 |
| disable IDF | TF-bin + MI | 85.80 | 91.77 | 88.69 | 89.68 | 82.48 | 85.93 | 87.74 | 87.13 | 87.31 | 87.13 | 74.87 |
Fig. 5Cross-validation ROC curve using SNOMED-CT disorders detected.
Results using different neural networks and document presentations detecting suspicious cases of COVID-19.
| Model | Doc. Repr. | SNOMED-CT Repr. | P (%) | R (%) | F1 (%) | Acc.(%) | AUC (%) | MCC (%) |
|---|---|---|---|---|---|---|---|---|
| Basic ANN | TF-IDF n-grams | Baseline | 79.57 | 81.05 | 79.55 | 81.49 | 81.05 | 63.23 |
| disable IDF | TF-bin + MI | 85.66 | 85.23 | 85.28 | 85.54 | 85.24 | 70.90 | |
| CNN | FastText SUC | Baseline | 84.17 | 84.52 | 83.66 | 84.03 | 84.52 | 68.67 |
| TF-bin + MI | 85.17 | 84.96 | 84.47 | 84.75 | 84.96 | 70.11 | ||
| LSTM | FastText SUC | Baseline | 80.22 | 77.45 | 75.77 | 76.59 | 77.45 | 57.46 |
| TF-bin + MI | 82.66 | 81.01 | 79.11 | 79.69 | 81.01 | 63.51 | ||
| BiLSTM | FastText SUC | Baseline | 78.01 | 74.40 | 71.63 | 73.00 | 74.40 | 52.01 |
| TF-bin + MI | 78.92 | 76.45 | 75.41 | 76.62 | 76.45 | 55.09 |
Fig. 6Results in terms of accuracy according to the selected percentile of MI.
Fig. 7The 20 most representative SNOMED-CT disorders detected to enhance COVID-19 text classification.
Fig. 8Confusion matrix using SNOMED-CT disorders detected.
Fig. 9False positive radiology report in Spanish for a patient with no COVID-19. (See the English translation in Figure A.12 in Appendix A).
Fig. 10False negative radiology report in Spanish for a patient with COVID-19. (See the English translation in Figure A.13 in Appendix A).