| Literature DB >> 36186874 |
Claudio Crema1, Giuseppe Attardi2, Daniele Sartiano3, Alberto Redolfi1.
Abstract
Natural language processing (NLP) is rapidly becoming an important topic in the medical community. The ability to automatically analyze any type of medical document could be the key factor to fully exploit the data it contains. Cutting-edge artificial intelligence (AI) architectures, particularly machine learning and deep learning, have begun to be applied to this topic and have yielded promising results. We conducted a literature search for 1,024 papers that used NLP technology in neuroscience and psychiatry from 2010 to early 2022. After a selection process, 115 papers were evaluated. Each publication was classified into one of three categories: information extraction, classification, and data inference. Automated understanding of clinical reports in electronic health records has the potential to improve healthcare delivery. Overall, the performance of NLP applications is high, with an average F1-score and AUC above 85%. We also derived a composite measure in the form of Z-scores to better compare the performance of NLP models and their different classes as a whole. No statistical differences were found in the unbiased comparison. Strong asymmetry between English and non-English models, difficulty in obtaining high-quality annotated data, and train biases causing low generalizability are the main limitations. This review suggests that NLP could be an effective tool to help clinicians gain insights from medical reports, clinical research forms, and more, making NLP an effective tool to improve the quality of healthcare services.Entities:
Keywords: deep learning; electronic health record; information extraction; natural language processing; neuroscience; psychiatry
Year: 2022 PMID: 36186874 PMCID: PMC9515453 DOI: 10.3389/fpsyt.2022.946387
Source DB: PubMed Journal: Front Psychiatry ISSN: 1664-0640 Impact factor: 5.435
FIGURE 1Number of scientific papers with keyword “Natural Language Processing” published on PubMed from 1990 to 2021, normalized on the number of total papers published on PubMed.
Qualitative evaluation of BERT models, generic NLP and medical NLP tools.
| Availability of materials (e.g., pre-trained models, ontologies, corpora, etc.) | Possibility of customization | Ability to perform basic NLP tasks (e.g., tokenization, stemming, POS tagging, etc.) | Ability to perform advanced NLP tasks (e.g., QABot, summarization, etc.) | Usability of APIs for the analysis of medical data | Community activeness | Scalability/User-friendliness | ||
| BERT models | BioBERT | Good | Good |
|
| Excellent | Average | Average |
| Clinical BioBERT | Good | Good |
|
| Excellent | Poor | Poor | |
| Umls-BERT | Good | Good |
|
| Excellent | Poor | Poor | |
| NLP generic tools | NLTK | Good | Excellent | Excellent | None | None | Excellent | Excellent |
| spaCy | Excellent | Good | Excellent | Good | Average | Excellent | Excellent | |
| HF | Excellent | Excellent | Excellent | Excellent | Average | Excellent | Excellent | |
| NLP medical tools | cTAKES | Good | Good | Excellent | None | Excellent | Good | Average |
| Bio-YODIE | Average | Average | Good | None | Excellent | Poor | Poor | |
| MetaMap | Average | Average | Good | None | Excellent | Average | Average | |
| MedCAT | Average | Average | Good | None | Excellent | Excellent | Good |
The rating scale is (from worst to best): None, Poor, Average, Good, and Excellent. NLP, natural language processing; BERT, bidirectional encoder representations from transformers; BioBERT, bidirectional encoder representations from transformers for biomedical text mining; NLTK, natural language toolkit; HF, hugging face; cTAKES, clinical text analysis and knowledge extraction system; Bio-YODIE, biomedical-yet another open data information extraction system; MedCAT, medical concept annotation toolkit. 1Model is free to download. †It is not possible to evaluate BioBERT, Clinical BioBERT, and Umls-BERT ability to perform NLP tasks a priori because they require fine-tuning first.
Natural language processing (NLP) tools and web sites.
| NLP tools | Brief description | Web-site |
| NLTK | Open-source library for low-level NLP operations (stemming, part-of-speech tagging) |
|
| spaCy | Open-source library for advanced NLP in Python. Support 64+ languages, implement DL models |
|
| Hugging face | Open-source library for NLP in Python. Features thousands of pre-trained models and datasets in several languages |
|
| cTAKES | Open-source NLP tool for IE from clinical EHR unstructured text |
|
| Bio-YODIE | Open-source NLP tool for biomedical named entity linking pipeline |
|
| MetaMap | Highly configurable program to map biomedical text to the UMLS Meta-thesaurus |
|
| MedCAT | Open-source NLP tool for IE from EHRs and link it to biomedical ontologies like SNOMED-CT and UMLS |
|
NLP, natural language processing; DL, deep learning; IE, information extraction; EHR, electronic health records; UMLS, unified medical language system; SNOMED-CT, systemized nomenclature of medicine-clinical terms.
FIGURE 2PRISMA chart.
Performance measures.
| Measure | Synonymous | Formula |
| Precision | Positive predictive value | TP/(TP + FP) |
| Recall | Sensitivity, true positive rate (TPR) | TP/(TP + FN) |
| F1-score | F-measure, F-score | 2. (P. R)/(P + R) |
| Accuracy | – | (TP + TN)/(TP + FP + TN + FN) |
| False positive rate | False alarm ratio | FP/(FP + TN) |
| Area under ROC curve | AUC, AUROC, C-statistics | Trade-off between TPR and FPR. AUC = 1: perfect classifier, AUC = 0.5: random classifier |
TP, true positive; TN, true negative; FP, false positive; FN, false negative; P, precision; R, recall; TPR, true positive rate; FPR, false positive rate.
Comparison of NLP performance for information extraction papers via traditional ML models.
| Author | Task | NLP model | Embeddings and Corpus | F1-score [%] |
| Savova et al. ( | Development of cTAKES, tool for IE of EHR | Rule-based | PTB corpus ( | 95.3 |
| GENIA corpus ( | 93.2 | |||
| Mayo Clinic EHR: ∼100 k tokens | 92.4 | |||
| Fonferko-Shadrach et al. ( | IE on epilepsy clinical texts | Rule-based | 200 unstructured clinic letters | 86.1 |
NLP, natural language processing; cTAKES, clinical text analysis and knowledge extraction system; IE, information extraction; EHR, electronic health record; PTB, penn treebank.
Comparison of NLP performance for information extraction papers via DL models.
| Author | Task | NLP model | Embeddings and Corpus | F1-score [%] |
| Lopes et al. ( | NER pipeline for Portuguese EHR | LASSO | i2b2 corpus 2010: ∼1,500 EHR ( | 83 |
| Weng et al. ( | IE pipeline built on cTAKES by using annotated texts | cTAKES + DL | iDASH corpus: 431 EHR | 84.5 |
| Massachusetts General Hospital corpus: ∼90 k EHR | 87 | |||
| Yu et al. ( | NER pipeline by using unannotated texts | BERT-based + CNN-based | BioWordVec: ∼2.3 M words ( | 91.4 |
| Kraljevic et al. ( | Development of MedCAT, tool for IE of EHR | BERT-based | Self-supervised train: ∼ 8.8B words | 94 |
| Vaci et al. ( | Extract data on individuals with depression from EHR | LSTM | 1.8 M EHRs from UN-CRIS Database ( | 69 |
NLP, natural language processing; NER, named entity recognition; LASSO, L1-regularized logistic regression; i2b2, informatics for integrating biology and the bedside; IE, information extraction; cTAKES, clinical text analysis and knowledge extraction system; DL, deep learning; BERT, bidirectional encoder representations from transformers; CNN, convolutional neural network; MedCAT, medical concept annotation toolkit; EHR, electronic health record; PTB, penn treebank; LSTM, long short-term memory.
Comparison of NLP performance for classification of texts produced by patients.
| Author | Classification | NLP model | Embeddings and Corpus | AUC |
| Takano et al. ( | Study 1: Specific vs. non-specific memory | SVM | Study 1: ∼12,400 EHR | 0.92 |
| Study 2: Novel memories | Study 2/3: ∼8,500 EHR | 0.89 | ||
| Clark et al. ( | MCI vs. AD | RF + SVM + naïve Bayes + Multilayer perceptrons | Fluency scores from ∼150 patients | 0.872 |
| Wang et al. ( | COVID-19 Twitter data analysis | RF | 50 M Tweets | 0.966 |
| Yu et al. ( | Negative life events into categories | SVM | Unlabeled corpus: 5,000 forum posts | 0.897 |
NLP, natural language processing; AUC, area under curve; SVM, support vector machine; RF, random forest; EHR, electronic health record; MCI, mild cognitive impairment; AD, Alzheimer’s Disease. *Only best classifier performance is reported.
Comparison of NLP performance for classification from clinicians’ notes.
| Author | Task | NLP model | Embeddings and Corpus | AUC |
| Xia et al. ( | Identify patients with complex neurological disorder | cTAKES-based | Train corpus: ∼600 clinical notes | 0.96 |
| Wissel et al. ( | Identify candidates for epilepsy surgery | Multiple linear regression | Train corpus: ∼1,100 clinical notes | 0.9 |
| Heo et al. ( | Prediction of stroke outcomes | CNN + LSTM + Multilayer perceptron | Train corpus: ∼1,300 clinical notes | 0.81 |
| Lineback et al. ( | Prediction of 30-day readmission after stroke | Logistic regression + naïve Bayes + SVM + RF + Gradient boosting + XGBoost | Train corpus: ∼2,300 clinical notes | 0.64 |
| Lin et al. ( | Identify UAU in hospitalized patients | Logistic regression | Train corpus: ∼58 k clinical notes | 0.91 |
| Bacchi et al. ( | Prediction of cause of TIA-like presentations | RNN + CNN | Corpus: 2,201 clinical notes (∼150 words each) | 0.88 |
NLP, natural language processing; AUC, area under curve; cTAKES, clinical text analysis and knowledge extraction system; CNN, convolutional neural network; LSTM, long short-term memory; SVM, support vector machine; RF, random forest; UAU, unhealthy alcohol use; ISP-D, internet-based self-assessment program for depression.
Comparison of NLP performance for predicting patient disposition.
| Author | Prediction | NLP model | Embeddings and Corpus | AUC |
| Tahayori et al. ( | Patients’ disposition from triage notes | BERT-based | ∼250 k EHR | 0.88 |
| Ahuja et al. ( | Relapse risk for multiple sclerosis | LASSO | Train corpus: ∼1,400 clinical notes | 0.71 |
| Zhang et al. ( | Clinical risk | CNN + LSTM | 2.5 M patients’ EHR | 0.85 |
| Klang et al. ( | Neuroscience ICU admission | XGBoost | ∼412 k patients’ EHR | 0.93 |
NLP, natural language processing; AUC, area under curve; BERT, bidirectional encoder representations from transformers; LASSO, L1-regularized logistic regression; CNN, convolutional neural network; RNN, recurrent neural network; LSTM, long short-term memory; EHR, electronic health record; ICU, intensive care unit; TIA, transient ischemic attack.
Comparison of NLP performance for identification of the best candidates for a specific treatment.
| Author | Identification | NLP model | Embeddings and Corpus | F1-score [%] |
| Cohen et al. ( | Candidates for surgery for drug-resistant epilepsy | Naïve Bayes | ∼6,300 patients’ EHR | 82 |
| Sung et al. ( | Intravenous thrombolytic therapy candidates | MetaMap-based | 234 clinical notes | 98.6 |
NLP, natural language processing; EHR, electronic health record.
Comparison of NLP performance for analysis of specific pathologies.
| Author | Analyzed pathology | NLP model | Embeddings and Corpus | F1-score [%] |
| Castro et al. ( | Cerebral aneurysms | cTAKES-based + LASSO | Train corpus: ∼300 clinical notes (manually annotated) | 84 |
| Katsuki et al. ( | Primary headache | DL-based | Test corpus: ∼850 clinical notes | 63.5 |
NLP, natural language processing; cTAKES, clinical text analysis and knowledge extraction system; LASSO, L1-regularized logistic regression; DL, deep learning.
FIGURE 3Performance box plot of the NLP methods representing the medians, interquartile ranges, outliers. IE, information extraction; ML, traditional machine learning; DL, deep learning; Patients, patients texts; Clinical, clinical notes; Disp., patients’ disposition; Cand., best candidates; Pathol., specific pathologies.