Literature DB >> 35543808

Construction of an Assisted Model Based on Natural Language Processing for Automatic Early Diagnosis of Autoimmune Encephalitis.

Yunsong Zhao¹, Bin Ren², Wenjin Yu¹, Haijun Zhang¹, Di Zhao¹, Junchao Lv¹, Zhen Xie³, Kun Jiang², Lei Shang⁴, Han Yao⁵, Yongyong Xu⁶, Gang Zhao^7,8.

Abstract

INTRODUCTION: Early diagnosis and etiological treatment can effectively improve the prognosis of patients with autoimmune encephalitis (AE). However, anti-neuronal antibody tests which provide the definitive diagnosis require time and are not always abnormal. By using natural language processing (NLP) technology, our study proposes an assisted diagnostic method for early clinical diagnosis of AE and compares its sensitivity with that of previously established criteria.
METHODS: Our model is based on the text classification model trained by the history of present illness (HPI) in electronic medical records (EMRs) that present a definite pathological diagnosis of AE or infectious encephalitis (IE). The definitive diagnosis of IE was based on the results of traditional etiological examinations. The definitive diagnosis of AE was based on the results of neuronal antibodies, and the diagnostic criteria of definite autoimmune limbic encephalitis proposed by Graus et al. used as the reference standard for antibody-negative AE. First, we automatically recognized and extracted symptoms for all HPI texts in EMRs by training a dataset of 552 cases. Second, four text classification models trained by a dataset of 199 cases were established for differential diagnosis of AE and IE based on a post-structuring text dataset of every HPI, which was completed using symptoms in English language after the process of normalization of synonyms. The optimal model was identified by evaluating and comparing the performance of the four models. Finally, combined with three typical symptoms and the results of standard paraclinical tests such as cerebrospinal fluid (CSF), magnetic resonance imaging (MRI), or electroencephalogram (EEG) proposed from Graus criteria, an assisted early diagnostic model for AE was established on the basis of the text classification model with the best performance.
RESULTS: The comparison results for the four models applied to the independent testing dataset showed the naïve Bayesian classifier with bag of words achieved the best performance, with an area under the receiver operating characteristic curve of 0.85, accuracy of 84.5% (95% confidence interval [CI] 74.0-92.0%), sensitivity of 86.7% (95% CI 69.3-96.2%), and specificity of 82.9% (95% CI 67.9-92.8%), respectively. Compared with the diagnostic criteria proposed previously, the early diagnostic sensitivity for possible AE using the assisted diagnostic model based on the independent testing dataset was improved from 73.3% (95% CI 54.1-87.7%) to 86.7% (95% CI 69.3-96.2%).
CONCLUSIONS: The assisted diagnostic model could effectively increase the early diagnostic sensitivity for AE compared to previous diagnostic criteria, assist physicians in establishing the diagnosis of AE automatically after inputting the HPI and the results of standard paraclinical tests according to their narrative habits for describing symptoms, avoiding misdiagnosis and allowing for prompt initiation of specific treatment.

Entities: Chemical

Keywords: Autoimmune encephalitis; Computer-assisted diagnosis; Early diagnosis; Electronic medical records; Natural language processing

Year: 2022 PMID： 35543808 PMCID： PMC9338198 DOI： 10.1007/s40120-022-00355-7

Source DB: PubMed Journal: Neurol Ther ISSN： 2193-6536

Key Summary Points

Introduction

Encephalitis is an inflammatory disease that damages the brain parenchyma and has high morbidity and mortality [1, 2]. According to the etiology, encephalitis can generally be divided into two types: autoimmune encephalitis (AE) and infectious encephalitis (IE). Previous studies have shown that early diagnosis and etiological treatment can effectively improve the prognosis of patients with encephalitis [3-9]. AE refers to a large class of encephalitis associated with anti-neuronal antibodies [10]. At present, there are more than 30 anti-neuronal antibodies related to AE, which can be divided into cell surface antigen antibodies and intracellular antigen antibodies. Among all the antibodies, the most common anti-neuronal antibody is anti-N-methyl-d-aspartate receptor (NMDAR) antibody [11]. Although the clinical presentations of AE are varied, some AE can present with typical manifestations, such as psychiatric symptoms, seizures, short-term memory deficits, decreased level of consciousness, and dyskinesias in anti-NMDAR encephalitis [12]. Currently, a definitive diagnosis of AE is largely dependent on positive test results for anti-neuronal antibodies in cerebrospinal fluid and serum. However, anti-neuronal antibody tests have the following limitations for early diagnosis of AE: (1) many medical institutions do not have the ability to conduct anti-neuronal antibody tests and therefore may delay the optimal treatment time for patients. (2) Some AE cases are negative for anti-neuronal antibodies; thus, even if the results of the anti-neuronal antibody tests are negative, the possibility that encephalitis is immune-mediated cannot be ruled out [12]. The greatest challenge in early diagnosis of AE is differential diagnosis of encephalitis of other etiologies. IE is one of the most common differential diagnoses of AE. Difficulties in the differential diagnosis of AE and IE are mainly reflected in the following aspects: (1) AE often has core symptoms similar to IE; and (2) approximately 50% of patients with AE are negative for anti-neuronal antibodies [13]. To avoid delaying the best opportunity for therapy, Graus et al. [12] proposed criteria for early clinical diagnosis of AE based on typical clinical symptoms and the results of standard paraclinical tests. The clinical symptoms and the standard paraclinical test results, for example, cerebrospinal fluid (CSF), magnetic resonance imaging (MRI), or electroencephalogram (EEG), can be obtained at the early stage of admission. The standard paraclinical tests that are conventional and accessible to most clinicians could be checked as soon as possible. Graus et al. believe that early clinical diagnosis of AE can be obtained and treatment can be carried out quickly using their diagnostic approach. At the same time, the diagnosis and treatment can be adjusted after the results of anti-neuronal antibody testing are clear. As a diagnosis approach, the flowchart proposed from Graus criteria is meant to be used as a whole. One point worth emphasizing is the first criterion for possible autoimmune encephalitis (PAE): the first diagnostic criterion is to screen as many AE cases as possible; thus, high sensitivity is important for PAE criteria to reduce the omission diagnostic rate. Only when the sensitivity of the PAE diagnostic criteria reaches 100% can the subsequent diagnostic process for other subgroups of AE be correctly ensured; otherwise, the omission diagnostic rate may increase [14]. However, the sensitivity of these criteria for PAE has been controversial, ranging from 57.6% to 100.0% [14-17]. The sensitivity of previous diagnostic criteria proposed by Graus et al. [12] is relatively good. On the one hand, the limited number of symptoms (six) that can be incorporated within a clinically usable scale and the simple logic diagnosis rules increase the operability and convenience of clinical practice. On the other hand, the diagnostic criteria may ignore the opportunity of utilizing other symptoms and weighting those symptoms individually. In fact, a great number of atypical initial symptoms can also occur in AE [18-20]. At the same time, there are some negative symptoms in the history of present illness (HPI) in electronic medical records (EMRs) that may aid in differential diagnosis. On the basis of diagnostic criteria proposed previously, taking full advantage of enhancing the initial symptom selection (these atypical positive symptoms and negative symptoms in the HPI described in EMRs) may make some progress in clinical diagnosis of AE via natural language processing (NLP) technology. However, because of the different habits of physicians in writing EMRs, there are a variety of synonyms for the same symptom in raw texts. If we want to analyze the symptoms, we should recognize and normalize the symptoms that have synonyms. For a long time, comprehensive analysis of the characteristics of AE symptoms through traditional labor-dependent methods has represented a great challenge because the symptoms are complex and diverse. NLP is a subfield of linguistics, computer science, and artificial intelligence. The technology can accurately process, extract, and analyze large amounts of information from natural language data. Clinical named entity recognition (CNER) [21] and text classification [22] are both common missions in NLP technology, which could be applied to the healthcare domain for automatically extracting complex and diverse clinical symptoms or classifying the clinical texts from a large corpus of EMRs, using smaller subset of cases for training models. The bidirectional long short-term memory conditional random field (BiLSTM-CRF) model is one of the most popular models to achieve the CNER mission [23, 24]. Equally important, following the arrival and development of the bidirectional encoder representations from transformers (BERT) [25, 26], CNER has started to adopt the BERT-based approach, which brings the advantage of allowing pretrained models [27]. Text classification usually needs to achieve feature selection at first. Feature selection is the process of selecting the features which contribute most to the classification of a given text. Text classification methods and feature selection methods are rapidly growing [28]. To conclude, with the rapid development of NLP technology [29-31], we may have the opportunity to recognize and extract medical terminology, especially for symptoms described in the HPI in EMRs; perform a comprehensive analysis of clinical symptoms; and develop an assisted diagnosis method for PAE. Our study aims to develop an assisted model for early clinical diagnosis of AE based on NLP technology and to compare its sensitivity with that of previously established criteria from Graus et al. [12].

Methods

Study Population and Design

Our data included 2514 Chinese EMRs with a diagnosis of central nervous system (CNS) infectious or inflammatory diseases, which were identified using the International Classification of Diseases-Tenth Revision (ICD-10) codes from the medical records database of the neurology department of Xijing Hospital during a period of 10 years (October 2010–December 2020). The ICD-10 codes are provided in Table S1 in the supplementary material. De-identification of all EMRs was conducted for every known identifier type (ID, name, gender, age, address, etc.). Standard paraclinical tests (e.g., CSF, MRI, or EEG studies) and other routine laboratory data were obtained by review of the EMRs. The Xijing Hospital Ethics Committee approved this study (KY20192071-F-1). Our study constructed two datasets by using 2514 Chinese EMRs: the first to train CNER as part of the NLP (552 cases) analysis pipeline, and the second (199 cases) to train and test the diagnosis classification model between AE and IE. The two datasets were filtered respectively from 2514 cases, so there was overlap between the CNER dataset and the text classification model dataset. To eliminate subjective interference and evaluate the classification model more objectively, the EMR samples included in the text classification dataset were AE or IE cases with definite evidence of etiology. The definitive diagnosis of IE was based on the results of traditional etiological examinations, which included microscopic staining, pathogenic microbiological analysis, and PCR. The definitive diagnosis of AE was based on the exclusion of other definite causes (e.g., IE). Then, all patients underwent an extensive search for neuronal antibodies and they required positive antibodies, either in CSF or serum, using commercial cell-based assay kits, according to published guidelines [32, 33]. Finally, because the possibility that autoantibodies may not be detected in definite autoimmune limbic encephalitis, the clinical diagnostic criteria proposed by Graus et al. only applied to definite autoimmune limbic encephalitis in this study [12]. A synopsis of the overall NLP analysis pipeline is shown in Fig. 1. First, we automatically extracted symptoms (i.e., CNER) from all HPI texts by training the BiLSTM-CRF model on the basis of a dataset of 552 cases with a single diagnosis of central nervous system (CNS) infection or AE. Second, post-structuring of the HPI in EMRs for all cases was implemented after normalizing symptom terminologies in English language. Third, four text classification models trained by a dataset of 199 cases were established for differential diagnosis of AE and IE based on a post-structuring text dataset of every HPI. The optimal model was identified by evaluating and comparing the performance of the four models. Finally, combined with three typical symptoms and the results of standard paraclinical tests (e.g., CSF, MRI, or EEG studies) proposed from Graus criteria, an assisted early diagnostic model for AE was established on the basis of the text classification model with the best performance.

Fig. 1

Synopsis of the overall NLP analysis pipeline. NLP natural language processing, EMRs electronic medical records, CNS central nervous system, AE autoimmune encephalitis, BiLSTM-CRF bidirectional long short-term memory conditional random field, HPI history of present illness, SNOMED-CT Systematized Nomenclature of Medicine-Clinical Terms, MeSH Medical Subject Headings, IE infectious encephalitis

Data Preprocessing for CNER

We filtered all 552 patient records identified with a single diagnosis of CNS infection or AE from 2514 EMRs with CNS infectious or inflammatory diseases. While all 552 cases were used at different stages of the CNER development (training word embedding, identifying the symptoms), a random subset of 140 cases (25% of 552 patient records) were selected for manual word annotations to assist with training. It has been proven that Chinese medical text segmentation is very important for producing high-quality word embedding and promoting downstream information extraction applications [34]. Therefore, we used the Jieba Chinese Word Segmentation Library supported by Python programming language to segment the HPI in EMRs. In our CNER approach, annotated data were represented in the BMESO format, in which each word was assigned to one of five classes: B, beginning of an entity; M, middle of an entity; E, ending of an entity; S, single word for an entity; O, outside of an entity. Therefore, the CNER problem became a classification problem requiring assignment of one of the five class tags to each word. The annotation guidelines were similar to those in Yang et al.’s study [35]. One main difference was that we only manually annotated symptoms in the HPI in EMRs. Thus, we only had one type of entity in this study. Another difference was that negative symptoms were recognized as a whole entity [36]. The statistics of the training subset of the HPI used for CNER is shown in Table S2 in the supplementary material. There were 26,655 words and 1055 symptoms in the HPI in the training subset, and B, M, E, S, and O were annotated 1006, 479, 1006, 49, and 24,115 times, respectively. We explained the annotation guidelines in this study to three specialized neurology physicians. The final annotation results were identified by the following rules: Word boundaries of symptoms were marked using B, M, E, S, and O tags by manual annotations performed by two physicians. Examples of some annotated sequences are provided in Table S3 in the supplementary material. After the manual annotation by the two physicians was completed, the inter-annotator agreement was calculated with Kohen’s kappa and was 0.87. Then, the manual annotation results provided by the two physicians were compared. When the results were the same, the same annotation result was the consensus result. In the case of different annotation results by the two physicians, the third physician made a final interpretation of the two annotation results, and one was selected as the consensus result.

Training and Evaluation for CNER Using the BiLSTM-CRF Model

To improve the quality of training word embedding, 552 HPI texts were used as a corpus for training word embedding. We used 140 HPI texts for the BiLSTM-CRF model. The Word2Vec tool is used to train word embedding using a continuous bag of words (CBOW) model or skip-gram model [37]. Various empirical evidence shows that the CBOW model performs better than the skip-gram model for databases with only hundreds of thousands of words [38, 39]. Therefore, the CBOW model was adopted in this study to achieve word embedding. The dimension of the word embedding vector was set to 128 dimensions, and the other hyper-parameters are provided in Table S4 in the supplementary material. The dataset, which contained 140 HPI texts, was split into two mutually exclusive subsets: training (80% of dataset) and testing (20% of dataset) subsets [40]. We divided every HPI with full stops into sentences. Each sentence of every HPI was used as an input sequence in the following model. The implementation of CNER in this study is based on the BiLSTM-CRF model, with word embedding as the input of sequences [23, 24]. To optimize the hyper-parameters of the BiLSTM-CRF model, a tenfold cross-validation method was adopted using the training subset. When the optimal hyper-parameters of the BiLSTM-CRF model were determined, the entire training subset was trained as the final BiLSTM-CRF model according to the optimal hyper-parameters. The hyper-parameters of the BiLSTM-CRF model were fine-tuned, training the model with the training subset and evaluating it with the F measure. The results for each configuration of the parameters (batch size, number of epochs, dropout, and learning rate with the Adam algorithm [41]) in the hyper-parameter fine-tuning stage are shown in Fig. S1 in the supplementary material. In the case of the embedding created from words by Word2Vec, the best hyper-parameters were as follow: 2 batches, 40 epochs, 0.2 dropout, and 0.001 learning rate. The other hyper-parameters are provided in Table S4 in the supplementary material. The performance of the final BiLSTM-CRF model was measured by precision, recall, and F measure for all entities using the independent testing subset [42]. The evaluation program provided two sets of measures—exact match and inexact match—where exact match means that an entity is correctly predicted if, and only if, the starting and ending offsets are exactly the same as those in the consensus result; the inexact match means that an entity is correctly predicted if it overlaps with any entity in the consensus result [43, 44]. Table S5 in the supplementary material shows the statistics of the independent testing subset of the HPI used in this study. There were 6425 words and 296 symptoms in the HPI in EMRs. Table S5 also shows the performance of the BiLSTM-CRF model applied to the independent testing subset. The numbers in columns 4–6 are precision, recall, and F measure values for all entities using the exact match or inexact match measures.

Post-Structuring of the HPI

The final BiLSTM-CRF model was used to identify and extract the symptoms in 552 HPI texts presented in EMRs. All the symptoms were classified into different groups according to whether they had synonyms. Then, we obtained normalized symptom terminologies by establishing a mapping relationship between the categorized symptoms and the international standard English medical terms set, Systematized Nomenclature of Medicine-Clinical Terms (SNOMED-CT) and Medical Subject Headings (MeSH). The process of normalization of symptoms was conducted using Python regular expressions according to normalized symptom terminologies in English language. Then, the post-structuring of the HPI was completed using the normalized symptom terminologies; in other words, we rebuilt the HPI texts using the normalized symptom terminologies, which included typical symptoms, atypical symptoms, and negative symptoms.

Training and Evaluation for Text Classification Model

According to the definite diagnostic criteria for AE and IE, we obtained the text classification dataset of 199 cases by reviewing EMRs. The dataset included 83 AE cases and 116 IE cases. There are several autoimmune CNS diseases (primary CNS angiitis, Rasmussen’s encephalitis) that are often considered in the differential diagnosis of autoimmune encephalitis because of the clinical features. These diseases were considered immune-related diseases and excluded [11, 12]. The overall data were randomly divided into a 65% training dataset and a 35% testing dataset. The training dataset included 53 AE cases and 75 IE cases, and the independent testing dataset included 30 AE cases and 41 IE cases. We did not directly perform text classification on raw texts because of the impact of the EMR template. The EMR template makes different EMRs have the same words, which are not related to symptoms of AE. However, the large numbers of same words from EMR template would cover up the role of symptoms of AE and affect the result of text classification. Meanwhile, as a result of different habits in writing EMRs, there were a variety of synonyms for the same symptom in raw texts. Therefore, our text classification model was based on rebuilding the HPI using normalized symptom terminologies rather than the raw HPI texts presented in EMRs. Our text classification models consisted of four models: two different classifiers with two different text feature selection methods respectively. Four text classification models, namely, naïve Bayesian classifier (NBC) and support vector machine (SVM), using two different text feature selection methods, namely, bag of words (BoW) and term frequency–inverse document frequency (TF-IDF), were established to distinguish AE and IE on the basis of structured texts of the HPI, which were rebuilt using normalized symptom terminologies. The process of determining the optimal hyper-parameters of the four text classification models was realized by the grid search function GridSearchCV from the Scikit-Learn library [45]. The hyper-parameters are provided in Table S4 in the supplementary material. When the optimal hyper-parameters of the text classification models were determined, the final models were trained using the whole training dataset. The performances of the four text classification models were evaluated and compared using the independent testing dataset. The performances of the text classification models were measured by sensitivity, specificity, accuracy, and area under the receiver operating characteristic curve (AUROC). Finally, the model that performed best was used to establish the assisted diagnostic model for AE.

Evaluation of the Assisted Diagnostic Model

Because the diagnostic criteria from Graus et al. [12] emphasized the importance of psychiatric symptoms, seizures, and short-term memory deficits, logical diagnosis rules for these three symptoms were added to the basis of the text classification model. Specifically, when the text classification model judges that the case is not an AE case, if the case contains one or more of these three symptoms, the case will be clinically diagnosed as an AE case. Furthermore, our assisted diagnostic model was combined with the results of standard paraclinical tests (e.g., CSF, MRI, or EEG studies) from the Graus criteria. All cases of the independent testing dataset were analyzed according to the diagnostic criteria for PAE from Graus et al. and the assisted diagnostic model to compare the etiological diagnosis. The performance was measured and assessed according to sensitivity, specificity, accuracy, and confusion matrices. We analyzed demographics, clinical, and standard paraclinical test characteristics. Continuous variables were presented as the mean ± standard deviation (SD) in the descriptive analyses, while categorical and binary variables were presented as frequencies (n) and percentages (%). Student’s t test and chi-squared test were used to compare outcomes between patient subgroups for continuous and categorical data, respectively. All data acquisition, processing, and analyses were conducted in the Python programming language (version 3.7.0) [46, 47], TensorFlow library [48, 49], and Scikit-Learn library [45].

Results

Our text classification dataset included 199 cases (83 definite AE cases and 116 definite IE cases). The process of constructing the text classification dataset is shown in Fig. 2. Demographics characteristics and standard paraclinical test characteristics are compared between AE and IE groups in Table 1. Compared to the IE group, the AE group had a significantly higher rate of cases with EEG abnormalities (57.8% vs. 25.9%; P < 0.001), and a lower percentage of male cases (44.6% vs. 62.1%; P = 0.015). There was no significant between-group difference with respect to age, CSF pleocytosis (white blood cell count greater than 5/mm3), CSF-specific oligoclonal bands positive, or MRI abnormalities.

Fig. 2

Table 1

Comparison of demographics characteristics and standard paraclinical test characteristics between AE and IE groups (n = 199)

Variable	AE (n = 83)	IE (n = 116)	P value^a
Demographic characteristics
Age (years)	31 (34.7 ± 15.5)	37 (36.0 ± 13.7)	0.533
Sex, male	37 (44.6)	72 (62.1)	0.015*
Standard paraclinical test characteristics
CSF pleocytosis (white blood cell count > 5/mm³)	53 (63.9)	87 (75.0)	0.09
CSF-specific oligoclonal bands positive	2 (2.4)	1 (0.9)	0.769
MRI abnormalities	40 (48.2)	48 (41.4)	0.340
EEG abnormalities	48 (57.8)	30 (25.9)	0.000**

Data are presented as the mean ± standard deviation (SD) or as n (%)

AE autoimmune encephalitis, IE infectious encephalitis, CSF cerebrospinal fluid, MRI magnetic resonance imaging, EEG electroencephalogram

aSignificant difference at *P < 0.05 and **P < 0.001

Flowchart illustrating the development of an assisted model based on NLP for the diagnosis of AE. NLP natural language processing, EMRs electronic medical records, CNS central nervous system, AE autoimmune encephalitis, IE infectious encephalitis Comparison of demographics characteristics and standard paraclinical test characteristics between AE and IE groups (n = 199) Data are presented as the mean ± standard deviation (SD) or as n (%) AE autoimmune encephalitis, IE infectious encephalitis, CSF cerebrospinal fluid, MRI magnetic resonance imaging, EEG electroencephalogram aSignificant difference at *P < 0.05 and **P < 0.001 By using the BiLSTM-CRF model, we recognized and extracted 5819 symptoms from 552 HPI in EMRs identified with the single diagnosis of CNS infection or AE. After normalizing synonyms of symptom terminologies using SNOMED-CT and MeSH, we finally obtained 219 normalized symptom terminologies in English language. The 219 normalized symptom terminologies and the proportion of the IE and AE cohorts with each symptom are shown in Table S6 in the supplementary material. In the definite AE group, the majority of patients presented with psychiatric symptoms (74.7%), decreased level of consciousness (68.7%), seizures (60.2%), and involuntary movement (59.0%). In the IE group, headache (81.9%) and fever (73.3%) were frequently encountered. Significant differences (only those with P < 0.001) with the frequency of clinical characteristics for comparing between IE and AE groups are shown in Fig. 3. The autoantibodies and microorganisms detected in samples from definite AE and IE cases are shown in Fig. 4.

Fig. 3

Frequency of clinical characteristics for comparing between IE and AE groups (significant difference at P < 0.001). AE autoimmune encephalitis, IE infectious encephalitis

Fig. 4

Autoantibodies and microorganisms detected in samples from definite AE and IE of the whole dataset. a Antibodies detected in samples from definite AE of the whole dataset (n = 83); asterisk indicates the antibodies were found in the same patient. b Microorganisms detected in samples from definite IE of the whole dataset (n = 116). AE autoimmune encephalitis, IE infectious encephalitis, NMDAR anti-N-methyl-d-aspartate receptor, LGI1 anti-leucine-rich glioma-inactivated protein 1, GABAR anti-gamma-aminobutyric acid B receptor, CASPR2 anti-contactin-associated protein 2

Frequency of clinical characteristics for comparing between IE and AE groups (significant difference at P < 0.001). AE autoimmune encephalitis, IE infectious encephalitis Autoantibodies and microorganisms detected in samples from definite AE and IE of the whole dataset. a Antibodies detected in samples from definite AE of the whole dataset (n = 83); asterisk indicates the antibodies were found in the same patient. b Microorganisms detected in samples from definite IE of the whole dataset (n = 116). AE autoimmune encephalitis, IE infectious encephalitis, NMDAR anti-N-methyl-d-aspartate receptor, LGI1 anti-leucine-rich glioma-inactivated protein 1, GABAR anti-gamma-aminobutyric acid B receptor, CASPR2 anti-contactin-associated protein 2 Comparing the results of accuracy, sensitivity, and specificity determined using the independent testing dataset in Fig. 5 shows that NBC with the BoW model performs better than the other three models on the basis of accuracy and sensitivity, obtaining 84.5% (95% confidence interval [CI] 74.0–92.0%) and 86.7% (95% CI 69.3–96.2%), respectively. The AUROC comparison results for the four models applied to the independent testing dataset are shown in Fig. 6, which shows that the AUROC of NBC with BoW achieved the best performance, with an AUROC of 0.85. However, the examination of specificity metrics suggested that the NBC with TF-IDF model, SVM with BoW model, and SVM with TF-IDF model achieved 85.4% (95% CI 70.8–94.4%), which is better than the 82.9% (95% CI 67.9–92.8%) observed with the NBC with BoW model. Since the goal of our study was to improve the sensitivity of the clinical diagnosis of AE, the NBC with BoW model was selected as the best model after comprehensive consideration of various performance measures.

Fig. 5

Fig. 6

AUROC of text classification models distinguished AE from IE using independent testing dataset. NBC with BoW, 0.85; NBC with TF-IDF, 0.83; support vector machine (SVM) with BoW, 0.83; SVM with TF-IDF, 0.81. AUROC area under the receiver operating characteristic curve, AE autoimmune encephalitis, IE infectious encephalitis, NBC naïve Bayesian classifier, SVM support vector machine, BoW bag of words, TF-IDF term frequency–inverse document frequency

Predictive performance of four text classification models to distinguish AE and IE on the basis of symptoms of the HPI using the independent testing dataset. AE autoimmune encephalitis, IE infectious encephalitis, HPI history of present illness, NBC naïve Bayesian classifier, SVM support vector machine, BoW bag of words, TF-IDF term frequency–inverse document frequency AUROC of text classification models distinguished AE from IE using independent testing dataset. NBC with BoW, 0.85; NBC with TF-IDF, 0.83; support vector machine (SVM) with BoW, 0.83; SVM with TF-IDF, 0.81. AUROC area under the receiver operating characteristic curve, AE autoimmune encephalitis, IE infectious encephalitis, NBC naïve Bayesian classifier, SVM support vector machine, BoW bag of words, TF-IDF term frequency–inverse document frequency We used the sensitivity, specificity, and accuracy to evaluate the diagnostic performance for PAE between the assisted diagnostic model and the diagnostic criteria from Graus et al. using the same independent subset of patients (30 definite AE cases and 41 definite IE cases). The performance measures show that our model achieved better sensitivity, specificity, and accuracy than the diagnostic criteria proposed by Graus et al. The performance is as follow: 86.7% (95% CI 69.3–96.2%) vs. 73.3% (95% CI 54.1–87.7%), 75.6% (95% CI 59.7–87.6%) vs. 56.1% (95% CI 39.7–71.5%), and 80.3% (95% CI 69.1–88.8%) vs. 63.4% (95% CI 51.1–74.5%), respectively. The detailed results for the numbers of true positive, false positive, true negative, and false negative are shown in confusion matrices (Fig. 7).

Fig. 7

Confusion matrices of different diagnostic methods for AE using the independent testing dataset. a Confusion matrix for predictions of Graus criteria vs. etiology diagnosis of AE; b Confusion matrix for predictions of assisted diagnostic model vs. etiology diagnosis of AE. AE autoimmune encephalitis, IE infectious encephalitis

Discussion

In this study, we have developed an assisted diagnostic model for AE based on NLP technology by recognizing symptoms described in EMRs using CNER. We subsequently tested this model in an independent testing dataset and compared its performance with previously established Graus criteria. Compared with the diagnostic criteria proposed previously, the early diagnostic sensitivity for PAE using the assisted diagnostic model based on the independent testing dataset was improved from 73.3% (95% CI 54.1–87.7%) to 86.7% (95% CI 69.3–96.2%). Since IE is the most common and most difficult disease to differentiate from AE in clinical work, IE is taken as the most important disease for differential diagnosis of AE. In the past, early clinical diagnosis of AE was mostly based on the diagnostic criteria proposed by Graus et al. in 2016, but the consistency of the performance of the diagnostic criteria for PAE is controversial, ranging from 57.6% to 100.0% [14-17]. Our study presented some of the results of previous studies that evaluated and validated the performance of Graus criteria and compared them with our study (Table 2). The differences in sensitivity for PAE in previous studies may be caused by the following reasons: (1) sample size of cases is small and different. (2) Definition of the AE group is different, e.g., Wagner et al.’s AE group contained 17 definite AE and 16 probable AE cases. However, the sensitivity for PAE would be 29.4% if they excluded cases which diagnosed as probable AE, only bringing the 17 definite AE cases into the AE group. (3) Ratio of different anti-neuronal antibody cases varies, such as a higher ratio of NMDAR cases in the AE group, for which the algorithm seems to have a particularly high sensitivity. (4) Sensitivity for PAE increases with time; the sensitivity reported by Li et al. for the time period of up to 14 days after admission only was 60.4% for PAE.

Table 2

Summary of early diagnostic performance in previous studies using Graus criteria and comparison with our study

Studies	Diagnostic methods for PAE	AE (n)	IE or non-AE^a (n)	Sensitivity (%)	Specificity (%)
Giordano et al. [16]	Graus criteria	22	0	100.0	–
Baumgartner et al. [15]	Graus criteria	50	0	86.0	–
Li et al. [14]	Graus criteria	64	31^a	84.3	93.5
Wagner et al. [17]	Graus criteria	33	51	57.6	7.8
Our study	Graus criteria	30	41	73.3	56.1
Our study	Assisted model	30	41	86.7	75.6

AE autoimmune encephalitis, IE infectious encephalitis, PAE possible autoimmune encephalitis

aCases with non-autoimmune encephalitis in Li et al.’s study, including viral encephalitis, purulent encephalitis, tuberculous meningoencephalitis, central nervous system tumor, and epileptic disorders

Summary of early diagnostic performance in previous studies using Graus criteria and comparison with our study AE autoimmune encephalitis, IE infectious encephalitis, PAE possible autoimmune encephalitis aCases with non-autoimmune encephalitis in Li et al.’s study, including viral encephalitis, purulent encephalitis, tuberculous meningoencephalitis, central nervous system tumor, and epileptic disorders In our study, the sensitivity increased from 73.3% to 86.7% by using the assisted diagnostic model, which compared with the Graus diagnostic criteria applied to the same independent testing dataset. We reviewed the reasons for misdiagnosis of the four cases in the independent testing dataset when using the assisted diagnostic model. Two were due to no abnormalities in the CSF, MRI or EEG, and the other two were due to atypical symptoms, which were not identified by the assisted diagnostic model. However, our model identified four more true positive AE cases than the diagnostic criteria for PAE from Graus et al. The measure of specificity mainly depends on the type of disease that is used for the differential diagnosis of AE. If the disease is IE, differential diagnosis is difficult, leading to relatively poor specificity of PAE both for Graus criteria and our assisted model. If the disease is a non-AE disease that can easily be identified, then specificity would be relatively high. In fact, compared with typical symptoms associated with AE and the results of standard paraclinical tests, “Reasonable exclusion of alternative causes” in diagnostic criteria for AE is the most important factor for enhancing the specificity [17]. Therefore, the specificity requires much evidence to exclude all other etiologies, and our study aims to improve the AE diagnostic sensitivity, (finding out the highest number of cases of PAE), not the diagnostic specificity. In the early differential diagnosis of AE, IE is the most common and difficult disease to differentiate, especially viral encephalitis [50]. In our study, the assisted diagnostic model misdiagnosed 10 IE cases as AE, of which eight cases were viral encephalitis and the other two cases were encephalitis caused by Mycobacterium tuberculosis. Since infection itself is an important cause of AE, an increasing number of studies have begun to focus on AE caused by viral encephalitis [19, 51]. This infection-induced AE is not only difficult to diagnose but also may induce double-peak encephalitis, which brings great challenges in clinical treatment. For physicians, fundamentally solving the differential diagnosis problem between AE and IE requires an accurate diagnosis, namely, an etiological diagnosis. Early clinical diagnosis of AE can only be adopted as a compromise measure to allow for early initial treatment until an etiological diagnosis can be provided. A proactive search for the exact etiology of AE or IE should always be carried out through the clinical diagnosis and treatment process. In our study, all the possible accompanying symptoms of AE (including typical positive symptoms, atypical positive symptoms and negative symptoms) were considered in the development of the assisted diagnosis model. After extracting symptoms from the HPI texts, we normalized synonyms of symptom terminologies in English language using SNOMED-CT and MeSH. Then post-structuring of the HPI texts and development of the text classification model were implemented using symptoms terminologies in English language. This facilitates generalization of our findings and our assisted diagnostic model to English speaking populations. The above method not only accounted for the clinical characteristics of complex and diverse AE symptoms but also effectively improved the sensitivity and specificity of early clinical diagnosis of AE. Physicians could quickly obtain an early clinical diagnosis of AE when they input the HPI into the model according to their narrative habits for describing symptoms in Chinese language or English language for non-Chinese speakers. There are several limitations to this study: First, the Graus diagnostic criteria were not developed to differentiate AE from IE, but to allow for early diagnostic suspicion of AE among all patients with neurologic syndromes. This is a limitation that we used Graus criteria in comparison with our model to distinguish AE from IE. Our assisted diagnostic model shows performance only for the differential diagnosis between AE and IE, which would limit generalizability of our study. There are other differential diagnoses can also be challenging in the evaluation of PAE except IE (e.g., new onset epilepsy due to other causes, other organic encephalopathies, psychiatric disorders, etc.). In addition, patients with antibody-associated AE may have different clinical features to those with antibody-negative AE. However, the antibody-negative AE cases in our study only included the ones with definite limbic encephalitis, which further limits generalizability. In the future, we plan to include more diseases, which need differential diagnosis with AE, and analyze texts in EMRs from patients in primary care, general medical, or psychiatric settings because the greatest unmet need with regard to early diagnosis arises from patients not yet evaluated in neurology departments. Second, NLP research is currently dominated by the use of transformer models [52], such as BERT. In the future, we plan to adopt advanced BERT-based transformer models (e.g., RoBERTa [53], ELECTRA [54]) or specialized models (e.g., MacBERT [55] and BioBERT [56]) as pretrained models, which could be applicable in the present work. Third, model explainability was not discussed in our study. However, the need for explainability is present in NLP [57] and the healthcare domain in particular, which is currently a very active area of research [58-60]. The complexity arising from the large parameter space and the combination of algorithms makes models uninterpretable for humans, i.e., the decision process cannot be fully comprehended [61, 62]. To alleviate the issues present in explainability, an assistive diagnostic model that humans can understand, manage, and trust should be proposed in our future work [63]. Finally, our study is retrospective. The sample size of datasets is small, which includes 83 definite AE cases and 116 definite IE cases. As a result of the small sample size, deep neural networks were not used to perform text classification. To explore the effectiveness of the assisted diagnostic model for early clinical diagnosis of AE, larger prospective studies should be conducted to obtain more powerful clinical evidence.

Conclusion

In this study, we described the development of an assisted diagnostic model for AE based on normalized symptoms from the HPI described in EMRs via NLP technology. We demonstrated that the assisted diagnostic model could effectively increase diagnostic sensitivity for AE compared to previous diagnostic criteria from Graus et al. This model is capable of assisting physicians in establishing the diagnosis of AE automatically after inputting the HPI and the results of standard paraclinical tests according to their narrative habits for describing symptoms, allowing for more accurate diagnosis and prompt initiation of specific treatment. Below is the link to the electronic supplementary material. Supplementary file1 (PDF 958 KB)

Why carry out this study?

Early diagnosis and etiological treatment can effectively improve the prognosis of patients with autoimmune encephalitis (AE).

There are practical syndrome-based diagnostic criteria for AE described previously.

The sensitivity of these criteria is reasonable but could be further improved.

What was learned from the study?

Using our assisted model, we increased the early diagnostic sensitivity for AE by broadening the number of symptoms analyzed compared to previous criteria.

By using natural language processing technology, the model could provide an early diagnosis of AE automatically, avoiding misdiagnosis and allowing for prompt initiation of specific treatment.

37 in total

1. Neuropsychological and psychiatric profiles in acute encephalitis in adults.

Authors: Stephen M Pewter; W Huw Williams; Catherine Haslam; Janice M Kay
Journal: Neuropsychol Rehabil Date: 2007 Aug-Oct Impact factor: 2.868

2. A comprehensive study of named entity recognition in Chinese clinical text.

Authors: Jianbo Lei; Buzhou Tang; Xueqin Lu; Kaihua Gao; Min Jiang; Hua Xu
Journal: J Am Med Inform Assoc Date: 2013-12-17 Impact factor: 4.497

3. Treatment and prognostic factors for long-term outcome in patients with anti-NMDA receptor encephalitis: an observational cohort study.

Authors: Maarten J Titulaer; Lindsey McCracken; Iñigo Gabilondo; Thaís Armangué; Carol Glaser; Takahiro Iizuka; Lawrence S Honig; Susanne M Benseler; Izumi Kawachi; Eugenia Martinez-Hernandez; Esther Aguilar; Núria Gresa-Arribas; Nicole Ryan-Florance; Abiguei Torrents; Albert Saiz; Myrna R Rosenfeld; Rita Balice-Gordon; Francesc Graus; Josep Dalmau
Journal: Lancet Neurol Date: 2013-01-03 Impact factor: 44.182

4. Speculation detection for Chinese clinical notes: Impacts of word segmentation and embedding models.

Authors: Shaodian Zhang; Tian Kang; Xingting Zhang; Dong Wen; Noémie Elhadad; Jianbo Lei
Journal: J Biomed Inform Date: 2016-02-26 Impact factor: 6.317

Review 5. Diagnostics of autoimmune encephalitis associated with antibodies against neuronal surface antigens.

Authors: Luigi Zuliani; Marco Zoccarato; Matteo Gastaldi; Raffaele Iorio; Amelia Evoli; Tiziana Biagioli; Silvia Casagrande; Elena Bazzigaluppi; Raffaella Fazio; Claudia Giannotta; Eduardo Nobile-Orazio; Francesca Andreetta; Ornella Simoncini; Gianna Costa; Sara Mariotto; Sergio Ferrari; Elisabetta Galloni; Michela Marcon; Diego Franciotta; Bruno Giometto
Journal: Neurol Sci Date: 2017-10 Impact factor: 3.307

6. Admission diagnoses of patients later diagnosed with autoimmune encephalitis.

Authors: Annette Baumgartner; Sebastian Rauer; Tilman Hottenrott; Frank Leypoldt; Friederike Ufer; Harald Hegen; Harald Prüss; Jan Lewerenz; Florian Deisenhammer; Oliver Stich
Journal: J Neurol Date: 2018-11-12 Impact factor: 4.849

7. Emerging clinical applications of text analytics.

Authors: Irena Spasić; Özlem Uzuner; Li Zhou
Journal: Int J Med Inform Date: 2019-09-20 Impact factor: 4.046

8. Herpes simplex virus encephalitis is a trigger of brain autoimmunity.

Authors: Thaís Armangue; Frank Leypoldt; Ignacio Málaga; Miquel Raspall-Chaure; Itxaso Marti; Charles Nichter; John Pugh; Monica Vicente-Rasoamalala; Miguel Lafuente-Hidalgo; Alfons Macaya; Michael Ke; Maarten J Titulaer; Romana Höftberger; Heather Sheriff; Carol Glaser; Josep Dalmau
Journal: Ann Neurol Date: 2014-02-25 Impact factor: 10.422

9. Anti-NMDA receptor encephalitis: clinical characteristics, predictors of outcome and the knowledge gap in southwest China.

Authors: W Wang; J-M Li; F-Y Hu; R Wang; Z Hong; L He; D Zhou
Journal: Eur J Neurol Date: 2015-11-12 Impact factor: 6.089

Review 10. The Laboratory Diagnosis of Autoimmune Encephalitis.

Authors: Sang Kun Lee; Soon-Tae Lee
Journal: J Epilepsy Res Date: 2016-12-31