Literature DB >> 34330277

A novel deep learning approach to extract Chinese clinical entities for lung cancer screening and staging.

Huanyao Zhang^1,2, Danqing Hu^1,2, Huilong Duan^1,2, Shaolei Li³, Nan Wu⁴, Xudong Lu^1,2.

Abstract

BACKGROUND: Computed tomography (CT) reports record a large volume of valuable information about patients' conditions and the interpretations of radiology images from radiologists, which can be used for clinical decision-making and further academic study. However, the free-text nature of clinical reports is a critical barrier to use this data more effectively. In this study, we investigate a novel deep learning method to extract entities from Chinese CT reports for lung cancer screening and TNM staging.
METHODS: The proposed approach presents a new named entity recognition algorithm, namely the BERT-based-BiLSTM-Transformer network (BERT-BTN) with pre-training, to extract clinical entities for lung cancer screening and staging. Specifically, instead of traditional word embedding methods, BERT is applied to learn the deep semantic representations of characters. Following the long short-term memory layer, a Transformer layer is added to capture the global dependencies between characters. Besides, pre-training technique is employed to alleviate the problem of insufficient labeled data.
RESULTS: We verify the effectiveness of the proposed approach on a clinical dataset containing 359 CT reports collected from the Department of Thoracic Surgery II of Peking University Cancer Hospital. The experimental results show that the proposed approach achieves an 85.96% macro-F1 score under exact match scheme, which improves the performance by 1.38%, 1.84%, 3.81%,4.29%,5.12%,5.29% and 8.84% compared to BERT-BTN, BERT-LSTM, BERT-fine-tune, BERT-Transformer, FastText-BTN, FastText-BiLSTM and FastText-Transformer, respectively.
CONCLUSIONS: In this study, we developed a novel deep learning method, i.e., BERT-BTN with pre-training, to extract the clinical entities from Chinese CT reports. The experimental results indicate that the proposed approach can efficiently recognize various clinical entities about lung cancer screening and staging, which shows the potential for further clinical decision-making and academic research.

Entities: CellLine Chemical Disease Gene Species

Keywords: BERT; CT reports; Lung cancer screening and staging; Named entity recognition; Pre-training; Transformer

Year: 2021 PMID： 34330277 PMCID： PMC8323233 DOI： 10.1186/s12911-021-01575-x

Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN： 1472-6947 Impact factor: 2.796

Background

Lung cancer is the most commonly diagnosed cancer and the leading cause of cancer-related deaths, and the situation is particularly urgent in China [1]. Computed tomography (CT), as the primary examination of lung cancer, reports a large volume of valuable information about patients’ conditions and the interpretations from radiologists, which can be used for clinical diagnosis and progression assessment. Besides, the information in clinical narratives was also utilized in many academic studies, e.g., risk evaluation [2, 3], staging [4], decision making [5], and achieved remarkable results. However, the free-text nature of CT reports is a critical barrier to fully use this information [6], and manually extracting structured information from free-text data is time-consuming, error prone, and costly [7]. To extract the information from free-text corpus, Named Entity Recognition (NER) is applied to identify the types and boundaries of interested entities, which has been widely investigated [8]. In earlier studies, rule-based approaches [9, 10] were first proposed to tackle this problem. Although valuable, simplified artificial rules can hardly cover all language phenomena, and intricate rules are difficult to update and maintain and often lead to poor generalization and portability [11]. To alleviate these problems, many researchers turned to machine learning algorithms, e.g., support vector machines (SVM), Conditional Random Fields (CRF), and achieved great power for NER [12-15]. However, the performance of these statistical methods heavily relies on predefined features, which can hardly cover all useful semantic representations for recognition, resulting in poor discriminatory ability of the model [16]. Recently, deep neural network (DNN), especially Recurrent Neural Network (RNN), achieves remarkable performance in Clinical Named Entity Recognition (CNER) tasks. Mostafiz and Ashraf [17] compared the RNN-based NER method with other information extraction tools, e.g., RapTAT [18], MTI [19], in extracting pathological terms from chest X-Ray radiology reports and demonstrated that deep neural network outperformed generic tools by a large margin. Gridach [20] added a CRF layer after the RNN layer to process the CNER task and obtained remarkable results on both JNLPBA and BioCreAtIvE II GM data sets. Zhang et al. [21] used Bi-directional Long Short-Term Memory and Conditional Random Field (BiLSTM-CRF) to automatically identify clinical entities such as diagnosis, symptom, and treatment simultaneously from Chinese Electronic Health Records (EHRs) and achieved better performance than CRF model. Beside the breakthrough of RNN, recently, self-attention, a special case of attention mechanism, has been widely used to capture richer correlation between words. Unlike RNNs that obtain long dependencies over several time steps [22], which makes it a challenge to learn long-term dependencies when encoding long sequences, self-attention can directly capture long dependencies by calculating the cross interactions between the two tokens in a sentence regardless of their distance [23]. By focusing on some important information, it gives higher weight to important information, while assigning smaller weight to other information received at the same time [16]. Relying entirely on self-attention to draw global dependencies between input and output, Transformer [24] has achieved remarkable performance in a variety of sequence learning tasks [25, 26]. Despite these achievements, it still lacks the components necessary for modeling local structures sequentially and relies heavily on location embeddings that have limited its efficiency [27]. More recently, a novel language representation model, namely Bidirectional Encoder Representations from Transformers (BERT) [28], was proposed by pre-training on large unlabeled corpus using bidirectional transformers. By pre-training Masked Language Model (MLM) and Next Sentence Prediction (NSP) on large plain text corpus, BERT has achieved significant improvement on various Natural Language Processing (NLP) tasks, e.g., NER, Question Answering (QA), Machine Reading Comprehension (MRC), and etc. One of the important applications of BERT is to provide word embedding as features of DNN. As an unsupervised feature learning techniques, word embedding maps the words to vectors of real numbers to capture the semantic and syntactic information between them [29], which has become an indispensable component of DNN for NER tasks. Unlike classical embeddings such as FastText [30] and GloVe that represent the word with polysemy using only one fixed vector, BERT can dynamically adjust the word representation by capturing contextual information and long distant dependencies between words in the sentence [31]. To build a supervised NER model, data annotation is an essential step, but it is expensive and time-consuming [32]. When the labeled data is limited, a lot of linguistic phenomena will not be covered in the training corpus, which may lead to poor generalization of models [33]. Unsupervised pre-training is a popular way to enhance the model performance by learning linguistic phenomena from unlabeled data. In the sense of realizing the minimum of the empirical cost function, unsupervised pre-training can optimally initialize the model’s parameters, thereby somehow making the optimization process more efficient [34]. Although CNER has been extensively studied [17, 20, 21], most of the previous studies did not focus on extracting entities for staging from radiology reports. In this paper, we proposed a novel deep learning approach, namely BERT-based-BiLSTM-Transformer network (BERT-BTN) with pre-training, to extract 14 types of clinical entities from chest CT reports for lung cancer screening and TNM staging. Specifically, BERT was applied as the word embedding layer to learn the word representation. Then, we combined LSTM with Transformer to enjoy the advantages of them while naturally avoid their respective limitations. Specifically, following the traditional LSTM layer, we added a Transformer layer to capture the global dependencies between characters. To alleviate the problem of insufficient labeled data, pre-training technique was employed to initialize the parameters of the proposed model. Experimental results indicate that our method achieves competitive performance for recognizing entities in comparison with benchmark models. To the best of our knowledge, this is the first study to combine those techniques to extract entities from Chinese CT reports for lung cancer screening and TNM staging.

Methods

Overview

The development pipeline of the proposed method is shown in Fig. 1. To develop our NER model, we first annotated the pre-defined entities in chest CT reports. And then, the pre-training technique was applied to initialize the parameters of the model. After that, the model was trained, validated, and test on the annotated dataset. The details of the proposed method are elaborated in follows.

Fig. 1

The development pipeline of the proposed method

Data and annotation

A total of 529 chest CT reports was collected from the Department of Thoracic Surgery II of Peking University Cancer Hospital. The data contained heterogeneous aspects including patient identification, examination time, findings, conclusion, diagnosis, and etc. In this study, we extracted the information from findings because the information about cancer screening and staging was mainly recorded in findings. In clinical practice, clinicians usually refer to TNM staging guideline to stage patients. Based on the 8th edition of lung cancer TNM staging guideline [35] and consultations of clinicians at the department, we finally defined a total of 14 types of named entities which covered the screening and staging information in chest CT reports. These entities and corresponding instances are shown in Table 1.

Table 1

Entity types for clinical named entity recognition

Entity type	Description	Instance
Vessel	Description of great vessel invasion	病灶包绕右下肺动脉主 (The lesion surrounds the right lower pulmonary trunk)
Vertebral Body	Description of Vertebral Body invasion	颈7椎体压缩变扁(Cervical 7 vertebrae become compressed and flattened)
PAOP^a	Description of pulmonary atelectasis or obstructive pneumonitis	远端可见片絮影 (Fillets are visible at the far end)
Bronchus	Description of bronchial invasion	凹陷 (indentation)
Pleura	Description of pleural invasion or metastasis	增厚 (thickening)
Shape	Shape of mass	类圆形 (round)
Density	Density of mass	磨玻璃密度 (ground glass density)
Mass	Suspected mass/lump/lesion in lung	结节 (nodule)
Enhancement	Enhancement extent of mass	强化明显 (significant intension)
Size	Size of mass or lymph nodes	25 × 22 cm
Location	Location of mass or lymph nodes	左上肺右基底段 (upper left lung right basal segment)
Lymph	Suspected lymph node metastasis	肿大淋巴结 (swollen lymph nodes)
Negation	Negative words	未见 (no)
Effusion	Condition of pericardial effusion	心包积液 (effusion)

aPAOP: Pulmonary Atelectasis/Obstructive Pneumonitis

Entity types for clinical named entity recognition aPAOP: Pulmonary Atelectasis/Obstructive Pneumonitis Based on i2b2 annotation guideline [36] and repeated discussions, we have formulated an annotation guideline, and the annotation guideline is listed in Additional file 1. Two medical informatics engineers were recruited to annotate the chest CT reports manually following the annotation guideline. We used the BIO label scheme, where B, I, and O denote the beginning, inside, and outside characters of an entity, respectively. Figure 2 shows an example of annotated chest CT report. We randomly selected 359 chest CT reports to annotate. The summary statistics of the annotations are shown in Table 2. Then the annotated data was used as the gold standard data to train and evaluate the proposed method. The annotation task was initiated by going through preliminary practice rounds in which annotators were given the same set of 50 CT reports to annotate followed by team meetings where agreement was discussed to clarify ambiguous examples found during preceding practice sessions. Once good understanding of the annotation task was achieved, we selected 100 reports to annotated by both annotators to calculate the inter-rater agreement.

Fig. 2

A chest CT report sample annotated with BIO tags a Original CT report. LOC: Location; SHP: Shape; MA: Mass; SZ: Size; Ng: Negation; LPH: Lymph. b Its’ translation version in English

Table 2

The statistics of annotated named entities in chest CT reports

Entity type	Total
Entity type	Count	Average length
Vessel	51	10.82
Vertebral Body	28	13.75
PAOP	85	8.77
Bronchus	58	4.66
Pleura	230	4.27
Shape	513	4.37
Density	340	5.00
Mass	874	4.11
Enhancement	185	5.44
Size	774	7.35
Location	1937	8.77
Lymph	588	4.66
Negation	924	4.27
Effusion	412	4.37

A chest CT report sample annotated with BIO tags a Original CT report. LOC: Location; SHP: Shape; MA: Mass; SZ: Size; Ng: Negation; LPH: Lymph. b Its’ translation version in English The statistics of annotated named entities in chest CT reports

Clinical named entity recognition model

As shown in Fig. 3, given a sentence, we first input the sentence into embedding layer to capture the semantic representation of each character. In this paper, we used the Whole Word Masking version of BERT (BERT-WWM) [37] as the embedding layer, which mitigates the limitations of original BERT by forcing the model to recover the whole word in MLM pre-training task.

Fig. 3

The architecture of the BERT-BTN model

The architecture of the BERT-BTN model Following the word embedding layer, the BiLSTM layer was applied to capture nested structures of the sentence and latent dependency of each character. After that, we used a Transformer layer to draw global dependencies between each character regardless of distance, which can alleviate the burden of the LSTM compressing all relevant information into a single hidden state [38].Then a linear layer was employed to predict possible labels of each character in the sentence. To improve predictive accuracy, we added a CRF layer to learn some constraints from annotated labels to ensure the final predicted labels were valid. Finally, a softmax function was used to output the probabilities of all labels for each character in the sentence.

Unsupervised pre-training

When the labeled data is limited, pre-training has been proven to effectively improve model performance [39]. In this study, we applied a pre-training method described in the literature [40]. To pre-train the model, we first calculated Term Frequency–Inverse Document Frequency (TF-IDF) vector based on all CT reports except those in the test set, the calculation method is shown in Eq. 1. where is a document, w is the word in the document, indicates the number of times w occurs in d, N indicates the total number of documents, is the number of documents containing w. Next, we employed Eq. 2 to normalize the as . Then, we multiplied the with its corresponding char embedding using Eq. 3 to obtain TF-IDF-weighted embedding as the target for pre-training. It was shown that these TF-IDF-weighted embeddings were able to capture some of the natural variation between different sentences [40]. To pre-train the model in an unsupervised manner, we used a tanh layer to replace the CRF layer, and the mean-square-error loss to formulate the objective function (Eq. 4). where is the number of words in a sentence, indicates output of the ith word in the sentence, is the corresponding TF-IDF weighted embedding. During pre-training, we only updated parameters of BiLSTM layer, Transformer layer and Linear layer and froze parameters of other layers. BERT optimizes two training objectives—MLM and NSP. MLM is the task of predicting missing tokens in a sequence from their placeholders. Specifically, it simply masks some percentage of the input tokens at random, and then predicts those masked tokens. In order to train a model that understands sentence relationships, BERT pre-train the NSP task, which takes two sequences (, ) as input, and predicts whether is the direct continuation of . However, it requires a large collection of unlabeled text to pre-train BERT. Comparing to BERT, our pre-training approach is simpler and doesn’t need so much unlabeled text.

Experiments and results

To train and evaluate the proposed model, we randomly separated 70% CT reports as the training set, 10% as the validation set, and 20% as the test set. To determine the optimal hyper-parameters, a grid search was applied to the training set. Our hyper-parameter spaces are Learning_Rate {1e−4,5e−4,1e−3,5e−3}, Dropout {0,0.1,0.2,0.3,0.4,0.5}, Batch_Size {8,16}, LSTM_Laye {1,2}, LSTM_Hidden_Size {64,128}, Transformer_Layer {1,2,3,5}, Transformer_Head {1,2,3,4,6,8,12}. The hyper-parameters used in this paper are listed in Table 3. The standard back-propagation was used to update all parameters and Adam algorithm [41] was employed to optimize the objective function. To avoid overfitting problem, an early stopping strategy [42] was employed on the validation set.

Table 3

The main hyper-parameters for the proposed model

Parameter	Setting
LSTM_Hidden_Size	128
LSTM_Layer	1
Transformer_Layer	1
Transformer_Head	1
Dropout	0.13
Batch_size	8
Learning_Rate	1e−4

The main hyper-parameters for the proposed model Two evaluation scoring schemes were used, i.e., exact match and inexact match, where exact match scheme only counts perfect matches when compared to the gold standard; the inexact match means entity is correctly predicted if it overlaps with the corresponding entity in the gold standard. We selected precision, recall, and F1 score as evaluation metrics to measure the performance of our model. To investigate the effectiveness of the proposed approach, extensive experiments were carried out over the collected data including (1) replacing BERT embedding with FastText embedding, (2) removing transformer layer from the proposed model, (3) removing BiLSTM layer from the proposed model, (4) canceling pre-training step, (5) directly fine-tuning with BERT. We ran our experiments five times and averaged the 5 results as the final result to reduce the possible bias from dataset partitioning. Based on the annotated 100 reports by the two annotators, the inter-annotation agreement using kappa statistics [43] is 0.937, which indicates the annotation is reliable. Table 4 shows the overall performance of the proposed and benchmark models. As shown in Table 4, the BERT-BTN with pre-training achieves the best performance with 85.96% macro-F1 score and 90.67% micro-F1 score under the exact match scheme and 94.56% macro-F1 score and 96.78% micro-F1 score under the inexact match scheme in comparison with the benchmark models.

Table 4

The f1 scores of the proposed and benchmark models

Model	Inexact-match		Exact-match
Model	Macro	Micro	Macro	Micro
FastText-Transformer	89.29 ± 2.64	95.25 ± 0.46	77.12 ± 4.14	86.85 ± 1.18
FastText-BiLSTM	90.46 ± 1.31	95.72 ± 0.70	80.67 ± 0.87	88.08 ± 1.41
FastText-BTN	90.47 ± 1.82	95.22 ± 0.52	80.84 ± 3.16	87.76 ± 1.30
BERT-Transformer	90.94 ± 0.69	95.80 ± 0.31	81.67 ± 6.14	87.35 ± 1.23
BERT-BiLSTM	93.05 ± 0.89	97.27 ± 0.16	84.12 ± 1.59	90.13 ± 0.92
BERT- BTN	94.40 ± 0.91	97.28 ± 0.60	84.58 ± 2.72	90.78 ± 1.04
BERT-fine-tune	92.43 ± 0.61	96.22 ± 0.93	82.15 ± 3.41	88.33 ± 3.00
BERT-BTN (with pre-training)	94.56 ± 0.80	96.78 ± 0.73	85.96 ± 0.46	90.67 ± 0.51

Bold value indicates the values is best score in the current evaluation index

The f1 scores of the proposed and benchmark models Bold value indicates the values is best score in the current evaluation index To prove the effectiveness of BERT embedding, we selected the FastText embedding, a classical embedding that represents the word using only one fixed vector, as the baseline. By analyzing the performances of these two word embedding methods, we can notice that models using BERT embedding outperform models using FastText embedding with an improvement of 4.55% macro-F1 score under exact match scheme and 3.93% macro-F1 score under inexact match scheme at most. The performance improvements indicate BERT is more powerful in contextual information encoding by taking both left and right contexts of target words into account. BERT-BTN provides 0.46% overall performance improvement under exact match scheme and 1.35% under inexact match scheme compared with BERT-BiLSTM, indicating the long-term dependencies learnt by Transformer are useful for NER. When comparing BERT-Transformer with BERT-BTN, the macro-F1 score drops by 2.91% under exact match scheme and 3.46% under inexact match scheme, indicating the position information encoded by BiLSTM has a significant influence on Transformer’s performance. The reason for performance reduction may be that Transformer only relies on self-attention to draw global dependencies of input and treats every position identically, which may neglect some fixed patterns in the sentences since some information is described by several clauses in a fixed order. Also, we directly fine-tuned BERT and the result shows the simple fine-tuned BERT cannot achieve competitive performances under both exact and inexact match scheme in comparison with the other BERT-based models, indicating that it remains a challenge to achieve good results by fine-tuning BERT directly on some domain-specific tasks. Moreover, when applying the pre-training technique, both prediction accuracy and the speed of convergence gain considerable improvements in comparison with BERT-BTN. As depicted in Fig. 4, using TF-IDF–weighted character embeddings to pre-train the model can almost optimally initialize the model’s parameters so as to accelerate convergence.

Fig. 4

The training loss of models using BERT embedding

The training loss of models using BERT embedding Table 5 shows the macro-F1 score of each type of entities under exact match scheme. As shown in Table 5, all models achieved competitive performances and over 90% macro-F1 scores for recognizing Size type of entities, Effusion type of entities, Lymph type of entities, Negation type of entities, and Size type of entities. For complex entities with various expression types, i.e., PAOP type of entities, Vessel type of entities, and Pleura type of entities, the performances were significantly different between different models. Specifically, BERT provided the most improvement because it can dynamically adjust embeddings according to the current context to capture more meaningful semantic information. For instance, we notice that some abnormal tokens such as “ 增厚(thickening)”, “凹陷 (indentation)”, “截断 (truncated)” are labeled as Pleura type of entities or Bronchus type of entities based on different contexts, BERT can provide different embeddings for the same token depending on its context so that the BERT-based models can achieve better results. Besides, the self-attention mechanism can bring some benefits to recognize complex entities due to the ability to capture the global long dependencies and maximize the useful context-related information in the resource. Moreover, for the two longest entity types, i.e., Vertebral Body type of entities and Vessel type of entities, pre-training leads to significant improvements on the macro-F1 score from 70.67% to 82.41%, 59.05% to 65.95%, respectively. Since pre-training can obtain more general linguistic phenomena from unlabeled text, which can provide some benefit for the model to identify long entities.

Table 5

The exact match macro-f1 scores of the proposed and benchmark models about 14 types of entities

Entity type	FastText-Transformer	FastText-BiLSTM	FastText-BTN	BERT-Transformer	BERT-BiLSTM	BERT- BTN	BERT-fine-tune	BERT-BTN (pre-training)
Vessel	58.01 ± 11.71	54.54 ± 3.20	56.63 ± 13.56	47.42 ± 14.93	58.48 ± 5.28	59.05 ± 0.51	57.31 ± 18.21	65.95 ± 7.63
Vertebral Body	62.67 ± 24.11	59.01 ± 22.65	65.02 ± 12.48	74.41 ± 21.69	63.70 ± 24.61	70.67 ± 16.36	55.71 ± 35.11	82.41 ± 10.86
PAOP	60.54 ± 11.24	62.43 ± 8.84	66.94 ± 10.75	55.27 ± 18.44	75.49 ± 9.25	74.50 ± 12.46	72.32 ± 11.29	78.97 ± 11.53
Bronchus	60.55 ± 7.26	72.01 ± 3.48	70.79 ± 6.93	76.30 ± 7.61	79.81 ± 9.71	80.15 ± 5.64	82.13 ± 5.24	79.37 ± 4.06
Pleura	66.71 ± 10.56	84.22 ± 7.32	82.46 ± 7.53	79.64 ± 7.51	85.61 ± 5.83	84.30 ± 2.60	85.10 ± 6.41	85.57 ± 3.87
Shape	69.90 ± 7.34	77.52 ± 2.95	73.56 ± 2.64	79.25 ± 9.93	82.00 ± 3.40	80.69 ± 2.77	81.65 ± 1.79	82.00 ± 2.99
Density	84.33 ± 1.19	81.37 ± 3.23	83.85 ± 1.65	85.49 ± 8.00	87.46 ± 2.51	88.75 ± 2.55	87.21 ± 4.56	86.08 ± 1.88
Mass	80.16 ± 2.28	82.35 ± 3.16	83.04 ± 2.11	84.99 ± 7.43	84.76 ± 2.72	85.13 ± 2.47	77.44 ± 6.07	85.41 ± 3.78
Enhancement	74.51 ± 5.76	80.66 ± 2.81	76.54 ± 10.95	87.70 ± 7.77	85.29 ± 5.76	84.33 ± 7.36	80.24 ± 14.90	84.27 ± 6.36
Size	93.30 ± 1.65	95.58 ± 1.09	95.58 ± 1.08	95.70 ± 4.59	95.63 ± 1.78	96.05 ± 1.39	96.03 ± 0.87	95.70 ± 1.32
Location	83.87 ± 6.41	86.84 ± 2.46	86.87 ± 1.65	89.00 ± 3.58	91.36 ± 0.66	91.59 ± 0.97	88.55 ± 4.00	90.60 ± 2.54
Lymph	90.51 ± 4.00	94.30 ± 2.34	94.16 ± 3.24	93.13 ± 1.17	93.65 ± 7.06	93.60 ± 3.66	94.09 ± 2.26	91.98 ± 3.46
Negation	98.56 ± 0.41	98.97 ± 0.40	98.58 ± 0.58	98.45 ± 2.97	98.84 ± 0.39	98.30 ± 0.22	94.59 ± 8.53	98.79 ± 0.38
Effusion	96.12 ± 1.72	97.84 ± 0.18	97.82 ± 1.07	96.62 ± 4.10	95.61 ± 3.47	98.01 ± 1.80	97.78 ± 0.92	96.52 ± 0.48

Bold value indicates the values is best score in the current evaluation index

The exact match macro-f1 scores of the proposed and benchmark models about 14 types of entities Bold value indicates the values is best score in the current evaluation index

Discussion

In this study, we proposed a novel deep learning method, namely BERT-BTN with pre-training, to recognize 14 types of clinical entities from chest CT reports for lung cancer screening and TNM staging. The results illustrated in Tables 4 and 5 indicate that models with BERT embedding obtains a significant improvement compared with models with FastText embedding. Besides, Transformer provides overall performance improvement and positional information has an important impact for Transformer-based models to recognize entities. Pre-training gain significant improvements in both recognition accuracy and the speed of converge. Also, fine-tuning BERT directly on some domain-specific tasks may not achieve so satisfactory results. The experimental results indicate that the proposed method can efficiently recognize various clinical entities about lung cancer screening and staging, which shows the potential for further clinical decision-making and academic research. Although the proposed method achieves competitive overall performance for the NER task, it should be mentioned that there are some limitations in our work. First, we should notice that some types of entities are still not accurately recognized. As shown in Tables 5 and 6, the Vessel type of entities is not recognized satisfactorily like the other types of entities. The first reason may be the number of Vessel type of entities is small, so that an inaccurate recognition can significantly reduce its accuracy. Secondly, the average length of Vessel type of entities is much longer and its pattern is more complex than the other entities, which make it difficult to identify the entity boundaries. When the Vessel type of entities contain some other types of entities that appear frequently like Mass type of entities and Location type of entities, it is a challenge for the model to exactly recognize the whole Vessel entity. For instance, the phrase “右肺动脉分支局限性管腔变窄 (The lumen of the right pulmonary artery branch narrowed)” was annotated as the Vessel type of entities, while our model identified the token “狭窄(narrowed)” in this phase as Bronchus type of entities. One straightforward approach is to get more labeled data containing entities mentioned above to train our model. Zhao et al. [44] showed that training on a specific domain dataset provided better performance than training on a large, general domain dataset. Moreover, using more Chinese clinical corpus to train the Bert-based embedding may be another way to improve the recognition performances of long and complex entities.

Table 6

The inexact match macro-f1 scores of the proposed and benchmark models about 14 types of entities

Entity type	FastText-Transformer	FastText-BiLSTM	FastText-BTN	BERT-Transformer	BERT-BiLSTM	BERT- BTN	BERT-fine-tune	BERT-BTN (pre-training)
Vessel	68.74 ± 5.78	59.25 ± 8.77	62.63 ± 9.30	51.06 ± 13.44	67.00 ± 9.22	73.48 ± 12.46	70.74 ± 10.08	77.25 ± 6.73
Vertebral Body	91.81 ± 7.22	92.75 ± 6.39	93.81 ± 8.52	85.95 ± 6.79	80.00 ± 27.39	85.79 ± 11.08	85.24 ± 10.16	91.69 ± 13.64
PAOP	76.25 ± 7.62	73.30 ± 14.43	77.31 ± 14.70	83.17 ± 4.08	91.16 ± 1.23	93.80 ± 3.13	85.44 ± 10.16	93.93 ± 2.26
Bronchus	73.86 ± 9.92	85.79 ± 6.32	83.21 ± 5.63	83.74 ± 3.51	93.27 ± 3.88	90.54 ± 3.49	89.67 ± 4.51	91.83 ± 3.00
Pleura	85.45 ± 7.95	93.70 ± 3.08	91.71 ± 4.50	84.78 ± 3.55	96.17 ± 1.77	95.25 ± 2.62	94.36 ± 4.48	96.20 ± 2.24
Shape	87.30 ± 6.44	89.73 ± 1.43	88.04 ± 2.74	89.01 ± 1.09	94.51 ± 1.71	93.96 ± 2.10	93.28 ± 2.13	92.41 ± 1.74
Density	91.68 ± 1.34	91.06 ± 1.77	92.72 ± 1.01	92.45 ± 1.83	93.34 ± 1.16	96.17 ± 2.36	94.86 ± 2.81	95.49 ± 0.78
Mass	95.32 ± 2.39	96.20 ± 1.11	96.83 ± 1.11	94.34 ± 1.06	97.01 ± 0.61	97.93 ± 0.84	95.38 ± 3.96	96.86 ± 0.75
Enhancement	89.62 ± 5.16	92.68 ± 3.00	89.04 ± 7.50	92.74 ± 1.89	95.48 ± 3.98	95.28 ± 3.2	94.51 ± 4.86	95.58 ± 2.98
Size	98.61 ± 0.46	98.44 ± 0.83	98.34 ± 0.46	97.74 ± 0.74	99.04 ± 0.62	99.03 ± 0.59	98.45 ± 0.55	98.62 ± 0.77
Location	93.52 ± 2.18	95.33 ± 1.05	95.48 ± 0.48	93.77 ± 0.92	97.46 ± 0.53	97.41 ± 0.61	94.51 ± 3.35	97.24 ± 2.08
Lymph	99.58 ± 0.37	99.78 ± 0.18	99.71 ± 0.30	95.85 ± 2.54	99.13 ± 0.28	98.25 ± 1.76	99.41 ± 0.57	98.60 ± 2.28
Negation	98.88 ± 0.36	99.06 ± 0.28	98.66 ± 0.52	99.05 ± 0.14	99.08 ± 0.30	98.97 ± 0.25	99.00 ± 0.25	98.88 ± 0.11
Effusion	99.51 ± 0.28	99.31 ± 0.59	99.09 ± 0.64	97.65 ± 1.01	99.05 ± 1.31	99.08 ± 0.79	99.18 ± 0.61	99.30 ± 1.00

Bold value indicates the values is best score in the current evaluation index

The inexact match macro-f1 scores of the proposed and benchmark models about 14 types of entities Bold value indicates the values is best score in the current evaluation index Second, as shown in Tables 5 and 6 and Figs. 5 and 6, the different performances under inexact match scheme and exact match scheme indicate some entities, e.g., Vessel type of entities, Vertebral Body type of entities, were only recognized partially. Yu et al. [45] presented a model which labeled the start and end positions separately in a cascade structure and decoded them together by a multi-span decoding algorithm. They found that predicting end positions might benefit from the prediction results of start positions, which may help to narrow the gap between exact match and inexact match. In the future, we can also try this strategy to explore whether it can further improve the performance.

Fig. 5

Comparison of the proposed and benchmark models about 14 types of named entities under exact match scheme

Fig. 6

Comparison of the proposed and benchmark models about 14 types of named entities under inexact match scheme

Comparison of the proposed and benchmark models about 14 types of named entities under exact match scheme Comparison of the proposed and benchmark models about 14 types of named entities under inexact match scheme

Conclusion

In this paper, we proposed a novel deep learning method, namely the BERT-BTN with pre-training, to extract 14 types of clinical entities from Chinese chest CT reports for lung cancer screening and TNM staging. The experimental results show that our model outperforms the benchmark BERT-BTN, BERT-LSTM, BERT-fine-tune, BERT-Transformer, FastText-BTN, FastText-BiLSTM and FastText-Transformer models and achieves the best macro-F1 score of 85.96%, which shows great potential for further utilization in clinical decision support and academic research. Additional file 1: A guideline for annotating 14 types of clinical entities from chest CT reports for lung cancer screening and TNM staging.

18 in total

1. Automatic early stopping using cross validation: quantifying the criteria.

Authors: Lutz Prechelt
Journal: Neural Netw Date: 1998-06

2. Agreement, the f-measure, and reliability in information retrieval.

Authors: George Hripcsak; Adam S Rothschild
Journal: J Am Med Inform Assoc Date: 2005-01-31 Impact factor: 4.497

3. Character-level neural network for biomedical named entity recognition.

Authors: Mourad Gridach
Journal: J Biomed Inform Date: 2017-05-11 Impact factor: 6.317

4. Assisted annotation of medical free text using RapTAT.

Authors: Glenn T Gobbel; Jennifer Garvin; Ruth Reeves; Robert M Cronin; Julia Heavirland; Jenifer Williams; Allison Weaver; Shrimalini Jayaramaraja; Dario Giuse; Theodore Speroff; Steven H Brown; Hua Xu; Michael E Matheny
Journal: J Am Med Inform Assoc Date: 2014-01-15 Impact factor: 4.497

5. The Importance Of Integrating Narrative Into Health Care Decision Making.

Authors: Daniel Dohan; Sarah B Garrett; Katharine A Rendle; Meghan Halley; Corey Abramson
Journal: Health Aff (Millwood) Date: 2016-04 Impact factor: 6.301

6. Extracting data from electronic medical records: validation of a natural language processing program to assess prostate biopsy results.

Authors: Anil A Thomas; Chengyi Zheng; Howard Jung; Allen Chang; Brian Kim; Joy Gelfond; Jeff Slezak; Kim Porter; Steven J Jacobsen; Gary W Chien
Journal: World J Urol Date: 2013-02-17 Impact factor: 4.226

7. Utilizing Chinese Admission Records for MACE Prediction of Acute Coronary Syndrome.

Authors: Danqing Hu; Zhengxing Huang; Tak-Ming Chan; Wei Dong; Xudong Lu; Huilong Duan
Journal: Int J Environ Res Public Health Date: 2016-09-13 Impact factor: 3.390

Review 8. Retrospect and Prospect for Lung Cancer in China: Clinical Advances of Immune Checkpoint Inhibitors.

Authors: Shun Lu; Yongfeng Yu; Yi Yang
Journal: Oncologist Date: 2019-02

9. Clinical Named Entity Recognition From Chinese Electronic Health Records via Machine Learning Methods.

Authors: Yu Zhang; Xuwen Wang; Zhen Hou; Jiao Li
Journal: JMIR Med Inform Date: 2018-12-17

10. Hierarchical attention networks for information extraction from cancer pathology reports.

Authors: Shang Gao; Michael T Young; John X Qiu; Hong-Jun Yoon; James B Christian; Paul A Fearn; Georgia D Tourassi; Arvind Ramanthan
Journal: J Am Med Inform Assoc Date: 2018-03-01 Impact factor: 4.497

2 in total

1. Automatic symptoms identification from a massive volume of unstructured medical consultations using deep neural and BERT models.

Authors: Hossam Faris; Mohammad Faris; Maria Habib; Alaa Alomari
Journal: Heliyon Date: 2022-06-10

2. Automatic text classification of actionable radiology reports of tinnitus patients using bidirectional encoder representations from transformer (BERT) and in-domain pre-training (IDPT).

Authors: Jia Li; Yucong Lin; Pengfei Zhao; Wenjuan Liu; Linkun Cai; Jing Sun; Lei Zhao; Zhenghan Yang; Hong Song; Han Lv; Zhenchang Wang
Journal: BMC Med Inform Decis Mak Date: 2022-07-30 Impact factor: 3.298

2 in total