Literature DB >> 35602188

A BERT model generates diagnostically relevant semantic embeddings from pathology synopses with active learning.

Youqing Mu¹, Hamid R Tizhoosh², Rohollah Moosavi Tayebi¹, Catherine Ross^1,3, Monalisa Sur^1,3, Brian Leber^1,3, Clinton J V Campbell^1,3.

Abstract

Background: Pathology synopses consist of semi-structured or unstructured text summarizing visual information by observing human tissue. Experts write and interpret these synopses with high domain-specific knowledge to extract tissue semantics and formulate a diagnosis in the context of ancillary testing and clinical information. The limited number of specialists available to interpret pathology synopses restricts the utility of the inherent information. Deep learning offers a tool for information extraction and automatic feature generation from complex datasets.
Methods: Using an active learning approach, we developed a set of semantic labels for bone marrow aspirate pathology synopses. We then trained a transformer-based deep-learning model to map these synopses to one or more semantic labels, and extracted learned embeddings (i.e., meaningful attributes) from the model's hidden layer.
Results: Here we demonstrate that with a small amount of training data, a transformer-based natural language model can extract embeddings from pathology synopses that capture diagnostically relevant information. On average, these embeddings can be used to generate semantic labels mapping patients to probable diagnostic groups with a micro-average F1 score of 0.779 Â ± 0.025. Conclusions: We provide a generalizable deep learning model and approach to unlock the semantic information inherent in pathology synopses toward improved diagnostics, biodiscovery and AI-assisted computational pathology.

Entities: Chemical

Keywords: Haematological cancer; Pathology

Year: 2021 PMID： 35602188 PMCID： PMC9053264 DOI： 10.1038/s43856-021-00008-0

Source DB: PubMed Journal: Commun Med (Lond) ISSN： 2730-664X

Introduction

Making a diagnosis in pathology is a complex intellectual process, involving the integration of information from multiple pathological and clinical sources[1]. The pathologist’s central role is to extract visual information from microscopic features of human tissue (morphology), thereby lowering the uncertainty about a suspected disease state[2]. This information is then transferred into a written pathology report, which is synthesized in the context of the inherent world model and the knowledge accrued by the pathologist over many years. Therefore, a pathology report is intrinsic semantics of tissue morphology, which then must be captured and interpreted by an expert reader in the context of their world model and domain-specific knowledge. This requires years of specialized training, as pathologists often do not make a specific diagnostic interpretation[3]. Rather, a diagnosis often consists of semantic information extracted from the pathology specimen, ancillary testing, and the clinical history described as either unstructured or semi-structured text (called a synopsis). A pathology synopsis may give one or more probable diagnoses (i.e., a differential diagnosis) or may simply describe the salient morphological information without a differential diagnosis, and it is left to the expert end-reader to extract the semantic content. The reader must then map this semantic content to one of a small number of core concepts that help decide the appropriate next steps and diagnosis. This poses a challenge for knowledge mining given the finite number of experts who can do this, specifically when scaled to a large number of synopses. Tools to automatically extract the morphological semantics from pathology synopses would have high value in both the research and clinical domains. For example, automated annotation of pathology synopses with semantic labels would provide a clinical diagnostic support tool by unlocking the semantics for less experienced interpreters, and a means for knowledge mining by searching large databases of synopses for semantically similar content. Furthermore, the field of pathology is now transitioning to using digitally captured whole-slide images (WSI) for primary diagnosis (digital pathology)[4]. Scalable annotation of large WSI datasets with semantic labels from associated synopses will be essential toward developing computational pathology approaches for diagnostic support[5]. Artificial intelligence (AI) aspires to create human-like intelligence[6]. Successful AI schemes consist largely of numerous statistical and computer science techniques collectively known as machine learning (ML)[7,8]. ML algorithms automatically extract information from data (i.e., learning, or knowledge acquisition) and then use this knowledge to make generalizations about the world[8]. Some notable examples of successful applications of ML include classifying and analyzing digital images[9] and extracting meaning from natural language (natural language processing, NLP)[10]. One particular type of ML, called deep learning (DL), has been extremely successful in many of these tasks, particularly in image and language analysis[11]. DL algorithms are roughly modeled after the neural structure of the human brain, learning automatically to make representations from data as a hierarchy of concepts from simple to more complex [11], a pyramidal multi-resolution approach that should not be foreign to any pathologist. Activation weights within the different layers of the network can be adjusted according to input data, and then used to approximate a function that predicts outputs on new, unseen data[11]. The information extracted from data by DL can be represented as a set of real numbers known as “features”; within a neural network, low-dimensional embeddings of features are created to represent information as feature vectors[11]. The feature vectors produced by DL can then be used for a wide array of downstream applications, including image analysis and numerous NLP tasks such as language translation[9,12-14]. Recently, a DL model called a transformer has emerged at the forefront of the NLP field[15]. Compared to previous DL-based NLP methods that mainly relied on gated recurrent neural networks with added attention mechanisms, transformers rely exclusively on attention and avoid a recurrent structure to learn language embeddings[15]. In doing so, transformers process sentences or short text holistically, learning the syntactic relationship between words through multi-headed attention mechanisms and positional word embeddings[15]. Consequently, they have shown high success in the fields of machine translation and language modeling[15,16]. Specifically, Google recently introduced Bidirectional Encoded Representations of Transformers (BERT), a transformer architecture that serves as an English language model trained on a corpus of over 800 million words in the general domain[13]. BERT encodes bidirectional representations of text using self-supervision, allowing for rich embeddings that capture meaning in human language (i.e., syntax and semantics). A classification (CLS) feature vector is an output from the last layer of the BERT model representing the embedding that captures syntactic and semantic information from the input text, which can be used to train additional ML models such as a classifier[13]. Importantly, BERT can be easily adapted to new domains by transfer learning with minimal fine-tuning, providing an ideal language model for specialized domains such as medicine[13,17,18]. In the pathology domain, NLP methods have mainly consisted of handcrafted rule-based approaches to extract information from reports or synopses, followed by traditional ML methods such as decision trees for downstream classification [19-23]. Several groups have recently applied DL approaches to analyzing pathology synopses, which have focused on keyword extraction versus generation of semantic embeddings[24-27]. These approaches also required manual annotation of large numbers of pathology synopses by expert pathologists for supervised learning, limiting scalability and generalization[28]. The requirement for large-scale annotation has been a key obstacle to the supervised training of DL models in specialized domains such as pathology, given the task’s tediousness and the lack of experts with domain-specific knowledge to sufficiently label training data[29]. One approach to help mitigate this problem is known as active learning, where specific instead of random samples, samples that are underrepresented or represent weaknesses in model performance are queried and labeled as the training data[30]. In this way, a relatively small amount of labeled training data can be generalized to reach a given level of accuracy and scaled to large unlabeled datasets[30-32]. The ideal NLP approach for analyzing pathology synopses would both automatically extract features (i.e., require no manual feature engineering), generate embeddings that capture the inherent rich, semantic information, and be rapidly trainable and generalizable using a relatively small amount of expert-labeled data. In hematopathology, a bone marrow study is the foundation of making a hematological diagnosis, and consists of both a solid tissue histopathology component, called a trephine core biopsy, and a liquid cytology component, called an aspirate. As per International Council for Standardization in Hematology standards, an aspirate synopsis presents the morphological information in the specimen extracted by a hematopathologist in a field:description format. Each field contains a semantic summary of the pathologist’s visual interpretation of key elements of a bone marrow specimen, such as adequacy, cellularity, and the status of each hematopoietic cell lineage[33]. These synopses must then be interpreted by an expert end-reader such as a hematologist, who extracts the semantic information and then maps this to one or more core semantic labels, either “normal”, or one of various “abnormal” labels (Fig. 1 and Table 1). These conceptual labels may rarely represent a specific diagnosis; more commonly, they represent broad diagnostic categories or descriptive morphological findings[34]. The hematologist must then integrate these core semantic labels with bone marrow histopathology, ancillary testing, and clinical findings to decide on the most appropriate differential diagnosis and next steps. Often, these semantic labels do not appear in the synopsis; for example, the hematologist may map the content to the semantic label of “normal” based upon their own interpretation, but the word normal may not appear in the synopses. Therefore, bone marrow aspirate synopses form the ideal basis for evaluating NLP tools to extract embeddings that capture morphological semantics.

Fig. 1

Generation of semantic labels for bone marrow aspirate synopses and modeling process.

Table 1

The evolution of the semantic labels.

Iteration	New labels	Label count	Sample count
1	Acute lymphoblastic leukemia, acute myeloid leukemia, inadequate, lymphoproliferative disorder, mastocytosis, metastatic, myelodysplastic syndrome, myeloproliferative neoplasm, normal, plasma cell neoplasm	10	50
2	Erythroid hyperplasia, iron deficiency	12	83
3	Acute leukemia, acute promyelocytic leukemia, chronic myeloid leukemia, hemophagocytosis, hypercellular, hypocellular	18
4	Basophilia, eosinophilia	20	282
5		20	296
6	Granulocytic hyperplasia	21	344
7		21	393
8		21	408
9		21	500

In each iteration, new cases and/or new labels are added to the dataset. In some iterations, we reviewed the labeled cases and added new labels to the previous cases, or added a small number of new semantic labels.

Generation of semantic labels for bone marrow aspirate synopses and modeling process.

An expert reader (a clinical hematologist) interprets semi-structured bone marrow aspirate synopses and maps their contents to one or more semantic labels, which impact clinical decision-making. In order to train a model to assign semantic labels to bone marrow aspirate synopses, a synopsis first becomes a single text string and then tokenized as an input vector. The input vector will go through BERT and the classifier. The final output is a vector of size 21 (the number of semantic labels in our study). It is then compared with the ground truth vector to adjust the network weights. The evolution of the semantic labels. In each iteration, new cases and/or new labels are added to the dataset. In some iterations, we reviewed the labeled cases and added new labels to the previous cases, or added a small number of new semantic labels. Accordingly, here we employ a BERT-based NLP model to automatically extract features and generate low-dimensional embeddings from bone marrow aspirate pathology synopses. We then apply a simple single-layer neural network classifier mapping these embeddings to one or more semantic labels as hematopathologists. We approach this problem as a multi-label classification using a binary relevance (BR) method, where multiple semantic labels are turned into multiple binary predictions. The model performs well in label prediction (micro-average F1 score of 0.779 ± 0.025, 0.778 ± 0.034 when evaluated by expert hematopathologists[35]). Using dimensionality reduction, chord diagrams, and a word-knockout approach, we show that the model’s embeddings capture diagnostically relevant semantic information from pathology synopses. Importantly, our model was trained using <5% of our starting dataset of over 11,000 pathology synopses using an active learning approach, with minimal manual data annotation by expert pathologists. Our model[36] provides an efficient, scalable and generalizable scheme to unlock the semantic information from pathology synopses with relatively little data annotation by pathologists. We see the high relevance of our model and approach to knowledge mining, improved diagnostics and biodiscovery. A schematic illustration of our overall modeling pathway is shown in Fig. 1.

Methods

Pathology synopses data and preprocessing

Our study was approved by the Hamilton Integrated Research Ethics Board, study protocol 7766-C. As this study was a retrospective chart review, it was approved by the REB with waiver of consent. We collected 11,418 historical synopses for bone marrow specimens spanning April 2001 to December 2019. The original text data were saved in a spreadsheet file. Due to the format’s limitation, the synopsis structure was lost and fields were mixed with descriptions. In addition, noise (i.e., irrelevant information) including signatures from doctors and the reporting system’s context were included in the text. Here, we used our Python program[36] to remove the signatures, remove inline space, remove end space, and remove the reporting system. The reduction of text noise likely helped the model learn the semantic information in this dataset more effectively. It also became more ordered and comfortable for experts to read and label these samples.

Active learning

Only the primary dataset with 50 cases was randomly sampled, which was used to train the first model. The model then predicted the labels of the remaining 11,000 unlabeled cases. We randomly sampled Threshold − Num(label) cases from each rare label group based on the model’s predictions. These CRL candidates were checked by hematopathologists and had their labels verified. They were then integrated with the existing dataset to create a new dataset. A new model was then trained on this new dataset. We repeated the process until all the labels had more cases than the threshold number. We heuristically set the threshold as 20, which means that labels having less than 20 samples were considered rare labels. In the early iterations (iteration 1–5), the threshold was lowered to 10 and 15 to enrich fewer cases so that the hematopathologist would not be overwhelmed by the labeling. Iterations consisted of adding new labels and/or editing the previous labels (Table 1). As a result, the number of new labels varied in each iteration and we did not set a fixed number for how many samples the dataset was enriched by in each iteration (Algorithm 1). If we had still found new semantic labels or the hematopathologists had thought the identified semantic labels could not cover most cases’ semantic information based on their experience, we would raise the threshold and sample more cases. We did not discover new semantic labels during the last three iterations (Table 1), and our hematopathologists confirmed the labels have covered the semantic information of most cases, which suggested the labeling is enough and CRL sampling had achieved its goals.

Algorithm 1: Active learning process

Result: A balanced dataset with more than 20 cases for each label dataset = {50 randomly sampled cases}; while COUNT(rareLabels) > 0, where rareLabels = {label: COUNT(Case) < 20} do Sampling process; // see Algorithm 2; while COUNT(candidates) > 100 do threshold = threshold − 5; Sampling process; // see Algorithm 2); end pathologists verify CRL candidates’ labels and may add new labels; dataset = dataset ∪ verified CRL; end

Algorithm 2: Sampling process

Result: CRL candidates candidates ; for label in rareLabels do randomly sample threshold − COUNT(existedCases) CRL candidates from predicted label group; candidates.append(CRL candidates) end return candidates;

Model training

Our overall process can be regarded as a multi-label classification, a type of supervised learning problem where an instance may be associated with multiple labels. This is different from the traditional task of single-label classification (i.e., multi-class or binary), where each sample is associated only with a single class label[37]. We approach this classification by problem transformation, which transforms the multi-label problem into one or more single-label classification problems. We used the most common problem transformation method, namely the BR method[38], to transform the multi-label prediction into multiple single binary predictions. As a result, each case’s semantic label was converted into a binary vector of size 21, the number of different individual labels, to frame the training as multiple binary predictions. Sentences in descriptions were combined into a single text string using our augmentation methods. The text was tokenized to form an input vector, which was the concatenation of “input IDs”, “attention mask”, and “token type IDs”. The input IDs were the numerical representations of words building the text; the attention mask was used to batch texts together; and token type IDs provided the classifier token [CLS]. The input vector went through BERT’s 12 encoder layers. Each layer applied self-attention and passed its results through a feed-forward network to the next encoder. The output from the special [CLS] token was used as the input for a classifier. The classifier consisted of a dropout layer with a 0.5 dropout rate to improve the generalization and a fully connected layer with 21 nodes. It took a vector of size 768 from [CLS] as input and computed a logit of size 21 as output. In prediction, the sigmoid function (Eq. 1)[39], turned the logit into a prediction score vector from 0 to 1: The final output was a vector of size 21. The output denoted the model’s confidence that one predicted label is true. We treat each label independently and use binary cross entropy (Eq. 2) to calculate the loss, where N is the batch size and σ is Sigmoid: With the loss value, we used the Adam algorithm with weight decay fix[40] (weight decay = 1e−2, learning rate = 1e−3) to fine-tune the network weights interconnecting the layers (Fig. 1), using HuggingFace’s Transformers[41], a Python package. The labeled case set was randomly split into a training set (80%) and a validation set (20%). We trained models based on a training set with the ten epochs. We saved the model each epoch and compared them by the micro-average F1 score on the validation set. The best-performing model was later used to predict the labels. During the active learning stage, to make sure the training set included all labels, so that model could learn all the labels and help sampling CRL, we first assigned at least 1 case for each label to the training set, then randomly separated the rest to the training set and validation set to achieve the 8/2 split. After the active learning stage, we used modified Monte Carlo cross-validation (MCCV) (Algorithm 3)[42], which was adapted by us to guarantee the validation set has at least a certain number of cases for each label, to create four final datasets. We trained four final models from them. Experts reviewed the predictions whereas the embeddings are from one randomly selected final model.

Algorithm 3: The adapted MCCV process

Data: cases, validationSizeRatio Result: trainingSet, validationSet trainSet ; validationSet ; tmpSet ; validationSize = len(cases) * validationSizeRatio; minValidationCaseNum = min(COUNT(Case)) * validationSizeRatio; random.shuffle(cases); for case in cases do if any(COUNT(validationCase < minValidationCaseNum) then validationSet.add(case); else tmpSet.add(case); end end random.shuffle(tmpSet); for case in tmpSet do if len(validationSet) < validationSize then validationSet.add(case); else trainSet.add(case); end end return trainSet, validationSet

Synopsis conversion and augmentation

The semi-structured synopses needed to be converted into single text instances first. As the schema of synopses was a table with field:description and table columns’ order would not influence its content, we could construct the text using different orders of the synopses’ parts, i.e., columns (Supplementary Fig. S1 and Supplementary Table S1). In the computer vision field, data augmentation, a technique to increase the diversity of the training set by applying transformations such as image rotation, is usually used to solve data insufficiency challenges[43]. These transformations introduce changes but keep the data’s core patterns, and therefore, act as regularizers to reduce overfitting when training a model[44]. Likewise, thanks to the irrelevance of text order in the synopses to its semantic content, we could randomly shuffle the sequence of the synopses’ components to make different text strings to augment the dataset. This augmentation could also be applied for prediction (Supplementary Fig. S2). We shuffled the fields with their descriptions to create different text representations. The model computed the prediction scores on all of them. By concatenating them and only considering the maximum value for each label’s score, we obtained the result of an augmented prediction.

Evaluation

We reviewed the NLP system’s performance in predicting labels using precision and sensitivity measures[45]. We recorded specificity, accuracy, and F1-score values based on the counts of true positives (hits), false positives (false hits), true negatives (correct rejections), and false negatives (misses) for each prediction. These performance measures were a set of equations defined as follows: Precision (reproducibility, PPV) Sensitivity (recall or hit rate) F-score (harmonic mean of precision and sensitivity) We used micro-average F1-score, i.e., the F1-score of all labels’ aggregated contributions, to represent the overall performance. Micro-averaging emphasizes the common labels of the dataset because it puts the same importance on each sample. This was suitable for our problem, as labels that were very uncommon in the dataset were not intended to notably affect the overall F1-score if the model performed well in the other, more common labels. Micro-average F1-score[46] is defined as:

Word knockout

We removed a word from a synopsis and use the model to predict each label’s score. We compared the outputs with the original outputs. Since other factors remained unaltered, the change in the output was caused by the word only. We call the change the “influence score” (INF) (Supplementary Fig. S3). We did the same computation for all the words in the 500 labeled synopses’ descriptions. We grouped the influence scores by the synopses’ semantic labels and calculated their sum. Then we normalized each word’s influence score by dividing the sums with the their L2-norm (Eq. (3)) where Λ = {INF : label/word = x}.

Replication and blinding

This study’s procedure is programmed as a pipeline in our supplied software. The process was repeated four times on the same local servers to ensure repeatability. It was also partly run once on the Google Colab to ensure hardware independence. We also provide a Jupyter Notebook “demo_BERT_active_learning.ipynb” in our supplied software to guide other researchers to replicate our study. Blinding is not relevant as all data were de-identified, and the study design did not entail a blinding step in the design. Researchers trained ML models to predict diagnostic labels, and hematopathologists reviewed model performance on predicting diagnostic labels. Pathologists were not aware of original diagnostic labels when evaluating model performance.

24 in total

Review 1. Deep learning.

Authors: Yann LeCun; Yoshua Bengio; Geoffrey Hinton
Journal: Nature Date: 2015-05-28 Impact factor: 49.962

Review 2. Histopathological image analysis: a review.

Authors: Metin N Gurcan; Laura E Boucheron; Ali Can; Anant Madabhushi; Nasir M Rajpoot; B Yener
Journal: IEEE Rev Biomed Eng Date: 2009-10-30

3. A Transparent and Adaptable Method to Extract Colonoscopy and Pathology Data Using Natural Language Processing.

Authors: Helene B Fevrier; Liyan Liu; Lisa J Herrinton; Dan Li
Journal: J Med Syst Date: 2020-07-31 Impact factor: 4.460

Review 4. Acute Myeloid Leukemia With Myelodysplasia-Related Changes.

Authors: James Vardiman; Kaaren Reichard
Journal: Am J Clin Pathol Date: 2015-07 Impact factor: 2.493

5. Automating the Capture of Structured Pathology Data for Prostate Cancer Clinical Care and Research.

Authors: Anobel Y Odisho; Mark Bridge; Mitchell Webb; Niloufar Ameli; Renu S Eapen; Frank Stauf; Janet E Cowan; Samuel L Washington; Annika Herlemann; Peter R Carroll; Matthew R Cooperberg
Journal: JCO Clin Cancer Inform Date: 2019-07

6. Using machine learning to parse breast pathology reports.

Authors: Adam Yala; Regina Barzilay; Laura Salama; Molly Griffin; Grace Sollender; Aditya Bardia; Constance Lehman; Julliette M Buckley; Suzanne B Coopey; Fernanda Polubriaginof; Judy E Garber; Barbara L Smith; Michele A Gadd; Michelle C Specht; Thomas M Gudewicz; Anthony J Guidi; Alphonse Taghian; Kevin S Hughes
Journal: Breast Cancer Res Treat Date: 2016-11-08 Impact factor: 4.872

7. Development of visual diagnostic expertise in pathology -- an information-processing study.

Authors: Rebecca S Crowley; Gregory J Naus; Jimmie Stewart; Charles P Friedman
Journal: J Am Med Inform Assoc Date: 2003 Jan-Feb Impact factor: 4.497

8. Extracting comprehensive clinical information for breast cancer using deep learning methods.

Authors: Xiaohui Zhang; Yaoyun Zhang; Qin Zhang; Yuankai Ren; Tinglin Qiu; Jianhui Ma; Qiang Sun
Journal: Int J Med Inform Date: 2019-10-02 Impact factor: 4.046

9. Scalable analysis of Big pathology image data cohorts using efficient methods and high-performance computing strategies.

Authors: Tahsin Kurc; Xin Qi; Daihou Wang; Fusheng Wang; George Teodoro; Lee Cooper; Michael Nalisnik; Lin Yang; Joel Saltz; David J Foran
Journal: BMC Bioinformatics Date: 2015-12-01 Impact factor: 3.169

10. Artificial intelligence with multi-functional machine learning platform development for better healthcare and precision medicine.

Authors: Zeeshan Ahmed; Khalid Mohamed; Saman Zeeshan; XinQi Dong
Journal: Database (Oxford) Date: 2020-01-01 Impact factor: 3.451

1 in total

1. Automatic symptoms identification from a massive volume of unstructured medical consultations using deep neural and BERT models.

Authors: Hossam Faris; Mohammad Faris; Maria Habib; Alaa Alomari
Journal: Heliyon Date: 2022-06-10

1 in total