Literature DB >> 34990477

Attention based automated radiology report generation using CNN and LSTM.

Mehreen Sirshar¹, Muhammad Faheem Khalil Paracha¹, Muhammad Usman Akram¹, Norah Saleh Alghamdi², Syeda Zainab Yousuf Zaidi³, Tatheer Fatima⁴.

Abstract

The automated generation of radiology reports provides X-rays and has tremendous potential to enhance the clinical diagnosis of diseases in patients. A new research direction is gaining increasing attention that involves the use of hybrid approaches based on natural language processing and computer vision techniques to create auto medical report generation systems. The auto report generator, producing radiology reports, will significantly reduce the burden on doctors and assist them in writing manual reports. Because the sensitivity of chest X-ray (CXR) findings provided by existing techniques not adequately accurate, producing comprehensive explanations for medical photographs remains a difficult task. A novel approach to address this issue was proposed, based on the continuous integration of convolutional neural networks and long short-term memory for detecting diseases, followed by the attention mechanism for sequence generation based on these diseases. Experimental results obtained by using the Indiana University CXR and MIMIC-CXR datasets showed that the proposed model attained the current state-of-the-art efficiency as opposed to other solutions of the baseline. BLEU-1, BLEU-2, BLEU-3, and BLEU-4 were used as the evaluation metrics.

Entities: Chemical

Mesh：

Year: 2022 PMID： 34990477 PMCID： PMC8736265 DOI： 10.1371/journal.pone.0262209

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

1. Introduction

Chest diseases are fatal to human life. Common chest diseases such as pneumonia, pneumothorax, and effusion [1] are diagnosed with the help of medical images, such as chest X-rays (CXR) and CT scans. These images provide subsequent evidence of chest abnormalities captured through a proper pathological process. A radiologist conducts an analytical examination for the presence of even a minor abnormality on an X-ray image, followed by a detailed diagnostic textual report of a patient. This manually created report (see Fig 1) describes the condition of the chest in general, detailed findings, and diseases, if they are projected on the X-ray image. Writing medical reports is a laborious task. In developing countries with a large population with poor health conditions, such as Pakistan, radiologists may have to capture hundreds of X-ray images of different patients every day. Generating hundreds of reports on pathological conditions of lungs against CXR is time-consuming and tedious. The process of describing X-rays in terms of text is not efficient, even for specialist doctors in their respective fields. Moreover, this task is error-prone due to inexperienced radiologists, faulty reasoning by radiologists, staff shortage in hospitals, or additional workload in the hospitals that cause errors in the reports [2].

Fig 1

Examples of chest X-ray images and radiology reports.

Additionally, writing accurate reports is very difficult task for the pathologists and radiologists with less experience and for those working in rural areas with barely any healthcare facilities. To properly read and understand a radiograph, the following skills are needed [3]. (i) Knowledge about the basic physiology of chest diseases and other information about any normality or abnormality of thorax anatomy; (ii) the ability to find the association with other indicative diseases (respiratory function tests, test results, and electrocardiograms); (iii) the ability to understand the changes in the radiographs over time; (iv) familiarity with patient clinical history; and (v) the ability to analyze radiographs through a fixed pattern. In other words, writing medical reports is a strenuous task for both experienced and inexperienced medical professionals. The proposed research is thus derived from the motivation to improve the clinical diagnostic systems by adding the functionality to generate reports automatically. The current automatic report generation approaches suffer from various limitations that need to be addressed to complete this task. The first limitation is the understanding of diseases that appear as white projections of some well-understood patterns on the CXR, and then applying language semantics to express these in natural languages such as English. Therefore, in addition to visual understanding, a natural language processing model is required for report generation. In contrast to the existing models, the proposed research presents a model to solve the problems of visual representation as well as sentence generation. As a first step, the proposed model takes CXR images as input I and performs the feature extraction process for disease identification. In the second step, the model is trained to generate the desired report, which consists of LSTM followed by an attention mechanism. To optimize the report, the probability p(S|I) is determined, where S = {S1, S2, S3…} represents a set of words generated for the report from the vocabulary that sufficiently defines the contents in the CXR images [4]. The proposed model is motivated by the recent advancements in machine translation, where the goal is to transform the source composed of a sequence of tokens to the targeted sequence of tokens by maximizing the likelihood p(T|S), where S is the sequence of tokens present in a source space and T is the targeted sequence of words. The remainder of this paper is organized as follows. Section 2 provides a review of the literature and significant work done by researchers in the past few years. Section 3 describes the proposed methodology in detail. Section 4 presents all the datasets and experimental results in detail with the relevant figures and tables. Finally, Section 5 concludes the paper.

2. Related work

In recent years, several chest radiograph datasets have been made publicly available. A summary of all of these datasets is presented in Table 1. A number of researchers have worked on caption generation for general images and detailed report creation for medical images. Tanti et al. [5] classified generative models into two types: (i) injection architecture and (ii) merge architecture. In the injection architecture, the input is the tokenized captions and the image vectors to an RNN block, whereas in the merge architecture, the input is only the captions to the RNN block, and merges the output with the effective image learning computational models by leveraging the information in the medical images and the free-text reports in the emerging field. Such a combination of image and textual data helps to further improve the model performance in automatic report generation (Litjens et al.) [6]. Correctly reading the CXR images is exasperating due to the huge variability, variation, and complexity of the diseases as well as their treatments, using computerized tomography (CT) scans (Rubin, 2015) [7].

Table 1

Summarized specification of publically available chest X-ray datasets.

Dataset	Source Institution	Disease Labeling	No of Images	No of Reports	No of Patients
IU Chest X-Ray (Demner-Fushman et al. [8])	Indiana Network for Patient Care	Expert	8,121	3,996	3,996
MIMIC-CXR (Johnson et al. [9])	Beth Israel Deacones Medical Center	Automatic (CheXpert labeler)	4,73,057	2,06,563	63,478
Chest-XRay8 (Wang et al. [10])	National Institutes of Health	Automatic (DNorm + MetaMap)	1,08,948	-	32,717
PadChest (Bustos et al. [11])	Hospital Universitario de San Juan	Expert + Automatic (Neural network)	1,60,868	2,06,222	67,625
CheXpert (Irvin et al. [12])	Stanford Hospital	Automatic (CheXpert labeler)	2,24,316	-	65,240

Schlegl et al. [13] first proposed a weakly supervised learning approach to utilize semantic descriptions in the reports as labels for better classifying tissue patterns in OCT imaging. They specified how accurate voxel level classifiers would be and how this information increases the classification accuracy for intraretinal SRF, IRC, and normal retinal tissues. In 2015, Shin et al. [14] and Wang et al. [10, 15] proposed a network that comprises a CNN and RNN in the field of radiology that is jointly trained to find abnormalities in CXR. They mined the radiological reports to create disease and symptom concepts as labels. They first used LDA to find the topics for clustering, and then applied disease detection tools such as DNorm, MetaMap, and several other NLP tools for downstream CXR classification using a convolutional neural network. They also released a label set along with image data. Later, Wang et al. [16] used the same exact CXR dataset to further improve the performance of disease classification and report generation from medical images. For report generation, Jing et al. [17] built a multi-task learning framework, which consists of co-attention and a hierarchical LSTM that predicts the tags, localizes the regions with abnormalities, and uses these for the radiological image annotation and report paragraph generation. They performed their experiments on two publicly available datasets: IU CXR [8] and PEIR Gross [17]. Moradi et al. [18] jointly processed image and text signals to produce CXR images of regions of interest. They proposed two architectures to find their region of interest in CXR and then to generate a textual report. One of these architectures is comprised of CNN and LSTM, and its training was performed using images, their corresponding reports, and the markings of regions of interest (ROIs) for those X-rays; the second one consists of a pre-trained network on a large dataset of the same type of images for feature learning to obtain their findings of interest. Rubin et al. [7] trained a convolutional network to predict common thoracic diseases using CXR images. They proposed a novel architecture called DuelNet that processes both frontal and lateral X-ray images while emulating routine clinical practice. The dataset used was the MIMIC dataset, which is almost four times larger than the size of the largest previously used CXR dataset (ChestX-Ray8) [10]. Li et al. [19] suggested a reinforcement learning-based named HRGR agent to train the report generator to decide whether to make a sentence using a template or generate a new sentence. This work was believed to be the first to combine human prior knowledge and generative neural networks at the same time to generate medical reports. This agent was updated using reinforcement learning. Alternatively, Gale et al. [20] generated interpretable hip fracture X-ray reports by identifying image features and filling text templates. It comprises the training of a simple RNN model to produce hip fracture reports to clarify the results of the neural network classifiers. Finally, Hsu et al. [21] proposed a model in which he trained radiological images and reported joint representation through unsupervised alignment of the cross-modal embedding spaces via both local and global information retrieval. Experiments were performed on the MIMIC dataset, which contains both medical images and their corresponding reports. Machine translation has already been performed for several years by defining a sequence of different activities, such as independently translating terms, aligning phrases, and reordering; however, recent developments have suggested easier and better ways to perform the same tasks by utilizing a recurrent neural network (RNN) [22-24], which provides state-of-the-art performance. RNN is composed of two parts: encoder and decoder. The encoder reads the source sequences that may be either text or images and then transforms them into a vector representation of a fixed length, which then acts as the initial hidden state of the decoder that produces the targeted sequence of words. The proposed model applies a deep convolutional neural network (CNN) as an encoder to an RNN. This encoder converts the input CXR into a vector representation of a fixed length for use in multiple computer vision tasks [5]. The CNN encoder obtains the details about CXR contents that are used as the input to the decoder LSTM followed by the attention block, which efficiently generates the medical reports (see Fig 3).

Fig 3

VGG-16 architecture and associated parameters.

The main contributions of the proposed research in the medical report generation are An innovative model that provides end-to-end solutions for the problem with state-of-the-art sub-networks, CNN as an encoder, and LSTM followed by attention as a decoder. An entirely trainable neural network utilizing vision features along with attention heads for better report generation Finally, substantial experiments on the IU and MIMIC CXR dataset demonstrating the significance of our proposed approach.

3. Model

A probabilistic and neural-network-based model is proposed to produce the radiograph report. Recent advancements in the computational machine translation have demonstrated that with a strong sequence model, the state-of-the-art outcomes can be obtained by explicitly optimizing the probability of the successful translation in an end to end manner, provided as an input sequence, both for the training and the inference. Such models use an RNN that converts a variable-size input of the encoder into a fixed size vector. The fixed-size representation is then used as an input to the decoder part to convert this into a meaningful appropriate sequence of words. Thus, in the proposed model, the variable size input is CXR, the encoder is CNN, and the decoder is LSTM, followed by attention, which uses the same source as the target language conversion principle. The main objective is to directly maximize the likelihood of accuracy of the medical report, as originally described by a radiologist or pathologist. This is achieved by the mathematical formulation represented in Eq 1. In the above equation, θ* and θ are considered as the parameters of the proposed model. S is the correct medical report of CXR I. Remember S can be a sentence of any length. Therefore, the chain rule has been considered as one of the easiest approaches to obtain the combined likelihood from S0 to SN, where N is the maximum number of words in the report. This can be represented by Eq 2. For ease, the dependency would be on θ. Training pairs are created for training (S, I). The goal is to maximize the sum of the log probabilities of S over T over all the training pairs and to optimize this using the gradient descent, as described in (2). Further details regarding the training are discussed in section 4. It is natural to implement p(S0, S1…S) with LSTM, where different numbers of words in a sentence (up to t-1) as described in Eq 2 are stated with the help of a hidden state of a fixed length or a memory unit ht. By using the nonlinear function f after obtaining a new input xt, the hidden state or memory is updated. This is stated in Eq 3 as Two important structural decisions must be made to render the LSTM more workable. First, what type of functions would be appropriate for the model, as well as how it can manage both the input CXR images and words to the same system. To provide the structural decision, the suggested model uses a specific type of network called the long short-term memory (LSTM) network. LSTM has already been proven as the best network when sequence-related tasks, such as translation, must be performed. However, before feeding the input of the encoder to LSTM, the input is passed to the attention mechanism, whose main purpose is to focus only on those parts of the images that are of our region of interest and have maximum information. A CNN was applied to describe the contents of the CXR images. CNN has already proved itself as the current state-of-the-art network for visual classification or image-related tasks, and the VGG16 architecture of CNN is selected because it is based on a novel approach of batch normalization and won the ILSVRC 2014 classification competition [25]. In addition, it generalized many tasks using transfer learning, such as scene classification [26]. The words are used in the system with the help of an embedded model through which they are converted into vectors using one hot scheme.

3.1 Convolutional Neural Network (CNN)

CNN is a special type of neural network that provides sophisticated performance in image processing and visual representation tasks. Some of the best applications of CNN are feature extraction and classification based on those features, such as image segmentation, object detection, etc. The CNN is composed of different types of convolutional layers. Similar to the multilayer neural network [27], there are fully connected (FC) layers after these convolutional layers. A CNN is built in such a way as to take advantage of the 2D input image structure. With the support of multiple local ties and linked weights, this task is accomplished along with many pooling methods that translate the input data into invariant features. The key benefit of CNN includes the freedom to prepare and offer fewer parameters than other networks with the same number of hidden states. The visual geometry group (VGG) network, which is a deep convolution neural network for large-scale visual recognition, was used in this study [28]. The VGG has many variants. The most famous are VGG16 and VGG19, which have 16 and 19 layers, respectively. The classification errors for both VGG16 and VGG19 were almost the same for both the validation data and the test data, which were 7.4% and 7.3%, respectively. The proposed model used the transfer learning approach to train the VGG16 architecture, as shown in Fig 2, to efficiently extract the features from the input images (CXR) using a combination of multiple 3 × 3 convolution layers and max pooling layers. The Softmax layer of VGG16 was replaced with the final 1 × 4096 FC layer. This layer now acts as an input to the decoder as well as the generation of medical reports. The output of the VGG16 network is a vector of size 1 × 4096, which will later be converted into a fixed vector length of 1 × 256 that is used to represent the features of the images.

Fig 2

VGG-16 neural network architecture with highlighted sizes and each layer units.

A dropout layer was added to the network with a value of 0.5, to reduce overfitting. An optimal value is between 0.5 and 0.8, which indicates the probability at which the outputs of the layer are dropped out. A dense layer is added after the dropout layer, which basically applies the activation function to the input, the kernel with a bias. The activation function used was rectified linear units (ReLU), and the size of the output space was specified as 256. These vectors of size 256 are the output of the feature extraction model, which will then be used as the input of the attention block followed by LSTM. Fig 3 shows detailed architecture of VGG-16 along with all parameters.

3.2 Word embedding

Word embedding is primarily responsible for processing the captions of each image given as input during the training process. The output of the word embedding is also a vector of size 1 × 256, which is another input to the decoder sequences. Initially, the captions present along with each CXR were tokenized. Tokenization is a process through which the words in these sentences are converted to integers so that the neural network can process them efficiently. The tokenized captions are padded to ensure that the length of all sentences is equal to the size of the longest sentence with max words. Then, an embedding layer is attached to embed the tokenized captions into fixed dense vectors with an output space of 256 × 22. 22 was chosen as the maximum number of words in all the findings of the IU CXR dataset. These vectors further ease the processing by providing a convenient way to represent words in the vector space. A dropout layer is attached again with a probability of 0.5, to reduce overfitting in the model.

3.3 Attention mechanism

The attention mechanism helps the LSTM to focus on just part of an image, which is a specific and of interest, when generating a new word while completing a caption. Thus, it is simply saying that a decoder is going to generate the caption by using only some specific information. Consequently, with this new block of attention in the architecture, we are going to predict the next word of the medical report by not only the hidden state of the decoder, but also using the context vector that contains information of our interest. Therefore, we divide the image into n parts; then, at the i-th location of the report, we use the hi hidden state of LSTM. So now, this hi is used as the context to select the relevant part of the radiograph. The attention model output is denoted by zi. The output zi can be considered as a vector that contains only those parts of the image that have the main information and our point of interest, and it is now easy for LSTM to generate a new word that actually describes the content and the diseases present in CXR and the relationship between them. One most important thing to consider is that after LSTM generates a new word, it also returns a new hidden state ht+1 for the generation of the next word, and so on. The mathematical representation of the attention mechanism is as follows. In the equation, e means at every i-th timestamp of the decoder and the importance of the j-th pixel location in the input image. s is the previous state of decoder h is the state of encoder f is a simple feed forward a neural network which is a linear transformation of input U*h+W*s and then, a non-linearity (tanh) on top of that, an again one more transformation . This is a scalar quantity. Now, when we know the input, we need to feed the weighted sum combination of input to the decoder. where C is the context vector. Here, s is the previous state of the decoder is the previous predicted word C is the context vector, i.e., the weighted sum of the input. Finally, we can say that it is a better modeling technique than the previous one, and it is said to be a more informed model because we are trying to obtain our results in a more natural way.

3.4 Long Short-Term Memory (LSTM)

It is difficult for a simple RNN to develop long-term stability when the problem of vanishing gradients and exploding gradients is very common [29]. To overcome this problem, a specific type of recurrent network called LSTM was introduced [29] and successfully extended for translation tasks [23, 30] and sequence generation [31].

3.4.1 LSTM based sentence generator

The main idea behind the LSTM model is memory cell c, which encodes information based on what input is observed at any time (see Fig 4). The operation of the cell is controlled by "gates" or layers that are inserted multiplicatively and can retain either values coming from the gates as 0 or 1. In particular, three gates are being used to track if the current value of the cell should be forgotten, whether the new cell value (output gate o) is to be produced, or to be interpreted as its input. Eqs 8, 9 and 10 represent the input, forget, and output layers, respectively, where Eqs 11, 12 and 13 represent the other operations of LSTM [29].

Fig 4

Complete and combined model of CNN image embedder and LSTM with the word embedding.

The LSTM is shown in the unrolled version. All LSTMs are using the same parameters.

Complete and combined model of CNN image embedder and LSTM with the word embedding.

The LSTM is shown in the unrolled version. All LSTMs are using the same parameters. The weight (W) metric represents the trained parameters and ⊙ represents the multiplication with gate value. These multiplicative gates enable the LSTM to be robustly trained as these gates cope well with the gradients, including burst and vanish [29]. The nonlinearities are hyperbolic tangent h(·) and sigmoid σ (·). The mt used in the last equation is fed to the Softmax function, whose primary purpose is to generate a distribution of a likelihood pt over all the words present in the vocabulary.

3.5 Training

LSTM is trained to guess each word of the report after seeing CXR and all corresponding words, as described by p(S|I, S0, S1…S). To gain more accuracy, it is better to place many copies of the LSTM. A replica of the LSTM was generated for the image. For each term where all LSTM modules share the same parameters, the word is predicted by LSTM at a time t again serving as an input to the attention block, and then the output of that attention block is used as the input to LSTM at time t+1, and so on (see Fig 5). This is instructive for this reason. In the unrolled version, all recurrent connections are converted into feed-forward links. If the input CXR image is denoted as I, and S = (S0, S1…S) is a correct medical report describing CXR in more depth, the unrolling method can be represented using Eqs 14, 15 and 16, as stated below.

Fig 5

Long Short-Term Memory (LSTM) network architecture: In the above diagram the memory block comprises a cell c which is essentially controlled by three gates.

These gates are the input, the forget and the output gates.

Long Short-Term Memory (LSTM) network architecture: In the above diagram the memory block comprises a cell c which is essentially controlled by three gates.

These gates are the input, the forget and the output gates. Every word St is represented in one hot vector scheme, where the size of the vector for one word will be equal to the size of the vocabulary. Two words, S0 and SN, show the start and end of the medical report. S0 is “startseq” for start and SN is “endseq” for the end of the medical report. Specifically, the LSTM signal for a full report was produced by emitting the stop term “endseq.” Images and words were projected onto the same space. Images are mapped using a CNN, where the words are generated by using the word embeddings. The input CXR image is given only once after passing from the attention block, initially at t = -1, to tell LSTM about the disease present in the radiograph. It has been experimentally verified that giving the input image at every time step produces poor results as at each phase, the network may have to directly tackle the noise at each time stamp; thus, it is less effective and causes overfitting. The loss of the stated model can be calculated by summing the negative log probability of the right term at each time stamp, as described below in Eq 17. The loss calculated using the above formula is minimized with regard to many parameters of LSTM, the attention block, CNN, and word embedding. The hyper-parameters and respective configuration are given in Table 2.

Table 2

Proposed technique hyper parameters and related configuration.

Hyper-parameters	Configuration
Layers	Encoder (VGG16) + 1 Dense + Decoder-LSTM (9)
Optimizer	Adam
Activation Function	Relu
Learning Rate	0.001
Batch Size	64
Loss function	Sparse Categorical Cross entropy
Number of attention heads	1
Dropout rate	0.5

3.6 Inference

Many approaches can be used to produce a medical report that provides a radiograph. One of them is Beam Search, in which the selection is performed iteratively by a collection of the best sentences up to time t as candidates to produce size t + 1 sentences and holding only the best k results. Another approach is sampling in which the 1st word of the report depends on the highest probability p1 in the vocabulary. Then, the embedding of the same previous word is used as the input, and the next word is selected using probability p2. This process is continued until embedding is performed for the end-of-sentence token, where the maximum length depends on the condition. For the experimentation, sampling was used and discussed in the following section.

4. Experimentation

We performed a systematic series of studies to test the proposed model’s efficacy by comparing the previously developed models as well as with the help of metrics such as the BLEU score.

4.1 Evaluation metrics

It has already been discussed that describing CXRs is a difficult task. An experienced radiologist was required to read the CXR. Human evaluation is more efficient than image captioning through natural language. Prior research has proposed many evaluation matrices to check the performance of the proposed model. Human evaluation is a technique used to measure the performance of a model. However, this is not possible in our case, as we have already discussed the difficulties faced in correctly reading the CXR [2]. The most widely used metric in the research on sentence generation using images has been the BLEU score [32], which is a type of word precision n-gram between produced and referenced reports. For the proposed architecture, we measured the BLEU score to check the accuracy of the proposed model. The possible range of the BLEU score is 0.00 to 1.00. The higher the BLEU score, the better the generation of medical reports because it is basically the comparison of candidate sentences and reference sentences. The candidate sentence is predicted, and the reference sentence is the actual one. Four types of the BLEU scores were observed: BLEU-1 (1.0, 0, 0, 0), BLEU-2 (0.5, 0.5, 0, 0), BLEU-3 (0.33, 0.33, 0.33, 0) and BLEU-4 (0.25, 0.25, 0.25, 0.25). In addition, cumulative weights have been used because they provide better output. The Adam optimizer [33] was used for parameter learning. Researchers are focusing on this subject and have identified other metrics that are considered more relevant for medical report assessment. We note only one such metric, BLEU, hoping for even further debate and work to come up with a reference to the metric preference.

4.2 Dataset

In order to show the validity of proposed technique through detailed experimentations and comparisons, we have used two publicly available CXR datasets i.e. The Indiana University CXR dataset [8] and MIMIC CXR dataset [9]. The Indiana University Chest X-Ray dataset (IU X-Ray) by Demner-Fushman et al. is a collection of CXRs combined with their corresponding medical records. The file format for the X-rays used in this dataset was PNG with a resolution of 512x624 having 24 bits depth. The dataset includes 3,955 radiology reports from two major health networks within the archive of the Indiana Network for Medical Care and 7,470 related CXRs [8]. Almost every report consists of two CXRs of patients, including frontal and lateral views. Different sections in the dataset include the impression, findings, comparison, and indication. In this research, we used the findings of doctors as the target medical reports to be generated (Fig 1 provides an example). The second dataset used in this research is MIMIC CXR dataset by Johnson et al. [9]. This is one of the largest publicly available CXR dataset which also contains free-text reports. It includes imaging studies for 65,379 patients, 377,110 CXR images and 227,835 radiology reports. The dataset contains JPEG images of varying sizes having 8bits depth. There are detailed reports along with images in this dataset and we have utilized findings section from this whole data as this includes radiology reports against each image. For both datasets, the first step is to pre-process the data by converting the long findings of the doctors into a short report by converting them into multiple chunks which leads to more than one report associated with each CXR. In addition, all tokens in the reports are converted to lowercase, and all tokens that are not alphabetical are removed. In all experimentations, randomly selected 80% of data is used for training and remaining 20% is used for testing and we repeated the experiment 10 times to show average performance of the proposed technique.

4.3 Baselines

The proposed model is compared with different current best performing architectures, for example, LRCN [34] by Donahue et al., Soft ATT [35] by Xu et al., ATT-RK [36] by You et al., and Hieratical Generation [37] by Krause, J et al. Donahue et al. [34] presented a long term RCNN network which consider visual representations along with descriptions. They focused more on RCNN based spatial temporal layers instead of fixed spatio-temporal receptive field. Xu et al. [35] presented a soft attention based model along with visual features for report generation. Their model has the capability to automatically learn to fix its gaze to salient objects while generating reports. A semantic attention based model was introduced by You et al. [36]. Their model learnt to selectively put attention to semantic concept proposals and fused them along with RNN to generate reports. Kause et al. [37] extended same concept presented by You et al. [36] and added hierarchal RNN model for robust captioning. However, subsequent detailed comparative analysis showed that all these baseline methods fell short on analyzing long sentences. We implemented all these models for the radiology report generation and decided to use VGG-16 [28] as the CNN encoder while keeping in mind that these models were built for a short sentence-based report.

4.4 Quantitative results

We report the results of the medical report generator results using the standard image captioning evaluation metric, that is, BLEU [32]. We performed some experiments by replacing LSTM with a gate recurring unit (GRU) and bidirectional LSTM by keeping the same VGG16 for the feature extraction and the attention block to work on that part of the image that is of interest. Various images were tested using the above three methods, that is, VGG + LSTM, VGG + GRU, and VGG + Bi-directional LSTM. The training process is accomplished for the above-mentioned models, and the results are obtained and clearly identified that the VGG + LSTM model is more accurate than the other techniques. A comparison of the BLEU scores of all three techniques is presented in Table 3.

Table 3

Comparison of proposed technique with different combination of available options in terms of BLEU score up to n gram for the medical report generated on the IU CXR dataset.

Models	BLEU-1	BLEU-2	BLEU-3	BLEU-4
VGG16 + LSTM with Attention	0.580	0.342	0.263	0.155
VGG16 + LSTM without Attention	0.522	0.262	0.201	0.119
VGG16 + GRU	0.495	0.302	0.250	0.160
VGG16 + Bi-Directional LSTM	0.533	0.321	0.253	0.153

In Table 3, we can clearly understand that if we want to use the encoder and decoder followed by the attention mechanism to either describe the contents of a natural image or use this for the medical images, the combination of VGG for feature extraction and LSTM for sentence generation yielded state-of-the-art results. Although previous research efforts have used bi-directional LSTM in which we have information of the past as well as the future so that our model can predict better, it works better mostly in the caption generation of natural images [38]. For the report generation, it is clear that the proposed architecture of CNN-LSTM with the unrolled LSTM format followed by the attention mechanism performs much better than all the other mentioned networks, as shown in Table 4. The contrast between these models explicitly indicates the effectiveness of the proposed CNN LSTM model. This finding is not unexpected, as it is already well established that a single-layer LSTM cannot model long sentences effectively [35]. Although, Alfarghaly et al. [39] methods resulted in relatively lower BLEU scores but they added extra information from IU X-Ray dataset related to image tags and they also assigned final automated tags to images along with report generation.

Table 4

Comparison of proposed technique with existing state of the art in terms of BLEU score up to n gram for the medical report generated on the IU CXR dataset.

Dataset	Methods	BLEU-1	BLEU-2	BLEU-3	BLEU-4
IU X-Ray	LRCN [34]	0.369	0.229	0.149	0.099
	Soft ATT [35]	0.399	0.251	0.168	0.118
	ATT-RK [36]	0.369	0.226	0.151	0.108
	Hierarchical Generation [37]	0.437	0.323	0.221	0.172
	Conditioned Transformers [39]	0.347	0.221	0.156	0.116
	CDGPT²[39]	0.387	0.245	0.166	0.111
	CNN LSTM (With Attention)	0.580	0.342	0.263	0.155

The proposed algorithm is also tested on MIMIC dataset and results are presented in terms of BLEU-4 score. Table 5 shows comparison of proposed technique with state of art methods who have used MIMIC CXR dataset. Meshed memory transform presented in [42] gives almost comparable results with proposed technique. This meshed memory model was optimized with help of 5 loss functions. This along with proposed technique clearly outperformed simple CNN and RNN based models. The attention heads introduced in proposed technique clearly helped it in getting state of the art results.

Table 5

Comparison of proposed technique with current state of the art techniques in terms of BLEU-4 for the medical report generated on the MIMIC-CXR dataset.

	TieNet[16]	CNN-RNN²[40]	R2Gen [41]	Meshed Memory Trans [42]	Proposed
BLEU-4	0.081	0.076	0.086	0.133	0.153

The model was trained on Google Colab, which provides a 1x NIVIDIA Tesla K80 GPU with 12 GB GDDR5 VRAM. The loss calculated for the LSTM+ VGG model was less than that for the other two models. LSTM took more time than the GRU in processing. This is due to the lesser number of operations occurring in the GRU than in the LSTM. GRUs generally train faster on less training data than LSTMs and are simpler and easier to modify. Figs 6 and 7 represent the accuracy and loss graph of the proposed model between number of epochs, respectively. We can clearly see that after the 7th epoch, the loss starts to increase in the testing data. However, on the training data, the loss will decrease, thus showing the problem of overfitting. Therefore, to avoid this overfitting, we decided to use the trained model of the 7th epoch, which is the minimum loss our model can attain while training on the CXR images of the IU dataset.

Fig 6

Graph between accuracy and epochs using proposed model.

Fig 7

Graph between loss and epochs using proposed model.

The performance is also expected to increase when using a larger dataset for training on a greater number of CXRs. Because of the considerable accuracy of the generated medical reports, radiologists can gain more assistance and benefits in terms of the rapid generation of reports.

4.5 Qualitative results

Some of the sample reports generated by the proposed CNN-LSTM-based model are shown in Fig 8. The medical reports are high-level descriptions of the X-ray. Sentences generated based on different diseases present in the radiograph depend upon the features extracted by VGG or the encoder part. Many true abnormalities present in the X-ray are correctly described by the CNN-LSTM model, as shown below. Any sentence composed of the words like “no,” “normal,” “clear,” “stable” is considered as “normality.”

Fig 8

Different examples of generated results.

Column 1 contains the original image, column 2 contain the attention maps and column 3 contain the actual and predicted captions.

Different examples of generated results.

Column 1 contains the original image, column 2 contain the attention maps and column 3 contain the actual and predicted captions. The performance of proposed system in terms of BLEU scores along with predicted labels is shown in Fig 8. Here we have added few randomly selected best and worst cases of CXRs from datasets used for experimentations. Here it is evident from first 2 examples of Fig 8 that proposed technique has been able to capture variations in the labels and predict the labels with good BLEU scores. However, the scores are quite low for the cases where reports are too long as shown in example 3 of Fig 9.

Fig 9

Randomly selected CXR from datasets used along with original and predicted reports.

5. Conclusion

The proposed model is an application to create automated textual reports for CXR, with the aim of assisting medical professionals in creating reports more efficiently and effectively. It is based on a CNN feature extraction model that acts as an encoder that converts an image into a fixed act as an encoder that converts an image into a fixed-size vector representation, followed by an RNN decoder that generates corresponding sentences based on the learned image features. The effectiveness of the model was analyzed quantitatively and qualitatively on the CXR dataset. A comparative study of various methods has been presented to observe the influence of different components on medical report generation and has also demonstrated various use cases on the proposed system. The results show that the LSTM model generally works slightly better than GRU although it takes a little more time for the training as well as for the sentence generation owing to its complexity. The performance is also expected to increase when using a larger dataset by training on a greater number of images. Different experiments on the IU dataset validate the effectiveness of the proposed architecture. 19 Aug 2021 PONE-D-21-15198 Attention Based Automated Radiology Report Generation using CNN and RNN PLOS ONE Dear Dr. Paracha, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. ============================== The proposed method needs to compare with other state-of-the-art methods, such as https://arxiv.org/abs/1912.08226 https://arxiv.org/abs/2010.10042 The experiments need to conduct on several datasets to demonstrate the model's generalizability. The paper needs to be proofread. There are many typos and errors. ============================== Please submit your revised manuscript by Oct 03 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript: A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols. We look forward to receiving your revised manuscript. Kind regards, Yifan Peng, Ph.D. Academic Editor PLOS ONE Journal requirements: When submitting your revision, we need you to address these additional requirements. 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf. 2.Thank you for stating the following in the Acknowledgments Section of your manuscript: “This research project was funded by the Deanship of Scientific Research, Princess Nourah bint Abdulrahman University, through the “Program of Research Project Funding After Publication, grant No (42-PRFA-P-53)”. We note that you have provided funding information that is not currently declared in your Funding Statement. However, funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form. Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows: “This research project was funded by the Deanship of Scientific Research, Princess Nourah bint Abdulrahman University, through the “Program of Research Project Funding After Publication, grant No (42-PRFA-P-53)”. https://www.pnu.edu.sa/en/Pages/home.aspx No, The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.” Please include your amended statements within your cover letter; we will change the online submission form on your behalf. Additional Editor Comments (if provided): [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #2: Yes ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: N/A ********** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: No ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: No ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: The author proposed an attention based Automated Radiology Report Generation using CNN and RNN. The paper is easy to read and the results can confirm their idea. But there are still several issues that need to be addressed. (1) A line is missing at the bottom of the table. (2) The equation 2 is missing a ). (3) The second Table 1 should be Table 2 and the Table 2 should be Table 3. (4) The resolution of Figure 3 is low. (5) The reference seems wrong. For example, in the section 4.3 Baselines: The proposed model is compared with many current best performing architectures, for example, LRCN [18] by Donahue et al., Soft ATT [19] by Xu et al., ATT-RK [20] by You et al., and Hieratical Generation [21] by Krause, et al. But in the Table 2 the references are 19-22. You have to check your references carefully. (6) You need to include the latest method to compare with your proposed method, such as Meshed-Memory Transformer for image captioning, Linguistics Improving Factual Completeness and Consistency of Image-to-Text Radiology Report Generation. You can also compare with other methods, but at least those proposed in 2020. Reviewer #2: This manuscript describes an attention-based report generation from X-ray using CNN and RNN. Descriptions of x-ray images were missing from both data and experiment sections. More details on the images will be helpful (size of x-ray, divided image into n parts? what is n in the experiment?) VGG was not elaborated on Page 12. The experiments seem a bit confusing. For example, when LSTM and RNN were used in the experiment? Is it CNN-RNN or CNN-LSTM? In text, it is mentioned CNN-RNN, but in figure, it is CNN-LSTM. The text of the manuscript can be reduced significantly. Please provide all the parameters for all the algorithms. Captions for images are not available. Problems in table caption. The numbering of tables is wrong. ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. 24 Sep 2021 Editor’s Comments & Responses 1. The proposed method needs to compare with other state-of-the-art methods, such as 2. https://arxiv.org/abs/1912.08226 3. https://arxiv.org/abs/2010.10042 Response: We have considered both articles and added some more relevant ones in revised version. We have also compared the results of proposed technique with these articles. Following recent articles have been added in revised version • Alfarghaly, O., Khaled, R., Elkorany, A., Helal, M., & Fahmy, A. (2021). Automated radiology report generation using conditioned transformers. Informatics in Medicine Unlocked, 24, 100557. • Liu, G., Hsu, T. M. H., McDermott, M., Boag, W., Weng, W. H., Szolovits, P., & Ghassemi, M. (2019, October). Clinically accurate chest x-ray report generation. In Machine Learning for Healthcare Conference (pp. 249-269). PMLR. • Chen, Z., Song, Y., Chang, T. H., & Wan, X. (2020). Generating radiology reports via memory-driven transformer. arXiv preprint arXiv:2010.16056. • Miura, Y., Zhang, Y., Tsai, E. B., Langlotz, C. P., & Jurafsky, D. (2020). Improving factual completeness and consistency of image-to-text radiology report generation. arXiv preprint arXiv:2010.10042. 2. The experiments need to conduct on several datasets to demonstrate the model's generalizability. Response: We have added one more publically available dataset for experimentations and results are added in revised article. We have added more results and also have highlighted failure cases to show model’s generalizability. Please refer to table-3, table-5 and figure 9. 3. The paper needs to be proofread. There are many typos and errors. Response: The article was initially proof read by a third party EdiTag. We have again proofread it for possible errors. Reviewer's Comments & Responses 1. Is the manuscript technically sound, and do the data support the conclusions? Reviewer #1: Yes Reviewer #2: Yes 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: N/A 3. Have the authors made all data underlying the findings in their manuscript fully available? Reviewer #1: Yes Reviewer #2: No Response: The datasets used in this article are publically available 4. Is the manuscript presented in an intelligible fashion and written in standard English? Reviewer #1: Yes Reviewer #2: No Response: Manuscript is updated according to the comments of the reviewers and academic Editor and all the grammatical and typographical errors are catered. It has been proof read by third party. 5. Review Comments to the Author Reviewer #1 The author proposed an attention based Automated Radiology Report Generation using CNN and RNN. The paper is easy to read and the results can confirm their idea. But there are still several issues that need to be addressed. Response: We are thankful to the reviewer for encouragement and also sharing his comments to improve the quality of our article (1) A line is missing at the bottom of the table. Response: We have updated the formatting of all tables and bottom lines are added at the end of all tables. We have also updated captions of all tables for better understanding. (2) The equation 2 is missing a ). Response: The equation is updated and ‘)’ is added at the end of equation 2. (3) The second Table 1 should be Table 2 and the Table 2 should be Table 3. Response: We have corrected table numbering. (4) The resolution of Figure 3 is low. Response: We have updated Figure 3 to improve its resolution. We have also improved the resolution of other figures. (5) The reference seems wrong. For example, in the section 4.3 Baselines: The proposed model is compared with many current best performing architectures, for example, LRCN [18] by Donahue et al., Soft ATT [19] by Xu et al., ATT-RK [20] by You et al., and Hieratical Generation [21] by Krause, et al. But in the Table 2 the references are 19-22. You have to check your references carefully. Response: References are updated and now all the references are properly aligned. (6) You need to include the latest method to compare with your proposed method, such as Meshed-Memory Transformer for image captioning, Linguistics Improving Factual Completeness and Consistency of Image-to-Text Radiology Report Generation. You can also compare with other methods, but at least those proposed in 2020. Response: As per reviewer suggestion, we have conducted more experiments and compared our results with more recent articles. We have also considered more public datasets like MIMIC for experiments and comparisons. Reviewer #2: This manuscript describes an attention-based report generation from X-ray using CNN and RNN. Descriptions of x-ray images were missing from both data and experiment sections. More details on the images will be helpful (size of x-ray, divided image into n parts? what is n in the experiment?) Response: We have added these details with respect to both dataset and experimental setup in section 4.2 of revised article. VGG was not elaborated on Page 12. Response: Section 3.1 of revised article contains details related to VGG and we have also shown all layers of VGG in figures 2 and 3. The Softmax layer of VGG16 was replaced with the final 1 × 4096 FC layer. This layer now acts as an input to the decoder as well as the generation of medical reports. The output of the VGG16 network is a vector of size 1 × 4096, which will later be converted into a fixed vector length of 1 × 256 that is used to represent the features of the images. The experiments seem a bit confusing. For example, when LSTM and RNN were used in the experiment? Is it CNN-RNN or CNN-LSTM? In text, it is mentioned CNN-RNN, but in figure, it is CNN-LSTM. Response: Long Short-Term Memory (LSTM) networks are a modified version of recurrent neural networks (RNN) which makes it easier to remember past data in memory and considered best of the text generation tasks. The vanishing gradient problem of RNN is resolved here. So as a big picture we write RNN every where even in the title of the paper. But actually, we are using the best model of RNN which is LSTM. But as highlighted by the reviewer, we have replaced RNN with LSTM wherever we are mentioning proposed technique. We have also updated the title accordingly. The text of the manuscript can be reduced significantly. Response: As advised by the reviewer, we have reduced the text in introduction and literature review sections. However, more results and discussions are added to emphasize more on proposed system robustness. So in general the length of article is almost same. Please provide all the parameters for all the algorithms. Response: We have added parameters in figure 2,3 and 4 of revised article. We have also added table 2 to show hyper parameters. Captions for images are not available. Report: As per journal formatting, we didn’t add figures in the text. However the image captions are added at relevant places for better understanding. Problems in table caption. The numbering of tables is wrong. Response: We have corrected table numbering and have also improved the captions for better understanding. Submitted filename: Response to Reviewers.docx Click here for additional data file. 8 Nov 2021 PONE-D-21-15198R1Attention Based Automated Radiology Report Generation using CNN and LSTMPLOS ONE Dear Dr. Paracha, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. ==============================1. please briefly describe the baseline methods 2. please check figure 4 3. please make sure the codes are publicly available. ============================== Please submit your revised manuscript by Dec 23 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript: A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols. We look forward to receiving your revised manuscript. Kind regards, Yifan Peng, Ph.D. Academic Editor PLOS ONE Journal Requirements: Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice. [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #1: All comments have been addressed Reviewer #2: All comments have been addressed ********** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #2: Yes ********** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: No ********** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes ********** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: All my previous comments have been addressed. I have one more comments for this revised version. As the author have conducted more experiments and compared their results with more recent articles, they need added the new comparison methods to the Section Baselines. Reviewer #2: The authors addressed all the comments. Figure 4 is black and there is nothing. The authors may share the code. ********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. 23 Nov 2021 Comment: Please briefly describe the baseline methods Response: We have added brief description of baseline methods in section “4.3 Baseline” of revised article. Comment: Please check figure 4 Response: Figure 4 is updated. Now it is properly visible. Comment: Please make sure the codes are publicly available. Response: We have made our complete code publically available at our research group website. It can be accessed at http://biomisa.org/index.php/downloads/ Submitted filename: Response File.docx Click here for additional data file. 16 Dec 2021 PONE-D-21-15198R2Attention Based Automated Radiology Report Generation using CNN and LSTMPLOS ONE Dear Dr. Paracha, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Please submit your revised manuscript by Jan 30 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript:If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols. We look forward to receiving your revised manuscript. Kind regards, Yifan Peng, Ph.D. Academic Editor PLOS ONE Journal Requirements: Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice. Additional Editor Comments (if provided): 1. Figure 9 on page 39 is black 2. Figure 4 on page 40 is black 3. I feel the table on page 12 is not Fig 4. [Note: HTML markup is below. Please do not edit.] Reviewers' comments: [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. 18 Dec 2021 We have made all updates with respect to figures mentioned by the editor Submitted filename: Response File-II.docx Click here for additional data file. 20 Dec 2021 Attention Based Automated Radiology Report Generation using CNN and LSTM PONE-D-21-15198R3 Dear Dr. Paracha, We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Yifan Peng, Ph.D. Academic Editor PLOS ONE Additional Editor Comments (optional): Reviewers' comments: 27 Dec 2021 PONE-D-21-15198R3 Attention based Automated Radiology Report Generation using CNN and LSTM Dear Dr. Paracha: I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. If we can help with anything else, please email us at plosone@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. Yifan Peng Academic Editor PLOS ONE

9 in total

1. Predicting Semantic Descriptions from Medical Images with Convolutional Neural Networks.

Authors: Thomas Schlegl; Sebastian M Waldstein; Wolf-Dieter Vogl; Ursula Schmidt-Erfurth; Georg Langs
Journal: Inf Process Med Imaging Date: 2015

2. Long short-term memory.

Authors: S Hochreiter; J Schmidhuber
Journal: Neural Comput Date: 1997-11-15 Impact factor: 2.026

Review 3. A survey on deep learning in medical image analysis.

Authors: Geert Litjens; Thijs Kooi; Babak Ehteshami Bejnordi; Arnaud Arindra Adiyoso Setio; Francesco Ciompi; Mohsen Ghafoorian; Jeroen A W M van der Laak; Bram van Ginneken; Clara I Sánchez
Journal: Med Image Anal Date: 2017-07-26 Impact factor: 8.545