| Literature DB >> 36060430 |
Mingjie Li1, Rui Liu2, Fuyu Wang3, Xiaojun Chang1, Xiaodan Liang4.
Abstract
Medical reports have significant clinical value to radiologists and specialists, especially during a pandemic like COVID. However, beyond the common difficulties faced in the natural image captioning, medical report generation specifically requires the model to describe a medical image with a fine-grained and semantic-coherence paragraph that should satisfy both medical commonsense and logic. Previous works generally extract the global image features and attempt to generate a paragraph that is similar to referenced reports; however, this approach has two limitations. Firstly, the regions of primary interest to radiologists are usually located in a small area of the global image, meaning that the remainder parts of the image could be considered as irrelevant noise in the training procedure. Secondly, there are many similar sentences used in each medical report to describe the normal regions of the image, which causes serious data bias. This deviation is likely to teach models to generate these inessential sentences on a regular basis. To address these problems, we propose an Auxiliary Signal-Guided Knowledge Encoder-Decoder (ASGK) to mimic radiologists' working patterns. Specifically, the auxiliary patches are explored to expand the widely used visual patch features before fed to the Transformer encoder, while the external linguistic signals help the decoder better master prior knowledge during the pre-training process. Our approach performs well on common benchmarks, including CX-CHR, IU X-Ray, and COVID-19 CT Report dataset (COV-CTR), demonstrating combining auxiliary signals with transformer architecture can bring a significant improvement in terms of medical report generation. The experimental results confirm that auxiliary signals driven Transformer-based models are with solid capabilities to outperform previous approaches on both medical terminology classification and paragraph generation metrics.Entities:
Keywords: Auxiliary signals; Generative pre-training; Medical report generation; Transformer
Year: 2022 PMID: 36060430 PMCID: PMC9417931 DOI: 10.1007/s11280-022-01013-6
Source DB: PubMed Journal: World Wide Web ISSN: 1386-145X Impact factor: 3.000
Fig. 1Two samples from CX-CHR and our COV-CTR datasets. Red bounding boxes annotated by a radiologist indicate the regions that he pays more attention to describing this image. The red text describes the abnormalities. Underlined text indicates alignment between ground truth reports and generated reports
Fig. 2An overview of our ASGK approach. The ASGK model consists of a medical graph encoder and a natural language decoder. The medical graph encoder encodes input features into the corresponding medical tag graph, while the natural language decoder transfers high-level information to sentences or reports. The external signals guide the pretraining procedure, while the internal signals guide the model to bridge linguistic and visual information. T and MCS represent threshold and max connection select operation respectively
Statistics of COV-CTR, CX-CHR and Open-IU
| Statistics | COV-CTR | CX-CHR | IU X-Ray |
|---|---|---|---|
| Patients | − | 35,609 | 3867 |
| Images | 728 | 45,598 | 7470 |
| Normalities | − | 18 | − |
| Abnormalities | − | 155 | − |
| Vocabulary Size | 235 | 27683 | 2791 |
| Max. Sen. Num. | 14 | 24 | 18 |
| Max. Sen. Len. | 37 | 38 | 42 |
| Max. Rep. Len. | 127 | 216 | 173 |
| Avg. Sen. Len. | 8.197 | 7.111 | 6.997 |
| Avg. Rep. Len. | 77.274 | 64.858 | 32.450 |
Fig. 3We evaluate our model each epoch and report BLEU-4 and CIDER values on validation and testing sets
Evaluation metrics on CH-CHR and COV-CTR datasets comparing ASGK with other methods
| Dataset | Model | C | R | B@1 | B@2 | B@3 | B@4 | Hit(%) |
|---|---|---|---|---|---|---|---|---|
| CX-CHR | CoAtt | 273.5 |
| 64.7 | 57.5 | 52.5 | 48.7 | 8.0 |
| HRGR | 289.5 | 61.2 | 67.3 | 58.7 | 53.0 | 48.6 | − | |
| KERP | 285.0 | 61.8 | 67.3 | 58.8 | 53.2 | 47.3 | − | |
| V-BERT | 302.4 | 63.7 |
| 60.1 | 54.1 | 50.3 | 19.0 | |
| V-GPT | 301.8 | 63.0 | 67.9 | 59.6 | 54.0 | 48.7 | − | |
| SAT | 311.2 | 63.3 | 62.3 | 55.2 | 53.9 | 48.1 | − | |
| R2Gen | 310.2 | 63.3 | 68.1 | 60.2 | 54.3 | 50.1 | − | |
| Ours |
| 64.1 |
|
|
|
|
| |
| COV-CTR | CoAtt | 67.2 | 74.8 | 70.9 | 64.5 | 60.3 | 55.2 | 25.0 |
| SAT | 65.9 | 72.3 | 69.7 | 62.1 | 56.8 | 51.5 | − | |
| AdaAtt | 68.2 | 72.6 | 67.6 | 63.3 | 59.6 | 51.4 | − | |
| V-BERT |
| 74.7 | 71.0 | 65.3 | 60.6 | 55.8 | 26.0 | |
| V-GPT | 68.0 |
| 70.8 | 64.5 | 60.0 | 54.9 | − | |
| R2Gen | 67.2 | 73.2 | 69.3 | 61.1 | 55.9 | 51.8 | − | |
| TopDown | 63.1 | 72.1 | 70.5 | 65.3 | 60.9 | 56.1 | − | |
| Ours |
|
|
|
|
|
|
|
C and R are short for CIDER-D and ROUGE-L. B-n denotes that the BLEU score uses up to n-grams. Hit represents the human evaluation results
The bold numbers are the largest in each column
Comparison of report generation models on three metrics on the Open-IU dataset
| Model | Bleu-4 | Cider-D | Rouge-L |
|---|---|---|---|
| CARG [ | 11.3 | − | 35.4 |
| KERP [ | 28.0 | 33.9 | |
| TieNet [ | 8.1 | − | 31.1 |
| SentSAT [ | 14.3 | 26.8 | 35.9 |
| SentSAT+KG [ | 14.7 | 30.4 | |
| Ours | 12.5 | 27.9 |
As some of their works are outsourced, we directly use the results reported in their papers
The bold numbers are the largest in each column
Ablation studies for different auxiliary signals
| Dataset | Model | CIDER-D | ROUGE-L | BLEU-4 | AUC |
|---|---|---|---|---|---|
| CX-CHR | baseline | 289.7 | 61.3 | 48.3 | 78.7 |
| baseline+IA+CE | 304.6 | 62.5 | 48.9 | 82.1 | |
| baseline+IA | 305.3 | 62.7 | 49.1 | 83.2 | |
| baseline+EA | 317.2 | 63.8 | 52.0 | 79.3 | |
| baseline+IA+EA | |||||
| COV-CRT | baseline | 59.1 | 68.3 | 52.5 | 72.7 |
| baseline+IA+CE | 61.3 | 70.2 | 54.1 | 79.0 | |
| baseline+IA | 62.8 | 70.5 | 54.2 | 79.7 | |
| baseline+EA | 66.9 | 72.0 | 55.6 | 74.5 | |
| baseline+IA+EA |
IA, EA and CE are short for “internal auxiliary signals”, “external auxiliary signals’ and “cross entropy”. Four metrics are adopted to evaluate our model on two datasets
The bold numbers are the largest in each column
Fig. 4Sample output of our approach on both CX-CHR and COV-CTR datasets. We use the outputs before the last pooling layer in DenseNet-121 to generate heat maps, then threshold them by to produce the suspicious regions
Fig. 5Sample output of our approach on both CX-CHR and COV-CTR datasets. In the medical tag graphs, we show the nodes whose value (which is equal to the classification probability) exceeds 0.5 and edges whose weights are more than 0.3. To read the image clearly, we show the values of some edges in the appropriate places. The underlined text indicates alignment between ground truth reports and generated reports