4.7 Article

Automated Radiographic Report Generation Purely on Transformer: A Multicriteria Supervised Approach

Journal

IEEE TRANSACTIONS ON MEDICAL IMAGING
Volume 41, Issue 10, Pages 2803-2813

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/TMI.2022.3171661

Keywords

Transformers; Visualization; Medical diagnostic imaging; Feature extraction; Task analysis; Decoding; Training; Medical report generation; image caption; transformer; image-text matching

Funding

  1. National Key Research and Development Program of China [2020AAA0108303]
  2. NSFC [41876098]

Ask authors/readers for more resources

This paper proposes a pure transformer-based framework for automated radiographic report generation in the medical field. It addresses the challenges of visual similarity among medical images and the importance of disease-related words. By improving visual-textual alignment, multi-label classification, and word importance weighting, the framework achieves promising performance in generating accurate reports.
Automated radiographic report generation is challenging in at least two aspects. First, medical images are very similar to each other and the visual differences of clinic importance are often fine-grained. Second, the disease-related words may be submerged by many similar sentences describing the common content of the images, causing the abnormal to be misinterpreted as the normal in the worst case. To tackle these challenges, this paper proposes a pure transformer-based framework to jointly enforce better visual-textual alignment, multi-label diagnostic classification, and word importance weighting, to facilitate report generation. To the best of our knowledge, this is the first pure transformer-based framework for medical report generation, which enjoys the capacity of transformer in learning long range dependencies for both image regions and sentence words. Specifically, for the first challenge, we design a novel mechanism to embed an auxiliary image-text matching objective into the transformer's encoder-decoder structure, so that better correlated image and text features could be learned to help a report to discriminate similar images. For the second challenge, we integrate an additional multi-label classification task into our framework to guide the model in making correct diagnostic predictions. Also, a term-weighting scheme is proposed to reflect the importance of words for training so that our model would not miss key discriminative information. Our work achieves promising performance over the state-of-the-arts on two benchmark datasets, including the largest dataset MIMIC-CXR.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available