☆ 4.0 Article

Fine-Grained Cross-Modal Retrieval for Cultural Items with Focal Attention and Hierarchical Encodings

COMPUTERS (2021)

Journal

COMPUTERS

Volume 10, Issue 9, Pages -

Publisher

MDPI

DOI: 10.3390/computers10090105

Keywords

attention mechanism; hierarchical representation; indicator function

Funding

project MARVEL (Mobile Augmented Reality and Vocal Explanations for Landmarks and visual arts-RUN Financing KU Leuven)
Research Foundation-Flanders (FWO) [1S55420N]

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

This paper presents a weakly supervised alignment model for fine-grained image-text alignment and cross-modal retrieval in the cultural heritage domain, which utilizes attention mechanisms and a common semantic space to achieve superior performance compared to two state-of-the-art methods.

In this paper, we target the tasks of fine-grained image-text alignment and cross-modal retrieval in the cultural heritage domain as follows: (1) given an image fragment of an artwork, we retrieve the noun phrases that describe it; (2) given a noun phrase artifact attribute, we retrieve the corresponding image fragment it specifies. To this end, we propose a weakly supervised alignment model where the correspondence between the input training visual and textual fragments is not known but their corresponding units that refer to the same artwork are treated as a positive pair. The model exploits the latent alignment between fragments across modalities using attention mechanisms by first projecting them into a shared common semantic space; the model is then trained by increasing the image-text similarity of the positive pair in the common space. During this process, we encode the inputs of our model with hierarchical encodings and remove irrelevant fragments with different indicator functions. We also study techniques to augment the limited training data with synthetic relevant textual fragments and transformed image fragments. The model is later fine-tuned by a limited set of small-scale image-text fragment pairs. We rank the test image fragments and noun phrases by their intermodal similarity in the learned common space. Extensive experiments demonstrate that our proposed models outperform two state-of-the-art methods adapted to fine-grained cross-modal retrieval of cultural items for two benchmark datasets.

Fine-Grained Cross-Modal Retrieval for Cultural Items with Focal Attention and Hierarchical Encodings

Journal

COMPUTERS

Publisher

MDPI

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Fine-Grained Cross-Modal Retrieval for Cultural Items with Focal Attention and Hierarchical Encodings

Journal

COMPUTERS

Publisher

MDPI

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper