4.6 Article

Ternary Adversarial Networks With Self-Supervision for Zero-Shot Cross-Modal Retrieval

Journal

IEEE TRANSACTIONS ON CYBERNETICS
Volume 50, Issue 6, Pages 2400-2413

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/TCYB.2019.2928180

Keywords

Semantics; Correlation; Knowledge transfer; Standards; Task analysis; Training; Feature extraction; Adversarial learning; cross-modal retrieval; self-supervision; zero-shot learning (ZSL)

Funding

  1. National Natural Science Foundation of China [61602089, 61632007, 61772116, 61572108]
  2. Leading Initiative for Excellent Young Researcher of Ministry of Education, Culture, Sports, Science, and Technology, Japan [16809746]
  3. Research Fund of the Telecommunications Advancement Foundation
  4. Sichuan Science and Technology Program of China [2018GZDZX0032]

Ask authors/readers for more resources

Given a query instance from one modality (e.g., image), cross-modal retrieval aims to find semantically similar instances from another modality (e.g., text). To perform cross-modal retrieval, existing approaches typically learn a common semantic space from a labeled source set and directly produce common representations in the learned space for the instances in a target set. These methods commonly require that the instances of both two sets share the same classes. Intuitively, they may not generalize well on a more practical scenario of zero-shot cross-modal retrieval, that is, the instances of the target set contain unseen classes that have inconsistent semantics with the seen classes in the source set. Inspired by zero-shot learning, we propose a novel model called ternary adversarial networks with self-supervision (TANSS) in this paper, to overcome the limitation of the existing methods on this challenging task. Our TANSS approach consists of three paralleled subnetworks: 1) two semantic feature learning subnetworks that capture the intrinsic data structures of different modalities and preserve the modality relationships via semantic features in the common semantic space; 2) a self-supervised semantic subnetwork that leverages the word vectors of both seen and unseen labels as guidance to supervise the semantic feature learning and enhances the knowledge transfer to unseen labels; and 3) we also utilize the adversarial learning scheme in our TANSS to maximize the consistency and correlation of the semantic features between different modalities. The three subnetworks are integrated in our TANSS to formulate an end-to-end network architecture which enables efficient iterative parameter optimization. Comprehensive experiments on three cross-modal datasets show the effectiveness of our TANSS approach compared with the state-of-the-art methods for zero-shot cross-modal retrieval.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available