4.7 Article

Multi-level knowledge-driven feature representation and triplet loss optimization network for image-text retrieval

Related references

Note: Only part of the references are listed.
Article Computer Science, Information Systems

Unified Adaptive Relevance Distinguishable Attention Network for Image-Text Matching

Kun Zhang et al.

Summary: Image-text matching is a fundamental cross-modal task that bridges the gap between vision and language. We propose a novel Unified Adaptive Relevance Distinguishable Attention (UARDA) mechanism to accurately learn semantic alignment by distinguishing relevant and irrelevant distributions. Experimental results show that UARDA outperforms state-of-the-arts and reduces retrieval time substantially.

IEEE TRANSACTIONS ON MULTIMEDIA (2023)

Article Computer Science, Artificial Intelligence

Image-Text Embedding Learning via Visual and Textual Semantic Reasoning

Kunpeng Li et al.

Summary: As a hot research topic, cross-modal retrieval between images and texts is challenging due to the lack of semantic concepts in current image representations. To address this, we introduce an intuitive and interpretable model that learns a common embedding space for image-text alignments. Our model incorporates semantic relationship information and performs global semantic reasoning to capture key objects and concepts. Experiments show that our method surpasses state-of-the-art approaches and is highly efficient at the inference stage.

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE (2023)

Article Computer Science, Information Systems

Rare-aware attention network for image-text matching

Yan Wang et al.

Summary: This paper proposes a novel rare-aware attention network (RAAN) to address the long-tail effect in image and text matching, by exploring and exploiting rare content. The RAAN utilizes rare attention matching and rareness representation to strengthen similarity calculation, achieving leading performance on large-scale databases.

INFORMATION PROCESSING & MANAGEMENT (2023)

Article Computer Science, Information Systems

Label-attention transformer with geometrically coherent objects for image captioning

Shikha Dubey et al.

Summary: Encoder-decoder-based image captioning utilizes the transformer and investigates two unexplored ideas, including an object-focused label attention module (LAM) and a geometrically coherent proposal (GCP) module. These modules enforce objects' relevance and explore the effectiveness of learning the association between vision and language constructs. Experimental results show that the proposed framework, LATGeO, generates improved and meaningful captions.

INFORMATION SCIENCES (2023)

Article Automation & Control Systems

Image caption generation using a dual attention mechanism

Roshni Padate et al.

Summary: Existing image captioning models can be classified into retrieval-oriented and generation-oriented schemes. This article introduces a new image captioning model that includes three main phases: feature extraction, dual attention generation, and caption generation. The model utilizes CNN for visual attention, LSTM for textual attention, and BI-LSTM to combine both modalities. The weights of BI-LSTM are optimized using the SI-EFO algorithm. Experimental results show improvement over several other models.

ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE (2023)

Proceedings Paper Computer Science, Artificial Intelligence

Body Part-Based Representation Learning for Occluded Person Re-Identification

Vladimir Somers et al.

Summary: In this work, we propose BPBreID, a body part-based ReID model for solving the challenges of occlusions and non-discriminative local appearance. Our method designs two modules to predict body part attention maps and generate part-based features, and introduces GiLt training scheme for robust part-based representation learning. Extensive experiments show the effectiveness of our proposed method, achieving 0.7% mAP improvement and 5.6% rank-1 accuracy improvement on the challenging Occluded-Duke dataset.

2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV) (2023)

Article Engineering, Electrical & Electronic

Region Reinforcement Network With Topic Constraint for Image-Text Matching

Jie Wu et al.

Summary: Image and sentence matching, which combines vision and language, has gained increasing attention. Previous methods ignored the relationships between image regions and considered all region-word pairs equally. This paper proposes a novel method, the Region Reinforcement Network with Topic Constraint (RRTC), to explore the correspondences between images and texts. It builds a region reinforcement network to infer fine-grained correspondence by considering the relationships of regions and re-assigning region-word similarities. The topic constraint module summarizes the central theme of images and constrains the deviation of the original image.

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY (2022)

Proceedings Paper Computer Science, Artificial Intelligence

Negative-Aware Attention Framework for Image-Text Matching

Kun Zhang et al.

Summary: This paper proposes a novel Negative-Aware Attention Framework (NAAF) for image-text matching, which utilizes both the positive effect of matched fragments and the negative effect of mismatched fragments to improve the performance.

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022) (2022)

Proceedings Paper Computer Science, Theory & Methods

Reasoning with Multi-Structure Commonsense Knowledge in Visual Dialog

Shunyu Zhang et al.

Summary: This paper proposes a novel model that uses multi-structure commonsense knowledge for reasoning in Visual Dialog. Experimental results demonstrate that this model outperforms comparative methods in terms of performance.

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2022 (2022)

Proceedings Paper Computer Science, Theory & Methods

Multi-view Multi-label Canonical Correlation Analysis for Cross-modal Matching and Retrieval

Rushil Sanghavi et al.

Summary: In this paper, the problem of cross-modal retrieval in the presence of multi-view and multi-label data is addressed. The authors propose a Multi-view Multi-label Canonical Correlation Analysis (MVMLCCA) method, which generalizes CCA for multi-view data and utilizes high-level semantic information in the form of multi-label annotations. The proposed MVMLCCA method establishes correspondence across multiple views without explicit pairing of multi-view samples. Extensive experiments demonstrate that this approach offers more flexibility without compromising scalability and cross-modal retrieval performance.

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2022 (2022)

Article Computer Science, Information Systems

Cross-modal Graph Matching Network for Image-text Retrieval

Yuhao Cheng et al.

Summary: Image-text retrieval is a fundamental task in cross-modal research. Existing methods can be classified into independent representation matching and cross-interaction matching. This article proposes a method called CGMN, which explores both intra- and inter-relations without introducing network interaction. The experiments show that CGMN outperforms state-of-the-art methods in image retrieval and is more efficient than interactive matching methods.

ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS (2022)

Article Computer Science, Information Systems

Harmonious Multi-branch Network for Person Re-identification with Harder Triplet Loss

Zengming Tang et al.

Summary: This article proposes a novel harmonious multi-branch network (HMBN) to address issues in person re-identification. HMBN learns pedestrian information using different stripes on different branches and solves intra-branch and inter-branch problems through horizontal overlapped partitioning and attention mechanism. Experimental results demonstrate the superiority of HMBN over other methods.

ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS (2022)

Article Computer Science, Information Systems

Multi-level similarity learning for image-text retrieval

Wen-Hui Li et al.

Summary: This paper proposes a multi-level representation learning method to improve the quality of image-text retrieval task by utilizing semantic-level, structural-level, and contextual-level information. The experiments demonstrate the superiority of this method on two commonly used datasets.

INFORMATION PROCESSING & MANAGEMENT (2021)

Article Engineering, Electrical & Electronic

CMPD: Using Cross Memory Network With Pair Discrimination for Image-Text Retrieval

Xin Wen et al.

Summary: A novel cross memory network with pair discrimination (CMPD) is proposed for image-text cross modal retrieval, demonstrating superior performance compared to state-of-the-art approaches. The method utilizes cross memory as a set of latent concepts and pair discrimination loss to capture semantic relationships efficiently.

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY (2021)

Article Engineering, Electrical & Electronic

Learning Dual Semantic Relations With Graph Attention for Image-Text Matching

Keyu Wen et al.

Summary: In this work, a novel multi-level semantic relations enhancement approach named DSRAN is proposed to address the issue of mismatch between regional features and global features in image-text matching. DSRAN consists of two modules, performing graph attention for region-level relations enhancement and regional-global relations enhancement simultaneously. The experimental results show that DSRAN outperforms previous approaches by a large margin, demonstrating the effectiveness of the dual semantic relations learning scheme.

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY (2021)

Article Computer Science, Artificial Intelligence

Multi-level Alignment Network for Domain Adaptive Cross-modal Retrieval

Jianfeng Dong et al.

Summary: This paper introduces a new task of domain adaptive cross-modal retrieval, addressing the scenario where training and testing data come from different domains. By proposing a Multi-level Alignment Network (MAN), the semantic, modality, and domain gaps are effectively reduced, enhancing the generalization ability for target data. Experiments show that MAN outperforms multiple baselines and achieves a new state-of-the-art in large-scale text-to-video retrieval.

NEUROCOMPUTING (2021)

Article Computer Science, Artificial Intelligence

Scalable multi-label canonical correlation analysis for cross-modal retrieval

Xin Shu et al.

Summary: In this paper, a novel framework is proposed to integrate semantic correlation and feature correlation for cross-modal retrieval. By using semantic transformation, the model avoids explicitly computing the covariance matrix, which leads to a huge saving of computational cost. Experimental results demonstrate the accuracy and efficiency of the proposed method on three multi-label datasets.

PATTERN RECOGNITION (2021)

Proceedings Paper Computer Science, Artificial Intelligence

VSR plus plus : Improving Visual Semantic Reasoning for Fine-Grained Image-Text Matching

Hui Yuan et al.

Summary: The Improved Visual Semantic Reasoning model (VSR++) addresses the challenges in fine-grained image-text matching by jointly modeling global alignment and local correspondence. With a suitable learning strategy to balance their importance, the model achieves state-of-the-art performance on two benchmark datasets by distinguishing image regions and text words at a fine-grained level.

2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR) (2021)

Article Computer Science, Artificial Intelligence

Memorize, Associate and Match: Embedding Enhancement via Fine-Grained Alignment for Image-Text Retrieval

Jiangtong Li et al.

Summary: The MEMBER method introduces global memory banks to enable fine-grained alignment and fusion between images and texts in embedding learning paradigm, achieving mutual embedding enhancement and maintaining retrieval efficiency. Extensive experiments show that MEMBER outperforms state-of-the-art approaches on two large-scale benchmark datasets.

IEEE TRANSACTIONS ON IMAGE PROCESSING (2021)

Article Computer Science, Information Systems

Multi-Level Correlation Adversarial Hashing for Cross-Modal Retrieval

Xinhong Ma et al.

IEEE TRANSACTIONS ON MULTIMEDIA (2020)

Article Computer Science, Artificial Intelligence

Cross-Modal Attention With Semantic Consistence for Image-Text Matching

Xing Xu et al.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS (2020)

Article Computer Science, Artificial Intelligence

Learning Two-Branch Neural Networks for Image-Text Matching Tasks

Liwei Wang et al.

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE (2019)

Article Computer Science, Artificial Intelligence

Bidirectional image-sentence retrieval by local and global deep matching

Lin Ma et al.

NEUROCOMPUTING (2019)

Article Computer Science, Artificial Intelligence

CycleMatch: A cycle-consistent embedding network for image-text matching

Yu Liu et al.

PATTERN RECOGNITION (2019)

Article Computer Science, Information Systems

CM-GANs: Cross-modal Generative Adversarial Networks for Common Representation Learning

Yuxin Peng et al.

ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS (2019)