相关参考文献
注意:仅列出部分参考文献,下载原文获取全部文献信息。
Article
Computer Science, Information Systems
Kun Zhang et al.
Summary: Image-text matching is a fundamental cross-modal task that bridges the gap between vision and language. We propose a novel Unified Adaptive Relevance Distinguishable Attention (UARDA) mechanism to accurately learn semantic alignment by distinguishing relevant and irrelevant distributions. Experimental results show that UARDA outperforms state-of-the-arts and reduces retrieval time substantially.
IEEE TRANSACTIONS ON MULTIMEDIA
(2023)
Article
Computer Science, Artificial Intelligence
Kunpeng Li et al.
Summary: As a hot research topic, cross-modal retrieval between images and texts is challenging due to the lack of semantic concepts in current image representations. To address this, we introduce an intuitive and interpretable model that learns a common embedding space for image-text alignments. Our model incorporates semantic relationship information and performs global semantic reasoning to capture key objects and concepts. Experiments show that our method surpasses state-of-the-art approaches and is highly efficient at the inference stage.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE
(2023)
Article
Computer Science, Information Systems
Yan Wang et al.
Summary: This paper proposes a novel rare-aware attention network (RAAN) to address the long-tail effect in image and text matching, by exploring and exploiting rare content. The RAAN utilizes rare attention matching and rareness representation to strengthen similarity calculation, achieving leading performance on large-scale databases.
INFORMATION PROCESSING & MANAGEMENT
(2023)
Article
Computer Science, Information Systems
Shikha Dubey et al.
Summary: Encoder-decoder-based image captioning utilizes the transformer and investigates two unexplored ideas, including an object-focused label attention module (LAM) and a geometrically coherent proposal (GCP) module. These modules enforce objects' relevance and explore the effectiveness of learning the association between vision and language constructs. Experimental results show that the proposed framework, LATGeO, generates improved and meaningful captions.
INFORMATION SCIENCES
(2023)
Article
Automation & Control Systems
Roshni Padate et al.
Summary: Existing image captioning models can be classified into retrieval-oriented and generation-oriented schemes. This article introduces a new image captioning model that includes three main phases: feature extraction, dual attention generation, and caption generation. The model utilizes CNN for visual attention, LSTM for textual attention, and BI-LSTM to combine both modalities. The weights of BI-LSTM are optimized using the SI-EFO algorithm. Experimental results show improvement over several other models.
ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE
(2023)
Proceedings Paper
Computer Science, Artificial Intelligence
Vladimir Somers et al.
Summary: In this work, we propose BPBreID, a body part-based ReID model for solving the challenges of occlusions and non-discriminative local appearance. Our method designs two modules to predict body part attention maps and generate part-based features, and introduces GiLt training scheme for robust part-based representation learning. Extensive experiments show the effectiveness of our proposed method, achieving 0.7% mAP improvement and 5.6% rank-1 accuracy improvement on the challenging Occluded-Duke dataset.
2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV)
(2023)
Article
Engineering, Electrical & Electronic
Jie Wu et al.
Summary: Image and sentence matching, which combines vision and language, has gained increasing attention. Previous methods ignored the relationships between image regions and considered all region-word pairs equally. This paper proposes a novel method, the Region Reinforcement Network with Topic Constraint (RRTC), to explore the correspondences between images and texts. It builds a region reinforcement network to infer fine-grained correspondence by considering the relationships of regions and re-assigning region-word similarities. The topic constraint module summarizes the central theme of images and constrains the deviation of the original image.
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY
(2022)
Proceedings Paper
Computer Science, Artificial Intelligence
Kun Zhang et al.
Summary: This paper proposes a novel Negative-Aware Attention Framework (NAAF) for image-text matching, which utilizes both the positive effect of matched fragments and the negative effect of mismatched fragments to improve the performance.
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022)
(2022)
Proceedings Paper
Computer Science, Theory & Methods
Shunyu Zhang et al.
Summary: This paper proposes a novel model that uses multi-structure commonsense knowledge for reasoning in Visual Dialog. Experimental results demonstrate that this model outperforms comparative methods in terms of performance.
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2022
(2022)
Proceedings Paper
Computer Science, Theory & Methods
Rushil Sanghavi et al.
Summary: In this paper, the problem of cross-modal retrieval in the presence of multi-view and multi-label data is addressed. The authors propose a Multi-view Multi-label Canonical Correlation Analysis (MVMLCCA) method, which generalizes CCA for multi-view data and utilizes high-level semantic information in the form of multi-label annotations. The proposed MVMLCCA method establishes correspondence across multiple views without explicit pairing of multi-view samples. Extensive experiments demonstrate that this approach offers more flexibility without compromising scalability and cross-modal retrieval performance.
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2022
(2022)
Article
Computer Science, Information Systems
Yuhao Cheng et al.
Summary: Image-text retrieval is a fundamental task in cross-modal research. Existing methods can be classified into independent representation matching and cross-interaction matching. This article proposes a method called CGMN, which explores both intra- and inter-relations without introducing network interaction. The experiments show that CGMN outperforms state-of-the-art methods in image retrieval and is more efficient than interactive matching methods.
ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS
(2022)
Article
Computer Science, Information Systems
Zengming Tang et al.
Summary: This article proposes a novel harmonious multi-branch network (HMBN) to address issues in person re-identification. HMBN learns pedestrian information using different stripes on different branches and solves intra-branch and inter-branch problems through horizontal overlapped partitioning and attention mechanism. Experimental results demonstrate the superiority of HMBN over other methods.
ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS
(2022)
Article
Computer Science, Information Systems
Wen-Hui Li et al.
Summary: This paper proposes a multi-level representation learning method to improve the quality of image-text retrieval task by utilizing semantic-level, structural-level, and contextual-level information. The experiments demonstrate the superiority of this method on two commonly used datasets.
INFORMATION PROCESSING & MANAGEMENT
(2021)
Article
Engineering, Electrical & Electronic
Xin Wen et al.
Summary: A novel cross memory network with pair discrimination (CMPD) is proposed for image-text cross modal retrieval, demonstrating superior performance compared to state-of-the-art approaches. The method utilizes cross memory as a set of latent concepts and pair discrimination loss to capture semantic relationships efficiently.
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY
(2021)
Article
Engineering, Electrical & Electronic
Keyu Wen et al.
Summary: In this work, a novel multi-level semantic relations enhancement approach named DSRAN is proposed to address the issue of mismatch between regional features and global features in image-text matching. DSRAN consists of two modules, performing graph attention for region-level relations enhancement and regional-global relations enhancement simultaneously. The experimental results show that DSRAN outperforms previous approaches by a large margin, demonstrating the effectiveness of the dual semantic relations learning scheme.
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY
(2021)
Article
Computer Science, Artificial Intelligence
Jianfeng Dong et al.
Summary: This paper introduces a new task of domain adaptive cross-modal retrieval, addressing the scenario where training and testing data come from different domains. By proposing a Multi-level Alignment Network (MAN), the semantic, modality, and domain gaps are effectively reduced, enhancing the generalization ability for target data. Experiments show that MAN outperforms multiple baselines and achieves a new state-of-the-art in large-scale text-to-video retrieval.
Article
Computer Science, Artificial Intelligence
Xin Shu et al.
Summary: In this paper, a novel framework is proposed to integrate semantic correlation and feature correlation for cross-modal retrieval. By using semantic transformation, the model avoids explicitly computing the covariance matrix, which leads to a huge saving of computational cost. Experimental results demonstrate the accuracy and efficiency of the proposed method on three multi-label datasets.
PATTERN RECOGNITION
(2021)
Proceedings Paper
Computer Science, Artificial Intelligence
Hui Yuan et al.
Summary: The Improved Visual Semantic Reasoning model (VSR++) addresses the challenges in fine-grained image-text matching by jointly modeling global alignment and local correspondence. With a suitable learning strategy to balance their importance, the model achieves state-of-the-art performance on two benchmark datasets by distinguishing image regions and text words at a fine-grained level.
2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR)
(2021)
Article
Computer Science, Artificial Intelligence
Jiangtong Li et al.
Summary: The MEMBER method introduces global memory banks to enable fine-grained alignment and fusion between images and texts in embedding learning paradigm, achieving mutual embedding enhancement and maintaining retrieval efficiency. Extensive experiments show that MEMBER outperforms state-of-the-art approaches on two large-scale benchmark datasets.
IEEE TRANSACTIONS ON IMAGE PROCESSING
(2021)
Article
Computer Science, Information Systems
Xinhong Ma et al.
IEEE TRANSACTIONS ON MULTIMEDIA
(2020)
Article
Computer Science, Artificial Intelligence
Xing Xu et al.
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS
(2020)
Article
Computer Science, Artificial Intelligence
Liwei Wang et al.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE
(2019)
Article
Computer Science, Artificial Intelligence
Lin Ma et al.
Article
Computer Science, Artificial Intelligence
Yu Liu et al.
PATTERN RECOGNITION
(2019)
Article
Computer Science, Information Systems
Yuxin Peng et al.
ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS
(2019)