Masked Vision-language Transformer in Fashion

Article Computer Science, Artificial Intelligence

Modality-Invariant Asymmetric Networks for Cross-Modal Hashing

Zheng Zhang et al.

Summary: In this paper, we propose a novel Modality-Invariant Asymmetric Networks (MIAN) architecture that explores the asymmetric intra- and inter-modal similarity preservation under a probabilistic modality alignment framework. The MIAN approach outperforms the state-of-the-art cross-modal hashing methods in terms of performance. The proposed approach incorporates pairwise, piecewise, and transformed semantics into a unified semantic-preserving hash codes learning scheme.

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING (2023)

添加到收藏夹

Article Computer Science, Artificial Intelligence

Verbal-Person Nets: Pose-Guided Multi-Granularity Language-to-Person Generation

Deyin Liu et al.

Summary: Person image generation conditioned on natural language allows us to personalize image editing in a user-friendly manner. We propose a novel pose-guided multi-granularity attention architecture to synthesize person images. By incorporating sentence-level description and pose feature maps, we generate a coarse person image and further enhance it by drawing human body parts with highly correlated textual nouns and determining the spatial positions with respect to target pose points.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS (2023)

添加到收藏夹

Article Computer Science, Artificial Intelligence

Salient Object Detection via Integrity Learning

Mingchen Zhuge et al.

Summary: Although current salient object detection (SOD) works have made significant progress, they are limited in preserving the integrity of predicted salient regions. To address this issue, a novel Integrity Cognition Network (ICON) is proposed, which leverages diverse feature aggregation, integrity channel enhancement, and part-whole verification to learn strong integrity features. Experimental results on seven benchmarks demonstrate that ICON outperforms baseline methods and achieves around 10% improvement in average false negative ratio (FNR). The code and results are available at: https://github.com/mczhuge/ICON.

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE (2023)

添加到收藏夹

Proceedings Paper Computer Science, Artificial Intelligence

Masked Autoencoders Are Scalable Vision Learners

Kaiming He et al.

Summary: This paper presents a self-supervised learning method for computer vision based on masked autoencoders. By masking a portion of the input image and reconstructing the missing pixels, large models can be trained efficiently and effectively. The approach achieves high generalization performance and outperforms supervised pretraining in transfer learning tasks.

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022) (2022)

添加到收藏夹

Proceedings Paper Computer Science, Artificial Intelligence

FashionVLP: Vision Language Transformer for Fashion Retrieval with Feedback

Sonam Goenka et al.

Summary: This study proposes a new vision-language transformer based model, FashionVLP, for fashion image retrieval. The model utilizes prior knowledge from large image-text corpora and combines visual information from multiple levels of context to effectively capture fashion-related information. The results show that FashionVLP achieves state-of-the-art performance on benchmark datasets and a significant improvement on the challenging FashionlQ dataset with complex natural language feedback.

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) (2022)

添加到收藏夹

Article Automation & Control Systems

Paradigm Shift in Natural Language Processing

Tian-Xiang Sun et al.

Summary: In recent years, with the development of deep learning, modeling for natural language processing tasks has converged into several mainstream paradigms. However, influenced by the rapid progress of pre-trained language models, paradigm shift has become a trend and achieved success in many tasks. Some paradigms also show potential to unify a large number of NLP tasks.

MACHINE INTELLIGENCE RESEARCH (2022)

添加到收藏夹

Proceedings Paper Computer Science, Artificial Intelligence

M6: Multi-Modality-to-Multi-Modality Multitask Mega-transformer for Unified Pretraining

Junyang Lin et al.

Summary: This work introduces the largest dataset for pretraining in Chinese and proposes a method called M6 for multimodal pretraining, achieving significant success in various domains.

KDD '21: PROCEEDINGS OF THE 27TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING (2021)

添加到收藏夹

Proceedings Paper Computer Science, Artificial Intelligence

VX2TEXT: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs

Xudong Lin et al.

Summary: VX2TEXT is a text generation framework that converts multimodal inputs (video, text, speech, audio) into language embeddings for fusion, enabling direct application to video-based text generation tasks. It outperforms existing models on video-based text generation tasks and is conceptually simple and effective.

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021 (2021)

添加到收藏夹

Proceedings Paper Computer Science, Artificial Intelligence

Kaleido-BERT: Vision-Language Pre-training on Fashion Domain

Mingchen Zhuge et al.

Summary: Kaleido-BERT is a novel vision-language pre-training model that utilizes a unique kaleido strategy to enhance cross-modality representations. By conducting self-supervised training on five different tasks and achieving significant improvements on four downstream tasks, it demonstrates remarkable potential for real-world applications.

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021 (2021)

添加到收藏夹

Proceedings Paper Computer Science, Information Systems