☆ 3.8 Proceedings Paper

Isolated Sign Recognition from RGB Video using Pose Flow and Self-Attention

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2021 (2021)

Journal

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2021

Volume -, Issue -, Pages 3436-3445

Publisher

IEEE COMPUTER SOC

DOI: 10.1109/CVPRW53098.2021.00383

Keywords

Funding

Research Foundation Flanders (FWO Vlaanderen) [77410]
European Union [101017255]

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

Automatic sign language recognition involves the intersection of natural language processing and computer vision, utilizing multi-modal multi-head attention mechanism. By pre-extracting useful information from sign language videos, the impact of data limitation on recognition accuracy is reduced, leading to significant improvement in performance.

Automatic sign language recognition lies at the intersection of natural language processing (NLP) and computer vision. The highly successful transformer architectures, based on multi-head attention, originate from the field of NLP. The Video Transformer Network (VTN) is an adaptation of this concept for tasks that require video understanding, e.g., action recognition. However, due to the limited amount of labeled data that is commonly available for training automatic sign (language) recognition, the VTN cannot reach its full potential in this domain. In this work, we reduce the impact of this data limitation by automatically pre-extracting useful information from the sign language videos. In our approach, different types of information are offered to a VTN in a multi-modal setup. It includes per-frame human pose keypoints (extracted by OpenPose) to capture the body movement and hand crops to capture the (evolution of) hand shapes. We evaluate our method on the recently released AUTSL dataset for isolated sign recognition and obtain 92.92% accuracy on the test set using only RGB data. For comparison: the VTN architecture without hand crops and pose flow achieved 82% accuracy. A qualitative inspection of our model hints at further potential of multi-modal multi-head attention in a sign language recognition context.

Isolated Sign Recognition from RGB Video using Pose Flow and Self-Attention

Journal

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2021

Publisher

IEEE COMPUTER SOC

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Isolated Sign Recognition from RGB Video using Pose Flow and Self-Attention

Journal

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2021

Publisher

IEEE COMPUTER SOC

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper