☆ 4.5 Article

TS-LSTM and temporal-inception: Exploiting spatiotemporal dynamics for activity recognition

SIGNAL PROCESSING-IMAGE COMMUNICATION (2019)

Journal

SIGNAL PROCESSING-IMAGE COMMUNICATION

Volume 71, Issue -, Pages 76-87

Publisher

ELSEVIER

DOI: 10.1016/j.image.2018.09.003

Keywords

Video understanding; Action recognition; Convolutional neural network; Recurrent neural network

Funding

National Science Foundation
National Robotics Initiative [IIS-1426998]
Div Of Information & Intelligent Systems
Direct For Computer & Info Scie & Enginr [1426998] Funding Source: National Science Foundation

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Abstract

Recent two-stream deep Convolutional Neural Networks (ConvNets) have made significant progress in recognizing human actions in videos. Despite their success, methods extending the basic two-stream ConvNet have not systematically explored possible network architectures to further exploit spatiotemporal dynamics within video sequences. Further, such networks often use different baseline two-stream networks. Therefore, the differences and the distinguishing factors between various methods using Recurrent Neural Networks (RNN) or Convolutional Neural Networks on temporally-constructed feature vectors (Temporal-ConvNets) are unclear. In this work, we would like to answer the question: given the spatial and motion feature representations over time, what is the best way to exploit the temporal information? Toward this end, we first demonstrate a strong baseline two-stream ConvNet using ResNet-101. We use this baseline to thoroughly examine the use of both RNNs and Temporal-ConvNets for extracting spatiotemporal information. Building upon our experimental results, we then propose and investigate two different networks to further integrate spatiotemporal information: (1) Temporal Segment RNN and (2) Inception-style Temporal-ConvNet. We demonstrate that using both RNNs (with LSTMs) and Temporal-ConvNets on spatiotemporal feature matrices are able to exploit spatiotemporal dynamics to improve the overall performance. Our analysis identifies specific limitations for each method that could form the basis of future work. Our experimental results on UCF101 and HMDB51 datasets achieve comparable stateof-the-art performances, 94.1% and 69.0%, respectively, without requiring extensive temporal augmentation or end-to-end training.

TS-LSTM and temporal-inception: Exploiting spatiotemporal dynamics for activity recognition

Journal

SIGNAL PROCESSING-IMAGE COMMUNICATION

Publisher

ELSEVIER

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

TS-LSTM and temporal-inception: Exploiting spatiotemporal dynamics for activity recognition

Journal

SIGNAL PROCESSING-IMAGE COMMUNICATION

Publisher

ELSEVIER

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper