LI3D-BiLSTM: A Lightweight Inception-3D Networks with BiLSTM for Video Action Recognition

Fafa Wang; Xuebo Jin; Shenglun Yi

doi:10.62762/TETAI.2024.628205

CiteScore

3.42

Impact Factor

Volume 1, Issue 1, IECE Transactions on Emerging Topics in Artificial Intelligence

Volume 1, Issue 1, 2024

Submit Manuscript Edit a Special Issue

Academic Editor

Guoxiong Zhou

Central South University of Forestry and Technology, China

Article QR Code

Scan the QR code for reading

Popular articles

Research on A Ship Trajectory Classification Method Based on Deep Learning YOLOv7-Bw: A Dense Small Object Efficient Detector Based on Remote Sensing Image A Mimic Fusion Algorithm for Dual Channel Video Based on Possibility Distribution Synthesis Theory Bridging Modalities: A Survey of Cross-Modal Image-Text Retrieval Deep Prediction Network Based on Covariance Intersection Fusion for Sensor Data Visual Feature Extraction and Tracking Method Based on Corner Flow Detection Inaugural Editorial of the Chinese Journal of Information Fusion Simultaneous Spatiotemporal Bias Compensation and Data Fusion for Asynchronous Multisensor Systems YOLOv8-Lite: A Lightweight Object Detection Model for Real-time Autonomous Driving Systems Extraction of Motion Information from Occupancy Grid Map Using Keystone Transform

IECE Transactions on Emerging Topics in Artificial Intelligence, Volume 1, Issue 1, 2024: 58-70

Code (Data) Available | Free to Read | Research Article | Feature Paper | 09 August 2024

LI3D-BiLSTM: A Lightweight Inception-3D Networks with BiLSTM for Video Action Recognition

Fafa Wang 1,2

Xuebo Jin 2 *

Shenglun Yi 3

1 Beijing iQIYI Technology Co., Ltd., Beijing 100080, China

2 School of Computer Science and Artificial Intelligence, Beijing Technology and Business University, Beijing 100048, China

3 Department of Information Engineering, University of Padua, Italy

* Corresponding Author: Xuebo Jin, [email protected]

DOI: 10.62762/TETAI.2024.628205

Received: 21 Mar 2024, Accepted: 02 August 2024, Published: 09 August 2024

Cited by: 2 (Source: Google Scholar)

PDF (3.75 MB)

Article Metrics Cite This Article

Abstract

This paper proposes an improved video action recognition method, primarily consisting of three key components. Firstly, in the data preprocessing stage, we developed multi-temporal scale video frame extraction and multi-spatial scale video cropping techniques to enhance content information and standardize input formats. Secondly, we propose a lightweight Inception-3D networks (LI3D) network structure for spatio-temporal feature extraction and design a soft-association feature aggregation module to improve the recognition accuracy of key actions in videos. Lastly, we employ a bidirectional LSTM network to contextualize the feature sequences extracted by LI3D, enhancing the representation capability for temporal data. To improve the model’s robustness and generalization ability, we introduced spatial and temporal scale data augmentation techniques in the preprocessing stage, effectively extracting video key frames and capturing key regional actions. Furthermore, we conducted an in-depth study on spatio-temporal feature extraction methods for video data, effectively extracting spatial and temporal information through the LI3D network and transfer learning. Experimental results demonstrate that the proposed method achieves significant performance improvements in video action recognition tasks, providing new insights and approaches for research in related fields.

Graphical Abstract

Keywords

video action recognition

multi-scale preprocessing

lightweight I3D (LI3D)

spatio-temporal feature extraction

bidirectional LSTM

Funding

This work was supported without any funding.

References

Zhu, F., Xie, J., & Fang, Y. (2016, March). Learning cross-domain neural networks for sketch-based 3D shape retrieval. In Proceedings of the AAAI conference on artificial intelligence (Vol. 30, No. 1).
[Google Scholar]
Andrade-Ambriz, Y. A., Ledesma, S., Ibarra-Manzano, M. A., Oros-Flores, M. I., & Almanza-Ojeda, D. L. (2022). Human activity recognition using temporal convolutional neural network architecture. Expert Systems with Applications, 191, 116287.
[Google Scholar]
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014). Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (pp. 1725-1732).
[Google Scholar]
Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision (pp. 4489-4497).
[Google Scholar]
Zha, S., Luisier, F., Andrews, W., Srivastava, N., & Salakhutdinov, R. (2015). Exploiting image-trained CNN architectures for unconstrained video classification. arXiv preprint arXiv:1503.04144.
[Google Scholar]
Wang, C., Wang, Y., Han, Y., Song, L., Quan, Z., Li, J., & Li, X. (2017, January). CNN-based object detection solutions for embedded heterogeneous multicore SoCs. In 2017 22nd Asia and South Pacific design automation conference (ASP-DAC) (pp. 105-110). IEEE.
[Google Scholar]
Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., & Darrell, T. (2015). Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2625-2634).
[Google Scholar]
Khan, S. H., Hayat, M., & Porikli, F. (2019). Regularization of deep neural networks with spectral dropout. Neural Networks, 110, 82-90.
[Google Scholar]
Pan, T., Song, Y., Yang, T., Jiang, W., & Liu, W. (2021). Videomoco: Contrastive video representation learning with temporally adversarial examples. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11205-11214).
[Google Scholar]
Chao, Y. W., Yang, J., Price, B., Cohen, S., & Deng, J. (2017). Forecasting human dynamics from static images. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 548-556).
[Google Scholar]
Deo, N., Rangesh, A., & Trivedi, M. (2016, November). In-vehicle hand gesture recognition using hidden markov models. In 2016 IEEE 19th International Conference on Intelligent Transportation Systems (ITSC) (pp. 2179-2184). IEEE.
[Google Scholar]
Sahoo, D., Pham, Q., Lu, J., & Hoi, S. C. (2017). Online deep learning: Learning deep neural networks on the fly. arXiv preprint arXiv:1711.03705.
[Google Scholar]
Bertasius, G., Wang, H., & Torresani, L. (2021, July). Is space-time attention all you need for video understanding?. In ICML (Vol. 2, No. 3, p. 4).
[Google Scholar]
Fan, H., Xiong, B., Mangalam, K., Li, Y., Yan, Z., Malik, J., & Feichtenhofer, C. (2021). Multiscale vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 6824-6835).
[Google Scholar]
Tong, Z., Song, Y., Wang, J., & Wang, L. (2022). Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Advances in neural information processing systems, 35, 10078-10093.
[Google Scholar]
Lin, J., Gan, C., & Han, S. (2019). Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 7083-7093).
[Google Scholar]
Ryoo, M., Piergiovanni, A. J., Arnab, A., Dehghani, M., & Angelova, A. (2021). Tokenlearner: Adaptive space-time tokenization for videos. Advances in neural information processing systems, 34, 12786-12797.
[Google Scholar]
Yang, H., Huang, D., Wen, B., Wu, J., Yao, H., Jiang, Y., ... & Yuan, Z. (2022). Self-supervised video representation learning with motion-aware masked autoencoders. arXiv preprint arXiv:2210.04154.
[Google Scholar]
Wei, C., Fan, H., Xie, S., Wu, C. Y., Yuille, A., & Feichtenhofer, C. (2022). Masked feature prediction for self-supervised visual pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 14668-14678).
[Google Scholar]
Xia, X., Xu, C., & Nan, B. (2017, June). Inception-v3 for flower classification. In 2017 2nd international conference on image, vision and computing (ICIVC) (pp. 783-787). IEEE.
[Google Scholar]

Cite This Article

APA Style

Wang, F., Jin, X., & Yi, S. (2024). LI3D-BiLSTM: A Lightweight Inception-3D Networks with BiLSTM for Video Action Recognition. IECE Transactions on Emerging Topics in Artificial Intelligence, 1(1), 58–70. https://doi.org/10.62762/TETAI.2024.628205

Article Metrics

Citations:

Google Scholar

Crossref

Scopus

Web of Science

Article Access Statistics:

PDF Downloads: 310

Publisher's Note

IECE stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Institute of Emerging and Computer Engineers (IECE) or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

IECE Transactions on Emerging Topics in Artificial Intelligence

ISSN: 3066-1676 (Online) | ISSN: 3066-1668 (Print)

Email: [email protected]

Portico

All published articles are preserved here permanently:
https://www.portico.org/publishers/iece/

Google Scholar

Crossref

Scopus

Web of Science

We use cookies