Spatio-temporal Feature Soft Correlation Concatenation Aggregation Structure for Video Action Recognition Networks

Fafa Wang; Shenglun Yi

doi:10.62762/TSCC.2024.212751

CiteScore

1.63

Impact Factor

Volume 1, Issue 1, IECE Transactions on Sensing, Communication, and Control

Volume 1, Issue 1, 2024

Submit Manuscript Edit a Special Issue

Academic Editor

Aniruddha Chandra

National Institute of Technology, India

Article QR Code

Scan the QR code for reading

Popular articles

Research on A Ship Trajectory Classification Method Based on Deep Learning YOLOv7-Bw: A Dense Small Object Efficient Detector Based on Remote Sensing Image A Mimic Fusion Algorithm for Dual Channel Video Based on Possibility Distribution Synthesis Theory Bridging Modalities: A Survey of Cross-Modal Image-Text Retrieval Deep Prediction Network Based on Covariance Intersection Fusion for Sensor Data Visual Feature Extraction and Tracking Method Based on Corner Flow Detection Inaugural Editorial of the Chinese Journal of Information Fusion Simultaneous Spatiotemporal Bias Compensation and Data Fusion for Asynchronous Multisensor Systems YOLOv8-Lite: A Lightweight Object Detection Model for Real-time Autonomous Driving Systems Extraction of Motion Information from Occupancy Grid Map Using Keystone Transform

IECE Transactions on Sensing, Communication, and Control, Volume 1, Issue 1, 2024: 60-71

Free to Read | Research Article | 25 October 2024

Spatio-temporal Feature Soft Correlation Concatenation Aggregation Structure for Video Action Recognition Networks

Fafa Wang 1

Shenglun Yi 2 *

1 Beijing iQIYI Technology Co., Ltd., China

2 Department of Information Engineering, University of Padua, Italy

* Corresponding Author: Shenglun Yi, [email protected]

DOI: 10.62762/TSCC.2024.212751

Received: 17 September 2024, Accepted: 18 October 2024, Published: 25 October 2024

Cited by: 1 (Source: Web of Science) , 1 (Source: Google Scholar)

PDF (2.23 MB)

Article Metrics Cite This Article

Abstract

The efficient extraction and fusion of video features to accurately identify complex and similar actions has consistently remained a significant research endeavor in the field of video action recognition. While adept at feature extraction, prevailing methodologies for video action recognition frequently exhibit suboptimal performance in the context of complex scenes and similar actions. This shortcoming arises primarily from their reliance on uni-dimensional feature extraction, thereby overlooking the interrelations among features and the significance of multi-dimensional fusion. To address this issue, this paper introduces an innovative framework predicated upon a soft correlation strategy aimed at augmenting the representational capacity of features by implementing multi-level, multi-dimensional feature aggregation and concatenating the temporal features produced by the network. Our end-to-end multi-feature encoding soft correlation concatenation aggregation layer, situated at the temporal feature output terminal of the Video Action Recognition network, proficiently aggregates and integrates the output temporal features. This approach culminates in producing a composite feature that cohesively unifies multi-dimensional information, markedly enhancing the network's competency in differentiating analogous video actions. Empirical findings demonstrate that the approach delineated in this paper bolsters the efficacy of video action recognition networks, achieving a more thorough depiction of images, and yielding superior accuracy and robustness.

Graphical Abstract

Keywords

video action recognition

soft correlation

spatio-temporal feature extraction

concatenation aggregation structure

bidirectional LSTM

Funding

This work was supported without any funding.

References

Kong, J., Wang, H., Wang, X., Jin, X., Fang, X., & Lin, S. (2021). Multi-stream hybrid architecture based on cross-level fusion strategy for fine-grained crop species recognition in precision agriculture. Computers and Electronics in Agriculture, 185, 106134.
[Google Scholar]
Li, J., Wang, B., Ma, H., Gao, L., & Fu, H. (2024). Visual Feature Extraction and Tracking Method Based on Corner Flow Detection. IECE Transactions on Intelligent Systematics, 1(1), 3-9.
[Google Scholar]
Jin, X., Tong, A., Ge, X., Ma, H., Li, J., Fu, H., & Gao, L. (2024). YOLOv7-Bw: A Dense Small Object Efficient Detector Based on Remote Sensing Image. IECE Transactions on Intelligent Systematics, 1(1), 30-39.
[Google Scholar]
Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., & Van Gool, L. (2016). Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision (pp. 20-36). Springer, Cham.
[Google Scholar]
Yang, Z., An, G., Zhang, R., Zheng, Z., & Ruan, Q. (2023). SRI3D: Two-stream inflated 3D ConvNet based on sparse regularization for action recognition. IET Image Processing, 17(5), 1438-1448.
[Google Scholar]
Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision (pp. 4489-4497).
[Google Scholar]
Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 6299-6308).
[Google Scholar]
Feichtenhofer, C., Fan, H., Malik, J., & He, K. (2019). Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 6202-6211).
[Google Scholar]
Saha, A., Mazumdar, M., & Ghosh, A. (2019). Human motion recognition using CNN and SVM. Journal of Ambient Intelligence and Humanized Computing, 10(4), 1561574.
[Google Scholar]
Karpathy, A., & Fei-Fei, L. (2015). Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3128-3137).
[Google Scholar]
Funke, I., Bodenstedt, S., Oehme, F., von Bechtolsheim, F., Weitz, J., & Speidel, S. (2019, October). Using 3D convolutional neural networks to learn spatiotemporal features for automatic surgical gesture recognition in video. In International conference on medical image computing and computer-assisted intervention (pp. 467-475). Cham: Springer International Publishing.
[Google Scholar]
Zha, Z., Wang, Y., & Wu, X. (2019). A comparative study of convolutional neural networks for video action recognition. Journal of Visual Communication and Image Representation, 58, 951-960.
[Google Scholar]
Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., & Darrell, T. (2015). Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2625-2634).
[Google Scholar]
Zou, J., Wang, D., & Li, X. (2020). Adaptive Regularization for CNNs. Neural Networks, 134, 151-159.
[Google Scholar]
Bertasius, G., Wang, H., & Torresani, L. (2021, July). Is space-time attention all you need for video understanding?. In ICML (Vol. 2, No. 3, p. 4).
[Google Scholar]
Fan, H., Xie, L.,& Wang, Z. (2021). Multiscale Vision Transformer for Video Understanding. International Journal of Computer Vision, 29(2), 129-142.
[Google Scholar]
Fu, L., & Laterveer, R. (2023). Special Cubic Four-Folds, K3 Surfaces, and the Franchetta Property. International Mathematics Research Notices, 2023(10), 8872-8902.
[Google Scholar]
Yang, J., & Yu, H. (2022). Temporal Shift Attention for Action Recognition. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 121-130.
[Google Scholar]

Cite This Article

APA Style

Wang, F., & Yi, S. (2024). Spatio-temporal Feature Soft Correlation Concatenation Aggregation Structure for Video Action Recognition Networks. IECE Transactions on Sensing, Communication, and Control, 1(1), 60–71. https://doi.org/10.62762/TSCC.2024.212751

Article Metrics

Citations:

Google Scholar

Crossref

Scopus

Web of Science

Article Access Statistics:

PDF Downloads: 116

Publisher's Note

IECE stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Institute of Emerging and Computer Engineers (IECE) or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

IECE Transactions on Sensing, Communication, and Control

ISSN: 3065-7431 (Online) | ISSN: 3065-7423 (Print)

Email: [email protected]

Portico

All published articles are preserved here permanently:
https://www.portico.org/publishers/iece/

Google Scholar

Crossref

Scopus

Web of Science

We use cookies