-
CiteScore
-
Impact Factor
IECE Transactions on Sensing, Communication, and Control, 2024, Volume 1, Issue 1: 60-71

Research Article | 25 October 2024
1 Beijing iQIYI Technology Co., Ltd., China
2 Department of Information Engineering, University of Padua, Italy
* Corresponding author: Shenglun Yi, email: [email protected]
Received: 17 September 2024, Accepted: 18 October 2024, Published: 25 October 2024  

Abstract
The efficient extraction and fusion of video features to accurately identify complex and similar actions has consistently remained a significant research endeavor in the field of video action recognition. While adept at feature extraction, prevailing methodologies for video action recognition frequently exhibit suboptimal performance in the context of complex scenes and similar actions. This shortcoming arises primarily from their reliance on uni-dimensional feature extraction, thereby overlooking the interrelations among features and the significance of multi-dimensional fusion. To address this issue, this paper introduces an innovative framework predicated upon a soft correlation strategy aimed at augmenting the representational capacity of features by implementing multi-level, multi-dimensional feature aggregation and concatenating the temporal features produced by the network. Our end-to-end multi-feature encoding soft correlation concatenation aggregation layer, situated at the temporal feature output terminal of the Video Action Recognition network, proficiently aggregates and integrates the output temporal features. This approach culminates in producing a composite feature that cohesively unifies multi-dimensional information, markedly enhancing the network's competency in differentiating analogous video actions. Empirical findings demonstrate that the approach delineated in this paper bolsters the efficacy of video action recognition networks, achieving a more thorough depiction of images, and yielding superior accuracy and robustness.

Graphical Abstract
Spatio-temporal Feature Soft Correlation Concatenation Aggregation Structure for Video Action Recognition Networks

Keywords
video action recognition
soft correlation
spatio-temporal feature extraction
concatenation aggregation structure
bidirectional LSTM

References

[1] Kong, J., Wang, H., Wang, X., Jin, X., Fang, X., & Lin, S. (2021). Multi-stream hybrid architecture based on cross-level fusion strategy for fine-grained crop species recognition in precision agriculture. Computers and Electronics in Agriculture, 185, 106134.

[2] Li, J., Wang, B., Ma, H., Gao, L., & Fu, H. (2024). Visual Feature Extraction and Tracking Method Based on Corner Flow Detection. IECE Transactions on Intelligent Systematics, 1(1), 3-9.

[3] Jin, X., Tong, A., Ge, X., Ma, H., Li, J., Fu, H., & Gao, L. (2024). YOLOv7-Bw: A Dense Small Object Efficient Detector Based on Remote Sensing Image. IECE Transactions on Intelligent Systematics, 1(1), 30-39.

[4] Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., & Van Gool, L. (2016). Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision (pp. 20-36). Springer, Cham.

[5] Yang, Z., An, G., Zhang, R., Zheng, Z., & Ruan, Q. (2023). SRI3D: Two-stream inflated 3D ConvNet based on sparse regularization for action recognition. IET Image Processing, 17(5), 1438-1448.

[6] Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision (pp. 4489-4497).

[7] Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 6299-6308).

[8] Feichtenhofer, C., Fan, H., Malik, J., & He, K. (2019). Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 6202-6211).

[9] Saha, A., Mazumdar, M., & Ghosh, A. (2019). Human motion recognition using CNN and SVM. Journal of Ambient Intelligence and Humanized Computing, 10(4), 1561574.

[10] Karpathy, A., & Fei-Fei, L. (2015). Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3128-3137).

[11] Funke, I., Bodenstedt, S., Oehme, F., von Bechtolsheim, F., Weitz, J., & Speidel, S. (2019, October). Using 3D convolutional neural networks to learn spatiotemporal features for automatic surgical gesture recognition in video. In International conference on medical image computing and computer-assisted intervention (pp. 467-475). Cham: Springer International Publishing.

[12] Zha, Z., Wang, Y., & Wu, X. (2019). A comparative study of convolutional neural networks for video action recognition. Journal of Visual Communication and Image Representation, 58, 951-960.

[13] Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., & Darrell, T. (2015). Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2625-2634).

[14] Zou, J., Wang, D., & Li, X. (2020). Adaptive Regularization for CNNs. Neural Networks, 134, 151-159.

[15] Bertasius, G., Wang, H., & Torresani, L. (2021, July). Is space-time attention all you need for video understanding?. In ICML (Vol. 2, No. 3, p. 4).

[16] Fan, H., Xie, L.,& Wang, Z. (2021). Multiscale Vision Transformer for Video Understanding. International Journal of Computer Vision, 29(2), 129-142.

[17] Fu, L., & Laterveer, R. (2023). Special Cubic Four-Folds, K3 Surfaces, and the Franchetta Property. International Mathematics Research Notices, 2023(10), 8872-8902.

[18] Yang, J., & Yu, H. (2022). Temporal Shift Attention for Action Recognition. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 121-130.


Cite This Article
APA Style
Wang, F., & Yi, S. (2024). Spatio-temporal Feature Soft Correlation Concatenation Aggregation Structure for Video Action Recognition Networks. IECE Transactions on Sensing, Communication, and Control, 1(1), 60–71. https://doi.org/10.62762/TSCC.2024.212751

Article Metrics
Citations:

Crossref

0

Scopus

0

Web of Science

0
Article Access Statistics:
Views: 114
PDF Downloads: 5

Publisher's Note
IECE stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions
IECE or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
IECE Transactions on Sensing, Communication, and Control

IECE Transactions on Sensing, Communication, and Control

ISSN: request pending (Online) | ISSN: request pending (Print)

Email: [email protected]

Portico

Portico

All published articles are preserved here permanently:
https://www.portico.org/publishers/iece/

Copyright © 2024 Institute of Emerging and Computer Engineers Inc.