-
CiteScore
1.00
Impact Factor
Volume 1, Issue 1, IECE Transactions on Sensing, Communication, and Control
Volume 1, Issue 1, 2024
Submit Manuscript Edit a Special Issue
Academic Editor
Aniruddha Chandra
Aniruddha Chandra
National Institute of Technology, India
Article QR Code
Article QR Code
Scan the QR code for reading
Popular articles
IECE Transactions on Sensing, Communication, and Control, 2024, Volume 1, Issue 1: 60-71

Free to Read | Research Article | 25 October 2024
Spatio-temporal Feature Soft Correlation Concatenation Aggregation Structure for Video Action Recognition Networks
1 Beijing iQIYI Technology Co., Ltd., China
2 Department of Information Engineering, University of Padua, Italy
* Corresponding Author: Shenglun Yi, [email protected]
Received: 17 September 2024, Accepted: 18 October 2024, Published: 25 October 2024  
Abstract
The efficient extraction and fusion of video features to accurately identify complex and similar actions has consistently remained a significant research endeavor in the field of video action recognition. While adept at feature extraction, prevailing methodologies for video action recognition frequently exhibit suboptimal performance in the context of complex scenes and similar actions. This shortcoming arises primarily from their reliance on uni-dimensional feature extraction, thereby overlooking the interrelations among features and the significance of multi-dimensional fusion. To address this issue, this paper introduces an innovative framework predicated upon a soft correlation strategy aimed at augmenting the representational capacity of features by implementing multi-level, multi-dimensional feature aggregation and concatenating the temporal features produced by the network. Our end-to-end multi-feature encoding soft correlation concatenation aggregation layer, situated at the temporal feature output terminal of the Video Action Recognition network, proficiently aggregates and integrates the output temporal features. This approach culminates in producing a composite feature that cohesively unifies multi-dimensional information, markedly enhancing the network's competency in differentiating analogous video actions. Empirical findings demonstrate that the approach delineated in this paper bolsters the efficacy of video action recognition networks, achieving a more thorough depiction of images, and yielding superior accuracy and robustness.

Graphical Abstract
Spatio-temporal Feature Soft Correlation Concatenation Aggregation Structure for Video Action Recognition Networks

Keywords
video action recognition
soft correlation
spatio-temporal feature extraction
concatenation aggregation structure
bidirectional LSTM

Funding
This work was supported without any funding.

Cite This Article
APA Style
Wang, F., & Yi, S. (2024). Spatio-temporal Feature Soft Correlation Concatenation Aggregation Structure for Video Action Recognition Networks. IECE Transactions on Sensing, Communication, and Control, 1(1), 60–71. https://doi.org/10.62762/TSCC.2024.212751

References
  1. Kong, J., Wang, H., Wang, X., Jin, X., Fang, X., & Lin, S. (2021). Multi-stream hybrid architecture based on cross-level fusion strategy for fine-grained crop species recognition in precision agriculture. Computers and Electronics in Agriculture, 185, 106134.
    [Google Scholar]
  2. Li, J., Wang, B., Ma, H., Gao, L., & Fu, H. (2024). Visual Feature Extraction and Tracking Method Based on Corner Flow Detection. IECE Transactions on Intelligent Systematics, 1(1), 3-9.
    [Google Scholar]
  3. Jin, X., Tong, A., Ge, X., Ma, H., Li, J., Fu, H., & Gao, L. (2024). YOLOv7-Bw: A Dense Small Object Efficient Detector Based on Remote Sensing Image. IECE Transactions on Intelligent Systematics, 1(1), 30-39.
    [Google Scholar]
  4. Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., & Van Gool, L. (2016). Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision (pp. 20-36). Springer, Cham.
    [Google Scholar]
  5. Yang, Z., An, G., Zhang, R., Zheng, Z., & Ruan, Q. (2023). SRI3D: Two-stream inflated 3D ConvNet based on sparse regularization for action recognition. IET Image Processing, 17(5), 1438-1448.
    [Google Scholar]
  6. Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision (pp. 4489-4497).
    [Google Scholar]
  7. Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 6299-6308).
    [Google Scholar]
  8. Feichtenhofer, C., Fan, H., Malik, J., & He, K. (2019). Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 6202-6211).
    [Google Scholar]
  9. Saha, A., Mazumdar, M., & Ghosh, A. (2019). Human motion recognition using CNN and SVM. Journal of Ambient Intelligence and Humanized Computing, 10(4), 1561574.
    [Google Scholar]
  10. Karpathy, A., & Fei-Fei, L. (2015). Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3128-3137).
    [Google Scholar]
  11. Funke, I., Bodenstedt, S., Oehme, F., von Bechtolsheim, F., Weitz, J., & Speidel, S. (2019, October). Using 3D convolutional neural networks to learn spatiotemporal features for automatic surgical gesture recognition in video. In International conference on medical image computing and computer-assisted intervention (pp. 467-475). Cham: Springer International Publishing.
    [Google Scholar]
  12. Zha, Z., Wang, Y., & Wu, X. (2019). A comparative study of convolutional neural networks for video action recognition. Journal of Visual Communication and Image Representation, 58, 951-960.
    [Google Scholar]
  13. Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., & Darrell, T. (2015). Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2625-2634).
    [Google Scholar]
  14. Zou, J., Wang, D., & Li, X. (2020). Adaptive Regularization for CNNs. Neural Networks, 134, 151-159.
    [Google Scholar]
  15. Bertasius, G., Wang, H., & Torresani, L. (2021, July). Is space-time attention all you need for video understanding?. In ICML (Vol. 2, No. 3, p. 4).
    [Google Scholar]
  16. Fan, H., Xie, L.,& Wang, Z. (2021). Multiscale Vision Transformer for Video Understanding. International Journal of Computer Vision, 29(2), 129-142.
    [Google Scholar]
  17. Fu, L., & Laterveer, R. (2023). Special Cubic Four-Folds, K3 Surfaces, and the Franchetta Property. International Mathematics Research Notices, 2023(10), 8872-8902.
    [Google Scholar]
  18. Yang, J., & Yu, H. (2022). Temporal Shift Attention for Action Recognition. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 121-130.
    [Google Scholar]

Article Metrics
Citations:

Crossref

0

Scopus

0

Web of Science

0
Article Access Statistics:
Views: 472
PDF Downloads: 59

Publisher's Note
IECE stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions
Institute of Emerging and Computer Engineers (IECE) or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
IECE Transactions on Sensing, Communication, and Control

IECE Transactions on Sensing, Communication, and Control

ISSN: 3065-7431 (Online) | ISSN: 3065-7423 (Print)

Email: [email protected]

Portico

Portico

All published articles are preserved here permanently:
https://www.portico.org/publishers/iece/

Copyright © 2024 Institute of Emerging and Computer Engineers Inc.