首页 > 论文 > 光电工程 > 47卷 > 2期(pp:190139--1)

高效 3D密集残差网络及其在人体行为识别中的应用

Efficient 3D dense residual network and its application in human action recognition

  • 摘要
  • 论文信息
  • 参考文献
  • 被引情况
  • PDF全文
分享:

摘要

针对 3D-CNN能够较好地提取视频中时空特征但对计算量和内存要求很高的问题, 本文设计了高效 3D卷积块替换原来计算量大的 3×3×3卷积层, 进而提出了一种融合 3D卷积块的密集残差网络 (3D-EDRNs)用于人体行为识别。高效 3D卷积块由获取视频空间特征的 1×3×3卷积层和获取视频时间特征的 3×1×1卷积层组合而成。将高效 3D卷积块组合在密集残差网络的多个位置中, 不但利用了残差块易于优化和密集连接网络特征复用等优点, 而且能够缩短训练时间, 提高网络的时空特征提取效率和性能。在经典数据集 UCF101、HMDB51和动态多视角复杂 3D人体行为数据库(DMV action3D)上验证了结合 3D卷积块的 3D-EDRNs能够显著降低模型复杂度, 有效提高网络的分类性能, 同时具有计算资源需求少、参数量小和训练时间短等优点。

Abstract

In view of the problem that 3D-CNN can better extract the spatio-temporalfeatures in video, but it requiresa high amount of computation and memory, this paper designs an efficient 3D convolutional block to replace the 3×3×3 convolutional layer with a high amount of computation, and then proposes a 3D-efficient dense residual networks (3D-EDRNs) integrating 3D convolutional blocks for human action recognition. The efficient 3D convolu-tional block is composed of 1×3×3 convolutional layers for obtaining spatial features of video and 3×1×1 convolu-tional layers for obtaining temporal features of video. Efficient 3D convolutional blocks are combined in multiple lo-cations of dense residual network, which not only takes advantage of the advantages of easy optimization of residual blocks and feature reuse of dense connected network, but also can shorten the training time and improve the effi-ciency and performance of spatial-temporal feature extraction of the network. In the classical data set UCF101, HMDB51 and the dynamic multi-view complicated 3D database of human activity (DMV action3D), it is verified that the 3D-EDRNs combined with 3D convolutional block can significantly reduce the complexity of the model, effec-tively improve the classification performance of the network, and have the advantages of less computational re-source demand, small number of parameters and short training time.

Newport宣传-MKS新实验室计划
补充资料

中图分类号:TP391.4

DOI:10.12086/oee.2020.190139

所属栏目:科研论文

基金项目:国家自然科学基金资助项目 (61673276, 61603255, 61703277)

收稿日期:2019-03-27

修改稿日期:2019-06-23

网络出版日期:--

作者单位    点击查看

李梁华:上海理工大学光电信息与计算机工程学院, 上海 200093
王永雄:上海理工大学光电信息与计算机工程学院, 上海 200093

联系人作者:李梁华(1244094457@qq.com)

备注:李梁华(1994-), 男, 硕士研究生, 主要从事计算机视觉的研究。

【1】He K M, Zhang X Y, Ren S Q, et al. Delving deep into rectifiers: surpassing human-level performance on ImageNet classifica-tion[C]//2015 IEEE International Conference on Computer Vision (ICCV), Santiago, 2015: 1026–1034.

【2】Shojaeilangari S, Yau W Y, Li J, et al. Dynamic facial expression analysis based on extended spatio-temporal histogram of oriented gradients[J]. International Journal of Biometrics, 2014, 6(1): 33–52.

【3】Scovanner P, Ali S, Shah M. A 3-dimensional sift descriptor and its application to action recognition[C]//Proceeding MM ''07 Pro-ceedings of the 15th ACM international conference on Multime-dia, New York, 2007: 357–360.

【4】Laptev I, Marszalek M, Schmid C, et al. Learning realistic human actions from movies[C]//2008 IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, 2008: 1–8.

【5】Willems G, Tuytelaars T, Van Gool L. An efficient dense and scale-invariant spatio-temporal interest point detec-tor[C]//European Conference on Computer Vision, Berlin, 2008: 650–663.

【6】Wang H, Schmid C. Action recognition with improved trajecto-ries[C]//2013 IEEE International Conference on Computer Vision, Sydney, 2014: 3551–3558.

【7】Simonyan K, Zisserman A. Two-stream convolutional networks for action recognition in videos[C]//Proceedings of the 27th In-ternational Conference on Neural Information Processing Sys-tems, Montreal, Canada, 2014: 568–576.

【8】Yao L, Torabi A, Cho K, et al. Describing videos by exploiting temporal structure[C]//2015 IEEE International Conference on Computer Vision (ICCV), Santiago, 2015: 199–211.

【9】Shao L, Zhen X T, Tao D C, et al. Spatio-temporal laplacian pyramid coding for action recognition[J]. IEEE Transactions on Cybernetics, 2014, 44(6): 817–827.

【10】Hara K, Kataoka H, Satoh Y. Learning spatio-temporal features with 3D residual networks for action recognition[C]//2017 IEEE International Conference on Computer Vision Workshops (ICCVW), Venice, 2017: 3154–3160.

【11】Ioffe S, Szegedy C. Batch normalization: accelerating deep network training by reducing internal covariate shift[C]//Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 2015: 448–456.

【12】Huang G, Liu Z, Van Der Maaten L, et al. Densely connected convolutional networks[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, 2017: 2261–2269.

【13】Song T Z, Song Y, Wang Y X, et al. Residual network with dense block[J]. Journal of Electronic Imaging, 2018, 27(5): 053036.

【14】Wang Y X, Li X, Li L H. Dynamic and multi-view complicated 3D database of human activity and activity recognition[J]. Journal of Data Acquisition & Processing, 2019, 34(1): 68–79. 王永雄, 李璇, 李梁华. 动态多视角复杂 3D人体行为数据库及行为识别[J].数据采集与处理, 2019, 34(1): 68–79.

【15】Ji S W, XuW, Yang M, et al. 3D convolutional neural networks for human action recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(1): 221–231.

【16】Tran D, Bourdev L, Fergus R, et al. Learning spatiotemporal features with 3D convolutional networks[C]//2015 IEEE Interna-tional Conference on Computer Vision (ICCV), Santiago, 2014: 4489–4497.

【17】Qiu Z F, Yao T, Mei T. Learning spatio-temporal representation with pseudo-3D residual networks[C]//2017 IEEE International Conference on Computer Vision (ICCV), Venice, 2017: 5534–5542.

【18】He K M, Sun J. Convolutional neural networks at constrained time cost[C]//2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, 2015: 5353–5360.

【19】Soomro K, Zamir A R, Shah M. UCF101: A dataset of 101 hu-man actions classes from videos in the wild[Z]. arXiv:1212.0402, 2012.

【20】Tran D, Torresani L. EXMOVES: mid-level features for efficient action recognition and video analysis[J]. International Journal of Computer Vision, 2016, 119(3): 239–253.

【21】Wang Z L, Huang M, Zhu Q B, et al. The optical flow detection method of moving target using deep convolution neural net-work[J]. Opto-Electronic Engineering, 2018, 45(8): 38–47.王正来 , 黄敏, 朱启兵 , 等. 基于深度卷积神经网络的运动目标光流检测方法[J].光电工程, 2018, 45(8): 38–47.

【22】Wang X H, Gao L L, Wang P, et al. Two-stream 3-D convNet fusion for action recognition in videos with arbitrary size and length[J]. IEEE Transactions on Multimedia, 2018, 20(3): 634–644.

【23】Varol G, Laptev I, Schmid C. Long-term temporal convolutions for action recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(6): 1510–1517.

引用该论文

Li Lianghua,Wang Yongxiong. Efficient 3D dense residual network and its application in human action recognition[J]. Opto-Electronic Engineering, 2020, 47(2): 190139

李梁华,王永雄. 高效 3D密集残差网络及其在人体行为识别中的应用[J]. 光电工程, 2020, 47(2): 190139

您的浏览器不支持PDF插件,请使用最新的(Chrome/Fire Fox等)浏览器.或者您还可以点击此处下载该论文PDF