基于深度学习的图像描述研究

杨楠; 南琳; 张丁一; 库涛

doi:doi:10.3788/irla201847.0203002

红外与激光工程, 2018, 47 (2): 0203002, 网络出版: 2018-04-26

基于深度学习的图像描述研究

Research on image interpretation based on deep learning

论文大纲

杨楠 ^1,2南琳 ^1,2张丁一 ^1,2库涛 ^1,2

作者单位

¹ 中国科学院沈阳自动化研究所, 辽宁沈阳 110016

² 中国科学院大学, 北京 100049

摘要

卷积神经网络(Convolution Neural Networks, CNN)和循环神经网络(Recurrent Neural Networks, RNN)在图像分类、计算机视觉、自然语言处理、语音识别、机器翻译、语义分析等领域取得了迅速的发展, 引起了研究者对计算机自动生成图像描述的广泛关注。目前图像描述存在的主要问题有输入文本数据稀疏、模型存在过拟合、模型损失函数震荡难以收敛等问题。文中使用NIC作为基线模型, 针对数据稀疏问题, 改变了基线模型中的文本one-hot表示, 使用word2vec对文本进行映射, 为了防止过拟合, 在模型中加入了正则项和使用Dropout技术, 并在词序记忆方面取得创新, 引入联想记忆单元GRU, 用于文本生成。在试验中使用AdamOptimizer优化器进行参数迭代更新。实验结果表明: 改进后的模型参数减少且收敛速度大幅加快, 损失函数曲线更加平滑, 损失最大降至2.91, 模型的准确率比NIC提高了接近15%。实验有效地验证了在模型当中使用word2vec对文本进行映射可明显缓解数据稀疏问题, 加入正则项和使用Dropout技术可有效防止模型过拟合, 引入联想记忆单元GRU能够大幅减少模型训练参数, 加快算法收敛速度, 进而提高整个模型的准确率。

Abstract

Convolution Neural Networks (CNN) and Recurrent Neural Networks (RNN) had developed rapidly in the fields of image classification, computer vision, natural language process, speech recognition, machine translation and semantic analysis, which caused researchers' close attention to computers' automatic generation of image interpretation. At present, the main problems in image description were sparse input text data, over-fitting of the model, difficult convergence of the model loss function, and so on. In this paper, NIC was used as a baseline model. For data sparseness, one-hot text in the baseline model was changed and word2vec was used to map the text. To prevent over-fitting, regular items were added to the model and Dropout technology was used. In order to make innovations in word order memory, the associative memory unit GRU for text generation was used. In experiment, the AdamOptimizer optimizer was used to update parameters iteratively. The experimental results show that the improved model parameters are reduced and the convergence speed is significantly faster, the loss function curves are smoother, the maximum loss is reduced to 2.91, and the model accuracy rate increases by nearly 15% compared with the NIC. Experiments validate that the use of word2vec to map text in the model obviously alleviates the data sparseness problem. Adding regular items and using Dropout technology could effectively prevent over-fitting of the model. The introduction of associative memory unit GRU could greatly reduce the model trained parameters and speed up the algorithm of convergence rate, improve the accuracy of the entire model.

参考文献

[1] 许锋, 卢建刚, 孙优贤. 神经网络在图像处理中的应用[J]. 信息与控制, 2003, 4(1): 344-351.

Xu Feng, Lu Jiangang, Sun Youxian. Application of neural network in image processing[J]. Chinese Journal of Information and Control, 2003, 4(1): 344-351. (in Chinese)

[2] Farhadi A, Hejrati M, Sadeghi M A, et al. Every picture tells a story generating sentences from images[J]. ECCV, 2010, 21(10):15-29.

[3] Kulkarni G, Premraj V, Dhar S, et al. Baby talk: Understanding and generating simple image descriptions[J]. CVPR, 2014, 35(12): 1601-1608.

[4] Cho K, van Merrienboer B, Gulcehre C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation[J]. EMNLP, 2014, 14(6): 1078-1093.

[5] Vinyals O, Toshev A, Bengio S, et al. Show and tell: A neural image caption generator[C]//Proceeding of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015: 3156-3164.

[6] Alex Krizhevsky, IIya Sutskever, Geoffrey Hinton. Imagenet classification with deep convolution neural networks[C]//Proceedings of Advances Neural Information Processing Systems(NLPS), 2012: 1097-1105.

[7] Sermanet P, Eigen D, Zhang X, et al. Overfeat: Integrated recognition, localization and detection using convolutional networks[J]. Computer Vision and Pattern Recognition, 2013, arXiv preprint arXiv: 1312.6229.

[8] Gerber R, Nagel H H. Knowledge representation for the generation of quantified natural language description of vehicle traffic in image sequence[C]//Proceeding of the IEEE International Conference on Image Processing, 1996: 805-808.

[9] Yao B Z, Yang X, Lin L, et al. I2t: Image parsing to text description[C]//Proceedings of the IEEE, 2010, 98(8): 1485-1508.

[10] Li S, Kulkarni G, Berg T L, et al. Composing simple image descriptions using web-scale n-grams[C]//Proceeding of the Conference on Computational Natural Language Learning, 2011.

[11] Aker A, Gaizauskas R. Generating image descriptions using dependency relational patterns[C]//Proceedings of the Meeting of the Association for Computational Linguistics(ACL), 2010: 49 (9) :1250-1258.

[12] Hodosh M, Young P, Hockenmaier J. Framing image description as a ranking task: Data, models and evaluation metrics[C]//International Conference on Artificial Intelligence, 2013, 47(1): 853-899.

[13] 温亚,南琳.面向自然语言理解的图像语义分析方法研究[D]. 沈阳: 中国科学院沈阳自动化研究所, 2017.

Wen Ya, Nan Lin. Research on semantic analysis method of image based on natural language understanding[D]. Shenyang: Shenyang Institute of Automation, Chinese Academy of Sciences, 2017. (in Chinese)

杨楠, 南琳, 张丁一, 库涛. 基于深度学习的图像描述研究[J]. 红外与激光工程, 2018, 47(2): 0203002. Yang Nan, Nan Lin, Zhang Dingyi, Ku Tao. Research on image interpretation based on deep learning[J]. Infrared and Laser Engineering, 2018, 47(2): 0203002.

基于深度学习的图像描述研究

关于本站 Cookie 的使用提示

全站搜索

基于深度学习的图像描述研究

相关论文

相关资讯

关于本站 Cookie 的使用提示

全站搜索