基于余弦距离损失函数的人脸表情识别算法 下载: 1219次
1 引言
人脸面部表情是表达人类情绪状态和意图的最强大最自然的信号之一,在机器人、驾驶员疲劳驾驶检测及人机交互系统中都有着广泛应用[1-2]。Ekman等[3]在1971年定义了6种基本表情,即生气、厌恶、害怕、开心、难过和惊讶,后期加入中性表情。真实生活环境中存在头部姿势变化、光照变化、遮挡以及细微的面部外观变化等,会导致类内特征差异大、类间特征相似度高,故面部表情识别的准确率依然较低,面部表情识别任务仍然面临着巨大挑战[4]。
近几年来,人工智能领域有了令人瞩目的发展。一方面AI芯片在处理能力和架构方面有一定的提升,在计算力上为并行计算提供了保障;另一方面,人工神经网络出现了各种新颖的结构和算法,为各行业提供了更高效的智能化解决方案,从而使得不少领域的研究开始转向深度学习方法,并且获得了非常高的识别准确率[5-9]。当前,越来越多的研究者采用深度学习技术来处理面部表情识别在真实生活环境中的挑战,特别是卷积神经网络技术,其中多层卷积神经网络可以学习到输入对象的深层特征,在面部表情识别方面取得了很好的成果[10-13]。
传统的卷积神经网络使用Softmax损失函数来优化类间特征的差异,但忽略了类内特征存在的差异性。为解决这个问题,许多新的损失函数被提出。Island损失函数受人脸识别任务中的Center损失函数的启发[14],在对特征与相应类的距离增加惩罚项的基础上,增大类间距离,这不仅压缩了聚类簇,而且增大了聚类中心之间的距离[15]。人脸识别任务中提出的AM-Softmax损失函数是在Softmax 损失函数的基础上引入余弦余量项
2 基本原理
2.1 人脸表情识别方法
基于深度学习的人脸表情识别算法分为3部分,人脸预处理、表情特征提取和表情分类。具有足够多的标记训练集、尽可能多的种族变化和环境变化对于人脸表情识别算法的设计至关重要。目前已经公开了多种人脸表情数据集用于人脸表情识别研究[17-21]。对于特征提取,本文设计的网络结构是Xception结构的精简模式mini-Xception[22],结构如
2.2 传统损失函数
为更好地理解本文基于余弦距离的损失函数,简要介绍Softmax损失函数和Island损失函数。Softmax 损失函数表示为
式中:
Island损失函数是基于Center损失函数的改进函数,在网络模型训练过程中,Center 损失函数可以减小类内特征之间的差异,Island 损失函数可以增大类间特征之间的距离。Center 损失函数表示为
式中:
式中:
式中:
Softmax 损失函数、Center 损失函数和Island 损失函数的分类效果如
图 2. 三种损失函数学习到的深度特征图示。(a) Softmax损失;(b) Center损失;(c) Island损失
Fig. 2. Depth feature diagrams learned by three loss functions. (a) Softmax loss; (b) Center loss; (c) Island loss
2.3 基于余弦距离的损失函数
根据对Island损失函数的分析,Softmax损失函数在很大程度上影响着Island损失函数的分类准确率,而基于余弦距离损失函数的基本思想就是通过改善Softmax损失函数来提高分类准确率。
人脸识别任务中的AM-Softmax损失函数就是在Softmax损失函数的基础上改进而来,它可以使人脸特征具有更大的类间距和更小的类内距。AM-Softmax损失函数除了将网络偏置参数
式中:
本文基于余弦距离损失函数的思想就是把AM-Softmax损失函数加入到Island损失函数中,可以得到基于余弦距离损失函数的计算公式为
图 3. 传统Softmax损失函数和AM-Softmax损失函数的对比示意图
Fig. 3. Comparison of traditional Softmax loss function and AM-Softmax loss function
式中:超参数
图 4. Island损失函数和基于余弦距离损失函数的分类边界示意图
Fig. 4. Shematic of classification boundaries of Island loss function and loss function based on cosine distance
3 实验仿真
3.1 实验设置
3.1.1 预处理
为排除数据集中人脸尺寸、角度等不一致对表情识别造成的影响,本文对数据集中的所有图像都进行人脸对齐。人脸关键点检测采用人脸检测Dlib库[24],基于三个关键点对准面部区域:两个眼睛和嘴巴中心。然后将对齐后的面部图像大小都调整为100×100,将三个颜色通道的像素进行归一化处理,即把[0,255]范围内的像素值归一化到[0,1]范围内。
受限于人脸表情数据集的数据量,采用数据增强技术来扩充训练数据[25],将图像块随机旋转-20°到20°、以50%的概率对图像进行随机水平翻转,在水平和竖直方向上对图像进行10%范围内的随机偏移,对图像进行10%范围内的随机缩放。
3.1.2 人脸表情数据集
RAF-DB是一个真实世界的面部表情数据集[21],包含从互联网上下载的29672个高度多样化的面部表情图像。通过手动注释和可靠估计,为表情样本提供了7种基本表情标签和11种复合表情标签。选用基本表情数据集来评估基于余弦距离损失函数的有效性。
3.1.3 卷积神经网络框架及训练配置
使用Ubuntu16.04系统下的Keras框架来实现基于余弦距离损失函数的mini-Xception卷积神经网络结构。本实验使用GPU(显卡型号为NVIDIA GeForce GTX 1080Ti)进行训练,迭代1万次。采用自适应矩估计(Adam)的训练策略,其中,批处理大小为128,学习率初始设为0.01,当测试损失val_loss值不再下降时,学习率乘以衰减因子0.2。
3.2 实验结果
对基于余弦距离损失函数中的两个超参数
表 1. s =10,m 取不同取值时的面部表情识别的准确率
Table 1. Accuracy of facial expression recognition when s =10 and m takes different values
|
表 2. m =0.35,s 取不同取值时的面部表情识别的准确率
Table 2. Accuracy of facial expression recognition when m =0.35 and s takes different values
|
图 5. 基于余弦距离损失函数的网络模型实验结果的混淆矩阵
Fig. 5. Confusion matrix of experimental results of network model based on cosine distance loss function
将本文基于余弦距离损失函数与人脸表情识别经典算法进行准确率的对比,对比结果如
表 3. 不同损失函数下面部表情识别的准确率
Table 3. Accuracy of facial expression recognition under different loss functions
|
4 结论
提出一种基于余弦距离的损失函数,实验采用mini-Xception神经网络模型,模型结构精简,参数量较传统神经网络模型大大减小。采用基于余弦距离的损失函数对网络模型进行优化训练,可以学习到具有强区分度的表情特征,达到最小化类内距离、最大化类间距离的效果。本文的mini-Xception神经网络模型在加入基于余弦距离的损失函数以后,人脸表情识别准确率得到提升。实验结果证明,本文损失函数在表情识别任务中具有巨大的优势。后续研究将主要致力于针对数目较少的类别训练集进一步提升准确率;及对损失函数进行进一步改进,以更好地指导神经网络训练,使得人脸表情识别准确率得到进一步提升。
[1] Tian Y I, Kanade T, Cohn J F. Recognizing action units for facial expression analysis[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2001, 23(2): 97-115.
[2] DarwinC, ProdgerP. The expression of the emotions in man and animals[M]. USA: Oxford University Press, 1998.
[4] Valstar M F, Mehu M, Jiang B H, et al. Meta-analysis of the first facial expression recognition challenge[J]. IEEE Transactions on Systems, Man, and Cybernetics, Part b (Cybernetics), 2012, 42(4): 966-979.
[5] 龙鑫, 苏寒松, 刘高华, 等. 一种基于角度距离损失函数和卷积神经网络的人脸识别算法[J]. 激光与光电子学进展, 2018, 55(12): 121505.
[6] KrizhevskyA, SutskeverI, Hinton GE. ImageNet classification with deep convolutional neural networks[C]∥Advances in Neural Information Processing Systems 25(NIPS 2012), December 3-6, 2012, Lake Tahoe, Nevada, United States. Canada: NIPS, 2012.
[7] SimonyanK, Zisserman A. Very deep convolutional networks for large-scale image recognition[J/OL]. ( 2015-04-10)[2019-04-24]. https:∥arxiv.org/abs/1409. 1556.
[8] SzegedyC, LiuW, Jia YQ, et al. Going deeper with convolutions[C]∥2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 7-12, 2015, Boston, MA, USA. New York: IEEE, 2015: 15523970.
[9] He KM, Zhang XY, Ren SQ, et al. Deep residual learning for image recognition[C]∥2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 27-30, 2016, Las Vegas, NV, USA. New York: IEEE, 2016: 770- 778.
[10] Kim BK, LeeH, RohJ,et al. Hierarchical committee of deep CNNs with exponentially-weighted decision fusion for static facial expression recognition[C]∥Proceedings of the 2015 ACM on International Conference on Multimodal Interaction - ICMI '15, November 9-13, 2015, Seattle, Washington, USA. New York: ACM, 2015: 427- 434.
[11] Yu ZD, ZhangC. Image based static facial expression recognition with multiple deep network learning[C]∥Proceedings of the 2015 ACM on International Conference on Multimodal Interaction-ICMI'15, November 9-13, 2015, Seattle, Washington, USA. New York: ACM, 2015: 435- 442.
[12] Ng HW, Nguyen VD, VonikakisV, et al. Deep learning for emotion recognition on small datasets using transfer learning[C]∥Proceedings of the 2015 ACM on International Conference on Multimodal Interaction-ICMI'15, November 9-13, 2015, Seattle, Washington, USA. New York: ACM, 2015: 443- 449.
[13] Yao AB, Cai DQ, HuP, et al. HoloNet: towards robust emotion recognition in the wild[C]∥Proceedings of the 18th ACM International Conference on Multimodal Interaction - ICMI 2016, November 12-16, 2016, Tokyo, Japan. New York: ACM, 2016: 472- 478.
[14] Wen YD, Zhang KP, Li ZF, et al. A discriminative feature learning approach for deep face recognition[M] ∥Leibe B, Matas J, Sebe N, et al. Computer vision-ECCV 2016. Lecture notes in computer science. Cham: Springer, 2016, 9911: 499- 515.
[15] CaiJ, Meng ZB, Khan AS, et al. Island loss for learning discriminative features in facial expression recognition[C]∥2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), May 15-19, 2018, Xi'an,China. New York: IEEE, 2018: 302- 309.
[16] Wang F, Cheng J, Liu W Y, et al. Additive margin softmax for face verification[J]. IEEE Signal Processing Letters, 2018, 25(7): 926-930.
[17] KanadeT, Cohn JF, Tian YL. Comprehensive database for facial expression analysis[C]∥Proceedings Fourth IEEE International Conference on Automatic Face and Gesture Recognition (Cat. No. PR00580), March 28-30, 2000,Grenoble, France. New York: IEEE, 2000: 6577271.
[18] LuceyP, Cohn JF, KanadeT,et al. The extended Cohn-Kanade dataset (CK+): a complete dataset for action unit and emotion-specified expression[C]∥2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops, June 13-18, 2010, San Francisco, CA, USA. New York: IEEE, 2010: 94- 101.
[20] DhallA, Ramana Murthy O V, Goecke R, et al. Video and image based emotion recognition challenges in the wild[C]∥Proceedings of the 2015 ACM on International Conference on Multimodal Interaction - ICMI '15, November 9-13, 2015, Seattle, Washington, USA. New York: ACM, 2015: 423- 426.
[21] LiS, Deng WH, Du JP. Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild[C]∥2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 21-26, 2017, Honolulu,HI,USA. New York: IEEE, 2017: 2584- 2593.
[22] CholletF. Xception:deep learning with depthwise separable convolutions[C]∥The IEEE Conference on Computer Vision and Pattern Recognition (CVPR),July 21-26, 2017, Honolulu, Hawaii, USA.New York: IEEE, 2017: 1251- 1258.
[23] Deng JK, Zhou YX, ZafeiriouS. Marginal loss for deep face recognition[C]∥2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), July 21-26, 2017, Honolulu, HI, USA. New York: IEEE, 2017: 2006- 2014.
[24] Saragih JM, LuceyS, Cohn JF. Face alignment through subspace constrained mean-shifts[C]∥2009 IEEE 12th International Conference on Computer Vision,September 29-October 2, 2009, Kyoto,Japan. New York: IEEE, 2009: 1034- 1041.
[25] Simard PY, SteinkrausD, Platt JC. Best practices for convolutional neural networks applied to visual document analysis[C]∥Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings., August 6-6, 2003, Edinburgh, UK. New York: IEEE, 2003.
[26] Zhao LM, LiX, Zhuang YT, et al. Deeply-learned part-aligned representations for person re-identification[C]∥2017 IEEE International Conference on Computer Vision (ICCV), October 22-29, 2017, Venice, Italy. New York: IEEE, 2017: 3239- 3248.
[27] IsolaP, Zhu JY, Zhou TH, et al. Image-to-image translation with conditional adversarial networks[C]∥2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 21-26, 2017, Honolulu, HI, USA. New York: IEEE, 2017: 5967- 5976.
吴慧华, 苏寒松, 刘高华, 李燊, 苏晓. 基于余弦距离损失函数的人脸表情识别算法[J]. 激光与光电子学进展, 2019, 56(24): 241502. Huihua Wu, Hansong Su, Gaohua Liu, Shen Li, Xiao Su. Facial Expression Recognition Algorithm Based on Cosine Distance Loss Function[J]. Laser & Optoelectronics Progress, 2019, 56(24): 241502.