光谱学与光谱分析, 2023, 43 (4): 1043, 网络出版: 2023-05-03  

基于随机森林特征重要性和区间偏最小二乘法的近红外光谱波长筛选方法

Wavelength Selection Method of Near-Infrared Spectrum Based on Random Forest Feature Importance and Interval Partial Least Square Method
作者单位
1 黑龙江八一农垦大学信息与电气工程学院, 黑龙江 大庆 163319
2 农业农村部农产品及加工品质量监督检验测试中心(大庆), 黑龙江 大庆 163319
3 东北农业大学电气与信息学院, 黑龙江 哈尔滨 150030
摘要
为建立快速近红外光谱定量分析模型, 特征波长筛选是提高定量分析预测精度较为有效的方法之一。 它能够筛选出有效波长信息, 减少数据冗余、 提高数据有效性。 随机森林(RF)作为一种集成算法, 可根据计算特征重要性进行特征筛选。 RF将基于袋外数据(OOB)的平均精度下降(MDA)方法计算均方误差平均值作为特征重要性结果, 通过设置特征重要性阈值筛选特征变量构成特征波长子集, 但该阈值范围的设定无理论依据, 因此需要对特征重要性阈值范围进行探究。 另一方面, 由于RF的随机特性, 特征波长子集中可能包含无效甚至是干扰变量, 并不能保证所选变量的有效性。 故而进一步提出RF-iPLS波长筛选方法。 区间偏最小二乘法(iPLS)筛选出的特征波长多为连续特征波段的特性, 对特征波长子集划分区间, 弥补RF因自身随机性造成的无效变量问题; 同时, RF筛选的离散特征波长解决了iPLS筛选的连续波段中含冗余信息的问题。 为了说明RF-iPLS算法的合理性, 特征子集经过蒙特卡洛(MC)方法500次样本特征采样后, 构建RF-MC-iPLS算法。 虽然RF-iPLS与RF-MC-iPLS算法结构接近, 但运行时间缩短了11.12%, 结果说明RF-iPLS算法在预测模型中的特征波长筛选是有效的, 且具有较低的时间复杂度。 为了进一步验证改进的RF-iPLS算法的有效性, 应用一组公开谷物蛋白质近红外光谱数据, 建立PLSR模型, 并与全谱的PLSR模型以及基于不同波长筛选方法的PLSR模型进行比较。 实验结果表明, 相比于全谱的117个波长, RF-iPLS优选出12个特征波长, 建模集的RMSEC从2.61降到0.64, 预测精度提升了约75.5%, 预测集的RMSEP从2.63降到0.69, 预测精度提升了73.8%, 极大地提高了预测精度且预测结果最优, 说明RF-iPLS是一种有效的特征波长筛选方法, 可以简化近红外光谱定量分析模型的复杂度并实现高效降维。
Abstract
In the rapidly establishing quantitative analysis model of near-infrared spectroscopy, feature wavelength selection is one of the more effective methods to improve prediction accuracy. Through selecting effective information, redundant data is reduced, and the effectiveness of the data set is improved. Random Forest (RF) is an integrated algorithm. The feature importance of spectroscopy wavelength can be calculated by using RF. And the mean square error average value is used as the feature importance result based on the mean decrease accuracy (MDA) method of Out-of-Bag data (OOB). The feature variables are selected to form the feature wave subset by setting the feature importance threshold. However, there is no theoretical basis for setting the threshold range. So it is necessary to explore the range of feature importance thresholds. On the other hand, due to the random characteristics of RF, invalid or even interfering variables may be included in the characteristic wavelength subset, and the selected effectiveness variables cannot be guaranteed. Therefore, the RF-iPLS feature wavelength selection algorithm is further proposed.The feature wavelength subset is divided into intervals by interval partial least squares (iPLS), which makes up for the problem of invalid variables caused by RF randomness and redundant information by iPLS. In order to illustrate the rationality of the RF-iPLS algorithm, the RF-MC-iPLS algorithm is constructed using by Monte Carlo (MC) method. The comparison feature subset is generated after 500 samples.Although the structure of RF-iPLS is similar to that of RF-MC-iPLS, its running time is shortened by 11.12%. The results show that the feature wavelength selection of the RF-iPLS algorithm is effective and has low time complexity in the prediction model. Furthermore, to verify the algorithm’s effectiveness, RF-iPLS was applied to grain protein near-infrared spectroscopy data sets and PLSR models were established. It is compared with the full spectrum PLSR and PLSR models based on different wavelength selection methods. The results show that compared with 117 wavelength points of the full spectrum, RF-iPLS selects 12 feature wavelength points. The RMSEC of the modeling set is reduced from 2.61 to 0.64. The prediction accuracy is improved by about 75.5%. The RMSEP of the prediction set is reduced from 2.63 to 0.69, and the prediction accuracy is improved by 73.8%. The prediction accuracy and optimal prediction results show that RF-iPLS is an effective feature wavelength selection method, and it can simplify the complexity of the near-infrared spectral quantitative analysis model and achieve efficient dimensionality reduction.
参考文献

[1] HONG Ming-jian, WEN Quan, WEN Zhi-yu(洪明坚, 温 泉, 温志渝). Acta Optica Sinica(光学学报), 2010, (12): 3637.

[2] CHU Xiao-li, CHEN Pu, LI Jing-yan, et al(褚小立, 陈 瀑, 李敬岩, 等). Journal of Instrumental Analysis(分析测试学报), 2020, 39(10): 1181.

[3] GUO Zhi-ming, HUANG Wen-qian, PENG Yan-kun, et al(郭志明, 黄文倩, 彭彦昆, 等). Chinese Journal of Analytical Chemistry(分析化学), 2014, 42(4): 513.

[4] Lee S, Choi H, Cha K, et al. Microchemical Journal, 2013, 110(7): 39.

[5] Epifanio I. BMC Bioinformatics, 2017, 18(1): 230.

[6] Nicodemus K K, Malley J D, Strobl C, et al. BMC Bioinformatics, 2010, 11(1): 110.

[7] SONG Shu-fang, HE Ru-yang(宋述芳, 何入洋). Journal of National University of Defense Technology(国防科技大学学报), 2021, 43(2): 25.

[8] WANG Qi-bin, YANG Hui-hua, PAN Xi-peng, et al(王其滨, 杨辉华, 潘细朋, 等). Laser and Infrared(激光与红外), 2020, 50(9): 7.

[9] QIN Yu-hua, GONG Hui-li, SONG Nan, et al(秦玉华, 宫会丽, 宋 楠, 等). Tobacco Science & Technology(烟草科技), 2014, (6): 64.

[10] FANG Kuang-nan, WU Jian-bin, ZHU Jian-ping, et al(方匡南, 吴见彬, 朱建平, 等). Statistics & Information Forum(统计与信息论坛), 2011, 26(3): 32.

[11] YAO Deng-ju, YANG Jing, ZHAN Xiao-juan(姚登举, 杨 静, 詹晓娟). Journal of Jilin University(Engineering and Technology Editon)[吉林大学学报(工学版)], 2014, (1): 142.

[12] HAO Yong, SUN Xu-dong, WANG Hao(郝 勇, 孙旭东, 王 豪). Journal of Jiangsu University(Natural Science Edition)[江苏大学学报(自然科学版)], 2013, 34(1): 49.

[13] WANG Xue, MA Tie-min, YANG Tao, et al(王 雪, 马铁民, 杨 涛, 等). Transactions of the Chinese Society of Agricultural Engineering(农业工程学报), 2018, 34(13): 203.

[14] Breiman L. Machine Learning, 2001, 45(1): 5.

[15] YANG Qiong-zhu, REN Peng, LONG Shuai, et al(杨琼朱, 任 鹏, 龙 帅, 等). Journal of Analytical Science(分析科学学报), 2016, 32(4): 485.

[16] Wang X, Ma T M, Yang T, et al. International Journal of Agricultural and Biological Engineering, 2019, 12(2): 132.

[17] MA Yue, JIANG Qi-gang, MENG Zhi-guo, et al(马 玥, 姜琦刚, 孟治国, 等). Spectroscopy and Spectral Analysis(光谱学与光谱分析), 2018, 38(1): 181.

[18] LI Na-na, WANG Yong, ZHOU Lin, et al(李娜娜, 王 勇, 周 林, 等). Computer Science(计算机科学), 2021, 48(S1): 464.

[19] LI Mao-gang, YAN Chun-hua, XUE Jia, et al (李茂刚, 闫春华, 薛 佳, 等). Chinese Journal of Analytical Chemistry(分析化学), 2019, 47(12): 1995.

[20] XIE Huan, CHEN Zheng-guang(谢 欢, 陈争光). Chinese Journal of Analytical Chemistry(分析化学), 2019, 47(12): 1987.

[21] Liu J, Sun S, Tan Z, et al. Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy, 2020, 242: 118718.

[22] Ridgway C, Chambers J. Journal of the Science of Food & Agriculture, 2015, 71(2): 251.

陈蕊, 王雪, 王子文, 曲浩, 马铁民, 陈争光, 高睿. 基于随机森林特征重要性和区间偏最小二乘法的近红外光谱波长筛选方法[J]. 光谱学与光谱分析, 2023, 43(4): 1043. CHEN Rui, WANG Xue, WANG Zi-wen, QU Hao, MA Tie-min, CHEN Zheng-guang, GAO Rui. Wavelength Selection Method of Near-Infrared Spectrum Based on Random Forest Feature Importance and Interval Partial Least Square Method[J]. Spectroscopy and Spectral Analysis, 2023, 43(4): 1043.

关于本站 Cookie 的使用提示

中国光学期刊网使用基于 cookie 的技术来更好地为您提供各项服务,点击此处了解我们的隐私策略。 如您需继续使用本网站,请您授权我们使用本地 cookie 来保存部分信息。
全站搜索
您最值得信赖的光电行业旗舰网络服务平台!