光谱学与光谱分析, 2023, 43 (4): 1043, 网络出版: 2023-05-03  

基于随机森林特征重要性和区间偏最小二乘法的近红外光谱波长筛选方法

Wavelength Selection Method of Near-Infrared Spectrum Based on Random Forest Feature Importance and Interval Partial Least Square Method
作者单位
1 黑龙江八一农垦大学信息与电气工程学院, 黑龙江 大庆 163319
2 农业农村部农产品及加工品质量监督检验测试中心(大庆), 黑龙江 大庆 163319
3 东北农业大学电气与信息学院, 黑龙江 哈尔滨 150030
摘要
为建立快速近红外光谱定量分析模型, 特征波长筛选是提高定量分析预测精度较为有效的方法之一。 它能够筛选出有效波长信息, 减少数据冗余、 提高数据有效性。 随机森林(RF)作为一种集成算法, 可根据计算特征重要性进行特征筛选。 RF将基于袋外数据(OOB)的平均精度下降(MDA)方法计算均方误差平均值作为特征重要性结果, 通过设置特征重要性阈值筛选特征变量构成特征波长子集, 但该阈值范围的设定无理论依据, 因此需要对特征重要性阈值范围进行探究。 另一方面, 由于RF的随机特性, 特征波长子集中可能包含无效甚至是干扰变量, 并不能保证所选变量的有效性。 故而进一步提出RF-iPLS波长筛选方法。 区间偏最小二乘法(iPLS)筛选出的特征波长多为连续特征波段的特性, 对特征波长子集划分区间, 弥补RF因自身随机性造成的无效变量问题; 同时, RF筛选的离散特征波长解决了iPLS筛选的连续波段中含冗余信息的问题。 为了说明RF-iPLS算法的合理性, 特征子集经过蒙特卡洛(MC)方法500次样本特征采样后, 构建RF-MC-iPLS算法。 虽然RF-iPLS与RF-MC-iPLS算法结构接近, 但运行时间缩短了11.12%, 结果说明RF-iPLS算法在预测模型中的特征波长筛选是有效的, 且具有较低的时间复杂度。 为了进一步验证改进的RF-iPLS算法的有效性, 应用一组公开谷物蛋白质近红外光谱数据, 建立PLSR模型, 并与全谱的PLSR模型以及基于不同波长筛选方法的PLSR模型进行比较。 实验结果表明, 相比于全谱的117个波长, RF-iPLS优选出12个特征波长, 建模集的RMSEC从2.61降到0.64, 预测精度提升了约75.5%, 预测集的RMSEP从2.63降到0.69, 预测精度提升了73.8%, 极大地提高了预测精度且预测结果最优, 说明RF-iPLS是一种有效的特征波长筛选方法, 可以简化近红外光谱定量分析模型的复杂度并实现高效降维。
Abstract
In the rapidly establishing quantitative analysis model of near-infrared spectroscopy, feature wavelength selection is one of the more effective methods to improve prediction accuracy. Through selecting effective information, redundant data is reduced, and the effectiveness of the data set is improved. Random Forest (RF) is an integrated algorithm. The feature importance of spectroscopy wavelength can be calculated by using RF. And the mean square error average value is used as the feature importance result based on the mean decrease accuracy (MDA) method of Out-of-Bag data (OOB). The feature variables are selected to form the feature wave subset by setting the feature importance threshold. However, there is no theoretical basis for setting the threshold range. So it is necessary to explore the range of feature importance thresholds. On the other hand, due to the random characteristics of RF, invalid or even interfering variables may be included in the characteristic wavelength subset, and the selected effectiveness variables cannot be guaranteed. Therefore, the RF-iPLS feature wavelength selection algorithm is further proposed.The feature wavelength subset is divided into intervals by interval partial least squares (iPLS), which makes up for the problem of invalid variables caused by RF randomness and redundant information by iPLS. In order to illustrate the rationality of the RF-iPLS algorithm, the RF-MC-iPLS algorithm is constructed using by Monte Carlo (MC) method. The comparison feature subset is generated after 500 samples.Although the structure of RF-iPLS is similar to that of RF-MC-iPLS, its running time is shortened by 11.12%. The results show that the feature wavelength selection of the RF-iPLS algorithm is effective and has low time complexity in the prediction model. Furthermore, to verify the algorithm’s effectiveness, RF-iPLS was applied to grain protein near-infrared spectroscopy data sets and PLSR models were established. It is compared with the full spectrum PLSR and PLSR models based on different wavelength selection methods. The results show that compared with 117 wavelength points of the full spectrum, RF-iPLS selects 12 feature wavelength points. The RMSEC of the modeling set is reduced from 2.61 to 0.64. The prediction accuracy is improved by about 75.5%. The RMSEP of the prediction set is reduced from 2.63 to 0.69, and the prediction accuracy is improved by 73.8%. The prediction accuracy and optimal prediction results show that RF-iPLS is an effective feature wavelength selection method, and it can simplify the complexity of the near-infrared spectral quantitative analysis model and achieve efficient dimensionality reduction.

陈蕊, 王雪, 王子文, 曲浩, 马铁民, 陈争光, 高睿. 基于随机森林特征重要性和区间偏最小二乘法的近红外光谱波长筛选方法[J]. 光谱学与光谱分析, 2023, 43(4): 1043. CHEN Rui, WANG Xue, WANG Zi-wen, QU Hao, MA Tie-min, CHEN Zheng-guang, GAO Rui. Wavelength Selection Method of Near-Infrared Spectrum Based on Random Forest Feature Importance and Interval Partial Least Square Method[J]. Spectroscopy and Spectral Analysis, 2023, 43(4): 1043.

关于本站 Cookie 的使用提示

中国光学期刊网使用基于 cookie 的技术来更好地为您提供各项服务,点击此处了解我们的隐私策略。 如您需继续使用本网站,请您授权我们使用本地 cookie 来保存部分信息。
全站搜索
您最值得信赖的光电行业旗舰网络服务平台!