近红外光谱的选择比率竞争群体分析的变量选择算法

王玉喜; 贾振红; 杨杰; Nikola K Kasabov

doi:doi:10.3964/j.issn.1000-0593(2020)04-1056-07

光谱学与光谱分析, 2020, 40 (4): 1056, 网络出版: 2020-12-11

近红外光谱的选择比率竞争群体分析的变量选择算法

A Variable Selection Method of the Selectivity Ratio Competitive Model Population Analysis for Near Infrared Spectroscopy

王玉喜 ¹贾振红 ^1,*杨杰 ²Nikola K Kasabov ³

作者单位

¹ 新疆大学信息科学与工程学院, 新疆乌鲁木齐 830046

² 上海交通大学图像处理与模式识别研究所, 上海 200240

³ Knowledge Engineering and Discovery Research Institute, Auckland University of Technology, Auckland 1020, New Zealand

摘要

光谱分析是化学计量学的一个重要应用方向, 并已被广泛应用到各个领域, 其中光谱变量选择又是光谱分析的重要环节。研究不同的变量选择方法客观地识别有用的信息变量和消除无关或干扰变量十分关键。提出了一种新的变量选择方法, 命名选择比率的竞争性群体分析法(SRCMPA)。该算法采用选择比率, 自适应加权采样和模型群体分析的思想, 并结合了变量排列和指数递减函数方法。关键波长定义为多元线性回归模型中得分值较大的波长, 将线性模型PLS下的选择比率的得分值作为评价各波长重要性的指标, 然后, 根据每个波长的重要性, SRCMPA依次从蒙特卡罗采样中选择N个波长子集, 以迭代和竞争的方式运行。在每一次采样运行中, 以固定比率的样品以建立校准的PLS模型并计算每个变量的选择比率值, 基于排序选择比率的得分值和作为权重的归一化的SR(选择比率)得分值, 采用指数递减函数的强制选择和自适应加权采样竞争选择的两步过程来选择关键变量。最后, 应用交叉验证(CV)方法来选择具有最低交叉验证均方根(RMSECV)的子集作为最优子集。该算法已在小麦蛋白数据集和啤酒数据集上进行了测试, 并使用三种高效算法作对比。通过对实验结果来评估算法优越性, 该算法能够找到数据集的关键波长变量的最佳组合, 并能用于解释感兴趣的化学特性, 通过建模后的评价结果也是最佳的。该算法在啤酒光谱数据集的运行结果, 相较于啤酒数据集的全光谱PLS模型, 变量个数由567个减少到42个左右。并且模型的RMSECV由0.622下降到0.115, RMSEP由0.823减少到了0.263左右, 预测精度分别提高了81.5%和68.1%。 Q2_CV和Q2_test也分别由0.940, 0.852提高到了0.994和0.995。在小麦蛋白数据集的运行结果, 相较于于小麦蛋白光谱数据集的全光谱PLS模型, 变量个数由175个减少到18个左右。并且模型的RMSECV由0.607下降到0.292, RMSEP由0.519减少到了0.234左右, 预测精度分别提高了51.9%和54.9%。 Q2_CV和Q2_test也分别由0.748, 0.774提高到了0.931和0.839。

Abstract

Spectral analysis is an important application of chemometrics and has been widely used in various fields. Spectral variable selection is a key part of spectral analysis. Therefore, it is critical to study different variable selection methods to objectively identify useful information variables or eliminate irrelevant and interfering variables. In our study, a new variable selection method of the selectivity ratio competitive population analysis (SRCMPA) is proposed. This algorithm adopts the idea of selection ratio, adaptive weighted sampling and model population analysis, and combines the method of variable arrangement and exponential decline function. The key wavelength is defined as the wavelength with a high score value in the regression model. In this paper, the score value of the selection ratio under the PLS model is used as an index to evaluate the importance of each wavelength. Then, according to the importance of each wavelength, SRCMPA sequentially selects N wavelength subsets from Monte Carlo sampling, and runs in an iterative and competitive manner. In each sampling operation, the PLS model is built with a fixed ratio samples and the selection ratio value of each variable is calculated. Based on the score value of the ranking selection ratio and the normalized SR (selection ratio) score value as the weight, the key variables are selected by two steps: the compulsory selection of exponential decline function and the competitive selection of adaptive weighted sampling. Finally, cross validation (CV) method is applied to select the optimal subset with the lowest cross validation mean square root (RMSECV). The algorithm has been tested on wheat protein data set and beer data set, and compared with three efficient algorithms. Through the experimental results to evaluate the superiority of the algorithm, this algorithm can find the best combination of the key wavelength variables of the data set, and can be used to explain the chemical characteristics of interest, the evaluation results after modeling are also the best. Compared with the PLS model of full-spectrum beer data set, the number of variables in this algorithm has been reduced from 567 to about 42. And the RMSECV of model decreased from 0.622 to 0.115, RMSEP decreased from 0.823 to 0.363, and the prediction accuracy increased by 81.5% and 55.9%, respectively. Q2_CV and Q2_test also increased from 0.940, 0.852 to 0.994 and 0.995. For wheat protein data sets, Compared with the PLS model of full-spectrum wheat protein spectral data set, the number of variables has been reduced from 175 to about 18. And the RMSECV of the model decreased from 0.607 to 0.292, the RMSEP decreased from 0.519 to 0.234, and the prediction accuracy increased by 51.9% and 54.9%, respectively. Q2_CV and Q2_test also increased from 0.748, 0.774 to 0.931 and 0.839.

PDF全文

王玉喜, 贾振红, 杨杰, Nikola K Kasabov. 近红外光谱的选择比率竞争群体分析的变量选择算法[J]. 光谱学与光谱分析, 2020, 40(4): 1056. WANG Yu-xi, JIA Zhen-hong, YANG Jie, Nikola K Kasabov. A Variable Selection Method of the Selectivity Ratio Competitive Model Population Analysis for Near Infrared Spectroscopy[J]. Spectroscopy and Spectral Analysis, 2020, 40(4): 1056.

近红外光谱的选择比率竞争群体分析的变量选择算法

关于本站 Cookie 的使用提示

全站搜索

近红外光谱的选择比率竞争群体分析的变量选择算法

相关论文

相关资讯

关于本站 Cookie 的使用提示

全站搜索