Ensemble-SISPLS近红外光谱变量选择方法

李四海; 赵磊

doi:doi:10.3964/j.issn.1000-0593(2019)04-1047-06

光谱学与光谱分析, 2019, 39 (4): 1047, 网络出版: 2019-04-11

Ensemble-SISPLS近红外光谱变量选择方法

A Variable Selection Method Based on Ensemble-SISPLS for Near Infrared Spectroscopy

李四海 ^1,*赵磊 ²

作者单位

¹ 甘肃中医药大学信息工程学院, 甘肃兰州 730000

² 甘肃省高校中(藏)药化学与质量研究省级重点实验室, 甘肃兰州 730000

摘要

近红外光谱具有高维小样本的特点, 变量选择是提高定量分析模型稳健性和可解释性的一种有效方法。确定独立筛选（SIS）是一种基于边际相关性的超高维数据变量选择方法, 广泛用于基因微阵列数据的变量选择。 SIS具有将数据维度降低至样本大小规模的能力, 其降维能力与LASSO相当, 在相当宽泛的近似条件下, 由于具有安全筛选性质, 所有重要变量被保留的概率趋于1。基于确定独立筛选偏最小二乘（SIS-SPLS）的变量选择是一种迭代式的SIS变量选择方法, 首先利用SIS方法完成光谱重要变量的初选; 然后根据重要变量的边际相关性大小进行逐步前向选择: 建立偏最小二乘回归模型, 依据贝叶斯信息准则（BIC）确定最终的变量选择结果。 SIS-SPLS以逐步前向选择的方式实现对重要变量的增量式筛选, 随着潜变量个数的增加及因变量残差的逐步减小, SIS-SPLS方法选择的变量个数将趋于稳定。然而仅以边际相关性对变量重要性进行评价, 当光谱变量个数远大于样本数时, 该方法也存在选择的变量过多、变量选择结果不够稳健等问题。为进一步提高小样本情况下变量选择的稳健性, 将集成学习引入SIS-SPLS方法之中, 提出了一种集成SIS-SPLS变量选择方法（Ensemble-SISPLS）。该方法首先对校正集样本进行自助重采样, 对采样得到的每一个校正子集分别使用SIS-SPLS方法进行变量筛选, 通过投票机制并设置频次阈值对所有校正子集的变量选择结果进行集成, 选择出现频次大于给定阈值的变量并建立偏最小二乘回归模型, 计算5折交叉验证均方根误差。对频次阈值和潜变量个数两个关键参数使用网格搜索法进行优选, 根据子模型的交叉验证均方根误差和变量个数对子模型性能进行综合评价, 以最优子模型包含的变量作为最终的变量选择结果。分别在Corn数据集和当归数据集上进行变量选择实验, 比较Ensemble-SISPLS, SIS-SPLS和UVE-PLS三种变量选择方法的性能。其中当归数据集共77个样本, 样本采自甘肃岷县和渭源县, 使用Nicolet-6700型近红外光谱仪扫描得到所有样本的近红外光谱并对当归中的阿魏酸含量进行预测。 Ensemble-SISPLS方法在Corn数据集上选择的变量个数、 RMSEP和决定系数分别为22, 0.000 8和0.999 8; SIS-SPLS方法在Corn数据集上选择的变量个数、 RMSEP和决定系数分别为97, 0.007 3和0.998 8。 Ensemble-SISPLS方法在当归数据集上选择的变量个数、 RMSEP和决定系数分别为24, 0.018 1和0.996 3; SIS-SPLS方法在当归数据集上选择的变量个数、 RMSEP和决定系数分别为38, 0.022 6和0.994 3。结果表明, 该方法进一步提高了变量选择结果的稳健性和预测能力。 Ensemble-SISPLS变量选择方法有效结合了SIS-SPLS较强的变量选择能力和集成学习良好的泛化能力, 提高了变量选择的稳健性。此外, 由于在子模型的预测能力和变量个数之间进行了折中, 一定程度上减少了选择变量的个数, 提高了模型的可解释性。

Abstract

Near-infrared spectroscopy has the characteristics of high-dimensional small sample, which means the number of variables is by far larger compared to that of samples. Variable selection is an effective method to improve the robustness and interpretability of quantitative analysis models of near-infrared spectroscopy. Sure Independence Screening (SIS), an effective feature selection method for ultrahigh dimensional space based on marginal correlations between each predictor and response, is widely used for variable selection of gene microarray data. SIS has the ability to reduce the dimensionality of data to the size of the sample, which is comparable to the reduction ability of LASSO. In a fairly general asymptotic framework, the use of SIS with the sure screening property means that all the significant variables remain after employing the variable screening method with probability tending to one. The variable selection method, based on sure independence screening combined with partial least squares regression (SIS-SPLS), is an iterative SIS method. Firstly, the SIS method is used to complete the initial selection of significant variables, then the stepwise forward selection is carried out on the basis of the marginal correlation of selected significant variables: the partial least squares regression model is established, and the final variable selection result is determined according to the Bayesian Information Criterion (BIC). SIS-SPLS implements an incremental screening of important variables in the stepwise forward selection manner. As the number of latent variables increases and the residual decreases gradually, the number of variables selected by SIS-SPLS will stay steady. Whereas, the evaluation of the importance of variables only by the marginal correlation, when the number of spectral variables is much larger than that of samples, will make the selected variable still large in number, or make the robustness of the variable selection results unsatisfactory. To improve the robustness of variable selection results in the case of small samples, a new variable selection method based on ensemble learning, the SIS method and partial least squares regression (Ensemble-SISPLS) was developed in this paper. First, using the bagging ensemble strategy, the bootstrap method was adopted to resample at random on the calibration set. The variable selection was performed by SIS-SPLS on each calibration subset. The variable selection results of all the calibration subsets were aggregated together by the vote rule. The variable whose frequency was greater than the given threshold was selected and the partial least squares regression model was established to calculate the root mean square error of the 5-fold cross validation. The grid search method was utilized to optimize the two key parameters of the frequency threshold and the number of latent variables. Based on the cross-validation root mean square error and number of variables of the sub-models, the sub-model performance was comprehensively evaluated, and the variables included in the optimal sub-model were treated as the final variable selection result. The variable selection experiments were respectively performed on the Corn dataset and the Angelica sinensis dataset, several variable selection methods such as Ensemble-SISPLS, SIS-SPLS and UVE-PLS were compared in selected variable number and model robustness. A total of 77 Angelica sinensis samples were collected from Minxian and Weiyuan Counties in Gansu Province. Near infrared spectra of all samples were obtained through a Nicolet-6700 near-infrared spectrometer for the prediction of ferulic acid content in Angelica sinensis. The number of selected variables, RMSEP and the coefficient of determination of the Ensemble-SISPLS method on the Corn dataset were 22, 0.000 8 and 0.999 8 respectively; the number of selected variables, RMSEP and the coefficient of determination of the SIS-SPLS method on the Corn dataset were 97, 0.007 3 and 0.998 8 respectively. The number of selected variables, RMSEP and the coefficient of determination of the Ensemble-SISPLS method on Angelica sinensis dataset were 24, 0.018 1 and 0.996 3 respectively; the number of selected variables, RMSEP and the coefficient of determination of the SIS-SPLS method on Angelica sinensis dataset were 38, 0.022 6 and 0.994 3. The results showed that the Ensemble-SISPLS method further improved the robustness and predictability of the variable selection result. The Ensemble-SISPLS method which combines the variable selection ability of the SIS-SPLS method and the good generalization capacity of ensemble learning can improve the robustness of variable selection. In addition, the evaluation criteria of sub-models manage to make an optimal compromise between the prediction performance and the number of selected variables, which reduces the number of selected variables to some extent and at the same time improves the interpretability of the model.

PDF全文

李四海, 赵磊. Ensemble-SISPLS近红外光谱变量选择方法[J]. 光谱学与光谱分析, 2019, 39(4): 1047. LI Si-hai, ZHAO Lei. A Variable Selection Method Based on Ensemble-SISPLS for Near Infrared Spectroscopy[J]. Spectroscopy and Spectral Analysis, 2019, 39(4): 1047.

Ensemble-SISPLS近红外光谱变量选择方法

关于本站 Cookie 的使用提示

全站搜索

Ensemble-SISPLS近红外光谱变量选择方法

相关论文

相关资讯

关于本站 Cookie 的使用提示

全站搜索