光谱学与光谱分析, 2019, 39 (12): 3809, 网络出版: 2020-01-07  

近红外光谱LASSO特征选择方法及其聚类分析应用研究

NIR Spectral Feature Selection Using Lasso Method and Its Application in the Classification Analysis
作者单位
1 江苏大学电气信息工程学院, 江苏 镇江 212013
2 安徽大学电气工程与自动化学院, 安徽 合肥 230061
3 江苏大学食品与生物工程学院, 江苏 镇江 212013
摘要
近红外光谱技术是一种通过分析样本的特征光谱数据, 实现定性或定量分析的无损检测方法, 特征数据的完整性和代表性决定了所建模型的性能, 而现有分析方法只能实现光谱子区间特征筛选, 导致分析模型稳定性差、 且难以再优化。 为实现近红外光谱区间高维数特征提取, 有效提高近红外光谱定性分析模型的精度和稳定性, 提出一种基于最小绝对收缩和选择算法(LASSO)的光谱特征筛选方法, 并以我国特色高值外贸产品云南松茸为分析对象进行聚类应用研究, 讨论了该方法对于高维光谱特征筛选的有效性、 分析对比了LASSO筛选特征变量及主元分析(PCA)降维算法所建松茸真伪甄别及食用菌分类模型的预测精度及稳定性。 通过调研发现, 云南产鲜松茸因其独特外形易于分辨, 而片状的干松茸失去其独有的外形特征, 导致国内干松茸掺假事件屡禁不止。 选取云南产松茸、 杏鲍菇、 老人头、 姬松茸四种干样共166样本数据进行分析, 采用光谱范围为900~1 700 nm的NIRQuest512型近红外光谱仪获得166×512维原始光谱数据, 剔除异常数据后采用标准正态变换对光谱数据进行预处理。 在此基础上, 利用LASSO筛选出全光谱区间的特征变量, 再使用Kennard-Stone法并结合典型线性(KNN)和非线性建模(BP)算法, 构建松茸真伪甄别模型和食用菌分类模型, 对两种模型进行盲样测试, 并分析了LASSO与PCA算法的不同点, 最后使用蒙特卡罗方法检测两种模型的稳定性。 实验结果表明基于LASSO光谱特征选择的松茸真伪甄别模型和食用菌分类模型预测精度和稳定性均高于PCA方法, 其中基于原始光谱数据所建真伪甄别模型的预测准确率为69.57% (BP)和60.87% (KNN), 食用菌分类模型准确率为67.39% (BP)和65.22% (KNN), 基于LASSO特征筛选的真伪甄别模型预测准确率分别达到100% (BP)和78.26% (KNN), 食用菌分类模型预测准确率分别达到89.13% (BP)和80.43% (KNN), 对两种模型进行10次蒙特卡罗实验, 其结果平均值分别为99.93%和97.22%, 由此可知, 与PCA等数据降维算法相比, LASSO可实现全光谱区间的光谱特征选择和数据降维, 有效地提高了近红外定性分析模型的预测性能, 为近红外分析提供了一种新的特征筛选方法。
Abstract
Near-infrared spectroscopy (NIRS) is a non-destructive detection method for qualitative or quantitative analysis by using spectral feature data. The integrity and representativeness of feature data determine the performance of the analytical model. However, existing analytical methods can only extract the feature data from the spectral subinterval. Then the developed models using these feature extracting methods have poor stability. In order to extract the feature from the high-dimensional NIR spectral data and improve the accuracy and stability of NIR spectral model, a spectral screening method using the Least Absolute Shrinkage and Selection Operator (LASSO) algorithm is proposed in this paper. Furthermore, the Tricholoma Matsutake, one of the high-value foreign trade products in China is taken as example to validate the developed classified model using LASSO algorithm. The effectiveness of the feature screening algorithm for the high-dimensional spectral data is discussed, and predictive accuracy and stability of the Tricholoma Matsutake distinguished and edible fungus classified model using LASSO and PCA are also analyzed. It is well known that the fresh Tricholoma Matsutake has the unique shape and it is easy to distinguish its counterfeit. However, it is difficult to distinguish the dry Tricholoma Matsutake from other mushrooms because all of dry mushrooms have the similar flake shape. As a result, dry Tricholoma Matsutake adulteration incidents have occurred frequently. 166 dry samples of Yunnan Tricholoma Matsutake, Pleurotuseryngii, Jujube hilt nipple mushroom and Agaricusblazei were selected in this experiment, and 166×512-dimensional raw spectral data were obtained by NIRQuest 512 NIR spectrometer with a spectral range of 900~1 700 nm. The standard normal transformation (SNV) was taken to pre-process the spectral data after the anomalous data eliminating. The LASSO was used to extract feature variables from the high-dimensional NIR spectral data based on the spectral pretreatment. Then the typical linear (k-Nearest Neighbor, KNN) and the nonlinear modeling (Back-Propagation neural network, BP) algorithms combined with the Kennard-Stone method were used to construct the Tricholoma Matsutake distinguished and edible fungus classified model. The effectiveness of models using LASSO and PCA were also analyzed. Furthermore, the predictive accuracy and the stability of the developed KNN model and BP model were analyzed by using the Monte Carlo method. The experimental results demonstrated that the prediction accuracy and stability of model using LASSO were better than those of the model using PCA. The prediction accuracy of the distinguished and edible fungus classified models using the original spectral data were 69.57% (BP), 60.87%(KNN) and 67.39% (BP), 65.22% (KNN) respectively. And the prediction accuracy of the distinguished and edible fungus classified models using LASSO algorithm were up to 100% (BP), 67.39% (KNN) and 89.13% (BP), 80.43% (KNN) respectively. The two models were performed by 10 times Monte Carlo method and the average results were 99.93% and 97.22%, respectively. Compared with the conventional feature selection methods (such as PCA), the LASSO algorithm can extract the feature from the high-dimensional NIR spectral data. And the accuracy and stability of the models using NIR spectral data can be improved. Furthermore, the developed algorithm is alternative to be a new feature extraction method for NIR spectral data analysis.

李鱼强, 潘天红, 李浩然, 邹小波. 近红外光谱LASSO特征选择方法及其聚类分析应用研究[J]. 光谱学与光谱分析, 2019, 39(12): 3809. LI Yu-qiang, PAN Tian-hong, LI Hao-ran, ZOU Xiao-bo. NIR Spectral Feature Selection Using Lasso Method and Its Application in the Classification Analysis[J]. Spectroscopy and Spectral Analysis, 2019, 39(12): 3809.

关于本站 Cookie 的使用提示

中国光学期刊网使用基于 cookie 的技术来更好地为您提供各项服务,点击此处了解我们的隐私策略。 如您需继续使用本网站,请您授权我们使用本地 cookie 来保存部分信息。
全站搜索
您最值得信赖的光电行业旗舰网络服务平台!