首页 > 论文 > 光谱学与光谱分析 > 39卷 > 3期(pp:717-722)

特征分层结合改进粒子群算法的近红外光谱特征选择方法研究

Study on Feature Selection of Near Infrared Spectra Based on Feature Hierarchical Combining Improved Particle Swarm Optimization

  • 摘要
  • 论文信息
  • 参考文献
  • 被引情况
  • PDF全文
分享:

摘要

在近红外光谱数据定量建模中, 数据的高冗余和高噪严重影响了建模的稳健性和精确性, 因此提出了一种特征分层结合改进粒子群算法(PSO)的特征光谱选择方法。 首先通过互信息度量特征的重要性得分, 并按特征的重要性降序排序, 有效避免了因采用降维方法得到主成分而引起的丢失重要信息的问题。 其次, 引入了跳跃度概念, 并构造了一种特征分层的方法, 重要性程度相似的特征并入同一个特征子集, 将降序排列的特征集分割为不同的特征子集, 避免了筛选特征过程中因人为设定特征重要性得分阈值而导致的不确定性。 最后, 采用收敛速度快、 控制参数少的粒子群算法作为最优特征子集的优化方法, 同时对粒子群算法做了两方面改进: 引入混沌模型增加种群的多样性, 提高了PSO的全局搜索能力, 避免陷入局部最优; 将特征数目引入到适应度函数中, 在迭代前期通过惩罚因子调节特征数目对适应度函数的影响, 提高了算法的适应能力。 将分层后的数据以特征子集为单位, 依次累加并作为改进粒子群算法的输入, 从而选择出高辨别力的特征子集。 以烟碱指标为例进行了特征选择过程的描述, 实验采用尼高力公司的Antaris Ⅱ近红外光谱仪进行近红外光谱数据的采集, 光谱扫描范围为4 000~10 000 cm-1。 首先, 利用互信息理论计算全光谱1 557个特征对待测指标定量建模的重要性得分, 得分取30次实验的均值。 其次, 将所有特征按照重要性得分降序排序, 计算所有特征的跳跃度, 依据跳跃度寻找特征分层的临界点, 将特征划分到不同的特征层中, 构建了包含8个特征子集的特征集合S={S′1, S′2, S′3, S′4, S′5, S′6, S′7, S′8}。 然后, 依次将特征子集S′1, {S′1, S′2}, {S′1, S′2, S′3}, …, {S′1, S′2, S′3, S′4, S′5, S′6, S′7, S′8}作为初始粒子群的候选集, 以R/(1+RMSEP)作为特征子集优劣的评价标准, 各自重复实验50次, 比值最大的特征子集即为最优特征子集。 为验证该算法的有效性, 选取了具有代表性烟叶近红外光谱数据作为训练集和测试集, 建立了烟碱、 总糖两个指标的PLS定量模型, 并分别与全光谱、 分层后的特征光谱、 粒子群算法选出的特征光谱进行了比较。 仿真结果表明, 本算法所选特征烟碱、 总糖的建模相关系数r分别为0.988 5和0.982 2, 交互验证均方差RMSECV分别为0.098 4和0.889 3, 预测均方根误差RMSEP分别为0.100 7和0.901 6, 模型准确率均明显高于其他三种方法。 从所选特征数来看, 该算法所选特征数最少, 有效剔除了原特征集中的弱相关和噪声、 冗余信息, 所建模型的主因子数最少, 降低了模型的复杂性, 模型更加稳健, 适应性更广。

Abstract

In the quantitative modeling of near-infrared spectroscopy data, the high redundancy and high noise of the data severely affect the robustness and accuracy of the modeling. Therefore, this paper presents a feature-based spectroscopy combined with improved Particle Swarm Optimization (PSO) Method of choosing. First, we measure the importance score of each feature through mutual information, and then sort the features according to the importance of the features in descending order. This effectively avoids the problem of losing important information caused by using the principal component reduction method. Secondly, the concept of jump degree is introduced and a method of feature stratification is constructed. Similar features of similar importance are merged into the same feature subset, and the descending ordered feature set is segmented into different feature subsets, avoiding the screening uncertainty caused by artificially setting the score of feature importance score during feature process. Finally, the particle swarm optimization algorithm with fast convergence rate and few control parameters is used as the optimal feature subset optimization method. At the same time, particle swarm optimization is improved in two aspects: The chaotic model is introduced to increase the diversity of the population and improve the global searching ability of PSO, so as to avoid getting into local optimum. The number of features is introduced into the fitness function, and the influence of the number of features on the fitness function is adjusted by the penalty factor in the early iteration to improve the adaptability of the algorithm. The stratified data is collected as a feature subset and then added as a modified particle swarm optimization algorithm to select the high-resolution feature subset. In this paper, the nicotine index as an example of the feature selection process is described, using Nicolet company Antaris II near infrared spectrometer near infrared spectrum data acquisition, spectrum scanning range is 4 000~10 000 cm-1. First, we use the mutual information theory to calculate the importance score of 1 557 features of the whole spectrum on the quantitative modeling of the index to be measured, and take the average of 30 experiments. Secondly, all the features are sorted in descending order of importance scores to calculate the jumping degree of all the features. According to the jumping degree, the critical points of the feature stratification are searched, and the features are divided into different feature layers to construct a feature containing 8 feature subsets set S={S′1, S′2, S′3, S′4, S′5, S′6, S′7, S′8}. Then, the feature subset is in turn {S′1}, {S′1, S′2}, {S′1, S′2, S′3}, …, {S′1, S′2, S′3, S′4, S′5, S′6, S′7, S′8} as a candidate for initial particle swarm. With R/(1+RMSEP) as the evaluation criteria of the pros and cons of feature subsets, each iterative experiment 50 times, the ratio of the largest feature subset is the optimal feature subset. In order to verify the effectiveness of this algorithm, we select representative tobacco near-infrared spectral data as a training set and a test set, establish a PLS quantitative model of nicotine and total sugar, and compare with the full-spectrum, stratified characteristic spectrum, particle swarm algorithm selected by the characteristic spectra. The simulation results show that the modeling correlation coefficients R of nicotine and total sugar selected by this algorithm are respectively 0.988 5 and 0.982 2, RMSECV of mutual verification are 0.098 4 and 0.889 3 respectively, RMSEP of prediction root mean square error are 0.901 6 and 0.100 7 respectively, Accuracy are significantly higher than the other three methods. From the selected number of features, the proposed algorithm has the least number of selected features, effectively eliminating the weak correlation and noise and redundant information in the original feature set, minimizing the number of main factors of the model and reducing the complexity of the model, and the model is steadier, more adaptable.

广告组1 - 空间光调制器+DMD
补充资料

中图分类号:O657.3

DOI:10.3964/j.issn.1000-0593(2019)03-0717-06

基金项目:国家重点研发计划项(2016YFB1001103), 云南中烟工业有限责任公司项目(2017XX02, 2018JC01)资)

收稿日期:2018-01-18

修改稿日期:2018-05-11

网络出版日期:--

作者单位    点击查看

徐宝鼎:中国海洋大学信息科学与工程学院, 山东 青岛 266100
秦玉华:青岛科技大学信息科学技术学院, 山东 青岛 266061
杨 宁:中国海洋大学信息科学与工程学院, 山东 青岛 266100
高 锐:云南中烟工业有限责任公司技术中心, 云南 昆明 650024
苑程程:中国海洋大学信息科学与工程学院, 山东 青岛 266100

联系人作者:徐宝鼎(xbd991@163.com)

备注:徐宝鼎, 1990年生, 中国海洋大学信息科学与工程学院硕士研究生

【1】YUAN Tian-jun, WANG Jia-jun, ZHE Wei, et al(袁天军, 王家俊, 者 为, 等). Chinese Agricultural Science Bulletin(中国农学通报), 2013, 29(20): 190.

【2】QIU Jun, ZHANG Huan-bao, SONG Yan, et al(邱 军, 张怀宝, 宋 岩, 等). Chinese Tobacco Science(中国烟草科学), 2008, 29(1): 55.

【3】QIN Yu-hua, DING Xiang-qian, GONG Hui-li(秦玉华, 丁香乾, 宫会丽). Infrared and Laser Engineering(红外与激光工程), 2013, 42(5): 1355.

【4】SHU Ru-xin, SUN Ping, YANG Kai, et al(束茹欣, 孙 平, 杨 凯, 等). Tobacco Science and Technology(烟草科技), 2011, (11): 50.

【5】CHEN Xiao-jing, WU Di, YU Jia-jia, et al(陈孝敬, 吴 迪, 虞佳佳, 等). Acta Optica Sinica(光学学报), 2008, 28(11): 2154.

【6】ZOU Xiao-bo, ZHAO Jie-wen (邹小波, 赵杰文). Acta Optica Sinica(光学学报), 2007, 27(7): 1316.

【7】LIU Xin, YU Sui-huai, CHU Jian-jie, et al(刘 昕, 余隋怀, 初建杰, 等). Computer Engineering and Applications(计算机工程与应用), 2015, 51(7): 1.

【8】TANG Shi-wei, LIU Xian-mei(唐世伟, 刘贤梅). Information Theory(信息论). Harbin: Harbin Engineering University Press(哈尔滨: 哈尔滨工业大学出版社), 2009.

【9】ZHANG De-ran(张德然). Statisitical Research(统计研究), 2003, (5): 53.

【10】Kennedy J, Eberhart R C. International Conference On Neural Networks, 1995. 1942.

【11】Kennedy J, Eberhart R C. International Conference On Systems, Man, And Cybernetics, 1997. 4104.

【12】YANG Rui-qing, LIU Guang-yuan(杨瑞请, 刘光远). Computer Science(计算机科学), 2008, 35(3): 137.

【13】Mahdiyeh Eslami, Hussain Shareef, Azah Mohamed. Journal of Central South University of Technology, 2011, 18: 1579.

【14】LI Ce, WANG Bao-yun, GAO Hao(李 策, 王保云, 高 浩). Computer Technology and Development(计算机技术与发展), 2017, 27(4): 89.

引用该论文

XU Bao-ding,QIN Yu-hua,YANG Ning,GAO Rui,YUAN Cheng-cheng. Study on Feature Selection of Near Infrared Spectra Based on Feature Hierarchical Combining Improved Particle Swarm Optimization[J]. Spectroscopy and Spectral Analysis, 2019, 39(3): 717-722

徐宝鼎,秦玉华,杨 宁,高 锐,苑程程. 特征分层结合改进粒子群算法的近红外光谱特征选择方法研究[J]. 光谱学与光谱分析, 2019, 39(3): 717-722

被引情况

【1】龙耀威,孙红,高德华,张智勇,李民赞,杨玮. 镀膜型光谱成像数据提取与作物叶绿素分布探测研究. 光谱学与光谱分析, 2020, 40(5): 1581--1

【2】张磊,丁香乾,宫会丽,吴丽君,白晓莉,罗林. 改进和声搜索算法的近红外光谱特征变量选择. 光谱学与光谱分析, 2020, 40(6): 1869--1

您的浏览器不支持PDF插件,请使用最新的(Chrome/Fire Fox等)浏览器.或者您还可以点击此处下载该论文PDF