光谱学与光谱分析, 2023, 43 (12): 3806, 网络出版: 2024-01-11  

基于Wasserstein散度的t-SNE相似性度量方法研究

Research on t-SNE Similarity Measurement Method Based on Wasserstein Divergence
作者单位
1 青岛科技大学信息科学技术学院, 山东 青岛 266061
2 江西中烟工业有限责任公司信息中心, 江西 南昌 330096
3 中国海洋大学信息科学与工程学部, 山东 青岛 266100
摘要
近红外光谱具有高维、 高冗余、 非线性的特性, 严重影响了样本之间的相似性度量的精准, 故而提出了一种基于Wasserstein散度的t分布随机近邻嵌入算法(Wt-SNE)。 基于流形学习算法思想, 利用高斯分布将高维数据的距离转换为概率分布, 使用更加偏重长尾分布的方式t分布表示低维空间中对应数据点的概率分布。 将高维数据的概率分布嵌入映射至低维度空间, 重构低维流形结构, 引入Wasserstein散度度量两个空间内概率分布的差异, 通过降低散度值来提高两个分布的相似度, 以此来实现高维数据降维处理。 为验证Wt-SNE算法的有效性, 首先对烟叶近红外光谱数据进行降维投影, 并与PCA、 LPP、 t-SNE算法比较, 结果表明Wt-SNE算法降维后的数据, 在低维空间内样本类别边界更加明显。 其次, 采用KNN、 SVM和PLS-DA分类器对降维后的数据进行烟叶产地预测, 准确率分别为93.8%、 91.5%、 92.7%, 表明降维后的数据不仅重构了原始光谱的空间结构而且保留了样本间的相似度关系。 最后, 选取某一卷烟叶组配方中的烟叶进行单料目标烟叶的替换, 根据备选样本与目标样本之间的马氏距离选取替换样本。 实验表明, Wt-SNE选取的替换烟叶与目标烟叶相似度最高, 烟碱、 总糖等化学成分含量与目标烟叶差异较小, 香气、 烟气、 口感得分表现出较高的一致性。 该方法能够有效度量烟叶近红外光谱之间的相似性, 为卷烟叶组配方的维护提供有力的依据。
Abstract
Near-infrared spectroscopy has the characteristics of high dimension, high redundancy, and nonlinearity, which seriously affects the similarity measurement results between samples. This paper proposes a t-distributed stochastic nearest neighbor embedding algorithm (Wt-SNE) based on Wasserstein divergence. Based on the idea of manifold learning algorithm, Gaussian distribution is used to convert the distance of high-dimensional data into a probability distribution, and t-distribution is used to represent the probability distribution of corresponding data points in low-dimensional space, which is more inclined to long-tailed distribution. The probability distribution embedding of high-dimensional data is mapped to the low-dimensional space. The low-dimensional manifold structure is reconstructed, the Wasserstein divergence is introduced to measure the difference between the probability distributions in the two spaces, and the similarity of the two distributions is improved by reducing the divergence value. In this way, the dimensionality reduction processing of high-dimensional data is realized. In order to verify the effectiveness of the Wt-SNE algorithm, this paper first performs dimensionality reduction projection on tobacco NIR spectral data and compares it with PCA, LPP, and t-SNE algorithms. The results show that the sample category boundaries in the low-dimensional space are more obvious after the dimensionality reduction of the Wt-SNE algorithm. Secondly, the KNN, SVM, and PLS-DA classifiers were used to predict the tobacco origin of the reduced-dimensional data, and the accuracy rates were 93.8%, 91.5%, and 92.7% respectively, indicating that the reduced-dimensional data not only reconstructed the spatial structure of the original spectrum but also retained the similarity relationship between samples. Finally, tobacco from a particular cigarette formula was selected for single material target tobacco replacement, and the replacement samples were selected based on the Marginal distance between the candidate samples and the target samples. The experiments showed that the replacement tobacco selected by Wt-SNE had the highest similarity to the target tobacco, the chemical composition contents such as nicotine and total sugar were less different from those of the target tobacco, and the aroma, smoke, and taste scores showed high consistency. The method can effectively measure the similarity between the NIR spectra of the tobacco and provide a strong basis for the maintenance of the cigarette formula.

刘鑫鹏, 孙祥洪, 秦玉华, 张敏, 宫会丽. 基于Wasserstein散度的t-SNE相似性度量方法研究[J]. 光谱学与光谱分析, 2023, 43(12): 3806. LIU Xin-peng, SUN Xiang-hong, QIN Yu-hua, ZHANG Min, GONG Hui-li. Research on t-SNE Similarity Measurement Method Based on Wasserstein Divergence[J]. Spectroscopy and Spectral Analysis, 2023, 43(12): 3806.

关于本站 Cookie 的使用提示

中国光学期刊网使用基于 cookie 的技术来更好地为您提供各项服务,点击此处了解我们的隐私策略。 如您需继续使用本网站,请您授权我们使用本地 cookie 来保存部分信息。
全站搜索
您最值得信赖的光电行业旗舰网络服务平台!