中国激光, 2021, 48 (3): 0311002, 网络出版: 2021-02-02   

基于随机森林算法的食源性致病菌拉曼光谱识别 下载: 1109次

Recognition of Food-Borne Pathogenic Bacteria by Raman Spectroscopy Based on Random Forest Algorithm
1 上海应用技术大学计算机科学与信息工程学院, 上海 201418
2 军事兽医研究所, 吉林 长春 130062

Objective Food and drug safety is of great concern to society. Food pathogenic bacteria are pathogenic bacteria that can cause food poisoning or bacteria that use food as the vector of transmission. Therefore, quick and effective detection of food-borne pathogenic bacteria in food is crucial to protect public health. The culture separation method, which is traditionally used to examine microorganisms, depends on the medium used for culturing, separation, and biochemical identification. Detection of food-borne pathogenic bacteria generally requires five to seven days and includes a series of detection procedures such as pre-enrichment, selective enrichment, microscopic examination and serological verification. Therefore, traditional detection methods are insufficient for preventing and controlling food-borne pathogenic bacteria. However, Raman spectroscopy is a nondestructive method that can be used to rapidly and accurately identify molecules existing in the functional groups. In this study, 11 food-borne pathogenic bacteria samples were used to construct a recognition and classification model based on a random forest algorithm and Raman spectra. This model was then used to build a classification and recognition model to resolve the problems of low classification accuracy and long detection time required by traditional methods used to detect food-borne pathogenic bacteria. The results of this study will help to ensure public health safety by rapidly and effectively detecting pathogens in food and drugs.

Methods All of the food-borne pathogenic bacteria in this study were purchased from China Center of Industrial Culture Collection. First, a sample of food-borne pathogenic bacteria was detected by Raman spectrometry in a shift range of 500--1600 cm -1. LabSpec 6.0 software was used for spectral collection, and each sample was collected 15 times. After screening, 132 Raman spectral data were obtained. Min-max normalization was performed on the Raman spectral data in the spectral preprocessing stage, and the intensity was mapped to a range of [0, 1] for comparison. The Savitzky-Golay algorithm was used for smooth denoising to remove noise and fluorescence interference. Principal component analysis (PCA) was used for feature dimensionality reduction for sample data with high-dimensional characteristics to avoid problems caused by excessively high dimensions. In the model evaluation stage, K-fold cross-validation was used to verify whether the model balanced underfitting and overfitting phenomena and to evaluate the model stability. According to these criteria, the Raman spectral recognition model based on the random forest algorithm proposed in this study was able to effectively distinguish different food-borne pathogenic bacteria among the collected samples.

Results and Discussions In this study, K-nearest neighbors (KNN), logistic regression, support vector machine (SVM), decision tree, and random forest models were used for classification prediction of the pre-treated Raman spectral data of the food-borne pathogenic bacteria (Table 4). Among the 10-fold cross-validation models, the accuracy of the random forest model was better than that of the traditional machine learning algorithms. The decision tree model presented the worst results, with an accuracy rate of 82.63%. This is because the decision tree results in a single weak learner, whereas the random forest model includes multiple votes that are combined to form strong learning (Fig. 5). Therefore, the classification ability of the random forest algorithm is higher than that of a single decision tree classifier. Compared with traditional machine learning algorithms, the random forest algorithm adds two randomness elements in the model construction: sampling randomness and feature selection randomness (Table 2). Because the random forest is composed of decision trees, a higher correlation of decision trees results in a higher error rate. Random sampling determines the decrease degree in the correlation of each tree in the random forest. Among a small number of features selected randomly by each tree in the random forest, the features of optimal splitting ability are chosen as the left and right subtrees of the decision tree. This expands the effect of randomness and further enhances the robustness of the model. Because the introduction of the two randomness elements has a strong effect on reducing the variance of the model, the random forest generally does not need additional pruning. That is, it can achieve better generalization and a stronger ability to avoid overfitting, resulting in low variance. In addition, the Savitzky-Golay filtering algorithm was used for denoising in the preprocessing stage of the Raman spectral data (Fig. 3) to ensure good anti-interference ability in the model.

Conclusions Raman spectroscopy is a mature technology that has a significant effect on the detection and classification of food-borne pathogenic bacteria. In this study, a Raman spectrometer was used to detect the spectral data of 11 food-borne pathogens. According to the spectral properties, the spectral data were normalized, smoothed, and denoised in the preprocessing stage, which facilitated the model construction and training. In addition, a method was developed for identification and analysis of food-borne pathogenic bacteria by using Raman spectroscopy. The experimental results show that the classification model of PCA combined with the random forest algorithm proposed in this study has higher accuracy for Raman spectral data than that of the single machine learning method used conventionally for detecting food-borne pathogens. In addition, the new method improves the speed of manual identification of the Raman spectra. However, the random forest model was prone to overfitting in the sample sets with large noise processing. Future research to improve the accuracy of the model will show that denoising can be optimized in the data pretreatment stage and that the data feature selection algorithm can be optimized using the random forest algorithm. Only 11 samples of food-borne pathogenic bacteria were used in this study. Additional samples could be introduced in the construction of a later model to build a more complete Raman spectral database.

1 引言


将拉曼光谱法与机器学习算法结合起来对物质的成分进行识别和分类是目前光谱分析中常用的方法。利用拉曼光谱结合计算机算法进行识别分类可以缩短食源性致病菌的检测周期,大大降低人工识别拉曼峰的误判率。张燕君等[7]提出了一种结合激光拉曼光谱和人工蜂群支持向量机回归(ABC-SVR)快速定量检测三组分调和油中脂肪酸含量的方法;吴承炜等[8]提出了一种基于拉曼光谱和Siamese网络的相似性学习方法,该方法能够对矿物进行识别;Žuvela等[9]基于自然遗传算法实现了拉曼诊断平台(鼻咽癌临床鼻内镜)在分子水平上的实时活体检测;de Souza Lins Borba等[10]采集了14种不同品牌、不同型号的商用蓝色圆珠笔墨水在A4亚硫酸盐纸上墨线的拉曼光谱,建立了基于偏最小二乘判别分析(PLS-DA)的层次分类模型,这说明拉曼光谱结合计算机科学的方法是一种很有前途的快速无损的工具,可以区分文档中非常相似的墨水的类型。

随机森林(RF)算法[11-13]是一种集成学习(ensemble learning)方法。Vigneau等[14]将随机森林应用在感官分析中,结果发现,随机森林模型比偏最小二乘(PLS)回归模型具有更好的预测能力;Lin等[15]利用随机森林算法建立了重症监护病房(ICU)内急性肾损伤(AKI)患者的死亡率预测模型,并将模型的预测结果与其他两种机器学习模型和定制的简化急性生理评分(SAPS)II模型进行了比较,结果表明,随机森林模型有助于ICU临床医生及时做出临床干预决策,对降低AKI患者的院内死亡率具有重要意义。Huang等[16]采用随机森林算法对T细胞表位和非T细胞表位进行了分类,结果表明,基于特征和随森林相结合的T细胞表位预测方法是有效的。

史如晋等[17]构建了一种基于Stacking集成学习方法的食源性致病菌分类模型,成功地将大肠杆菌O157:H7以及布鲁氏 S2株分离开;但他们研究的食源性致病菌的类别数只有2个,并不能满足实际需求。本文在此基础上将食源性致病菌的类别数增加到11个(样本数为132个),这样虽然增加了训练和分类的难度,但更加符合实际。


2 数据收集与预处理

2.1 实验样本的采集

本实验所用食源性致病菌样本均购于中国工业微生物菌种保藏管理中心(CICC,,11种食源性致病菌的CICC编号与名称如表1所示。利用拉曼光谱仪采集11种食源性致病菌样品的132个拉曼光谱数据,测量的拉曼偏移范围为500~1600 cm-1。所有的菌株都可以根据标准菌株编号在CICC网站查询。

表 1. 11种食源性致病菌的CICC编号与名称

Table 1. CICC numbers and names of eleven food-borne pathogenic bacteria

NumberLatin name
10869Yersinia enterocolitica
10870Klebsiella pneumoniae
21482Salmonella enterica subsp. enterica serovarInfantis
21530Escherichia coli EHEC O157:H7
21534Shigella flexneri
21560Cronobacter sakazakii
21600Staphylococcus aureus
21617Vibrio parahaemolyticus
22933Acinetobacter baumannii
22956Salmonella enterica subsp. enterica serovarTyphimurium
23794Vibrio cholerae



图 1. 原始拉曼光谱

Fig. 1. Original Raman spectra

下载图片 查看所有图片

2.2 数据归一化




图 2. 归一化后的拉曼光谱

Fig. 2. Raman spectra after normalization

下载图片 查看所有图片

2.3 Savitzky-Golay平滑去噪



图 3. Savitzky-Golay处理后的拉曼光谱

Fig. 3. Raman spectrum after Savitzky-Golay processing

下载图片 查看所有图片

2.4 光谱特征降维

本实验使用拉曼光谱仪采集数据时,拉曼光谱的拉曼偏移范围为500~1600 cm-1。拉曼光谱数据具有波段范围广、数据冗余度高等特点。如果对原始高维数据直接进行定量与定性分析,就很有可能使分析结果的误差比较大。主成分分析可以将数据从N维降低到M维,此时需要找到M个向量用于投影原始数据,使投影误差(投影距离)最小。因此,可以对原始数据进行主成分分析,这样就可以使用具有较少维度且不相关的数据来取代原始的高维数据,然后用变换后的数据进行建模。投影误差表达式为



对经归一化处理和平滑去噪处理后的132个拉曼光谱数据进行主成分分析降维,得到帕累托图(Pareto chart),如图4所示。从帕累托图中可以看到,当保留9个主成分时,特征贡献率为99.058%,之后每增加一个主成分,其贡献率增加不足0.5%。所以,本文在计算中采用前9个主成分。

图 4. 主成分帕累托图

Fig. 4. Pareto chart of principle components

下载图片 查看所有图片

3 实验与讨论

3.1 随机森林算法



图 5. 随机森林算法架构图

Fig. 5. Frame of random forest algorithm

下载图片 查看所有图片


1)使用拉曼光谱仪检测拉曼偏移范围为500~1600 cm-1的样本,并使用 LabSpec6.0 软件进行光谱数据的采集,将数据保存在CSV文件中;











表 2. 决策树算法的工作流程

Table 2. Work process of decision tree algorithm

Decision tree algorithm
Input: sample X, sample numbers N, feature counts M
Output:Decision Tree model
X→for bagging∥processing X with bagging cycles
end for
while extracting ntry(ntry=N)→Xtrain do
Mmtry(mtryM)∥ random selection of mtry attributes
mtry→the best node
XXsamples// build samples using Bootstrap
end while
for (itree=0; 1<itreeNtree; itree++)
∥ node splitting by optimal attributes to generate decision trees
end for
end procedure



表 3. 食源性致病菌预测算法的流程

Table 3. Process of algorithm used for prediction of food-borne pathogenic bacteria

Food-borne pathogenic bacteria prediction algorithm
Input: sample X, training set Xtrain, test set Xtest
Output: K trees, prediction result r
for all i = 1 to K do
while jN do
end while
while stop condition not true do
∥ classification attributes are determined by the minimum Gini value
end while
end for
for all i-1 to K do
end for
end procedure


3.2 模型的构建与训练


1) 对原始拉曼光谱数据进行数据预处理;

2) 将多个准确率较低的决策树模型进行集成;

3) 对决策树输出的类别标签进行投票,决定输出的类别标签;

4) 用pyhton调用随机森林库,自动多线程运行CPU;

5) 使用GridSearchCV进行网格搜索,选择最优参数。



图 6. 模型准确率随n_estimators的变化

Fig. 6. Model accuracy change with n_estimators

下载图片 查看所有图片

图 7. 模型准确率随max_depth的变化

Fig. 7. Model accuracy change with max_depth

下载图片 查看所有图片

3.3 模型效果评估



图 8. 10折交叉验证示意图

Fig. 8. 10-fold cross-validation diagram

下载图片 查看所有图片


表 4. 各模型的精度

Table 4. Accuracy of each model

ModelAccuracy /%
PCA + KNN(K-nearest neighbors)88.19
PCA + logistic regression88.25
PCA + SVM(support vector machines)83.86
PCA + decision tree82.63
PCA + RF(our)91.36




4 结论





[1] 高扬, 尹啸冰, 王彤. 肠道致病菌PCR检测及应用价值评估[J]. 首都食品与医药, 2019, 26(22): 101.

    Gao Y, Yin X B, Wang T. PCR assay for enteropathogenic bacteria and evaluation of its application value[J]. Capital Food Medicine, 2019, 26(22): 101.

[2] Pannetier C. PCR[J]. Immunology Today, 1996, 17(12): 590.

[3] Vinner L, Fomsgaard A. Inactivation of orthopoxvirus for diagnostic PCR analysis[J]. Journal of Virological Methods, 2007, 146(1/2): 401-404.

[4] 张铭. 基于拉曼光谱实现物种血液的快速鉴别研究[D]. 南昌: 南昌大学, 2018.

    ZhangM. Rapid identification of species' blood based on Raman spectroscopy[D]. Nanchang: Nanchang University, 2018.

[5] McLaughlin G, Doty K C, Lednev I K. Raman spectroscopy of blood for species identification[J]. Analytical Chemistry, 2014, 86(23): 11628-11633.

[6] 王爽, Haishan Zeng. 实时拉曼光谱分析技术及其在临床早期癌症检测中的应用[J]. 中国激光, 2018, 45(2): 0207002.

    Wang S, Haishan Z. Real-time in vivo Raman spectroscopy and its clinical applications in early cancer detection[J]. Chinese Journal of Lasers, 2018, 45(2): 0207002.

[7] 张燕君, 张芳草, 付兴虎, 等. 基于ABC-SVR算法的拉曼光谱检测混合油脂肪酸含量[J]. 光谱学与光谱分析, 2019, 39(7): 2147-2152.

    Zhang Y J, Zhang F C, Fu X H, et al. Detection of fatty acid content in mixed oil by Raman spectroscopy based on ABC-SVR algorithm[J]. Spectroscopy and Spectral Analysis, 2019, 39(7): 2147-2152.

[8] 吴承炜, 史如晋, 曾万聃. 基于Siamese网络的矿物拉曼光谱识别[J]. 激光与光电子学进展, 2020, 57(9): 093301.

    Wu C W, Shi R J, Zeng W D. Mineral Raman spectral recognition based on Siamese network[J]. Laser & Optoelectronics Progress, 2020, 57(9): 093301.

[9] Žuvela P, Lin K, Shu C, et al. Fiber-optic Raman spectroscopy with nature-inspired genetic algorithms enhances real-time in vivo detection and diagnosis of nasopharyngeal carcinoma[J]. Analytical Chemistry, 2019, 91(13): 8101-8108.

[10] Saldanha Honorato R, de Juan A N. Use of Raman spectroscopy and chemometrics to distinguish blue ballpoint pen inks[J]. Forensic Science International, 2015, 249: 73-82.

[11] 方匡南, 吴见彬, 朱建平, 等. 随机森林方法研究综述[J]. 统计与信息论坛, 2011, 26(3): 32-38.

    Fang K N, Wu J B, Zhu J P, et al. A review of technologies on random forests[J]. Statistics & Information Forum, 2011, 26(3): 32-38.

[12] 马骊. 随机森林算法的优化改进研究[D]. 广州: 暨南大学, 2016.

    MaL. Research on optimization and improvement of random forests algorithm[D]. Guangzhou: Jinan University, 2016.

[13] 谢剑飞, 罗峻, 许敏, 等. 拉曼光谱结合随机森林方法应用于全棉纺织品真伪鉴别的研究[J]. 中国纤检, 2014( 22): 76- 78.

    Xie JF, LuoJ, XuM, et al. Study on identifi cation of 100% cotton textile by Raman spectroscopy and random forest method[J]. China Fiber Inspection, 2014( 22): 76- 78.

[14] Vigneau E, Courcoux P, Symoneaux R, et al. Random forests: a machine learning methodology to highlight the volatile organic compounds involved in olfactory perception[J]. Food Quality and Preference, 2018, 68: 135-145.

[15] Lin K, Hu Y, Kong G. Predicting in-hospital mortality of patients with acute kidney injury in the ICU using random forest model[J]. International Journal of Medical Informatics, 2019, 125: 55-61.

[16] Huang J H, Xie H L, Yan J, et al. Using random forest to classify T-cell epitopes based on amino acid properties and molecular features[J]. Analytica Chimica Acta, 2013, 804: 70-75.

[17] 史如晋, 夏钒曾, 曾万聃, 等. 基于PCA-Stacking模型的食源性致病菌拉曼光谱识别[J]. 激光与光电子学进展, 2019, 56(4): 043003.

    Shi R J, Xia F Z, Zeng W D, et al. Raman spectroscopic classification of foodborne pathogenic bacteria based on PCA-stacking model[J]. Laser & Optoelectronics Progress, 2019, 56(4): 043003.

[18] 韩小孩, 张耀辉, 孙福军, 等. 基于主成分分析的指标权重确定方法[J]. 四川兵工学报, 2012, 33(10): 124-126.

    Han X H, Zhang Y H, Sun F J, et al. Method for determining index weight based on principal component analysis[J]. Journal of Sichuan Ordnance, 2012, 33(10): 124-126.

[19] 李新蕊. 主成分分析、因子分析、聚类分析的比较与应用[J]. 山东教育学院学报, 2007, 22(6): 23-26.

    Li X R. Compare and application of principal component analysis, factor analysis and clustering analysis[J]. Journal of Shandong Education Institute, 2007, 22(6): 23-26.

[20] 雷林平. 基于Savitzky-Golay算法的曲线平滑去噪[J]. 电脑与信息技术, 2014, 22(5): 30-31.

    Lei L P. Curve smooth denoising based on Savitzky-Golay algorithm[J]. Computer and Information Technology, 2014, 22(5): 30-31.

[21] 朱磊磊, 冯爱明, 金尚忠, 等. 拉曼光谱检测中荧光抑制方法及其应用分析[J]. 激光与光电子学进展, 2018, 55(9): 090005.

    Zhu L L, Feng A M, Jin S Z, et al. Fluorescence suppression methods in Raman spectroscopy detection and their application analysis[J]. Laser & Optoelectronics Progress, 2018, 55(9): 090005.

[22] 胡局新, 张功杰. 基于K折交叉验证的选择性集成分类算法[J]. 科技通报, 2013, 29(12): 115-117.

    Hu J X, Zhang G J. K-fold cross-validation based selected ensemble classification algorithm[J]. Bulletin of Science and Technology, 2013, 29(12): 115-117.

[23] 包青岭, 丁建丽, 王敬哲. 利用随机森林方法优选光谱特征预测土壤水分含量[J]. 激光与光电子学进展, 2018, 55(11): 113002.

    Bao Q L, Ding J L, Wang J Z. Prediction of soil moisture content by selecting spectral characteristics using random forest method[J]. Laser & Optoelectronics Progress, 2018, 55(11): 113002.

王其, 曾万聃, 夏志平, 李志萍, 曲晗. 基于随机森林算法的食源性致病菌拉曼光谱识别[J]. 中国激光, 2021, 48(3): 0311002. Qi Wang, Wandan Zeng, Zhiping Xia, Zhiping Li, Han Qu. Recognition of Food-Borne Pathogenic Bacteria by Raman Spectroscopy Based on Random Forest Algorithm[J]. Chinese Journal of Lasers, 2021, 48(3): 0311002.

本文已被 12 篇论文引用
引用该论文: TXT   |   EndNote



关于本站 Cookie 的使用提示

中国光学期刊网使用基于 cookie 的技术来更好地为您提供各项服务,点击此处了解我们的隐私策略。 如您需继续使用本网站,请您授权我们使用本地 cookie 来保存部分信息。