CN111220565B

CN111220565B - CPLS-based infrared spectrum measuring instrument calibration migration method

Info

Publication number: CN111220565B
Application number: CN202010045812.3A
Authority: CN
Inventors: 赵煜辉; 刘晓东; 李雪晶; 芦鹏程
Original assignee: Northeastern University China
Current assignee: Northeastern University China
Priority date: 2020-01-16
Filing date: 2020-01-16
Publication date: 2022-07-29
Anticipated expiration: 2040-01-16
Also published as: CN111220565A

Abstract

The invention relates to the technical field of migration learning under a machine learning module, and provides a calibration and migration method of an infrared spectrum measuring instrument based on CPLS. First, the source domain dataset {X _m , Y} and the target domain dataset {X _s , Y} are collected, and they are centrally processed to obtain the centrally processed source domain dataset {X _{m_center} , Y _center } and Target domain data set {X _{s_center} , Y _center }; Then, based on the CPLS algorithm, the matrix X _{m_center} and Y _center are subjected to principal component analysis, and the matrix X _{s_center} is subjected to principal component analysis; The transition matrix M _{trans_pre} and the transition matrix M _trans are calculated again; Finally, predict the substance concentration variable of the measured object. The invention can clear the random noise measured by the main instrument, improve the data utilization rate and modeling accuracy, and reduce the time complexity.

Description

A calibration migration method for infrared spectroscopy measuring instruments based on CPLS

技术领域technical field

本发明涉及机器学习模块下的迁移学习技术领域，特别是涉及一种基于CPLS的红外光谱测量仪器标定迁移方法。The invention relates to the technical field of migration learning under a machine learning module, in particular to a calibration and migration method of an infrared spectrum measuring instrument based on CPLS.

背景技术Background technique

近红外光谱(NIRS)分析技术具备仪器操作简单、分析数据速度快、成本较低、不污染样品等优势，已在各领域得到了普遍应用。在生产过程中，使用近红外光谱分析技术进行建模，由于测量条件和仪器硬件性能往往并不稳定，会导致已有的标定模型失效。Near-infrared spectroscopy (NIRS) analysis technology has the advantages of simple instrument operation, fast data analysis, low cost, and no contamination of samples, and has been widely used in various fields. In the production process, the near-infrared spectroscopy analysis technology is used for modeling, because the measurement conditions and instrument hardware performance are often unstable, which will lead to the failure of the existing calibration model.

迁移学习的主要目的是从源域的一项或多项任务中提取分类或回归知识，并将这些知识应用到目标域任务中，如果一个任务的知识成功地转移到另一个任务中，那么新任务的模型可以在没有太多新样本的情况下获得。利用在一个或多个源域学习的知识，提高目标域的学习性能，解决了目标域标签缺失、标签成本高、学习过程耗时等问题，达到提高学习性能的目的。The main purpose of transfer learning is to extract classification or regression knowledge from one or more tasks in the source domain and apply this knowledge to the target domain task, if the knowledge of one task is successfully transferred to another task, then the new A model for the task can be obtained without too many new samples. Using the knowledge learned in one or more source domains to improve the learning performance of the target domain, it solves the problems of missing labels in the target domain, high label cost, and time-consuming learning process, and achieves the purpose of improving the learning performance.

标定迁移方法指的是在不同测量仪器或测量状态下的多元标定模型的迁移。这种方法利用不同来源的光谱数据间的线性关系，对新仪器或新状态下测得光谱样本进行转换，进而可以直接利用原有模型对新样本进行预测。迁移研究可以应用于相关领域而不是同一个领域之间，实现对迁移、域间转换的有用信息，从而可以保持原有模型的有效性或利用原有信息加快建模速度，避免用大量的目标域样本或模型再次对目标域进行采样或建模，从而提高模型的有效性，在很大程度上降低了成本，加快了建模速度。The calibration migration method refers to the migration of multivariate calibration models under different measuring instruments or measurement states. This method uses the linear relationship between spectral data from different sources to convert spectral samples measured in new instruments or in new states, and then directly use the original model to predict new samples. Migration research can be applied to related fields instead of the same field, to achieve useful information for migration and inter-domain conversion, so that the validity of the original model can be maintained or the original information can be used to speed up the modeling speed and avoid using a large number of targets. Domain samples or models again sample or model the target domain, thereby increasing the effectiveness of the model, reducing costs to a large extent and speeding up modeling.

已有的标定迁移方法存在着预测精度不高、限制应用场合等问题。如基于PLS的标定迁移方法中，偏最小二乘(partial least-regression，PLS)是数据信息提取和过程监控中常用的算法之一，通过提取过程变量与质量变量相关性最大的特征信息并对过程变量进行划分，将过程变量和质量变量转化为主元子空间和剩余子空间，实现了数据的压缩和提取。然而，PLS算法首先使用主成分分析法分别提取过程变量和质量变量的主元，二者主元没有关联。它默认为所有进程变量对质量变量都起作用，忽略了内部变量的状态信息。在许多情况下，由于过程数据缺乏激励，存在大量未测量的过程和质量扰动，当质量变量的剩余信息发生变化时，会出现报警失效现象，导致PLS预测输出较差。实际上，相较于过程变量，对质量变量信息变化的监控更加重要。另一方面，建立PLS模型所涉及的优化目标是在不受残差约束的情况下，最大化过程变量与质量变量之间的主成分相关性，使过程变量与质量变量之间的残差方差达到最大。变量不能保证是最小的，这可能会导致大量的过程变量和质量变量的信息残留。再者，目前近红外光谱建模处理数据量大，串行偏最小二乘算法时间复杂度高、训练和测试过程长。The existing calibration migration methods have problems such as low prediction accuracy and limited application occasions. For example, in the calibration migration method based on PLS, partial least-regression (PLS) is one of the commonly used algorithms in data information extraction and process monitoring. The process variables are divided, and the process variables and quality variables are converted into the main subspace and the residual subspace, which realizes the compression and extraction of data. However, the PLS algorithm first uses the principal component analysis method to extract the principal components of the process variable and the quality variable respectively, and the two principal components are not related. It defaults to all process variables acting on quality variables, ignoring state information for internal variables. In many cases, due to the lack of excitation of the process data, there are a large number of unmeasured process and quality disturbances, when the remaining information of the quality variable changes, the alarm failure phenomenon occurs, resulting in poor PLS prediction output. In fact, the monitoring of changes in quality variable information is more important than process variables. On the other hand, the optimization goal involved in building a PLS model is to maximize the principal component correlation between the process variable and the quality variable without being constrained by the residual, so that the residual variance between the process variable and the quality variable to reach maximum. The variables are not guaranteed to be minimal, which may result in a large amount of information remaining on process variables and quality variables. Furthermore, the current near-infrared spectrum modeling processing data volume is large, the time complexity of the serial partial least squares algorithm is high, and the training and testing process is long.

发明内容SUMMARY OF THE INVENTION

针对现有技术存在的问题，本发明提供一种基于CPLS的红外光谱测量仪器标定迁移方法，能够清除主仪器测量的随机噪声，提高数据利用率和建模精度，降低时间复杂度。Aiming at the problems existing in the prior art, the present invention provides a calibration and migration method for an infrared spectrum measuring instrument based on CPLS, which can remove random noise measured by the main instrument, improve data utilization and modeling accuracy, and reduce time complexity.

本发明的技术方案为：The technical scheme of the present invention is:

一种基于CPLS的红外光谱测量仪器标定迁移方法，其特征在于，包括下述步骤：A kind of infrared spectroscopic measuring instrument calibration migration method based on CPLS, is characterized in that, comprises the following steps:

步骤1：将红外光谱测量主仪器对应到源域、将红外光谱测量从仪器对应到目标域，使用红外光谱测量主仪器、红外光谱测量从仪器采集每个样本的光谱，分别得到主光谱、从光谱，对主光谱、从光谱分别在波长范围内间隔anm提取光谱数据，并采集每个样本的物质浓度变量值，得到源域数据集{X_m,Y}和目标域数据集{X_s,Y}；Step 1: Correspond the main infrared spectrum measurement instrument to the source domain and the infrared spectrum measurement slave instrument to the target domain. Use the infrared spectrum measurement master instrument and the infrared spectrum measurement slave instrument to collect the spectrum of each sample, and obtain the main spectrum and the slave spectrum respectively. spectrum, extract spectral data at intervals of anm in the wavelength range for the main spectrum and the secondary spectrum, and collect the variable value of the substance concentration of each sample to obtain the source domain data set {X _m , Y} and the target domain data set {X _s , Y};

其中，X_m＝(X_m1,X_m2,...,X_mi,...,X_mI)^T，X_mi＝(x_mi1,x_mi2,...,x_mij,...,x_miJ)，X_s＝(X_s1,X_s2,...,X_si,...,X_sI)^T，X_si＝(x_si1,x_si2,...,x_sij,...,x_siJ)，x_mij、x_sij分别为第i个样本的第j个主光谱数据、从光谱数据，i＝1,2,...,I，j＝1,2,...,J，I为样本总数，J为提取的光谱数据点总数；Y＝(Y₁,Y₂,...,Y_i,...,Y_I)^T，Y_i＝(y_i1,y_i2,...,y_ik,...,y_iK)，y_ik为第i个样本的第k个物质浓度变量的值，k＝1,2,...,K，K为物质浓度变量总数；Wherein, X _m =(X _m1 ,X _m2 ,...,X _mi ,...,X _mI ) ^T , X _mi =(x _mi1 ,x _mi2 ,...,x _mij ,...,x _miJ ), X _s =(X _s1 ,X _s2 ,...,X _si ,...,X _sI ) ^T ,X _si =(x _si1 ,x _si2 ,...,x _sij ,..., x _siJ ), x _mij , x _sij are the j-th master spectral data and slave spectral data of the ith sample, respectively, i=1,2,...,I, j=1,2,...,J , I is the total number of samples, J is the total number of spectral data points extracted; Y=(Y ₁ ,Y ₂ ,...,Y _i ,...,Y _I ) ^T ,Y _i =(y _i1 ,y _i2 , ...,y _ik ,...,y _iK ), y _ik is the value of the kth substance concentration variable of the ith sample, k=1,2,...,K, K is the total number of substance concentration variables ;

步骤2：对源域数据集和目标域数据集进行中心化处理，得到中心化处理后的源域数据集{X_{m_center},Y_center}和目标域数据集{X_{s_center},Y_center}；Step 2: Centralize the source domain data set and the target domain data set, and obtain the centrally processed source domain data set {X _{m_center} , Y _center } and target domain data set {X _{s_center} , Y _center };

步骤3：基于CPLS算法对矩阵X_{m_center}、Y_center进行主成分分析：Step 3: Perform principal component analysis on the matrices X _{m_center} and Y _center based on the CPLS algorithm:

步骤3.1：基于PLS算法对数据集{X_{m_center},Y_center}建立标定模型Y_center＝X_{m_center}B，计算得到系数矩阵B、X_{m_center}的得分矩阵T、X_{m_center}的载荷矩阵P、Y_center的得分矩阵U、Y_center的载荷矩阵Q，引入矩阵R使T＝X_{m_center}R，并确定潜在变量个数l；Step 3.1: Based on the PLS algorithm, establish a calibration model Y _center = X _{m_center} B for the data set {X _{m_center} , Y _center }, and calculate the coefficient matrix B, the score matrix T of X _{m_center} , the load matrix P of X _{m_center} , and the score of Y _center Load matrix Q of matrix U, Y _center , introduce matrix R to make T=X _{m_center} R, and determine the number of latent variables l;

步骤3.2：计算可预测的物质浓度变量为Step 3.2: Calculate the predictable substance concentration variable as

对可预测的物质浓度变量进行奇异值分解，得到Singular value decomposition of the predictable species concentration variable yields

其中，U_c为左奇异矩阵，D_c为奇异值对角矩阵，V_c为右奇异矩阵，V_c是正交矩阵；Q_c＝V_cD_c ^T，包括降序的l_c个非零奇异值和相应的右奇异向量；Among them, U _c is a left singular matrix, D _c is a singular value diagonal matrix, V _c is a right singular matrix, and V _c is an orthogonal matrix; Q _c =V _c D _c ^T , including _lc non-zero singularities in descending order value and the corresponding right singular vector;

由式(2)得到It can be obtained by formula (2)

得到get

R_c＝RQ^TV_cD_c ^-1 (4)R _c =RQ ^T V _c D _c ^-1 (4)

步骤3.3：计算不可预测的物质浓度变量为Step 3.3: Calculate the unpredictable substance concentration variable as

对不可预测的物质浓度变量进行主成分提取，得到l_y个主成分数为Principal component extraction is performed on the unpredictable substance concentration variable, and _ly principal components are obtained as

其中，

为

的输出残差矩阵；in,

for

The output residual matrix of ;

通过式(6)求出矩阵

The matrix is obtained by formula (6)

步骤3.4：通过在空间上R_c投影，得到与物质浓度变量无关的输入变量为Step 3.4: Through the projection of R _c in space, the input variable independent of the substance concentration variable is obtained as

其中，R_c ^*＝(R_c ^TR_c)^-1R_c ^T；Wherein, R _c ^* = (R _c ^T R _c ) ^-1 R _c ^T ;

对与物质浓度变量无关的输入变量进行主成分提取，得到l_x个主成分数为Principal component extraction is performed on the input variables unrelated to the substance concentration variable, and the number of l _x principal components is

其中，

为

的输入残差矩阵；in,

for

The input residual matrix of ;

通过式(8)求出矩阵

The matrix is obtained by formula (8)

步骤3.5：由步骤3.1至步骤3.4，得到X_{m_center}、Y_center的经PLS算法提取的主成分分别为X_{m_pre}＝TP^T、Y_pre＝UQ^T，X_{m_center}、Y_center的残差分别为X_{m_res_c}＝X_{m_center}-X_{m_pre}、Y_{res_c}＝Y_center-Y_pre，也即得到Step 3.5: From step 3.1 to step 3.4, the principal components extracted by the PLS algorithm of X _{m_center} and Y _center are respectively X _{m_pre} =TP ^T , Y _pre =UQ ^T , and the residuals of X _{m_center} and Y _center are respectively X _{m_res_c} =X _{m_center} -X _{m_pre} , Y _{res_c} =Y _center -Y _pre , that is, to get

步骤4：采用与步骤3中相同的方法对矩阵X_{s_center}进行主成分分析，得到X_{s_center}的残差为X_{s_res_c}；Step 4: adopt the same method as in step 3 to carry out principal component analysis on the matrix X _{s_center} , and obtain the residual of X _{s_center} as X _{s_res_c} ;

步骤5：计算主光谱经PLS算法提取主成分后源域数据集的得分T_{m_pre}＝X_{m_center}R，计算从光谱经PLS算法提取主成分后目标域数据集的得分T_{s_pre}＝X_{s_center}R，根据T_{m_pre}、T_{s_pre}基于最小二乘法计算转移矩阵M_{trans_pre}；计算主光谱对残差提取主成分后源域数据集的得分T_m＝X_{m_res_c}P，计算从光谱对残差提取主成分后目标域数据集的得分T_s＝X_{s_res_c}P，根据T_m、T_s基于最小二乘法计算转移矩阵M_trans；Step 5: Calculate the score T m_pre =X _{m_center} R of the source domain data set after the principal components are extracted from the main spectrum by the PLS algorithm, calculate the score T _{s_pre} =X _{s_center} R of the target domain data set _after the principal components are extracted from the spectrum by the PLS algorithm, according to T _{m_pre} , T _{s_pre} calculate the transition matrix M _{trans_pre} based on the least squares method; calculate the score T _m =X _{m_res_c} P of the source domain data set after the main spectral pair residuals extract the principal components, calculate the target domain after extracting the principal components from the spectral pair residuals The score of the dataset T _s =X _{s_res_c} P, and the transition matrix M _trans is calculated based on the least squares method according to T _m and T _s ;

步骤6：对被测对象的物质浓度变量进行预测：Step 6: Predict the substance concentration variable of the measured object:

步骤6.1：使用红外光谱测量从仪器采集被测对象的光谱，使用与步骤1中相同的方法提取光谱数据，得到被测对象的J个从光谱数据构成的矩阵X_{s_test}；Step 6.1: use infrared spectrum measurement to collect the spectrum of the measured object from the instrument, extract the spectral data using the same method as in step 1, and obtain J matrix X _{s_test} formed from the spectral data of the measured object;

步骤6.2：基于CPLS算法对X_{s_test}进行主成分分析，得到X_{s_test}的残差为X_{s_res_c_test}；Step 6.2: perform principal component analysis on X _{s_test} based on the CPLS algorithm, and obtain the residual of X _{s_test} as X _{s_res_c_test} ;

步骤6.3：预测被测对象的物质浓度变量构成的矩阵为Y_{test_predict}＝(X_{s_test}*R*M_{trans_pre}*P^T+X_{s_res_c_test}*R*M_trans*P^T)*B。Step 6.3: The matrix formed by predicting the substance concentration variables of the tested object is Y _{test_predict} =(X _{s_test} *R*M _{trans_pre} *P ^T +X _{s_res_c_test} *R*M _trans *P ^T )*B.

进一步地，所述步骤1中，所述样本为谷物，所述光谱数据为吸收度，所述物质浓度变量包括谷物的水分含量、油分含量、蛋白质含量、淀粉含量。Further, in the step 1, the sample is grain, the spectral data is absorbance, and the substance concentration variable includes the moisture content, oil content, protein content, and starch content of the grain.

本发明的有益效果为：The beneficial effects of the present invention are:

本发明基于CPLS算法对源域数据集和目标域数据集进行一次主成分提取后，对残差再进行一次主成分提取，在两次主成分提取的基础上计算转移矩阵，清除了主仪器测量的随机噪声，提高了数据利用率和建模精度，降低了时间复杂度，提高了训练和测试的速度。Based on the CPLS algorithm, the invention extracts the principal components of the source domain data set and the target domain data set once, and then extracts the principal components for the residuals again. On the basis of the two principal component extractions, the transfer matrix is calculated, and the measurement of the main instrument is eliminated. The random noise improves the data utilization and modeling accuracy, reduces the time complexity, and improves the speed of training and testing.

附图说明Description of drawings

图1为本发明的基于CPLS的红外光谱测量仪器标定迁移方法的流程图。FIG. 1 is a flow chart of a method for calibrating migration of an infrared spectrometer measuring instrument based on CPLS of the present invention.

图2为本发明的基于CPLS的红外光谱测量仪器标定迁移方法中基于CPLS对源域数据集进行主成分分析的流程图。FIG. 2 is a flow chart of performing principal component analysis on a source domain data set based on CPLS in the method for calibrating and migrating an infrared spectrometer measuring instrument based on CPLS of the present invention.

图3为本发明的基于CPLS的红外光谱测量仪器标定迁移方法中求转移矩阵的流程图。FIG. 3 is a flow chart of finding a transfer matrix in the method for calibrating and transferring an infrared spectrometer measuring instrument based on CPLS of the present invention.

图4为本发明的基于CPLS的红外光谱测量仪器标定迁移方法中对被测对象的物质浓度变量进行预测的流程图。FIG. 4 is a flow chart of predicting the substance concentration variable of the measured object in the method for calibrating migration of an infrared spectrometer measuring instrument based on CPLS of the present invention.

图5为具体实施方式中玉米数据集上油分的交叉验证误差随主成分数变化的示意图。FIG. 5 is a schematic diagram of the cross-validation error of the oil content on the corn dataset in the specific embodiment as a function of the principal component fraction.

图6为具体实施方式中mp6spec-mp5spec的拟合结果图。FIG. 6 is a fitting result diagram of mp6spec-mp5spec in the specific embodiment.

图7为具体实施方式中m5spec-mp5spec的拟合结果图。FIG. 7 is a graph of the fitting result of m5spec-mp5spec in a specific embodiment.

具体实施方式Detailed ways

下面将结合附图和具体实施方式，对本发明作进一步描述。The present invention will be further described below with reference to the accompanying drawings and specific embodiments.

本发明提出一种基于CPLS的红外光谱测量仪器标定迁移方法。因为在对数据处理上，PLS只是简单地对X和Y进行一次主成分提取，但通常X和Y的残差中也包含有效信息，由于提取不充分导致建立的模型误差较大，因此提出并行偏最小二乘(Concurrent PLS，CPLS)算法，在PLS的基础上，对残差再进行一次主成分提取，这样建立的模型误差更小，线性关系更接近于真实情况。但是由于现实中，样本的采集非常昂贵、费时，因此又在CPLS的基础上提出迁移学习，通过在源域和目标域的标准集上建立映射关系，完成对目标域测试集的预测。The invention proposes a calibration migration method of an infrared spectrum measuring instrument based on CPLS. Because in data processing, PLS simply performs principal component extraction on X and Y, but usually the residuals of X and Y also contain valid information. Due to insufficient extraction, the established model has a large error. Therefore, a parallel method is proposed. Partial least squares (Concurrent PLS, CPLS) algorithm, on the basis of PLS, extracts the principal components of the residual again, so that the model error is smaller and the linear relationship is closer to the real situation. However, in reality, the collection of samples is very expensive and time-consuming, so transfer learning is proposed on the basis of CPLS, and the prediction of the test set of the target domain is completed by establishing a mapping relationship between the standard set of the source domain and the target domain.

本发明采用的CPLS算法对PLS算法进一步改进，对与质量变量不相关的过程变量信息、无法分别预测信息的质量进行主成分分析，划分为5个子空间：过程变量与质量变量相关信息的子空间(相关主元子空间)、过程变量主元空间、过程变量残差空间、质量变量主元空间、质量变量残差子空间。The CPLS algorithm adopted by the present invention further improves the PLS algorithm, and performs principal component analysis on the process variable information that is not related to the quality variable and the quality of the information that cannot be predicted separately, and is divided into 5 subspaces: the subspace of the information related to the process variable and the quality variable. (Relevant Pivot Subspace), Process Variable Pivot Space, Process Variable Residual Space, Quality Variable Pivot Space, Quality Variable Residual Subspace.

CPLS模型实现了三个目标：(1)从标准PLS投影中提取与输出的可预测变化直接相关的分数，并且这些得分向量构成了共变子空间(CVS)；(2)进一步将未预测的输出变化投影到输出主元子空间(OPS)和输出残差子空间(ORS)，以监测这些子空间的异常变化；(3)与预测输出无关的输入变化被进一步投影到输入主元子空间(IPS)和输入残差子空间(IRS)，以监视这些子空间中的异常变化。The CPLS model achieves three goals: (1) extract scores directly related to predictable changes in the output from standard PLS projections, and these score vectors constitute the covariant subspace (CVS); (2) further convert the unpredicted The output changes are projected to the output principal component subspace (OPS) and the output residual subspace (ORS) to monitor abnormal changes in these subspaces; (3) input changes unrelated to the predicted output are further projected to the input principal component subspace (IPS) and Input Residual Subspaces (IRS) to monitor anomalous changes in these subspaces.

CPLS算法设置过程变量数据分为两个主要部分，其中一部分是与质量变量有关的信息，另一部分是与质量变量无关的信息。质量变量数据也分为两个主要部分，一部分是属于可由过程变量预测的信息，另一部分是不能由过程变量预测的信息。因此，基于CPLS监控方法提供了一个完整的监控框架，能够监控过程变量和质量变量以及信息的其他部分。The CPLS algorithm sets the process variable data into two main parts, one of which is the information related to the quality variable, and the other part is the information not related to the quality variable. The quality variable data is also divided into two main parts, one is the information that can be predicted by the process variable, and the other is the information that cannot be predicted by the process variable. Therefore, the CPLS-based monitoring method provides a complete monitoring framework capable of monitoring process variables and quality variables as well as other parts of the information.

如图1所示，本发明的基于CPLS的红外光谱测量仪器标定迁移方法，包括下述步骤：As shown in Figure 1, the CPLS-based infrared spectroscopy measuring instrument calibration migration method of the present invention comprises the following steps:

其中，X_m＝(X_m1,X_m2,...,X_mi,...,X_mI)^T，X_mi＝(x_mi1,x_mi2,...,x_mij,...,x_miJ)，X_s＝(X_s1,X_s2,...,X_si,...,X_sI)^T，X_si＝(x_si1,x_si2,...,x_sij,...,x_siJ)，x_mij、x_sij分别为第i个样本的第j个主光谱数据、从光谱数据，i＝1,2,...,I，j＝1,2,...,J，I为样本总数，J为提取的光谱数据点总数；Y＝(Y₁,Y₂,…,Y_i,…,Y_I)^T，Y_i＝(y_i1,y_i2,…,y_ik,...,y_iK)，y_ik为第i个样本的第k个物质浓度变量的值，k＝1,2,…,K，K为物质浓度变量总数。Wherein, X _m =(X _m1 ,X _m2 ,...,X _mi ,...,X _mI ) ^T , X _mi =(x _mi1 ,x _mi2 ,...,x _mij ,...,x _miJ ), X _s =(X _s1 ,X _s2 ,...,X _si ,...,X _sI ) ^T ,X _si =(x _si1 ,x _si2 ,...,x _sij ,..., x _siJ ), x _mij , x _sij are the j-th master spectral data and slave spectral data of the ith sample, respectively, i=1,2,...,I, j=1,2,...,J , I is the total number of samples, J is the total number of spectral data points extracted; Y=(Y ₁ ,Y ₂ ,…,Y _i ,…,Y _I ) ^T ,Y _i =(y _i1 ,y _i2 ,…,y _ik ,...,y _iK ), y _ik is the value of the kth substance concentration variable of the ith sample, k=1,2,...,K, K is the total number of substance concentration variables.

本实施例中，样本为谷物类中的玉米，光谱数据为吸收度，物质浓度变量包括玉米的水分含量、油分含量、蛋白质含量、淀粉含量。三种光谱仪器对相同的I＝80个样本测得的数据构成玉米数据集。用红外光谱测量仪器m5、mp5、mp6在1100-2498nm波长范围内每隔a＝2nm测量红外光谱，共J＝700个属性。第一次实验的主光谱-从光谱为m5spec-mp6spec，也即将m5测得的光谱作为主光谱，对应的光谱数据集作为初始源域数据集；由于mp6测得的光谱与m5测得的差异大些，被选为从光谱，对应的光谱数据集作为初始目标域数据集。然后在mp5spec-mp6spec、mp6spec-mp5spec、m5spec-mp5spec、mp5spec-m5spec、mp6spec-m5spec上依次又进行了五次实验。In this embodiment, the sample is corn in cereals, the spectral data is the absorbance, and the substance concentration variables include the moisture content, oil content, protein content, and starch content of the corn. The data measured by the three spectrometers on the same I=80 samples constituted the corn dataset. The infrared spectrum is measured every a=2nm in the wavelength range of 1100-2498nm with the infrared spectrum measuring instruments m5, mp5, and mp6, with a total of J=700 attributes. The main spectrum of the first experiment - the secondary spectrum is m5spec-mp6spec, that is, the spectrum measured by m5 is used as the main spectrum, and the corresponding spectral data set is used as the initial source domain data set; due to the difference between the spectrum measured by mp6 and that measured by m5 The larger one is selected from the spectrum, and the corresponding spectral dataset is used as the initial target domain dataset. Then, five more experiments were performed in sequence on mp5spec-mp6spec, mp6spec-mp5spec, m5spec-mp5spec, mp5spec-m5spec, and mp6spec-m5spec.

本实施例中，采用Kennard-Stone(KS)算法对玉米数据集进行分割。首先，提取初始源域数据集和初始目标域数据集中20％的数据作为测试样本，分别为16个样本的数据。利用目标域的测试样本对标定迁移模型进行测试。然后，提取初始源域数据集和初始目标域数据集中剩余的80％的数据作为训练样本，分别为64个样本的数据。利用源域的训练样本建立参考模型，对目标域的迁移样本进行预测；并利用目标域的训练样本建立目标域的标准模型，以便于对比其他迁移模型的性能。接着，从源域的训练样本和目标域的训练样本中，采用KS算法分别提取20％的数据构成源域的标准样本集和目标域的标准样本集，分别作为本发明的方法中使用的源域数据集{X_m,Y}和目标域数据集{X_s,Y}，来建立源域样本与目标域样本之间的传递关系。In this embodiment, the Kennard-Stone (KS) algorithm is used to segment the corn dataset. First, 20% of the data in the initial source domain dataset and the initial target domain dataset are extracted as test samples, which are data of 16 samples respectively. The calibration transfer model is tested with test samples from the target domain. Then, the remaining 80% of the data in the initial source domain dataset and the initial target domain dataset are extracted as training samples, which are data of 64 samples respectively. Use the training samples of the source domain to establish a reference model to predict the migration samples of the target domain; and use the training samples of the target domain to establish a standard model of the target domain, so as to compare the performance of other migration models. Next, from the training samples of the source domain and the training samples of the target domain, KS algorithm is used to extract 20% of the data to form the standard sample set of the source domain and the standard sample set of the target domain, respectively, as the source used in the method of the present invention. Domain dataset {X _m , Y} and target domain dataset {X _s , Y} to establish the transfer relationship between source domain samples and target domain samples.

步骤2：对源域数据集和目标域数据集进行中心化处理，也即对每一列数据求均值，然后用每列的原始数据减去该列的均值，得到中心化处理后的源域数据集{X_{m_center},Y_center}和目标域数据集{X_{s_center},Y_center}，这样可以有效避免由于数值差异较大引起的偏差。Step 2: Centralize the source domain data set and the target domain data set, that is, calculate the mean value of each column of data, and then subtract the mean value of the column from the original data of each column to obtain the centrally processed source domain data Set {X _{m_center} , Y _center } and target domain dataset {X _{s_center} , Y _center }, which can effectively avoid the deviation caused by large numerical differences.

步骤3：如图2所示，基于CPLS算法对矩阵X_{m_center}、Y_center进行主成分分析：Step 3: As shown in Figure 2, perform principal component analysis on the matrices X _{m_center} and Y _center based on the CPLS algorithm:

对可预测的物质浓度变量进行奇异值(SVD，Singular Value Decomposition)分解，得到The Singular Value Decomposition (SVD, Singular Value Decomposition) decomposition of the predictable substance concentration variable yields

由式(2)得到It can be obtained by formula (2)

得到get

R_c＝RQ^TV_cD_c ^-1 (4)R _c =RQ ^T V _c D _c ^-1 (4)

对不可预测的物质浓度变量进行主成分提取(PCA)，得到l_y个主成分数为Principal component extraction (PCA) is performed on the unpredictable _substance concentration variable, and the number of principal components is obtained as

其中，

为

的输出残差矩阵；in,

for

The output residual matrix of ;

通过式(6)求出矩阵

The matrix is obtained by formula (6)

其中，

为

的输入残差矩阵；in,

for

The input residual matrix of ;

通过式(8)求出矩阵

The matrix is obtained by formula (8)

根据CPLS的算法流程，可以明显看出X_{m_center}、Y_center被划分为三部分：经PLS算法提取的主成分、对残差提取的主成分、不可预测的误差。CPLS算法流程说明相较于PLS算法，它的优点在于多了对残差提取主成分的处理，提高了数据利用率。According to the algorithm flow of CPLS, it can be clearly seen that X _{m_center} and Y _center are divided into three parts: principal components extracted by the PLS algorithm, principal components extracted from residuals, and unpredictable errors. Compared with the PLS algorithm, the CPLS algorithm has the advantage of more processing of the principal components of the residual error extraction, which improves the data utilization rate.

步骤4：采用与步骤3中相同的方法对矩阵X_{s_center}进行主成分分析，得到X_{s_center}的残差为X_{s_res_c}。Step 4: Perform principal component analysis on the matrix X _{s_center} using the same method as in Step 3, and obtain the residual of X _{s_center} as X _{s_res_c} .

本实施例中，PLS算法最佳主成分数的选择结果分析如下：采用10折交叉验证方法对PLS方法的主成分数进行选取，以油这一成分为例，主成分数变化引起的玉米数据集中目标域训练集的油分含量模型交叉验证误差的变化情况如图5所示。从图5中可以看到，玉米集上油分的交叉验证误差在主成分数为12时达到了全局最小，因此我们对油分的最佳主成分数设为12。其他三种成分的最佳主成分数选择方法与此方法相同。In this embodiment, the analysis of the selection result of the optimal number of principal components of the PLS algorithm is as follows: the principal component number of the PLS method is selected by the 10-fold cross-validation method, taking the component of oil as an example, the corn data caused by the change of the number of principal components The variation of the cross-validation error of the oil content model in the training set of the centralized target domain is shown in Figure 5. As can be seen from Figure 5, the cross-validation error of oil on the corn set reaches the global minimum when the number of principal components is 12, so we set the optimal number of principal components for oil as 12. The optimal principal component fraction selection method for the other three components is the same as this method.

步骤5：如图3所示，使用最小二乘算法建立使目标域潜结构映射到源域潜结构的转移矩阵：计算主光谱经PLS算法提取主成分后源域数据集的得分T_{m_pre}＝X_{m_center}R，计算从光谱经PLS算法提取主成分后目标域数据集的得分T_{s_pre}＝X_{s_center}R，根据T_{m_pre}、T_{s_pre}基于最小二乘法计算转移矩阵M_{trans_pre}；计算主光谱对残差提取主成分后源域数据集的得分T_m＝X_{m_res_c}P，计算从光谱对残差提取主成分后目标域数据集的得分T_s＝X_{s_res_c}P，根据T_m、T_s基于最小二乘法计算转移矩阵M_trans。Step 5: As shown in Figure 3, use the least squares algorithm to establish a transition matrix that maps the latent structure of the target domain to the latent structure of the source domain: Calculate the score of the source domain dataset after the principal spectrum is extracted by the PLS algorithm T _{m_pre} =X _{m_center} R, calculate the score T _{s_pre} =X _{s_center} R of the target domain data set after extracting the principal components from the spectrum through the PLS algorithm, calculate the transition matrix M _{trans_pre} based on the least squares method according to T _{m_pre} , T _{s_pre} ; The score T _m =X _{m_res_c} P of the source domain dataset after the composition, calculate the score T _s =X _{s_res_c} P of the target domain dataset after extracting the principal components from the spectral pair residuals, calculate the transfer based on T _m , T _s based on the least squares method matrix M _trans .

步骤6：如图4所示，对被测对象的物质浓度变量进行预测：Step 6: As shown in Figure 4, predict the substance concentration variable of the measured object:

本实施例中，使用模型对数据进行预测，玉米数据集中不同主仪器-从仪器组合下的预测误差RMSEP结果如下表1所示：In this embodiment, the model is used to predict the data, and the prediction error RMSEP results under different master-slave-instrument combinations in the corn data set are shown in Table 1 below:

表1Table 1

分析表1可知：总的来说，在光谱mp5spec和光谱mp6spec之间利用本发明的运算效果普遍比另外两组要好，这是因为相比较而言，mp5spec和mp6spec的相似度比较高，这两组与光谱m5spec的区别比较大，因此在这两个之间迁移学习更有意义，因此结果误差比较小。且不难看出，以mp6spec为主光谱、mp5spec为从光谱，对水分、油分、蛋白质和淀粉的测量误差基本都是这六组实验中最小的，而m5spec和mp5spec、mp6spec之间的迁移结果则是六组中误差最大的。Analysis of Table 1 shows that: in general, the operation effect of the present invention between the spectrum mp5spec and the spectrum mp6spec is generally better than that of the other two groups. This is because the similarity between mp5spec and mp6spec is relatively high. The difference between the group and the spectral m5spec is relatively large, so it makes more sense to transfer learning between the two, so the error in the results is relatively small. It is not difficult to see that with mp6spec as the main spectrum and mp5spec as the secondary spectrum, the measurement errors of moisture, oil, protein and starch are basically the smallest among these six groups of experiments, while the migration results between m5spec and mp5spec and mp6spec are It is the largest error among the six groups.

如图6和图7所示，分别为本实施例中mp6spec-mp5spec、m5spec-mp5spec的拟合结果图。对比图6和图7，可以明显看出两组拟合效果的好坏。光谱mp6spec和光谱mp5spec之间，相似度较高，拟合度较好，对比光谱m5spec和光谱mp5spec之间的迁移学习，可以看出前者大部分点落在拟合线附近或者拟合线上，后者所有点都落在拟合直线的下方，表明前者迁移学习的效果明显好于后者，后者两个光谱之间其实没有迁移的必要，因为预测的效果一点都不好。As shown in FIG. 6 and FIG. 7 , the fitting result diagrams of mp6spec-mp5spec and m5spec-mp5spec in this embodiment are respectively. Comparing Figure 6 and Figure 7, it can be clearly seen that the fitting effect of the two groups is good or bad. Between spectral mp6spec and spectral mp5spec, the similarity is high and the fitting degree is good. Comparing the transfer learning between spectral m5spec and spectral mp5spec, it can be seen that most points of the former fall near the fitting line or on the fitting line. All the points of the latter fall below the fitted straight line, indicating that the effect of the former transfer learning is significantly better than that of the latter, and there is no need to transfer between the two spectra of the latter, because the prediction effect is not good at all.

由于光谱mp6spec-mp5spec之间的迁移效果最好，因此这里选用这组光谱进行实验与其他的算法进行对比，这里所述的其他算法分别是：多元散射校正(MultiplicativeScatter/Signal Correction,MSC)、典型相关分析(Canonical Correlation Analysis,CCA)、斜率偏差校正(Slope and Bias Correction,SBC)、分段直接标准化(PiecewiseDirect Standardization,PDS)。如表2所示，为玉米数据集中mp6spec-m5spec在各算法下的RMSEP对比结果。由表2可以看出，总的来说，本发明的基于CPLS的红外光谱测量仪器标定迁移方法的迁移效果是很好的：相比较MSC、CCA和PDS算法，本发明对四个成分的预测都是远远优于此三种算法的；和SBC算法相比，对水分、油分的预测效果比较好，而对蛋白质和淀粉的预测效果相差不大。Since the migration effect between the spectra mp6spec-mp5spec is the best, this group of spectra is selected for experiments and compared with other algorithms. The other algorithms described here are: Multiplicative Scatter/Signal Correction (MSC), typical Correlation analysis (Canonical Correlation Analysis, CCA), Slope and Bias Correction (Slope and Bias Correction, SBC), Piecewise Direct Standardization (Piecewise Direct Standardization, PDS). As shown in Table 2, it is the RMSEP comparison result of mp6spec-m5spec in the corn dataset under each algorithm. As can be seen from Table 2, in general, the migration effect of the CPLS-based infrared spectroscopy measuring instrument calibration migration method of the present invention is very good: compared with MSC, CCA and PDS algorithms, the present invention predicts four components. They are far superior to these three algorithms; compared with the SBC algorithm, the prediction effect of moisture and oil content is better, but the prediction effect of protein and starch is not much different.

表2Table 2

总之，通过在玉米数据集上做的六组实验，根据得出的实验结果，并分别与MSC算法、CCA算法、SBC算法、PDS算法作比较，都可以看出本发明的CPLS算法结合迁移学习的预测效果与SBC的效果相近，但远远优于MSC算法、CCA算法、PDS算法。可见，本发明清除了主仪器测量的随机噪声，提高了数据利用率和建模精度。In a word, through the six groups of experiments on the corn data set, according to the obtained experimental results, and comparing with the MSC algorithm, the CCA algorithm, the SBC algorithm and the PDS algorithm, it can be seen that the CPLS algorithm of the present invention is combined with the transfer learning. The prediction effect is similar to that of SBC, but far superior to MSC algorithm, CCA algorithm and PDS algorithm. It can be seen that the present invention removes the random noise measured by the main instrument, and improves the data utilization rate and modeling accuracy.

显然，上述实施例仅仅是本发明的一部分实施例，而不是全部的实施例。上述实施例仅用于解释本发明，并不构成对本发明保护范围的限定。基于上述实施例，本领域技术人员在没有做出创造性劳动的前提下所获得的所有其他实施例，也即凡在本申请的精神和原理之内所作的所有修改、等同替换和改进等，均落在本发明要求的保护范围内。Obviously, the above-mentioned embodiments are only a part of the embodiments of the present invention, but not all of the embodiments. The above embodiments are only used to explain the present invention, and do not constitute a limitation on the protection scope of the present invention. Based on the above-mentioned embodiments, all other embodiments obtained by those skilled in the art without creative work, that is, all modifications, equivalent replacements and improvements made within the spirit and principle of the present application, are fall within the scope of protection required by the present invention.

Claims

1. A CPLS-based infrared spectrum measuring instrument calibration migration method is characterized by comprising the following steps:

step 1: the method comprises the steps of enabling an infrared spectrum measurement master instrument to correspond to a source domain, enabling an infrared spectrum measurement slave instrument to correspond to a target domain, collecting the spectrum of each sample by using the infrared spectrum measurement master instrument and the infrared spectrum measurement slave instrument, respectively obtaining a master spectrum and a slave spectrum, respectively extracting spectral data of the master spectrum and the slave spectrum at intervals anm within a wavelength range, collecting material concentration variable values of each sample, and obtaining a source domain data set { X _m Y and target Domain data set { X _s ,Y}；

Wherein, X _m ＝(X _m1 ,X _m2 ,...,X _mi ,...,X _mI ) ^T ，X _mi ＝(x _mi1 ,x _mi2 ,...,x _mij ,...,x _miJ )，X _s ＝(X _s1 ,X _s2 ,...,X _si ,...,X _sI ) ^T ，X _si ＝(x _si1 ,x _si2 ,...,x _sij ,...,x _siJ )，x _mij 、x _sij J, I is the total number of samples, and J is the total number of extracted spectral data points; y ═ Y ₁ ,Y ₂ ,...,Y _i ,...,Y _I ) ^T ，Y _i ＝(y _i1 ,y _i2 ,...,y _ik ,...,y _iK )，y _ik Is the value of the kth species concentration variable for the ith sample, K being 1,2The total number of concentration variables;

step 2: the source domain data set and the target domain data set are subjected to centralized processing to obtain a centralized source domain data set { X _{m_center} ,Y _center And a target domain data set { X } _{s_center} ,Y _center }；

And 3, step 3: CPLS algorithm based matrix X _{m_center} 、Y _center Performing principal component analysis:

step 3.1: data set { X) based on PLS algorithm _{m_center} ,Y _center Establishment of calibration model Y _center ＝X _{m_center} B, calculating to obtain a coefficient matrix B, X _{m_center} Score matrix T, X _{m_center} Load matrix P, Y _center Score matrix U, Y _center The matrix R is introduced so that T is X _{m_center} R, and determining the number l of the latent variables;

step 3.2: calculating a predictable substance concentration variable of

Performing singular value decomposition on predictable substance concentration variables to obtain

Wherein, U _c As a left singular matrix, D _c As diagonal matrix of singular values, V _c As a right singular matrix, V _c Is an orthogonal matrix; q _c ＝V _c D _c ^T Including l in descending order _c A plurality of non-zero singular values and corresponding right singular vectors;

obtained by the formula (2)

To obtain

R _c ＝RQ ^T V _c D _c ^-1 (4)

Step 3.3: calculating an unpredictable substance concentration variable as

Extracting main components from unpredictable substance concentration variables to obtain l _y The main component number is

Wherein,

is composed of

The output residual matrix of (3);

The matrix is obtained by equation (6)

Step 3.4: by spatially R _c Projection of an input variable independent of the material concentration variable as

Wherein R is _c ^* ＝(R _c ^T R _c ) ^-1 R _c ^T ；

Subjecting the input variable independent of the concentration variable of the substance to principal component extraction to obtain l _x The main component number is

Wherein,

is composed of

The input residual matrix of (3);

obtaining a matrix by equation (8)

Step 3.5: from step 3.1 to step 3.4, X is obtained _{m_center} 、Y _center The main components extracted by the PLS algorithm are respectively X _{m_pre} ＝TP ^T 、Y _pre ＝UQ ^T ，X _{m_center} 、Y _center Respectively have a residual error of X _{m_res_c} ＝X _{m_center} -X _{m_pre} 、Y _{res_c} ＝Y _center -Y _pre That is to obtain

And 4, step 4: applying the same method as in step 3 to the matrix X _{s_center} Performing principal component analysis to obtain X _{s_center} Has a residual error of X _{s_res_c} ；

And 5: calculating the score T of the source domain data set after the principal spectrum is extracted by the PLS algorithm _{m_pre} ＝X _{m_center} R, calculating the score T of the target domain data set after extracting the principal components from the spectrum by a PLS algorithm _{s_pre} ＝X _{s_center} R, according to T _{m_pre} 、T _{s_pre} Calculating transfer matrix M based on least square method _{trans_pre} (ii) a Calculating the score T of the data set of the source domain after extracting principal components from the residual error of the principal spectrum _m ＝X _{m_res_c} P, calculating the score T of the target domain data set after extracting the principal component from the spectrum pair residual error _s ＝X _{s_res_c} P, according to T _m 、T _s Calculating transfer matrix M based on least square method _trans ；

Step 6: predicting the substance concentration variable of the measured object:

step 6.1: collecting the spectrum of the measured object from the instrument by infrared spectrometry, and extracting the spectrum data by the same method as step 1 to obtain J matrixes X formed by the spectrum data of the measured object _{s_test} ；

Step 6.2: x pair based on CPLS algorithm _{s_test} Performing principal component analysis to obtain X _{s_test} Has a residual error of X _{s_res_c_test} ；

Step 6.3: the matrix formed by predicting the material concentration variable of the measured object is Y _{test_predict} ＝(X _{s_test} *R*M _{trans_pre} *P ^T +X _{s_res_c_test} *R*M _trans *P ^T )*B。

2. The CPLS-based Infrared Spectroscopy measurement instrument calibration migration method according to claim 1, wherein in the step 1, the sample is grain, the spectral data is absorbance, and the substance concentration variables comprise moisture content, oil content, protein content and starch content of grain.