CN107818329A

CN107818329A - A kind of MASS SPECTRAL DATA ANALYSIS method

Info

Publication number: CN107818329A
Application number: CN201710674793.9A
Authority: CN
Inventors: 王乾; 胡畅
Original assignee: Shanghai Jiao Tong University
Current assignee: Yinapu Zhejiang Biotechnology Co ltd
Priority date: 2017-08-09
Filing date: 2017-08-09
Publication date: 2018-03-20
Anticipated expiration: 2037-08-09
Also published as: CN107818329B

Abstract

The invention provides a mass spectrum data analysis method, which comprises a sample data collection step, a sample data preprocessing step, a data model construction and cross-validation step, a data model optimization step, and a sample group judgment step.

Description

A method for mass spectrometry data analysis

技术领域technical field

本发明涉及机器学习应用领域，特别涉及一种质谱数据分析方法。The invention relates to the field of machine learning applications, in particular to a mass spectrometry data analysis method.

背景技术Background technique

机器学习(Machine Learning，ML)是一门多领域交叉学科，涉及概率论、统计学、逼近论、凸分析、算法复杂度理论等多门学科，专门研究计算机怎样模拟或实现人组的学习行为，以获取新的知识或技能，重新组织已有的知识结构使之不断改善自身的性能。机器学习是人工智能的核心，可应用于数据挖掘、计算机视觉、自然语言处理、生物特征识别、搜索引擎、医学诊断、检测信用卡欺诈、证券市场分析、DNA序列测序等诸多领域。机器学习算法是一组从已知数据中自动分析及获得规律并利用规律对未知数据进行预测的算法。Machine learning (Machine Learning, ML) is a multi-field interdisciplinary subject, involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and other disciplines, specializing in the study of how computers simulate or realize the learning behavior of human groups , to acquire new knowledge or skills, and reorganize the existing knowledge structure to continuously improve its own performance. Machine learning is the core of artificial intelligence and can be applied to many fields such as data mining, computer vision, natural language processing, biometric identification, search engines, medical diagnosis, detection of credit card fraud, securities market analysis, and DNA sequence sequencing. Machine learning algorithms are a set of algorithms that automatically analyze and obtain laws from known data and use the laws to predict unknown data.

质谱数据，是使用专门的仪器使样品发生电离，生成不同荷质比的带电荷离子，再利用外加电场使得不同荷质比的离子在空间上或时间上分离，进而得到质谱数据。不同质荷比的离子经质量分析器分开后，被检测并记录下来，经计算机处理后生成质谱图。Mass spectrometry data is to use a special instrument to ionize the sample to generate charged ions with different charge-to-mass ratios, and then use an external electric field to separate the ions with different charge-to-mass ratios in space or time to obtain mass spectrometry data. Ions with different mass-to-charge ratios are separated by a mass analyzer, detected and recorded, and processed by a computer to generate a mass spectrum.

在生物、化学及医学领域，经常会涉及到根据成分对体液样本进行的分类问题，一般来说，技术人员大多使用单独分析、分别对比的方法，这种方法的优势在于，样品成分清楚，分类准确；其不足之处在于，当需要分类的体液样本种类较多时，需要消耗大量时间和大量资源，人力成本较高。如何根据已知类别的体液样本，推断出新的体液样本的类别，一直是科研人员的重要研究课题。In the fields of biology, chemistry, and medicine, it often involves the classification of body fluid samples according to their components. Generally speaking, most technicians use separate analysis and comparison methods. The advantage of this method is that the sample components are clear and easy to classify. Accurate; its disadvantage is that when there are many types of body fluid samples to be classified, it takes a lot of time and resources, and the labor cost is high. How to infer the type of a new body fluid sample based on known types of body fluid samples has always been an important research topic for researchers.

以医学领域为例，目前已知的某些疾病患者的体液内往往会存在一些相同的特殊成分，这些成分可能是导致患者患有同类疾病的病因，也可能是因某类疾病的表现特征。在临床上，如果查到某患者体液内存在某一类成分，就可以将该患者与某一种或某一类疾病建立关联，为临床诊断提供数据支持。由于人体是非常复杂的有机体，疾病的诊断和治疗方案的选择都需要专业医务人员针对每一个体海量数据做出判断，诊断效率较低，人力成本较高。当需要做检查的患者人数较多时，患者需要长时间排队，医生连续工作也会比较辛苦，每个病人的诊疗时间较短，很容易出现误诊的情况。因此，在临床医学中，需要有一种能同时可以对大量体液样本作出成分分析的医学设备，可以根据已知的健康人群和患者的大量体液样本，在短时间内检测和分析大量未知样本中是否含有某些特定成分，从而辅助医务人员可以更加方便准确地作出诊断。Taking the medical field as an example, some of the same special components often exist in the body fluids of patients with certain diseases known so far. These components may be the cause of the patient suffering from the same disease, or may be the manifestation of a certain type of disease. Clinically, if a certain type of component is detected in a patient's body fluid, the patient can be associated with a certain disease or a certain type of disease, providing data support for clinical diagnosis. Since the human body is a very complex organism, the diagnosis of diseases and the selection of treatment options require professional medical personnel to make judgments based on the massive data of each individual, resulting in low diagnostic efficiency and high labor costs. When there are a large number of patients who need to be examined, patients need to queue for a long time, and doctors will work hard continuously. The time for diagnosis and treatment of each patient is short, and misdiagnosis is prone to occur. Therefore, in clinical medicine, there is a need for a medical device that can analyze the composition of a large number of body fluid samples at the same time, and can detect and analyze in a short time whether or not Contains certain specific ingredients, so as to assist medical personnel to make a diagnosis more conveniently and accurately.

发明内容Contents of the invention

本发明的目的在于：提供一种质谱数据分析方法，以解决现有技术中存在的当需要分类的体液样本数量较多时，需要消耗大量时间和大量资源，人力成本较高的技术问题。The purpose of the present invention is to provide a mass spectrometry data analysis method to solve the technical problems in the prior art that when there are a large number of body fluid samples to be classified, a large amount of time and resources are consumed and labor costs are high.

为解决上述技术问题，本发明提供一种质谱数据分析方法，包括如下步骤：样本数据采集步骤，用以采集两个以上体液样本的质谱数据并根据所述质谱数据生成质谱图；所述体液样本包括两个以上训练样本及至少一个测试样本；所述训练样本被分成两个以上组别，同一组别的训练样本标识有相同的组别标签；样本数据预处理步骤，用以对至少一组质谱数据进行预处理，对所述质谱图进行坐标变换处理，获得所述训练样本及所述测试样本的标准化质谱数据；数据模型构建及交叉验证步骤，用以利用所述训练样本的标准化质谱数据及所述训练样本的组别标签构建初级数据模型，根据所述训练样本的标准化质谱数据对所述初级数据模型进行至少一次的交叉验证处理；数据模型优化步骤，用以根据交叉验证的结果构建优化数据模型；以及样本组别判断步骤，用以利用所述测试样本的标准化质谱数据及所述优化数据模型获取所述测试样本的组别标签。In order to solve the above technical problems, the present invention provides a mass spectrometry data analysis method, comprising the following steps: a sample data collection step for collecting mass spectrometry data of two or more body fluid samples and generating a mass spectrogram according to the mass spectrometry data; the body fluid sample Including two or more training samples and at least one test sample; the training samples are divided into two or more groups, and the training samples of the same group are marked with the same group label; the sample data preprocessing step is used to at least one group Preprocessing the mass spectrum data, performing coordinate transformation processing on the mass spectrum to obtain standardized mass spectrum data of the training sample and the test sample; data model construction and cross-validation steps for utilizing the standardized mass spectrum data of the training sample and the group labels of the training samples to construct a primary data model, and carry out at least one cross-validation process on the primary data model according to the standardized mass spectrum data of the training samples; the data model optimization step is used to construct according to the results of the cross-validation Optimizing the data model; and a sample group judgment step for obtaining the group label of the test sample by using the standardized mass spectrum data of the test sample and the optimized data model.

进一步地，所述样本数据采集步骤，具体包括如下步骤：获取两个以上体液样本；将全部所述体液样本在一平板上排列成矩阵；以及利用质谱法采集所述体液样本的质谱数据并生成质谱图；每一体液样本采集至少一组质谱数据。Further, the sample data collection step specifically includes the following steps: acquiring more than two body fluid samples; arranging all the body fluid samples in a matrix on a flat plate; and using mass spectrometry to collect mass spectrum data of the body fluid samples and generating Mass spectrum; at least one set of mass spectrum data is collected for each body fluid sample.

进一步地，所述测试样本位于所述平板的中部，所述训练样本围绕所述测试样本；所述平板包括但不限于基质金属板；任意两个相邻的训练样本的组别标签皆不同；任意两个相邻体液样本的距离大于或等于2mm，且小于或等于5mm。Further, the test sample is located in the middle of the flat plate, and the training sample surrounds the test sample; the flat plate includes but is not limited to a matrix metal plate; the group labels of any two adjacent training samples are different; The distance between any two adjacent body fluid samples is greater than or equal to 2mm and less than or equal to 5mm.

进一步地，每一组质谱数据包括体液样本中一离子的质荷比值及对应该离子的信号实测强度值；每一组质谱数据对应所述质谱图中一个采样点；每一个采样点的横坐标表示一离子的质荷比值，其纵坐标表示对应该离子的信号实测强度值。Further, each set of mass spectrometry data includes the mass-to-charge ratio of an ion in the body fluid sample and the measured signal intensity value corresponding to the ion; each set of mass spectrometry data corresponds to a sampling point in the mass spectrogram; the abscissa of each sampling point Indicates the mass-to-charge ratio of an ion, and its ordinate indicates the measured intensity value of the signal corresponding to the ion.

进一步地，所述样本数据预处理步骤具体包括如下步骤：基线校正步骤，用以对所述质谱图中的质谱数据做基线校正处理；重采样步骤，用以利用重采样算法对基线校正后的质谱数据中的离子质荷比值进行重采样处理，对所述质谱图进行横坐标变换，统一所有质谱数据的质荷比，获得重采样质谱数据；标准化步骤，用以对所述重采样质谱数据中离子信号强度值进行标准化处理，对所述质谱图进行纵坐标变换，获得标准化质谱数据。Further, the sample data preprocessing step specifically includes the following steps: a baseline correction step, for performing baseline correction processing on the mass spectrum data in the mass spectrogram; a resampling step, for using a resampling algorithm to correct the baseline The mass-to-charge ratio of the ions in the mass spectrum data is resampled, the abscissa of the mass spectrum is transformed, and the mass-to-charge ratios of all mass spectrum data are unified to obtain resampled mass spectrum data; the standardization step is used to resample the mass spectrum data The signal intensity value of the neutral ion is standardized, and the ordinate transformation is performed on the mass spectrum to obtain standardized mass spectrum data.

进一步地，所述基线校正步骤具体包括如下步骤：信号计算步骤，用以利用窗函数计算一组质谱数据中至少一质荷比值对应的基线信号强度；信号校正步骤，用以根据所述基线信号强度校正对应所述质荷比的实测信号强度；重复所述信号计算步骤及所述信号校正步骤，依次完成每一体液样本的每一组质谱数据的校正。Further, the baseline correction step specifically includes the following steps: a signal calculation step for calculating the baseline signal intensity corresponding to at least one mass-to-charge ratio value in a set of mass spectrum data by using a window function; a signal correction step for calculating the baseline signal intensity according to the baseline signal Intensity correction corresponds to the measured signal intensity of the mass-to-charge ratio; the signal calculation step and the signal correction step are repeated to sequentially complete the correction of each group of mass spectrum data of each body fluid sample.

进一步地，所述重采样步骤具体包括如下步骤：有效质荷比选择步骤，用以选择有效质荷比区间及有效质荷比数量；有效质荷比计算步骤，用以利用重采样算法计算重采样质谱数据的质荷比；插值处理步骤，用以利用重采样后的质荷比及质荷比编号对基线校正后的质谱图进行插值处理，将基线校正后质谱图的横坐标由荷质比数值变换为质荷比编号。Further, the resampling step specifically includes the following steps: an effective mass-to-charge ratio selection step, used to select the effective mass-to-charge ratio interval and the number of effective mass-to-charge ratios; an effective mass-to-charge ratio calculation step, to use the resampling algorithm to calculate the The mass-to-charge ratio of the sampled mass spectrum data; the interpolation processing step is used to interpolate the mass spectrum after baseline correction by using the mass-to-charge ratio after resampling and the mass-to-charge ratio number, and the abscissa of the mass spectrum after baseline correction is changed from charge to mass Ratio values are transformed into mass-to-charge ratio numbers.

进一步地，所述重采样算法是指：设重采样后有效质谱数据的质荷比区间为[y₁,y₂]，重采样后的质荷比坐标数量为N；利用以下公式计算重采样后的质荷比坐标 Further, the resampling algorithm refers to: set the mass-to-charge ratio interval of the effective mass spectrum data after resampling to [y ₁ , y ₂ ], and the number of mass-to-charge ratio coordinates after resampling is N; use the following formula to calculate the resampling Mass-to-charge ratio coordinates after

其中，N大于10⁴且小于10⁵。 Wherein, N is greater than 10 ⁴ and less than 10 ⁵ .

进一步地，所述标准化步骤具体包括如下步骤：信号强度绝对值总和计算步骤，用以计算所有重采样质谱数据中离子信号强度值的绝对值总和S；标准化信号强度值总和设定步骤，用以设定标准化处理后所有重采样质谱数据中离子信号强度值的绝对值总和为常量T；信号强度值变化倍数计算步骤，用以计算每一信号强度值的变化倍数T/S；信号强度值变化步骤，用以对所述重采样质谱数据中每个离子信号强度值进行同步放大或同步缩小处理。Further, the standardization step specifically includes the following steps: a step of calculating the absolute sum of signal intensities to calculate the absolute sum S of ion signal intensities in all resampled mass spectrum data; a step of setting the sum of standardized signal intensities to Set the absolute value sum of ion signal intensity values in all resampled mass spectrometry data after normalization to be a constant T; the signal intensity value change multiple calculation step is used to calculate the change multiple T/S of each signal intensity value; the signal intensity value change step, for performing synchronous amplification or synchronous reduction processing on each ion signal intensity value in the resampled mass spectrum data.

进一步地，所述数据模型构建及交叉验证步骤，具体包括如下步骤：任选一训练样本作为标准训练样本，其组别标签已知；以所述标准训练样本的位置为圆心，以特定长度r为半径，在所述平板上设定一个圆形区域；根据所述圆形区域内除所述标准训练样本外的其他训练样本的标准化质谱数据构建矩阵D，所述矩阵D中每一列数据分别对应一训练样本的一组标准化质谱数据；根据所述圆形区域内除所述标准训练样本外的其他训练样本的组别标签获取向量每一训练样本的组别标签记录在向量中；利用稀疏学习优化算法建立初级数据模型将所述标准训练样本的两组以上质谱数据与所述数据模型相乘，将其乘积按照数值大小排成数列，对其中位值进行取整处理，获取所述标准训练样本的推测组别标签；对比所述标准训练样本的推测组别标签与其组别标签，若二者相同，则判定所述标准训练样本的组别标签推测正确，正确度计数器加一；依次将每一个训练样本作为标准训练样本，重复上述各个步骤，对所有的训练样本进行交叉验证处理，计算出在半径为r的情况下所述训练样本的组别标签判断准确率，所述组别标签判断准确率为正确度计数器的数值与所述训练样本总数的比值；调整半径r的大小，重复上述各个步骤，计算出在半径r为不同数值的情况下的组别标签判断准确率；从两个以上组别标签判断准确率中选取一个准确率最大值，获取对应该准确率最大值的半径r的最优值R。[0017]进一步地，所述数据模型优化步骤，具体包括如下步骤：以一测试样本的位置为圆心，以半径最优值R的长度为半径，在所述平板上设定一个圆形区域；根据所述圆形区域内所有训练样本的标准化质谱数据构建矩阵D_W，所述矩阵D_W中每一列数据分别对应一训练样本的一组标准化质谱数据；根据所述圆形区域内所有训练样本的组别标签获取向量每一训练样本的组别标签以自然数形式记录在对应该训练样本的向量中；利用稀疏学习优化算法建立优化数据模型所述数据模型优化步骤，具体包括如下步骤：以一测试样本的位置为圆心，以半径最优值R的长度为半径，在所述平板上设定一个圆形区域；根据所述圆形区域内所有训练样本的标准化质谱数据构建矩阵D_W，所述矩阵D_W中每一列数据分别对应一训练样本的一组标准化质谱数据；根据所述圆形区域内所有训练样本的组别标签获取向量每一训练样本的组别标签以自然数形式记录在对应该训练样本的向量中；利用稀疏学习优化算法建立优化数据模型 Further, the data model construction and cross-validation steps specifically include the following steps: choose a training sample as a standard training sample, and its group label is known; take the position of the standard training sample as the center of the circle, and use a specific length r For the radius, a circular area is set on the flat plate; a matrix D is constructed according to the standardized mass spectrum data of other training samples in the circular area except the standard training sample, and each column of data in the matrix D is respectively A group of standardized mass spectrum data corresponding to a training sample; according to the group labels of other training samples in the circular area except the standard training sample, the vector is obtained The group label of each training sample is recorded in the vector Medium; use sparse learning optimization algorithm to build primary data model Multiplying more than two groups of mass spectrum data of the standard training sample by the data model, arranging the products into a sequence according to the numerical value, rounding the median value, and obtaining the estimated group label of the standard training sample ; compare the inferred group label of the standard training sample with its group label, if the two are the same, then determine that the group label of the standard training sample is guessed correctly, and add one to the correctness counter; each training sample is used as a standard in turn For the training sample, repeat the above steps, carry out cross-validation processing on all the training samples, and calculate the group label judgment accuracy rate of the training sample under the condition that the radius is r, and the group label judgment accuracy rate is correctness The ratio of the numerical value of the counter to the total number of training samples; adjust the size of the radius r, repeat the above steps, and calculate the group label judgment accuracy rate when the radius r is a different value; judge from more than two group labels Select a maximum accuracy rate in the accuracy rate, and obtain the optimal value R of the radius r corresponding to the maximum accuracy rate. Further, described data model optimization step, specifically comprises the steps: take the position of a test sample as the center of circle, take the length of radius optimal value R as radius, set a circular area on the flat plate; Construct a matrix D _W according to the standardized mass spectrum data of all training samples in the circular area, and each column of data in the matrix D _W corresponds to a group of standardized mass spectral data of a training sample; according to all training samples in the circular area The group labels get the vector The group label of each training sample is recorded in the vector corresponding to the training sample in the form of natural numbers Medium; use sparse learning optimization algorithm to build an optimized data model The data model optimization step specifically includes the following steps: taking the position of a test sample as the center and the length of the optimal value R of the radius as the radius, setting a circular area on the flat plate; according to the circular area The standardized mass spectrometry data of all training samples in the matrix D W constructs a matrix D _W , and each column of data in the matrix D _W corresponds to a group of standardized mass spectrometry data of a training sample; according to the group labels of all training samples in the circular area, the vector is obtained The group label of each training sample is recorded in the vector corresponding to the training sample in the form of natural numbers Medium; use sparse learning optimization algorithm to build an optimized data model

进一步地，所述样本组别判断步骤具体包括如下步骤：将一测试样本的一组质谱数据与所述数据模型相乘，对其乘积进行取整处理，获取该测试样本的组别标签；或者将一测试样本的两组以上质谱数据与所述数据模型相乘，将其乘积按照数值大小排成数列，对其中位值进行取整处理，获取该测试样本的组别标签。Further, the sample group judgment step specifically includes the following steps: multiplying a set of mass spectrum data of a test sample by the data model, rounding the product, and obtaining the group label of the test sample; or Multiplying more than two groups of mass spectrometry data of a test sample with the data model, arranging the products into a series according to the numerical value, rounding the median value to obtain the group label of the test sample.

本发明的优点在于：提供一种质谱数据分析方法，可以根据已知体液样本的组别构建分组器模型，经多个训练样本的多次交叉验证，获取正确率最高的数据模型，可以同时处理大量体液样本的质谱数据，并根据体液样本成分对其进行分组。The advantage of the present invention is that it provides a mass spectrometry data analysis method, which can construct a grouper model according to the groups of known body fluid samples, and obtain the data model with the highest correct rate through multiple cross-validation of multiple training samples, which can be processed simultaneously Mass spectrometry data of a large number of body fluid samples and grouping them according to body fluid sample components.

附图说明Description of drawings

图1为本发明实施例所述质谱数据分析方法的流程图；Fig. 1 is the flow chart of the mass spectrometry data analysis method described in the embodiment of the present invention;

图2为本发明实施例所述样本数据采集步骤的方法流程图；Fig. 2 is a method flowchart of the sample data collection step described in the embodiment of the present invention;

图3为本发明实施例所述质谱数据在预处理前生成的质谱图；Fig. 3 is the mass spectrogram generated before the preprocessing of the mass spectral data described in the embodiment of the present invention;

图4为本发明实施例所述样本数据预处理步骤的方法流程图；Fig. 4 is a method flowchart of the sample data preprocessing step described in the embodiment of the present invention;

图5为本发明实施例所述样本数据基线校正步骤的方法流程图；Fig. 5 is a method flow chart of the sample data baseline correction step described in the embodiment of the present invention;

图6为本发明实施例所述质谱数据在基线校正后生成的质谱图；Fig. 6 is a mass spectrogram generated after baseline correction of the mass spectral data described in the embodiment of the present invention;

图7为本发明实施例所述质谱数据重采样处理步骤的方法流程图；Fig. 7 is a method flow chart of the mass spectrum data resampling processing steps described in the embodiment of the present invention;

图8为本发明实施例重采样质谱数据中有效质荷比示意图；Fig. 8 is a schematic diagram of the effective mass-to-charge ratio in the resampled mass spectrum data according to the embodiment of the present invention;

图9为本发明实施例重采样质谱数据生成的质谱图；FIG. 9 is a mass spectrum generated by resampling mass spectrum data according to an embodiment of the present invention;

图10为本发明实施例所述质谱数据标准化处理步骤的方法流程图；Fig. 10 is a method flowchart of the standardization processing steps of mass spectrum data according to the embodiment of the present invention;

图11所示为本实施例标准化质谱数据生成的质谱图；Figure 11 shows the mass spectrogram generated by the standardized mass spectrometry data of the present embodiment;

图12为本发明实施例所述数据模型构建及交叉验证步骤的方法流程图；12 is a method flow chart of the data model construction and cross-validation steps described in the embodiment of the present invention;

图13为本发明实施例所述数据模型优化步骤的方法流程图。Fig. 13 is a method flowchart of the data model optimization step according to the embodiment of the present invention.

具体实施方式Detailed ways

下文提供一种本发明的实施例，参照说明书附图，以示范本发明可实施。An embodiment of the present invention is provided below, with reference to the accompanying drawings, to demonstrate that the present invention can be implemented.

如图1所示，本实施例提供一种质谱数据分析方法，包括如下步骤S1)～步骤S5)。As shown in FIG. 1 , this embodiment provides a mass spectrometry data analysis method, including the following steps S1) to S5).

步骤S1)样本数据采集步骤，用以采集两个以上体液样本的至少一组质谱数据并根据所述质谱数据生成质谱图。所述体液样本包括两个以上训练样本及至少一个测试样本；所述训练样本被分成两个以上组别(也可以称之为类别)，同一组别的训练样本标识有相同的组别标签。所述体液样本可以为某一种来自人体或其他生物的体液，本实施例优选人类的血液样本，组别标签分别为0和1，组别0的样本来自某种疾病患者(如糖尿病患者、血友病患者等)，组别1的样本来自无该种疾病的健康人士，所述训练样本为已知组别标签的血液样本，每一血液样本上标识有0或1。在其他实施例中，组别标签还可以标识为其他自然数。Step S1) The sample data collection step is used to collect at least one set of mass spectrum data of two or more body fluid samples and generate a mass spectrum according to the mass spectrum data. The body fluid samples include more than two training samples and at least one test sample; the training samples are divided into more than two groups (also referred to as categories), and the training samples of the same group are marked with the same group label. The body fluid sample can be a certain kind of body fluid from the human body or other organisms. In this embodiment, human blood samples are preferred, and the group labels are 0 and 1 respectively. The samples of group 0 come from patients with certain diseases (such as diabetic patients, Hemophiliacs, etc.), the samples of group 1 come from healthy people without the disease, the training samples are blood samples with known group labels, and each blood sample is marked with 0 or 1. In other embodiments, the group labels may also be identified as other natural numbers.

如图2所示，步骤S1)具体包括如下步骤：步骤S101)获取两个以上体液样本；一般可以选取数十个或数百个样本。步骤S102)将全部所述体液样本以液滴形式在一平板(优选基质金属板)上排列成矩阵，所述测试样本位于所述平板的中部，所述训练样本围绕所述测试样本；任意两个相邻的训练样本的组别标签皆不同；任意两个相邻体液样本的距离大于或等于2mm，且小于5mm；所述平板包括但不限于基质金属板。步骤S103)利用质谱法采集所述体液样本的质谱数据并生成质谱图，如图3所示；每一体液样本采集至少一组质谱数据，优选三组以上，减少质谱数据误差带来的负面影响，提高正确率，在同一样本多组数据基础之上实现模式分类，可以有效降低单组数据误差所导致的干扰。每组质谱数据包括体液样本中一离子的质荷比值及对应该离子的信号实测强度值；所述质谱图中每一采样点，其横坐标表示一离子的质荷比值，其纵坐标表示对应该离子的信号实测强度值，详见图3。As shown in Fig. 2, step S1) specifically includes the following steps: Step S101) obtains more than two body fluid samples; generally dozens or hundreds of samples can be selected. Step S102) Arranging all the body fluid samples in the form of droplets on a flat plate (preferably a matrix metal plate) to form a matrix, the test sample is located in the middle of the flat plate, and the training samples surround the test sample; any two The group labels of two adjacent training samples are all different; the distance between any two adjacent body fluid samples is greater than or equal to 2mm and less than 5mm; the flat plate includes but not limited to a matrix metal plate. Step S103) Using mass spectrometry to collect mass spectrometry data of the body fluid sample and generate a mass spectrogram, as shown in Figure 3; collecting at least one set of mass spectrometry data for each body fluid sample, preferably more than three sets, to reduce the negative impact of mass spectrometry data errors , improve the accuracy rate, and realize pattern classification on the basis of multiple sets of data in the same sample, which can effectively reduce the interference caused by the error of a single set of data. Each group of mass spectrometry data includes the mass-to-charge ratio of an ion in the body fluid sample and the measured signal intensity value corresponding to the ion; for each sampling point in the mass spectrogram, its abscissa represents the mass-to-charge ratio of an ion, and its ordinate represents the mass-to-charge ratio of an ion. For the measured signal intensity values of the ions, see Figure 3 for details.

步骤S2)样本数据预处理步骤，用以对至少一组质谱数据进行预处理，对所述质谱图进行坐标变换处理，获得所述训练样本的标准化质谱数据。由于样本的处理、仪器的性能、外部污染等因素，由质谱仪直接得到的质谱数据需要进行适当的预处理以提高分组精度。Step S2) The sample data preprocessing step is used to preprocess at least one set of mass spectrum data, perform coordinate transformation processing on the mass spectrum, and obtain standardized mass spectrum data of the training sample. Due to factors such as sample processing, instrument performance, external contamination, etc., the mass spectral data directly obtained by the mass spectrometer needs to be properly preprocessed to improve the grouping accuracy.

如图4所示，步骤S2)具体包括步骤S201)～步骤S203)，对所述质谱图上的质谱数据，经由基线校正、重采样及标准化三个处理步骤，可以避免外部因素过多影响到质谱数据的分组精度。As shown in Figure 4, step S2) specifically includes steps S201) to step S203). For the mass spectrum data on the mass spectrum, through three processing steps of baseline correction, resampling and standardization, it is possible to avoid too many external factors affecting the Grouping precision for mass spectrometry data.

步骤S201)基线校正步骤，用以对所述质谱图上的质谱数据做基线校正处理，基线是质谱数据中的基本强度值，基线校正步骤的作用在于识别并去除质谱图中偏离较大的基线，去除质谱数据中偏差较大的数据。如图5所示，步骤S201)基线校正步骤具体包括如下步骤：步骤S2011)信号计算步骤，用以利用窗函数计算一组质谱数据中至少一质荷比的基线信号强度；步骤S2012)信号校正步骤，用以根据所述基线信号强度校正对应所述质荷比的实测信号强度，筛选并去除偏差较大的无效数据；重复步骤S2011)～步骤S2012)，依次完成每一体液样本的每一组质谱数据的校正。当运用计算机实现工程测试信号处理时，不能对无限长的信号进行测量和运算，而是取其有限的时间片段进行分析，从信号中截取一个时间片段，然后用截取的信号时间片段进行周期延拓处理，得到虚拟的无限长的信号，就可以对信号进行傅里叶变换、相关分析等数学处理。在具体应用中，可采用不同的截取函数对信号进行截断，该截取函数称为窗函数。本实施例中，所述窗函数STEP设为50，WINDOW设为50。基线校正完成后，获得基线校正后的质谱图，详见图6，其横坐标表示一离子的质荷比值，其纵坐标表示对应该离子的信号实测强度值。Step S201) The baseline correction step is used to perform baseline correction processing on the mass spectrum data on the mass spectrum. The baseline is the basic intensity value in the mass spectrum data. The function of the baseline correction step is to identify and remove a large deviation from the baseline in the mass spectrum , to remove data with large deviations in the mass spectrometry data. As shown in Figure 5, the step S201) the baseline correction step specifically includes the following steps: step S2011) the signal calculation step, in order to utilize the window function to calculate the baseline signal intensity of at least one mass-to-charge ratio in a set of mass spectrum data; step S2012) signal correction a step of correcting the measured signal strength corresponding to the mass-to-charge ratio according to the baseline signal strength, screening and removing invalid data with large deviations; repeating steps S2011) to S2012), and sequentially completing each body fluid sample Calibration of group mass spectrometry data. When the computer is used to realize engineering test signal processing, it is not possible to measure and calculate the infinitely long signal, but to take its limited time segment for analysis, intercept a time segment from the signal, and then use the intercepted signal time segment for period delay. By extension processing, a virtual infinitely long signal can be obtained, and mathematical processing such as Fourier transform and correlation analysis can be performed on the signal. In specific applications, different interception functions can be used to truncate the signal, and the interception function is called a window function. In this embodiment, the window function STEP is set to 50, and WINDOW is set to 50. After the baseline correction is completed, the mass spectrum after the baseline correction is obtained, as shown in Figure 6 for details, the abscissa represents the mass-to-charge ratio of an ion, and the ordinate represents the measured intensity value of the signal corresponding to the ion.

步骤S202)重采样步骤，用以利用重采样算法对基线校正后的质谱数据中的离子质荷比值进行重采样处理，对所述质谱图进行横坐标变换，统一所有质谱数据的质荷比，去除偏差较大的质谱数据，获得重采样质谱数据。Step S202) a resampling step, for resampling the ion mass-to-charge ratios in the baseline-corrected mass spectrometry data using a resampling algorithm, performing abscissa transformation on the mass spectrograms, and unifying the mass-to-charge ratios of all mass spectrometry data, Remove mass spectral data with large deviations to obtain resampled mass spectral data.

如图7所示，重采样步骤S202)具体包括如下步骤S2021)～S2023)。S2021)有效质荷比选择步骤，用以选择有效质荷比区间及有效质荷比数量；构建重采样数据中有效质荷比示意图，其横坐标表示重采样后保留的有效质荷比编号，其纵坐标表示该质荷比编号对应的质荷比数值。S2022)有效质荷比计算步骤，用以利用重采样算法计算重采样质谱数据的质荷比；重采样算法是指：设重采样后有效质谱数据的质荷比区间为[y₁,y₂]，重采样后的质荷比坐标数量为N；利用以下公式计算重采样后的质荷比坐标 As shown in FIG. 7 , the resampling step S202) specifically includes the following steps S2021) to S2023). S2021) The effective mass-to-charge ratio selection step is used to select the effective mass-to-charge ratio interval and the effective mass-to-charge ratio quantity; construct the effective mass-to-charge ratio schematic diagram in the resampled data, and its abscissa represents the effective mass-to-charge ratio number retained after resampling, Its ordinate indicates the value of the mass-to-charge ratio corresponding to the mass-to-charge ratio number. S2022) The effective mass-to-charge ratio calculation step is used to calculate the mass-to-charge ratio of the resampled mass spectrum data using the resampling algorithm; the resampling algorithm refers to: set the mass-to-charge ratio interval of the effective mass spectrum data after resampling to [y ₁ , y ₂ ], the number of mass-to-charge ratio coordinates after resampling is N; use the following formula to calculate the mass-to-charge ratio coordinates after resampling

其中，N大于10⁴且小于10⁵，已取得算法准确性和计算速度的平衡。S2023)插值处理步骤，用以利用重采样后的质荷比及质荷比编号对基线校正后的质谱图进行插值处理，将基线校正后质谱图的横坐标由荷质比数值变换为质荷比编号。在本实施例中，重采样后的质谱数据的质荷比均分布在98.9～1003.1的质荷比区间内，保留10000组有效质谱数据，利用以下公式计算重采样后的质荷比坐标与有效质谱数据相对应的，共有10000个质荷比，如图8所示为本实施例重采样数据中有效质荷比示意图，其横坐标表示重采样后保留的有效质荷比编号，其纵坐标表示该质荷比编号对应的质荷比数值。 Among them, N is greater than 10 ⁴ and less than 10 ⁵ , which has achieved a balance between algorithm accuracy and calculation speed. S2023) Interpolation processing step, for using the mass-to-charge ratio after resampling and the mass-to-charge ratio number to perform interpolation processing on the mass spectrogram after baseline correction, and transform the abscissa of the mass spectrogram after baseline correction from charge-to-mass ratio value to mass-to-charge Than no. In this embodiment, the mass-to-charge ratios of the resampled mass spectrum data are distributed within the mass-to-charge ratio range of 98.9 to 1003.1, and 10,000 sets of valid mass spectrum data are retained, and the mass-to-charge ratio coordinates after resampling are calculated using the following formula Corresponding to the effective mass spectrum data, there are a total of 10,000 mass-to-charge ratios, as shown in Figure 8, which is a schematic diagram of the effective mass-to-charge ratios in the resampled data of this embodiment, and its abscissa represents the number of effective mass-to-charge ratios retained after resampling. The ordinate indicates the value of the mass-to-charge ratio corresponding to the mass-to-charge ratio number.

在对所述质谱图进行插值处理的过程中，将基线校正后质谱图中(如图6)多余的质谱数据去除，只保留重采样的有效质谱数据；将基线校正后质谱图的横坐标由荷质比数值变换为质荷比编号，其纵坐标不变，即可完成每组原始质谱数据的重采样，如图9所示为本实施例重采样质谱数据的质谱图，其横坐标表示重采样后的有效质荷比编号，其纵坐标表示该质荷比编号对应的离子信号实测强度值。重采样步骤后，在所述质谱图上，质荷比相对较小的区间包含的采样值较多，质荷比较大的区间包含的采样值较少，与质荷比较小区间包含信息量多于质荷比较大区间的假设相对应。In the process of interpolating the mass spectrum, the redundant mass spectrum data in the mass spectrum after baseline correction (as shown in Figure 6) is removed, and only the effective mass spectrum data for resampling is retained; the abscissa of the mass spectrum after baseline correction is given by The charge-to-mass ratio value is transformed into a mass-to-charge ratio number, and the ordinate remains unchanged, so that the resampling of each group of original mass spectrum data can be completed. As shown in FIG. The effective mass-to-charge ratio number after resampling, and its ordinate indicates the measured intensity value of the ion signal corresponding to the mass-to-charge ratio number. After the resampling step, on the mass spectrogram, the intervals with relatively small mass-to-charge ratios contain more sampled values, the intervals with relatively large mass-to-charge ratios contain fewer sampled values, and the smaller intervals with mass-to-charge ratios contain more information Corresponding to the assumption that the mass-charge ratio is relatively large.

步骤S203)标准化步骤，用以对所述重采样质谱数据中离子信号强度值进行标准化处理，对所述质谱图进行纵坐标变换，获得标准化质谱数据。如图10所示，步骤S203)标准化步骤具体包括如下步骤步骤S2031)～步骤S2034)。步骤S2031)信号强度绝对值总和计算步骤，用以计算所有重采样质谱数据中离子信号强度值的绝对值的总和S；步骤S2032)标准化信号强度值总和设定步骤，用以设定标准化处理后所有重采样质谱数据中离子信号强度值的绝对值总和为常量T，本实施例中，该常量设为10000；步骤S2033)信号强度值变化倍数计算步骤，用以计算每一信号强度值的变化倍数T/S；步骤S2034)信号强度值变化步骤，用以对所述重采样质谱数据中每个离子信号强度值进行同步放大或同步缩小处理，对所述质谱图进行纵坐标变换，信号强度值的变化倍数为步骤S2033)中的T/S。如图11所示为本实施例标准化质谱数据的质谱图，其横坐标表示重采样后的有效质荷比编号，其纵坐标表示该质荷比编号对应的离子信号标准化强度值。所述标准化步骤的技术效果在于，将质谱数据的强度映射到统一的范围，可以确保每组质谱数据强度的分布范围基本一致，从而增强不同样品质谱数据的可比性。Step S203) A standardization step, for standardizing ion signal intensity values in the resampled mass spectrum data, and performing ordinate transformation on the mass spectrum to obtain standardized mass spectrum data. As shown in FIG. 10 , the standardization step in step S203) specifically includes the following steps (step S2031) to step S2034). Step S2031) a signal intensity absolute value sum calculation step, used to calculate the sum S of the absolute value of ion signal intensity values in all resampled mass spectrum data; step S2032) a normalized signal intensity value sum setting step, used to set the normalized processing The sum of the absolute values of the ion signal strength values in all resampled mass spectrum data is a constant T, and in this embodiment, the constant is set to 10000; step S2033) signal strength value change multiple calculation step, in order to calculate the change of each signal strength value Multiple T/S; step S2034) signal strength value change step, for synchronously amplifying or synchronously reducing the signal strength value of each ion in the resampled mass spectrum data, performing ordinate transformation on the mass spectrogram, and signal strength The change factor of the value is T/S in step S2033). As shown in Figure 11 is the mass spectrogram of the standardized mass spectrometry data in this embodiment, the abscissa indicates the effective mass-to-charge ratio number after resampling, and the ordinate indicates the normalized intensity value of the ion signal corresponding to the mass-to-charge ratio number. The technical effect of the standardization step is that mapping the intensity of the mass spectrum data to a unified range can ensure that the distribution range of the intensity of each group of mass spectrum data is basically consistent, thereby enhancing the comparability of different mass spectrum data.

步骤S3)数据模型构建及交叉验证步骤，用以利用所述训练样本的质谱数据及所述训练样本的组别标签构建初级数据模型，根据所述训练样本的质谱数据对所述初级数据模型进行n次(n为训练样本数)的交叉验证处理，利用已知训练样本的质谱数据和组别标签进行机器学习并建立模型。如图12所示，步骤S3)具体包括如下步骤：步骤S301)任选一训练样本作为标准训练样本，其组别标签已知；步骤S302)以所述标准训练样本的位置为圆心，以特定长度r为半径，在所述平板上设定一个圆形区域；步骤S303)根据所述圆形区域内除所述标准训练样本外的其他训练样本的标准化质谱数据构建矩阵D，所述矩阵D中每一列数据分别对应一训练样本的一组标准化质谱数据；步骤S304)根据所述圆形区域内除所述标准训练样本外的其他训练样本的组别标签获取向量每一训练样本的组别标签记录在向量中；步骤S305)利用稀疏学习优化算法建立初级数据模型步骤S306)将所述标准训练样本的两组以上质谱数据与所述数据模型相乘，将其乘积按照数值大小排成数列，对其中位值进行取整处理，获取所述标准训练样本的推测组别标签；由于本发明中组别标签皆为整数(0或者1)，因此需要将小数点后的数字四舍五入后获得一个整数，此即为取整处理；步骤S307)对比所述标准训练样本的推测组别标签与已知的所述标准训练样本的组别标签，若二者相同，则判定所述标准训练样本的组别标签推测正确，正确度计数器加一；步骤S308)依次将每一个训练样本作为标准训练样本，重复步骤S301)～步骤S307)，对所有的训练样本进行交叉验证处理，计算出在半径为r的情况下所述训练样本的组别标签判断准确率，所述组别标签判断准确率为正确度计数器的数值与所述训练样本总数的比值；步骤S309)调整半径r的大小，重复步骤S301)～步骤S308)，计算出在半径r为不同数值的情况下多个组别标签判断准确率；步骤S310)从所述多个组别标签判断准确率中选取一个准确率最大值，获取对应该准确率最大值的半径r的数值，也即为半径最优值R。Step S3) Data model construction and cross-validation steps, for constructing a primary data model by using the mass spectrum data of the training samples and the group labels of the training samples, and performing an operation on the primary data model according to the mass spectrum data of the training samples n times (n is the number of training samples) of cross-validation processing, using the mass spectrum data and group labels of known training samples to perform machine learning and build a model. As shown in Figure 12, step S3) specifically includes the following steps: step S301) select a training sample as a standard training sample, and its group label is known; step S302) take the position of the standard training sample as the center of the circle, and use The length r is a radius, and a circular area is set on the flat plate; step S303) a matrix D is constructed according to the standardized mass spectrum data of other training samples in the circular area except the standard training sample, and the matrix D Each column of data corresponds to a group of standardized mass spectrometry data of a training sample; step S304) Acquire vectors according to the group labels of other training samples in the circular area except the standard training samples The group label of each training sample is recorded in the vector Middle; Step S305) Utilize the sparse learning optimization algorithm to establish the primary data model Step S306) Multiplying two or more sets of mass spectrum data of the standard training sample by the data model, arranging the products into a series according to the numerical value, rounding the median value, and obtaining the guess of the standard training sample Group label; since the group label in the present invention is all integers (0 or 1), it is necessary to round the digits after the decimal point to obtain an integer, which is the rounding process; step S307) compares the standard training sample Infer that the group label is the same as the known group label of the standard training sample, if the two are the same, it is determined that the group label of the standard training sample is guessed correctly, and the correctness counter is increased by one; step S308) sequentially counts each The training sample is used as a standard training sample, repeat steps S301) to step S307), carry out cross-validation processing on all training samples, and calculate the group label judgment accuracy rate of the training sample when the radius is r, the group The accuracy rate of different label judgment is the ratio of the value of the accuracy counter to the total number of training samples; step S309) adjusts the size of the radius r, repeats steps S301) to step S308), and calculates how many A group label judgment accuracy rate; step S310) select a maximum accuracy rate from the plurality of group label judgment accuracy rates, and obtain the value of the radius r corresponding to the maximum accuracy rate, which is the optimal value of the radius R.

机器学习算法是一类从数据中自动分析获得规律，并利用规律对未知数据进行预测的算法。本发明采用机器学习中的Lasso回归算法来分析质谱数据，学习构建模型，包括训练和测试两个过程。Lasso算法的基本思想是在回归系数的绝对值之和小于一个常数的约束条件下，使残差平方和最小化，从而产生一些严格等于0的回归系数，得到可以解释的模型。Machine learning algorithm is a kind of algorithm that automatically analyzes and obtains laws from data, and uses the laws to predict unknown data. The invention adopts the Lasso regression algorithm in machine learning to analyze the mass spectrum data, and learns to build a model, including two processes of training and testing. The basic idea of the Lasso algorithm is to minimize the sum of squared residuals under the constraint that the sum of the absolute values of the regression coefficients is less than a constant, so as to generate some regression coefficients that are strictly equal to 0 and obtain an interpretable model.

本实施例中，利用Lasso算法进行n次交叉验证(n为训练样本数)，每次交叉验证得到的模型配合11个强度阈值0、0.1、…、1，对应得到11个组别标签判断准确率；重复n次，共得到n*11个数据模型(分组器)，每个数据模型对应一个组别标签判断准确率。调整特定半径r＝2.0mm、2.2mm、2.4mm…、4.8mm、5mm，获取n*11*16个组别标签判断准确率，将所有组别标签判断准确率的数值按照大小排列，找出最大的准确率数值，再找出与准确率最大值相对应的半径，此即为半径最优值R。In this embodiment, the Lasso algorithm is used to perform n times of cross-validation (n is the number of training samples), and the model obtained by each cross-validation is matched with 11 intensity thresholds 0, 0.1, ..., 1, and the corresponding 11 group labels are accurately judged rate; repeated n times, a total of n*11 data models (groupers) were obtained, and each data model corresponds to a group label judgment accuracy rate. Adjust the specific radius r=2.0mm, 2.2mm, 2.4mm..., 4.8mm, 5mm to obtain n*11*16 group label judgment accuracy rates, arrange the values of all group label judgment accuracy rates according to size, and find out The maximum accuracy rate value, and then find the radius corresponding to the maximum accuracy rate value, which is the optimal value of the radius R.

步骤S4)数据模型优化步骤，用以根据交叉验证的结果构建优化数据模型。如图13所示，步骤S4)具体包括如下步骤：步骤S401)以一测试样本的位置为圆心，以步骤S310)中的半径最优值R的长度为半径，在所述平板上设定一个圆形区域；步骤S402)根据所述圆形区域内所有训练样本的标准化质谱数据构建矩阵D_W，所述矩阵D_W中每一列数据分别对应一训练样本的一组标准化质谱数据；步骤S403)根据所述圆形区域内所有训练样本的组别标签获取向量每一训练样本的组别标签以整数形式记录在对应该训练样本的向量中；步骤S404)建立优化数据模型建立优化数据模型过程中利用稀疏学习优化算法。Step S4) a data model optimization step for constructing an optimized data model according to the cross-validation result. As shown in Figure 13, step S4) specifically includes the following steps: step S401) takes the position of a test sample as the center of the circle, and takes the length of the optimal value R of the radius in step S310) as the radius, and sets a Circular area; Step S402) Construct matrix D _W according to the standardized mass spectrum data of all training samples in the circular area, each column of data in the matrix D _W corresponds to a set of standardized mass spectrum data of a training sample; Step S403) Acquire vectors according to the group labels of all training samples in the circular area The group label of each training sample is recorded as an integer in the vector corresponding to the training sample Middle; step S404) establishes the optimized data model Build an optimized data model The sparse learning optimization algorithm is used in the process.

步骤S5)样本组别判断步骤，用以利用所述测试样本的质谱数据及所述优化数据模型获取所述测试样本的组别标签。在所述步骤S5)中，将一测试样本的一组质谱数据与所述数据模型相乘，对其乘积进行取整处理，获取该测试样本的组别标签；或者将一测试样本的两组以上质谱数据与所述数据模型相乘，将其乘积按照数值大小排成数列，对其中位值进行取整处理，获取该测试样本的组别标签。在本实施例中，如果取整的结果是0，可以认为与该测试样本对应的人具有与某种疾病关联的质谱数据模式，从而辅助医师作出诊断；如果取整的结果是1，可以认为与该测试样本对应的人并不具有与该种疾病关联的质谱数据模式，从而辅助医师作出诊断。Step S5) a sample group judgment step, for obtaining the group label of the test sample by using the mass spectrum data of the test sample and the optimized data model. In said step S5), a group of mass spectrum data of a test sample is multiplied by the data model, and the product is rounded to obtain the group label of the test sample; or two groups of a test sample are The above mass spectrum data is multiplied by the data model, and the products are arranged into a series according to the numerical value, and the median value is rounded to obtain the group label of the test sample. In this embodiment, if the rounded result is 0, it can be considered that the person corresponding to the test sample has a mass spectrum data pattern associated with a certain disease, thereby assisting the doctor to make a diagnosis; if the rounded result is 1, it can be considered The person corresponding to the test sample does not have the mass spectrometry data pattern associated with the disease, thereby assisting the physician in making a diagnosis.

本发明提供一种质谱数据分析方法，可以根据已知体液样本的组别构建分组器模型，经多个训练样本的多次交叉验证，获取正确率最高的数据模型，可以同时处理大量体液样本的质谱数据，并根据体液样本成分对其进行分组。在医学临床上，本发明的技术方案可以应用于辅助疾病智能诊断，利用计算机技术同时检测多位待检测者的多组血液样本，可以在短时间内判断多位待检测者是否具有与某种疾病关联的质谱数据模式，辅助医师实现快速诊断。The invention provides a mass spectrometry data analysis method, which can construct a grouper model according to the groups of known body fluid samples, obtain the data model with the highest correct rate through multiple cross-validation of multiple training samples, and can process a large number of body fluid samples at the same time mass spectrometry data and group them according to body fluid sample composition. In clinical medicine, the technical solution of the present invention can be applied to assist in the intelligent diagnosis of diseases. By using computer technology to simultaneously detect multiple groups of blood samples of multiple people to be tested, it can be judged in a short time whether multiple people to be tested have a certain Disease-associated mass spectrometry data patterns assist physicians in rapid diagnosis.

以上所述仅是本发明的优选实施方式，应当指出，对于本技术领域的普通技术人员，在不脱离本发明原理的前提下，还可以做出若干改进和润饰，这些改进和润饰也应视为本发明的保护范围。The above is only a preferred embodiment of the present invention, it should be pointed out that for those of ordinary skill in the art, without departing from the principle of the present invention, some improvements and modifications can also be made, and these improvements and modifications should also be considered Be the protection scope of the present invention.

Claims

1. a mass spectrometry data analysis method, is characterized in that, comprises the steps:

The sample data collection step is used to collect mass spectrum data of more than two body fluid samples and generate a mass spectrum according to the mass spectrum data; the body fluid samples include more than two training samples and at least one test sample;

The training samples are divided into two or more groups, and the training samples of the same group have the same group label;

The sample data preprocessing step is used to preprocess at least one set of mass spectrum data, perform coordinate transformation processing on the mass spectrum, and obtain standardized mass spectrum data of the training sample and the test sample;

The data model construction and cross-validation steps are used to construct a primary data model using the standardized mass spectrum data of the training samples and the group labels of the training samples, and perform at least One-time cross-validation processing;

A data model optimization step, for constructing an optimized data model according to the results of the cross-validation; and

The sample group judgment step is used to obtain the group label of the test sample by using the standardized mass spectrum data of the test sample and the optimized data model.

2. mass spectrum data analysis method as claimed in claim 1, is characterized in that,

The sample data collection step specifically includes the following steps:

Obtain two or more body fluid samples;

arranging all of said bodily fluid samples in a matrix on a plate; and

The mass spectrometry data of the body fluid sample is collected by mass spectrometry and a mass spectrogram is generated; at least one set of mass spectrum data is collected for each body fluid sample.

3. mass spectrum data analysis method as claimed in claim 2, is characterized in that,

The test sample is located in the middle of the flat panel, and the training sample surrounds the test sample;

The flat plate includes, but is not limited to, a matrix metal plate;

The group labels of any two adjacent training samples are different;

The distance between any two adjacent body fluid samples is greater than or equal to 2mm and less than or equal to 5mm.

4. mass spectrum data analysis method as claimed in claim 1 or 2, is characterized in that,

Each set of mass spectrometry data includes the mass-to-charge ratio value of an ion in the sample and the measured signal intensity value corresponding to the ion;

Each set of mass spectrum data corresponds to a sampling point in the mass spectrum;

The abscissa of each sampling point represents the mass-to-charge ratio of an ion, and the ordinate represents the measured intensity value of the signal corresponding to the ion.

5. mass spectrum data analysis method as claimed in claim 1, is characterized in that,

The sample data preprocessing step specifically includes the following steps:

a baseline correction step, for performing baseline correction processing on the mass spectrum data in the mass spectrogram;

The resampling step is used to resample the mass-to-charge ratio of ions in the mass spectrum data after baseline correction by using a resampling algorithm, perform abscissa transformation on the mass spectrum, and unify the mass-to-charge ratios of all mass spectrum data to obtain resampling mass spectrometry data; and

The standardization step is used to standardize ion signal intensity values in the resampled mass spectrum data, and perform ordinate transformation on the mass spectrum to obtain standardized mass spectrum data.

6. mass spectrum data analysis method as claimed in claim 5, is characterized in that,

The baseline correction step specifically includes the following steps:

The signal calculation step is used to calculate the baseline signal intensity corresponding to at least one mass-to-charge ratio value in a set of mass spectrum data by using a window function;

a signal correction step for correcting the measured signal intensity corresponding to the mass-to-charge ratio based on the baseline signal intensity; and

The signal calculation step and the signal correction step are repeated to sequentially complete the correction of each set of mass spectrum data of each body fluid sample.

7. mass spectrum data analysis method as claimed in claim 5, is characterized in that,

The resampling step specifically includes the following steps:

The effective mass-to-charge ratio selection step is used to select the effective mass-to-charge ratio range and the effective mass-to-charge ratio quantity;

The effective mass-to-charge ratio calculation step is used to calculate the mass-to-charge ratio of the resampled mass spectrum data using a resampling algorithm;

The interpolation processing step is used to interpolate the mass spectrum after baseline correction by using the mass-to-charge ratio after resampling and the mass-to-charge ratio number, and convert the abscissa of the mass spectrum after baseline correction from the charge-to-mass ratio value to the mass-to-charge ratio number .

8. mass spectrum data analysis method as claimed in claim 7, is characterized in that,

The resampling algorithm refers to:

Set the mass-to-charge ratio interval of effective mass spectrum data after resampling to [y ₁ , y ₂ ], and the number of mass-to-charge ratio coordinates after resampling to be N;

Use the following formula to calculate the mass-to-charge ratio coordinates after resampling

Wherein, N is greater than 10 ⁴ and less than 10 ⁵ .

9. mass spectrum data analysis method as claimed in claim 5, is characterized in that,

Described standardization step specifically comprises the following steps:

A signal intensity absolute value sum calculation step, used to calculate the absolute value sum S of ion signal intensity values in all resampled mass spectrum data;

The normalized signal intensity value sum setting step is used to set the absolute value sum of the ion signal intensity values in all resampled mass spectrum data after the normalization process as a constant T;

The signal strength value change multiple calculation step is used to calculate the change multiple T/S of each signal strength value;

The signal intensity value changing step is used for synchronously amplifying or synchronously reducing the signal intensity value of each ion in the resampled mass spectrum data.

10. mass spectrum data analysis method as claimed in claim 1, is characterized in that,

The data model construction and cross-validation steps specifically include the following steps:

Choose a training sample as a standard training sample, whose group label is known;

Set a circular area on the plate with the position of the standard training sample as the center and a specific length r as the radius;

According to the standardized mass spectrum data of other training samples in the circular area except the standard training samples, a matrix D is constructed, and each column of data in the matrix D corresponds to a set of standardized mass spectrum data of a training sample;

Acquire vectors according to the group labels of other training samples in the circular area except the standard training samples The group label of each training sample is recorded in the vector middle;

Create a primary data model

Multiplying more than two groups of standardized mass spectrum data of the standard training sample by the data model, arranging the products into a sequence according to the numerical value, rounding the median value, and obtaining the estimated group of the standard training sample Label;

Comparing the estimated group label of the standard training sample with its group label, if the two are the same, it is determined that the group label of the standard training sample is guessed correctly, and the correctness counter is increased by one;

Taking each training sample as a standard training sample in turn, repeating the above steps, performing cross-validation processing on all training samples, and calculating the group label judgment accuracy rate of the training samples when the radius is r, the group The accuracy rate of different label judgment is the ratio of the numerical value of the accuracy counter and the total number of training samples;

Adjust the size of the radius r, repeat the above steps, and calculate the group label judgment accuracy rate when the radius r is a different value; and

Select a maximum accuracy rate from two or more group label judgment accuracy rates, and obtain the optimal value R of the radius r corresponding to the maximum accuracy rate.

11. mass spectrum data analysis method as claimed in claim 1, is characterized in that,

The data model optimization step specifically includes the following steps:

Taking the position of a test sample as the center of the circle, and taking the length of the optimal radius R as the radius, set a circular area on the flat plate;

Construct a matrix D _W according to the standardized mass spectrum data of all training samples in the circular area, and each column of data in the matrix D _W corresponds to a set of standardized mass spectrum data of a training sample;

Acquire vectors according to the group labels of all training samples in the circular area The group label of each training sample is recorded in the vector corresponding to the training sample in the form of natural numbers in; and

Building Optimal Data Models Using Sparse Learning Optimization Algorithms .

12. mass spectrum data analysis method as claimed in claim 1, is characterized in that,

The sample group judgment step specifically includes the following steps:

multiplying a set of mass spectrum data of a test sample by the data model, and rounding the product to obtain the group label of the test sample; or

Multiplying more than two groups of mass spectrometry data of a test sample with the data model, arranging the products into a series according to the numerical value, rounding the median value to obtain the group label of the test sample.