CN111896609A

CN111896609A - A method for analyzing mass spectrometry data based on artificial intelligence

Info

Publication number: CN111896609A
Application number: CN202010707525.4A
Authority: CN
Inventors: 钱昆; 徐伟; 曹敬
Original assignee: Shanghai Jiao Tong University
Current assignee: Shanghai Jiao Tong University
Priority date: 2020-07-21
Filing date: 2020-07-21
Publication date: 2020-11-06
Anticipated expiration: 2040-07-21
Also published as: CN111896609B

Abstract

A method for analyzing mass spectrometry data based on artificial intelligence, the method comprises: using a laser-assisted desorption/ionization mass spectrometer to collect metabolite small molecule fingerprints for each of the samples; extracting the absolute intensity of the fingerprints ; Input the processed data into the multi-layer neural network for sample grouping processing. A method for calculating the importance of contribution to sample discrimination, converting fingerprint data into a two-dimensional image; using a saliency feature analysis method to calculate the data in the metabolite screening picture library, and sorting all the features to screen out Substances that contribute the most to sample discrimination. The beneficial effects of the present invention are that samples are quickly grouped for mass spectrometry data, and the interpretability of the classification model is greatly improved.

Description

A method for analyzing mass spectrometry data based on artificial intelligence

技术领域technical field

本发明属于人工智能辅助质谱数据挖掘领域，具体设计基于质谱获得的代谢物指纹图谱和人工智能分析技术用于构建样本分组模型，以及分组重要性计算。The invention belongs to the field of artificial intelligence-assisted mass spectrometry data mining, and specifically designs metabolite fingerprints obtained based on mass spectrometry and artificial intelligence analysis technology for constructing a sample grouping model and calculating the importance of grouping.

背景技术Background technique

质谱检测方法显示出高通量分析和多代谢物检测的优点，是检测未靶向代谢物的主要方法。然而，质谱检测方法的应用也面对冗长的预处理等问题，包括生物样品中的高复杂性和低代谢产物丰度。随着纳米技术的发展，最近开发的纳米辅助激光解吸/电离质谱法(LDI MS)由于具有较高的分析通量(～300个样品/小时)和精准的代谢识别(质量误差<50ppm)，成为代谢分析的最实用工具。Mass spectrometry detection methods show the advantages of high-throughput analysis and multi-metabolite detection, and are the main method to detect untargeted metabolites. However, the application of mass spectrometry detection methods also faces problems such as lengthy preprocessing, including high complexity and low metabolite abundance in biological samples. With the development of nanotechnology, the recently developed nano-assisted laser desorption/ionization mass spectrometry (LDI MS), due to its high analytical throughput (~300 samples/hour) and precise metabolic identification (mass error <50ppm), Become the most practical tool for metabolic analysis.

深度学习主要应用于大型和高纬度数据集的辅助分析，它可以接受多种数据类型的输入，成为现在多种医疗数据分析的前沿技术。深度学习作为机器学习的一个前沿领域，已经成为现在一种主要分析工具，由于其具有尽可能的优化损失函数去学习相关数据规律和尽可能的挖掘数据潜在特征的特点被广泛地应用于各种领域。其已经广泛的在生物医学领域应用。然而，传统的机器学习方法分析无法很好地适用于质谱数据分析、挖掘，这是因为质谱数据本身样本特征庞大，会出现过拟合和欠拟合等降低准确度的问题。此外，深度学习是一种黑盒子，难以从分中选择重要特征来解释诊断原理的机制。Deep learning is mainly used in the auxiliary analysis of large and high-latitude datasets. It can accept input of various data types, and has become a cutting-edge technology in the analysis of various medical data. As a frontier field of machine learning, deep learning has become a major analysis tool. Because it has the characteristics of optimizing the loss function as much as possible to learn the relevant data rules and mining the potential features of the data as much as possible, it is widely used in various field. It has been widely used in the biomedical field. However, traditional machine learning method analysis cannot be well applied to mass spectrometry data analysis and mining, because the mass spectrometry data itself has huge sample characteristics, and there will be problems such as overfitting and underfitting that reduce the accuracy. In addition, deep learning is a black box, and it is difficult to select important features from the points to explain the mechanism of the diagnosis principle.

深度学习中的常见特征选择主要方法是显著性区域图，其主要应用于图像领域，通过比对检测图像中的显著性差异区域，从而快速直观的筛选出最明显差异的区域，这一技术已经被扩展用于解决神经科学，心理学和医学诊断等各个领域中复杂的场景理解问题。但是这种显著性分析的方法还尚未应用于质谱数据分析中。The main method of common feature selection in deep learning is the saliency region map, which is mainly used in the image field. It is extended to solve complex scene understanding problems in various fields such as neuroscience, psychology and medical diagnosis. However, this method of significance analysis has not yet been applied to mass spectrometry data analysis.

发明内容SUMMARY OF THE INVENTION

针对质谱数据分析时存在耗时长，数据维度高和组合复杂等问题，本发明提供了一种快速，准确，高效的基于改进地多层神经网络构建的分类模型实现样本快速分组，并计算分类贡献重要性的方法，大大提升了分类模型的可解释能力。Aiming at the problems of long time consumption, high data dimension and complex combination in mass spectrometry data analysis, the present invention provides a fast, accurate and efficient classification model based on an improved multi-layer neural network to realize rapid grouping of samples and calculate the classification contribution The importance of the method greatly improves the interpretability of the classification model.

一种基于人工智能分析质谱数据的方法，使用多层神经网络分析处理质谱数据，实现样本的分组；A method for analyzing mass spectrometry data based on artificial intelligence, using a multi-layer neural network to analyze and process mass spectrometry data to realize grouping of samples;

方法包括：Methods include:

步骤1：将样本吸移到质谱靶板上，干燥后作为薄层进行后续质谱分析；Step 1: Pipette the sample onto the mass spectrometer target plate, and use it as a thin layer for subsequent mass spectrometry analysis after drying;

步骤2：采用激光辅助解吸/电离质谱仪对每个分析样品进行正离子模式100-1000之间的代谢物小分子指纹谱图进行收集，无需任何平滑程序；Step 2: Collect small molecule fingerprints of metabolites between 100 and 1000 in positive ion mode for each analyzed sample using a laser-assisted desorption/ionization mass spectrometer, without any smoothing procedure;

步骤3：将原始代谢指纹图谱进行绝对强度的提取处理，并对所有样本提取的数据进行中心化预处理；Step 3: Extract the absolute intensity of the original metabolic fingerprint, and perform centralized preprocessing on the data extracted from all samples;

步骤4：将步骤3的数据输入神经网络，进行样本分组处理。Step 4: Input the data of Step 3 into the neural network for sample grouping processing.

进一步地，步骤2中，对每一个样品进行了至少2个独立实验，以消除个体内部偏差并提高分析的可重复性和稳定性。Further, in step 2, at least 2 independent experiments were performed on each sample to eliminate intra-individual bias and improve the reproducibility and stability of the analysis.

进一步地，多层神经网络，包括：网络输入，网络主体、网络输出；所网络主体包括特征提取部分、非线性特征交互层、分类层；网络输入先被特征提取部分处理，特征提取部分的输出被非线性特征交互层处理，非线性特征交互层的输出被分类层处理，分类层的输出即为网络输出；Further, the multi-layer neural network includes: network input, network main body, and network output; the network main body includes a feature extraction part, a nonlinear feature interaction layer, and a classification layer; the network input is first processed by the feature extraction part, and the output of the feature extraction part It is processed by the nonlinear feature interaction layer, the output of the nonlinear feature interaction layer is processed by the classification layer, and the output of the classification layer is the network output;

从网络输入一直到分类层的原理公式为：The principle formula from the network input all the way to the classification layer is:

x_input＝concatenate(x_spectral,x_ext) (1)x_input=concatenate(x_spectral,x_ext) (1)

x_fs＝feature_extract(x_input) (2)x_fs=feature_extract(x_input) (2)

x_nl＝feature_interaction(x_fs) (3)x_nl=feature_interaction(x_fs) (3)

y_pred＝softmax(x_nl) (4)y_pred=softmax(x_nl) (4)

进一步地，网络输入为1-1024维的多模态特征(x_input)，其中包括原始质谱数据输入(x_spectral)，其他部分用0填充。基于样本的有限性，对所有多模态特征进行了简单缩放中心化。Further, the network input is 1-1024-dimensional multimodal features (x_input), including the original mass spectral data input (x_spectral), and other parts are filled with 0s. Based on the finiteness of the samples, all multimodal features are simply scaled and centered.

进一步地，特征提取部分(feature_extract)，由四层局部联接(LocallyConnected1D)层堆叠而成，每个LocallyConnected1D层将所有特征划分为32个区间分别进行全连接特征提取(32个拥有各自参数的全连接层)，从而体现质谱数据的特征位置相关性，同时兼容最后的32外部多模态特征，与四层全连接架构相比可减小网络宽度与参数规模的同时精细建模质谱类数据的特征提取过程，也间接的降低了过拟合。Further, the feature extraction part (feature_extract) is composed of four layers of local connection (LocallyConnected1D) layers. Each LocallyConnected1D layer divides all features into 32 intervals for fully connected feature extraction (32 fully connected with their own parameters). Compared with the four-layer fully connected architecture, it can reduce the network width and parameter scale while finely modeling the characteristics of mass spectrometry data. The extraction process also indirectly reduces overfitting.

进一步地，四层LocallyConnected1D层堆叠的原理公式：Further, the principle formula of four-layer LocallyConnected1D layer stacking:

进一步地，非线性特征交互层(feature_interaction)，对特征提取部分得到的96个隐特征，学习其非线性关系。特征交互部分每一层能够同时提取离散化的Relu激活特征，也能提取特征线性组合的近似二次关系，通过残差或者拼接合并提取为融合特征，能够更好地提取非线性特征，同时借助dropout正则化可进一步缓解过拟合、增强泛化性能。非线性特征交互层可被视为是一种适用于全连接层的新型自注意力机制，且兼具离散、二次特征的融合能力，与多层全连接架构相比，在减小网络宽度与参数规模的同时增强了非线性性，从而利于有限样本下降低过拟合，提高了最终的分类性能。Further, the nonlinear feature interaction layer (feature_interaction) learns its nonlinear relationship for the 96 latent features obtained in the feature extraction part. In the feature interaction part, each layer can simultaneously extract the discrete Relu activation features, and can also extract the approximate quadratic relationship of the linear combination of features, which can be extracted as fusion features through residuals or splicing and merging, which can better extract nonlinear features. Dropout regularization can further alleviate overfitting and enhance generalization performance. The nonlinear feature interaction layer can be regarded as a new self-attention mechanism suitable for the fully connected layer, and has the fusion ability of discrete and quadratic features. Compared with the multi-layer fully connected architecture, the network width is reduced. The nonlinearity is enhanced along with the parameter scale, which is beneficial to reduce overfitting under limited samples and improve the final classification performance.

进一步地，非线性特征交互层的原理公式：Further, the principle formula of the nonlinear feature interaction layer:

进一步地，对样本预处理后代谢指纹图谱进行非目标检测，获得相关的代谢物数据库，构建分组信息与代谢谱图的映射关系，并划分训练集样本和盲测集样本。Further, non-target detection of metabolic fingerprints after sample preprocessing was performed to obtain relevant metabolite databases, to construct the mapping relationship between grouping information and metabolic profiles, and to divide training set samples and blind test set samples.

进一步地，使用样本数据对神经网络进行训练，随机将训练集数据的3/4作为训练组，1/4作为测试组。对训练组样本基于所述多层神经网络进行10折交叉验证(10-fold)训练，通过统计最终模型准确的平均值，来实现分类。Further, the neural network is trained using the sample data, and 3/4 of the training set data is randomly used as the training group and 1/4 as the test group. 10-fold cross-validation (10-fold) training is performed on the training group samples based on the multi-layer neural network, and classification is achieved by counting the accurate average value of the final model.

一种样本区分贡献重要性的计算方法，包括以下步骤：A method for calculating the importance of sample discrimination contribution, including the following steps:

步骤11：将质谱数据转化为二维图像，构建代谢物筛选图片库；Step 11: Convert mass spectrometry data into two-dimensional images, and build a metabolite screening image library;

步骤12：使用显著性特征分析方法(Saliency Maps)对代谢物筛选图片库内的数据进行计算，并对所有特征进行排序，筛选出对样本区分贡献最大的物质。Step 12: Calculate the data in the metabolite screening image library by using the saliency maps, and sort all the features to screen out the substances that contribute the most to the sample discrimination.

本发明具有如下技术效果：对质谱数据实现样本快速分组，并大大提升了分类模型的可解释能力。The present invention has the following technical effects: rapid grouping of samples for mass spectrometry data is realized, and the interpretability of the classification model is greatly improved.

附图说明Description of drawings

图1是本发明的一个实施例中的神经网络结构示意图。FIG. 1 is a schematic diagram of a neural network structure in an embodiment of the present invention.

具体实施方式Detailed ways

以下参考说明书附图介绍本申请的优选实施例，使其技术内容更加清楚和便于理解。本申请可以通过许多不同形式的实施例来得以体现，本申请的保护范围并非仅限于文中提到的实施例。The preferred embodiments of the present application will be described below with reference to the accompanying drawings, so as to make its technical content clearer and easier to understand. The present application can be embodied in many different forms of embodiments, and the protection scope of the present application is not limited to the embodiments mentioned herein.

以下将对本发明的构思、具体结构及产生的技术效果作进一步的说明，以充分地了解本发明的目的、特征和效果，但本发明的保护不仅限于此。The concept, specific structure and technical effects of the present invention will be further described below to fully understand the purpose, features and effects of the present invention, but the protection of the present invention is not limited to this.

在本发明的一个实施例中，先准备待检样本的数据，其步骤如下：In an embodiment of the present invention, the data of the sample to be tested is first prepared, and the steps are as follows:

步骤2：采用激光辅助解吸/电离质谱仪对每个分析样品进行正离子模式100-1000之间的代谢物小分子指纹谱图进行收集，无需任何平滑程序，并对每一个样品进行了五个独立实验，以消除个体内部偏差并提高诊断结果的可重复性和稳定性；Step 2: Collect metabolite small molecule fingerprints between 100-1000 in positive ion mode for each analyzed sample using a laser-assisted desorption/ionization mass spectrometer, without any smoothing procedure, and perform five Independent experiments to remove intra-individual bias and improve the reproducibility and stability of diagnostic results;

步骤3：将原始代谢指纹图谱(100-1000m/z质荷比之间)进行绝对强度的提取处理，并对所有样本提取的数据进行中心化预处理，用于进一步的机器学习；Step 3: Extract the absolute intensity of the original metabolic fingerprint (between 100-1000 m/z mass-to-charge ratio), and perform centralized preprocessing on the data extracted from all samples for further machine learning;

对样本预处理后代谢指纹图谱进行非目标检测，获得相关的代谢物数据库，构建分组信息与代谢谱图的映射关系，并划分训练集样本和盲测集样本。Non-target detection of metabolic fingerprints after sample preprocessing was performed to obtain relevant metabolite databases, to construct the mapping relationship between grouping information and metabolic profiles, and to divide training set samples and blind test set samples.

本实施例中，用于处理从样本中提取的数据的神经网络结构如下：In this embodiment, the neural network structure for processing the data extracted from the sample is as follows:

该网络的输入为1-1024维的多模态特征(x_input)，其中包括原始质谱数据输入(x_spectral)，其他部分用0填充。基于样本的有限性，对所有特征进行了简单缩放中心化(-1,1)。网络的主体分两大部分，紧跟输入的为特征提取部分(feature_extract)，特征提取层后为非线性特征交互层(feature_interaction)，最后对重组后的96特征输入到Softmax分类层进行分类概率输出。The input to this network is a multimodal feature (x_input) of 1-1024 dimensions, which includes the raw mass spectral data input (x_spectral), and the other parts are padded with 0s. Based on the finiteness of the samples, all features are simply scaled and centered (-1,1). The main body of the network is divided into two parts, followed by the input is the feature extraction part (feature_extract), the feature extraction layer is followed by the nonlinear feature interaction layer (feature_interaction), and finally the reorganized 96 features are input to the Softmax classification layer for classification probability output .

从1024维输入到Softmax层的原理公式：The principle formula from the 1024-dimensional input to the Softmax layer:

x_fs＝feature_extract(x_input) (2)x_fs=feature_extract(x_input) (2)

x_nl＝feature_interaction(x_fs) (3)x_nl=feature_interaction(x_fs) (3)

y_pred＝softmax(x_nl) (4)y_pred=softmax(x_nl) (4)

特征提取部分(feature_extract)，由四层LocallyConnected1D层堆叠而成，每个LocallyConnected1D层将所有特征划分为32个区间分别进行全连接特征提取(32个拥有各自参数的全连接层)，从而体现质谱数据的特征位置相关性，同时兼容最后的32外部多模态特征，与四层全连接架构相比可减小网络宽度与参数规模的同时精细建模质谱类数据的特征提取过程，也间接的降低了过拟合。The feature extraction part (feature_extract) is composed of four LocallyConnected1D layers stacked. Each LocallyConnected1D layer divides all features into 32 intervals for fully connected feature extraction (32 fully connected layers with their own parameters), so as to reflect mass spectrometry data. Compared with the four-layer fully connected architecture, the network width and parameter scale can be reduced, and the feature extraction process of mass spectrometry data can be finely modeled, and the feature extraction process of mass spectrometry data can also be indirectly reduced. overfitting.

四层LocallyConnected1D层堆叠的原理公式：The principle formula of four-layer LocallyConnected1D layer stacking:

非线性特征交互层(feature_interaction)，对特征提取部分得到的96个隐特征，学习其非线性关系。特征交互部分每一层能够同时提取离散化的Relu激活特征，也能提取特征线性组合的近似二次关系，通过残差或者拼接合并提取为融合特征，能够更好地提取非线性特征，同时借助dropout正则化可进一步缓解过拟合、增强泛化性能。非线性特征交互层可被视为是一种适用于全连接层的新型自注意力机制，且兼具离散、二次特征的融合能力，与多层全连接架构相比，在减小网络宽度与参数规模的同时增强了非线性性，从而利于有限样本下降低过拟合，提高了最终的分类性能。The nonlinear feature interaction layer (feature_interaction) learns the nonlinear relationship of the 96 latent features obtained in the feature extraction part. In the feature interaction part, each layer can simultaneously extract the discrete Relu activation features, and can also extract the approximate quadratic relationship of the linear combination of features, which can be extracted as fusion features through residuals or splicing and merging, which can better extract nonlinear features. Dropout regularization can further alleviate overfitting and enhance generalization performance. The nonlinear feature interaction layer can be regarded as a new self-attention mechanism suitable for the fully connected layer, and has the fusion ability of discrete and quadratic features. Compared with the multi-layer fully connected architecture, the network width is reduced. The nonlinearity is enhanced along with the parameter scale, which is beneficial to reduce overfitting under limited samples and improve the final classification performance.

非线性特征交互层的原理公式：The principle formula of the nonlinear feature interaction layer:

使用样本数据对上述神经网络进行训练，随机将训练集数据的3/4作为训练组，1/4作为测试组。对训练组样本基于所述多层神经网络进行10折交叉验证(10-fold)训练，通过统计最终模型准确的平均值，来实现分类；Use the sample data to train the above neural network, randomly select 3/4 of the training set data as the training group and 1/4 as the test group. 10-fold cross-validation (10-fold) training is performed on the training group samples based on the multi-layer neural network, and the classification is realized by counting the accurate average value of the final model;

将训练好的网络用于盲测集样本分析，通过分析预测的准确度，验证了该基于多层神经网络为基础的分组模型可实现精准分类；The trained network is used for blind test set sample analysis, and by analyzing the accuracy of prediction, it is verified that the multi-layer neural network-based grouping model can achieve accurate classification;

进一步分析分类模型贡献度，包括以下步骤：Further analysis of the contribution of the classification model includes the following steps:

步骤11：将质谱谱图数据转化为二维图像，构建代谢物筛选图片库；Step 11: Convert the mass spectrometry data into a two-dimensional image, and construct a metabolite screening image library;

以上详细描述了本申请的较佳具体实施例。应当理解，本领域的普通技术无需创造性劳动就可以根据本申请的构思作出诸多修改和变化。因此，凡本技术领域中技术人员依本申请的构思在现有技术的基础上通过逻辑分析、推理或者有限的实验可以得到的技术方案，皆应在由权利要求书所确定的保护范围内。The preferred specific embodiments of the present application are described in detail above. It should be understood that many modifications and changes can be made in accordance with the concept of the present application without creative efforts by those skilled in the art. Therefore, any technical solutions that can be obtained by those skilled in the art through logical analysis, reasoning or limited experiments on the basis of the prior art according to the concept of the present application shall fall within the protection scope determined by the claims.

Claims

1. A method for analyzing mass spectrum data based on artificial intelligence is characterized in that the mass spectrum data is analyzed and processed by using a multilayer neural network, and grouping of samples is realized; the method comprises the following steps:

step 1: absorbing and moving the sample to a mass spectrum target plate, drying the sample, and performing subsequent mass spectrum analysis as a thin layer;

step 2: collecting the metabolite small molecule fingerprint spectrogram between the positive ion mode 100-1000 for each sample by adopting a laser-assisted desorption/ionization mass spectrometer without any smoothing program;

and step 3: extracting absolute intensity of the fingerprint spectrum, and performing centralized preprocessing on the extracted data;

and 4, step 4: and (4) inputting the data processed in the step (3) into the multilayer neural network, and performing sample grouping processing.

2. The artificial intelligence based method of analyzing mass spectrometry data of claim 1, wherein in step 2, at least 2 independent experiments are performed on each of the samples.

3. The artificial intelligence based method of analyzing mass spectrometry data of claim 1, wherein the multi-layer neural network comprises: network input, network subject, network output; the network main body comprises a feature extraction part, a nonlinear feature interaction layer and a classification layer; the network input is firstly processed by the feature extraction part, the output of the feature extraction part is processed by the nonlinear feature interaction layer, the output of the nonlinear feature interaction layer is processed by the classification layer, and the output of the classification layer is the network output;

the principle formula from the network input up to the classification level is:

x_input＝concatenate(x_spectral,x_ext) (1)

x_fs＝feature_extract(x_input) (2)

x_nl＝feature_interaction(x_fs) (3)

y_pred＝softmax(x_nl) (4)。

4. the method for artificial intelligence based analysis of mass spectrometry data of claim 3, wherein the network input is a multi-modal 1-1024 dimensional signature including raw mass spectrometry data input, and the rest is filled with 0 s.

5. The method for analyzing mass spectrometry data based on artificial intelligence as claimed in claim 3, wherein the feature extraction part is formed by stacking four local connection layers, each of the local connection layers divides all features into 32 intervals for full connection feature extraction.

6. The method for artificial intelligence based analysis of mass spectrometry data of claim 3, wherein the nonlinear feature interaction layer learns the nonlinear relationship of the 96 latent features obtained by the feature extraction portion.

7. The artificial intelligence based method of analyzing mass spectral data of claim 6, wherein the principle formula of the nonlinear feature interaction layer is:

8. the method for analyzing mass spectrometry data based on artificial intelligence of any one of claims 1 or 2, wherein the fingerprint is subjected to non-target detection, a related metabolite database is obtained, a mapping relation between grouping information and a metabolic profile is constructed, and a training set sample and a blind measurement set sample are divided.

9. The method for artificial intelligence based analysis of mass spectrometry data of claim 8, wherein 3/4 of the data of the training set samples are used as a training set and 1/4 is used as a test set. And performing 10-fold cross validation (10-fold) training on the training group samples based on the multilayer neural network, and realizing classification by counting the accurate average value of the final model.

10. A method for calculating the significance of a sample discrimination contribution, comprising the steps of:

step 11: converting the fingerprint data as set forth in claim 8 into two-dimensional images to construct a metabolite screening picture library;

step 12: and (3) calculating the data in the metabolite screening picture library by using a significant feature analysis method, sequencing all features, and screening out the substances which have the greatest contribution to sample discrimination.