CN115600667A

CN115600667A - A method and device for modeling multi-dimensional state change time series data of a system

Info

Publication number: CN115600667A
Application number: CN202211155310.1A
Authority: CN
Inventors: 陈新; 杨玉萍
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2022-09-22
Filing date: 2022-09-22
Publication date: 2023-01-13

Abstract

The invention discloses a method and a device for modeling multi-dimensional state change time sequence data of a system, wherein a model can learn the change of two time states of the system through the design of a unit time forward change generator (G unit), the mode of integrally learning a time sequence track by a traditional method is converted into the mode of changing and learning the specific time point of the time sequence track, and the problem that the time sequence data with the residual point cannot be learned is solved; the method has the advantages that the system state change rule in unit time is learned through the model, the system state change rule is essentially a model of a system working mechanism, and a new system working mechanism can be superposed on the model to predict what kind of change of system behavior can be generated by changing the existing system working mechanism.

Description

A method and method for modeling time series data of system multi-dimensional state changes device

技术领域technical field

本发明属于计算方法领域，是一种对系统多维度状态变化时间序列数据进行建模的方法及装置。The invention belongs to the field of computing methods, and relates to a method and a device for modeling time series data of system multi-dimensional state changes.

背景技术Background technique

建立一种对多维数据时间序列建立神经网络模型的方法，实现对系统单位时间状态变化的驱动机制学习，并利用神经网络模型模拟系统内在机制的时序变化，是人工智能领域对时序分析的一个重要目标。Establishing a method of establishing a neural network model for multi-dimensional data time series, realizing the learning of the driving mechanism of the state change of the system per unit time, and using the neural network model to simulate the time series change of the internal mechanism of the system is an important aspect of time series analysis in the field of artificial intelligence. Target.

例如，生物发酵过程的时序数据建模与预测是实现高效生物制造的一项重要技术。发酵过程的时序数据是指，在微生物的发酵过程中，在各个时间点采集的具有时间点标签的序列数据，每个时间点包括一组固定的指标，常见的指标包括代谢组数据、发酵工艺数据、转录组数据。发酵过程时序数据建模与预测的目的是通过模拟探索发酵规律，优化发酵过程，以获得更好的经济和社会效益。For example, time-series data modeling and prediction of bio-fermentation process is an important technology for efficient bio-manufacturing. The time-series data of the fermentation process refers to the sequence data with time-point labels collected at various time points during the microbial fermentation process. Each time point includes a set of fixed indicators. Common indicators include metabolome data, fermentation process data, transcriptome data. The purpose of time-series data modeling and forecasting in the fermentation process is to explore the fermentation law through simulation, optimize the fermentation process, and obtain better economic and social benefits.

目前对发酵过程进行分析的主要方法是代谢通量模型，它不是一种发酵过程的时间序列轨迹的分析方法，而是一种对发酵过程可能产生的平衡态的分析方法。这一分析方法依赖偏微分方程组描述化学反应，因而存在一定局限，主要是不能实现对代谢组数据、发酵工艺数据、转录组数据的跨类别联合分析，对发酵过程优化的指导能力有限。At present, the main method for analyzing the fermentation process is the metabolic flux model, which is not an analysis method for the time series trajectory of the fermentation process, but an analysis method for the possible equilibrium state of the fermentation process. This analysis method relies on partial differential equations to describe chemical reactions, so it has certain limitations. The main reason is that it cannot realize the cross-category joint analysis of metabolome data, fermentation process data, and transcriptome data, and its guidance ability for fermentation process optimization is limited.

另一方面，基于数理统计和机器学习理论的回归算法，在很多领域内被用于对各类时间序列数据的建模。数理统计方法首先基于建模者对被建模系统的状态变化规律的理解，设计合理的泛函框架；然后学习泛函参数，完成对多维时间序列的曲线拟合。这类方法常用的有周期因子法、移动平均法、ARIMA模型等。由于建模者设计的泛函框架与真实驱动被建模系统状态变化的机制存在不同，这类方法建立的模型存在系统水平上的精度上限，无法通过增加实验观察数据突破。另一个策略是使用机器学习方法，基于通用的神经网络等框架，完成对多维时间序列的曲线拟合。机器学习方法通过设计对所有的变化规律都有较好拟合能力的泛函框架，解决了数理统计方法要求准确设计拟合泛函框架的问题。这类方法常用的有K近邻算法、SVM算法、LSTM模型、Seq2seq模型等。由于通用泛函比专门设计的泛函更难学习其参数，这类方法通常需要大量的训练例子才能取得较好的拟合效果On the other hand, regression algorithms based on mathematical statistics and machine learning theory are used in many fields to model various time series data. The mathematical statistics method first designs a reasonable functional framework based on the modeler's understanding of the state change laws of the modeled system; then learns the functional parameters to complete the curve fitting of the multidimensional time series. Commonly used methods of this type include periodic factor method, moving average method, ARIMA model and so on. Because the functional framework designed by the modeler is different from the mechanism that actually drives the state change of the modeled system, the model established by this type of method has an upper limit of accuracy at the system level, and it cannot be broken through additional experimental observation data. Another strategy is to use machine learning methods to complete the curve fitting of multidimensional time series based on general frameworks such as neural networks. The machine learning method solves the problem that the mathematical statistics method requires an accurate design of a fitting functional framework by designing a functional framework that has a good fitting ability for all changing laws. Such methods are commonly used K-nearest neighbor algorithm, SVM algorithm, LSTM model, Seq2seq model, etc. Since general functionals are more difficult to learn their parameters than specially designed functionals, such methods usually require a large number of training examples to achieve good fitting results

不论是数理统计还是机器学习方法，在应用时通常将一个批次的被建模系统的状态变化曲线作为一个学习例子，且要求全部学习例子中的状态指标不能缺失。在实际应用中，很多对被建模系统状态的追踪观察，因样本检测问题，无法做到无缺失指标和无缺失时间点。例如，在生物发酵过程的时序数据建模中，常常在变化不大的时间段中取样观察的时间点较少，在变化较大的时间段中取样观察的时间点较多，如果以变化较大的时间段中的取样频率为基准，相当于变化不大的时间段中的部分时间点的数据缺失；对发酵过程的一个时间点的系统状态的表征，常常使用代谢组、转录组等组学技术，组学技术可以一次性得到众多指标，但是由于技术原因，不能保证每次测定所得的每个指标都能具有足够用于后续建模的精度，造成被建模系统的时间点状态数据中存在指标缺失。这样的残缺时序数据无法被直接学习。Whether it is mathematical statistics or machine learning methods, the state change curve of a batch of modeled systems is usually used as a learning example when it is applied, and it is required that the state indicators in all learning examples cannot be missing. In practical applications, many tracking observations of the state of the modeled system cannot achieve no missing indicators and no missing time points due to sample detection problems. For example, in the time series data modeling of the biological fermentation process, fewer time points are often sampled and observed in the time period with little change, and more time points are sampled and observed in the time period with large change. The sampling frequency in a large time period is used as the benchmark, which is equivalent to the lack of data at some time points in a time period with little change; for the characterization of the system state at a time point in the fermentation process, metabolomes, transcriptomes and other groups are often used. Omics technology can obtain many indicators at one time, but due to technical reasons, it cannot be guaranteed that each indicator obtained by each measurement can have sufficient accuracy for subsequent modeling, resulting in the point-in-time state data of the modeled system There are missing indicators in . Such incomplete time series data cannot be directly learned.

对于被建模系统时间点状态数据缺失的情况，如果通过插值补全时间点状态数据，会造成训练数据的不精确，导致模型精度下降。如果使用整个批次的时间序列整体作为一个学习例子，则使得单个例子数据规模变大，模型变复杂，获取例子的测试成本增高，例子数量减少，给模型学习造成困难。For the lack of time-point state data of the modeled system, if the time-point state data is supplemented by interpolation, the training data will be inaccurate, resulting in a decrease in model accuracy. If the entire batch of time series is used as a learning example, the data size of a single example will become larger, the model will become more complex, the test cost of obtaining examples will increase, and the number of examples will decrease, making it difficult for model learning.

另一方面，上述的时间序列建模方法为曲线拟合方法，而不是对时间序列的产生机制建模，因而无法提供时间序列的产生机制应该如何优化的线索。例如，生物发酵过程的时序数据是由工程菌的生物机制驱动的，通过曲线拟合的时间序列建模方法获得的发酵过程模型不直接对应工程菌的生物机制，难以指导对工程菌基因组的改造。On the other hand, the above-mentioned time series modeling method is a curve fitting method instead of modeling the generation mechanism of the time series, so it cannot provide clues on how to optimize the generation mechanism of the time series. For example, the time-series data of the biological fermentation process is driven by the biological mechanism of the engineered bacteria, and the fermentation process model obtained by the time series modeling method of curve fitting does not directly correspond to the biological mechanism of the engineered bacteria, and it is difficult to guide the transformation of the genome of the engineered bacteria .

本发明公开，在生物发酵体系中，神经网络模型可以用于建模一个发酵体系的工作机制。发酵体系工作机制的神经网络模型可以根据当前的体系状态，计算一个单位时间之后的体系状态。这一生物发酵体系时间序列变化的驱动机制模型，相比于一般意义上的时间序列数值拟合模型，能够更好地反映工程菌的各类调控机制对发酵过程的影响，更好地指导进行工程菌的基因组改造，获得更好的生产性能。The invention discloses that in the biological fermentation system, the neural network model can be used to model the working mechanism of a fermentation system. The neural network model of the working mechanism of the fermentation system can calculate the system state after a unit of time based on the current system state. Compared with the time series numerical fitting model in the general sense, this driving mechanism model for the time series change of the biological fermentation system can better reflect the influence of various regulatory mechanisms of engineering bacteria on the fermentation process, and better guide the process of fermentation. Genome modification of engineering bacteria to obtain better production performance.

因此，目前多维时间序列建模方法具有以下主要不足：Therefore, the current multidimensional time series modeling methods have the following main deficiencies:

1)存在残缺时间点的时序数据无法被学习；1) Time series data with incomplete time points cannot be learned;

2)使用一个批次的时间序列作为一个学习案例，使得单案例数据规模大，导致模型复杂度高；同时可供学习的案例数量少，给模型学习造成困难；2) Using a batch of time series as a learning case makes the data of a single case large, leading to high model complexity; at the same time, the number of cases available for learning is small, which makes model learning difficult;

3)大多数的时间序列建模方法不能直接对应时间序列的产生机制，无法提供时间序列的产生机制应该如何优化的线索。3) Most time series modeling methods cannot directly correspond to the generation mechanism of time series, and cannot provide clues on how to optimize the generation mechanism of time series.

目前缺乏一种通用的方法，可以完全实现针对不同缺失数据的多维时序数据，实现对系统单位时间状态变化的驱动机制的学习和模拟。本发明公开了一种利用神经网络模型建模时间序列的产生机制的方法，可以利用存在残缺时间点的时序数据，无需插值或将整体批次的时间序列作为一个例子，弥补了上述多维时间序列建模方法的不足。At present, there is a lack of a general method that can fully realize the multi-dimensional time series data for different missing data, and realize the learning and simulation of the driving mechanism of the state change of the system per unit time. The invention discloses a method of using a neural network model to model the generation mechanism of time series, which can use time series data with incomplete time points without interpolation or taking the time series of the whole batch as an example to make up for the above-mentioned multi-dimensional time series Insufficient modeling methods.

发明内容Contents of the invention

本发明公开了一种对系统多维度状态变化时间序列数据进行建模的方法及装置，具体的，本发明是通过以下技术方案来实现的：The invention discloses a method and device for modeling the multi-dimensional state change time series data of the system. Specifically, the invention is realized through the following technical solutions:

一种对系统多维度状态变化时间序列数据进行建模的方法，包括：A method for modeling time series data of multidimensional state changes of a system, comprising:

1)将原始观察数据进行规范化整理，得到形式统一的规范观察数据；1) Standardize the original observation data to obtain standardized observation data in a unified form;

2)基于规范观察数据组织用于人工神经网络训练的训练例子；2) Organize training examples for artificial neural network training based on canonical observation data;

3)设计人工神经网络的结构，建立人工神经网络模型；3) Design the structure of the artificial neural network and establish the artificial neural network model;

4)利用步骤2)中的训练例子对步骤3)中建立的人工神经网络模型进行训练，得到人工神经网络的参数矩阵；4) Utilize the training example in step 2) to train the artificial neural network model established in step 3), obtain the parameter matrix of artificial neural network;

5)利用步骤2)中的训练例子，评估在设计人工神经网络结构的过程中使用的参数以及在对建立的人工神经网络模型进行训练的过程中使用的参数对所得的人工神经网络模型的精度影响，选取不同的参数组合下最优的人工神经网络模型作为最终的结果模型；5) Using the training example in step 2), evaluate the parameters used in the process of designing the artificial neural network structure and the parameters used in the process of training the established artificial neural network model to the accuracy of the artificial neural network model obtained Influence, select the optimal artificial neural network model under different parameter combinations as the final result model;

步骤3)中所述的设计的神经网络结构具有如下特征：The neural network structure of the design described in step 3) has the following characteristics:

其基本结构为一个神经网络单元，称为单位时间正向变化生成器(G单元)，其输入层和输出层具有相同的维度；该神经网络单元用于建模经过一个单位时间后，多维观察数据发生的变化；Its basic structure is a neural network unit, called the forward change generator per unit time (G unit), whose input layer and output layer have the same dimension; this neural network unit is used to model multi-dimensional observation after a unit time data changes;

采用对G单元进行串联的方式得到串联训练结构，串联训练结构用于对多维观察数据经过多个单位时间的变化规律进行建模。The serial training structure is obtained by connecting the G units in series, and the serial training structure is used to model the change rule of the multi-dimensional observation data after multiple unit times.

作为进一步地改进，本发明所述的步骤1)具体为：经多个批次的观察得到多个批次的多维时间序列观察数据，每个批次的观察数据包括一组时间点，每个时间点的观察数据包括一组指标，每个指标的观察数据为一个具体的值；将多维时间序列观察数据整理为四元组的组织形式，即批次、时间、指标、值。As a further improvement, step 1) of the present invention is specifically: Obtain multiple batches of multi-dimensional time-series observation data through multiple batches of observations, each batch of observation data includes a group of time points, each The observation data at the time point includes a set of indicators, and the observation data of each indicator is a specific value; the multi-dimensional time series observation data is organized into a four-tuple organization form, that is, batch, time, indicator, and value.

作为进一步地改进，本发明所述的步骤2)具体为：将同一批次内任意两个时间点的被建模系统的状态变化组织为一个训练例子，得到一组具有不同时间间隔的训练例子，其中每个训练例子包括两个时间点的被建模系统的状态，称较早的时间点的数据为该训练例子的起始时间点数据；称较晚的时间点的数据为该训练例子的结束时间点数据；每个训练例子的起始时间点数据和结束时间点数据表示为观察数据四元组的组织形式。As a further improvement, step 2) of the present invention is specifically: organize the state changes of the modeled system at any two time points in the same batch as a training example, and obtain a set of training examples with different time intervals , where each training example includes the state of the modeled system at two time points, the data at the earlier time point is called the starting time point data of the training example; the data at the later time point is called the training example The end time point data of ; the start time point data and end time point data of each training example are expressed as the organizational form of observation data quadruples.

作为进一步地改进，本发明所述的正向变化生成器(G单元)采用全连接结构。As a further improvement, the forward change generator (G unit) of the present invention adopts a fully connected structure.

作为进一步地改进，本发明所述的串联训练结构按如下方式建立：对不同时间间隔的训练例子，根据其时间间隔个数，为每一个时间间隔，记为n个单位时间，建立一个G单元组成的串联训练结构，所述的串联训练结构将n个G单元首尾相接串联。As a further improvement, the tandem training structure of the present invention is established as follows: for training examples of different time intervals, according to the number of time intervals, for each time interval, denoted as n unit times, a G unit is established A serial training structure composed of n G units connected end to end in series.

作为进一步地改进，本发明所述的时间间隔个数为n的串联训练结构称为(G)ⁿ，n＝1时(G)¹中仅包含1个G单元，串联训练结构(G)¹对应的训练数据为时间间隔个数为1的训练例子；n>1时，(G)ⁿ为n个G单元串联得到的串联训练结构，串联训练结构(G)ⁿ对应的训练数据为时间间隔个数为n的训练例子。As a further improvement, the tandem training structure with n time intervals in the present invention is called (G) ⁿ , when n=1, (G) ¹ only contains 1 G unit, and the tandem training structure (G) ¹ The corresponding training data is a training example with a time interval of 1; when n>1, (G) ⁿ is a series training structure obtained by connecting n G units in series, and the training data corresponding to the series training structure (G) ⁿ is a time interval The number of training examples is n.

作为进一步地改进，本发明所述的步骤4)的训练方式具体为：As a further improvement, the training method of step 4) of the present invention is specifically:

4.1计算每个串联训练结构的损失值：对每一个串联训练结构(G)ⁿ，记其对应的训练例子中的一个为S_i，记S_i的起始时间点为T_i，S_i的结束时间点为T_i+n，记被建模系统在T_i时间点的状态数据为

在T_i+n时间点的状态数据为

把T_i时间点的数据

输入到串联训练结构(G)ⁿ，得到网络的输出数据

基于网络输出数据

与T_i+n时间点的真实数据

可计算得模型损失值Loss，公式为

4.1 Calculate the loss value of each serial training structure: for each serial training structure (G) ⁿ , record one of its corresponding training examples as S _i , record the starting time point of S _i as T _i , and the time point of S _i The end time point is T _i+n , record the state data of the modeled system at the time point T _i as

The state data at T _i+n time point is

Put the data at time point T _i

Input to the serial training structure (G) ⁿ to get the output data of the network

Output data based on the network

and the real data at T _i+n time point

The model loss value Loss can be calculated, the formula is

4.2在模型损失值的基础上，进行限制损失的反向传播层数的梯度计算：对时间间隔个数为1的串联训练结构，其仅包含1个G单元，该网络使用误差反向传播训练机制，直接采用误差反向传播方法得到G单元的更新梯度；对于时间间隔个数大于1的串联训练结构，其由多个G单元的串联组成，采用误差反向传播方法计算梯度，并截取最后一个G单元上的梯度作为该串联训练结构的G单元的更新梯度；4.2 On the basis of the model loss value, the gradient calculation of the number of backpropagation layers that limits the loss is performed: for a series training structure with a time interval of 1, which only contains 1 G unit, the network uses error backpropagation training Mechanism, directly use the error backpropagation method to obtain the update gradient of the G unit; for the series training structure with the number of time intervals greater than 1, which is composed of multiple G units in series, use the error backpropagation method to calculate the gradient, and intercept the last The gradient on a G unit is used as the update gradient of the G unit of the serial training structure;

4.3在所述的限制损失反向传播层数的梯度计算方法的基础上，将多个串联训练结构按顺序进行权值更新，并在顺序权值更新的过程中共享网络权值参数，具体实现为：对于不同时间间隔的串联训练结构，按照一定的顺序，计算串联训练结构的梯度，并使用梯度下降方法进行权值更新，一个串联训练结构更新后的权值参数立即共享到其它所有串联训练结构；4.3 On the basis of the gradient calculation method for limiting the number of loss backpropagation layers, update the weights of multiple serial training structures in sequence, and share the network weight parameters in the process of updating the sequential weights. The specific implementation It is: for the serial training structure of different time intervals, calculate the gradient of the serial training structure according to a certain order, and use the gradient descent method to update the weight value, and the updated weight parameters of a serial training structure are immediately shared to all other serial training structure;

4.4对K个串联训练结构(G)¹、(G)²、(G)³...(G)^K，计算每个串联训练结构的损失值，判断是否每个串联训练结构的损失值都收敛，若均已收敛，则得到结果模型，否则继续重复步骤4.1、4.2、4.3，直至每个串联训练结构的损失值均收敛，得到结果模型。4.4 For K serial training structures (G) ¹ , (G) ² , (G) ³ ... (G) ^K , calculate the loss value of each serial training structure, and judge whether the loss value of each serial training structure is Convergence, if all have converged, the result model will be obtained, otherwise, continue to repeat steps 4.1, 4.2, 4.3 until the loss values of each serial training structure converge, and the result model will be obtained.

作为进一步地改进，本发明所述步骤5)具体指从模型超参数的层面对建模过程进行优化，在优化的过程中尝试调整再建模过程中使用的超参数，具体包括：G单元的结构参数，包括网络的每层节点数，隐藏层数量、梯度下降方法的学习率、每次训练投入训练数据例子数、循环训练的循环次数；采用上述超参数的不同组合完成建立神经网络结构和神经网络训练的工作，得到新的结果模型；评估每个超参数组合下结果模型对观察数据的拟合精度，选取最优模型。As a further improvement, step 5) of the present invention specifically refers to optimizing the modeling process from the level of model hyperparameters, and trying to adjust the hyperparameters used in the remodeling process during the optimization process, specifically including: G unit Structural parameters, including the number of nodes in each layer of the network, the number of hidden layers, the learning rate of the gradient descent method, the number of training data examples for each training, and the number of cycles of cyclic training; using different combinations of the above hyperparameters to complete the establishment of the neural network structure and Neural network training work to obtain a new result model; evaluate the fitting accuracy of the result model to the observed data under each hyperparameter combination, and select the optimal model.

本发明还公开了一种应用对多维数据时间序列进行建模方法的对多维数据时间序列进行建模的装置，包括：The present invention also discloses a device for modeling multidimensional data time series using a method for modeling multidimensional data time series, including:

获取单元：用于将原始观察数据进行规范化整理，得到形式统一的规范观察数据；Acquisition unit: used to standardize and organize the original observation data to obtain standardized observation data in a unified form;

组织单元：用于将基于规范观察数据组织用于人工神经网络训练的训练例子；Organizational unit: used to organize training examples based on canonical observation data for artificial neural network training;

构建单元：用于设计人工神经网络的结构，建立人工神经网络模型；Building unit: used to design the structure of the artificial neural network and establish the artificial neural network model;

训练单元：用于利用训练例子对建立的人工神经网络模型进行训练，得到人工神经网络的参数矩阵；Training unit: used to train the established artificial neural network model using training examples to obtain the parameter matrix of the artificial neural network;

优化单元：用于利用训练例子，评估在设计人工神经网络结构的过程中使用的参数以及在对建立的人工神经网络模型进行训练的过程中使用的参数对所得的人工神经网络模型的精度影响，选取不同的参数组合下最优的人工神经网络模型作为最终的结果模型；Optimization unit: for using training examples to evaluate the influence of parameters used in the process of designing the artificial neural network structure and the parameters used in the process of training the established artificial neural network model on the accuracy of the obtained artificial neural network model, Select the optimal artificial neural network model under different parameter combinations as the final result model;

所述的设计的神经网络结构具有如下特征：The neural network structure of described design has following characteristics:

本发明的有益效果如下：The beneficial effects of the present invention are as follows:

1)通过单位时间正向变化生成器(G单元)的设计，使得模型可学习系统两个时间状态的变化，将传统方法对时序轨迹进行整体学习的方式转换为对时序轨迹的特定时间点变化学习的方式，解决了存在残缺点的时序数据无法被学习的问题；1) Through the design of the positive change generator (G unit) per unit time, the model can learn the changes of the two time states of the system, and convert the traditional method of overall learning of the timing trajectory into a specific time point change of the timing trajectory The learning method solves the problem that time series data with defects cannot be learned;

2)由于训练方式的改变，通过一个批次的时间序列数据可以组织多个学习例子，使得例子数量变多，提高训练数据的信息效能；2) Due to the change of the training method, multiple learning examples can be organized through a batch of time series data, which increases the number of examples and improves the information efficiency of the training data;

3)通过模型学习单位时间上系统状态变化的规律，其本质是一个系统工作机制的模型，可在该模型上叠加新的系统工作机制以预测对现有系统工作机制的改变可能产生系统行为的何种改变。3) Learning the law of system state changes per unit time through the model, which is essentially a model of the system working mechanism, on which a new system working mechanism can be superimposed to predict changes to the existing system working mechanism that may cause changes in system behavior what a change.

4)本发明的神经网络模型的结构和训练方法，可以应用在生物发酵时序数据分析中。4) The structure and training method of the neural network model of the present invention can be applied in the analysis of biological fermentation time series data.

附图说明Description of drawings

图1为模型训练步骤示意图；Fig. 1 is a schematic diagram of model training steps;

图2为训练例子组织方式示例图；Figure 2 is an example diagram of the organization of training examples;

图3为多个G单元组成的串联训练结构示意图；Fig. 3 is a schematic diagram of a serial training structure composed of multiple G units;

图4为串联训练结构的损失限制性反向传播示意图。Figure 4 is a schematic diagram of loss-restricted backpropagation for the serial training structure.

具体实施方式detailed description

本发明设计的单位时间正向变化生成器(G单元)，可用于建模发酵生物体系的整体机制，包括菌体生理、催化反应、相应工艺和营养条件调控的机制。通过对这一机制模型的分析，可以得到机制优化的可能方案，指导对发酵过程的优化，包括对目标产物的产量预测、工程菌的基因组改造与发酵条件设计。图1为模型训练步骤示意图；本发明公开的方法通过以下步骤对发酵生物体系的整体机制建模。The positive change generator per unit time (G unit) designed by the present invention can be used to model the overall mechanism of the fermentation biological system, including the mechanism of bacterial physiology, catalytic reaction, corresponding process and nutritional condition regulation. Through the analysis of this mechanism model, possible schemes for mechanism optimization can be obtained to guide the optimization of the fermentation process, including the prediction of the yield of the target product, the genome modification of engineering bacteria and the design of fermentation conditions. Fig. 1 is a schematic diagram of model training steps; the method disclosed in the present invention models the overall mechanism of the fermentation biological system through the following steps.

1.整理时序数据格式为四元组织形式1. Organize the time series data format into a quaternary organization form

建模分析的对象是“生物发酵体系”，生物发酵体系包括菌种和发酵环境，因此模型需要能够学习和预测反映菌种状态和发酵环境的指标。菌种状态和发酵环境可以通过多种组学技术表征，发酵过程是由多个时间点组成的时间序列；因此，这些发酵时序数据可以整理为这样的四元组织形式：批次、时间、指标、值。发酵时序数据分为训练数据和测试数据。训练数据用于建立模型，测试数据用于评估模型的精度。The object of modeling analysis is "biological fermentation system". The biological fermentation system includes strains and fermentation environment, so the model needs to be able to learn and predict indicators reflecting the state of strains and fermentation environment. The strain status and fermentation environment can be characterized by a variety of omics techniques, and the fermentation process is a time series composed of multiple time points; therefore, these fermentation time series data can be organized into such a quaternary organizational form: batch, time, index ,value. The fermentation time series data is divided into training data and test data. The training data is used to build the model and the test data is used to evaluate the accuracy of the model.

设训练数据为以阿卡波糖为目标产物的放线菌发酵过程中测量得到的时序数据，有(M+U)个批次(例如M＝7,U＝3)，每个批次有(K+1)个时间点(例如K＝10，每个时间点分别记为T₁,T₂,T₃…T₁₁)。Assuming that the training data is the time-series data measured during the fermentation of actinomycetes with acarbose as the target product, there are (M+U) batches (such as M=7, U=3), and each batch has (K+1) time points (eg K=10, each time point is respectively marked as T ₁ , T ₂ , T ₃ . . . T ₁₁ ).

对(M+U)个批次的放线菌发酵过程中每个批次对应的(K+1)个时间点的样本，通过质谱测定每个样本中的各类化合物的丰度，选定其中与阿卡波糖合成相关的化合物q个(例如q＝196)，从每个时间点的样本的化合物丰度报告中提取这q个化合物的丰度，作为该时间点生物发酵系统状态的观察值，质谱没有观测到的化合物丰度记为缺失。记在这批样本中，其中第2个批次的T₃时间点和第3批次的T₂时间点，由于样本测定的实验失败，造成数据缺失。For the samples of (K+1) time points corresponding to each batch of (M+U) batches of actinomycete fermentation process, the abundance of various compounds in each sample was determined by mass spectrometry, and the selected Among them, there are q compounds related to the synthesis of acarbose (for example, q=196), and the abundance of these q compounds is extracted from the compound abundance report of the sample at each time point, as the state of the biological fermentation system at this time point Observed values, the abundance of compounds not observed by mass spectrometry were recorded as missing. Recorded in this batch of samples, the T ₃ time point of the second batch and the T ₂ time point of the third batch were missing due to the failure of the sample determination experiment.

将(M+U)个批次的放线菌发酵数据分为训练数据和测试数据，其中有M个批次为训练数据批次，有U个批次为测试数据批次。因此，所有数据共有(M+U)个批次，每个批次有(K+1)个时间点(T₁,T₂,T₃…T_K+1)，每个时间点有q维指标数据值，其中第m个批次中第p个时间点的q维数据表示为Divide (M+U) batches of actinomycete fermentation data into training data and test data, wherein M batches are training data batches, and U batches are test data batches. Therefore, there are (M+U) batches of all data, each batch has (K+1) time points (T ₁ , T ₂ , T ₃ ... T _K+1 ), each time point has q dimension Index data value, where the q-dimensional data of the p-th time point in the m-th batch is expressed as

2.组织训练数据例子2. Organize training data examples

图2为训练例子组织方式示例图；将M个批次的放线菌发酵数据作为训练数据，根据步骤1中的四元组织形式，对同一批次内任意两个时间点的系统状态变化组织为一个例子，T_a与T_b分别表示例子的起始时间点和终止时间点，实例训练数据被组织为：Figure 2 is an example diagram of the organization of training examples; M batches of actinomycete fermentation data are used as training data, and according to the quaternary organization form in step 1, the system state changes at any two time points in the same batch are organized As an example, T _a and T _b represent the start time point and end time point of the example respectively, and the example training data is organized as:

批次1例子集合D₁：

Batch 1 example set D ₁ :

批次2例子集合D₂：

Batch 2 example set D ₂ :

批次3例子集合D₃：

Batch 3 example set D ₃ :

批次m例子集合D_m：

Batch m example set D _m :

其中4≤m≤M-1，m∈N^* where 4≤m≤M-1, m∈N ^*

批次M例子集合D_M：

Batch M example set D _M :

再将各个批次中的例子汇总，按照例子中T_a与T_b两个时间点间隔大小，被重新组织为：Then summarize the examples in each batch, and reorganize them according to the interval between the two time points T _a and T _b in the example:

3.为不同时间间隔的训练例子建立G单元的串联训练结构3. Establish a serial training structure of G units for training examples at different time intervals

图3为多个G单元组成的串联训练结构示意图；先搭建G单元的网络，G单元由输入层，隐藏层，输出层组成，其中每一层的节点数均为e，隐藏层层数为f。Figure 3 is a schematic diagram of a series training structure composed of multiple G units; first build the network of G units, G units are composed of an input layer, a hidden layer, and an output layer, wherein the number of nodes in each layer is e, and the number of hidden layers is f.

对由步骤2得到的不同时间间隔的训练例子，根据其时间间隔个数来将G单元串联起来得到对应的串联训练结构：(G)¹、(G)²、(G)³...(G)^K，其中每个G单元对应着一组网络权值参数(为表示方便，网络权重参数和偏置参数统称为网络权值参数)，各个串联训练结构参数具体为：For the training examples of different time intervals obtained in step 2, connect the G units in series according to the number of time intervals to obtain the corresponding series training structure: (G) ¹ , (G) ² , (G) ³ ...( G) ^K , where each G unit corresponds to a set of network weight parameters (for convenience, the network weight parameters and bias parameters are collectively referred to as network weight parameters), and each serial training structure parameter is specifically:

(G)¹中有1个G单元：

权值参数有

(G) ¹ G unit in 1:

The weight parameter has

(G)²中有2个G单元：

权值参数有

There are 2 G cells in (G) ² :

The weight parameter has

(G)³中有3个G单元：

权值参数有

There are 3 G units in (G) ³ :

The weight parameter has

(G)^K中有n个G单元：

权值参数有

(G) There are n G units in ^K :

The weight parameter has

对各个串联训练结构的初始权值参数设置为相同随机值：W₀，即：Set the initial weight parameters of each serial training structure to the same random value: W ₀ , namely:

训练开始前时都等于W₀。

It is equal to W ₀ before the training starts.

4.计算模型损失值4. Calculate the model loss value

将c组训练例子的起始时间点T_a的多维数据，输入到由步骤3得到的对应时间间隔个数的串联训练结构中计算损失值。即将集合Input the multidimensional data of the starting time point T _a of group c training examples into the serial training structure corresponding to the number of time intervals obtained in step 3 to calculate the loss value. coming soon

中的c个起始时间点T_a的多维数据

输入(G)ⁿ中，得到输出数据

其中n≤K、n∈N^*。

与T_b时间点多维真实数据

可计算得损失值Loss_n，计算公式为

The multidimensional data of c starting time points T _a in

Input (G) ⁿ to get the output data

where n≦K, n∈N ^* .

Multidimensional real data with T _b time point

The loss value Loss _n can be calculated, and the calculation formula is

5.通过损失限制性反向传播计算梯度，通过顺序循环优化更新网络权值参数5. Calculate the gradient through loss-restricted backpropagation, and update the network weight parameters through sequential loop optimization

图4为串联训练结构的损失限制性反向传播示意图。由步骤4得到的每个串联训练结构的损失值，对每个串联训练结构进行限制性反向传播计算梯度和通过顺序循环优化更新网络权值参数，具体计算过程如下：Figure 4 is a schematic diagram of loss-restricted backpropagation for the serial training structure. From the loss value of each series training structure obtained in step 4, perform restricted back propagation for each series training structure to calculate the gradient and update the network weight parameters through sequential loop optimization. The specific calculation process is as follows:

对时间间隔个数为1的串联训练结构，其仅包含1个G单元，该网络使用误差反向传播训练机制，直接得到G单元的更新梯度。对时间间隔个数大于1的串联训练结构则由多个G单元的串联，采用误差反向传播机制计算梯度，截取最后一个G单元上的梯度作为G单元的更新梯度。具体计算公式为：For the serial training structure with a time interval of 1, which only contains 1 G unit, the network uses the error backpropagation training mechanism to directly obtain the update gradient of the G unit. For the series training structure with the number of time intervals greater than 1, multiple G units are connected in series, and the gradient is calculated by using the error back propagation mechanism, and the gradient on the last G unit is intercepted as the update gradient of the G unit. The specific calculation formula is:

(G)¹只计算

单元的更新梯度，即

(G) ¹ calculation only

The update gradient of the unit, namely

(G)²只计算

单元的更新梯度，即

(G) ² calculations only

The update gradient of the unit, namely

(G)³只计算

单元的更新梯度，即

(G) ³ calculations only

The update gradient of the unit, namely

(G)^K只计算

单元的更新梯度，即

(G) ^K counts only

The update gradient of the unit, namely

上述梯度计算方法每次仅计算串联训练结构最后一个结构G单元的梯度，所以本发明将这一方法称为损失限制性反向传播。The above gradient calculation method only calculates the gradient of the last structure G unit of the serial training structure each time, so the present invention refers to this method as loss-limited backpropagation.

在上述方法计算所得的梯度基础上，对于不同时间间隔的串联训练结构，按照时间间隔个数大小顺序，先计算较小时间间隔的串联训练结构的梯度，并使用SGD随机梯度下降方法进行权重更新，更新后的权重共享到其它所有串联训练模块，本发明将这一训练方法称为顺序循环优化。设置循环训练中循环次数参数为H，其中第t次循环训练流程为：On the basis of the gradient calculated by the above method, for the series training structure of different time intervals, according to the order of the number of time intervals, first calculate the gradient of the series training structure of smaller time intervals, and use the SGD stochastic gradient descent method to update the weight , the updated weights are shared to all other serial training modules, and the present invention refers to this training method as sequential cycle optimization. Set the number of cycles parameter in the cycle training as H, where the t-th cycle training process is:

先计算(G)¹中

单元的更新权值，即

其中，α表示SGD梯度下降方法的学习率；

表示第t-1次循环训练训练结束时(G)¹中

单元更新后的权值参数，

表示第t次循环(G)¹中

单元更新后的权值参数；将

单元更新后的梯度

权值参数共享到其它串联训练结构中即：First calculate (G) ¹

The update weight of the unit, namely

Among them, α represents the learning rate of the SGD gradient descent method;

Indicates that at the end of the t-1 cycle training training (G) ¹

The weight parameter after unit update,

Indicates the tth cycle (G) ¹

The updated weight parameters of the unit;

Gradient after cell update

The weight parameters are shared to other serial training structures, namely:

都等于

are equal to

计算(G)²中

单元的更新权值，即

再将

单元更新后的梯度

权值参数共享到其它所有串联训练结构中。Calculate (G) ²

The update weight of the unit, namely

then

Gradient after cell update

The weight parameters are shared to all other cascade training structures.

按照相同的计算方法以此更新(G)³、(G)⁴...(G)^K，当本循环训练计算G_K中

单元的更新完权值并共享到其它串联训练结构，本次循环训练结束，进入到下一次循环训练计算，最终要进行H次循环训练计算。Follow the same calculation method to update (G) ³ , (G) ⁴ ... (G) ^K , when this loop training calculates G _K

After the weights of the unit are updated and shared with other serial training structures, this cycle training is over, and the next cycle training calculation is entered, and finally H cycles of training calculations are performed.

6.在全部训练结构损失收敛后得到结果模型6. Get the resulting model after all training structure losses converge

在步骤5的循环训练结束后，对K个串联训练结构(G)¹、(G)²、(G)³...(G)^K按照步骤4的方式输入对应时间间隔的全部训练例子，可以得到每个串联训练模块的全数据损失值。具体计算为：After the loop training in step 5 ends, input all the training examples of the corresponding time intervals for the K serial training structures (G) ¹ , (G) ² , (G) ³ ... (G) ^K according to step 4, The full data loss value for each concatenated training module can be obtained. The specific calculation is:

计算(G)¹的损失值，即

Calculate the loss value of (G) ¹ , that is,

计算(G)²的损失值，即

Compute the loss value of (G) ² , namely

计算(G)³的损失值，即

Calculate the loss value of (G) ³ , namely

计算(G)^K的损失值，即

Calculate the loss value of (G) ^K , namely

判断Loss₁,Loss₂,Loss₃,…,Loss_K是否都收敛，如果是，则进行下一步，否则继续步骤4和步骤5计算，直至Loss₁,Loss₂,Loss₃,…,Loss_K都收敛。Determine whether Loss ₁ , Loss ₂ , Loss ₃ ,...,Loss _K are all convergent, if so, go to the next step, otherwise continue to step 4 and step 5 until Loss ₁ , Loss ₂ , Loss ₃ ,...,Loss _K are all convergence.

7.调整模型超参数，优化结果模型7. Adjust the model hyperparameters and optimize the resulting model

步骤6完成训练得到结果模型后，调整模型的超参数：G单元的网络结构参数(网络的每层节点数e，隐藏层数f)，SGD随机梯度下降方法的学习率α，单次循环训练投入训练例子数c，循环训练的循环次数H。通过对超参数的组合调整后，重复步骤4-6直至网络Loss差值收敛，终止训练。Step 6 After completing the training and obtaining the result model, adjust the hyperparameters of the model: the network structure parameters of the G unit (the number of nodes in each layer of the network e, the number of hidden layers f), the learning rate α of the SGD stochastic gradient descent method, and the single cycle training Input the number of training examples c, the number of loops H of the loop training. After adjusting the combination of hyperparameters, repeat steps 4-6 until the network Loss difference converges, and terminate the training.

8.得到最优模型8. Get the optimal model

经步骤7的超参数优化后，得到最优的G单元，保存G单元网络结构和网络权值两组参数。并利用测试数据对模型的预测精度进行验证。After the hyperparameter optimization in step 7, the optimal G unit is obtained, and two sets of parameters of the G unit network structure and network weight are saved. And use the test data to verify the prediction accuracy of the model.

将应用案例中第一步所述的U个批次的放线菌发酵数据作为测试数据，每个批次有K+1个时间点(T₁,T₂,T₃…T_K+1)，每个时间点有q个指标，将测试数据以第1-2步骤的方式组织为：Take the U batches of actinomycete fermentation data described in the first step of the application case as the test data, and each batch has K+1 time points (T ₁ , T ₂ , T ₃ ...T _K+1 ) , there are q indicators at each time point, and the test data is organized as steps 1-2:

其中，

中有

共N_l个例子，l≤K、l∈N^*。in,

There are

A total of N _l examples, l≤K, l∈N ^* .

计算(G)¹的预测精度，即

Compute the prediction accuracy of (G) ¹ , i.e.

计算(G)²的预测精度，即

Compute the prediction accuracy of (G) ² , i.e.

计算(G)³的预测精度，即

Compute the prediction accuracy of (G) ³ , i.e.

计算(G)^K的预测精度，即

Compute the prediction accuracy of (G) ^K , i.e.

由此得到生物发酵系统机制模型(G单元)用于预测不同时间间隔的生物发酵体系中系统状态改变时的精度预期。From this, the mechanism model (G unit) of the biological fermentation system is obtained to predict the accuracy expectation when the system state changes in the biological fermentation system at different time intervals.

设计的神经网络结构具有如下特征：The designed neural network structure has the following characteristics:

以上并非是对本专利具体实施方式的限制。应当指出，对于本技术领域的普通技术人员来说，在不脱离本发明实质范围的前提下，还可以做出若干变化、改型、添加或替换，这些改进和润饰也应视为本发明的保护范围。The above is not a limitation to the specific implementation manner of this patent. It should be pointed out that for those skilled in the art, without departing from the essential scope of the present invention, some changes, modifications, additions or substitutions can also be made, and these improvements and modifications should also be considered as part of the present invention. protected range.

Claims

1. A method for modeling multi-dimensional state change time sequence data of a system is characterized in that,

the method comprises the following steps:

1) Carrying out standardized arrangement on the original observation data to obtain standardized observation data with a uniform form;

2) Organizing training examples for artificial neural network training based on normative observation data;

3) Designing the structure of an artificial neural network, and establishing an artificial neural network model;

4) Training the artificial neural network model established in the step 3) by using the training example in the step 2) to obtain a parameter matrix of the artificial neural network;

5) Evaluating the parameters used in the process of designing the artificial neural network structure and the influence of the parameters used in the process of training the established artificial neural network model on the precision of the obtained artificial neural network model by using the training example in the step 2), and selecting the optimal artificial neural network model under different parameter combinations as a final result model;

the neural network structure designed in step 3) has the following characteristics:

the basic structure of the device is a neural network unit, namely a unit time forward change generator (G unit), and an input layer and an output layer of the device have the same dimension; the neural network unit is used for modeling the change of multidimensional observation data after a unit time;

and obtaining a series training structure by adopting a mode of serially connecting the G units, wherein the series training structure is used for modeling the change rule of the multi-dimensional observation data in a plurality of unit times.

2. The method for modeling the multidimensional state change time series data of the system according to claim 1, wherein the step 1) is specifically as follows: the multi-dimensional time series observation data are arranged into a four-tuple organization form, namely, batch, time, indexes and values, the multi-dimensional time series observation data of multiple batches are obtained through observation of the multiple batches, the observation data of each batch comprise a group of time points, the observation data of each time point comprise a group of indexes, and the observation data of each index is a specific value.

3. The method for modeling system multidimensional state change time series data according to claim 1, wherein the step 2) is specifically: organizing system state changes of any two time points in the same batch into a training example to obtain a group of training examples with different time intervals, wherein each training example comprises states of the two time points, and data of an earlier time point is called as starting time point data of the training example; the data of later time points are called as the data of the ending time points of the training example; the start time point data and the end time point data for each training example are represented as an organization of observation data quadruples.

4. The method for modeling system multidimensional state change time series data as recited in claim 1 wherein said forward change generator (G cell) employs a fully connected structure.

5. The method for modeling system multi-dimensional state change time series data according to claim 4,

the series training structure is established as follows: for training examples of different time intervals, recording n unit times for each time interval according to the number of the time intervals, and establishing a series training structure consisting of G units, wherein the series training structure connects the n G units in series end to end.

6. The method of claim 5, wherein the n time intervals of the training structure are in series and are called (G) ⁿ N =1 hour (G) ¹ In the training system, only 1G unit is included, and a series training structure (G) ¹ The corresponding training data is the training example with the time interval number of 1；n>1 hour (G) ⁿ A tandem training architecture for the concatenation of n G units, tandem training architecture (G) ⁿ The corresponding training data is a training example with the time interval number n.

7. The method for modeling the multidimensional state change time series data of the system according to any one of claims 1 to 6, wherein the training mode of the step 4) is specifically as follows:

4.1 Calculate the loss value for each tandem training structure: for each series training structure (G) ⁿ Remember that one of the corresponding training examples is S _i Remember S _i Starting point in time of T _i ，S _i Has an end time point of T _i+n Writing to a modeled System at T _i The state data at the time point is

At T _i+n The state data at the time point is

Handle T _i Data of time point

Input to the series training Structure (G) ⁿ Obtaining network output data

Outputting data based on a network

And Y _i+n Real data of time point

The Loss value Loss of the model can be calculated by the formula

4.2 Gradient calculation of the number of counterpropagating layers limiting the loss is performed on the basis of the model loss value: for the series training structure with the time interval number of 1, the network only comprises 1G unit, and the network obtains the updating gradient of the G unit by directly adopting an error back propagation method by using an error back propagation training mechanism; for the series training structure with the time interval number larger than 1, which is formed by connecting a plurality of G units in series, calculating the gradient by adopting an error back propagation method, and intercepting the gradient on the last G unit as the updating gradient of the G unit of the series training structure;

4.3 Series training structure order weight update: on the basis of the training mechanism for limiting the number of the loss reverse propagation layers, updating the sequence weights of a plurality of series training structures, and sharing the network weight parameters in the sequence weight updating process; the concrete implementation is as follows: calculating the gradient of the series training structures at different time intervals according to a certain sequence, updating the weight by using a gradient descent method, and sharing the updated weight to all other series training structures;

4.4 For K series training structures (G) ¹ 、(G) ² 、(G) ³ ...(G) ^K Calculating the loss value of each series training module, judging whether the loss value of each series training structure is converged, if so, obtaining a result model, otherwise, continuously repeating 4.1), 4.2) and 4.3) until the loss value of each series training structure is converged, and obtaining the result model.

8. The method according to claim 7, wherein the step 5) is specifically optimized from a modeling process of model hyper-parameters, and the hyper-parameters adjusted in the modeling process are: structural parameters of the G unit (the number of nodes of each layer of the network and the number of hidden layers); the learning rate of the gradient descent method; training data example number is input in each training; the number of cycles of the cycle training; adopting different combinations of the hyper-parameters, and completing the work of establishing a neural network structure and training the neural network according to the step 3) and the step 4) to obtain a new result model; and evaluating the fitting precision of the result model to the observation data under each hyper-parameter combination, and selecting an optimal model.

9. An apparatus for modeling multi-dimensional state change time series data of a system, comprising:

an acquisition unit: the system is used for carrying out standardized arrangement on the original observation data to obtain standardized observation data with a uniform form;

organization unit: training examples for organizing the normative-based observation data for artificial neural network training;

a construction unit: the method comprises the steps of designing the structure of the artificial neural network, and establishing an artificial neural network model;

a training unit: the artificial neural network model is used for training the established artificial neural network model by utilizing a training example to obtain a parameter matrix of the artificial neural network;

an optimization unit: the method is used for evaluating the precision influence of parameters used in the process of designing the artificial neural network structure and the process of training the established artificial neural network model on the obtained artificial neural network model by utilizing a training example, and selecting the optimal artificial neural network model under different parameter combinations as a final result model;

the designed neural network structure has the following characteristics:

and obtaining a series training structure by adopting a mode of serially connecting the G units, wherein the series training structure is used for modeling the change rule of the multidimensional observation data in a plurality of unit times.