CN117334334B

CN117334334B - A health risk prediction method, device, equipment and medium

Info

Publication number: CN117334334B
Application number: CN202311272180.4A
Authority: CN
Inventors: 冯思玲; 汤乐; 黄梦醒; 王冠军; 冯文龙; 毋媛媛
Original assignee: Hainan University
Current assignee: Hainan University
Priority date: 2023-09-28
Filing date: 2023-09-28
Publication date: 2024-05-03
Anticipated expiration: 2043-09-28
Also published as: CN117334334A

Abstract

The present invention discloses a health risk prediction method, device, equipment and medium. The method obtains current environmental data; inputs the current environmental data into a prediction model, wherein the prediction model includes a VARLST hybrid model and an AdaBoost model, wherein the VARLST model includes a VARMA model and a BI-LSTM model, and weightedly fuses the prediction results of the VARLST model and the prediction results of the AdaBoost model to obtain a final prediction value; based on the final prediction value predicted under the current environmental data and the number of disease patients and non-disease patients under the dangerous reference baseline value in the past time period, calculate the relative risk value of the predicted health risk; and perform health risk warning based on the relative risk value. The present invention can prepare for early prevention by rapid prediction and calculation of the relative risk value of health risks when the environmental quality index is abnormal.

Description

A health risk prediction method, device, equipment and medium

技术领域Technical Field

本发明涉及健康风险预测技术领域，尤其涉及一种健康风险预测方法、装置、设备及介质。The present invention relates to the technical field of health risk prediction, and in particular to a health risk prediction method, device, equipment and medium.

背景技术Background technique

近年来，随着城市化不断发展，城市的人口密度也不断提升，城市人口密集的地区同时也是污染物传播和群众感染的高风险地区。当城市附近出现污染物数值异常时，若没有有效的预测预警以及民众忽视环境这一客观因素对身体健康的影响，会对当地民众的健康造成不可挽回的后果。目前智慧城市普及的范围越来越广，现有很多技术能够整合城市的各项服务资源，为城市提供更精确更便捷的服务。因此，需要一种结合分析环境数据和医疗数据的模型，预测环境变化对身体健康的影响，旨在解决上述问题。In recent years, with the continuous development of urbanization, the population density of cities has also been increasing. Urban areas with dense populations are also high-risk areas for the spread of pollutants and mass infection. When abnormal pollutant values appear near the city, if there is no effective prediction and warning and the public ignores the impact of the objective factor of the environment on physical health, it will cause irreversible consequences for the health of local people. At present, the scope of smart city popularization is becoming wider and wider, and there are many technologies that can integrate various service resources of the city to provide more accurate and convenient services for the city. Therefore, a model that combines and analyzes environmental data and medical data is needed to predict the impact of environmental changes on physical health, aiming to solve the above problems.

当前常用的多条时间序列参数预测健康风险的算法大致分为三类，包括用于环境时间序列预测的自回归平均移动、向量自回归、ARIMA模型等使用建模数学原理，包括AdaBoost模型、随机森林等机器学习模型，包括LSTM等适用于处理序列数据的深度学习模型，都可以在一定程度上处理多条时间序列预测。但是，目前已有的技术对多参数健康风险预测，会有模型泛化能力低、对于非线性关系的数据拟合效果较差、模型计算复杂度较高等问题，因此会导致很难在复杂的环境数据进行建模，很多算法难以投入实际使用。而基于深度学习的预测算法由于其独特的网络结构，在时间序列预测任务中显示出很大的优势，但也存在对于数据的要求较高，网络应用于多元数据需要大量的样本，否则预测精度将急剧下降，需要有足够的训练数据和计算资源、预测结果存在明显滞后等问题。The commonly used algorithms for predicting health risks with multiple time series parameters can be roughly divided into three categories, including the use of modeling mathematical principles such as autoregressive mean moving average, vector autoregression, and ARIMA models for environmental time series prediction, including machine learning models such as AdaBoost model and random forest, and deep learning models such as LSTM suitable for processing sequence data, which can handle multiple time series predictions to a certain extent. However, the existing technologies for multi-parameter health risk prediction have problems such as low model generalization ability, poor data fitting effect for nonlinear relationships, and high model calculation complexity. Therefore, it is difficult to model complex environmental data, and many algorithms are difficult to put into practical use. The prediction algorithm based on deep learning shows great advantages in time series prediction tasks due to its unique network structure, but it also has high requirements for data. The network application to multivariate data requires a large number of samples, otherwise the prediction accuracy will drop sharply, and sufficient training data and computing resources are required, and there is a significant lag in the prediction results.

发明内容Summary of the invention

为了解决上述技术问题，本发明提出一种健康风险预测方法、装置、设备及介质，能够在环境质量指数异常时，通过快速预测以及健康风险的相对危险值的计算，做到提早预防的准备。In order to solve the above technical problems, the present invention proposes a health risk prediction method, device, equipment and medium, which can make early prevention preparations through rapid prediction and calculation of relative danger values of health risks when the environmental quality index is abnormal.

为了达到上述目的，本发明的技术方案如下：In order to achieve the above object, the technical solution of the present invention is as follows:

一种健康风险预测方法，包括如下步骤：A health risk prediction method comprises the following steps:

获取现时环境数据；Obtain current environmental data;

将所述现时环境数据输入至预测模型中，所述预测模型包括VARLST混合模型和AdaBoost模型，其中，所述VARLST模型包括VARMA模型和BI-LSTM模型，所述VARMA模型用于多维度数据预测，得到VARMA预测模型的预测结果并将预测结果与实际值对比，得到拟合残差，将拟合残差作为BI-LSTM模型的输入，得到BI-LSTM模型的预测结果，将VARMA预测模型的预测结果与BI-LSTM模型的预测结果进行叠加，得到VARLST模型的预测结果；将VARLST模型的预测结果和AdaBoost模型的预测结果进行加权融合，获得最终预测值；The current environment data is input into a prediction model, wherein the prediction model includes a VARLST hybrid model and an AdaBoost model, wherein the VARLST model includes a VARMA model and a BI-LSTM model, and the VARMA model is used for multi-dimensional data prediction, and the prediction result of the VARMA prediction model is obtained and the prediction result is compared with the actual value to obtain a fitting residual, and the fitting residual is used as an input of the BI-LSTM model to obtain the prediction result of the BI-LSTM model, and the prediction result of the VARMA prediction model is superimposed with the prediction result of the BI-LSTM model to obtain the prediction result of the VARLST model; the prediction result of the VARLST model and the prediction result of the AdaBoost model are weighted and fused to obtain a final prediction value;

基于现时环境数据下预测的最终预测值以及过去时间段内的危险参考基准值下的疾病患病人数和非患病人数，计算得到预测健康风险的相对危险值；基于相对危险值进行健康风险预警。Based on the final predicted value predicted under the current environmental data and the number of disease patients and non-disease patients under the risk reference benchmark value in the past time period, the relative risk value of the predicted health risk is calculated; health risk warning is carried out based on the relative risk value.

优选地，所述VARLST模型的训练过程，如下所示：Preferably, the training process of the VARLST model is as follows:

获取历史环境数据以及就医统计数据并进行预处理，将预处理后的数据集按预设比例划分为训练集和测试集；Obtain historical environmental data and medical statistics data and preprocess them, and divide the preprocessed data set into a training set and a test set according to a preset ratio;

将数据集进行差分平滑处理，并对差分平滑后的数据进行自相关函数和偏自相关函数分析，基于分析结果对VARMA预测模型参数和差值的阶数进行估计，所述VARMA预测模型为：The data set is subjected to difference smoothing, and the autocorrelation function and partial autocorrelation function analysis are performed on the difference smoothed data. Based on the analysis results, the parameters of the VARMA prediction model and the order of the difference are estimated. The VARMA prediction model is:

其中y_n,t是n项科室第t天就医统计人数；y_n,t-p是n项科室第t天就医统计人数的p阶滞后；θ_n,t是n项科室第t天就医统计人数的时间序列组成结构；A是组成结构对应的系数；B_p是p阶滞后对应的系数；C是外生变量的系数；x_m,t是m项第t天环境指数数据，ε_n是残差项，Where y _n,t is the number of medical visits in n departments on the tth day; y _n,tp is the p-order lag of the number of medical visits in n departments on the tth day; θ _n,t is the time series composition structure of the number of medical visits in n departments on the tth day; A is the coefficient corresponding to the composition structure; B _p is the coefficient corresponding to the p-order lag; C is the coefficient of the exogenous variable; x _m,t is the environmental index data of m items on the tth day, ε _n is the residual term,

将训练集输入VARLST模型中进行训练，所述VARLST模型包括VARMA模型和BI-LSTM模型，所述VARMA模型用于多维度数据预测，将得到的VARMA预测模型的预测结果与实际值对比，得到拟合残差；The training set is input into a VARLST model for training, wherein the VARLST model includes a VARMA model and a BI-LSTM model, wherein the VARMA model is used for multi-dimensional data prediction, and the prediction result of the obtained VARMA prediction model is compared with the actual value to obtain a fitting residual;

将拟合残差输入BI-LSTM模型中进行训练，直至达到预设要求或迭代次数，输出训练好的BI-LSTM模型，其中，在每一次迭代中，将BI-LSTM模型进行后向传递，通过后向传递的输出，计算每个时刻输出层的误差、和前向LSTM、后向LSTM的参数导数，在得到参数导数后，更新BI-LSTM模型的网络参数；Input the fitted residual into the BI-LSTM model for training until the preset requirements or number of iterations are met, and output the trained BI-LSTM model. In each iteration, the BI-LSTM model is backward-transferred, and the error of the output layer at each moment and the parameter derivatives of the forward LSTM and backward LSTM are calculated through the output of the backward transfer. After obtaining the parameter derivatives, the network parameters of the BI-LSTM model are updated;

将测试集输入训练好的BI-LSTM模型，获得BI-LSTM模型的预测结果；Input the test set into the trained BI-LSTM model to obtain the prediction results of the BI-LSTM model;

将VARMA预测模型的预测结果与BI-LSTM模型的预测结果进行叠加，得到VARLST模型的预测结果。The prediction results of the VARMA prediction model are superimposed on the prediction results of the BI-LSTM model to obtain the prediction results of the VARLST model.

优选地，所述AdaBoost模型的训练过程，如下所示：Preferably, the training process of the AdaBoost model is as follows:

基于数据集和初始化权重分布对基学习器进行训练，优化基学习器；Train the base learner based on the data set and the initial weight distribution, and optimize the base learner;

计算第t次迭代中每个训练样本的预测误差、比例误差和连接权重；Calculate the prediction error, proportional error and connection weight of each training sample in the tth iteration;

调整对t+1次迭代中每个训练样本的权重分布，并继续训练，直至达到最大迭代次数，得到K个模型；Adjust the weight distribution of each training sample in the t+1 iteration, and continue training until the maximum number of iterations is reached to obtain K models;

将K个模型的预测结果相加作为AdaBoost模型的预测结果。The prediction results of the K models are added together as the prediction result of the AdaBoost model.

优选地，所述历史环境数据包括空气质量数据、城市饮用水水质数据、废气废水及其他排放总量数据。Preferably, the historical environmental data includes air quality data, urban drinking water quality data, waste gas, wastewater and other total emission data.

优选地，获取历史环境数据以及就医统计数据并进行预处理，包括如下步骤：Preferably, obtaining historical environmental data and medical statistics and preprocessing them includes the following steps:

对环境数据以及就医统计数据进行数据清洗，包括处理缺失值、异常值和重复值；Perform data cleaning on environmental data and medical statistics, including processing missing values, outliers and duplicate values;

对清洗好的数据进行相关性分析，通过皮尔逊相关系数计算各个特征之间的相关系数，最终得到相关系数矩阵C；Perform correlation analysis on the cleaned data, calculate the correlation coefficients between each feature using the Pearson correlation coefficient, and finally obtain the correlation coefficient matrix C;

根据相关系数矩阵C，选择与就医统计数据相关性较高的环境指数作为模型的输入变量。According to the correlation coefficient matrix C, environmental indices with high correlation with medical statistics data are selected as input variables of the model.

优选地，选取MAE、MAPE、RMSE作为评价指标，对预测模型性能进行评价。Preferably, MAE, MAPE, and RMSE are selected as evaluation indicators to evaluate the performance of the prediction model.

优选地，所述预测健康风险的相对危险值RR，计算公式如下所示：Preferably, the relative risk value RR of the predicted health risk is calculated by the following formula:

其中IE为预测的疾病患病人数及预测的疾病患病人数，IN为预测的非患病人数，CE为危险参考基准值下的疾病患病人数，CN为危险参考基准之下的非患病人数。Among them, IE is the predicted number of people suffering from the disease and the predicted number of people with the disease, IN is the predicted number of people without the disease, CE is the number of people suffering from the disease under the risk reference benchmark value, and CN is the number of people without the disease under the risk reference benchmark.

基于上述内容，本发明还公开了一种健康风险预测装置，包括：数据获取模块、混合多元时间序列预测模块和健康风险预测模块，其中，Based on the above content, the present invention also discloses a health risk prediction device, including: a data acquisition module, a mixed multivariate time series prediction module and a health risk prediction module, wherein:

所述数据获取模块，用于获取现时环境数据；The data acquisition module is used to acquire current environment data;

所述混合多元时间序列预测模块，用于将所述现时环境数据输入至预测模型中，所述预测模型包括VARLST混合模型和AdaBoost模型，其中，所述VARLST模型包括VARMA模型和BI-LSTM模型，所述VARMA模型用于多维度数据预测，得到VARMA预测模型的预测结果并将预测结果与实际值对比，得到拟合残差，将拟合残差作为BI-LSTM模型的输入，得到BI-LSTM模型的预测结果，将VARMA预测模型的预测结果与BI-LSTM模型的预测结果进行叠加，得到VARLST模型的预测结果；将VARLST模型的预测结果和AdaBoost模型的预测结果进行加权融合，获得最终预测值；The hybrid multivariate time series prediction module is used to input the current environmental data into a prediction model, wherein the prediction model includes a VARLST hybrid model and an AdaBoost model, wherein the VARLST model includes a VARMA model and a BI-LSTM model, wherein the VARMA model is used for multi-dimensional data prediction, and the prediction result of the VARMA prediction model is obtained and the prediction result is compared with the actual value to obtain a fitting residual, and the fitting residual is used as an input of the BI-LSTM model to obtain the prediction result of the BI-LSTM model, and the prediction result of the VARMA prediction model is superimposed with the prediction result of the BI-LSTM model to obtain the prediction result of the VARLST model; the prediction result of the VARLST model and the prediction result of the AdaBoost model are weighted and fused to obtain a final prediction value;

所述健康风险预测模块，用于基于现时环境数据下预测的最终预测值以及过去时间段内的危险参考基准值下的疾病患病人数和非患病人数，计算得到预测健康风险的相对危险值；基于相对危险值进行健康风险预警。The health risk prediction module is used to calculate the relative risk value of the predicted health risk based on the final predicted value predicted under the current environmental data and the number of people with the disease and the number of people without the disease under the risk reference baseline value in the past time period; and to issue a health risk warning based on the relative risk value.

基于上述内容，本发明还公开了一种计算机设备，包括：存储器，用于存储计算机程序；处理器，用于执行所述计算机程序时实现如上述任一所述的方法。Based on the above content, the present invention further discloses a computer device, including: a memory for storing a computer program; and a processor for implementing any of the above methods when executing the computer program.

基于上述内容，本发明还公开了一种可读存储介质，所述可读存储介质上存储有计算机程序，所述计算机程序被处理器执行时实现如上述任一所述的方法。Based on the above content, the present invention further discloses a readable storage medium, on which a computer program is stored. When the computer program is executed by a processor, any of the methods described above is implemented.

基于上述技术方案，本发明的有益效果是：本发明首先获取空气质量各个指数数据，城市饮用水水质数据，废气废水及其他排放总量数据以及就医人数统计数据，进行数据清洗、数据相关性分析、特征选择，使数据更加具有完整性和准确性。将选择好的特征数据分别通过VARLST和AdaBoost模型进行训练。首先将预处理好的数据进行ACF和PACF分析后，对VARMA预测模型参数和差值顺序进行估计，建立多维数据预测网络进行数据分析预测，得到的预测结果与实际值进行对比，得到的拟合残差通过BI-LSTM网络进行训练，通过神经网络分析和处理输出预测数据，将两者的预测结果进行计算校正，得到VARLST模型的预测结果。另一边将将预处理好的数据进行初始化权重分布、迭代训练弱回归器、调整弱回归器权重并组合进行预测后，得到AdaBoost模型的最终预测结果。将两个模型预测结果进行平均融合，得到最终的预测结果。将最终预测的患病人数和非患病人数与危险参考基准值进行对比计算，得到相对危险值，判断出健康风险严重程度，达到提前预测健康风险的目的。本发明使用皮尔逊相关系数对数据相关性的分析，解决了预测时需要大量数据的问题，提高预测效率，加强了环境因素与就医人数的联系；通过VARLST模型，结合深度学习与传统统计预测算法的优势，使预测模型具有较高的执行效率和泛化能力，以提高健康风险预测模型的准确性和实用性；融合AdaBoost模型，结合AdaBoost模型的集成学习优势，自动选择重要特征，提高模型的预测精度和稳定性。Based on the above technical scheme, the beneficial effects of the present invention are as follows: the present invention first obtains the data of various air quality indexes, urban drinking water quality data, waste gas, wastewater and other total emission data, and medical population statistics, performs data cleaning, data correlation analysis, and feature selection, so that the data is more complete and accurate. The selected feature data is trained by VARLST and AdaBoost models respectively. First, after ACF and PACF analysis of the pre-processed data, the parameters and difference order of the VARMA prediction model are estimated, a multidimensional data prediction network is established for data analysis and prediction, the obtained prediction results are compared with the actual values, the obtained fitting residuals are trained by the BI-LSTM network, and the output prediction data is analyzed and processed by the neural network, and the prediction results of the two are calculated and corrected to obtain the prediction results of the VARLST model. On the other hand, the pre-processed data is initialized with weight distribution, iteratively trained with weak regressors, and the weights of weak regressors are adjusted and combined for prediction, and the final prediction results of the AdaBoost model are obtained. The prediction results of the two models are averaged and fused to obtain the final prediction results. The final predicted number of sick people and non-sick people are compared and calculated with the risk reference benchmark value to obtain the relative risk value, judge the severity of health risks, and achieve the purpose of predicting health risks in advance. The present invention uses the Pearson correlation coefficient to analyze the data correlation, solves the problem of requiring a large amount of data for prediction, improves prediction efficiency, and strengthens the connection between environmental factors and the number of medical treatments; through the VARLST model, combined with the advantages of deep learning and traditional statistical prediction algorithms, the prediction model has higher execution efficiency and generalization ability to improve the accuracy and practicality of the health risk prediction model; the AdaBoost model is integrated, combined with the integrated learning advantages of the AdaBoost model, and important features are automatically selected to improve the prediction accuracy and stability of the model.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1是一个实施例中一种健康风险预测方法的应用环境图；FIG1 is a diagram of an application environment of a health risk prediction method in an embodiment;

图2是一个实施例中一种健康风险预测方法的流程图；FIG2 is a flow chart of a health risk prediction method in one embodiment;

图3是一个实施例中一种健康风险预测方法中VARLST-AdaBoost混合模型的架构示意图；FIG3 is a schematic diagram of the architecture of a VARLST-AdaBoost hybrid model in a health risk prediction method in one embodiment;

图4是一个实施例中一种健康风险预测装置的结构示意图；FIG4 is a schematic structural diagram of a health risk prediction device in one embodiment;

图5是一个实施例中计算机设备的内部结构图。FIG. 5 is a diagram showing the internal structure of a computer device in an embodiment.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述。The technical solutions in the embodiments of the present invention will be described clearly and completely below in conjunction with the accompanying drawings in the embodiments of the present invention.

本申请实施例提供的一种健康风险预测方法，可以应用于如图1所示的应用环境中。如图1所示，该应用环境包括计算机设备110。计算机设备110可以获取现时环境数据；计算机设备110可以将所述现时环境数据输入至预测模型(VARLST-AdaBoost混合模型)中，该预测模型包括VARLST混合模型和AdaBoost模型，其中，所述VARLST模型包括VARMA模型和BI-LSTM模型，所述VARMA模型用于多维度数据预测，得到VARMA预测模型的预测结果并将预测结果与实际值对比，得到拟合残差，将拟合残差作为BI-LSTM模型的输入，得到BI-LSTM模型的预测结果，将VARMA预测模型的预测结果与BI-LSTM模型的预测结果进行叠加，得到VARLST模型的预测结果；将VARLST模型的预测结果和AdaBoost模型的预测结果进行加权融合，获得最终预测值；计算机设备110可以基于现时环境数据下预测的最终预测值以及过去时间段内的危险参考基准值下的疾病患病人数和非患病人数，计算得到预测健康风险的相对危险值；基于相对危险值进行健康风险预警。其中，计算机设备110可以但不限于是各种个人计算机、笔记本电脑、机器人、平板电脑等设备。A health risk prediction method provided in an embodiment of the present application can be applied in an application environment as shown in FIG1 . As shown in FIG1 , the application environment includes a computer device 110 . The computer device 110 can obtain current environmental data; the computer device 110 can input the current environmental data into a prediction model (VARLST-AdaBoost hybrid model), which includes a VARLST hybrid model and an AdaBoost model, wherein the VARLST model includes a VARMA model and a BI-LSTM model, the VARMA model is used for multi-dimensional data prediction, and the prediction result of the VARMA prediction model is obtained and the prediction result is compared with the actual value to obtain the fitting residual, and the fitting residual is used as the input of the BI-LSTM model to obtain the prediction result of the BI-LSTM model, and the prediction result of the VARMA prediction model is superimposed with the prediction result of the BI-LSTM model to obtain the prediction result of the VARLST model; the prediction result of the VARLST model and the prediction result of the AdaBoost model are weighted and fused to obtain a final prediction value; the computer device 110 can calculate the relative risk value of the predicted health risk based on the final prediction value predicted under the current environmental data and the number of people with the disease and the number of people without the disease under the risk reference benchmark value in the past time period; and health risk warning is performed based on the relative risk value. The computer device 110 may be, but is not limited to, various personal computers, laptop computers, robots, tablet computers, and other devices.

在一个实施例中，如图2、3所示，提供了一种健康风险预测方法，包括如下步骤：In one embodiment, as shown in FIGS. 2 and 3 , a health risk prediction method is provided, comprising the following steps:

步骤1：获取AQI、PM2.5、PM10、CO、NO2、SO2、O3等空气质量数据、城市饮用水水质数据、废气废水及其他排放总量数据和就医人数统计数据的历史数据。Step 1: Obtain historical data on air quality data such as AQI, PM2.5, PM10, CO, NO2, SO2, O3, urban drinking water quality data, total amount of waste gas, wastewater and other emissions, and medical treatment statistics.

步骤2：将所有获得的历史数据进行预处理。具体包括以下步骤：Step 2: Preprocess all the historical data obtained. Specifically, it includes the following steps:

步骤2-1：对获取的所有历史数据进行数据清洗，包括处理缺失值、异常值和重复值，确保数据的完整性和准确性；Step 2-1: Clean all historical data obtained, including processing missing values, outliers and duplicate values, to ensure the integrity and accuracy of the data;

步骤2-2：对清洗好的数据进行相关性分析，通过皮尔逊相关系数计算各个特征之间的相关系数，最终得到一个相关系数矩阵C。如果相关系数接近1或-1，则表示两者之间存在较强的线性相关性，如果相关系数接近于0，则表示两者之间不存在线性相关性；Step 2-2: Perform correlation analysis on the cleaned data, calculate the correlation coefficient between each feature through the Pearson correlation coefficient, and finally obtain a correlation coefficient matrix C. If the correlation coefficient is close to 1 or -1, it means that there is a strong linear correlation between the two. If the correlation coefficient is close to 0, it means that there is no linear correlation between the two.

步骤2-3：根据相关系数矩阵C，选择与就医人数统计数据相关性较高的环境指数作为模型的输入变量。由相关系数的大小和显著性决定哪些环境指数能够作为某一特定的疾病的输入变量，选择与疾病相关行较高且相关系数显著的指数维度作为输入变量，能够更有效地提高模型的预测性能。Step 2-3: According to the correlation coefficient matrix C, select environmental indices with high correlation with the statistical data of the number of medical visits as the input variables of the model. The size and significance of the correlation coefficient determine which environmental indices can be used as input variables for a specific disease. Selecting index dimensions with high correlation with the disease and significant correlation coefficient as input variables can more effectively improve the prediction performance of the model.

步骤3：将得到的相关性较高的环境指数数据和就医人数统计数据一同作为输入数据，首先通过自相关函数(ACF)和偏自相关函数(PACF)确定上述时间序列的自相关性和偏自相关性，基于分析更好的构建VARLST混合模型，得到最终预测结果。具体包括如下步骤：Step 3: Take the highly correlated environmental index data and the medical population statistics as input data, first determine the autocorrelation and partial autocorrelation of the above time series through the autocorrelation function (ACF) and partial autocorrelation function (PACF), and build a VARLST hybrid model based on the analysis to obtain the final prediction results. The specific steps include the following:

步骤3-1：得到的环境指数和就医统计人数的多参数时间序列数据是非平稳序列，需要通过差分处理来平滑时间序列，并确保满足应用预测模型的条件，然后通过差分约简来恢复预测结果；Step 3-1: The multi-parameter time series data of environmental index and medical statistics are non-stationary series. It is necessary to smooth the time series through difference processing and ensure that the conditions for applying the prediction model are met. Then, the prediction results are restored through difference reduction.

步骤3-2：对差分平滑后的数据进行自相关函数(ACF)和偏自相关函数(PACF)的分析，判断时间序列的自相关性和偏自相关性，更好地理解数据的特征和趋势；Step 3-2: Analyze the autocorrelation function (ACF) and partial autocorrelation function (PACF) of the difference-smoothed data to determine the autocorrelation and partial autocorrelation of the time series and better understand the characteristics and trends of the data;

步骤3-3：基于的多元环境时间序列的自相关函数和偏自相关函数，估计向量自回归移动平均(VARMA)预测模型的参数和差值的阶数，确保预测模型的实用性和时间序列的稳定性。VARMA模型采用多方程系统的形式，通常用于预测互联时间序列的系统，并分析随机扰动对变量系统的动态影响，假设最优阶数为p，样本数据集的总天数为T，各疾病的就医人数为内生变量，环境指数数据为外生变量，构建向量自回归模型为：Step 3-3: Based on the autocorrelation function and partial autocorrelation function of the multivariate environmental time series, estimate the parameters and order of the difference of the vector autoregression moving average (VARMA) forecasting model to ensure the practicality of the forecasting model and the stability of the time series. The VARMA model adopts the form of a multi-equation system, which is usually used to predict the system of interconnected time series and analyze the dynamic impact of random disturbances on the variable system. Assuming that the optimal order is p, the total number of days in the sample data set is T, the number of medical visits for each disease is an endogenous variable, and the environmental index data is an exogenous variable, the vector autoregression model is constructed as follows:

其中y_n,t是n项科室第t天就医统计人数；y_n,t-p是n项科室第t天就医统计人数的p阶滞后；θ_n,t是n项科室第t天就医统计人数的时间序列组成结构；A是组成结构对应的系数；B_p是p阶滞后对应的系数；C是外生变量的系数；x_m,t是m项第t天环境指数数据，ε_n是残差项；Where y _n,t is the number of medical visits in n departments on the tth day; y _n,tp is the p-order lag of the number of medical visits in n departments on the tth day; θ _n,t is the time series composition structure of the number of medical visits in n departments on the tth day; A is the coefficient corresponding to the composition structure; B _p is the coefficient corresponding to the p-order lag; C is the coefficient of the exogenous variable; x _m,t is the environmental index data of m items on the tth day, and ε _n is the residual term;

上述等式可以简化为：The above equation can be simplified to:

Y_t＝Aθ_t+B₁Y_t-1+···+B_pY_t-p+Cx_t+ε_t,t＝1,2,···,T (2)Y _t =Aθ _t +B ₁ Y _t-1 +···+B _p Y _tp +Cx _t +ε _t ,t=1,2,···,T (2)

步骤3-4：将模型预测的拟合值与真实值进行比较得到拟合残差，数据拟合残差计算公式如下：Step 3-4: Compare the fitted value predicted by the model with the true value to obtain the fitted residual. The calculation formula for the data fitting residual is as follows:

E_t＝Z_t-X_t (3)E _t = Z _t - X _t (3)

其中X_t为输出拟合结果，Z_t为真实数据，E_t为数据拟合残差；Where _Xt is the output fitting result, _Zt is the real data, and _Et is the data fitting residual;

步骤3-5：构建BI-LSTM神经网络模型，对拟合残差数据进行遗忘和保留，利用残差数据训练BI-LSTM神经网络模型，并对BI-LSTM神经网络模型进行测试，减小VARMA模型无法准确分析的误差，具体包括以下步骤：Step 3-5: Build a BI-LSTM neural network model, forget and retain the fitted residual data, use the residual data to train the BI-LSTM neural network model, and test the BI-LSTM neural network model to reduce the error that the VARMA model cannot accurately analyze. Specifically, the following steps are included:

步骤3-5-1：通过一个逐点相乘和sigmoid神经层的操作，让信息通过时具有选择性，构建遗忘门，公式如下：Step 3-5-1: Through a point-by-point multiplication and sigmoid neural layer operation, the information is selective when passing through, and a forget gate is constructed. The formula is as follows:

f_t＝σ(W_f·[h_t-1,x_t]+b_f) (4) _ft = σ( _Wf ·[ht _-1 , xt _] + _bf ) (4)

步骤3-5-2：构建输入门结构，公式如下：Step 3-5-2: Construct the input gate structure, the formula is as follows:

i_t＝σ(W_i·[h_t-1,x_t]+b_i) (5) _it =σ( _Wi ·[ _ht-1 , _xt ]+ _bi ) (5)

c_t＝tanh(W_C·[h_t-1,x_t]+b_C) (6)c _t = tanh(W _C ·[h _t-1 ,x _t ]+b _C ) (6)

y_t＝f_t*y_t-1+i_t*c_t (7)y _t = f _t * y _{t - 1} + i _t * c _t (7)

式中W_C为c_t的权值系数，W_i为输入的权值系数，tanh是激活函数；Where W _C is the weight coefficient of c _t , W _i is the input weight coefficient, and tanh is the activation function;

步骤3-5-3：构建输出门结构，公式如下：Step 3-5-3: Construct the output gate structure, the formula is as follows:

o_t＝σ(W_o[h_t-1,x_t]+b_o) (8)o _t =σ(W _o [h _t-1 ,x _t ]+b _o ) (8)

h_t＝o_t*tanh(y_t) (9)h _t = o _t *tanh(y _t ) (9)

在步骤3-5-1和3-5-2后，细胞信息y_t得到了更新，将细胞信息通过tanh激活函数进行处理，处理之后会输出一个区间为[-1,1]的值，再将输出值与sigmoid门的输出相乘之后，最终得到确定的输出部分；After steps 3-5-1 and 3-5-2, the cell information y _t is updated. The cell information is processed by the tanh activation function, and a value in the interval [-1,1] is output after processing. The output value is then multiplied by the output of the sigmoid gate to finally obtain the determined output part;

步骤3-5-4：计算BI-LSTM结构总输出值，在t时刻的BI-LSTM结构的总输出值为前向LSTM和后向LSTM输出之和，公式如下：Step 3-5-4: Calculate the total output value of the BI-LSTM structure. The total output value of the BI-LSTM structure at time t is the sum of the forward LSTM and backward LSTM outputs. The formula is as follows:

其中为向量取和操作；in Take and operation for vector;

步骤3-5-5：迭代训练BI-LSTM网络参数，在每一次迭代中，首先将BI-LSTM网络进行后向传递，之后通过后向传递的输出，计算每个时刻输出层的误差，和前向LSTM、后向LSTM的参数导数；Step 3-5-5: Iteratively train the BI-LSTM network parameters. In each iteration, first perform a backward pass on the BI-LSTM network, and then calculate the error of the output layer at each moment and the parameter derivatives of the forward LSTM and backward LSTM through the output of the backward pass;

在得到参数导数后，最后更新BI-LSTM的网络参数。After obtaining the parameter derivatives, the network parameters of BI-LSTM are finally updated.

步骤3-6：训练后的BI-LSTM模型用于预测测试集数据残差，预测的残差结果记录为E。最后，计算校正后的测试集预测结果，表示为Y_VARLST：Step 3-6: The trained BI-LSTM model is used to predict the residual of the test set data, and the predicted residual result is recorded as E. Finally, the corrected test set prediction result is calculated, expressed as Y _VARLST :

Y_VARLST＝X_t+E (13) _YVARLST ＝ _Xt +E (13)

步骤4：将步骤2得到的相关性较高的环境指数数据和就医人数统计数据一同作为输入数据，通过AdaBoost模型的迭代训练，最后组合所有弱回归器按照权重进行组合，得到最终的预测结果。具体包括以下步骤：Step 4: Take the highly correlated environmental index data and medical population statistics obtained in step 2 as input data, iterate through the AdaBoost model, and finally combine all weak regressors according to the weights to obtain the final prediction result. Specifically, the following steps are included:

步骤4-1：设一个样本集为U＝{(x_i,y_i|i＝1,2,…,N)}，样本在第t次迭代时的权重分布由符号D_t(i)，t＝1,2,…,K表示。当第一次迭代期间t＝1时，初始化权重分布 Step 4-1: Let a sample set be U = {(x _i , y _i |i = 1, 2, ..., N)}, and the weight distribution of the sample at the tth iteration is represented by the symbol D _t(i) , t = 1, 2, ..., K. When t = 1 during the first iteration, the weight distribution is initialized

步骤4-2：计算出训练集上最大误差E_t＝max|y_i-G_k(x_i)|,i＝1,2,…,N，根据权重分布计算f_t(x_i)的预计输出，计算出预测误差。计算公式如下：Step 4-2: Calculate the maximum error E _t = max|y _i -G _k ( _xi )| on the training set, i = 1, 2, ..., N, calculate the expected output of f _t ( _xi ) according to the weight distribution, and calculate the prediction error. The calculation formula is as follows:

步骤4-3：进行比例误差计算，计算公式如下：Step 4-3: Calculate the proportional error. The calculation formula is as follows:

步骤4-4：；计算连接权重其中/> Step 4-4: Calculate the connection weight Where/>

步骤4-5：调整加权权重分布以及/> Step 4-5: Adjust weighted distribution and/>

步骤4-6：经过K次迭代，最终预测结果可以通过Step 4-6: After K iterations, the final prediction result can be obtained by

步骤4-7：通过VARLST模型得到的预测结果Y_VARLST和AdaBoost模型得到的预测结果Y_AdaBoost按各自的权重融合得到最终的预测结果Y。计算公式如下：Step 4-7: The prediction result Y obtained by the VARLST model The prediction result Y obtained by _{the VARLST} and AdaBoost models _AdaBoost is fused according to their respective weights to obtain the final prediction result Y. The calculation formula is as follows:

Y＝W₁Y_VARLST+W₂Y_AdaBoost (17)Y＝W ₁ Y _VARLST +W ₂ Y _AdaBoost (17)

为了降低过拟合风险、减少模型的方差、提高模型的稳定性和鲁棒性，本实施例中使用的是平均融合，因此W₁＝W₂＝0.5。In order to reduce the risk of overfitting, reduce the variance of the model, and improve the stability and robustness of the model, average fusion is used in this embodiment, so W ₁ =W ₂ =0.5.

步骤5：运用均方误差(RMSE)、绝对百分比误差(MAPE)、平均绝对误差(MAE)检验时间序列分布之后非线性模型训练数据拟合效能，评估预测结果上述模型评估指标计算公式如下：Step 5: Use the mean square error (RMSE), absolute percentage error (MAPE), and mean absolute error (MAE) to test the fitting efficiency of the nonlinear model training data after the time series distribution and evaluate the prediction results. The calculation formulas for the above model evaluation indicators are as follows:

其中P_i是预测数据值，X_i是真实数据值，是真实数据的平均值，n是数据集的数量。Where Pi _is the predicted data value, _Xi is the true data value, is the mean of the true data, and n is the number of datasets.

步骤6：通过预测得到现时环境要素数据预测的疾病患病人数和非患病人数，结合过去一段时间的患病率和非患病率的中位数作为危险参考基准值，计算得到健康风险的相对危险值，计算公式如下：Step 6: The number of people with and without diseases predicted by the current environmental factor data is predicted, and the median of the disease rate and non-disease rate in the past period is used as the risk reference benchmark value to calculate the relative risk value of health risk. The calculation formula is as follows:

其中RR为多参数环境指数下预测健康风险的相对危险值，IE为预测的疾病患病人数，IN为预测的非患病人数，CE为危险参考基准值下的疾病患病人数，CN为危险参考基准之下的非患病人数。Among them, RR is the relative risk value of the predicted health risk under the multi-parameter environmental index, IE is the predicted number of people with the disease, IN is the predicted number of non-disease people, CE is the number of people with the disease under the risk reference benchmark value, and CN is the number of non-disease people under the risk reference benchmark.

根据多参数环境指数下预测健康风险的相对危险值划分健康风险预警等级，当预测健康风险的相对危险值小于1.4为低风险，预测健康风险的相对危险值是1.4至2.9时为中风险，预测健康风险的相对危险值大于2.9为高风险。通过最终定义的风险等级做到提前预测健康风险的目的。The health risk warning level is divided according to the relative risk value of the predicted health risk under the multi-parameter environmental index. When the relative risk value of the predicted health risk is less than 1.4, it is low risk; when the relative risk value of the predicted health risk is 1.4 to 2.9, it is medium risk; when the relative risk value of the predicted health risk is greater than 2.9, it is high risk. The purpose of predicting health risks in advance is achieved through the final defined risk level.

应该理解的是，虽然上述流程图中的各个步骤按照箭头的指示依次显示，但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明，这些步骤的执行并没有严格的顺序限制，这些步骤可以以其它的顺序执行。而且，上述流程图中的至少一部分步骤可以包括多个子步骤或者多个阶段，这些子步骤或者阶段并不必然是在同一时刻执行完成，而是可以在不同的时刻执行，这些子步骤或者阶段的执行顺序也不必然是依次进行，而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。It should be understood that, although the various steps in the above-mentioned flow chart are displayed in sequence according to the indication of the arrows, these steps are not necessarily executed in sequence according to the order indicated by the arrows. Unless there is a clear explanation in this article, the execution of these steps is not strictly limited in order, and these steps can be executed in other orders. Moreover, at least a part of the steps in the above-mentioned flow chart may include multiple sub-steps or multiple stages, and these sub-steps or stages are not necessarily executed at the same time, but can be executed at different times, and the execution order of these sub-steps or stages is not necessarily to be carried out in sequence, but can be executed in turn or alternately with other steps or at least a part of the sub-steps or stages of other steps.

仿真算例：Simulation example:

为了验证本实例所提出算法的有效性，选择了海南省海口市从2014年开始至2023年6月的空气质量指数月度数据，废气废水及其他排放总量月度数据以及海口市就医人数统计月度数据，使用VARMA模型、AdaBoost模型以及VARLST模型与本申请中预测模型(VARLST-AdaBoost混合模型)进行预测对比，最后根据模型评估指数比较模型的拟合效能。In order to verify the effectiveness of the algorithm proposed in this example, the monthly air quality index data, monthly data on total emissions of waste gas, wastewater and other emissions, and monthly data on the number of medical treatment in Haikou City, Hainan Province from 2014 to June 2023 were selected. The VARMA model, AdaBoost model and VARLST model were used to compare the predictions with the prediction model (VARLST-AdaBoost hybrid model) in this application. Finally, the fitting performance of the model was compared according to the model evaluation index.

本实施例中运用均方误差(RMSE)，绝对百分比误差(MAPE)，平均绝对误差(MAE)系数检验时间序列分布之后非线性模型训练数据拟合效能，最终得出的指标如下表所示。In this embodiment, the mean square error (RMSE), absolute percentage error (MAPE), and mean absolute error (MAE) coefficients are used to test the fitting efficiency of the nonlinear model training data after the time series distribution, and the final indicators are shown in the following table.

表1模型对比实验结果Table 1 Model comparison experimental results

从表1中可以看出当数据总量不多时，常规的VARMA和AdaBoost预测模型预测效果不显著，而VARLST模型在一定程度上提高了预测精度，但会存在过拟合以及不稳定的风险。本发明提出的VARLST-AdaBoost混合模型在原有的模型上做出改进，根据分类错误率调整样本权重，减少模型的过拟合问题，并且预测结果的准确率有着更加优异的表现，提升模型的预测性能。It can be seen from Table 1 that when the total amount of data is not large, the conventional VARMA and AdaBoost prediction models have insignificant prediction effects, while the VARLST model improves the prediction accuracy to a certain extent, but there is a risk of overfitting and instability. The VARLST-AdaBoost hybrid model proposed in the present invention improves the original model, adjusts the sample weight according to the classification error rate, reduces the overfitting problem of the model, and has a more excellent performance in the accuracy of the prediction results, thereby improving the prediction performance of the model.

如图4所示，在一个实施例中提供一种健康风险预测装置400，包括：数据获取模块410、混合多元时间序列预测模块420和健康风险预测模块430，其中，As shown in FIG4 , in one embodiment, a health risk prediction device 400 is provided, including: a data acquisition module 410, a mixed multivariate time series prediction module 420 and a health risk prediction module 430, wherein:

数据获取模块410，用于获取现时环境数据；The data acquisition module 410 is used to acquire current environment data;

混合多元时间序列预测模块420，用于将所述现时环境数据输入至预测模型中，所述预测模型包括VARLST混合模型和AdaBoost模型，其中，所述VARLST模型包括VARMA模型和BI-LSTM模型，所述VARMA模型用于多维度数据预测，得到VARMA预测模型的预测结果并将预测结果与实际值对比，得到拟合残差，将拟合残差作为BI-LSTM模型的输入，得到BI-LSTM模型的预测结果，将VARMA预测模型的预测结果与BI-LSTM模型的预测结果进行叠加，得到VARLST模型的预测结果；将VARLST模型的预测结果和AdaBoost模型的预测结果进行加权融合，获得最终预测值；A hybrid multivariate time series prediction module 420 is used to input the current environmental data into a prediction model, wherein the prediction model includes a VARLST hybrid model and an AdaBoost model, wherein the VARLST model includes a VARMA model and a BI-LSTM model, wherein the VARMA model is used for multi-dimensional data prediction, obtain the prediction result of the VARMA prediction model and compare the prediction result with the actual value to obtain the fitting residual, use the fitting residual as the input of the BI-LSTM model to obtain the prediction result of the BI-LSTM model, superimpose the prediction result of the VARMA prediction model with the prediction result of the BI-LSTM model to obtain the prediction result of the VARLST model; perform weighted fusion on the prediction result of the VARLST model and the prediction result of the AdaBoost model to obtain the final prediction value;

健康风险预测模块430，用于基于现时环境数据下预测的最终预测值以及过去时间段内的危险参考基准值下的疾病患病人数和非患病人数，计算得到预测健康风险的相对危险值；基于相对危险值进行健康风险预警。The health risk prediction module 430 is used to calculate the relative risk value of the predicted health risk based on the final predicted value predicted under the current environmental data and the number of people with the disease and the number of people without the disease under the risk reference baseline value in the past time period; and to issue a health risk warning based on the relative risk value.

在一个实施例中，提供了一种计算机设备，该计算机设备可以是终端，其内部结构图可以如图5所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口、显示屏和输入装置。其中，该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统和计算机程序。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时以实现一种健康风险预测方法。该计算机设备的显示屏可以是液晶显示屏或者电子墨水显示屏，该计算机设备的输入装置可以是显示屏上覆盖的触摸层，也可以是计算机设备外壳上设置的按键、轨迹球或触控板，还可以是外接的键盘、触控板或鼠标等。In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be shown in FIG5. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected via a system bus. Among them, the processor of the computer device is used to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and the computer program in the non-volatile storage medium. The network interface of the computer device is used to communicate with an external terminal via a network connection. When the computer program is executed by the processor, a health risk prediction method is implemented. The display screen of the computer device may be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer device may be a touch layer covered on the display screen, or a key, trackball or touchpad provided on the housing of the computer device, or an external keyboard, touchpad or mouse, etc.

本领域技术人员可以理解，图5中示出的结构，仅仅是与本申请方案相关的部分结构的框图，并不构成对本申请方案所应用于其上的计算机设备的限定，具体的计算机设备可以包括比图中所示更多或更少的部件，或者组合某些部件，或者具有不同的部件布置。Those skilled in the art will understand that the structure shown in FIG. 5 is merely a block diagram of a partial structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied. The specific computer device may include more or fewer components than shown in the figure, or combine certain components, or have a different arrangement of components.

在一个实施例中，提供了一种计算机设备，包括存储器和处理器，存储器中存储有计算机程序，该处理器执行计算机程序时实现以下步骤：In one embodiment, a computer device is provided, including a memory and a processor, wherein a computer program is stored in the memory, and when the processor executes the computer program, the following steps are implemented:

获取现时环境数据；Obtain current environmental data;

将所述现时环境数据输入至预测模型中，所述预测模型包括VARLST混合模型和AdaBoost模型，其中，所述VARLST模型包括VARMA模型和BI-LSTM模型，所述VARMA模型用于多维度数据预测，得到VARMA预测模型的预测结果并将预测结果与实际值对比，得到拟合残差，将拟合残差作为BI-LSTM模型的输入，得到BI-LSTM模型的预测结果，将VARMA预测模型的预测结果与BI-LSTM模型的预测结果进行叠加，得到VARLST模型的预测结果；将VARLST模型的预测结果和AdaBoost模型的预测结果进行加权融合，获得最终预测值；The current environment data is input into a prediction model, wherein the prediction model includes a VARLST hybrid model and an AdaBoost model, wherein the VARLST model includes a VARMA model and a BI-LSTM model, wherein the VARMA model is used for multi-dimensional data prediction, and the prediction result of the VARMA prediction model is obtained and the prediction result is compared with the actual value to obtain a fitting residual, and the fitting residual is used as an input of the BI-LSTM model to obtain the prediction result of the BI-LSTM model, and the prediction result of the VARMA prediction model is superimposed with the prediction result of the BI-LSTM model to obtain the prediction result of the VARLST model; and the prediction result of the VARLST model and the prediction result of the AdaBoost model are weightedly fused to obtain a final prediction value;

在一个实施例中，提供了一种计算机可读存储介质，其上存储有计算机程序，计算机程序被处理器执行时实现以下步骤：In one embodiment, a computer readable storage medium is provided, on which a computer program is stored, and when the computer program is executed by a processor, the following steps are implemented:

获取现时环境数据；Obtain current environmental data;

本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程，是可以通过计算机程序来指令相关的硬件来完成，所述的计算机程序可存储于一非易失性计算机可读取存储介质中，该计算机程序在执行时，可包括如上述各方法的实施例的流程。其中，本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用，均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限，RAM以多种形式可得，诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。Those skilled in the art can understand that all or part of the processes in the above-mentioned embodiment methods can be completed by instructing the relevant hardware through a computer program, and the computer program can be stored in a non-volatile computer-readable storage medium. When the computer program is executed, it can include the processes of the embodiments of the above-mentioned methods. Among them, any reference to memory, storage, database or other media used in the embodiments provided in this application can include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM) or flash memory. Volatile memory can include random access memory (RAM) or external cache memory. As an illustration and not limitation, RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

以上实施例的各技术特征可以进行任意的组合，为使描述简洁，未对上述实施例中的各个技术特征所有可能的组合都进行描述，然而，只要这些技术特征的组合不存在矛盾，都应当认为是本说明书记载的范围。The technical features of the above embodiments may be arbitrarily combined. To make the description concise, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.

以上所述实施例仅表达了本申请的几种实施方式，其描述较为具体和详细，但并不能因此而理解为对发明专利范围的限制。应当指出的是，对于本领域的普通技术人员来说，在不脱离本申请构思的前提下，还可以做出若干变形和改进，这些都属于本申请的保护范围。因此，本申请专利的保护范围应以所附权利要求为准。The above-mentioned embodiments only express several implementation methods of the present application, and the descriptions thereof are relatively specific and detailed, but they cannot be understood as limiting the scope of the invention patent. It should be pointed out that, for a person of ordinary skill in the art, several variations and improvements can be made without departing from the concept of the present application, and these all belong to the protection scope of the present application. Therefore, the protection scope of the patent of the present application shall be subject to the attached claims.

Claims

1. A health risk prediction method, characterized in that it comprises the following steps:

Obtain current environmental data;

Input the current environment data into a prediction model, wherein the prediction model includes a VARLST hybrid model and an AdaBoost model, wherein the VARLST hybrid model includes a VARMA model and a BI-LSTM model, and the VARMA model is used for multi-dimensional data prediction, obtain the prediction result of the VARMA prediction model, and compare the prediction result with the actual value to obtain the fitting residual, use the fitting residual as the input of the BI-LSTM model to obtain the prediction result of the BI-LSTM model, superimpose the prediction result of the VARMA prediction model with the prediction result of the BI-LSTM model to obtain the prediction result of the VARLST hybrid model; perform weighted fusion on the prediction result of the VARLST hybrid model and the prediction result of the AdaBoost model to obtain the final prediction value;

Based on the final predicted value predicted under the current environmental data and the number of people with the disease and the number of people without the disease under the risk reference baseline value in the past time period, the relative risk value of the predicted health risk is calculated; based on the relative risk value, a health risk warning is performed, and the relative risk value RR of the predicted health risk is calculated as follows:

Among them, IE is the predicted number of people with the disease and the predicted number of people with the disease, IN is the predicted number of people without the disease, CE is the number of people with the disease under the risk reference benchmark value, and CN is the number of people without the disease under the risk reference benchmark.

The training process of the VARLST hybrid model is specifically as follows: historical environmental data and medical statistics are obtained and preprocessed, and the preprocessed data set is divided into a training set and a test set according to a preset ratio; the data set is subjected to differential smoothing, and the autocorrelation function and partial autocorrelation function analysis are performed on the differential smoothed data, and the VARMA prediction model parameters and the order of the difference are estimated based on the analysis results. The VARMA prediction model is:

Where y _n,t is the number of medical visits in n departments on the tth day; y _n,tp is the p-order lag of the number of medical visits in n departments on the tth day; θ _n,t is the time series composition structure of the number of medical visits in n departments on the tth day; A is the coefficient corresponding to the composition structure; B _p is the coefficient corresponding to the p-order lag; C is the coefficient of the exogenous variable; x _m,t is the environmental index data of m items on the tth day, ε _n is a residual term, and the training set is input into a VARLST hybrid model for training. The VARLST hybrid model includes a VARMA model and a BI-LSTM model. The VARMA model is used for multi-dimensional data prediction, and the prediction result of the obtained VARMA prediction model is compared with the actual value to obtain a fitting residual; the fitting residual is input into a BI-LSTM model for training until the preset requirement or the number of iterations is reached, and the trained BI-LSTM model is output, wherein, in each iteration, the BI-LSTM model is backward-transferred, and the error of the output layer at each moment and the parameter derivatives of the forward LSTM and the backward LSTM are calculated through the output of the backward transfer, and after the parameter derivatives are obtained, the network parameters of the BI-LSTM model are updated; the test set is input into the trained BI-LSTM model to obtain the prediction result of the BI-LSTM model; the prediction result of the VARMA prediction model is superimposed with the prediction result of the BI-LSTM model to obtain the prediction result of the VARLST hybrid model;

The training process of the AdaBoost model is specifically as follows: training a base learner based on a data set and an initialization weight distribution to optimize the base learner; calculating the prediction error, proportional error, and connection weight of each training sample in the t-th iteration; adjusting the weight distribution of each training sample in the t+1 iteration, and continuing training until the maximum number of iterations is reached to obtain K models; and adding the prediction results of the K models as the prediction result of the AdaBoost model.

2. A health risk prediction method according to claim 1, characterized in that the historical environmental data includes air quality data, urban drinking water quality data, waste gas, wastewater and other total emission data.

3. A health risk prediction method according to claim 1, characterized in that the acquisition of historical environmental data and medical statistics data and preprocessing specifically comprises the following steps:

Perform data cleaning on environmental data and medical statistics, including processing missing values, outliers and duplicate values;

Perform correlation analysis on the cleaned data, calculate the correlation coefficients between each feature using the Pearson correlation coefficient, and finally obtain the correlation coefficient matrix C;

According to the correlation coefficient matrix C, environmental indices with high correlation with medical statistics are selected as input variables of the model.

4. A health risk prediction method according to claim 1, characterized in that MAE, MAPE, and RMSE are selected as evaluation indicators to evaluate the performance of the prediction model.

5. A health risk prediction device, characterized by comprising: a data acquisition module, a mixed multivariate time series prediction module and a health risk prediction module, wherein:

The data acquisition module is used to acquire current environment data;

The hybrid multivariate time series prediction module is used to input the current environmental data into the prediction model, the prediction model includes a VARLST hybrid model and an AdaBoost model, wherein the VARLST hybrid model includes a VARMA model and a BI-LSTM model, the VARMA model is used for multi-dimensional data prediction, the prediction result of the VARMA prediction model is obtained and the prediction result is compared with the actual value to obtain the fitting residual, the fitting residual is used as the input of the BI-LSTM model, the prediction result of the BI-LSTM model is obtained, and the prediction result of the VARMA prediction model is compared with the prediction result of the BI-LSTM model. The prediction results of the VARLST hybrid model are obtained by superimposing the prediction results of the VARLST hybrid model and the prediction results of the AdaBoost model, and the final prediction value is obtained. The training process of the VARLST hybrid model is specifically as follows: historical environmental data and medical statistics data are obtained and preprocessed, and the preprocessed data set is divided into a training set and a test set according to a preset ratio; the data set is subjected to differential smoothing, and the autocorrelation function and partial autocorrelation function analysis are performed on the differential smoothed data, and the parameters of the VARMA prediction model and the order of the difference are estimated based on the analysis results. The VARMA prediction model is:

Where y _n,t is the number of medical visits in n departments on the tth day; y _n,tp is the p-order lag of the number of medical visits in n departments on the tth day; θ _n,t is the time series composition structure of the number of medical visits in n departments on the tth day; A is the coefficient corresponding to the composition structure; B _p is the coefficient corresponding to the p-order lag; C is the coefficient of the exogenous variable; x _m,t is the environmental index data of m items on the tth day, ε _n is a residual term. The training set is input into a VARLST hybrid model for training. The VARLST hybrid model includes a VARMA model and a BI-LSTM model. The VARMA model is used for multi-dimensional data prediction. The prediction result of the obtained VARMA prediction model is compared with the actual value to obtain a fitting residual. The fitting residual is input into a BI-LSTM model for training until the preset requirement or the number of iterations is met, and a trained BI-LSTM model is output. In each iteration, the BI-LSTM model is backward-transferred. Through the output of the backward transfer, the error of the output layer at each moment and the parameter derivatives of the forward LSTM and the backward LSTM are calculated. After the parameter derivatives are obtained, the BI-LSTM model is updated. network parameters; input the test set into the trained BI-LSTM model to obtain the prediction results of the BI-LSTM model; superimpose the prediction results of the VARMA prediction model with the prediction results of the BI-LSTM model to obtain the prediction results of the VARLST hybrid model; the training process of the AdaBoost model is specifically as follows: training the base learner based on the data set and the initialization weight distribution to optimize the base learner; calculating the prediction error, proportional error and connection weight of each training sample in the tth iteration; adjusting the weight distribution of each training sample in the t+1th iteration, and continuing training until the maximum number of iterations is reached to obtain K models; adding the prediction results of the K models as the prediction results of the AdaBoost model;

The health risk prediction module is used to calculate the relative risk value of the predicted health risk based on the final predicted value predicted under the current environmental data and the number of people with the disease and the number of people without the disease under the risk reference baseline value in the past time period; and to perform health risk warning based on the relative risk value. The relative risk value RR of the predicted health risk is calculated by the following formula:

Among them, IE is the predicted number of people suffering from the disease and the predicted number of people with the disease, IN is the predicted number of people without the disease, CE is the number of people suffering from the disease under the risk reference benchmark value, and CN is the number of people without the disease under the risk reference benchmark.

6. A computer device, characterized by comprising: a memory for storing a computer program; and a processor for implementing the method according to any one of claims 1 to 4 when executing the computer program.

7. A readable storage medium, characterized in that a computer program is stored on the readable storage medium, and when the computer program is executed by a processor, the method according to any one of claims 1 to 4 is implemented.