CN111526070A

CN111526070A - Service function chain fault detection method based on prediction

Info

Publication number: CN111526070A
Application number: CN202010359298.0A
Authority: CN
Inventors: 唐伦; 廖皓; 贺兰钦; 胡彦娟; 陈前斌
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Hangzhou Mingshi Technology Co ltd; Shenzhen Wanzhida Technology Transfer Center Co ltd
Priority date: 2020-04-29
Filing date: 2020-04-29
Publication date: 2020-08-11
Anticipated expiration: 2040-04-29
Also published as: CN111526070B

Abstract

The invention relates to a prediction-based service function chain fault detection method, which belongs to the technical field of communication. The present invention firstly collects data by monitoring the performance data of each virtual network function on the entire service function chain according to the performance correlation existing between the virtual network functions of the service function chain under the virtual network environment, and divides its working state; Secondly, considering the high-dimensional complexity and time-related characteristics of network monitoring data, combined with the initiative requirements of fault detection, a gated recurrent unit network is used to detect faults, and the health status of the network is predicted by analyzing the historical performance data information of the service function chain. ; And using the similarity of virtual network function nodes between different service function chains, transfer learning is introduced to speed up the convergence of the model. The present invention can effectively detect the occurrence of a fault while meeting the fault detection time requirement.

Description

A prediction-based service function chain fault detection method

技术领域technical field

本发明属于通信技术领域，涉及一种基于预测的服务功能链故障检测方法。The invention belongs to the technical field of communication, and relates to a prediction-based service function chain fault detection method.

背景技术Background technique

5G的到来将带来一场跨时代的变革，5G网络将为人们丰富多彩的服务需求提供差异化的业务场景和定制化的网络服务。现有网络架构主要面向人和人的通信，难以支持物联网和车联网等需要实现大规模机器通信的新兴业务。因此，5G将网络切片视为理想的网络架构，作为解决上述问题的关键。基于SDN和NFV的5G网络切片架构能够根据用户需求灵活编排，动态高效的分配资源。NFV技术将传统网络的软硬件解耦，以软件的方式实现网络功能，更有利于实现网络资源共享。The arrival of 5G will bring about a change across the ages. 5G networks will provide differentiated business scenarios and customized network services for people's diverse service needs. The existing network architecture is mainly oriented to the communication between people and people, and it is difficult to support emerging services such as the Internet of Things and the Internet of Vehicles that require large-scale machine communication. Therefore, 5G regards network slicing as an ideal network architecture as the key to solving the above problems. The 5G network slicing architecture based on SDN and NFV can be flexibly arranged according to user needs and allocate resources dynamically and efficiently. NFV technology decouples the software and hardware of traditional networks and implements network functions in software, which is more conducive to the sharing of network resources.

然而网络切片架构给5G网络带来极大灵活性的同时，也为网络的可靠性提出了新的要求。由于底层资源的共享，底层网络节点故障容易引发共享该节点资源的多个虚拟网络功能(Virtual Network Function,VNF)出现故障，导致多条服务功能链(ServiceFunction Chain,SFC)出现功能瘫痪。因此，利用SON技术实现网络从故障中快速恢复十分重要，其中故障检测作为网络性能分析的主体，是实现自愈合措施的首要前提。However, while the network slicing architecture brings great flexibility to the 5G network, it also puts forward new requirements for the reliability of the network. Due to the sharing of the underlying resources, the failure of the underlying network node can easily lead to the failure of multiple Virtual Network Functions (VNFs) sharing the resources of the node, resulting in functional paralysis of multiple Service Function Chains (SFCs). Therefore, it is very important to use the SON technology to realize the rapid recovery of the network from the fault, and the fault detection, as the main body of the network performance analysis, is the primary prerequisite for realizing self-healing measures.

发明人在研究现有技术的过程中发现其存在如下缺点：In the process of researching the prior art, the inventor found that it has the following shortcomings:

现有大多数故障检测方法都是基于反应的机制，然而该机制难以避免反应的固有延迟，不利于网络的及时性要求；大多数方法未考虑网络服务质量下降的情况，忽略了持续性性能下降对网络带来的不利影响；未考虑服务功能链场景下故障具有在基础设施层和应用层之间的纵向传播和在不同服务功能链之间的横向传播特性，将导致部分正常的VNF节点出现性能异常的情况，难以发现故障首次出现的VNF节点位置。Most of the existing fault detection methods are based on the reaction mechanism. However, this mechanism is difficult to avoid the inherent delay of the reaction, which is not conducive to the timeliness requirements of the network; most methods do not consider the degradation of network service quality, and ignore the continuous performance degradation. Adverse effects on the network; failures in the scenario of service function chains without considering the vertical propagation between the infrastructure layer and the application layer and horizontal propagation between different service function chains will cause some normal VNF nodes to appear. In the case of abnormal performance, it is difficult to find the location of the VNF node where the fault first occurred.

发明内容SUMMARY OF THE INVENTION

有鉴于此，本发明的目的在于提供一种基于预测的服务功能链故障检测方法，该方法能够在网络虚拟化环境下，根据VNF节点的性能数据的变化有效检测出节点异常情况，满足网络的可靠性要求。In view of this, the purpose of the present invention is to provide a prediction-based service function chain fault detection method, which can effectively detect node anomalies according to changes in the performance data of VNF nodes in a network virtualization environment, so as to meet the network requirements. reliability requirements.

为达到上述目的，本发明提供如下技术方案：To achieve the above object, the present invention provides the following technical solutions:

一种基于预测的服务功能链故障检测方法，其特征在于，该方法具体包括以下步骤：A prediction-based service function chain fault detection method, characterized in that the method specifically includes the following steps:

S1：结合服务功能链场景下故障传播特点，根据VNF节点间存在的性能相关性，采用监测整条服务功能链上每个VNF性能数据的方式收集数据，并将其工作状态划分为正常、服务质量下降或故障；S1: Combined with the characteristics of fault propagation in the service function chain scenario, according to the performance correlation between VNF nodes, the data is collected by monitoring the performance data of each VNF on the entire service function chain, and its working status is divided into normal, service quality degradation or malfunction;

S2：针对网络监测数据的高维复杂和时间相关特性，结合故障检测的主动性要求，采用门控循环单元(gated recurrent unit,GRU)网络进行故障的检测，通过分析服务功能链的历史性能数据信息来预测网络的健康状况；其中，针对GRU网络进行建模需要较长时间而不利于故障检测的实时性要求的问题，利用不同服务功能链间VNF节点的相似性，引入迁移学习来加快模型的收敛速度。S2: In view of the high-dimensional complexity and time-related characteristics of network monitoring data, combined with the initiative requirements of fault detection, a gated recurrent unit (GRU) network is used to detect faults, and the historical performance data of the service function chain is analyzed by analyzing the historical performance data. information to predict the health of the network; among them, for the problem that GRU network modeling takes a long time and is not conducive to the real-time requirements of fault detection, the similarity of VNF nodes between different service function chains is used to introduce transfer learning to speed up the model convergence speed.

进一步，所述步骤S1中，在服务功能链场景下，服务功能链通常是根据用户的特定需求由多个功能独立的VNF按照一定的顺序排列而成，不同服务功能链之中的VNF可能有部分重叠，服务功能链的这种特性极易导致某个VNF故障向相关联的VNF传播，进而导致切片网络大面积故障。Further, in the step S1, in the service function chain scenario, the service function chain is usually formed by a plurality of VNFs with independent functions arranged in a certain order according to the specific needs of the user, and the VNFs in different service function chains may have Partially overlapping, this feature of the service function chain can easily cause a VNF fault to propagate to the associated VNF, which in turn leads to large-scale faults in the slicing network.

考虑到故障具有在服务功能链中不同VNF节点间传播的特性，采用在切片网络的应用层对整条服务功能链各个VNF节点进行工作状态监测的方式收集性能数据，通过检测系统对性能数据的持续性分析检测出故障的发生，并把首个表现出故障的VNF节点当作是故障发生的起点。Considering that faults are propagated between different VNF nodes in the service function chain, the performance data is collected by monitoring the working status of each VNF node in the entire service function chain at the application layer of the slicing network. Continuity analysis detects the occurrence of failures and considers the first VNF node that exhibits failures as the origin of failures.

进一步，为了消除原始数据样本中的量纲不同，数据取值范围不一样，数据变化趋势不明显等问题对模型训练的影响，同时提高模型的精度和网络训练速度，需要采用线性的最大最小值方法对从所有VNF节点收集的性能数据进行归一化预处理，其转换函数为：

Further, in order to eliminate the influence of different dimensions in the original data samples, different data value ranges, and insignificant data change trends on model training, and at the same time to improve the accuracy of the model and the speed of network training, it is necessary to use linear maximum and minimum values. The method performs normalization preprocessing on the performance data collected from all VNF nodes, and its transformation function is:

进一步，所述步骤S1中，根据故障发生的原因将服务功能链的工作状态划分为三类：Further, in the step S1, the working state of the service function chain is divided into three categories according to the cause of the failure:

(1)正常：网络状态运行良好；(1) Normal: The network status is running well;

(2)服务质量下降：出现网络负载增大，流量减小、时延增大或丢包等情况，但是VNF仍能工作；具体原因可能是周围环境突变、软硬件问题或资源不足，可能出现短时间恢复到正常工作状态或持续恶化变为故障；(2) Deterioration of service quality: the network load increases, traffic decreases, delay increases or packet loss occurs, but the VNF can still work; the specific reasons may be sudden changes in the surrounding environment, software and hardware problems, or insufficient resources. Return to normal working state in a short time or continue to deteriorate into failure;

(3)故障：网络功能完全不能用，时延变为无穷大，不能再为用户提供相应服务；为了保证该VNF节点所在SFC的正常运行，需要执行节点迁移或软硬件设备重启等操作。(3) Failure: The network function is completely unusable, the delay becomes infinite, and corresponding services cannot be provided for users; in order to ensure the normal operation of the SFC where the VNF node is located, operations such as node migration or software and hardware device restarts need to be performed.

进一步，所述执行节点迁移的规律是：故障的VNF节点可能是由正常的VNF节点突变产生，也可能是处在服务质量下降中的VNF节点性能恶化到一定程度或在性能下降途中突变产生。对于处在服务质量下降中的VNF节点，如果能够适应系统的优化调节，则恢复到正常工作状态或恶化为故障，需要通过必要的愈合措施才能实现网络功能的恢复。Further, the rules for executing node migration are: the faulty VNF node may be generated by a mutation of a normal VNF node, or the performance of a VNF node whose service quality is degrading may deteriorate to a certain degree or abruptly generated during the performance degradation. For a VNF node whose service quality is degrading, if it can adapt to the optimal adjustment of the system, it will return to a normal working state or deteriorate into a fault. Necessary healing measures are needed to achieve network function recovery.

进一步，所述步骤S2中，针对网络监测数据的高维复杂和时间相关特性，结合故障检测的主动性要求，采用GRU网络进行故障的检测，具体包括以下步骤：Further, in the step S2, in view of the high-dimensional complexity and time-related characteristics of the network monitoring data, combined with the initiative requirement of fault detection, the GRU network is used to detect the fault, which specifically includes the following steps:

S201：使用由三层GRU单元对输入的时序样本数据进行处理，利用历史监测数据集以小批量的方式对模型进行训练；S201: Use the three-layer GRU unit to process the input time series sample data, and use the historical monitoring data set to train the model in small batches;

S202：在通过三层GRU网络后，通过一个全连接层来对之前网络的特征信息进行整合，提高网络的学习能力；S202: After passing through the three-layer GRU network, a fully connected layer is used to integrate the feature information of the previous network to improve the learning ability of the network;

S203：将全连接层的输出作为softmax分类器的输入，并结合标签数据进行反向的有监督微调；S203: Use the output of the fully connected layer as the input of the softmax classifier, and perform reverse supervised fine-tuning combined with the label data;

S204：利用实时监测数据对参数进一步优化。S204: Use real-time monitoring data to further optimize parameters.

进一步，所述步骤S2中，通过分析服务功能链的历史性能数据信息来预测网络的健康状况，具体包括：Further, in the step S2, the health status of the network is predicted by analyzing the historical performance data information of the service function chain, which specifically includes:

假设历史性能数据为等待时延和处理时延；在训练阶段，首先对服务功能链上所有VNF节点的等待时延和处理时延进行特征采集，设某条服务功能链由m个VNF节点组成，记录每个时刻所有VNF节点的监测数据，定义滑动窗口长度为d，则在t-d到t的时间范围内，网络模型的输入数据集表示为x＝{x_t,x_t-1,…,x_t-d+1}，在t时刻所有VNF节点的数据集为：Assume that the historical performance data is the waiting delay and processing delay; in the training phase, firstly collect the characteristics of the waiting delay and processing delay of all VNF nodes on the service function chain, and set a service function chain to be composed of m VNF nodes , record the monitoring data of all VNF nodes at each moment, define the sliding window length as d, then in the time range from td to t, the input data set of the network model is expressed as x={x _t , x _t-1 ,..., x _t-d+1 }, the dataset of all VNF nodes at time t is:

其中，

和

分别表示第m个VNF的等待时延和处理时延；in,

and

respectively represent the waiting delay and processing delay of the mth VNF;

预测方法：GRU网络模型的输入为时间序列数据，因此需要通过滑动窗口法构造时间序列样本，以长度d作为滑动窗口的大小，根据时间步长h在数据集上移动，得到当前时刻的样本X_t＝{x_t,x_t-1,…,x_t-d+1}和下一时刻的样本X_t+h＝{x_t+h,x_t+h-1,…,x_t+h-d+1}，并根据下一时刻的网络实际状态确定标签值为x_t+1和x_t+h+1；Prediction method: The input of the GRU network model is time series data, so it is necessary to construct time series samples through the sliding window method, with the length d as the size of the sliding window, and move on the data set according to the time step h to obtain the sample X at the current moment. _t ={x _t ,x _t-1 ,...,x _t-d+1 } and the next time sample X _t+h ={x _t+h ,x _t+h-1 ,...,x _{t+h -d+1} }, and determine the label values x _t+1 and x _t+h+1 according to the actual state of the network at the next moment;

然后，根据GRU输入的维度，按一定的比例划分训练集和测试集，使用小批量的方式对模型进行训练，在通过三层GRU网络后，通过一个全连接层来对之前网络的特征信息进行整合，提高网络的学习能力，最后把全连接层的输出作为softmax分类器的输入，得到最终的预测结果；为了防止训练网络时出现过拟合，使用Dropout正则化的方式丢弃训练过程中产生的部分重复信息；Then, according to the dimension of the GRU input, the training set and the test set are divided according to a certain proportion, and the model is trained in small batches. After passing through the three-layer GRU network, the feature information of the previous network is processed through a fully connected layer. Integration, improve the learning ability of the network, and finally use the output of the fully connected layer as the input of the softmax classifier to obtain the final prediction result; in order to prevent overfitting when training the network, Dropout regularization is used to discard the data generated during the training process. partially duplicated information;

网络模型的训练：就是对模型参数不断优化的过程，为了减小输出结果与真实网络之间的误差，使用反向传播算法对网络参数进行迭代更新；通过梯度下降的方式对网络权重进行逐层优化，使得目标损失函数的值最小；由于Adam算法相比其他参数优化算法在计算效率和收敛速度等方面具有优势，因此，采用自适应学习率的Adam算法来加快算法的收敛速度。The training of the network model is the process of continuously optimizing the model parameters. In order to reduce the error between the output result and the real network, the network parameters are iteratively updated using the back-propagation algorithm; the network weights are updated layer by layer by means of gradient descent. optimization, so that the value of the objective loss function is minimized; because the Adam algorithm has advantages in computational efficiency and convergence speed compared with other parameter optimization algorithms, the Adam algorithm with adaptive learning rate is used to speed up the convergence speed of the algorithm.

进一步，Adam算法利用梯度的距估计动态调整每个参数的学习率，适用于大数据集和高维空间，其更新式为：Further, the Adam algorithm dynamically adjusts the learning rate of each parameter by using the distance estimation of the gradient, which is suitable for large data sets and high-dimensional spaces. The update formula is:

其中，θ为迭代参数，ε为学习率，

和

分别为梯度的一阶距估计的偏置矫正和二阶距估计的偏置矫正，δ是一个平滑项。where θ is the iteration parameter, ε is the learning rate,

and

are the bias correction for the first-order distance estimation of the gradient and the bias correction for the second-order distance estimation, respectively, and δ is a smoothing term.

进一步，所述步骤S2中，引入迁移学习来加快模型的收敛速度，具体包括：在网络切片场景中，监测数据集的大小受制于切片运行时间的长短，对于处在切片运行早期的服务功能链来说，可能出现因监测数据不足而导致故障检测准确率不高的情况，需要引入基于参数的迁移学习方法来加快模型的收敛速度，使基于GRU神经网络预测的故障检测模型能够在较小的数据样本条件下，保持较高的检测精度。Further, in the step S2, transfer learning is introduced to speed up the convergence speed of the model, which specifically includes: in the network slicing scenario, the size of the monitoring data set is subject to the length of the slicing running time. For the service function chain in the early stage of slicing operation For example, there may be cases where the fault detection accuracy is not high due to insufficient monitoring data. It is necessary to introduce a parameter-based transfer learning method to speed up the convergence of the model, so that the fault detection model based on GRU neural network prediction can be used in smaller Under the condition of data samples, it maintains a high detection accuracy.

选择的源域模型参数训练自与目标域具有相似性能指标需求的服务功能链数据；然后将与当前服务功能链结构相似的其他服务功能链中的故障检测模型参数迁移到当前服务功能链中，帮助当前服务功能链故障检测模型取得更好地训练效果；具体步骤为：利用服务功能链SFC b的历史数据样本集合进行SFC b的GRU网络的故障检测模型训练，得到模型收敛的最佳参数矩阵，以W_i ^b为例；设迁移比率φ(t)∈(0，1)表示参数从SFC b模型迁移到SFC a模型的迁移程度，其中φ(t)＝1/t，随时间t增大而减小；可知SFC a的参数矩阵为W_i ^a＝φ(t)W_i ^b+(1-φ(t))W_i ^a，初始时刻的参数W_i ^a＝W_i ^b，利用SFC a的数据样本集合进行模型的训练和微调，得到最优的SFC a的GRU网络模型参数W_i ^a'。The selected source domain model parameters are trained from the service function chain data with similar performance index requirements to the target domain; then the fault detection model parameters in other service function chains with similar structure to the current service function chain are transferred to the current service function chain, Help the current service function chain fault detection model to achieve better training results; the specific steps are: use the historical data sample set of the service function chain SFC b to train the fault detection model of the GRU network of SFC b, and obtain the optimal parameter matrix for model convergence , take _Wi ^b as an example; let the migration ratio φ(t)∈(0,1) represent the migration degree of parameters from the SFC b model to the SFC a model, where φ(t)=1/t, increasing with time t It can be seen that the parameter matrix of SFC a is W _i ^a =φ(t)W _i ^b +(1-φ(t))W _i ^a , and the parameter at the initial moment is W _i ^a =W _i ^b , using SFC The data sample set of a is used for model training and fine-tuning, and the optimal _GRU network model parameters of SFC ^a are obtained.

本发明的有益效果在于：针对5G端到端网络切片场景下的故障检测问题，本发明能够在满足系统对检测准确度要求的基础上，有效提取复杂网络中具有海量、高维的数据特征，同时保证了故障检测的及时性，在无线通信系统中有很高的应用价值。The beneficial effects of the present invention are: aiming at the problem of fault detection in the 5G end-to-end network slicing scenario, the present invention can effectively extract the massive and high-dimensional data features in complex networks on the basis of satisfying the system's requirements for detection accuracy. At the same time, the timeliness of fault detection is ensured, and it has high application value in wireless communication systems.

本发明的其他优点、目标和特征在某种程度上将在随后的说明书中进行阐述，并且在某种程度上，基于对下文的考察研究对本领域技术人员而言将是显而易见的，或者可以从本发明的实践中得到教导。本发明的目标和其他优点可以通过下面的说明书来实现和获得。Other advantages, objects, and features of the present invention will be set forth in the description that follows, and will be apparent to those skilled in the art based on a study of the following, to the extent that is taught in the practice of the present invention. The objectives and other advantages of the present invention may be realized and attained by the following description.

附图说明Description of drawings

为了使本发明的目的、技术方案和优点更加清楚，下面将结合附图对本发明作优选的详细描述，其中：In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be preferably described in detail below with reference to the accompanying drawings, wherein:

图1为可应用本发明场景示意图；1 is a schematic diagram of a scene where the present invention can be applied;

图2为本发明实施例中节点状态转移图；FIG. 2 is a state transition diagram of a node in an embodiment of the present invention;

图3为本发明实施例中基于GRU网络的故障检测模型；3 is a fault detection model based on a GRU network in an embodiment of the present invention;

图4为本发明实施例中服务功能链故障检测方法的流程示意图。FIG. 4 is a schematic flowchart of a service function chain fault detection method according to an embodiment of the present invention.

具体实施方式Detailed ways

以下通过特定的具体实例说明本发明的实施方式，本领域技术人员可由本说明书所揭露的内容轻易地了解本发明的其他优点与功效。本发明还可以通过另外不同的具体实施方式加以实施或应用，本说明书中的各项细节也可以基于不同观点与应用，在没有背离本发明的精神下进行各种修饰或改变。需要说明的是，以下实施例中所提供的图示仅以示意方式说明本发明的基本构想，在不冲突的情况下，以下实施例及实施例中的特征可以相互组合。The embodiments of the present invention are described below through specific specific examples, and those skilled in the art can easily understand other advantages and effects of the present invention from the contents disclosed in this specification. The present invention can also be implemented or applied through other different specific embodiments, and various details in this specification can also be modified or changed based on different viewpoints and applications without departing from the spirit of the present invention. It should be noted that the drawings provided in the following embodiments are only used to illustrate the basic idea of the present invention in a schematic manner, and the following embodiments and features in the embodiments can be combined with each other without conflict.

请参阅图1～图4，图1是可应用本发明实施的场景示意图。参见图1，应用层主要负责为每个切片请求提供有序的VNF集合来处理业务，基础设施层提供支撑各种切片网络功能需求的计算资源、带宽资源、存储资源等诸多类型资源的物理节点和链路。虚拟化层通过NFV MANO和SDN控制器实现切片的生命周期管理、网络性能数据监测等功能。系统根据不同的业务请求，生成特定的服务功能链，从而满足用户的服务需求。每条服务功能链中不同的VNF的流量、带宽、时延等需求也是不同的，但是相邻两个VNF以及连接它们的虚拟链路所需资源间具有一定的相关性。为保障网络的稳定性和服务质量，需要对整条服务功能链不同VNF的节点状态进行监测，及时检测出故障的发生。Please refer to FIG. 1 to FIG. 4 . FIG. 1 is a schematic diagram of a scenario where the present invention can be applied. Referring to Figure 1, the application layer is mainly responsible for providing an ordered set of VNFs for each slice request to process services, and the infrastructure layer provides physical nodes of many types of resources such as computing resources, bandwidth resources, and storage resources that support the functional requirements of various slice networks. and link. The virtualization layer implements functions such as slice lifecycle management and network performance data monitoring through NFV MANO and SDN controller. The system generates a specific service function chain according to different business requests, so as to meet the service needs of users. The traffic, bandwidth, and delay requirements of different VNFs in each service function chain are also different, but there is a certain correlation between the resources required by two adjacent VNFs and the virtual links connecting them. In order to ensure network stability and service quality, it is necessary to monitor the node status of different VNFs in the entire service function chain to detect the occurrence of faults in time.

图2为本实施例中节点状态转移图。故障的VNF节点可能是由正常的VNF节点突变产生，也可能是处在服务质量下降中的VNF节点性能恶化到一定程度或在性能下降途中突变产生。对于处在服务质量下降中的VNF节点，如果能够适应系统的优化调节，则可能恢复到正常工作状态，反之，则恶化为故障，需要通过必要的愈合措施才能实现网络功能的恢复。FIG. 2 is a state transition diagram of a node in this embodiment. The faulty VNF node may be caused by the mutation of the normal VNF node, or the performance of the VNF node whose service quality is degraded may deteriorate to a certain extent or be abruptly generated during the performance degradation. For a VNF node whose service quality is degrading, if it can adapt to the optimal adjustment of the system, it may return to a normal working state.

图3为本发明实施例中基于GRU网络的故障检测模型。参见图3，本发明的基于GRU网络预测的故障检测模型由三层GRU单元、一个全连接层和一个softmax分类器组成。定义t时刻的每一个输入样本为同一条服务功能链上所有VNF节点的健康状态特征信息x_t，并通过对GRU网络模型的训练能够得出该时刻模型的隐藏层状态h_t，进而得出每个VNF的预测状态y_t+1。由于下一时刻VNF的状态受到该时刻观测数据和上一时刻隐藏层状态的共同影响，因此GRU网络可以把历史观测数据用于对未来VNF节点的健康状态预测。FIG. 3 is a fault detection model based on a GRU network in an embodiment of the present invention. Referring to Fig. 3, the fault detection model based on GRU network prediction of the present invention is composed of three layers of GRU units, a fully connected layer and a softmax classifier. Define each input sample at time t as the health state feature information x _t of all VNF nodes on the same service function chain, and through the training of the GRU network model, the hidden layer state h _t of the model at this time can be obtained, and then obtain The predicted state y _t+1 of each VNF. Since the state of the VNF at the next moment is jointly affected by the observation data at this moment and the state of the hidden layer at the previous moment, the GRU network can use the historical observation data to predict the health state of the future VNF nodes.

图4为本发明实施例中服务功能链故障检测方法的流程示意图。具体步骤如下：FIG. 4 is a schematic flowchart of a service function chain fault detection method according to an embodiment of the present invention. Specific steps are as follows:

步骤401：初始化系统参数，包括SFC b和SFC a的GRU网络各层模型参数、学习率和迭代次数k＝0；Step 401: Initialize system parameters, including model parameters of each layer of the GRU network of SFC b and SFC a, the learning rate and the number of iterations k=0;

步骤402：输入SFC b所有VNF节点的历史性能数据和t时刻之前SFC a所有VNF节点的实时性能数据；Step 402: Input historical performance data of all VNF nodes of SFC b and real-time performance data of all VNF nodes of SFC a before time t;

步骤403：对输入数据进行归一化预处理；Step 403: perform normalization preprocessing on the input data;

步骤404：根据滑动窗口长度和时间步长构造时序输入数据；Step 404: Construct time series input data according to the sliding window length and time step;

步骤405：利用SFC b所有VNF节点的历史性能数据进行GRU网络模型的训练，通过自适应学习率的Adam算法对模型进行反向微调，更新相应的参数；Step 405: Use the historical performance data of all VNF nodes of SFC b to train the GRU network model, perform reverse fine-tuning of the model through the Adam algorithm of the adaptive learning rate, and update the corresponding parameters;

步骤406：判断是否满足收敛条件。如果不满足收敛条件，令k＝k+1，继续执行步骤405，否则，提取出最新的模型参数并执行步骤407；Step 406: Determine whether the convergence condition is satisfied. If the convergence condition is not met, set k=k+1, and continue to perform step 405; otherwise, extract the latest model parameters and perform step 407;

步骤407：将从SFC b的GRU网络模型中提取出的模型参数迁移到SFC a的GRU网络模型中，利用SFC a中t时刻之前的实时性能数据对模型进一步训练；Step 407: Migrate the model parameters extracted from the GRU network model of SFC b to the GRU network model of SFC a, and use the real-time performance data before time t in SFC a to further train the model;

步骤408：判断是否达到最大迭代次数K。若不满足，令k＝k+1，转到步骤407，否则执行步骤409；Step 408: Determine whether the maximum number of iterations K is reached. If not, let k=k+1, go to step 407, otherwise go to step 409;

步骤409：输出t+1时刻SFC a中每个VNF节点的工作状态。Step 409: Output the working status of each VNF node in SFC a at time t+1.

最后说明的是，以上实施例仅用以说明本发明的技术方案而非限制，尽管参照较佳实施例对本发明进行了详细说明，本领域的普通技术人员应当理解，可以对本发明的技术方案进行修改或者等同替换，而不脱离本技术方案的宗旨和范围，其均应涵盖在本发明的权利要求范围当中。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention and not to limit them. Although the present invention has been described in detail with reference to the preferred embodiments, those of ordinary skill in the art should understand that the technical solutions of the present invention can be Modifications or equivalent replacements, without departing from the spirit and scope of the technical solution, should all be included in the scope of the claims of the present invention.

Claims

1. A prediction-based service function chain fault detection method, characterized in that the method specifically comprises the following steps:

S1: Combined with the characteristics of fault propagation in the service function chain scenario, according to the performance correlation between virtual network function (Virtual Network Function, VNF) nodes, the data is collected by monitoring the performance data of each VNF on the entire service function chain, and the Its working status is divided into normal, service quality degradation or failure;

S2: In view of the high-dimensional complexity and time-related characteristics of network monitoring data, combined with the initiative requirements of fault detection, a gated recurrent unit (GRU) network is used to detect faults, and the historical performance data of the service function chain is analyzed by analyzing the historical performance data. information to predict the health of the network; among them, for the problem that GRU network modeling takes a long time and is not conducive to the real-time requirements of fault detection, the similarity of VNF nodes between different service function chains is used to introduce transfer learning to speed up the model convergence speed.

2. A prediction-based service function chain fault detection method according to claim 1, characterized in that, in the step S1, in the service function chain scenario, the service function chain is composed of a plurality of The VNFs with independent functions are arranged in a certain order; the performance data is collected by monitoring the working status of each VNF node of the entire service function chain at the application layer of the slicing network. The failure occurs, and the first VNF node that exhibits the failure is regarded as the starting point of the failure.

3 . The prediction-based service function chain fault detection method according to claim 2 , wherein the performance data collected from all VNF nodes is normalized and preprocessed by using a linear maximum-minimum value method. 4 .

4. A prediction-based service function chain fault detection method according to claim 1, wherein in the step S1, the working states of the service function chain are divided into three categories according to the cause of the failure:

(1) Normal: The network status is running well;

(2) Deterioration of service quality: the network load increases, the traffic decreases, the delay increases or the packet is lost, but the VNF can still work;

(3) Failure: The network function is completely unavailable, the delay becomes infinite, and corresponding services cannot be provided for users; perform node migration or restart software and hardware devices.

5. A prediction-based service function chain fault detection method according to claim 4, wherein the rule for executing node migration is: for a VNF node whose service quality is deteriorating, if it can adapt to the system's Optimal adjustment will restore to the normal working state or deteriorate into a fault, and it is necessary to restore the network function through healing measures.

6. A prediction-based service function chain fault detection method according to claim 1, wherein in the step S2, for the high-dimensional complexity and time-related characteristics of network monitoring data, combined with the initiative of fault detection According to the requirements, the GRU network is used for fault detection, which specifically includes the following steps:

S201: Use the three-layer GRU unit to process the input time series sample data, and use the historical monitoring data set to train the model in small batches;

S202: After passing through the three-layer GRU network, a fully connected layer is used to integrate the feature information of the previous network to improve the learning ability of the network;

S203: Use the output of the fully connected layer as the input of the softmax classifier, and perform reverse supervised fine-tuning combined with the label data;

S204: Use real-time monitoring data to further optimize parameters.

7. A prediction-based service function chain fault detection method according to claim 6, characterized in that, in the step S2, the health status of the network is predicted by analyzing historical performance data information of the service function chain, specifically comprising: :

Assume that the historical performance data is the waiting delay and processing delay; in the training phase, firstly collect the characteristics of the waiting delay and processing delay of all VNF nodes on the service function chain, and set a service function chain to be composed of m VNF nodes , record the monitoring data of all VNF nodes at each moment, define the sliding window length as d, then in the time range from td to t, the input data set of the network model is expressed as x={x _t , x _t-1 ,..., x _t-d+1 }, the dataset of all VNF nodes at time t is:

in,

and

respectively represent the waiting delay and processing delay of the mth VNF;

Prediction method: Construct time series samples through the sliding window method, take the length d as the size of the sliding window, move on the data set according to the time step h, and obtain the current sample X _t ={x _t ,x _t-1 ,… ,x _t-d+1 } and the next moment sample X _t+h ={x _t+h ,x _t+h-1 ,...,x _t+h-d+1 }, and according to the next moment's sample The actual state of the network determines that the label values are x _t+1 and x _t+h+1 ;

Then, according to the dimension of the GRU input, the training set and the test set are divided according to a certain proportion, and the model is trained in small batches. After passing through the three-layer GRU network, the feature information of the previous network is processed through a fully connected layer. Integration, improve the learning ability of the network, and finally use the output of the fully connected layer as the input of the softmax classifier to obtain the final prediction result; in order to prevent overfitting when training the network, Dropout regularization is used to discard the data generated during the training process. partially duplicated information;

Training of the network model: iteratively update the network parameters by using the back-propagation algorithm; optimize the network weight layer by layer by gradient descent to minimize the value of the objective loss function; use the Adam algorithm with adaptive learning rate to speed up the algorithm. convergence speed.

8. a kind of prediction-based service function chain fault detection method according to claim 7, is characterized in that, Adam algorithm utilizes the distance estimation of gradient to dynamically adjust the learning rate of each parameter, and its update formula is:

where θ is the iteration parameter, ε is the learning rate,

and

9 . The prediction-based service function chain fault detection method according to claim 1 , wherein in the step S2, transfer learning is introduced to speed up the convergence speed of the model, which specifically includes: the selected source domain model parameters. 10 . Train from service function chain data with similar performance index requirements to the target domain; then transfer the fault detection model parameters in other service function chains with similar structure to the current service function chain to the current service function chain to help the current service function chain fail The detection model achieves better training effect; the specific steps are: use the historical data sample set of Service Function Chain (SFC) b to train the fault detection model of the GRU network of SFC b, and obtain the optimal parameter matrix for model convergence. Wi ^b ; let the migration ratio φ(t)∈(0,1) represent the migration degree of parameters from the SFC b model to the SFCa model, where φ(t)=1/t; the parameter matrix of SFCa is W _i ^a ₌ φ(t)W _i ^b +(1-φ(t))W _i ^a , the parameters at the initial moment Wi ^a ₌ Wi ^b , use the _SFCa data sample set to train and fine-tune the model to obtain the optimal SFCa The _GRU network model parameters Wi ^a' .