CN111444009A

CN111444009A - A resource allocation method and device based on deep reinforcement learning

Info

Publication number: CN111444009A
Application number: CN201911117328.0A
Authority: CN
Inventors: 张海涛; 郭彤宇; 郭建立; 黄瀚; 何晨泽
Original assignee: Beijing University of Posts and Telecommunications; CETC 54 Research Institute
Current assignee: Beijing University of Posts and Telecommunications; CETC 54 Research Institute
Priority date: 2019-11-15
Filing date: 2019-11-15
Publication date: 2020-07-24
Anticipated expiration: 2039-11-15
Also published as: CN111444009B

Abstract

Embodiments of the present invention provide a method and device for resource allocation based on deep reinforcement learning. The method includes: determining a variety of services of resources to be allocated included in a user's application request, and the allocation priority of each service; determining the current edge The state parameters of the micro cloud system, the state parameters include the resource balance evaluation parameter, the response delay evaluation parameter, and the remaining amount of resources of each computing node in each micro cloud; input the state parameters into the pre-trained resource balance optimization model, and get the first The first target computing node of a service; the resource balancing optimization model is completed based on deep reinforcement learning training, and the first service is deployed on the first target computing node; the state parameter is updated, and the parameter input step is returned until the application request contains Resource allocation is completed for each service to which resources are to be allocated. Compared with the traditional resource allocation method, it can not only meet the communication delay requirement, but also achieve a higher degree of resource utilization balance.

Description

A resource allocation method and device based on deep reinforcement learning

技术领域technical field

本发明涉及无线通信技术领域，特别是涉及一种基于深度强化学习的资源分配方法及装置。The present invention relates to the technical field of wireless communication, and in particular, to a method and device for resource allocation based on deep reinforcement learning.

背景技术Background technique

近年来，随着信息化、网络化的不断发展，信息系统在军事、救灾等领域发挥着越来越重要的作用。在这种高度动态的环境中，任务计划及设备构成可能频繁变化，网络连通能力也会出现波动。基于单机设备的服务资源十分有限，无法应对复杂的计算任务。云计算技术是应对这种场景的有效手段。云计算技术中，能够根据任务需求，自定义进行资源配置，从而为大规模应用程序提供方便灵活的管理服务，然而，传统的云平台通常部署在距离用户较远的地区，通信延迟较高，且在网络不稳定的环境中，很难提供持续可靠的服务。In recent years, with the continuous development of informatization and networking, information systems have played an increasingly important role in military, disaster relief and other fields. In this highly dynamic environment, mission plans and equipment composition may change frequently, and network connectivity may fluctuate. Service resources based on stand-alone devices are very limited and cannot cope with complex computing tasks. Cloud computing technology is an effective means to deal with this scenario. In cloud computing technology, resource configuration can be customized according to task requirements, thereby providing convenient and flexible management services for large-scale applications. However, traditional cloud platforms are usually deployed in areas far away from users, and the communication delay is high. And in an environment with an unstable network, it is difficult to provide continuous and reliable services.

为了解决上述问题，产生了边缘微云平台。边缘微云平台是一种新兴的云计算模型，有多个分布式部署的边缘微云组成，每个边缘微云包含若干小型服务器，边缘微云平台的规模可以随着任务需求进行调整。边缘微云大都部署在移动车辆上，根据任务需求进行移动，以提供更高质量的云服务。随着微服务技术的发展，一个应用程序通常由多个相互通信的组合服务构成，每个组合服务对不同维度资源的需求不同。由于单个边缘微云的计算能力有限，不能满足所有服务需求，因此组合服务可以分布式部署在不同的边缘微云中，不同的边缘微云间相互协作共同提供计算能力。In order to solve the above problems, the edge micro-cloud platform was created. The edge micro-cloud platform is an emerging cloud computing model. It consists of multiple distributed edge micro-clouds. Each edge micro-cloud contains several small servers. The scale of the edge micro-cloud platform can be adjusted according to the task requirements. Edge micro-clouds are mostly deployed on mobile vehicles and move according to mission requirements to provide higher-quality cloud services. With the development of microservice technology, an application is usually composed of multiple composite services that communicate with each other, and each composite service has different requirements for resources in different dimensions. Since the computing power of a single edge micro-cloud is limited and cannot meet all service requirements, the combined services can be distributed and deployed in different edge micro-clouds, and different edge micro-clouds cooperate with each other to provide computing power.

然而，现有的边缘微云技术，在为服务分配资源时，经常会出现资源碎片化的情况，即资源分配不均衡，从而造成某一维度资源的浪费。此外，资源分配时不仅需要考虑各个服务的资源需求，还需要考虑服务之间的通信需求，这进一步增加了资源分配的复杂性，而现有的相关技术中均未考虑服务间通信需求。However, in the existing edge micro-cloud technology, when allocating resources for services, resources are often fragmented, that is, resource allocation is not balanced, resulting in a waste of resources in a certain dimension. In addition, resource allocation not only needs to consider the resource requirements of each service, but also needs to consider the communication requirements between services, which further increases the complexity of resource allocation, and the existing related technologies do not consider the communication requirements between services.

可见，亟需一种能够满足服务之间通信需求，且资源利用的均衡度较高资源分配方法。It can be seen that there is an urgent need for a resource allocation method that can meet the communication requirements between services and has a high degree of balanced resource utilization.

发明内容SUMMARY OF THE INVENTION

本发明实施例的目的在于提供一种基于深度强化学习的资源分配方法及装置，以提高资源利用的均衡度。具体技术方案如下：The purpose of the embodiments of the present invention is to provide a resource allocation method and device based on deep reinforcement learning, so as to improve the balance of resource utilization. The specific technical solutions are as follows:

为实现上述目的，本发明实施例提供了一种基于深度强化学习的资源分配方法，应用于边缘微云系统的控制平台，所述边缘微云系统还包括多个微云，每个微云包括多个计算节点，所述方法包括：In order to achieve the above object, an embodiment of the present invention provides a resource allocation method based on deep reinforcement learning, which is applied to a control platform of an edge micro-cloud system. The edge micro-cloud system further includes a plurality of micro-clouds, and each micro-cloud includes A plurality of computing nodes, the method includes:

确定用户的应用程序请求中包含的多种待分配资源的服务，以及各服务的分配优先级；Determine the various services to be allocated resources contained in the user's application request, and the allocation priority of each service;

确定当前边缘微云系统的状态参数，所述状态参数包括资源均衡度评估参数，响应延迟评估参数，以及每个微云中各个计算节点的资源剩余量；Determine the state parameters of the current edge micro cloud system, the state parameters include resource balance evaluation parameters, response delay evaluation parameters, and the remaining resources of each computing node in each micro cloud;

将所述状态参数输入预先训练完成的资源均衡优化模型，得到第一服务的第一目标计算节点；所述第一服务为当前分配优先级最高的服务；所述资源均衡优化模型是基于深度强化学习训练完成的，其中，深度强化学习的训练集包括：边缘微云系统的样本状态参数；Input the state parameter into the pre-trained resource balancing optimization model to obtain the first target computing node of the first service; the first service is the service with the highest current allocation priority; the resource balancing optimization model is based on deep reinforcement The learning and training are completed, wherein the training set of deep reinforcement learning includes: the sample state parameters of the edge micro-cloud system;

将所述第一服务部署于所述第一目标计算节点；deploying the first service on the first target computing node;

更新所述状态参数，并返回将所述状态参数输入预先训练完成的资源均衡优化模型的步骤，直到所述应用程序请求中包含的每种待分配资源的服务均完成资源分配。The state parameter is updated, and the step of inputting the state parameter into the pre-trained resource balancing optimization model is returned, until the resource allocation is completed for each service of the to-be-allocated resources included in the application request.

可选的，基于如下公式计算所述资源均衡度评估参数：Optionally, the resource balance degree evaluation parameter is calculated based on the following formula:

其中，

表示第i个微云中第j个计算节点的资源利用方差，D表示资源的种类数，

表示第i个微云中第j个计算节点中第d类资源的资源利用率，

表示第i个微云中第j个计算节点中所有种类资源的资源利用率的平均值，X表示资源分配策略，

表示第i个微云中第j个计算节点的资源均衡率，RUBD_i表示第i个微云的资源利用均衡度，L_i表示第i个微云中计算节点的总数，RUBD_Total表示所述边缘微云系统的资源均衡度评估参数，K表示所述边缘微云系统中微云的总数；in,

represents the resource utilization variance of the jth computing node in the ith microcloud, D represents the number of resource types,

Represents the resource utilization rate of the d-th resource in the j-th computing node in the i-th microcloud,

represents the average resource utilization of all kinds of resources in the jth computing node in the ith microcloud, X represents the resource allocation strategy,

Represents the resource balance rate of the jth computing node in the ith microcloud, RUBD _i represents the resource utilization balance of the _ith microcloud, Li represents the total number of computing nodes in the ith microcloud, and RUBD _Total represents the The resource balance evaluation parameter of the edge micro-cloud system, K represents the total number of micro-clouds in the edge micro-cloud system;

基于如下公式计算所述响应延迟评估参数：The response delay evaluation parameter is calculated based on the following formula:

t_Total＝T_Comp(X)+T_TR(X)t _Total = T _Comp (X)+T _TR (X)

t_Total表示响应延迟评估参数，T_Comp(X)表示计算延迟，T_TR(X)表示传输延迟。t _Total is the response delay evaluation parameter, T _Comp (X) is the computation delay, and T _TR (X) is the transmission delay.

可选的，所述资源均衡优化模型按照如下步骤进行训练：Optionally, the resource balancing optimization model is trained according to the following steps:

获取预设的神经网络模型和所述训练集；Obtain a preset neural network model and the training set;

将所述样本状态参数输入所述神经网络模型，得到服务放置动作；所述服务放置动作表示为样本服务确定所放置的目标计算节点；Inputting the sample state parameters into the neural network model to obtain a service placement action; the service placement action is represented as a sample service to determine the placed target computing node;

基于所述服务放置动作，对所述样本状态参数进行更新，得到更新后的样本状态参数；Based on the service placement action, the sample state parameter is updated to obtain the updated sample state parameter;

基于所述样本状态参数中包含的资源均衡度评估参数，响应延迟评估参数，以及所述更新后的样本状态参数中包含的资源均衡度评估参数，响应延迟评估参数，计算本次服务放置动作的奖励值；Based on the resource balance evaluation parameters and response delay evaluation parameters included in the sample state parameters, and the resource balance evaluation parameters and response delay evaluation parameters included in the updated sample state parameters, calculate the value of this service placement action. reward value;

将所述样本状态参数，所述更新后的样本状态参数，本次服务放置动作，以及所述本次服务放置动作的奖励值，代入预设的损失函数，计算本次服务放置动作的损失值；Substitute the sample state parameter, the updated sample state parameter, the current service placement action, and the reward value of the current service placement action into a preset loss function to calculate the loss value of this service placement action ;

根据所述损失值确定所述神经网络模型是否收敛；Determine whether the neural network model converges according to the loss value;

若否，则调整所述神经网络模型中的参数值，并返回将更新后的样本状态参数输入所述神经网络模型，得到服务放置动作的步骤；If not, then adjust the parameter values in the neural network model, and return to the step of inputting the updated sample state parameters into the neural network model to obtain the service placement action;

若是，则将当前的神经网络模型确定为资源均衡优化模型。If so, the current neural network model is determined as the resource balance optimization model.

可选的，所述损失函数为：Optionally, the loss function is:

其中，L表示损失函数，E[]表示数学期望，n表示每次迭代所参考的历史迭代数据的组数，t表示时刻，

表示t时刻之后n组历史迭代数据的优先级权重，

表示t时刻后n次迭代的奖励值之和，

表示针对t时刻后n次迭代奖励值的衰减因子，Q_target表示目标网络，Q_eva表示估计网络，s_t表示时刻t的样本状态参数，a_t表示时刻t的服务放置动作，s_t+n表示迭代n次后的样本状态参数，a′表示使估计网络输出最大值的服务放置动作，k表示迭代序号，

表示针对t时刻后第k次迭代奖励值的衰减因子，r_t+k+1表示t时刻后第k次迭代的奖励值。Among them, L represents the loss function, E[] represents the mathematical expectation, n represents the number of groups of historical iteration data referenced in each iteration, t represents the time,

Represents the priority weight of n groups of historical iteration data after time t,

represents the sum of the reward values of n iterations after time t,

represents the decay factor of the reward value for n iterations after time t, Q _target represents the target network, Q _eva represents the estimation network, s _t represents the sample state parameter at time t, at _t represents the service placement action at time t, s _t+n Represents the sample state parameter after n iterations, a' represents the service placement action that maximizes the estimated network output, k represents the iteration number,

Represents the decay factor for the reward value of the k-th iteration after time t, and r _t+k+1 represents the reward value of the k-th iteration after time t.

为实现上述目的，本发明实施例还提供了一种基于深度强化学习的资源分配装置，应用于边缘微云系统的控制平台，所述边缘微云系统还包括多个微云，每个微云包括多个计算节点，所述装置包括：In order to achieve the above purpose, the embodiment of the present invention also provides a resource allocation device based on deep reinforcement learning, which is applied to the control platform of an edge micro-cloud system. The edge micro-cloud system further includes a plurality of micro-clouds, each micro-cloud Including a plurality of computing nodes, the apparatus includes:

第一确定模块，用于确定用户的应用程序请求中包含的多种待分配资源的服务，以及各服务的分配优先级；a first determining module, used for determining a variety of services of resources to be allocated included in the application request of the user, and the allocation priority of each service;

第二确定模块，用于确定当前边缘微云系统的状态参数，所述状态参数包括资源均衡度评估参数，响应延迟评估参数，以及每个微云中各个计算节点的资源剩余量；The second determination module is used to determine the state parameters of the current edge micro-cloud system, where the state parameters include the resource balance evaluation parameter, the response delay evaluation parameter, and the remaining amount of resources of each computing node in each micro-cloud;

输入模块，用于将所述状态参数输入预先训练完成的资源均衡优化模型，得到第一服务的第一目标计算节点；所述第一服务为当前分配优先级最高的服务；所述资源均衡优化模型是基于深度强化学习训练完成的，其中，深度强化学习的训练集包括：边缘微云系统的样本状态参数；The input module is used to input the state parameter into the pre-trained resource balancing optimization model to obtain the first target computing node of the first service; the first service is the service with the highest allocation priority currently; the resource balancing optimization The model is completed based on deep reinforcement learning training, wherein the training set of deep reinforcement learning includes: the sample state parameters of the edge micro-cloud system;

部署模块，用于将所述第一服务部署于所述第一目标计算节点；a deployment module, configured to deploy the first service on the first target computing node;

更新模块，用于更新所述状态参数，并触发所述输入模块，直到所述应用程序请求中包含的每种待分配资源的服务均完成资源分配。An update module, configured to update the state parameter and trigger the input module until the resource allocation is completed for each service of the to-be-allocated resources contained in the application request.

可选的，所述装置还包括：计算模块，所述计算模块，用于基于如下公式计算所述资源均衡度评估参数：Optionally, the device further includes: a calculation module, configured to calculate the resource balance degree evaluation parameter based on the following formula:

其中，

表示第i个微云中第j个计算节点中第d类资源的资源利用率，

t_Total＝T_Comp(X)+T_TR(X)t _Total = T _Comp (X)+T _TR (X)

可选的，所述装置还包括：训练模块，所述训练模块，用于按照如下步骤训练所述资源均衡优化模型：Optionally, the device further includes: a training module, the training module is configured to train the resource balancing optimization model according to the following steps:

可选的，所述损失函数为：Optionally, the loss function is:

表示t时刻之后n组历史迭代数据的优先级权重，

表示t时刻后n次迭代的奖励值之和，

represents the sum of the reward values of n iterations after time t,

为实现上述目的，本发明实施例还提供了一种电子设备，包括处理器、通信接口、存储器和通信总线，其中，处理器，通信接口，存储器通过通信总线完成相互间的通信；To achieve the above object, an embodiment of the present invention also provides an electronic device, including a processor, a communication interface, a memory, and a communication bus, wherein the processor, the communication interface, and the memory communicate with each other through the communication bus;

存储器，用于存放计算机程序；memory for storing computer programs;

处理器，用于执行存储器上所存放的程序时，实现上述任一方法步骤。The processor is configured to implement any of the above method steps when executing the program stored in the memory.

为实现上述目的，本发明实施例还提供了一种计算机可读存储介质，所述计算机可读存储介质内存储有计算机程序，所述计算机程序被处理器执行时实现上述任一方法步骤。To achieve the above object, an embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, any one of the above method steps is implemented.

可见，本发明实施例提供的基于深度强化学习的资源分配方法及装置，可以确定用户的应用程序请求中包含的多种待分配资源的服务，以及各服务的分配优先级；确定当前边缘微云系统的状态参数，状态参数包括资源均衡度评估参数，响应延迟评估参数，以及每个微云中各个计算节点的资源剩余量；将状态参数输入预先训练完成的资源均衡优化模型，得到第一服务的第一目标计算节点，将第一服务部署于第一目标计算节点；更新状态参数，返回将状态参数输入预先训练完成的资源均衡优化模型的步骤，直到应用程序请求中包含的每种待分配资源的服务均完成资源分配。从而综合考虑了响应延迟，资源分配均衡度，并采用深度强化学习的方式训练网络模型，相比于传统的资源分配方法，既能够满足通信延迟需求，又能达到较高的资源利用均衡度。It can be seen that the resource allocation method and device based on deep reinforcement learning provided by the embodiments of the present invention can determine a variety of services to be allocated resources included in the user's application request, and the allocation priority of each service; determine the current edge micro cloud The state parameters of the system, the state parameters include the resource balance evaluation parameter, the response delay evaluation parameter, and the remaining amount of resources of each computing node in each micro cloud; enter the state parameters into the pre-trained resource balance optimization model to get the first service The first target computing node of the first target computing node, deploy the first service on the first target computing node; update the state parameters, and return to the steps of inputting the state parameters into the pre-trained resource balancing optimization model, until the application request contains each type of to-be-allocated The services of the resource complete the resource allocation. Therefore, the response delay and resource allocation balance are comprehensively considered, and the deep reinforcement learning method is used to train the network model. Compared with the traditional resource allocation method, it can not only meet the communication delay requirements, but also achieve a higher resource utilization balance.

当然，实施本发明的任一产品或方法必不一定需要同时达到以上所述的所有优点。Of course, it is not necessary for any product or method to implement the present invention to simultaneously achieve all of the advantages described above.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only These are some embodiments of the present invention. For those of ordinary skill in the art, other drawings can also be obtained according to these drawings without creative efforts.

图1为本发明实施例提供的基于深度强化学习的资源分配方法的一种流程示意图；1 is a schematic flowchart of a resource allocation method based on deep reinforcement learning provided by an embodiment of the present invention;

图2为本发明实施例提供的训练资源均衡优化模型的一种流程示意图；2 is a schematic flowchart of a training resource balancing optimization model provided by an embodiment of the present invention;

图3为本发明实施例提供的基于深度强化学习的资源分配装置的一种结构示意图；3 is a schematic structural diagram of a resource allocation device based on deep reinforcement learning provided by an embodiment of the present invention;

图4为本发明实施例提供的电子设备的一种结构示意图。FIG. 4 is a schematic structural diagram of an electronic device provided by an embodiment of the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, but not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

为了解决现有的边缘微云领域的服务资源分配方法的资源利用均衡度较低的技术问题，本发明实施例提供了一种基于深度强化学习的资源分配方法、装置、电子设备及计算机可读存储介质。In order to solve the technical problem of low resource utilization balance in the existing service resource allocation method in the edge micro-cloud field, the embodiments of the present invention provide a resource allocation method, device, electronic device, and computer-readable based on deep reinforcement learning storage medium.

为了便于理解，下面先对本发明实施例的应用场景进行说明。For ease of understanding, an application scenario of the embodiment of the present invention is first described below.

本发明实施例提供的基于深度强化学习的资源分配方法可以应用于军事领域，救灾领域等高度动态的场景，这些场景通常采用边缘微云系统来为应用程序提供服务。边缘微云系统可以包括控制平台，多个微云，每个微云中包含多个计算节点，其中计算节点可以表示包含处理器、通信接口、存储器和通信总线的电子设备，例如个人电脑等，通常可以被置于移动车辆上。微云表示多个计算节点的集合，一个微云通常可以包括一个移动车辆上的多个计算节点。本发明实施例提供的基于深度强化学习的资源分配方法可以应用于控制平台，即控制平台决定将应用程序包含的每种服务部署在哪个计算节点中。The resource allocation method based on deep reinforcement learning provided by the embodiments of the present invention can be applied to highly dynamic scenarios such as the military field and disaster relief field, and these scenarios usually use an edge micro-cloud system to provide services for applications. The edge micro-cloud system can include a control platform, multiple micro-clouds, and each micro-cloud contains multiple computing nodes, where the computing nodes can represent electronic devices including processors, communication interfaces, memory, and communication buses, such as personal computers, etc., Typically it can be placed on a moving vehicle. A micro cloud represents a collection of multiple computing nodes, and a micro cloud can usually include multiple computing nodes on a mobile vehicle. The resource allocation method based on deep reinforcement learning provided by the embodiment of the present invention can be applied to the control platform, that is, the control platform decides which computing node to deploy each service included in the application program.

具体的，参见图1，本发明实施例提供的基于深度强化学习的资源分配方法可以包括以下步骤：Specifically, referring to FIG. 1 , the resource allocation method based on deep reinforcement learning provided by the embodiment of the present invention may include the following steps:

S101：确定用户的应用程序请求中包含的多种待分配资源的服务，以及各服务的分配优先级。S101: Determine a variety of services of resources to be allocated included in the application request of the user, and the allocation priority of each service.

本发明实施例中，用户的应用程序请求可能包含多种服务，例如定位服务，图像处理服务等，这些服务需要被部署在微云中的计算节点上，从而计算节点能够为这些服务提供相应的资源。本发明实施例中，为服务分配资源的过程，也可以理解为为服务分配计算节点的过程。In this embodiment of the present invention, the user's application request may include various services, such as positioning services, image processing services, etc. These services need to be deployed on computing nodes in the micro cloud, so that the computing nodes can provide corresponding services for these services. resource. In this embodiment of the present invention, the process of allocating resources for a service may also be understood as a process of allocating computing nodes for the service.

此外，由于服务之间存在一定的关联关系，因此待分配资源的服务具备分配优先级。例如，服务a需要依赖于服务b，则服务b的分配优先级高于服务a，即只有先为服务b分配计算节点，才能为服务a分配。In addition, since there is a certain relationship between services, the service to which the resource is to be allocated has an allocation priority. For example, if service a needs to depend on service b, the allocation priority of service b is higher than that of service a, that is, service a can be allocated only when computing nodes are allocated to service b first.

S102：确定当前边缘微云系统的状态参数，状态参数包括资源均衡度评估参数，响应延迟评估参数，以及每个微云中各个计算节点的资源剩余量。S102: Determine state parameters of the current edge micro-cloud system, where the state parameters include a resource balance evaluation parameter, a response delay evaluation parameter, and the remaining amount of resources of each computing node in each micro-cloud.

本发明实施例中，资源均衡度评估参数表示微云中计算节点的资源分配的均衡度。In the embodiment of the present invention, the resource balance degree evaluation parameter represents the balance degree of resource allocation of computing nodes in the micro cloud.

在本发明的一种实施例中，可以基于如下公式计算资源均衡度评估参数：In an embodiment of the present invention, the resource balance degree evaluation parameter can be calculated based on the following formula:

首先定义资源利用方差如下：First define the resource utilization variance as follows:

其中，

表示第i个微云中第j个计算节点中第d类资源的资源利用率，

表示第i个微云中第j个计算节点中所有种类资源的资源利用率的平均值，X表示资源分配策略。in,

Represents the average resource utilization of all kinds of resources in the j-th computing node in the i-th microcloud, and X represents the resource allocation strategy.

由于资源利用方差无法体现某一特定资源的利用率过高，而其他种类资源的利用率较低的资源利用不均衡的情况，因此，可以按如下公式定义资源均衡率：Since the resource utilization variance cannot reflect the unbalanced utilization of resources in which the utilization rate of a specific resource is too high and the utilization rate of other types of resources is low, the resource balancing rate can be defined as follows:

其中，

表示第i个微云中第j个计算节点的资源均衡率。in,

Represents the resource balance rate of the jth computing node in the ith cloud.

进一步的，对资源均衡率进行归一化，可以按如下公式定义资源利用均衡度：Further, by normalizing the resource balance rate, the resource utilization balance degree can be defined according to the following formula:

其中，

表示第i个微云中第j个计算节点的资源利用均衡度，其值在0到1之间，资源利用均衡度值越大，则资源利用越均衡。in,

Represents the resource utilization balance degree of the jth computing node in the ith microcloud, and its value is between 0 and 1. The larger the resource utilization balance degree value is, the more balanced the resource utilization is.

进而可以确定整个边缘微云系统的资源利用均衡度：Then, the resource utilization balance of the entire edge micro-cloud system can be determined:

其中，RUBD_i表示第i个微云的资源利用均衡度，L_i表示第i个微云中计算节点的总数，RUBD_Total表示边缘微云系统的资源利用均衡度，也就是资源均衡度评估参数，K表示边缘微云系统中微云的总数。Among them, RUBD _i represents the resource utilization balance of the _ith microcloud, Li represents the total number of computing nodes in the ith microcloud, and RUBD _Total represents the resource utilization balance of the edge microcloud system, that is, the resource balance evaluation parameter , K represents the total number of microclouds in the edge microcloud system.

在本发明实施例中，响应延迟评估参数表示边缘微云系统的计算延迟和通信延迟之和，即：In the embodiment of the present invention, the response delay evaluation parameter represents the sum of the computing delay and the communication delay of the edge micro-cloud system, namely:

t_Total＝T_Comp(X)+T_TR(X)t _Total = T _Comp (X)+T _TR (X)

本发明实施例中，可以用

表示第i个微云中第j个计算节点的资源剩余量。In this embodiment of the present invention, the

Represents the resource remaining of the jth computing node in the ith microcloud.

进而，本发明实施例中，可以用集合s表示当前边缘微云系统的状态参数，则：Furthermore, in this embodiment of the present invention, the set s can be used to represent the state parameters of the current edge micro-cloud system, then:

也就是说，边缘微云系统的状态参数可以由资源均衡度评估参数，响应延迟评估参数，以及每个微云中各个计算节点的资源剩余量组成。That is to say, the state parameters of the edge micro-cloud system can be composed of the resource balance evaluation parameter, the response delay evaluation parameter, and the remaining amount of resources of each computing node in each micro-cloud.

S103：将状态参数输入预先训练完成的资源均衡优化模型，得到第一服务的第一目标计算节点；第一服务为当前分配优先级最高的服务，资源均衡优化模型是基于深度强化学习训练完成的，其中，深度强化学习的训练集包括：边缘微云系统的样本状态参数。S103: Input the state parameters into the pre-trained resource balancing optimization model to obtain the first target computing node of the first service; the first service is the service with the highest allocation priority currently, and the resource balancing optimization model is completed based on deep reinforcement learning training , where the training set of deep reinforcement learning includes: the sample state parameters of the edge micro-cloud system.

本发明实施例中，控制平台在获取当前边缘微云系统的状态参数后，可以将其输入资源均衡优化模型，由于资源均衡优化模型是根据训练集基于深度强化学习训练完成的，因此可以输出最适合当前状态参数的资源分配策略。In the embodiment of the present invention, after obtaining the state parameters of the current edge micro-cloud system, the control platform can input it into the resource balance optimization model. Since the resource balance optimization model is trained based on the training set based on deep reinforcement learning, it can output the most Resource allocation strategy suitable for current state parameters.

具体的，资源均衡优化模型输出第一服务的第一目标计算节点，其中第一服务为当前分配优先级最高的服务。例如，当前分配优先级最高的服务为定位服务，则资源均衡优化模型输出定位服务的目标计算节点，可以记为

则

表示该定位服务被分配至第i个微云中的第j个节点。Specifically, the resource balance optimization model outputs the first target computing node of the first service, where the first service is the service with the highest currently assigned priority. For example, if the service with the highest allocation priority is the location service, the resource balance optimization model outputs the target computing node of the location service, which can be recorded as

but

Indicates that the location service is allocated to the jth node in the ith microcloud.

其中，资源均衡优化模型的训练过程可以参见下文，此处不赘述。The training process of the resource balancing optimization model can be referred to below, which will not be repeated here.

S104：将第一服务部署于第一目标计算节点。S104: Deploy the first service on the first target computing node.

本发明实施例中，在确定第一服务的第一目标计算节点后，可以将第一服务部署于该第一目标计算节点，则第一服务可以基于该第一目标计算节点中的资源来运行。In this embodiment of the present invention, after the first target computing node of the first service is determined, the first service can be deployed on the first target computing node, and the first service can be run based on the resources in the first target computing node .

S105：更新状态参数，并返回将状态参数输入预先训练完成的资源均衡优化模型的步骤，直到应用程序请求中包含的每种待分配资源的服务均完成资源分配。S105: Update the state parameters, and return to the step of inputting the state parameters into the pre-trained resource balancing optimization model, until the resource allocation is completed for each service of the to-be-allocated resources included in the application request.

本发明实施例中，在为第一服务分配资源后，边缘微云系统的状态参数会发生变化，因此，控制平台重新统计边缘微云系统的当前状态参数，并继续为下一个服务分配资源。In the embodiment of the present invention, after allocating resources for the first service, the state parameters of the edge micro-cloud system will change. Therefore, the control platform re-counts the current state parameters of the edge micro-cloud system and continues to allocate resources for the next service.

控制平台将当前状态参数输入资源均衡优化模型，进而得到当前分配优先级最高的服务的目标计算节点。The control platform inputs the current state parameters into the resource balancing optimization model, and then obtains the target computing node that currently allocates the service with the highest priority.

本发明实施例中，可以循环执行上述步骤，直到应用程序请求中包含的每种待分配资源的服务均完成资源分配。In this embodiment of the present invention, the above steps may be performed cyclically until resource allocation is completed for each service of to-be-allocated resources included in the application request.

可见，本发明实施例提供的基于深度强化学习的资源分配方法，可以确定用户的应用程序请求中包含的多种待分配资源的服务，以及各服务的分配优先级；确定当前边缘微云系统的状态参数，状态参数包括资源均衡度评估参数，响应延迟评估参数，以及每个微云中各个计算节点的资源剩余量；将状态参数输入预先训练完成的资源均衡优化模型，得到第一服务的第一目标计算节点，将第一服务部署于第一目标计算节点；更新状态参数，返回将状态参数输入预先训练完成的资源均衡优化模型的步骤，直到应用程序请求中包含的每种待分配资源的服务均完成资源分配。从而综合考虑了响应延迟，资源分配均衡度，并采用深度强化学习的方式训练网络模型，相比于传统的资源分配方法，既能够满足通信延迟需求，又能达到较高的资源利用均衡度。It can be seen that the resource allocation method based on deep reinforcement learning provided by the embodiment of the present invention can determine a variety of services to be allocated resources included in the user's application request, and the allocation priority of each service; determine the current edge micro cloud system The state parameters include the resource balance evaluation parameters, the response delay evaluation parameters, and the remaining resources of each computing node in each microcloud; input the state parameters into the pre-trained resource balance optimization model to obtain the first service A target computing node, deploying the first service on the first target computing node; updating the state parameters, and returning to the step of inputting the state parameters into the pre-trained resource balancing optimization model, until each resource to be allocated contained in the application request is All services complete resource allocation. Therefore, the response delay and resource allocation balance are comprehensively considered, and the deep reinforcement learning method is used to train the network model. Compared with the traditional resource allocation method, it can not only meet the communication delay requirements, but also achieve a higher resource utilization balance.

本发明实施例中，可以基于深度强化学习来训练资源均衡优化模型。深度强化学习是强化学习和深度学习的结合。In this embodiment of the present invention, a resource balancing optimization model can be trained based on deep reinforcement learning. Deep reinforcement learning is a combination of reinforcement learning and deep learning.

为了便于理解，下面先对强化学习进行简要介绍。For ease of understanding, a brief introduction to reinforcement learning is given below.

强化学习是机器学习的一种类型，其基本思想是根据场景状态生成动作，再通过接收环境对动作的奖励获得学习信息并更新模型参数，最终能够实现在特定场景状态下做出最优动作。Reinforcement learning is a type of machine learning. Its basic idea is to generate actions according to the scene state, and then obtain learning information and update the model parameters by receiving the reward of the environment for the action, and finally realize the optimal action in a specific scene state.

本发明实施例中，资源分配过程可以建模为强化学习模型，其中，场景状态为边缘微云系统的状态参数，动作为针对某一服务的资源分配策略。从而，训练完成的资源均衡优化模型可以根据边缘微云系统的状态参数，输出针对某一服务的最佳资源分配策略。In this embodiment of the present invention, the resource allocation process can be modeled as a reinforcement learning model, where the scene state is a state parameter of the edge micro-cloud system, and the action is a resource allocation strategy for a certain service. Therefore, the trained resource balance optimization model can output the optimal resource allocation strategy for a certain service according to the state parameters of the edge micro-cloud system.

在本发明的一种实施例中，参见图2，可以采用如下步骤训练资源均衡优化模型：In an embodiment of the present invention, referring to FIG. 2 , the following steps can be used to train the resource balancing optimization model:

S201：获取预设的神经网络模型和训练集。S201: Obtain a preset neural network model and a training set.

本领域技术人员可以理解，强化学习与传统的监督学习相比，在训练时没有标签作为样本，强化学习的训练过程中，仅需要初始的输入状态作为训练集。Those skilled in the art can understand that compared with traditional supervised learning, reinforcement learning does not have labels as samples during training, and in the training process of reinforcement learning, only the initial input state is required as a training set.

本发明实施例中，训练集可以为边缘微云系统的样本状态参数。In the embodiment of the present invention, the training set may be the sample state parameters of the edge micro-cloud system.

S202：将样本状态参数输入神经网络模型，得到服务放置动作；服务放置动作表示为样本服务确定所放置的目标计算节点。S202: Input the sample state parameters into the neural network model to obtain a service placement action; the service placement action represents determining a placed target computing node for the sample service.

本发明实施例中，神经网络模型的输入为样本状态参数，输出为服务放置动作，服务放置动作表示为样本服务确定所放置的目标计算节点。样本状态参数可以用s表示，同上文，

表示资源均衡度评估参数，响应延迟评估参数，以及每个微云中各个计算节点的资源剩余量的集合。服务放置动作可以用a表示，

表示服务被分配至第i个微云中的第j个节点。In the embodiment of the present invention, the input of the neural network model is the sample state parameter, the output is the service placement action, and the service placement action is represented as the target computing node where the sample service is determined to be placed. The sample state parameter can be represented by s, same as above,

Represents the resource balance evaluation parameter, the response delay evaluation parameter, and the set of resource remaining amount of each computing node in each microcloud. A service placement action can be represented by a,

Indicates that the service is assigned to the jth node in the ith microcloud.

S203：基于服务放置动作，对样本状态参数进行更新，得到更新后的样本状态参数。S203: Based on the service placement action, update the sample state parameter to obtain the updated sample state parameter.

在生成服务放置动作后，样本状态参数会进行更新，得到更新后的样本状态参数，作为下一轮训练的输入。After the service placement action is generated, the sample state parameters will be updated, and the updated sample state parameters will be obtained as the input for the next round of training.

S204：基于样本状态参数中包含的资源均衡度评估参数，响应延迟评估参数，以及更新后的样本状态参数中包含的资源均衡度评估参数，响应延迟评估参数，计算本次服务放置动作的奖励值。S204: Calculate the reward value of the current service placement action based on the resource balance degree evaluation parameter and the response delay evaluation parameter included in the sample state parameter, and the resource balance degree evaluation parameter and response delay evaluation parameter included in the updated sample state parameter .

本发明实施例中，在每轮迭代之后，可以计算本轮迭代中服务放置动作的奖励值，容易理解的，若本轮放置动作之后资源均衡度评估参数显示资源分配的越均衡，响应延迟评估参数显示响应延迟越低，则本轮放置动作的奖励值也就越高。In this embodiment of the present invention, after each round of iterations, the reward value of the service placement action in the current iteration can be calculated. It is easy to understand that if the resource balance evaluation parameter shows that the resource allocation is more balanced after the placement action in this round, the response delay is evaluated. The parameter shows that the lower the response delay, the higher the reward value of the placement action in this round.

具体的，本发明的一种实施例中，可以基于如下公式计算服务放置动作的奖励值：Specifically, in an embodiment of the present invention, the reward value of the service placement action can be calculated based on the following formula:

其中，r_n表示第n轮迭代的奖励值，

表示第n-1轮迭代后的资源均衡度评估参数，

表示第n轮迭代后的资源均衡度评估参数，

表示第n轮迭代后的服务延迟，L_ave表示平均服务延迟约束。Among them, rn represents the reward value of the _nth iteration,

Represents the resource balance evaluation parameter after the n-1 round of iteration,

Represents the resource balance evaluation parameter after the nth round of iteration,

represents the service delay after the nth round of iteration, and L _ave represents the average service delay constraint.

上述公式仅为计算奖励值的一种方式，本发明实施例并不限于采用以上公式计算奖励值。The above formula is only one way to calculate the reward value, and the embodiment of the present invention is not limited to using the above formula to calculate the reward value.

S205：将样本状态参数，更新后的样本状态参数，本次服务放置动作，以及本次服务放置动作的奖励值，代入预设的损失函数，计算本次服务放置动作的损失值。S205: Substitute the sample state parameter, the updated sample state parameter, the current service placement action, and the reward value of this service placement action into a preset loss function to calculate the loss value of this service placement action.

为了便于理解，以第n轮迭代为例进行说明，第n-1轮迭代后的样本状态参数为s_n-1，第n轮迭代后的样本状态参数为s_n，第n轮迭代输出的服务放置动作为a_n，第n轮迭代的奖励值为r_n，则可以基于s_n-1，s_n，a_n，r_n，以及预设的损失函数，计算第n轮迭代中服务放置动作的损失值。For ease of understanding, the nth round of iteration is taken as an example for illustration. The sample state parameter after the n-1th round of iteration is _sn-1 , the sample state parameter after the _nth round of iteration is sn , and the output of the nth round of iteration is The service placement action is an _n , and the reward value of the nth iteration is r _n , then the service placement in the _nth iteration can be calculated based on s _n _-1 , s _n , an , rn , and the preset loss function The loss value of the action.

本领域技术人员可以理解，在深度强化学习领域，每轮迭代都可以得到一个新的Q值，Q值是与状态s和动作a相关的函数值，表示在某一状态s下采取动作a所能获得的期望收益。Q值通常通过目标网络和估计网络来确定，其中目标网络用Q_target表示，估计网络用Q_eva表示，目标网络Q_target输出的Q值是迭代更新后的，估计网络Q_eva输出的Q值是迭代更新前的。Those skilled in the art can understand that in the field of deep reinforcement learning, a new Q value can be obtained in each round of iteration. expected return. The Q value is usually determined by the target network and the estimation network. The target network is represented by Q _target and the estimation network is represented by Q _eva . The Q value output by the target network Q _target is iteratively updated, and the Q value output by the estimation network Q _eva is Before iterative update.

本发明实施例中，可以基于迭代前后的样本状态参数，以及相应的服务放置动作，确定目标网络和估计网络的输出Q值，再结合奖励值构建损失函数。In the embodiment of the present invention, the output Q value of the target network and the estimated network can be determined based on the sample state parameters before and after the iteration and the corresponding service placement actions, and a loss function can be constructed in combination with the reward value.

具体的，本发明的一种实施例中，预设的损失函数可以为：Specifically, in an embodiment of the present invention, the preset loss function may be:

Loss＝E[(r_n+1+γQ_target(s_n+1,argmax_a′(Q_eva(s_n+1,a′)))-Q_eva(s_n,a_n))²]Loss=E[(r _n+1 +γQ _target (s _n+1 ,argmax _a′ (Q _eva (s _n+1 ,a′)))-Q _eva (s _n ,a _n )) ² ]

其中，E[]表示数学函数，r_n+1表示第n+1轮迭代的奖励值，γ表示衰减因子，Q_target表示目标网络，Q_eva表示估计网络，s_n表示第n轮迭代后的样本状态参数，a_n表示第n轮迭代输出的服务放置动作，a′表示使估计网络输出最大值的服务放置动作。Among them, E[] represents the mathematical function, r _n+1 represents the reward value of the n+1 round of iteration, γ represents the decay factor, Q _target represents the target network, Q _eva represents the estimation network, and s _n represents the nth round of iterations. Sample state parameters, an _n represents the service placement action output by the nth iteration, and a′ represents the service placement action that maximizes the estimated network output.

本发明实施例中，每轮迭代前后的样本状态参数，服务放置动作以及奖励值作为损失函数中的变量，进而训练神经网络模型，为了加快收敛速度，提高网络模型的准确度，在后续迭代过程中，可以联合之前多步的数据进行训练。例如，针对第三轮迭代，可以将第一轮迭代的样本状态参数，服务放置动作，奖励值以及第二轮迭代的样本状态参数，服务放置动作，奖励值均作为训练数据。即损失函数中可以综合考虑前几轮迭代的数据。In the embodiment of the present invention, the sample state parameters before and after each round of iteration, the service placement action and the reward value are used as variables in the loss function, and then the neural network model is trained. In order to speed up the convergence speed and improve the accuracy of the network model, in the subsequent iteration process , the data from previous steps can be combined for training. For example, for the third iteration, the sample state parameters, service placement actions, and reward values of the first iteration, and the sample state parameters, service placement actions, and reward values of the second iteration can be used as training data. That is, the data of the previous iterations can be comprehensively considered in the loss function.

此外，由于在每轮迭代中，目标网络和估计网络输出的Q值的差值能够反映该轮迭代中的数据作为训练数据的参考度，即差值越大，说明该轮迭代的数据更值得训练，因此可以为其设置较大的取样权重。举例来讲，若当前是第5轮迭代，若选取第2-4轮迭代的数据进行训练，针对第2-4轮迭代，若第3轮迭代中，目标网络和估计网络输出的Q值的差值最大，则可以为第3轮迭代的数据设置较大的取样权重。In addition, because in each round of iteration, the difference between the Q values output by the target network and the estimated network can reflect the data in this round of iterations as a reference for training data, that is, the larger the difference, the more worthy the data in this round of iterations. training, so you can set larger sampling weights for it. For example, if the current iteration is the 5th round, if the data from the 2nd to 4th iterations are selected for training, for the 2nd to 4th iterations, if in the 3rd iteration, the difference between the Q value output by the target network and the estimated network is If the difference is the largest, a larger sampling weight can be set for the data of the third iteration.

本发明实施例中，可以基于多步联合训练，以及设置取样权重两个方面来改进损失函数，具体的，基于上述两个方面进行改进后的损失函数可以为：In the embodiment of the present invention, the loss function can be improved based on two aspects of multi-step joint training and setting sampling weights. Specifically, the improved loss function based on the above two aspects can be:

其中，L表示改进后的损失函数，E[]表示数学期望，n表示每次迭代所参考的历史迭代数据的组数，t表示时刻，

表示t时刻之后n组历史迭代数据的优先级权重，

可以根据历史迭代数据中目标网络和估计网络输出的Q值的差值来确定，即Q值的差值越大，则该轮迭代中历史迭代数据的优先级权重越大，

表示t时刻后n次迭代的奖励值之和，

表示针对t时刻后n次迭代奖励值的衰减因子，

的值可以根据实际情况进行设定，Q_target表示目标网络，Q_eva表示估计网络，s_t表示时刻t的样本状态参数，a_t表示时刻t的服务放置动作，s_t+n表示迭代n次后的样本状态参数，a′表示使估计网络输出最大值的服务放置动作，k表示迭代序号，

表示针对t时刻后第k次迭代奖励值的衰减因子，r_t+k+1表示t时刻后第k次迭代的奖励值。Among them, L represents the improved loss function, E[] represents the mathematical expectation, n represents the number of groups of historical iteration data referenced in each iteration, t represents the time,

It can be determined according to the difference between the Q value output by the target network and the estimated network in the historical iteration data, that is, the greater the difference in the Q value, the greater the priority weight of the historical iteration data in this round of iteration.

represents the sum of the reward values of n iterations after time t,

represents the decay factor of the reward value for n iterations after time t,

The value of can be set according to the actual situation, Q _target represents the target network, Q _eva represents the estimated network, s _t represents the sample state parameter at time t, at _t represents the service placement action at time t, and s _t+n represents n iterations After the sample state parameters, a' represents the service placement action that maximizes the estimated network output, k represents the iteration number,

可见，改进后的损失函数考虑了先前迭代产生的多组历史迭代数据，并基于先前每轮迭代中Q值的差值来设定优先级权重，能够更针对性的对神经网络进行训练，从而加快收敛速度，并提高网络模型的准确度。It can be seen that the improved loss function considers multiple sets of historical iteration data generated by previous iterations, and sets the priority weight based on the difference between the Q values in each previous iteration, which can train the neural network in a more targeted manner. Accelerates convergence and improves the accuracy of network models.

S206：根据损失值确定神经网络模型是否收敛，否则执行S207，是则执行S208。S206: Determine whether the neural network model converges according to the loss value, otherwise, perform S207, and if yes, perform S208.

当损失值不超过预设的损失阈值时，可以认为神经网络模型已收敛。此外，也可以预设最大迭代次数，当达到最大迭代次数时，也可以认为神经网络模型已收敛，对此不做限定。When the loss value does not exceed the preset loss threshold, the neural network model can be considered to have converged. In addition, the maximum number of iterations may also be preset, and when the maximum number of iterations is reached, the neural network model may also be considered to have converged, which is not limited.

S207：调整神经网络模型中的参数值，返回执行步骤S202。S207: Adjust the parameter values in the neural network model, and return to step S202.

当损失值显示神经网络模型未收敛时，则调整参数值，返回步骤S202，开始新一轮的迭代训练。When the loss value shows that the neural network model has not converged, adjust the parameter value, return to step S202, and start a new round of iterative training.

S208：将当前的神经网络模型确定为资源均衡优化模型。S208: Determine the current neural network model as the resource balance optimization model.

基于相同的发明构思，根据上述基于深度强化学习的资源分配方法实施例，本发明实施例还提供了一种基于深度强化学习的资源分配方法，参见图3，可以包括以下模块：Based on the same inventive concept, according to the above-mentioned embodiment of the resource allocation method based on deep reinforcement learning, the embodiment of the present invention also provides a resource allocation method based on deep reinforcement learning, referring to FIG. 3 , which may include the following modules:

第一确定模块301，用于确定用户的应用程序请求中包含的多种待分配资源的服务，以及各服务的分配优先级；A first determining module 301, configured to determine a variety of services of resources to be allocated included in the application request of the user, and the allocation priority of each service;

第二确定模块302，用于确定当前边缘微云系统的状态参数，状态参数包括资源均衡度评估参数，响应延迟评估参数，以及每个微云中各个计算节点的资源剩余量；The second determination module 302 is configured to determine the state parameters of the current edge micro-cloud system, where the state parameters include a resource balance evaluation parameter, a response delay evaluation parameter, and the remaining amount of resources of each computing node in each micro-cloud;

输入模块303，用于将状态参数输入预先训练完成的资源均衡优化模型，得到第一服务的第一目标计算节点；第一服务为当前分配优先级最高的服务；资源均衡优化模型是基于深度强化学习训练完成的，其中，深度强化学习的训练集包括：边缘微云系统的样本状态参数；The input module 303 is used to input the state parameters into the pre-trained resource balancing optimization model to obtain the first target computing node of the first service; the first service is the service with the highest allocation priority currently; the resource balancing optimization model is based on deep reinforcement The learning and training are completed, wherein the training set of deep reinforcement learning includes: the sample state parameters of the edge micro-cloud system;

部署模块304，用于将第一服务部署于第一目标计算节点；a deployment module 304, configured to deploy the first service on the first target computing node;

更新模块305，用于更新状态参数，并触发输入模块，直到应用程序请求中包含的每种待分配资源的服务均完成资源分配。The updating module 305 is configured to update the state parameters and trigger the input module until the resource allocation is completed for each service of the resources to be allocated included in the application request.

在本发明的一种实施例中，在图3所示装置基础上，还可以包括计算模块，计算模块用于基于如下公式计算资源均衡度评估参数：In an embodiment of the present invention, on the basis of the device shown in FIG. 3 , a calculation module may be further included, and the calculation module is configured to calculate the resource balance degree evaluation parameter based on the following formula:

其中，

表示第i个微云中第j个计算节点中第d类资源的资源利用率，

表示第i个微云中第j个计算节点的资源均衡率，RUBD_i表示第i个微云的资源利用均衡度，L_i表示第i个微云中计算节点的总数，RUBD_Total表示边缘微云系统的资源均衡度评估参数，K表示边缘微云系统中微云的总数；in,

Represents the resource balance rate of the jth computing node in the ith microcloud, RUBD _i represents the resource utilization balance of the _ith microcloud, Li represents the total number of computing nodes in the ith microcloud, and RUBD _Total represents the edge microcloud The resource balance evaluation parameter of the cloud system, K represents the total number of micro-clouds in the edge micro-cloud system;

基于如下公式计算响应延迟评估参数：The response delay evaluation parameter is calculated based on the following formula:

t_Total＝T_Comp(X)+T_TR(X)t _Total = T _Comp (X)+T _TR (X)

在本发明的一种实施例中，在图3所示装置的基础上，还可以包括训练模块，训练模块用于按照如下步骤训练资源均衡优化模型：In an embodiment of the present invention, on the basis of the device shown in FIG. 3 , a training module may also be included, and the training module is used to train the resource balancing optimization model according to the following steps:

获取预设的神经网络模型和训练集；Obtain preset neural network models and training sets;

将样本状态参数输入神经网络模型，得到服务放置动作；服务放置动作表示为样本服务确定所放置的目标计算节点；Input the sample state parameters into the neural network model to obtain the service placement action; the service placement action represents the target computing node to be placed for the sample service;

基于服务放置动作，对样本状态参数进行更新，得到更新后的样本状态参数；Based on the service placement action, update the sample state parameters to obtain the updated sample state parameters;

基于样本状态参数中包含的资源均衡度评估参数，响应延迟评估参数，以及更新后的样本状态参数中包含的资源均衡度评估参数，响应延迟评估参数，计算本次服务放置动作的奖励值；Based on the resource balance evaluation parameters and response delay evaluation parameters included in the sample state parameters, and the resource balance evaluation parameters and response delay evaluation parameters included in the updated sample state parameters, calculate the reward value of this service placement action;

将样本状态参数，更新后的样本状态参数，本次服务放置动作，以及本次服务放置动作的奖励值，代入预设的损失函数，计算本次服务放置动作的损失值；Substitute the sample state parameters, the updated sample state parameters, the current service placement action, and the reward value of this service placement action into the preset loss function to calculate the loss value of this service placement action;

根据损失值确定神经网络模型是否收敛；Determine whether the neural network model converges according to the loss value;

若否，则调整神经网络模型中的参数值，并返回将更新后的样本状态参数输入神经网络模型，得到服务放置动作的步骤；If not, adjust the parameter values in the neural network model, and return to the steps of inputting the updated sample state parameters into the neural network model to obtain the service placement action;

在本发明的一种实施例中，损失函数可以为：In an embodiment of the present invention, the loss function may be:

表示t时刻之后n组历史迭代数据的优先级权重，

表示t时刻后n次迭代的奖励值之和，

represents the sum of the reward values of n iterations after time t,

应用本发明实施例提供的基于深度强化学习的资源分配装置，可以确定用户的应用程序请求中包含的多种待分配资源的服务，以及各服务的分配优先级；确定当前边缘微云系统的状态参数，状态参数包括资源均衡度评估参数，响应延迟评估参数，以及每个微云中各个计算节点的资源剩余量；将状态参数输入预先训练完成的资源均衡优化模型，得到第一服务的第一目标计算节点，将第一服务部署于第一目标计算节点；更新状态参数，返回将状态参数输入预先训练完成的资源均衡优化模型的步骤，直到应用程序请求中包含的每种待分配资源的服务均完成资源分配。从而综合考虑了响应延迟，资源分配均衡度，并采用深度强化学习的方式训练网络模型，相比于传统的资源分配方法，既能够满足通信延迟需求，又能达到较高的资源利用均衡度。By applying the resource allocation device based on deep reinforcement learning provided by the embodiment of the present invention, it is possible to determine a variety of services of resources to be allocated contained in the user's application request, and the allocation priority of each service; determine the current state of the edge micro cloud system Parameters, state parameters include resource balance evaluation parameters, response delay evaluation parameters, and the remaining resources of each computing node in each micro cloud; input the state parameters into the pre-trained resource balance optimization model to get the first service of the first service. The target computing node, deploying the first service on the first target computing node; updating the state parameter, and returning to the step of inputting the state parameter into the pre-trained resource balancing optimization model, until the service of each resource to be allocated included in the application request resource allocation has been completed. Therefore, the response delay and resource allocation balance are comprehensively considered, and the deep reinforcement learning method is used to train the network model. Compared with the traditional resource allocation method, it can not only meet the communication delay requirements, but also achieve a higher resource utilization balance.

基于相同的发明构思，根据上述基于深度强化学习的资源分配方法实施，本发明实施例提供了一种电子设备，如图4所示，包括处理器401、通信接口402、存储器403和通信总线404，其中，处理器401，通信接口402，存储器403通过通信总线404完成相互间的通信，Based on the same inventive concept, according to the implementation of the above-mentioned resource allocation method based on deep reinforcement learning, an embodiment of the present invention provides an electronic device, as shown in FIG. 4 , including a processor 401 , a communication interface 402 , a memory 403 and a communication bus 404 , wherein, the processor 401, the communication interface 402, and the memory 403 complete the communication with each other through the communication bus 404,

存储器403，用于存放计算机程序；a memory 403 for storing computer programs;

处理器401，用于执行存储器403上所存放的程序时，实现如下步骤：When the processor 401 is used to execute the program stored in the memory 403, the following steps are implemented:

确定当前边缘微云系统的状态参数，状态参数包括资源均衡度评估参数，响应延迟评估参数，以及每个微云中各个计算节点的资源剩余量；Determine the status parameters of the current edge micro-cloud system, the status parameters include resource balance evaluation parameters, response delay evaluation parameters, and the remaining resources of each computing node in each micro-cloud;

将状态参数输入预先训练完成的资源均衡优化模型，得到第一服务的第一目标计算节点；第一服务为当前分配优先级最高的服务；资源均衡优化模型是基于深度强化学习训练完成的，其中，深度强化学习的训练集包括：边缘微云系统的样本状态参数；Input the state parameters into the pre-trained resource balancing optimization model to obtain the first target computing node of the first service; the first service is the service with the highest priority currently assigned; the resource balancing optimization model is based on deep reinforcement learning training, where , the training set of deep reinforcement learning includes: the sample state parameters of the edge micro-cloud system;

将第一服务部署于第一目标计算节点；deploying the first service on the first target computing node;

更新状态参数，并返回将状态参数输入预先训练完成的资源均衡优化模型的步骤，直到应用程序请求中包含的每种待分配资源的服务均完成资源分配。Update the state parameter, and return to the step of inputting the state parameter into the pre-trained resource balancing optimization model, until the resource allocation is completed for each service of the to-be-allocated resources included in the application request.

上述电子设备提到的通信总线可以是外设部件互连标准(Peripheral ComponentInterconnect，PCI)总线或扩展工业标准结构(Extended Industry StandardArchitecture，EISA)总线等。该通信总线可以分为地址总线、数据总线、控制总线等。为便于表示，图4中仅用一条粗线表示，但并不表示仅有一根总线或一种类型的总线。The communication bus mentioned in the above electronic device may be a peripheral component interconnect standard (Peripheral Component Interconnect, PCI) bus or an Extended Industry Standard Architecture (Extended Industry Standard Architecture, EISA) bus or the like. The communication bus can be divided into an address bus, a data bus, a control bus, and the like. For ease of presentation, only one thick line is used in FIG. 4, but it does not mean that there is only one bus or one type of bus.

通信接口用于上述电子设备与其他设备之间的通信。The communication interface is used for communication between the above electronic device and other devices.

存储器可以包括随机存取存储器(Random Access Memory，RAM)，也可以包括非易失性存储器(Non-Volatile Memory，NVM)，例如至少一个磁盘存储器。可选的，存储器还可以是至少一个位于远离前述处理器的存储装置。The memory may include random access memory (Random Access Memory, RAM), and may also include non-volatile memory (Non-Volatile Memory, NVM), such as at least one disk memory. Optionally, the memory may also be at least one storage device located away from the aforementioned processor.

上述的处理器可以是通用处理器，包括中央处理器(Central Processing Unit，CPU)、网络处理器(Network Processor，NP)等；还可以是数字信号处理器(Digital SignalProcessing，DSP)、专用集成电路(Application Specific Integrated Circuit，ASIC)、现场可编程门阵列(Field-Programmable Gate Array，FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。The above-mentioned processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; may also be a digital signal processor (Digital Signal Processing, DSP), an application-specific integrated circuit (Application Specific Integrated Circuit, ASIC), Field-Programmable Gate Array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.

应用本发明实施例提供的电子设备，可以确定用户的应用程序请求中包含的多种待分配资源的服务，以及各服务的分配优先级；确定当前边缘微云系统的状态参数，状态参数包括资源均衡度评估参数，响应延迟评估参数，以及每个微云中各个计算节点的资源剩余量；将状态参数输入预先训练完成的资源均衡优化模型，得到第一服务的第一目标计算节点，将第一服务部署于第一目标计算节点；更新状态参数，返回将状态参数输入预先训练完成的资源均衡优化模型的步骤，直到应用程序请求中包含的每种待分配资源的服务均完成资源分配。从而综合考虑了响应延迟，资源分配均衡度，并采用深度强化学习的方式训练网络模型，相比于传统的资源分配方法，既能够满足通信延迟需求，又能达到较高的资源利用均衡度。By applying the electronic device provided by the embodiment of the present invention, it is possible to determine a variety of services to be allocated resources included in the user's application program request, and the allocation priority of each service; determine the state parameters of the current edge micro-cloud system, and the state parameters include resources The balance degree evaluation parameter, the response delay evaluation parameter, and the remaining amount of resources of each computing node in each micro cloud; input the status parameter into the pre-trained resource balance optimization model to obtain the first target computing node of the first service, and put the first target computing node of the first service. A service is deployed on the first target computing node; the state parameter is updated, and the step of inputting the state parameter into the pre-trained resource balancing optimization model is returned, until the resource allocation is completed for each service of the to-be-allocated resources included in the application request. Therefore, the response delay and resource allocation balance are comprehensively considered, and the deep reinforcement learning method is used to train the network model. Compared with the traditional resource allocation method, it can not only meet the communication delay requirements, but also achieve a higher resource utilization balance.

基于相同的发明构思，根据上述基于深度强化学习的资源分配方法实施，在本发明提供的又一实施例中，还提供了一种计算机可读存储介质，该计算机可读存储介质内存储有计算机程序，所述计算机程序被处理器执行时实现上述任一基于深度强化学习的资源分配方法步骤。Based on the same inventive concept, according to the implementation of the above-mentioned resource allocation method based on deep reinforcement learning, in another embodiment provided by the present invention, a computer-readable storage medium is also provided, in which a computer-readable storage medium is stored A program, when the computer program is executed by the processor, implements any of the above-mentioned steps of the resource allocation method based on deep reinforcement learning.

需要说明的是，在本文中，诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来，而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。It should be noted that, in this document, relational terms such as first and second are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply any relationship between these entities or operations. any such actual relationship or sequence exists. Moreover, the terms "comprising", "comprising" or any other variation thereof are intended to encompass a non-exclusive inclusion such that a process, method, article or device that includes a list of elements includes not only those elements, but also includes not explicitly listed or other elements inherent to such a process, method, article or apparatus. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in a process, method, article or apparatus that includes the element.

本说明书中的各个实施例均采用相关的方式描述，各个实施例之间相同相似的部分互相参见即可，每个实施例重点说明的都是与其他实施例的不同之处。尤其，对于上述基于深度强化学习的资源分配装置实施例、电子设备实施例以及计算机可读存储介质实施例而言，由于其基本相似于上述基于深度强化学习的资源分配方法实施例，所以描述的比较简单，相关之处参见上述基于深度强化学习的资源分配方法实施例的部分说明即可。Each embodiment in this specification is described in a related manner, and the same and similar parts between the various embodiments may be referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, for the above-mentioned embodiments of the resource allocation apparatus based on deep reinforcement learning, the electronic device embodiments, and the computer-readable storage medium embodiments, since they are basically similar to the above-mentioned embodiments of the resource allocation method based on deep reinforcement learning, the description It is relatively simple, and for relevant details, please refer to the partial description of the above-mentioned embodiment of the resource allocation method based on deep reinforcement learning.

以上所述仅为本发明的较佳实施例而已，并非用于限定本发明的保护范围。凡在本发明的精神和原则之内所作的任何修改、等同替换、改进等，均包含在本发明的保护范围内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the protection scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.

Claims

1. A resource allocation method based on deep reinforcement learning is applied to a control platform of an edge micro-cloud system, wherein the edge micro-cloud system further comprises a plurality of micro-clouds, each micro-cloud comprises a plurality of computing nodes, and the method comprises the following steps:

determining services of various resources to be allocated contained in an application program request of a user and allocation priority of each service;

determining state parameters of a current edge micro cloud system, wherein the state parameters comprise a resource balance degree evaluation parameter, a response delay evaluation parameter and the resource surplus of each computing node in each micro cloud;

inputting the state parameters into a resource balance optimization model which is trained in advance to obtain a first target computing node of a first service; the first service is the service with the highest priority currently allocated; the resource balance optimization model is completed based on deep reinforcement learning training, wherein a training set of the deep reinforcement learning comprises: sample state parameters of the edge micro-cloud system;

deploying the first service to the first target computing node;

and updating the state parameters, and returning to the step of inputting the state parameters into the resource balance optimization model which is trained in advance until the resource allocation of each service of the resources to be allocated contained in the application program request is completed.

2. The method of claim 1, wherein the resource balance evaluation parameter is calculated based on the following formula:

wherein, the RUV_i ^jRepresenting the resource utilization variance of the jth computing node in the ith cloudlet, D representing the number of classes of resources,

indicating the resource utilization rate of the d-th type resource in the j-th computing node in the ith micro cloud,

represents the average value of the resource utilization rate of all kinds of resources in the jth computing node in the ith micro cloud, X represents the resource allocation strategy,

representing the resource balance rate, RUBD, of the jth computing node in the ith clout_iIndicating resource utilization balance of the ith cloudlet, L_iRepresenting the total number of compute nodes, RUBD, in the ith clout_TotalRepresenting a resource balance evaluation parameter of the edge micro cloud system, wherein K represents the total number of micro clouds in the edge micro cloud system;

calculating the response delay evaluation parameter based on the following formula:

t_Total＝T_Comp(X)+T_TR(X)

t_Totalindicating a response delay evaluation parameter, T_Comp(X) represents a calculation delay, T_TR(X) represents a transmission delay.

3. The method of claim 1, wherein the resource balancing optimization model is trained by:

acquiring a preset neural network model and the training set;

inputting the sample state parameters into the neural network model to obtain service placement actions; the service placement action represents determining a placed target compute node for a sample service;

updating the sample state parameters based on the service placement action to obtain updated sample state parameters;

calculating an incentive value of the service placement action based on the resource balance degree evaluation parameter and the response delay evaluation parameter contained in the sample state parameter and the resource balance degree evaluation parameter contained in the updated sample state parameter and the response delay evaluation parameter;

substituting the sample state parameter, the updated sample state parameter, the service placing action and the reward value of the service placing action into a preset loss function, and calculating the loss value of the service placing action;

determining whether the neural network model converges according to the loss value;

if not, adjusting parameter values in the neural network model, and returning to the step of inputting the updated sample state parameters into the neural network model to obtain a service placing action;

and if so, determining the current neural network model as a resource balance optimization model.

4. The method of claim 3, wherein the loss function is:

wherein L represents a loss function, E [ [ alpha ] ]]Representing a mathematical expectation, n representing the number of sets of historical iteration data referenced per iteration, t representing the time of day,

representing the priority weight, r, of n sets of historical iterative data after time t_t ⁽ⁿ⁾Representing the sum of the prize values for n iterations after time t,

representing a decay factor, Q, for a reward value of n iterations after time t_targetRepresenting the target network, Q_evaRepresenting an estimated network, s_tA sample state parameter representing the time t, a_tService Placement action, s, representing time t_t+nRepresents the state parameters of the samples after iteration n times, a' represents the service placement action for making the estimated network output the maximum value, k represents the iteration number,

representing a decay factor, r, for the reward value of the kth iteration after time t_t+k+1Representing the prize value of the kth iteration after time t.

5. A resource allocation device based on deep reinforcement learning is applied to a control platform of an edge micro-cloud system, wherein the edge micro-cloud system further comprises a plurality of micro-clouds, each micro-cloud comprises a plurality of computing nodes, and the device comprises:

the first determining module is used for determining services of various resources to be allocated contained in an application program request of a user and the allocation priority of each service;

the second determining module is used for determining state parameters of the current edge micro cloud system, wherein the state parameters comprise a resource balance degree evaluation parameter, a response delay evaluation parameter and the resource surplus of each computing node in each micro cloud;

the input module is used for inputting the state parameters into a resource balance optimization model which is trained in advance to obtain a first target computing node of a first service; the first service is the service with the highest priority currently allocated; the resource balance optimization model is completed based on deep reinforcement learning training, wherein a training set of the deep reinforcement learning comprises: sample state parameters of the edge micro-cloud system;

a deployment module to deploy the first service to the first target computing node;

and the updating module is used for updating the state parameters and triggering the input module until each service of the resources to be allocated contained in the application program request completes resource allocation.

6. The apparatus of claim 5, further comprising: a calculating module, configured to calculate the resource balance evaluation parameter based on the following formula:

t_Total＝T_Comp(X)+T_TR(X)

7. The apparatus of claim 5, further comprising: a training module, configured to train the resource balancing optimization model according to the following steps:

acquiring a preset neural network model and the training set;

8. The apparatus of claim 7, wherein the loss function is:

9. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;

a memory for storing a computer program;

a processor for implementing the method steps of any of claims 1 to 4 when executing a program stored in the memory.

10. A computer-readable storage medium, characterized in that a computer program is stored in the computer-readable storage medium, which computer program, when being executed by a processor, carries out the method steps of any one of claims 1 to 4.