CN118567807A

CN118567807A - Edge server task cache model training and caching method and device

Info

Publication number: CN118567807A
Application number: CN202410703429.0A
Authority: CN
Inventors: 刘庆元
Original assignee: Suzhou Metabrain Intelligent Technology Co Ltd
Current assignee: Suzhou Metabrain Intelligent Technology Co Ltd
Priority date: 2024-05-31
Filing date: 2024-05-31
Publication date: 2024-08-30

Abstract

The invention provides a training and caching method and a device for an edge server task cache model, which relate to the technical field of data processing and comprise the following steps: acquiring an edge server state sample of a target area; for any edge server state sample, inputting a strategy network in a preset edge server task cache model, outputting action sample information corresponding to the edge server state sample, and collecting rewarding information obtained after the action sample information is executed and a next state sample of a target area; taking the edge server state sample, the action sample information, the rewarding information and the next state sample of the target area as experience quaternion of the target area, and storing the experience quaternion of the target area into an experience playback pool; training the value network in the preset edge server task cache model according to the experience quadruple of random sampling in the experience playback pool until a first preset training condition is met, stopping training, and obtaining a trained edge server task cache model.

Description

Edge server task cache model training and caching method and device

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a method and apparatus for training and caching a task cache model of an edge server.

Background

With the continuous development of digital technology, cloud computing service is also widely applied, but in the related technology, an edge server can execute all types of computing tasks unloaded by users, but does not consider whether the corresponding tasks are cached on the edge server, and the tasks of all users cannot be cached due to limited memory capacity and computing capacity of the edge server; if the redundant tasks are cached on the edge server, memory resources and computing resources of the edge server are wasted, so that user experience is reduced.

Therefore, how to reasonably perform task buffering of the edge server has become a problem to be solved in the industry.

Disclosure of Invention

The invention provides a training and caching method and device for an edge server task cache model, which are used for solving the problem of how to reasonably cache tasks of an edge server in the prior art.

The invention provides a training method for an edge server task cache model, which comprises the following steps:

Acquiring an edge server state sample of a target area;

for any edge server state sample, inputting a strategy network in a preset edge server task cache model, outputting action sample information corresponding to the edge server state sample, and collecting rewarding information obtained after executing the action sample information and a next state sample of a target area;

Taking the edge server state sample, the action sample information, the rewarding information and the next state sample of the target area as experience quaternions of the target area, and storing the experience quaternions of the target area into an experience playback pool; wherein the experience playback pool also stores experience quaternions of other areas adjacent to the target area;

Training a value network in the preset edge server task cache model according to the experience quadruple randomly sampled in the experience playback pool until a first preset training condition is met, and stopping training to obtain a trained edge server task cache model;

The trained edge server task cache model is used for outputting an edge server task cache strategy of the target area according to the edge server state of the target area.

According to the training method for the edge server task cache model provided by the invention, the value network in the preset edge server task cache model is trained according to the experience quadruple randomly sampled in the experience playback pool, and the training method comprises the following steps:

A first value sub-network in the value network evaluates the first value of the action of the experience quadruple to obtain a first target Q value which minimizes the Belman error;

performing second value evaluation on the action of the experience quadruple by a second value subnetwork in the value network to obtain a second target Q value which minimizes the Belman error;

Calculating a first loss value of the value network according to the first target Q value, and calculating a second loss value of the value network according to the second target Q value;

Respectively carrying out back propagation on a first value sub-network and a second value sub-network in the value network by using the first loss value and the second loss value, and updating network parameters in the first value sub-network and the second value sub-network;

traversing each experience quadruple in the experience playback pool until a second preset iteration stop condition is met, and stopping training of the value network to obtain a trained value network;

Wherein the first and second value subnetworks are two independent Q value networks; and the first value sub-network and the second value sub-network after updating the network parameters are used for evaluating the action sample information in the strategy network in the preset edge server task cache model.

According to the training method for the edge server task cache model provided by the invention, the training method for the strategy network in the preset edge server task cache model comprises the following steps:

Inputting the edge server state sample into a strategy network in a preset edge server task cache model, and outputting action sample information corresponding to the edge server state sample;

evaluating the action sample information through a first value sub-network and a second value sub-network after updating network parameters to obtain a first evaluation Q value and a second evaluation Q value;

Calculating a third loss value of the strategy network according to the minimum value of the first evaluation Q value and the second evaluation Q value and the evaluation value of the edge server state sample;

And under the condition that the third loss value is converged, stopping training by the strategy network to obtain a trained strategy network.

According to the training method for the edge server task cache model provided by the invention, the first preset training conditions comprise at least one of the following: and stopping training by the strategy network, stopping training by the value network, and meeting the preset training times and the preset training duration.

According to the training method for the edge server task cache model provided by the invention, before the step of training the value network in the preset edge server task cache model according to the experience quadruple randomly sampled in the experience playback pool, the training method further comprises the following steps:

Acquiring other areas with the distance smaller than a preset distance from the target area according to the positioning information of the target area;

and acquiring experience quaternions corresponding to the edge server state samples of the other areas, and storing the experience quaternions of the other areas in the experience playback pool.

According to the method for training the edge server task cache model provided by the invention, before the step of inputting the strategy network in the preset edge server task cache model for any edge server state sample, the method further comprises the following steps:

Initializing network parameters of the strategy network, and initializing the first value sub-network, the second value network and the corresponding first target network and second target network.

The invention also provides a method for caching the edge server task, which comprises the following steps:

acquiring the state information of an edge server of a current area;

Inputting the state information of the edge server of the current area into a trained edge server task cache model, and outputting an edge server task cache strategy of the current area;

the training method of the trained edge server task cache model comprises the following steps:

Acquiring an edge server state sample of a target area;

Training the value network in the preset edge server task cache model according to the experience quadruple randomly sampled in the experience playback pool until a first preset training condition is met, stopping training, and obtaining a trained edge server task cache model.

The invention provides an edge server task cache model training device, which comprises:

the first acquisition module is used for acquiring an edge server state sample of the target area;

The first output module is used for inputting a strategy network in a preset edge server task cache model for any one of the edge server state samples, outputting action sample information corresponding to the edge server state samples, and collecting rewarding information obtained after the action sample information is executed and a next state sample of a target area;

The storage module is used for taking the edge server state sample, the action sample information, the rewarding information and the next state sample of the target area as experience quaternion of the target area, and storing the experience quaternion of the target area into an experience playback pool; wherein the experience playback pool also stores experience quaternions of other areas adjacent to the target area;

The training module is used for training the value network in the preset edge server task cache model according to the experience quadruple randomly sampled in the experience playback pool until a first preset training condition is met, and stopping training to obtain a trained edge server task cache model;

The invention provides an edge server task caching device, which comprises:

the second acquisition module is used for acquiring the state information of the edge server of the current area;

The second output module is used for inputting the state information of the edge server of the current area into the trained edge server task cache model and outputting the edge server task cache strategy of the current area;

Acquiring an edge server state sample of a target area;

The invention also provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes any one of the training method of the edge server task cache model or the edge server task cache method when executing the program.

The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements any one of the edge server task cache model training methods or the edge server task cache methods described above.

The invention also provides a computer program product comprising a computer program which when executed by a processor implements any one of the edge server task cache model training methods or the edge server task cache methods described above.

According to the edge server task cache model training and caching method and device, the strategy network and the value network in the edge server task cache model are preset to train alternately, the strategy network generates new actions, the value network evaluates the values of the actions, and feedback is provided to improve the strategy network. In this way, effective strategies can be learned in complex environments while maintaining a balance between exploration and utilization, and more adequate training samples can be obtained by introducing empirical quadruples of other regions adjacent to the target region in the empirical playback pool.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a training method for an edge server task cache model according to an embodiment of the present application;

FIG. 2 is a diagram of an algorithm framework provided by an embodiment of the present application;

FIG. 3 is a flowchart of an algorithm provided in an embodiment of the present application;

Fig. 4 is a schematic flow chart of a task caching method of an edge server according to an embodiment of the present application;

Fig. 5 is a schematic structural diagram of an edge server task cache model training device according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of an edge server task buffering device according to an embodiment of the present application;

Fig. 7 is a schematic structural diagram of an electronic device provided by the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Fig. 1 is a schematic flow chart of a training method for an edge server task cache model according to an embodiment of the present application, as shown in fig. 1, including:

step 110, obtaining an edge server state sample of a target area;

The target area described in the embodiment of the present application may be a specific area defined in advance, and may be an area determined according to a distribution condition of an edge server, or an area divided according to a geographic location. Status samples are collected from edge servers of the target area. These samples reflect the current state of the server, possibly including resource usage, task queue status, network conditions, etc.

In an alternative embodiment, the edge server state samples may be samples carrying edge server task cache policy tags.

Step 120, for any of the edge server state samples, inputting a policy network in a preset edge server task cache model, outputting action sample information corresponding to the edge server state sample, and collecting rewarding information obtained after executing the action sample information and a next state sample of a target area;

In the embodiment of the application, the state sample of each edge server is used as input to be transmitted to a preset strategy network, and one parameter is that Policy network of (a)The task request of each time slot and the buffer decision of the last time slot are collected through interaction with the environment to serve as the input s _t of the system, and the task buffer decision a _t is output through the strategy network.

And the strategy network outputs action sample information according to the input state samples. In the edge server task cache model, actions may involve deciding which tasks should be cached, which tasks should be pushed to the cloud or other servers, or which tasks should be prioritized.

The action of the policy network output is performed in the actual edge server environment. This may involve actual task scheduling, resource allocation or cache management. After the actions are performed, rewards information related to the actions is collected. Rewards are typically based on the result of performing an action, possibly reduced task completion time, increased resource utilization, reduced energy consumption, etc. After performing the action and collecting the rewards, a next state sample of the target area is obtained. This may involve updating server state, including new task arrival, task completion, or resource state changes.

Step 130, taking the edge server state sample, the action sample information, the rewarding information and the next state sample of the target area as experience quaternions of the target area, and storing the experience quaternions of the target area into an experience playback pool; wherein the experience playback pool also stores experience quaternions of other areas adjacent to the target area;

in embodiments of the present application, an experience playback pool (Experience Replay Buffer, ERB) is an important component that allows agents to store and reuse past experiences.

The current state sample, the actions performed, the rewards collected, and the next state sample are combined into an empirical quadruple (st, at, rt, st+1). st is a state sample of time step t.

Where at is a sample of actions taken in state st, and rt is the reward earned after taking action at; st+1 is the next state sample at time step t+1.

The experience quadruple (st, at, rt, st+1) will be stored in the experience playback pool. Experience playback pools typically have a maximum capacity, and when the upper capacity limit is reached, the oldest experience is replaced with a new experience.

The experience playback pool stores experience quaternion of not only the target area, but also other areas adjacent to the target area. The purpose of this is to allow the agent to learn the interactions and synergy between the different regions.

Step 140, training a value network in the preset edge server task cache model according to the experience quadruple randomly sampled in the experience playback pool until a first preset training condition is met, and stopping training to obtain a trained edge server task cache model;

In an embodiment of the application, a batch of experience quadruples is randomly sampled from an experience playback pool while training a value network. Random sampling is helpful to break the correlation of time series data and improve the stability of the learning process.

Training a value network by using the sampled experience quadruple, and training the corresponding participation strategy network through the trained value network, thereby effectively realizing the training of the edge server task cache model through continuous iteration.

In an embodiment of the application, by presetting the alternate training of the policy network and the value network in the edge server task cache model, the policy network generates new actions, the value network evaluates the value of these actions, and provides feedback to improve the policy network. In this way, effective strategies can be learned in complex environments while maintaining a balance between exploration and utilization, and more adequate training samples can be obtained by introducing empirical quadruples of other regions adjacent to the target region in the empirical playback pool.

Optionally, training the value network in the preset edge server task cache model according to the experience quadruple randomly sampled in the experience playback pool includes:

In the present embodiment, the value network Critic portion comprises two independent Q value networks, generally denoted as a first value subnetwork Q1 and a second value subnetwork Q2. The two networks learn the cost functions of the state-action pairs, namely Q1 (s, a; θ1) and Q2 (s, a; θ2), respectively, where θ and θ2 are parameters of the two networks, respectively.

In order to increase the stability of the learning process, each Q-value network has a corresponding target network, denoted as a first target network Qtarget,1 and a second target network Qtarget,2, respectively. The parameters θ1target and θ2target of the target network are moving averages of the primary network parameters, which facilitate the smooth learning process because they update more slowly.

The goal is to minimize the bellman error J _Q (θ) by training the Q-value network parameters:

Where E represents the desire, the state value function V _θ(s_t) is expressed as:

V_θ(s_t)＝π(s_t)^T[Q(s_t)-αln(π(s_t))]

Where pi (s _t) represents the set of probabilities for each action in the state s _t, and Q (s _t) represents the set of Q values for each action in the state s _t, so that the value of V _θ(s_t can be directly calculated.

The Critic part has two independent Q value networksAndEach Q-value network also corresponds to a target Q-value networkAndThe method has the function of stabilizing the learning effect of the Q value network, and using the minimum value output by two Q value networks as the Q value network output of the Critic part:

Next, this target Q value Y is used to calculate the loss function of the Critic network, typically using the Mean Square Error (MSE) as the loss function, as follows: l=12 (Q1 (s, a; θ1) -Y) 2+12 (Q2 (s, a; θ2) -Y) 2l=2; this loss function L measures the difference between the predicted value of the two Q-value networks and the target Q-value Y.

Then updating network parameters in the first value sub-network and the second value sub-network according to the loss value; and continuously randomly acquiring a new experience quadruple from the experience playback pool to train until the preset training times are met, or stopping the connection of the models until the model converges, so as to obtain a trained value network.

In the embodiment of the application, the two value subnetworks Q1Q2 independently learn and mutually complement, which is helpful for reducing the overestimation problem and improving the stability of the value estimation. Finally, the trained value network can provide accurate action value evaluation for the strategy network, so that the strategy network is guided to output better task caching strategy

Optionally, the training method of the policy network in the preset edge server task cache model includes:

In the embodiment of the application, the goal of algorithm optimization is to find the optimal edge server task cache policy pi ^*:

Wherein E represents the mean, T is the time step number, r (s _t,a_t) is the reward obtained by the agent executing action a _t in state s _t, gamma E [0,1] is the discount factor, and s _t,a_t is the agent's state and action at time T, respectively. τ _π is the trajectory distribution under strategy pi (s _t,a_t). Alpha determines the importance of the entropy term relative to the prize, being the temperature coefficient. H (pi (|s _t))＝-lnπ(·|s_t),H(π(·|s_t) represents the entropy of the policy at state s _t.

In the embodiment of the application, the policy network Actor part outputs the action sample information a _t after observing the state sample s _t of the edge server, and collects the rewards r _t and observes the next state s _t+1 after executing the action sample information a _t. The experience groups (s _t,a_t,r_t,s_t+1) will be collected and stored in the experience playback pool D.

For each state-action pair (st, at), Q1 and Q2 calculate a first estimated Q value Q1 (st, at; θ1) and a second estimated Q value Q2 (st, at; θ2), respectively

And selecting the minimum value of the two value networks as the evaluation value of the action according to the evaluation Q values of the two value networks. This can increase stability and reduce estimation errors: qmin=min (Q1 (st, at; θ1), Q2 (st, at; θ2)); the evaluation value of the edge server state sample may be an action probability distribution determined by the policy network pi (a-s; phi).

In combination with the minimum evaluation Q value and the evaluation value of the state sample, the loss function in the third loss value LActorSAC algorithm of the computational strategy network typically includes an entropy term to encourage exploration, as follows: LActor = - αlogpi (at|st; phi) -Qmin (st, at) where α is a coefficient that adjusts the importance of the entropy term.

The third loss value LActor is used for back propagation to update the parameters phi of the policy network. The policy network continues to be updated and after each iteration it is checked whether the third loss value converges, i.e. whether the loss value is stable within a small range. And stopping the training of the strategy network when the third loss value converges or other preset stopping conditions are met.

Fig. 2 is an algorithm framework diagram provided by the embodiment of the present application, as shown in fig. 2, including a policy network Actor, a value network Critic, and an experience playback pool, where after the policy network obtains environmental status information, an experience quadruple is generated and stored in the experience playback pool after the policy network obtains the environmental status information, so that the value network randomly samples from the experience playback pool, and adjacent edge servers also share the experience quadruple in the experience playback pool.

FIG. 3 is a flowchart of an algorithm provided in an embodiment of the present application, as shown in FIG. 3, training processes of a slight network (Actor) and a value network (Critic) are alternately performed to form an iterative optimization loop:

Initializing: the policy network pi (a|s; phi) is initialized. Two value networks Q1 (s, a; θ1) and Q2 (s, a; θ2) are initialized and their corresponding target networks Qtarget,1 and Qtarget,2.

Collecting data: the agent performs an action in the environment according to the current policy network pi, collecting empirical data (s, a, r, s'). The experience data is stored in an experience playback pool.

Training value network (Critic): a batch of experience quadruples is randomly sampled from an experience playback pool. And calculating target values Y of the two value networks by using the sampled data, and taking the minimum value output by the two target networks. A loss function (e.g., mean square error) of the two value networks is calculated. Back propagation is performed, updating the network parameters θ1 and θ2 of Q1 and Q2.

Updating the target network: updating parameters of the target network using slow update rules, such as an exponential moving average: θ1target+_τθ1target+ (1- τ) θ1; where τ is a small positive number, such as 0.005.

Training policy network (Actor): the new actions are pi sampled using the current policy network and executed to obtain new empirical data. The value of the sampling action is evaluated using the updated value networks Q1 and Q2. A loss function of the policy network is calculated, including an entropy term to encourage exploration. Back propagation is performed, updating the parameters phi of the policy network.

Repeating the steps of training the value network, updating the target network and training the strategy network, and replacing the training value network and the strategy network. In each iteration, the policy network attempts to find better action policies, while the value network provides an assessment of these policies.

The performance of the agent is evaluated periodically during the training process. And adjusting the learning rate, the temperature parameter alpha or other super parameters according to the performance feedback.

By this way of alternating training, a balance can be struck between exploration and utilization, so that effective strategies are learned in complex environments. This process is adaptive and the agent's strategy will gradually improve as empirical data accumulates. Therefore, the task cache strategy of the edge server is effectively realized.

Optionally, before the step of training the value network in the preset edge server task cache model according to the experience quadruple randomly sampled in the experience playback pool, the method further includes:

In the embodiment of the application, firstly, the geographic position or the network topology position of the target area is determined, and all other areas with the distance smaller than the preset distance with the target area are searched and identified according to the positioning information of the target area. The preset distance may be a geographic distance, or may be other relevant measures such as network delay.

From these neighboring areas, a state sample of the edge server is obtained. The samples may include information about resource usage of the server, task queue status, network conditions, etc., and for each neighboring state sample, actions are performed and corresponding rewards information is collected and the next state sample is collected to construct an empirical quadruple.

The experience quadruples of these neighboring regions are stored in an experience playback pool. By doing so, the agent can learn the interaction and synergistic effect between different areas.

Fig. 4 is a flow chart of a task buffering method of an edge server according to an embodiment of the present application, as shown in fig. 4, including:

step 410, obtaining the state information of the edge server in the current area;

Step 420, inputting the state information of the edge server of the current area into a trained edge server task cache model, and outputting an edge server task cache strategy of the current area;

Acquiring an edge server state sample of a target area;

In the embodiment of the application, the preprocessed state information of the edge server of the current area is used as input and provided for a trained task cache model of the edge server.

And the model outputs the task caching strategy of the edge server of the current area by using the learned strategy according to the input state information.

The edge server task caching strategy in the embodiment of the application comprises the following steps: specific strategies of task caching, unloading and transferring.

And applying the task cache strategy output by the model to actual edge server management, and executing corresponding task scheduling and resource allocation.

More specifically, the training steps of the edge server task cache model in the embodiments of the present application are detailed in the above embodiments, and are not described herein.

The edge server task caching device and the edge server task caching model training device provided by the invention are described below, and the edge server task caching device and the edge server task caching model training device described below and the edge server task caching method and the edge server task caching model training method described above can be correspondingly referred to each other.

Fig. 5 is a schematic structural diagram of an edge server task cache model training device according to an embodiment of the present application, where, as shown in fig. 5, the training device includes:

The first obtaining module 510 is configured to obtain an edge server state sample of a target area;

The first output module 520 is configured to input, for any one of the edge server state samples, a policy network in a preset edge server task cache model, output action sample information corresponding to the edge server state sample, and collect rewarding information obtained after the action sample information is executed and a next state sample of a target area;

The storage module 530 is configured to store the edge server state sample, the action sample information, the rewards information, and the next state sample of the target area as an experience quadruple of the target area, and store the experience quadruple of the target area in an experience playback pool; wherein the experience playback pool also stores experience quaternions of other areas adjacent to the target area;

The training module 540 is configured to train the value network in the preset edge server task cache model according to the experience quadruple that randomly samples in the experience playback pool until a first preset training condition is satisfied, and stop training to obtain a trained edge server task cache model;

Fig. 6 is a schematic structural diagram of an edge server task buffering device according to an embodiment of the present application, as shown in fig. 6, including:

The second obtaining module 610 is configured to obtain edge server status information of the current area;

The second output module 620 is configured to input the edge server state information of the current area into a trained edge server task cache model, and output an edge server task cache policy of the current area;

Acquiring an edge server state sample of a target area;

In the embodiment of the application, the strategy network generates new actions through the alternate training of the strategy network and the value network in the preset edge server task cache model, and the value network evaluates the values of the actions and provides feedback to improve the strategy network. In this way, effective strategies can be learned in complex environments while maintaining a balance between exploration and utilization, and more adequate training samples can be obtained by introducing empirical quadruples of other regions adjacent to the target region in the empirical playback pool.

Fig. 7 is a schematic structural diagram of an electronic device according to the present invention, and as shown in fig. 7, the electronic device may include: processor 710, communication interface (Communications Interface) 720, memory 730, and communication bus 740, wherein processor 710, communication interface 720, memory 730 communicate with each other via communication bus 740. Processor 710 may invoke logic instructions in memory 730 to perform an edge server task cache model training method or an edge server task cache method, the edge server task cache model training method comprising: acquiring an edge server state sample of a target area;

The edge server task caching method comprises the following steps:

acquiring the state information of an edge server of a current area;

Acquiring an edge server state sample of a target area;

Further, the logic instructions in the memory 730 described above may be implemented in the form of software functional units and may be stored in a computer readable storage medium when sold or used as a stand alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In another aspect, the present invention further provides a computer program product, where the computer program product includes a computer program, where the computer program can be stored on a non-transitory computer readable storage medium, where the computer program, when executed by a processor, can perform the edge server task caching method provided by the methods above, and the edge server task cache model training method includes: acquiring an edge server state sample of a target area;

The edge server task caching method comprises the following steps:

acquiring the state information of an edge server of a current area;

Acquiring an edge server state sample of a target area;

In still another aspect, the present invention further provides a non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor is implemented to perform the edge server task cache method provided by the above methods, the edge server task cache model training method includes: acquiring an edge server state sample of a target area;

The edge server task caching method comprises the following steps:

acquiring the state information of an edge server of a current area;

Acquiring an edge server state sample of a target area;

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. The edge server task cache model training method is characterized by comprising the following steps of:

Acquiring an edge server state sample of a target area;

2. The edge server task cache model training method according to claim 1, wherein training the value network in the preset edge server task cache model according to the experience quadruple that randomly samples in the experience playback pool comprises:

3. The training method of the edge server task cache model according to claim 2, wherein the training method of the policy network in the preset edge server task cache model comprises the following steps:

4. The edge server task cache model training method of claim 3, wherein the first preset training conditions include at least one of: and stopping training by the strategy network, stopping training by the value network, and meeting the preset training times and the preset training duration.

5. The edge server task cache model training method according to claim 1, further comprising, prior to the step of training the value network in the preset edge server task cache model based on the empirical quadruple of random sampling in the empirical playback pool:

6. The method for training an edge server task cache model according to claim 2, further comprising, before said step of inputting a policy network in a preset edge server task cache model for any one of said edge server state samples:

7. The edge server task caching method is characterized by comprising the following steps of:

acquiring the state information of an edge server of a current area;

Acquiring an edge server state sample of a target area;

8. An edge server task cache model training device, comprising:

9. An edge server task caching apparatus, comprising:

Acquiring an edge server state sample of a target area;

10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the edge server task cache model training method of any one of claims 1to 6 or the edge server task cache method of claim 7 when the program is executed by the processor.

11. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the edge server task cache model training method of any of claims 1 to 6 or the edge server task cache method of claim 7.