CN115054923B

CN115054923B - Multi-agent model and training system and method

Info

Publication number: CN115054923B
Application number: CN202210693621.7A
Authority: CN
Inventors: 付小锋; 李執平; 文扬; 赖真; 李刚
Original assignee: Chengdu Southwest Information Control Research Institute Co ltd
Current assignee: Chengdu Southwest Information Control Research Institute Co ltd
Priority date: 2022-06-18
Filing date: 2022-06-18
Publication date: 2025-08-19
Anticipated expiration: 2042-06-18
Also published as: CN115054923A

Abstract

The invention discloses a multi-agent model, a training system and a method, wherein an improved MADDPG model is used as the multi-agent model, a simulation module and an expert module are introduced in the model training process, a strategy number and a next action sequence are used as the multi-agent model output, the expert module is used for correcting and updating the action sequence to assist flexible communication among the multi-agents and consideration of global information, the prior expert experience can be well utilized, and the training speed is improved, and meanwhile, the training is enabled to obtain more stability of the agents.

Description

Multi-agent model and training system and method

Technical Field

The invention belongs to the technical field of artificial intelligence, relates to intelligent training, and in particular relates to a multi-intelligent model and a training system and method thereof.

Background

The traditional thinking is mostly used in the existing intelligent training, the current observation sequence is input, the next action sequence is output, and then the action sequence is used for executing and generating in the game script. For example, application number CN201811492348.1 discloses an optimization method, device, terminal equipment and storage medium for training game agent, which discloses an agent model based on action network and criticism network, the model inputs current observation sequence to obtain action sequence, and obtains new observation sequence and current return after executing in script, when the return is greater than a certain target value, training is finished. However, most of the traditional agent training is single agent, and the trained network is relatively stable. And the observation space of a single agent is smaller than that of a plurality of agents, so that the training is easier. The single agent return design is simpler and is not suitable for direct migration into multi-agent training.

While improvements are currently made to Multi-agent model (Multi-AGENT DEEP DETERMINISTIC Policy Gradient, MADDPG model) training, the addition of separate modules to the MADDPG model ensures communication between agents, and so on. For example, application number CN202111240522.5 discloses a multi-Agent deep reinforcement learning algorithm, and a leader network module is added on the basis of MADDPG model, so that communication between intelligent agents is enhanced, and training stability and result robustness are increased.

However, the action space and the observation space of the existing multi-agent are large, most of the multi-agent have interaction or a certain degree of interaction, the return design is complex and sparse, and meanwhile, the network needs to consider the state of the agent and the communication between the agent and other agents, so that the training effect of the existing multi-agent is quite unsatisfactory. In addition, because the multi-agent training faces larger action space and observation space, only a large amount of hardware resources and massive training are relied on to learn actions of the agents, and the existing game strategies and expert knowledge are not well used to help network training.

Disclosure of Invention

Aiming at the current technical situation that the effective training of multiple intelligent agents is difficult to realize, the invention aims to provide a novel multi-intelligent agent model which can have global situation estimation capability.

Another object of the present invention is to provide a multi-agent training system and method for the multi-agent model, which introduces expert experience and voting mechanism based on the improved MADDPG model, and improves the training speed and simultaneously makes the training more stable.

In order to achieve the above purpose, the present invention is realized by adopting the following technical scheme.

The multi-agent model provided by the invention comprises a plurality of agents which are arranged in parallel. The multi-agent model is based on a traditional MADDPG model (see Multi-Agent Actor-Critic for Mixed Cooperative-Competitive EnvironmentsRyan Lowe,et al.arXiv:1706.02275v4[cs.LG],14Mar 2020), and a traditional MADDPG model, wherein the different agents are an Actor strategy network and a Critic evaluation network, the Actor strategy network and the Critic evaluation network of the agents are identical in structure and comprise six layers of full-connection layers, a residual layer is introduced after each two layers of full-connection networks, and a ReLU activation function is arranged after each full-connection layer and each residual layer.

The training system of the multi-agent model comprises a simulation module, a multi-agent training module and an expert module which are in communication connection, wherein a plurality of agents in the multi-agent model form a host, and the agents are added into an countermeasure scene of the simulation module;

The simulation module is used for generating a current observation sequence, current environmental returns and accumulated returns of each intelligent agent;

The multi-agent training module is used for outputting a strategy number sequence and a first action sequence according to the current observation sequence of each agent from the simulation module;

The expert module is used for selecting a mapping strategy corresponding to one strategy number from the strategy number sequences from the multi-agent training module through a voting strategy as an execution strategy, obtaining a second action sequence according to the execution strategy and the observation sequences of the agents from the simulation module, carrying out weighted average on the first action sequence and the second action sequence to obtain a next action sequence of the multi-agent, and feeding back the next action sequence to the simulation module.

In the invention, a conventional Unity simulation platform (such as a Unity 3D engine) is used as a simulation module, the simulation module comprises a virtual simulation countermeasure scene which is built by imitating a real scene (such as a mountain road, a Gu Dao map and the like), a plurality of intelligent agents in a multi-intelligent-agent model form a host, and are added into the virtual simulation countermeasure scene, the simulation module can set enemy behaviors through an automatic control function script, both countermeasure parties can freely move in the scene, and the task goal of the host is team cooperation to fight and kill enemy, so that a simulation countermeasure environment is formed.

The observation sequence of each intelligent agent comprises state information and position information of the current intelligent agent, state information and relative position information of other intelligent agents of the own party, state information and relative position information of enemy detected by the current intelligent agent, state information and relative position information of the enemy detected by the other intelligent agents of the own party and the like (recorded as observation contents). The status information includes velocity, blood volume, projectile inventory, etc. The state information and the position information of the intelligent agent can be directly obtained by the simulation module.

The environmental return refers to the prize value obtained by each agent of the own party in the game process of the own party and the enemy party in the simulation environment. The current environmental return refers to the reward value obtained by the current frame action of each own agent, and the accumulated return refers to the accumulated result of the reward value obtained by the historical frame action of each own agent.

The simulation module further comprises a reward feedback sub-module and a splicing sub-module. The rewarding feedback sub-module is used for calculating and obtaining the current environmental rewards and the accumulated rewards of each intelligent agent of the own party according to the rewards value (such as the rewards value obtained by hitting the enemy) corresponding to the target designed in advance. And the splicing sub-module is used for splicing the observation content of each intelligent agent to obtain the observation sequence of each intelligent agent.

The multi-agent training module includes training sub-modules corresponding to a plurality of agents. The training submodule network is the agent network, and the Actor strategy network of the training submodule outputs a strategy number corresponding to the formulated strategy while outputting the next action. Each training sub-network takes an observation sequence as input, takes a strategy number and next actions as output, and the strategy numbers of all the agents are spliced to form a strategy number sequence, and the next actions of all the agents are spliced to form a first action sequence. The strategy in the invention is to collect, analyze and count the existing countermeasure experience, situation analysis on military strategy, game strategy analysis and the like to obtain a plurality of general strategy rules for indicating the intelligent agents to cooperate, and to formulate a unique strategy number for each strategy. The training submodule introduces a residual network idea to connect the characteristics of every two layers, so that the loss of initial information can be reduced and the degradation of the network can be prevented. Thereby enabling the intelligent agent to have stronger classification capability and stability.

In the invention, when the multi-agent comprises four fire cars and one guarantee car, the strategy is formulated as follows:

(1) Three fire cars on the own side are arranged into a vertical long snake array;

(2) Three fire cars on the own side are arranged into a transverse long snake array;

(3) Four fire cars on the own side are arranged into a vertical long snake array;

(4) Four fire cars on the own side are arranged into a transverse long snake array;

(5) The own firepower vehicles are mutually close to ensure the surrounding of the vehicles;

(6) The own fire power car explores in four directions of east, west, south and north;

(7) Random walk.

And the strategy codes and the corresponding strategies are stored in the storage module. The storage module also stores the scope of the observed sequence threshold value related to the strategy and the next action of each agent related to the strategy, and constructs a list of the strategy number, the strategy, the scope of the observed sequence threshold value related to the strategy and the next action of each agent related to the strategy.

The expert module comprises a strategy making sub-module and an action calculation sub-module. The strategy system sub-module is used for selecting a mapping of one strategy number (namely a strategy represented by strategy codes) as an execution strategy through a voting strategy according to the strategy number sequence from the multi-agent training module. The voting strategy here uses a conventional mole voting algorithm (see Robert s. Boyer and J strother Moore, MJRTY-A Fast Majority Vote Algorithm, 1982). Further, the policy making sub-module obtains a corresponding policy number based on the constructed list by using the current observation sequence of each agent from the simulation module, and adds the corresponding policy number into the policy number sequence output by the multi-agent training module to be used as a policy set to be selected. The action calculation sub-module is used for determining the next action of each agent according to the current observation sequence from the simulation module and combining with the execution strategy of voting selection, forming a second action sequence, and then carrying out weighted average on the first action sequence and the second action sequence to obtain the next action sequence of each agent. In one implementation, the action computation sub-module averages corresponding terms in the first action sequence and the second action sequence, and takes the average value of the terms as a multi-agent next action sequence.

The simulation module obtains a new observation sequence, environmental returns and accumulated returns in the next round after obtaining the next action sequence according to the received next action sequence of the multiple intelligent agents and the current observation sequence.

The simulation module, the multi-agent training module and the expert module are communicated through socket processes, so that distributed training can be realized.

The invention further provides a training method of the multi-agent model, which is implemented by using the training system according to the following steps:

s1, initializing a multi-agent training module;

s2, initializing a simulation module;

S3, the simulation module generates a current observation sequence, current environmental returns and accumulated returns of each agent and sends the current observation sequence, the current environmental returns and the accumulated returns to the multi-agent training module and the expert module;

S4, the multi-agent training module outputs a strategy number sequence and a first action sequence according to the current observation sequence of each agent from the simulation module and sends the strategy number sequence and the first action sequence to the expert module;

S5, selecting a mapping of one strategy number by the expert module according to the strategy number sequence from the multi-agent training module through a voting strategy as an execution strategy, and simultaneously obtaining a second action sequence according to the execution strategy and the observation sequence of each agent from the simulation module;

Repeating the steps S3-S5 until the cumulative return of any agent exceeds a set threshold or the own agent completely gets winning (i.e. enemy is destroyed), ending the round of training, returning to the step S2, returning each agent to the initial state, and continuing training again until the winning rate of the nearest 100 own agents reaches 95%.

In the step S1, the initialization of the multi-agent training module mainly refers to the training of the network parameters of the sub-module, which can be implemented by random assignment.

In the step S2, the initialization of the simulation module mainly refers to the initialization of the state information and the position information of each agent, including the initial position, the initial direction, the initial speed, the initial blood volume, the initial shell number, and the like of each agent. In the simulation process, a man-machine interaction interface can be utilized to visually display the simulation environment and the state information and the position information of each simulation intelligent body.

In summary, the multi-agent reinforcement learning training method provided by the invention has the advantages of the traditional multi-agent reinforcement learning training method MADDPG, can perform centralized training and independent execution, and is added with a designed expert module, so that the action of an agent is guided by utilizing the existing tactical strategy, the prior knowledge can be better and faster utilized, and the training network can learn the beneficial strategy as soon as possible.

Compared with the prior art, the invention has the following beneficial effects:

1) The training system and the training method for the multi-agent model have the advantages of the existing multi-agent training, can perform centralized training in a decentralized manner, assist flexible communication among the multi-agents and consideration of global information, can well utilize the experience of the existing expert, and display good stability in quick training.

2) The multi-agent training method simplifies the output of the agent training network, uses strategy numbers and less action behavior sequences as the output of the agent network, wherein the strategy is from the prior expert experience, and the final action sequence not only comprises the actions of a single agent, but also guides the overall coordination and macroscopic overall actions of the multi-agent.

Drawings

Fig. 1 is a schematic diagram of a training system of a multi-agent model according to embodiment 2 of the present invention.

FIG. 2 is a schematic diagram of a multi-agent training module.

Fig. 3 is a flowchart of a training method of the multi-agent model provided in embodiment 3 of the present invention.

Detailed Description

The technical solutions of the embodiments of the present invention will be clearly and completely described with reference to the accompanying drawings, and it is obvious that the described embodiments are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden on the person of ordinary skill in the art based on the embodiments of the present invention, are within the scope of the present invention.

Example 1

The multi-agent model provided in this embodiment includes a plurality of agents arranged in parallel, and a plurality of agents constitute own. The multi-agent model of the present invention is based on the conventional MADDPG model (see Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments Ryan Lowe,et al.arXiv:1706.02275v4[cs.LG],14Mar 2020),, which is essentially the same as the conventional MADDPG model, except for the Actor policy network and Critic evaluation network in each agent.

In the embodiment, the structure of the Actor strategy network of the intelligent agent is the same as that of the Critic evaluation network, the Actor strategy network of the intelligent agent comprises six full-connection layers and two residual layers, one residual layer is introduced after each two full-connection layers, and a ReLU activation function is set after each full-connection layer and after each residual layer. In this embodiment, the convolution layer is used as the residual layer, and the sum of the input and output of the residual layer is used as the input of the next full connection layer. Each agent takes the current observation sequence of each agent as input and takes the next action corresponding to each agent as output.

The current observation sequence includes state information and position information of the agent in the current frame, state information and relative position information of other agents on the own side, state information and relative position information of the enemy detected by the current agent, state information and relative position information of the enemy detected by other agents on the own side, and the like (recorded as observation contents). The status information includes velocity, blood volume, projectile inventory, etc.

In this embodiment, the intelligent agent is a plurality of kinds of vehicles (fire powered vehicles, security vehicles, etc.) designed. In the embodiment, the multi-agent model comprises four fire vehicles and a guarantee vehicle, wherein the fire vehicle detection range is 1600m, the ammunition capacity is 50, the communication range is 3000m, the guarantee vehicle detection range is 1400m, the ammunition capacity is 500, the communication range is 3000m, the initial blood volume of the fire vehicle and the guarantee vehicle is 100, and the speed range is 0-5m/s.

Example 2

As shown in fig. 1, the training system for a multi-agent model provided in this embodiment includes a simulation module, a multi-agent training module, a storage module (not shown in the figure), and an expert module that are communicatively connected.

1. Simulation module

The simulation module is used for generating a current observation sequence, current environment returns and accumulated returns of each agent.

In this embodiment, a conventional Unity simulation platform (here, unity 3D engine) is used as the simulation module. The simulation module is connected with the multi-agent training module and the expert module through the Unity simulation platform interface, the multi-agent training module, the expert module and the simulation module are provided with communication sub-modules, and the modules are communicated through socket processes, so that distributed training can be realized.

The simulation module comprises a virtual simulation countermeasure scene which is built by imitating a real scene (such as a mountain road, gu Dao map and the like), and a plurality of agents are added into the virtual simulation countermeasure scene, in the embodiment, a plurality of agents in a multi-agent model form a host side, and are added into the virtual simulation countermeasure scene, the simulation module can set enemy behaviors (such as a Unity script-random walk) through a self-contained automatic control function script, the countermeasure parties can freely move in the scene, and the task goal of the host side is team cooperation to fight off enemy, so that a simulation countermeasure environment is formed.

The observation content of each intelligent agent comprises state information and position information of the current intelligent agent, state information and relative position information of other intelligent agents of the own party, state information and relative position information of the enemy intelligent agent detected by the current intelligent agent, state information and relative position information of the enemy intelligent agent detected by the other intelligent agents of the own party and the like. The status information includes velocity, blood volume, projectile inventory, etc.

The environmental return refers to the prize value obtained by each agent of the own party in the game process of the own party and the enemy party in the simulation environment. The current environmental return refers to the reward value obtained by the current frame action of each own agent, and the accumulated return refers to the accumulated result of the reward value obtained by the historical frame action of each own agent.Wherein R _i represents the cumulative return of the ith agent, gamma represents the return superposition discount coefficient at time t,Indicating the environmental return at time t of the ith agent.

The simulation module further comprises a reward feedback sub-module and a splicing sub-module.

The rewarding feedback sub-module is used for calculating and obtaining the current environmental rewards and the accumulated rewards of each intelligent agent of the own party according to the rewards value (such as the rewards value obtained by hitting the enemy) corresponding to the target designed in advance. And the splicing sub-module is used for splicing the observation content of each agent to obtain the observation sequence of each agent.

2. Multi-agent training module

The multi-agent training module is used for outputting a strategy number sequence and a first action sequence according to the current observation sequence of each agent from the simulation module.

The multi-agent training module comprises training submodules corresponding to a plurality of agents, and the training submodule network is the agent network. The Actor strategy network and the Critic evaluation network structure are shown in fig. 2, and each comprise six full-connection layers and two residual layers, wherein one residual layer is introduced after each two full-connection layers, and a ReLU activation function is set after each full-connection layer and after each residual layer. In this embodiment, the convolution layer is used as the residual layer, and the sum of the input and output of the residual layer is used as the input of the next full connection layer. The Actor strategy network of the training submodule outputs a strategy number while outputting the next action. The training sub-module network also comprises a target strategy network and a target evaluation network, which are used for calculating strategy loss in the training process and updating network parameters.

Each training sub-module takes the current observation sequence of each agent as input, takes strategy numbers and next actions as output, and the strategy numbers of all agents are spliced to form a strategy number sequence, and the next actions of all agents are spliced to form a first action sequence.

The strategy in the invention is to collect, analyze and count the existing countermeasure experience, situation analysis on military strategy, game strategy analysis and the like to obtain a plurality of general strategy rules for indicating the intelligent agents to cooperate, and to formulate a unique strategy number for each strategy.

The strategy formulation in this embodiment is as follows:

(7) Random walk.

As shown in FIG. 2, in the training process, for the ith agent, the observation sequence (s _i) of the ith agent is input into the Actor policy network to obtain the next action (a _i) and the policy number (p _i), the next action (a _i) and the observation sequence (s _i) are simultaneously input into the Critic evaluation network to obtain the output action evaluation (q _i), the policy loss is calculated by using the next action (a _i) and the action evaluation (q _i) and the target action of the target policy network and the target action evaluation of the target policy network, and then the Actor policy network and the Critic evaluation network parameters are corrected by using the policy loss.

3. Memory module

The policy codes and the corresponding policies are stored in the storage module. The storage module also stores the scope of the observed sequence threshold value related to the strategy and the next action of each agent related to the strategy, and constructs a list of strategy numbers, the strategy, the scope of the observed sequence threshold value related to the strategy and the next action of each agent related to the strategy.

Table 1 policy, observation sequence and action list

4. Expert module

The expert module is used for selecting a mapping strategy corresponding to one strategy number from the multi-agent training module as an execution strategy according to the strategy number sequence from the multi-agent training module through a voting strategy, obtaining a second action sequence according to the execution strategy and the observation sequence of each agent from the simulation module, and carrying out weighted average on the first action sequence and the second action sequence to obtain a multi-agent next action sequence which is fed back to the simulation module.

The expert module comprises a strategy making sub-module and an action calculating sub-module. The strategy making submodule obtains corresponding strategy numbers based on the constructed list by utilizing the current observation sequences of the intelligent agents from the simulation module, adds the strategy numbers into the strategy number sequences output by the multi-intelligent-agent training module to be used as strategy sets to be selected, and then the strategy making submodule selects the mapping of one strategy number (namely the strategy represented by the strategy codes) as an execution strategy through voting according to the strategy sets to be selected. The voting strategy here uses a conventional mole voting algorithm (see Robert s. Boyer and J strother Moore, MJRTY-A Fast Majority Vote Algorithm, 1982).

And the action calculation sub-module is used for determining the next action of each intelligent agent based on the constructed list according to the current observation sequence from the simulation module and combining with the execution strategy of voting selection, forming a second action sequence, and then carrying out weighted average on the first action sequence and the second action sequence to obtain the next action sequence of each intelligent agent. In one implementation, the action computation sub-module averages corresponding terms in the first action sequence and the second action sequence, and takes the average value of the terms as a multi-agent next action sequence.

The simulation module, the multi-agent training module and the expert module are communicated through socket processes based on communication sub-modules respectively arranged, and distributed training can be achieved.

Example 3

As shown in fig. 3, the training method of the multi-agent model provided in this embodiment is implemented by using the training system provided in embodiment 2 according to the following steps:

S1, initializing a multi-agent training module.

The multi-agent training module initialization mainly refers to training network parameters of the sub-module, and can be realized through random assignment.

S2, initializing a simulation module.

The initialization of the simulation module mainly refers to the initialization of the state information and the position information of each intelligent agent.

After receiving the initialization signal (e.g., "reset": 1), the simulation module initializes all the agents, including initial positions, initial directions, initial speeds, initial blood volume, initial number of shells, etc. of the respective agents.

In the simulation process, a man-machine interaction interface can be utilized to visually display the state information and the position information of the simulation environment and each simulation intelligent body, and the current environment return and the accumulated return are zeroed.

S3, the simulation module generates a current observation sequence of each agent, current environmental returns and accumulated returns of the two parties respectively, and sends the current observation sequence, the current environmental returns and the accumulated returns to the multi-agent training module and the expert module.

After the simulation module receives the initialization signal, the simulation module takes the initialization data of the state information and the position information of each agent as a first current observation sequence, and returns the current environment and the accumulated returns to zero.

In the subsequent cyclic training process, based on the current actions of each agent, a current observation sequence, a current environmental return and an accumulated return of each agent are generated.

The simulation module sends the generated current observation sequence, current environment returns and accumulated returns of each intelligent agent to the multi-intelligent-agent training module and the expert module through the communication submodule.

In this step, the current observation sequence (s ₁,s₂,…,s_i,…,s_N) of each agent is taken as input, the next action sequence (a ₁,a₂,…,a_i,…,,a_N as the first action sequence) and the policy number sequence (p ₁,p₂,…,p_i,…,p_N) are taken as output, and the output is sent to the expert module through the communication submodule.

The process of updating network parameters in an agent will be described below using the i-th agent as an example.

The Critic evaluation network gradient of the ith agent can be expressed as:

Wherein J (mu _i) represents R _i, namely J (mu _i)＝E[R_i), E [ ] represents an expected value, mu _i(a_i|s_i) represents an Actor policy network of an ith agent, and theta _i represents an Actor policy network parameter of the ith agent, which satisfies the above objective function; the action evaluation obtained by the Critic evaluation network of the i-th agent is shown.

The Critic evaluation network loss for the ith agent is:

Wherein r _i represents the current environmental report of the ith agent (here, time t is omitted), and gamma represents the return stack discount coefficient (here, time t is omitted); The target evaluation network for the ith agent outputs target action evaluation, which comprises iterating M steps through an Actor policy network, inputting the observation sequence (s ' _i) of the M steps into the target policy network to obtain target action (a ' _i), inputting the observation sequence (s ' _i) and the target action (a ' _i) into the target evaluation network simultaneously to obtain output target action evaluation (q ' _i), and in order to distinguish the Actor policy network from the target policy network, the target action evaluation is recorded as Thus, the target actions in the target policy network and the target action evaluations in the target evaluation network are updated once per iteration M steps of the Actor policy network.

The updating of the network parameter omega _i in the Critic evaluation network is obtained by conventional back propagation according to L (θ _i).

The direction of updating the Actor policy network of the ith agent is changed to the negative direction (the fastest increasing direction) of the gradient value of Critic even if the value obtained by the Critic evaluation network is larger:

I.e. the Loss function of the Actor policy network is Loss = -L (θ _i).

Thus, the updating of the network parameter θ _i in the Actor policy network is obtained by conventional back propagation according to Loss.

The updating process of the network parameter theta '_i in the target strategy network is that after each iteration M steps of the Actor strategy network, the discount factor is updated according to the updated theta _i and then according to the following formula of theta' _i←τθ_i+(1-τ)θ′_i, wherein tau represents the network parameter.

The updating process of the network parameter omega '_i in the target evaluation network is that after each iteration M steps of the Critic evaluation network, according to the updated omega _i, the network parameter updating discount factor is represented by the following formula omega' _i←τω_i+(1-τ)ω′_i, wherein tau represents the network parameter.

S5, the expert module selects mapping of one strategy number as an execution strategy according to the strategy number sequence from the multi-agent training module through a voting strategy, obtains a second action sequence according to the execution strategy and the observation sequence of each agent from the simulation platform, and then performs weighted average on the first action sequence and the second action sequence to obtain a next action sequence of the multi-agent and feeds the next action sequence back to the simulation module.

In the method, a strategy making sub-module firstly uses the current observation sequence of each intelligent agent from a simulation module, obtains corresponding strategy numbers based on the threshold range of the prediction sequence in the constructed list and the strategy mapping relation, adds the corresponding strategy numbers into the strategy number sequence output by a multi-intelligent agent training module to be used as a strategy set to be selected, and then the strategy making sub-module selects the mapping of one strategy number (namely the strategy represented by the strategy code) as an execution strategy through a mole voting algorithm according to the strategy set to be selected.

And the action calculation sub-module is used for determining the next action of each agent based on the constructed list according to the current observation sequence from the simulation module and combining with the execution strategy of voting selection, and forming a second action sequence. The method comprises the steps of determining the current position of each intelligent agent according to the current observation sequence, determining the target position of each intelligent agent according to the next action requirement corresponding to an execution strategy, and determining the next action of each intelligent agent according to the current position and the target position of each intelligent agent, wherein the next actions of all intelligent agents form a second action sequence.

And then, the action calculation sub-module averages corresponding items in the first action sequence and the second action sequence, takes the average value of each item as the next action sequence of the multi-agent, and feeds back the next action sequence to the simulation module.

And repeating the steps S3-S5 until the accumulated return of any agent exceeds a set threshold or the own agent completely obtains the winning advantage (namely the enemy is destroyed), and ending the round of training. And then returning to the step S2, returning each agent to the initial state to resume training until the last 100 local agents achieve 95 percent of the agent winning rate.

The invention provides a new multi-agent training optimizing method and system, which are characterized in that a current observation sequence is acquired according to a Unity simulation platform, a strategy number sequence and a predicted next action sequence are output and returned to an expert module based on an improved MADDPG model network, the expert module votes to select a final overall strategy according to the current observation sequence and the strategy number sequence obtained by the training module, calculates an updated next action sequence according to the overall strategy and the predicted action sequence of the training module and sends the updated next action sequence to the simulation module, the simulation module acts according to the next action sequence given by the expert module to obtain a new current observation sequence and current environment report, the current environment report is added into an accumulated report, and when the accumulated report exceeds a set threshold or the own agent is completely won, the current round of training is ended, and the initial state is returned to continue training. The training system not only takes the action sequences of multiple agents as a training network return result, but also takes the agent strategy numbers as a network result, so that the training network has global situation estimation capability, and the training efficiency is improved.

Those of ordinary skill in the art will recognize that the embodiments described herein are for the purpose of aiding the reader in understanding the principles of the present invention and should be understood that the scope of the invention is not limited to such specific statements and embodiments. Those of ordinary skill in the art can make various other specific modifications and combinations from the teachings of the present disclosure without departing from the spirit thereof, and such modifications and combinations remain within the scope of the present disclosure.

Claims

1. The training system of the multi-agent model is characterized by comprising a simulation module, a multi-agent training module and an expert module which are in communication connection, wherein the multi-agent model comprises a plurality of agents which are arranged in parallel, each agent comprises an Actor strategy network and a Critic evaluation network which are the same in structure, the network structure comprises six full-connection layers, a residual layer is introduced after every two full-connection networks, and a ReLU activation function is arranged after each full-connection layer and each residual layer;

The simulation module is used for generating a current observation sequence, a current environment report and an accumulated report of each intelligent agent, wherein the observation sequence of each intelligent agent comprises state information and position information of the current intelligent agent, state information and relative position information of other intelligent agents of the own party, state information and relative position information of enemies detected by the current intelligent agent, and state information and relative position information of the enemies detected by the other intelligent agents of the own party;

2. The training system of the multi-agent model according to claim 1, wherein the simulation module further comprises a reward feedback sub-module and a splicing sub-module, the reward feedback sub-module is used for calculating current environmental returns and accumulated returns of each agent on the own side according to a reward value corresponding to a pre-designed target, and the splicing sub-module is used for splicing observation contents of each agent to obtain observation sequences of each agent.

3. The system of claim 1, wherein the multi-agent training module comprises a training sub-module corresponding to a plurality of agents, wherein an Actor policy network of the training sub-module outputs a policy number corresponding to a formulated policy while outputting a next action, wherein the policy numbers of all agents are spliced to form a policy number sequence, and the next predicted actions of all agents are spliced to form a first action sequence.

4. The training system of the multi-agent model of claim 3, wherein the policy numbers and corresponding policies are stored in a memory module, wherein the memory module further stores a range of observed sequence thresholds associated with the policies and a list of next actions of each agent associated with the policies and builds the policy numbers, the policies, the range of observed sequence thresholds associated with the policies and the next actions of each agent associated with the policies.

5. The system of claim 4, wherein the expert module comprises a strategy generation sub-module and an action computation sub-module, the strategy generation sub-module is configured to select a mapping of one strategy number from the strategy number sequences from the multi-agent training module as an execution strategy by voting, and the action computation sub-module is configured to determine a next action of each agent according to a current observation sequence from the simulation module and simultaneously in combination with the selected execution strategy by voting, and to form a second action sequence.

6. The system of claim 5, wherein the policy generation sub-module further obtains a corresponding policy number based on the constructed list using the current observation sequence of each agent from the simulation module, and adds the corresponding policy number to the policy number sequence output by the multi-agent training module as the policy set to be selected.

7. The training system of a multi-agent model of claim 5 or 6, wherein the action computation sub-module performs a weighted average of the first and second sequences of actions to obtain a next sequence of actions for each agent.

8. A method of training a multi-agent model, characterized in that the training system according to any one of claims 1 to 7 is used to perform the following steps:

s1, initializing a multi-agent training module;

s2, initializing a simulation module;

S3, generating a current observation sequence, current environment returns and accumulated returns of each intelligent agent, and sending the current observation sequence, the current environment returns and the accumulated returns to the multi-intelligent-agent training module and the expert module, wherein the observation sequence of each intelligent agent comprises state information and position information of the current intelligent agent spliced together, state information and relative position information of other intelligent agents of the own party, state information and relative position information of enemy detected by the current intelligent agent, and state information and relative position information of the intelligent agents of the enemy detected by the other intelligent agents of the own party;

S5, selecting a mapping strategy corresponding to one strategy number as an execution strategy by the expert module according to the strategy number sequence from the multi-agent training module through a voting strategy, and obtaining a second action sequence according to the execution strategy and the observation sequence of each agent from the simulation module;

and repeating the steps S3-S5 until the accumulated return of any agent exceeds a set threshold value or the own agent completely obtains victory, ending the round of training, and returning to the step S2, and returning each agent to the initial state to resume training.