CN112434791A

CN112434791A - Multi-agent strong countermeasure simulation method and device and electronic equipment

Info

Publication number: CN112434791A
Application number: CN202011270335.7A
Authority: CN
Inventors: 白桦; 王群勇; 孙旭朋
Original assignee: BEIJING SHENGTAOPING TEST ENGINEERING TECHNOLOGY RESEARCH INSTITUTE
Current assignee: BEIJING SHENGTAOPING TEST ENGINEERING TECHNOLOGY RESEARCH INSTITUTE
Priority date: 2020-11-13
Filing date: 2020-11-13
Publication date: 2021-03-02

Abstract

The invention provides a multi-agent strong confrontation simulation method, a multi-agent strong confrontation simulation device and electronic equipment, wherein the method comprises the following steps: acquiring multi-round demonstration confrontation playback data from a confrontation simulation engine, and training and acquiring a neural network strategy model by adopting a confrontation network generation technology based on the confrontation playback data; and simulating the decision process of the multi-agent in the strong confrontation process by using the neural network strategy model to complete the strong confrontation simulation of the multi-agent. By learning the historical data, the training speed of the multi-agent strong confrontation model can be accelerated, so that the operation efficiency is effectively improved, and the computing resources are effectively saved.

Description

Multi-agent strong countermeasure simulation method and device and electronic equipment

Technical Field

The invention relates to the technical field of system simulation, in particular to a multi-agent strong countermeasure simulation method and device and electronic equipment.

Background

A Multi-Agent modeling method is based on a model theory of artificial intelligence and organizational behavior, a Multi-Agent System (MAS) is combined with the research of a mathematical model in a specific field, and the Multi-Agent modeling method already covers a plurality of traditional and advanced scientific fields such as a bionic optimization algorithm, computational economy, artificial society, knowledge propagation engineering, war and political complex systems and the like.

The existing Deep Reinforcement Learning (DQN) technical framework is one of the main methods for establishing a multi-agent strong countermeasure model. However, in the multi-agent strong countermeasure application, the continuous time sequence output action space dimension is huge, so that the number of parameters of the DQN model is also huge. If the model parameters are trained from the initial values, a large amount of training time is consumed to obtain satisfactory results, and the efficiency is low.

Disclosure of Invention

The invention provides a multi-agent strong countermeasure simulation method, a multi-agent strong countermeasure simulation device and electronic equipment, which are used for solving the defect of low operation efficiency in the prior art and achieving the aim of effectively improving the operation efficiency.

The invention provides a multi-agent strong confrontation simulation method, which comprises the following steps:

acquiring multi-round demonstration confrontation playback data from a confrontation simulation engine, and training and acquiring a neural network strategy model by adopting a confrontation network generation technology based on the confrontation playback data;

and simulating the decision process of the multi-agent in the strong confrontation process by using the neural network strategy model to complete the strong confrontation simulation of the multi-agent.

According to the multi-agent strong countermeasure simulation method of one embodiment of the invention, the neural network strategy model comprises a discrimination network and a strategy network;

wherein the discriminative network is configured to classify the input countermeasure data, and an output of the discriminative network is configured to indicate whether the input countermeasure data complies with a demonstration countermeasure policy;

the policy network is used for reading the state data of the strong countermeasure process and generating the countermeasure policy to be adopted under the state data based on the state data.

According to the multi-agent strong countermeasure simulation method of one embodiment of the invention, before the training and obtaining the neural network strategy model, the method further comprises the following steps:

determining a discriminant loss sum of the demonstration sample and the simulation sample as a loss of the discriminant network, wherein a loss function of the discriminant network is expressed as follows:

D_loss＝D_loss-expert+D_loss-learner；

in the formula, D_lossRepresenting the loss of said discrimination network, D_loss-expertRepresenting the cross entropy, D, of the actual output and the expected output of the discriminative network on the presentation sample_loss-learnerRepresenting a cross entropy of an actual output and an expected output of the discrimination network on the mimic sample;

the goal of the discriminating network is to minimize the discriminating loss sum.

According to the multi-agent strong confrontation simulation method of one embodiment of the invention, before the determining the discrimination loss sum of the demonstration sample and the simulation sample as the loss of the discrimination network, the method further comprises the following steps:

the cross entropy is calculated as follows:

l(x,y)＝L＝{l₁,...,l_n,...,l_N}^T；

l_n＝-w_n[y_n·logx_n+(1-y_n)·log(1-x_n)]；

in the formula, l (x, y) represents the cross entropy of the vectors x and y, and is defined as a vector { l ] composed of the cross entropy of each component of the vectors x and y₁,...,l_n,...,l_N}^T，l_nAs corresponding components x of the vectors x, y_nAnd y_nCross entropy of (1), w_nIs the weight of the component N, N being the dimension of the vector x, y.

determining a reward function for the policy network as follows:

Reward＝-log(D(Π_L))；

where Reward represents the return of the policy network, Π_LRepresenting said simulated sample, D (Π)_L) Representing a cross entropy of an actual output and an expected output of the discrimination network on the mimic sample;

determining a goal of the policy network to maximize a reward of the policy network;

and/or determining a loss function of the policy network as follows:

in the formula, pd represents the confrontation command parameter probability distribution constructed by the parameters output by the policy network, action represents the command parameter value obtained by sampling the constructed probability distribution, log _ prob represents the log probability density of the probability distribution at the sample point of the action value, entrypy represents the entropy of the probability distribution, and beta represents a hyper-parameter.

According to the multi-agent strong confrontation simulation method, the decision process of the multi-agent in the strong confrontation process is simulated by utilizing the neural network strategy model, and the method comprises the following steps:

constructing the countermeasure command parameter probability distribution based on the output of the policy network, and sampling and acquiring countermeasure command parameters from the countermeasure command parameter probability distribution;

converting the countermeasure command parameters into a countermeasure command list according to an interface format required by the countermeasure simulation engine, and inputting the countermeasure command list into the countermeasure simulation engine.

According to the multi-agent strong countermeasure simulation method, the discrimination network is specifically a binary classification neural network, the input of the binary classification neural network is tensor coding of a combined countermeasure state and countermeasure command list, and the output of the binary classification neural network is a binary classification scalar quantity within [0,1 ].

The invention also provides a multi-agent strong confrontation simulation device, which comprises:

the training module is used for acquiring multi-round demonstration confrontation playback data from the confrontation simulation engine and training and acquiring a neural network strategy model by adopting a confrontation network generation technology based on the confrontation playback data;

and the simulation module is used for simulating the decision process of the multi-agent in the strong confrontation process by utilizing the neural network strategy model so as to complete the strong confrontation simulation of the multi-agent.

The invention also provides an electronic device, which comprises a memory, a processor and a program or an instruction which is stored on the memory and can run on the processor, wherein when the processor executes the program or the instruction, the steps of the multi-agent strong countermeasure simulation method are realized.

The present invention also provides a non-transitory computer readable storage medium having stored thereon a program or instructions which, when executed by a computer, implement the steps of the multi-agent warfare simulation method as any one of the above.

According to the multi-agent strong-confrontation simulation method, the multi-agent strong-confrontation simulation device and the electronic equipment, the training speed of the multi-agent strong-confrontation model can be accelerated by learning historical data, so that the operation efficiency is effectively improved, and the computing resources are effectively saved.

Drawings

In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed to be used in the description of the present invention or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

FIG. 1 is a schematic diagram of the overall system structure in the multi-agent strong countermeasure simulation method provided by the present invention;

FIG. 2 is a schematic flow chart of a multi-agent strong countermeasure simulation method provided by the present invention;

FIG. 3 is a schematic flow chart of data acquisition of a demonstration countermeasure playback in the multi-agent strong countermeasure simulation method provided by the present invention;

FIG. 4 is a schematic diagram of a data structure in the multi-agent strong countermeasure simulation method provided by the present invention;

FIG. 5 is a schematic diagram of a reinforcement learning control loop in the multi-agent strong confrontation simulation method provided by the present invention;

FIG. 6 is a schematic diagram of a DQN behavior value function approximation network in the multi-agent strong countermeasure simulation method provided by the present invention;

FIG. 7 is a schematic flow chart of a neural network strategy model training in the multi-agent strong confrontation simulation method provided by the present invention;

FIG. 8 is a schematic structural diagram of a multi-agent strong countermeasure simulation apparatus provided in the present invention;

fig. 9 is a schematic structural diagram of an electronic device provided in the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Aiming at the problem of low operation efficiency in multi-agent strong confrontation simulation in the prior art, the training speed of the multi-agent strong confrontation model can be accelerated by learning historical data, so that the operation efficiency is effectively improved, and the calculation resources are effectively saved. The present invention will now be described and explained with reference to a number of embodiments, in particular with reference to the accompanying drawings.

As shown in fig. 1, for the general structural schematic diagram of the system in the multi-agent strong countermeasure simulation method provided by the present invention, in order to quickly establish a neural network countermeasure strategy with high intelligence, a generation countermeasure network technology is first adopted, and existing high-level countermeasure replay data is utilized to quickly optimize a neural network strategy model, so that the neural network strategy model can simulate the countermeasure strategies adopted in the replay to reach the same intelligence level. The neural network strategy model generated after training can be directly used for intelligent confrontation simulation, and can be further optimized and improved through a reinforcement learning technology to reach a higher intelligent level.

Fig. 2 is a schematic flow chart of a multi-agent strong confrontation simulation method provided by the present invention, as shown in fig. 2, the method includes:

s201, acquiring multi-round demonstration countermeasure playback data from an countermeasure simulation engine, and training and acquiring a neural network strategy model by adopting a countermeasure network generation technology based on the countermeasure playback data.

It can be understood that, as shown in fig. 3, for the schematic flow chart of the acquisition of the demonstration confrontation playback data in the multi-agent strong confrontation simulation method provided by the present invention, the present invention firstly needs to obtain multiple rounds of demonstration confrontation playback data from the simulation engine. Alternatively, the confrontation playback data may be saved in a playback buffer.

As shown in fig. 4, for the structural diagram of data in the multi-agent strong confrontation simulation method provided by the present invention, each sample point in the playback buffer is data of one confrontation step, which includes a joint confrontation situation s and a presenter confrontation command list a.

The collected confrontation playback data can be generated by manual operation of a high-level human player or an automated confrontation rule program which is written by professional technicians and is highly optimized, and the confrontation playback records only need to be stored through a simulation engine without additional manual marking processing.

After the confrontation playback data is collected, the confrontation network generation technology can be adopted to train the neural network strategy model, so that the neural network strategy model learns the confrontation strategy adopted by the demonstrator.

S202, simulating a decision process of the multi-agent in the strong confrontation process by using the neural network strategy model, and completing the strong confrontation simulation of the multi-agent.

It can be understood that, after the training of the neural network strategy model is completed according to the above steps, the multi-agent strong confrontation simulation test can be performed, and the decision process of the multi-agent in the strong confrontation process can be simulated by using the neural network strategy model. The imitator in fig. 1 may be made to mimic the decision-making process of a presenter, such as may be utilized with an confrontational simulation engine.

It should be understood that for real-world complex, large-scale problems, a single agent is often unable to describe and solve, and therefore, multiple agents are often included in an application system. The agents not only have self problem solving capability and behavior targets, but also can cooperate with each other to achieve a common overall target, and the system becomes a MAS. MAS has the following properties: each with incomplete information or ability to solve the problem; the data is stored and processed in a scattered way, and a system-level data centralized processing structure is not available; interactivity inside the system and the encapsulation of the whole system; the computations are synchronized and therefore should be locked for some shared resources.

The multi-agent simulation adopts a system theory and a multi-agent system modeling method to establish a system high-level model, and uses a system calculation model established by a simulation software and hardware support technology based on an agent model to realize simulation.

According to the multi-agent strong confrontation simulation method provided by the invention, under the condition of historical experience data, the training speed of the multi-agent strong confrontation model can be accelerated by learning the historical data, so that the operation efficiency is effectively improved, and the calculation resources are effectively saved. In the military field, if the combat data of the virtual enemy exists, the method can quickly establish a confrontation model of the virtual enemy, simulate the operation mode behavior of the other party and be used for simulation training of commanding fighters by the party.

It should be understood that in Reinforcement Learning (RL) modeling studies, how agents (agents) interact with the environment to optimize targets will be studied. Reinforcement learning is then defined as a markov decision process, which is the theoretical basis for reinforcement learning.

Next, three main functions that an agent can learn are introduced:

strategy → value function → model

Reinforcement learning is related to solving sequential decision problems, and many real-world problems, such as video game play, sports, driving, etc., can be solved in this manner.

In solving these problems, there is an objective or purpose, such as winning a game, reaching the destination safely or minimizing the cost of manufacturing the product. By taking action and getting feedback from the world about the proximity to the target (current score, distance to destination or price per unit). Achieving a goal typically requires taking many actions in turn, each of which changes the surrounding world. These changes in the world and the feedback received are observed before deciding to take further action in response.

The reinforcement learning problem may be represented as a system consisting of agents and environments. The environment generates information describing the state of the system, which may be referred to as the state. The agent interacts with the environment by observing the state and using this information to select an operation. The environment accepts the action and transitions to the next state. It then returns the next status and reward to the agent. When the cycle of (state → action → reward) is completed, one step may be considered to be completed. This cycle is repeated until the environment is terminated (e.g., when the problem is resolved). Fig. 5 is a schematic diagram of the reinforcement learning control loop in the multi-agent strong confrontation simulation method provided by the present invention, in which the whole process of the loop is described.

Consider how to transition from one state to another using an environment called a transfer function. In reinforcement learning, the transition function is formulated as a Markov Decision Process (MDP), which is a mathematical framework that models sequential decisions. To understand how the transition function is expressed as an MDP, consider the following equation:

s_t+1～P(s_t+1|(s₀,a₀),(s₁,a₁),...,(s_t,a_t))；

wherein at time step t the next state s is sampled from the probability distribution P conditioned on the entire history_t+1Ambient slave status s_tTransition to s_t+1Depending on all previous states s and actions a.

To make the context conversion function more practical, it is converted to MDP by adding the following assumptions: to the next state s_t+1Only depending on the previous oneA state s_tAnd operation a_tReferred to as Markov properties. Under this assumption, the new transfer function will become:

st+1～P(s_t+1|st,at)；

the above formula represents the secondary probability distribution P(s)_t+1|s_t,a_t) Middle sampling next state s_t+1. This is a simple form of the original conversion function. The markov property shows that the current state and action at time step t contain enough information to fully determine the transition probability of the next state at t + 1.

The concept of reinforcement learning is combined with deep neural network technology to produce a deep reinforcement learning (DQN) approach, i.e., to construct a deep neural network. As shown in fig. 6, a schematic diagram of a DQN behavior value function approximation network in the multi-agent robust simulation method provided by the present invention is shown. The input is an environment variable and the output is an action variable. And training the neural network by adopting the maximization of the return value as a target.

The multi-agent strong countermeasure simulation method provided according to the above embodiments is optional, and the neural network policy model includes a discrimination network and a policy network.

Wherein the discriminative network is configured to classify the input countermeasure data, and an output of the discriminative network is configured to indicate whether the input countermeasure data complies with a demonstration countermeasure policy; the policy network is used for reading the state data of the strong countermeasure process and generating the countermeasure policy to be adopted under the state data based on the state data.

It can be understood that, as shown in fig. 1, the neural network policy model of the present invention is composed of a discriminant network D and a policy network a. The input countermeasure data are classified by the judgment network D, scalar values between 0 and 1 are output, whether the input data meet the demonstration countermeasure strategy or not is judged, 0 is completely met, 1 is completely not met, and therefore the optimization goal of the judgment network D is to accurately judge all data as far as possible.

The aim of the policy network a, which reads the countermeasure situation (environment) data and generates the countermeasure commands to be taken in this situation, is to simulate the demonstration countermeasure as accurately as possible, and also to mean to fool the discrimination network D as possible into distinguishing whether the countermeasure data was generated by a demonstration player or by the policy network. Therefore, the discrimination network D and the strategy network a form a confrontation relationship, and the two networks are alternately trained, when the two networks reach equilibrium, the discrimination network D discriminates the demonstration confrontation data and the confrontation data generated by the strategy network with nearly equal probability (i.e. the difference between the two cannot be effectively discriminated, and ideally, the value is expected to be 0.5, which means that the discrimination network cannot be discriminated at all), and at this time, the strategy network a learns the confrontation strategy close to the demonstration player.

Thus, optionally, a processing procedure for training the neural network policy model is shown in fig. 7, which is a schematic flow chart of the method for training the neural network policy model in the multi-agent strong confrontation simulation method provided by the present invention, and includes the following processing steps:

(1) randomly sampling from a demonstration countermeasure playback cache to obtain batch samples;

(2) the batch sample comprises a joint confrontation situation and a presenter confrontation command list and can be directly used as a batch presenter sample;

(3) batch simulator sample generation comprising:

(3.1) obtaining a joint confrontation situation sample from the batch sample;

(3.2) inputting the joint confrontation situation samples into the strategy network A to generate output;

(3.3) generating a list of emulator confrontation commands from the policy network a output;

(3.4) combining the joint confrontation situation with the corresponding simulator confrontation command list to form a simulator lot sample;

(4) inputting the batch demonstrator sample and the batch simulator sample into a discrimination network D together, calculating a discrimination network D loss function and performing one round of optimization training on the discrimination network D;

(5) judging the imitator batch samples by using a judging network D to generate output;

(6) and calculating the loss of the strategy network A according to the judgment result of the judgment network D and performing one round of optimization training on the strategy network A.

Optionally, the discrimination network is specifically a binary classification neural network, the input of the binary classification neural network is tensor coding of a joint confrontation state and confrontation command list, and the output of the binary classification neural network is a binary classification scalar quantity within [0,1 ].

Specifically, the discrimination network D of the present invention may be a typical binary classification neural network, the input of the network is the tensor coding of the joint countermeasure situation + countermeasure command list, and the output is a binary classification scalar of 0 to 1.

Optionally, the network structure and the network scale of the binary classification neural network may be selected in consideration of characteristics of input data, for example, a convolutional network CNN or a multilayer perceptron MLP may be generally adopted, and the parameter dimension and the network depth may be adjusted and selected according to the number of input data attributes and the complexity of the association relationship.

Further, on the basis of the multi-agent strong confrontation simulation method provided by each of the above embodiments, before the training to obtain the neural network policy model, the method further includes:

D_loss＝D_loss-expert+D_loss-learner；

Specifically, the loss of the discrimination network D in the present invention is the discrimination loss sum of the demonstration sample and the simulation sample:

D_loss＝D_loss-expert+D_loss-learner；

wherein, the demonstration sampleCost determination loss D_loss-expertAnd D_loss-learnerThe cross entropy of the actual output and the expected output of the discrimination network D on the presentation samples and the dummy samples, respectively. The demonstration sample should be judged to be fully compliant with the demonstration countermeasure, so the expected output should be 0; the mock sample should be judged to be completely out of compliance with the demonstration countermeasure, so the expected output should be 1.

Further, on the basis of the multi-agent strong confrontation simulation method provided by each of the above embodiments, before determining the discrimination loss sum of the demonstration sample and the simulation sample as the loss of the discrimination network, the method further includes:

the cross entropy is calculated as follows:

l(x,y)＝L＝{l₁,...,l_n,...,l_N}^T；

l_n＝-w_n[y_n·logx_n+(1-y_n)·log(1-x_n)]；

Thus, the loss calculation function of the discrimination network D can be expressed as:

D_loss＝BCELoss(D(Π_E)，0)+BCELoss(D(Π_L)，1)；

wherein BCELoss is cross entropy, pi_EFor demonstration of samples, ii_LTo mimic a sample.

The optimization goal of the discrimination network D is to minimize the overall discrimination loss.

determining a reward function for the policy network as follows:

Reward＝-log(D(Π_L))；

and/or determining a loss function of the policy network as follows:

Specifically, the structural design of the policy network a is similar to that of a policy network in reinforcement learning, a convolutional network CNN or a multilayer perceptron MLP and the like can be selected according to input and output characteristics for construction, and parameters such as input and output dimensions and network depth need to be selected and adjusted in consideration of simulation data characteristics.

The technical framework of the invention is applicable to different types of agents, denoted by subscript i, and different numbers of agents of the same type, denoted by subscript j.

Then, the reward calculation formula of the policy network a is:

Reward＝-log(D(Π_L))；

the optimization goal of policy network a is to maximize the payback.

The loss calculation function of the policy network a is:

pd is the probability distribution of the parameters of the countermeasure command constructed by the parameters output by the policy network A, the type of the probability distribution adopted by the pd can be selected according to the characteristics of the parameters, the discrete parameters such as the command type can adopt Categorical distribution and the like, and the continuous parameters such as coordinate points x and y can adopt Normal distribution and the like; action is a command parameter value obtained from the structured probability distribution sampling; log _ prob is the log probability density of the probability distribution at the sample point of the action value; entropy is the entropy of the probability distribution; beta is a hyper-parameter, controls the proportion of the maximum entropy target in the strategy network loss, and adjusts according to the training condition during training.

The multi-agent strong confrontation simulation method provided according to the above embodiments is optional, and the simulating a decision process of the multi-agent in the strong confrontation process by using the neural network policy model includes: constructing the countermeasure command parameter probability distribution based on the output of the policy network, and sampling and acquiring countermeasure command parameters from the countermeasure command parameter probability distribution; converting the countermeasure command parameters into a countermeasure command list according to an interface format required by the countermeasure simulation engine, and inputting the countermeasure command list into the countermeasure simulation engine.

Specifically, the policy network a of the present invention is similar to a policy network in reinforcement learning, and has the input of tensor coding of the joint countermeasure situation and the output of probability distribution parameters which can be used for constructing a countermeasure command list. The automatic countermeasure program constructs a countermeasure command parameter probability distribution pd according to the output of the policy network A, samples the pd to obtain countermeasure command parameters, and finally converts the countermeasure command parameters into a countermeasure command list according to an interface format required by the countermeasure simulation engine and inputs the countermeasure command list into the countermeasure simulation engine.

Based on the same inventive concept, the invention provides a multi-agent strong countermeasure simulation device according to the above embodiments, and the device is used for realizing the strong countermeasure simulation of the multi-agent in the above embodiments. Therefore, the description and definition in the multi-agent active confrontation simulation method in each embodiment above can be used for understanding each execution module of the multi-agent active confrontation simulation device in the present invention, and reference may be made to the above embodiment specifically, and details are not described herein.

According to an embodiment of the present invention, the structure of the multi-agent strong countermeasure simulation apparatus is shown in fig. 8, which is a schematic structural diagram of the multi-agent strong countermeasure simulation apparatus provided by the present invention, the apparatus can be used for the strong countermeasure simulation of the multi-agent, the apparatus includes: a training module 801 and a simulation module 802.

The training module 801 is configured to acquire multi-round demonstration countermeasure playback data from a countermeasure simulation engine, and train and acquire a neural network strategy model by using a generation countermeasure network technology based on the countermeasure playback data; the simulation module 802 is configured to simulate a decision process of the multi-agent in the strong countermeasure process by using the neural network policy model, so as to complete the strong countermeasure simulation of the multi-agent.

Specifically, the training module 801 first needs to obtain multiple rounds of demo confrontation playback data from the simulation engine. After the confrontation playback data is collected, the training module 801 may train the neural network strategy model by using the confrontation network generation technology, so that the neural network strategy model learns the confrontation strategy adopted by the presenter.

Then, the simulation module 802 can perform a multi-agent strong confrontation simulation test, and simulate the decision process of the multi-agent in the strong confrontation process by using the neural network policy model. The imitator in fig. 1 may be made to mimic the decision-making process of a presenter, such as may be utilized with an confrontational simulation engine.

The multi-agent strong-confrontation simulation device provided by the invention can accelerate the training speed of the multi-agent strong-confrontation model by learning historical data, thereby effectively improving the operation efficiency and effectively saving the computing resources.

Optionally, the neural network policy model includes a discriminant network and a policy network;

Further, the training module is further configured to:

D_loss＝D_loss-expert+D_loss-learner；

Further, the training module is further configured to:

the cross entropy is calculated as follows:

l(x,y)＝L＝{l₁,...,l_n,...,l_N}^T；

l_n＝-w_n[y_n·logx_n+(1-y_n)·log(1-x_n)]；

Further, the training module is further configured to:

determining a reward function for the policy network as follows:

Reward＝-log(D(∏_L))；

and/or determining a loss function of the policy network as follows:

Optionally, the simulation module is configured to:

It is understood that the relevant program modules in the devices of the above embodiments can be implemented by a hardware processor (hardware processor) in the present invention. Moreover, the multi-agent strong countermeasure simulation apparatus of the present invention can implement the multi-agent strong countermeasure simulation flow of each of the above method embodiments by using the above program modules, and when used for implementing the strong countermeasure simulation of the multi-agent in each of the above method embodiments, the apparatus of the present invention produces the same beneficial effects as those of the corresponding above method embodiments, and reference may be made to the above method embodiments, and details thereof are not repeated here.

As a further aspect of the present invention, the present invention provides an electronic device according to the above embodiments, the electronic device includes a memory, a processor and a program or instructions stored in the memory and executable on the processor, and the processor executes the program or instructions to implement the steps of the multi-agent robust simulation method according to the above embodiments.

Further, the electronic device of the present invention may further include a communication interface and a bus. Referring to fig. 9, a schematic structural diagram of an electronic device provided in the present invention includes: at least one memory 901, at least one processor 902, a communication interface 903, and a bus 904.

Wherein, the memory 901, the processor 902 and the communication interface 903 are communicated with each other through the bus 904, and the communication interface 903 is used for information transmission between the electronic equipment and the countermeasure data equipment; the memory 901 stores a program or instructions that can be executed on the processor 902, and when the processor 902 executes the program or instructions, the steps of the multi-agent warfare simulation method as described in the above embodiments are implemented.

It is understood that the electronic device at least comprises a memory 901, a processor 902, a communication interface 903 and a bus 904, and the memory 901, the processor 902 and the communication interface 903 form a communication connection with each other through the bus 904, and can complete the communication with each other, for example, the processor 902 reads program instructions of the multi-agent robust simulation method from the memory 901. In addition, the communication interface 903 can also realize communication connection between the electronic device and the countermeasure data device, and can complete mutual information transmission, such as reading of the countermeasure playback data through the communication interface 903.

When the electronic device is running, the processor 902 invokes the program instructions in the memory 901 to perform the methods provided by the above-mentioned method embodiments, for example, including: acquiring multi-round demonstration confrontation playback data from a confrontation simulation engine, and training and acquiring a neural network strategy model by adopting a confrontation network generation technology based on the confrontation playback data; and simulating the decision process of the multi-agent in the strong confrontation process by using the neural network strategy model to complete the strong confrontation simulation of the multi-agent and the like.

The program instructions in the memory 901 may be implemented in the form of software functional units and stored in a computer readable storage medium when the program instructions are sold or used as independent products. Alternatively, all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, where the program may be stored in a computer-readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The present invention also provides a non-transitory computer readable storage medium according to the above embodiments, on which a program or instructions are stored, the program or instructions, when executed by a computer, implement the steps of the multi-agent strong confrontation simulation method according to the above embodiments, for example, comprising: acquiring multi-round demonstration confrontation playback data from a confrontation simulation engine, and training and acquiring a neural network strategy model by adopting a confrontation network generation technology based on the confrontation playback data; and simulating the decision process of the multi-agent in the strong confrontation process by using the neural network strategy model to complete the strong confrontation simulation of the multi-agent and the like.

As a further aspect of the present invention, the present invention also provides a computer program product according to the above embodiments, the computer program product comprising a computer program stored on a non-transitory computer-readable storage medium, the computer program comprising program instructions, which when executed by a computer, enable the computer to perform the multi-agent strong countermeasure simulation method provided by the above method embodiments, the method comprising: acquiring multi-round demonstration confrontation playback data from a confrontation simulation engine, and training and acquiring a neural network strategy model by adopting a confrontation network generation technology based on the confrontation playback data; and simulating the decision process of the multi-agent in the strong confrontation process by using the neural network strategy model to complete the strong confrontation simulation of the multi-agent.

According to the electronic device, the non-transitory computer readable storage medium and the computer program product provided by the invention, by executing the steps of the multi-agent strong confrontation simulation method described in each embodiment, the training speed of the multi-agent strong confrontation model can be accelerated by learning historical data, so that the operation efficiency is effectively improved, and the calculation resources are effectively saved.

It is to be understood that the above-described embodiments of the apparatus, the electronic device and the storage medium are merely illustrative, and that elements described as separate components may or may not be physically separate, may be located in one place, or may be distributed on different network elements. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the invention. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. Based on such understanding, the technical solutions mentioned above may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as a usb disk, a removable hard disk, a ROM, a RAM, a magnetic or optical disk, etc., and includes several instructions for causing a computer device (such as a personal computer, a server, or a network device, etc.) to execute the methods described in the method embodiments or some parts of the method embodiments.

In addition, it should be understood by those skilled in the art that the terms "comprises," "comprising," or any other variation thereof, in the specification of the present invention, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. The term "comprising", without further limitation, means that the element so defined is not excluded from the group consisting of additional identical elements in the process, method, article, or apparatus that comprises the element.

In the description of the present invention, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description. Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. a multi-agent strong confrontation simulation method, is characterized in that, comprises:

Obtain multiple rounds of demonstration confrontation playback data from the confrontation simulation engine, and based on the confrontation playback data, use the generative confrontation network technology to train and obtain a neural network strategy model;

Using the neural network strategy model, the decision process of the multi-agent in the strong confrontation process is simulated, and the multi-agent strong confrontation simulation is completed.

2. The multi-agent strong confrontation simulation method according to claim 1, wherein the neural network strategy model comprises a discriminant network and a strategy network;

Wherein, the discrimination network is used to classify the input confrontation data, and the output of the discrimination network is used to indicate whether the input confrontation data conforms to the demonstration confrontation strategy;

The strategy network is used to read the state data of the strong confrontation process, and based on the state data, generate a confrontation strategy that should be taken under the state data.

3. The multi-agent strong confrontation simulation method according to claim 2, wherein, before the training obtains the neural network strategy model, further comprising:

Determine the sum of the discriminative loss of the demo sample and the imitation sample as the loss of the discriminant network, and the loss function of the discriminant network is expressed as follows:

D _loss =D _loss-expert +D _loss-learner ;

In the formula, D _loss represents the loss of the discriminant network, D _loss-expert represents the cross-entropy between the actual output and the expected output of the discriminant network for the demo sample, and D _loss-learner represents the difference between the discriminative network and the imitation The cross entropy between the actual output of the sample and the expected output;

The objective of the discriminative network is determined to minimize the sum of the discriminative losses.

4. The multi-agent strong confrontation simulation method according to claim 3, characterized in that, before the determination of the sum of the discriminative losses of the demonstration samples and the imitation samples is taken as the loss of the discriminant network, further comprising:

The cross entropy is calculated according to the following formula, and the following formula is:

l(x, y)=L={l ₁ , . . . , _{ln , . . . , 1 N} _} ^T ;

l _n =-w _n [y _n ·logx _n +(1-v _n )·log(1-x _n )];

In the formula, l(x, y) represents the cross entropy of the vector x and y, which is defined as the vector {l ₁ ,...,l _n ,...,l _N } composed of the cross entropy of each component of the vector x and y ^T , _ln is the cross entropy of the corresponding components x _n and y _n of the vectors x and y, wn is the weight of the component _n , and N is the dimension of the vectors x and y.

5. The multi-agent strong confrontation simulation method according to claim 3 or 4, characterized in that, before the training obtains the neural network strategy model, further comprising:

Determine the reward function for the policy network as follows:

Reward=-log(D(Π _L ));

In the formula, Reward represents the reward of the strategy network, П _L represents the imitation sample, D(П _L ) represents the cross entropy between the actual output and the expected output of the discriminant network for the imitation sample;

determining that the objective of the policy network is to maximize the reward of the policy network;

and/or, determine the loss function of the policy network as follows:

In the formula, pd represents the probability distribution of the adversarial command parameters constructed by the parameters output by the policy network, action represents the command parameter value obtained by sampling from the constructed probability distribution, and log_prob represents the log probability of the probability distribution at the sample point of the action value. Density, entropy is the entropy of the probability distribution, and β is the hyperparameter.

6. The multi-agent strong confrontation simulation method according to claim 5, wherein the use of the neural network strategy model to simulate the decision-making process of the multi-agent in the strong confrontation process comprises:

Constructing the adversarial command parameter probability distribution based on the output of the strategy network, and sampling the adversarial command parameter from the adversarial command parameter probability distribution to obtain the adversarial command parameter;

According to the interface format required by the confrontation simulation engine, the confrontation command parameters are converted into a confrontation command list, and the confrontation command list is input to the confrontation simulation engine.

7. The multi-agent strong confrontation simulation method according to claim 2, wherein the discriminant network is specifically a binary classification neural network, and the input of the binary classification neural network is a joint adversarial state and an adversarial command A tensor encoding of a list, the output of the binary classification neural network is a binary classification scalar in [0, 1].

8. A multi-agent strong confrontation simulation device, characterized in that, comprising:

A training module is used to obtain multiple rounds of demonstration confrontation playback data from the confrontation simulation engine, and based on the confrontation playback data, use the generative confrontation network technology to train and obtain a neural network strategy model;

The simulation module is used to simulate the decision-making process of the multi-agent in the strong confrontation process by using the neural network strategy model to complete the multi-agent strong confrontation simulation.

9. An electronic device comprising a memory, a processor and a program or instruction stored on the memory and running on the processor, wherein when the processor executes the program or the instruction, the The steps of the multi-agent strong adversarial simulation method according to any one of claims 1 to 7.

10. A non-transitory computer-readable storage medium on which programs or instructions are stored, characterized in that, when the programs or instructions are executed by a computer, the multi-function according to any one of claims 1 to 7 is implemented. Steps of an agent-strong adversarial simulation method.