[go: up one dir, main page]

CN112434791A - Multi-agent strong countermeasure simulation method and device and electronic equipment - Google Patents

Multi-agent strong countermeasure simulation method and device and electronic equipment Download PDF

Info

Publication number
CN112434791A
CN112434791A CN202011270335.7A CN202011270335A CN112434791A CN 112434791 A CN112434791 A CN 112434791A CN 202011270335 A CN202011270335 A CN 202011270335A CN 112434791 A CN112434791 A CN 112434791A
Authority
CN
China
Prior art keywords
confrontation
network
agent
strong
loss
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011270335.7A
Other languages
Chinese (zh)
Inventor
白桦
王群勇
孙旭朋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BEIJING SHENGTAOPING TEST ENGINEERING TECHNOLOGY RESEARCH INSTITUTE
Original Assignee
BEIJING SHENGTAOPING TEST ENGINEERING TECHNOLOGY RESEARCH INSTITUTE
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING SHENGTAOPING TEST ENGINEERING TECHNOLOGY RESEARCH INSTITUTE filed Critical BEIJING SHENGTAOPING TEST ENGINEERING TECHNOLOGY RESEARCH INSTITUTE
Priority to CN202011270335.7A priority Critical patent/CN112434791A/en
Publication of CN112434791A publication Critical patent/CN112434791A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides a multi-agent strong confrontation simulation method, a multi-agent strong confrontation simulation device and electronic equipment, wherein the method comprises the following steps: acquiring multi-round demonstration confrontation playback data from a confrontation simulation engine, and training and acquiring a neural network strategy model by adopting a confrontation network generation technology based on the confrontation playback data; and simulating the decision process of the multi-agent in the strong confrontation process by using the neural network strategy model to complete the strong confrontation simulation of the multi-agent. By learning the historical data, the training speed of the multi-agent strong confrontation model can be accelerated, so that the operation efficiency is effectively improved, and the computing resources are effectively saved.

Description

Multi-agent strong countermeasure simulation method and device and electronic equipment
Technical Field
The invention relates to the technical field of system simulation, in particular to a multi-agent strong countermeasure simulation method and device and electronic equipment.
Background
A Multi-Agent modeling method is based on a model theory of artificial intelligence and organizational behavior, a Multi-Agent System (MAS) is combined with the research of a mathematical model in a specific field, and the Multi-Agent modeling method already covers a plurality of traditional and advanced scientific fields such as a bionic optimization algorithm, computational economy, artificial society, knowledge propagation engineering, war and political complex systems and the like.
The existing Deep Reinforcement Learning (DQN) technical framework is one of the main methods for establishing a multi-agent strong countermeasure model. However, in the multi-agent strong countermeasure application, the continuous time sequence output action space dimension is huge, so that the number of parameters of the DQN model is also huge. If the model parameters are trained from the initial values, a large amount of training time is consumed to obtain satisfactory results, and the efficiency is low.
Disclosure of Invention
The invention provides a multi-agent strong countermeasure simulation method, a multi-agent strong countermeasure simulation device and electronic equipment, which are used for solving the defect of low operation efficiency in the prior art and achieving the aim of effectively improving the operation efficiency.
The invention provides a multi-agent strong confrontation simulation method, which comprises the following steps:
acquiring multi-round demonstration confrontation playback data from a confrontation simulation engine, and training and acquiring a neural network strategy model by adopting a confrontation network generation technology based on the confrontation playback data;
and simulating the decision process of the multi-agent in the strong confrontation process by using the neural network strategy model to complete the strong confrontation simulation of the multi-agent.
According to the multi-agent strong countermeasure simulation method of one embodiment of the invention, the neural network strategy model comprises a discrimination network and a strategy network;
wherein the discriminative network is configured to classify the input countermeasure data, and an output of the discriminative network is configured to indicate whether the input countermeasure data complies with a demonstration countermeasure policy;
the policy network is used for reading the state data of the strong countermeasure process and generating the countermeasure policy to be adopted under the state data based on the state data.
According to the multi-agent strong countermeasure simulation method of one embodiment of the invention, before the training and obtaining the neural network strategy model, the method further comprises the following steps:
determining a discriminant loss sum of the demonstration sample and the simulation sample as a loss of the discriminant network, wherein a loss function of the discriminant network is expressed as follows:
Dloss=Dloss-expert+Dloss-learner
in the formula, DlossRepresenting the loss of said discrimination network, Dloss-expertRepresenting the cross entropy, D, of the actual output and the expected output of the discriminative network on the presentation sampleloss-learnerRepresenting a cross entropy of an actual output and an expected output of the discrimination network on the mimic sample;
the goal of the discriminating network is to minimize the discriminating loss sum.
According to the multi-agent strong confrontation simulation method of one embodiment of the invention, before the determining the discrimination loss sum of the demonstration sample and the simulation sample as the loss of the discrimination network, the method further comprises the following steps:
the cross entropy is calculated as follows:
l(x,y)=L={l1,...,ln,...,lN}T
ln=-wn[yn·logxn+(1-yn)·log(1-xn)];
in the formula, l (x, y) represents the cross entropy of the vectors x and y, and is defined as a vector { l ] composed of the cross entropy of each component of the vectors x and y1,...,ln,...,lN}T,lnAs corresponding components x of the vectors x, ynAnd ynCross entropy of (1), wnIs the weight of the component N, N being the dimension of the vector x, y.
According to the multi-agent strong countermeasure simulation method of one embodiment of the invention, before the training and obtaining the neural network strategy model, the method further comprises the following steps:
determining a reward function for the policy network as follows:
Reward=-log(D(ΠL));
where Reward represents the return of the policy network, ΠLRepresenting said simulated sample, D (Π)L) Representing a cross entropy of an actual output and an expected output of the discrimination network on the mimic sample;
determining a goal of the policy network to maximize a reward of the policy network;
and/or determining a loss function of the policy network as follows:
Figure BDA0002777479480000031
in the formula, pd represents the confrontation command parameter probability distribution constructed by the parameters output by the policy network, action represents the command parameter value obtained by sampling the constructed probability distribution, log _ prob represents the log probability density of the probability distribution at the sample point of the action value, entrypy represents the entropy of the probability distribution, and beta represents a hyper-parameter.
According to the multi-agent strong confrontation simulation method, the decision process of the multi-agent in the strong confrontation process is simulated by utilizing the neural network strategy model, and the method comprises the following steps:
constructing the countermeasure command parameter probability distribution based on the output of the policy network, and sampling and acquiring countermeasure command parameters from the countermeasure command parameter probability distribution;
converting the countermeasure command parameters into a countermeasure command list according to an interface format required by the countermeasure simulation engine, and inputting the countermeasure command list into the countermeasure simulation engine.
According to the multi-agent strong countermeasure simulation method, the discrimination network is specifically a binary classification neural network, the input of the binary classification neural network is tensor coding of a combined countermeasure state and countermeasure command list, and the output of the binary classification neural network is a binary classification scalar quantity within [0,1 ].
The invention also provides a multi-agent strong confrontation simulation device, which comprises:
the training module is used for acquiring multi-round demonstration confrontation playback data from the confrontation simulation engine and training and acquiring a neural network strategy model by adopting a confrontation network generation technology based on the confrontation playback data;
and the simulation module is used for simulating the decision process of the multi-agent in the strong confrontation process by utilizing the neural network strategy model so as to complete the strong confrontation simulation of the multi-agent.
The invention also provides an electronic device, which comprises a memory, a processor and a program or an instruction which is stored on the memory and can run on the processor, wherein when the processor executes the program or the instruction, the steps of the multi-agent strong countermeasure simulation method are realized.
The present invention also provides a non-transitory computer readable storage medium having stored thereon a program or instructions which, when executed by a computer, implement the steps of the multi-agent warfare simulation method as any one of the above.
According to the multi-agent strong-confrontation simulation method, the multi-agent strong-confrontation simulation device and the electronic equipment, the training speed of the multi-agent strong-confrontation model can be accelerated by learning historical data, so that the operation efficiency is effectively improved, and the computing resources are effectively saved.
Drawings
In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed to be used in the description of the present invention or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.
FIG. 1 is a schematic diagram of the overall system structure in the multi-agent strong countermeasure simulation method provided by the present invention;
FIG. 2 is a schematic flow chart of a multi-agent strong countermeasure simulation method provided by the present invention;
FIG. 3 is a schematic flow chart of data acquisition of a demonstration countermeasure playback in the multi-agent strong countermeasure simulation method provided by the present invention;
FIG. 4 is a schematic diagram of a data structure in the multi-agent strong countermeasure simulation method provided by the present invention;
FIG. 5 is a schematic diagram of a reinforcement learning control loop in the multi-agent strong confrontation simulation method provided by the present invention;
FIG. 6 is a schematic diagram of a DQN behavior value function approximation network in the multi-agent strong countermeasure simulation method provided by the present invention;
FIG. 7 is a schematic flow chart of a neural network strategy model training in the multi-agent strong confrontation simulation method provided by the present invention;
FIG. 8 is a schematic structural diagram of a multi-agent strong countermeasure simulation apparatus provided in the present invention;
fig. 9 is a schematic structural diagram of an electronic device provided in the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Aiming at the problem of low operation efficiency in multi-agent strong confrontation simulation in the prior art, the training speed of the multi-agent strong confrontation model can be accelerated by learning historical data, so that the operation efficiency is effectively improved, and the calculation resources are effectively saved. The present invention will now be described and explained with reference to a number of embodiments, in particular with reference to the accompanying drawings.
As shown in fig. 1, for the general structural schematic diagram of the system in the multi-agent strong countermeasure simulation method provided by the present invention, in order to quickly establish a neural network countermeasure strategy with high intelligence, a generation countermeasure network technology is first adopted, and existing high-level countermeasure replay data is utilized to quickly optimize a neural network strategy model, so that the neural network strategy model can simulate the countermeasure strategies adopted in the replay to reach the same intelligence level. The neural network strategy model generated after training can be directly used for intelligent confrontation simulation, and can be further optimized and improved through a reinforcement learning technology to reach a higher intelligent level.
Fig. 2 is a schematic flow chart of a multi-agent strong confrontation simulation method provided by the present invention, as shown in fig. 2, the method includes:
s201, acquiring multi-round demonstration countermeasure playback data from an countermeasure simulation engine, and training and acquiring a neural network strategy model by adopting a countermeasure network generation technology based on the countermeasure playback data.
It can be understood that, as shown in fig. 3, for the schematic flow chart of the acquisition of the demonstration confrontation playback data in the multi-agent strong confrontation simulation method provided by the present invention, the present invention firstly needs to obtain multiple rounds of demonstration confrontation playback data from the simulation engine. Alternatively, the confrontation playback data may be saved in a playback buffer.
As shown in fig. 4, for the structural diagram of data in the multi-agent strong confrontation simulation method provided by the present invention, each sample point in the playback buffer is data of one confrontation step, which includes a joint confrontation situation s and a presenter confrontation command list a.
The collected confrontation playback data can be generated by manual operation of a high-level human player or an automated confrontation rule program which is written by professional technicians and is highly optimized, and the confrontation playback records only need to be stored through a simulation engine without additional manual marking processing.
After the confrontation playback data is collected, the confrontation network generation technology can be adopted to train the neural network strategy model, so that the neural network strategy model learns the confrontation strategy adopted by the demonstrator.
S202, simulating a decision process of the multi-agent in the strong confrontation process by using the neural network strategy model, and completing the strong confrontation simulation of the multi-agent.
It can be understood that, after the training of the neural network strategy model is completed according to the above steps, the multi-agent strong confrontation simulation test can be performed, and the decision process of the multi-agent in the strong confrontation process can be simulated by using the neural network strategy model. The imitator in fig. 1 may be made to mimic the decision-making process of a presenter, such as may be utilized with an confrontational simulation engine.
It should be understood that for real-world complex, large-scale problems, a single agent is often unable to describe and solve, and therefore, multiple agents are often included in an application system. The agents not only have self problem solving capability and behavior targets, but also can cooperate with each other to achieve a common overall target, and the system becomes a MAS. MAS has the following properties: each with incomplete information or ability to solve the problem; the data is stored and processed in a scattered way, and a system-level data centralized processing structure is not available; interactivity inside the system and the encapsulation of the whole system; the computations are synchronized and therefore should be locked for some shared resources.
The multi-agent simulation adopts a system theory and a multi-agent system modeling method to establish a system high-level model, and uses a system calculation model established by a simulation software and hardware support technology based on an agent model to realize simulation.
According to the multi-agent strong confrontation simulation method provided by the invention, under the condition of historical experience data, the training speed of the multi-agent strong confrontation model can be accelerated by learning the historical data, so that the operation efficiency is effectively improved, and the calculation resources are effectively saved. In the military field, if the combat data of the virtual enemy exists, the method can quickly establish a confrontation model of the virtual enemy, simulate the operation mode behavior of the other party and be used for simulation training of commanding fighters by the party.
It should be understood that in Reinforcement Learning (RL) modeling studies, how agents (agents) interact with the environment to optimize targets will be studied. Reinforcement learning is then defined as a markov decision process, which is the theoretical basis for reinforcement learning.
Next, three main functions that an agent can learn are introduced:
strategy → value function → model
Reinforcement learning is related to solving sequential decision problems, and many real-world problems, such as video game play, sports, driving, etc., can be solved in this manner.
In solving these problems, there is an objective or purpose, such as winning a game, reaching the destination safely or minimizing the cost of manufacturing the product. By taking action and getting feedback from the world about the proximity to the target (current score, distance to destination or price per unit). Achieving a goal typically requires taking many actions in turn, each of which changes the surrounding world. These changes in the world and the feedback received are observed before deciding to take further action in response.
The reinforcement learning problem may be represented as a system consisting of agents and environments. The environment generates information describing the state of the system, which may be referred to as the state. The agent interacts with the environment by observing the state and using this information to select an operation. The environment accepts the action and transitions to the next state. It then returns the next status and reward to the agent. When the cycle of (state → action → reward) is completed, one step may be considered to be completed. This cycle is repeated until the environment is terminated (e.g., when the problem is resolved). Fig. 5 is a schematic diagram of the reinforcement learning control loop in the multi-agent strong confrontation simulation method provided by the present invention, in which the whole process of the loop is described.
Consider how to transition from one state to another using an environment called a transfer function. In reinforcement learning, the transition function is formulated as a Markov Decision Process (MDP), which is a mathematical framework that models sequential decisions. To understand how the transition function is expressed as an MDP, consider the following equation:
st+1~P(st+1|(s0,a0),(s1,a1),...,(st,at));
wherein at time step t the next state s is sampled from the probability distribution P conditioned on the entire historyt+1Ambient slave status stTransition to st+1Depending on all previous states s and actions a.
To make the context conversion function more practical, it is converted to MDP by adding the following assumptions: to the next state st+1Only depending on the previous oneA state stAnd operation atReferred to as Markov properties. Under this assumption, the new transfer function will become:
st+1~P(st+1|st,at);
the above formula represents the secondary probability distribution P(s)t+1|st,at) Middle sampling next state st+1. This is a simple form of the original conversion function. The markov property shows that the current state and action at time step t contain enough information to fully determine the transition probability of the next state at t + 1.
The concept of reinforcement learning is combined with deep neural network technology to produce a deep reinforcement learning (DQN) approach, i.e., to construct a deep neural network. As shown in fig. 6, a schematic diagram of a DQN behavior value function approximation network in the multi-agent robust simulation method provided by the present invention is shown. The input is an environment variable and the output is an action variable. And training the neural network by adopting the maximization of the return value as a target.
The multi-agent strong countermeasure simulation method provided according to the above embodiments is optional, and the neural network policy model includes a discrimination network and a policy network.
Wherein the discriminative network is configured to classify the input countermeasure data, and an output of the discriminative network is configured to indicate whether the input countermeasure data complies with a demonstration countermeasure policy; the policy network is used for reading the state data of the strong countermeasure process and generating the countermeasure policy to be adopted under the state data based on the state data.
It can be understood that, as shown in fig. 1, the neural network policy model of the present invention is composed of a discriminant network D and a policy network a. The input countermeasure data are classified by the judgment network D, scalar values between 0 and 1 are output, whether the input data meet the demonstration countermeasure strategy or not is judged, 0 is completely met, 1 is completely not met, and therefore the optimization goal of the judgment network D is to accurately judge all data as far as possible.
The aim of the policy network a, which reads the countermeasure situation (environment) data and generates the countermeasure commands to be taken in this situation, is to simulate the demonstration countermeasure as accurately as possible, and also to mean to fool the discrimination network D as possible into distinguishing whether the countermeasure data was generated by a demonstration player or by the policy network. Therefore, the discrimination network D and the strategy network a form a confrontation relationship, and the two networks are alternately trained, when the two networks reach equilibrium, the discrimination network D discriminates the demonstration confrontation data and the confrontation data generated by the strategy network with nearly equal probability (i.e. the difference between the two cannot be effectively discriminated, and ideally, the value is expected to be 0.5, which means that the discrimination network cannot be discriminated at all), and at this time, the strategy network a learns the confrontation strategy close to the demonstration player.
Thus, optionally, a processing procedure for training the neural network policy model is shown in fig. 7, which is a schematic flow chart of the method for training the neural network policy model in the multi-agent strong confrontation simulation method provided by the present invention, and includes the following processing steps:
(1) randomly sampling from a demonstration countermeasure playback cache to obtain batch samples;
(2) the batch sample comprises a joint confrontation situation and a presenter confrontation command list and can be directly used as a batch presenter sample;
(3) batch simulator sample generation comprising:
(3.1) obtaining a joint confrontation situation sample from the batch sample;
(3.2) inputting the joint confrontation situation samples into the strategy network A to generate output;
(3.3) generating a list of emulator confrontation commands from the policy network a output;
(3.4) combining the joint confrontation situation with the corresponding simulator confrontation command list to form a simulator lot sample;
(4) inputting the batch demonstrator sample and the batch simulator sample into a discrimination network D together, calculating a discrimination network D loss function and performing one round of optimization training on the discrimination network D;
(5) judging the imitator batch samples by using a judging network D to generate output;
(6) and calculating the loss of the strategy network A according to the judgment result of the judgment network D and performing one round of optimization training on the strategy network A.
Optionally, the discrimination network is specifically a binary classification neural network, the input of the binary classification neural network is tensor coding of a joint confrontation state and confrontation command list, and the output of the binary classification neural network is a binary classification scalar quantity within [0,1 ].
Specifically, the discrimination network D of the present invention may be a typical binary classification neural network, the input of the network is the tensor coding of the joint countermeasure situation + countermeasure command list, and the output is a binary classification scalar of 0 to 1.
Optionally, the network structure and the network scale of the binary classification neural network may be selected in consideration of characteristics of input data, for example, a convolutional network CNN or a multilayer perceptron MLP may be generally adopted, and the parameter dimension and the network depth may be adjusted and selected according to the number of input data attributes and the complexity of the association relationship.
Further, on the basis of the multi-agent strong confrontation simulation method provided by each of the above embodiments, before the training to obtain the neural network policy model, the method further includes:
determining a discriminant loss sum of the demonstration sample and the simulation sample as a loss of the discriminant network, wherein a loss function of the discriminant network is expressed as follows:
Dloss=Dloss-expert+Dloss-learner
in the formula, DlossRepresenting the loss of said discrimination network, Dloss-expertRepresenting the cross entropy, D, of the actual output and the expected output of the discriminative network on the presentation sampleloss-learnerRepresenting a cross entropy of an actual output and an expected output of the discrimination network on the mimic sample;
the goal of the discriminating network is to minimize the discriminating loss sum.
Specifically, the loss of the discrimination network D in the present invention is the discrimination loss sum of the demonstration sample and the simulation sample:
Dloss=Dloss-expert+Dloss-learner
wherein, the demonstration sampleCost determination loss Dloss-expertAnd Dloss-learnerThe cross entropy of the actual output and the expected output of the discrimination network D on the presentation samples and the dummy samples, respectively. The demonstration sample should be judged to be fully compliant with the demonstration countermeasure, so the expected output should be 0; the mock sample should be judged to be completely out of compliance with the demonstration countermeasure, so the expected output should be 1.
Further, on the basis of the multi-agent strong confrontation simulation method provided by each of the above embodiments, before determining the discrimination loss sum of the demonstration sample and the simulation sample as the loss of the discrimination network, the method further includes:
the cross entropy is calculated as follows:
l(x,y)=L={l1,...,ln,...,lN}T
ln=-wn[yn·logxn+(1-yn)·log(1-xn)];
in the formula, l (x, y) represents the cross entropy of the vectors x and y, and is defined as a vector { l ] composed of the cross entropy of each component of the vectors x and y1,...,ln,...,lN}T,lnAs corresponding components x of the vectors x, ynAnd ynCross entropy of (1), wnIs the weight of the component N, N being the dimension of the vector x, y.
Thus, the loss calculation function of the discrimination network D can be expressed as:
Dloss=BCELoss(D(ΠE),0)+BCELoss(D(ΠL),1);
wherein BCELoss is cross entropy, piEFor demonstration of samples, iiLTo mimic a sample.
The optimization goal of the discrimination network D is to minimize the overall discrimination loss.
Further, on the basis of the multi-agent strong confrontation simulation method provided by each of the above embodiments, before the training to obtain the neural network policy model, the method further includes:
determining a reward function for the policy network as follows:
Reward=-log(D(ΠL));
where Reward represents the return of the policy network, ΠLRepresenting said simulated sample, D (Π)L) Representing a cross entropy of an actual output and an expected output of the discrimination network on the mimic sample;
determining a goal of the policy network to maximize a reward of the policy network;
and/or determining a loss function of the policy network as follows:
Figure BDA0002777479480000121
in the formula, pd represents the confrontation command parameter probability distribution constructed by the parameters output by the policy network, action represents the command parameter value obtained by sampling the constructed probability distribution, log _ prob represents the log probability density of the probability distribution at the sample point of the action value, entrypy represents the entropy of the probability distribution, and beta represents a hyper-parameter.
Specifically, the structural design of the policy network a is similar to that of a policy network in reinforcement learning, a convolutional network CNN or a multilayer perceptron MLP and the like can be selected according to input and output characteristics for construction, and parameters such as input and output dimensions and network depth need to be selected and adjusted in consideration of simulation data characteristics.
The technical framework of the invention is applicable to different types of agents, denoted by subscript i, and different numbers of agents of the same type, denoted by subscript j.
Then, the reward calculation formula of the policy network a is:
Reward=-log(D(ΠL));
the optimization goal of policy network a is to maximize the payback.
The loss calculation function of the policy network a is:
Figure BDA0002777479480000122
pd is the probability distribution of the parameters of the countermeasure command constructed by the parameters output by the policy network A, the type of the probability distribution adopted by the pd can be selected according to the characteristics of the parameters, the discrete parameters such as the command type can adopt Categorical distribution and the like, and the continuous parameters such as coordinate points x and y can adopt Normal distribution and the like; action is a command parameter value obtained from the structured probability distribution sampling; log _ prob is the log probability density of the probability distribution at the sample point of the action value; entropy is the entropy of the probability distribution; beta is a hyper-parameter, controls the proportion of the maximum entropy target in the strategy network loss, and adjusts according to the training condition during training.
The multi-agent strong confrontation simulation method provided according to the above embodiments is optional, and the simulating a decision process of the multi-agent in the strong confrontation process by using the neural network policy model includes: constructing the countermeasure command parameter probability distribution based on the output of the policy network, and sampling and acquiring countermeasure command parameters from the countermeasure command parameter probability distribution; converting the countermeasure command parameters into a countermeasure command list according to an interface format required by the countermeasure simulation engine, and inputting the countermeasure command list into the countermeasure simulation engine.
Specifically, the policy network a of the present invention is similar to a policy network in reinforcement learning, and has the input of tensor coding of the joint countermeasure situation and the output of probability distribution parameters which can be used for constructing a countermeasure command list. The automatic countermeasure program constructs a countermeasure command parameter probability distribution pd according to the output of the policy network A, samples the pd to obtain countermeasure command parameters, and finally converts the countermeasure command parameters into a countermeasure command list according to an interface format required by the countermeasure simulation engine and inputs the countermeasure command list into the countermeasure simulation engine.
Based on the same inventive concept, the invention provides a multi-agent strong countermeasure simulation device according to the above embodiments, and the device is used for realizing the strong countermeasure simulation of the multi-agent in the above embodiments. Therefore, the description and definition in the multi-agent active confrontation simulation method in each embodiment above can be used for understanding each execution module of the multi-agent active confrontation simulation device in the present invention, and reference may be made to the above embodiment specifically, and details are not described herein.
According to an embodiment of the present invention, the structure of the multi-agent strong countermeasure simulation apparatus is shown in fig. 8, which is a schematic structural diagram of the multi-agent strong countermeasure simulation apparatus provided by the present invention, the apparatus can be used for the strong countermeasure simulation of the multi-agent, the apparatus includes: a training module 801 and a simulation module 802.
The training module 801 is configured to acquire multi-round demonstration countermeasure playback data from a countermeasure simulation engine, and train and acquire a neural network strategy model by using a generation countermeasure network technology based on the countermeasure playback data; the simulation module 802 is configured to simulate a decision process of the multi-agent in the strong countermeasure process by using the neural network policy model, so as to complete the strong countermeasure simulation of the multi-agent.
Specifically, the training module 801 first needs to obtain multiple rounds of demo confrontation playback data from the simulation engine. After the confrontation playback data is collected, the training module 801 may train the neural network strategy model by using the confrontation network generation technology, so that the neural network strategy model learns the confrontation strategy adopted by the presenter.
Then, the simulation module 802 can perform a multi-agent strong confrontation simulation test, and simulate the decision process of the multi-agent in the strong confrontation process by using the neural network policy model. The imitator in fig. 1 may be made to mimic the decision-making process of a presenter, such as may be utilized with an confrontational simulation engine.
The multi-agent strong-confrontation simulation device provided by the invention can accelerate the training speed of the multi-agent strong-confrontation model by learning historical data, thereby effectively improving the operation efficiency and effectively saving the computing resources.
Optionally, the neural network policy model includes a discriminant network and a policy network;
wherein the discriminative network is configured to classify the input countermeasure data, and an output of the discriminative network is configured to indicate whether the input countermeasure data complies with a demonstration countermeasure policy;
the policy network is used for reading the state data of the strong countermeasure process and generating the countermeasure policy to be adopted under the state data based on the state data.
Further, the training module is further configured to:
determining a discriminant loss sum of the demonstration sample and the simulation sample as a loss of the discriminant network, wherein a loss function of the discriminant network is expressed as follows:
Dloss=Dloss-expert+Dloss-learner
in the formula, DlossRepresenting the loss of said discrimination network, Dloss-expertRepresenting the cross entropy, D, of the actual output and the expected output of the discriminative network on the presentation sampleloss-learnerRepresenting a cross entropy of an actual output and an expected output of the discrimination network on the mimic sample;
the goal of the discriminating network is to minimize the discriminating loss sum.
Further, the training module is further configured to:
the cross entropy is calculated as follows:
l(x,y)=L={l1,...,ln,...,lN}T
ln=-wn[yn·logxn+(1-yn)·log(1-xn)];
in the formula, l (x, y) represents the cross entropy of the vectors x and y, and is defined as a vector { l ] composed of the cross entropy of each component of the vectors x and y1,...,ln,...,lN}T,lnAs corresponding components x of the vectors x, ynAnd ynCross entropy of (1), wnIs the weight of the component N, N being the dimension of the vector x, y.
Further, the training module is further configured to:
determining a reward function for the policy network as follows:
Reward=-log(D(∏L));
where Reward represents the return of the policy network, ΠLRepresenting said simulated sample, D (Π)L) Representing a cross entropy of an actual output and an expected output of the discrimination network on the mimic sample;
determining a goal of the policy network to maximize a reward of the policy network;
and/or determining a loss function of the policy network as follows:
Figure BDA0002777479480000151
in the formula, pd represents the confrontation command parameter probability distribution constructed by the parameters output by the policy network, action represents the command parameter value obtained by sampling the constructed probability distribution, log _ prob represents the log probability density of the probability distribution at the sample point of the action value, entrypy represents the entropy of the probability distribution, and beta represents a hyper-parameter.
Optionally, the simulation module is configured to:
constructing the countermeasure command parameter probability distribution based on the output of the policy network, and sampling and acquiring countermeasure command parameters from the countermeasure command parameter probability distribution;
converting the countermeasure command parameters into a countermeasure command list according to an interface format required by the countermeasure simulation engine, and inputting the countermeasure command list into the countermeasure simulation engine.
Optionally, the discrimination network is specifically a binary classification neural network, the input of the binary classification neural network is tensor coding of a joint confrontation state and confrontation command list, and the output of the binary classification neural network is a binary classification scalar quantity within [0,1 ].
It is understood that the relevant program modules in the devices of the above embodiments can be implemented by a hardware processor (hardware processor) in the present invention. Moreover, the multi-agent strong countermeasure simulation apparatus of the present invention can implement the multi-agent strong countermeasure simulation flow of each of the above method embodiments by using the above program modules, and when used for implementing the strong countermeasure simulation of the multi-agent in each of the above method embodiments, the apparatus of the present invention produces the same beneficial effects as those of the corresponding above method embodiments, and reference may be made to the above method embodiments, and details thereof are not repeated here.
As a further aspect of the present invention, the present invention provides an electronic device according to the above embodiments, the electronic device includes a memory, a processor and a program or instructions stored in the memory and executable on the processor, and the processor executes the program or instructions to implement the steps of the multi-agent robust simulation method according to the above embodiments.
Further, the electronic device of the present invention may further include a communication interface and a bus. Referring to fig. 9, a schematic structural diagram of an electronic device provided in the present invention includes: at least one memory 901, at least one processor 902, a communication interface 903, and a bus 904.
Wherein, the memory 901, the processor 902 and the communication interface 903 are communicated with each other through the bus 904, and the communication interface 903 is used for information transmission between the electronic equipment and the countermeasure data equipment; the memory 901 stores a program or instructions that can be executed on the processor 902, and when the processor 902 executes the program or instructions, the steps of the multi-agent warfare simulation method as described in the above embodiments are implemented.
It is understood that the electronic device at least comprises a memory 901, a processor 902, a communication interface 903 and a bus 904, and the memory 901, the processor 902 and the communication interface 903 form a communication connection with each other through the bus 904, and can complete the communication with each other, for example, the processor 902 reads program instructions of the multi-agent robust simulation method from the memory 901. In addition, the communication interface 903 can also realize communication connection between the electronic device and the countermeasure data device, and can complete mutual information transmission, such as reading of the countermeasure playback data through the communication interface 903.
When the electronic device is running, the processor 902 invokes the program instructions in the memory 901 to perform the methods provided by the above-mentioned method embodiments, for example, including: acquiring multi-round demonstration confrontation playback data from a confrontation simulation engine, and training and acquiring a neural network strategy model by adopting a confrontation network generation technology based on the confrontation playback data; and simulating the decision process of the multi-agent in the strong confrontation process by using the neural network strategy model to complete the strong confrontation simulation of the multi-agent and the like.
The program instructions in the memory 901 may be implemented in the form of software functional units and stored in a computer readable storage medium when the program instructions are sold or used as independent products. Alternatively, all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, where the program may be stored in a computer-readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The present invention also provides a non-transitory computer readable storage medium according to the above embodiments, on which a program or instructions are stored, the program or instructions, when executed by a computer, implement the steps of the multi-agent strong confrontation simulation method according to the above embodiments, for example, comprising: acquiring multi-round demonstration confrontation playback data from a confrontation simulation engine, and training and acquiring a neural network strategy model by adopting a confrontation network generation technology based on the confrontation playback data; and simulating the decision process of the multi-agent in the strong confrontation process by using the neural network strategy model to complete the strong confrontation simulation of the multi-agent and the like.
As a further aspect of the present invention, the present invention also provides a computer program product according to the above embodiments, the computer program product comprising a computer program stored on a non-transitory computer-readable storage medium, the computer program comprising program instructions, which when executed by a computer, enable the computer to perform the multi-agent strong countermeasure simulation method provided by the above method embodiments, the method comprising: acquiring multi-round demonstration confrontation playback data from a confrontation simulation engine, and training and acquiring a neural network strategy model by adopting a confrontation network generation technology based on the confrontation playback data; and simulating the decision process of the multi-agent in the strong confrontation process by using the neural network strategy model to complete the strong confrontation simulation of the multi-agent.
According to the electronic device, the non-transitory computer readable storage medium and the computer program product provided by the invention, by executing the steps of the multi-agent strong confrontation simulation method described in each embodiment, the training speed of the multi-agent strong confrontation model can be accelerated by learning historical data, so that the operation efficiency is effectively improved, and the calculation resources are effectively saved.
It is to be understood that the above-described embodiments of the apparatus, the electronic device and the storage medium are merely illustrative, and that elements described as separate components may or may not be physically separate, may be located in one place, or may be distributed on different network elements. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the invention. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. Based on such understanding, the technical solutions mentioned above may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as a usb disk, a removable hard disk, a ROM, a RAM, a magnetic or optical disk, etc., and includes several instructions for causing a computer device (such as a personal computer, a server, or a network device, etc.) to execute the methods described in the method embodiments or some parts of the method embodiments.
In addition, it should be understood by those skilled in the art that the terms "comprises," "comprising," or any other variation thereof, in the specification of the present invention, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. The term "comprising", without further limitation, means that the element so defined is not excluded from the group consisting of additional identical elements in the process, method, article, or apparatus that comprises the element.
In the description of the present invention, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description. Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1.一种多智能体强对抗仿真方法,其特征在于,包括:1. a multi-agent strong confrontation simulation method, is characterized in that, comprises: 从对抗仿真引擎获取多轮演示对抗回放数据,并基于所述对抗回放数据,采用生成对抗网络技术,训练获取神经网络策略模型;Obtain multiple rounds of demonstration confrontation playback data from the confrontation simulation engine, and based on the confrontation playback data, use the generative confrontation network technology to train and obtain a neural network strategy model; 利用所述神经网络策略模型,模拟所述多智能体在强对抗过程中的决策过程,完成多智能体强对抗仿真。Using the neural network strategy model, the decision process of the multi-agent in the strong confrontation process is simulated, and the multi-agent strong confrontation simulation is completed. 2.根据权利要求1所述的多智能体强对抗仿真方法,其特征在于,所述神经网络策略模型包括判别网络和策略网络;2. The multi-agent strong confrontation simulation method according to claim 1, wherein the neural network strategy model comprises a discriminant network and a strategy network; 其中,所述判别网络用于对输入对抗数据进行分类,所述判别网络的输出用于指示所述输入对抗数据是否符合演示对抗策略;Wherein, the discrimination network is used to classify the input confrontation data, and the output of the discrimination network is used to indicate whether the input confrontation data conforms to the demonstration confrontation strategy; 所述策略网络用于读取所述强对抗过程的状态数据,并基于所述状态数据,产生在所述状态数据下应采取的对抗策略。The strategy network is used to read the state data of the strong confrontation process, and based on the state data, generate a confrontation strategy that should be taken under the state data. 3.根据权利要求2所述的多智能体强对抗仿真方法,其特征在于,在所述训练获取神经网络策略模型之前,还包括:3. The multi-agent strong confrontation simulation method according to claim 2, wherein, before the training obtains the neural network strategy model, further comprising: 确定演示样本与模仿样本的判别损失总和,作为所述判别网络的损失,所述判别网络的损失函数表示如下:Determine the sum of the discriminative loss of the demo sample and the imitation sample as the loss of the discriminant network, and the loss function of the discriminant network is expressed as follows: Dloss=Dloss-expert+Dloss-learnerD loss =D loss-expert +D loss-learner ; 式中,Dloss表示所述判别网络的损失,Dloss-expert表示所述判别网络对所述演示样本的实际输出与预期输出的交叉熵,Dloss-learner表示所述判别网络对所述模仿样本的实际输出与预期输出的交叉熵;In the formula, D loss represents the loss of the discriminant network, D loss-expert represents the cross-entropy between the actual output and the expected output of the discriminant network for the demo sample, and D loss-learner represents the difference between the discriminative network and the imitation The cross entropy between the actual output of the sample and the expected output; 确定所述判别网络的目标为最小化所述判别损失总和。The objective of the discriminative network is determined to minimize the sum of the discriminative losses. 4.根据权利要求3所述的多智能体强对抗仿真方法,其特征在于,在所述确定演示样本与模仿样本的判别损失总和,作为所述判别网络的损失之前,还包括:4. The multi-agent strong confrontation simulation method according to claim 3, characterized in that, before the determination of the sum of the discriminative losses of the demonstration samples and the imitation samples is taken as the loss of the discriminant network, further comprising: 按如下公式计算所述交叉熵,所述如下公式为:The cross entropy is calculated according to the following formula, and the following formula is: l(x,y)=L={l1,...,ln,...,lN}Tl(x, y)=L={l 1 , . . . , ln , . . . , 1 N } T ; ln=-wn[yn·logxn+(1-vn)·log(1-xn)];l n =-w n [y n ·logx n +(1-v n )·log(1-x n )]; 式中,l(x,y)表示向量x与y的交叉熵,定义为向量x与y各个分量的交叉熵组成的向量{l1,...,ln,...,lN}T,ln为向量x、y的对应分量xn与yn的交叉熵,wn为分量n的权重,N为向量x、y的维数。In the formula, l(x, y) represents the cross entropy of the vector x and y, which is defined as the vector {l 1 ,...,l n ,...,l N } composed of the cross entropy of each component of the vector x and y T , ln is the cross entropy of the corresponding components x n and y n of the vectors x and y, wn is the weight of the component n , and N is the dimension of the vectors x and y. 5.根据权利要求3或4所述的多智能体强对抗仿真方法,其特征在于,在所述训练获取神经网络策略模型之前,还包括:5. The multi-agent strong confrontation simulation method according to claim 3 or 4, characterized in that, before the training obtains the neural network strategy model, further comprising: 确定所述策略网络的回报函数如下:Determine the reward function for the policy network as follows: Reward=-log(D(ΠL));Reward=-log(D(Π L )); 式中,Reward表示所述策略网络的回报,ПL表示所述模仿样本,D(ПL)表示所述判别网络对所述模仿样本的实际输出与预期输出的交叉熵;In the formula, Reward represents the reward of the strategy network, П L represents the imitation sample, D(П L ) represents the cross entropy between the actual output and the expected output of the discriminant network for the imitation sample; 确定所述策略网络的目标为最大化所述策略网络的回报;determining that the objective of the policy network is to maximize the reward of the policy network; 和/或,确定所述策略网络的损失函数如下:and/or, determine the loss function of the policy network as follows:
Figure FDA0002777479470000021
Figure FDA0002777479470000021
式中,pd表示由所述策略网络输出的参数构造的对抗命令参数概率分布,action表示从构造的概率分布取样获得的命令参数取值,log_prob表示概率分布在action取值的样本点的log概率密度,entropy表示概率分布的熵,β表示超参数。In the formula, pd represents the probability distribution of the adversarial command parameters constructed by the parameters output by the policy network, action represents the command parameter value obtained by sampling from the constructed probability distribution, and log_prob represents the log probability of the probability distribution at the sample point of the action value. Density, entropy is the entropy of the probability distribution, and β is the hyperparameter.
6.根据权利要求5所述的多智能体强对抗仿真方法,其特征在于,所述利用所述神经网络策略模型,模拟所述多智能体在强对抗过程中的决策过程,包括:6. The multi-agent strong confrontation simulation method according to claim 5, wherein the use of the neural network strategy model to simulate the decision-making process of the multi-agent in the strong confrontation process comprises: 基于所述策略网络的输出,构造所述对抗命令参数概率分布,并从所述对抗命令参数概率分布取样获取对抗命令参数;Constructing the adversarial command parameter probability distribution based on the output of the strategy network, and sampling the adversarial command parameter from the adversarial command parameter probability distribution to obtain the adversarial command parameter; 按照所述对抗仿真引擎所需的接口格式,将所述对抗命令参数转换为对抗命令列表,并将所述对抗命令列表输入到所述对抗仿真引擎。According to the interface format required by the confrontation simulation engine, the confrontation command parameters are converted into a confrontation command list, and the confrontation command list is input to the confrontation simulation engine. 7.根据权利要求2所述的多智能体强对抗仿真方法,其特征在于,所述判别网络具体为二元分类神经网络,所述二元分类神经网络的输入为联合对抗态状态与对抗命令列表的张量编码,所述二元分类神经网络的输出为[0,1]内的二元分类标量。7. The multi-agent strong confrontation simulation method according to claim 2, wherein the discriminant network is specifically a binary classification neural network, and the input of the binary classification neural network is a joint adversarial state and an adversarial command A tensor encoding of a list, the output of the binary classification neural network is a binary classification scalar in [0, 1]. 8.一种多智能体强对抗仿真装置,其特征在于,包括:8. A multi-agent strong confrontation simulation device, characterized in that, comprising: 训练模块,用于从对抗仿真引擎获取多轮演示对抗回放数据,并基于所述对抗回放数据,采用生成对抗网络技术,训练获取神经网络策略模型;A training module is used to obtain multiple rounds of demonstration confrontation playback data from the confrontation simulation engine, and based on the confrontation playback data, use the generative confrontation network technology to train and obtain a neural network strategy model; 仿真模块,用于利用所述神经网络策略模型,模拟所述多智能体在强对抗过程中的决策过程,完成多智能体强对抗仿真。The simulation module is used to simulate the decision-making process of the multi-agent in the strong confrontation process by using the neural network strategy model to complete the multi-agent strong confrontation simulation. 9.一种电子设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的程序或指令,其特征在于,所述处理器执行所述程序或指令时,实现如权利要求1至7中任一项所述的多智能体强对抗仿真方法的步骤。9. An electronic device comprising a memory, a processor and a program or instruction stored on the memory and running on the processor, wherein when the processor executes the program or the instruction, the The steps of the multi-agent strong adversarial simulation method according to any one of claims 1 to 7. 10.一种非暂态计算机可读存储介质,其上存储有程序或指令,其特征在于,所述程序或指令被计算机执行时,实现如权利要求1至7中任一项所述的多智能体强对抗仿真方法的步骤。10. A non-transitory computer-readable storage medium on which programs or instructions are stored, characterized in that, when the programs or instructions are executed by a computer, the multi-function according to any one of claims 1 to 7 is implemented. Steps of an agent-strong adversarial simulation method.
CN202011270335.7A 2020-11-13 2020-11-13 Multi-agent strong countermeasure simulation method and device and electronic equipment Pending CN112434791A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011270335.7A CN112434791A (en) 2020-11-13 2020-11-13 Multi-agent strong countermeasure simulation method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011270335.7A CN112434791A (en) 2020-11-13 2020-11-13 Multi-agent strong countermeasure simulation method and device and electronic equipment

Publications (1)

Publication Number Publication Date
CN112434791A true CN112434791A (en) 2021-03-02

Family

ID=74701309

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011270335.7A Pending CN112434791A (en) 2020-11-13 2020-11-13 Multi-agent strong countermeasure simulation method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN112434791A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113298260A (en) * 2021-06-11 2021-08-24 中国人民解放军国防科技大学 Confrontation simulation deduction method based on deep reinforcement learning
CN113894780A (en) * 2021-09-27 2022-01-07 中国科学院自动化研究所 Multi-robot cooperative countermeasure method and device, electronic equipment and storage medium
CN114254722A (en) * 2021-11-17 2022-03-29 中国人民解放军军事科学院国防科技创新研究院 Game countermeasure oriented multi-intelligent model fusion method
CN114996856A (en) * 2022-06-27 2022-09-02 北京鼎成智造科技有限公司 Data processing method and device for airplane intelligent agent maneuver decision
CN118887029A (en) * 2024-10-09 2024-11-01 中国科学院自动化研究所 A social simulation method, device and equipment based on large model intelligent agent
CN119647512A (en) * 2024-10-31 2025-03-18 中国科学院自动化研究所 Decision-making method, device, equipment and medium based on multi-agent

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1659468A2 (en) * 2004-11-16 2006-05-24 Rockwell Automation Technologies, Inc. Universal run-time interface for agent-based simulation and control systems
US20070016464A1 (en) * 2004-07-16 2007-01-18 John Yen Agent-based collaborative recognition-primed decision-making
CN101964019A (en) * 2010-09-10 2011-02-02 北京航空航天大学 Against behavior modeling simulation platform and method based on Agent technology
CN108764453A (en) * 2018-06-08 2018-11-06 中国科学技术大学 The modeling method and action prediction system of game are synchronized towards multiple agent
CN109598342A (en) * 2018-11-23 2019-04-09 中国运载火箭技术研究院 A kind of decision networks model is from game training method and system
US20200090042A1 (en) * 2017-05-19 2020-03-19 Deepmind Technologies Limited Data efficient imitation of diverse behaviors
CN111507880A (en) * 2020-04-18 2020-08-07 郑州大学 Crowd confrontation simulation method based on emotional infection and deep reinforcement learning
CN111767786A (en) * 2020-05-11 2020-10-13 北京航空航天大学 Adversarial attack method and device based on three-dimensional dynamic interactive scene

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070016464A1 (en) * 2004-07-16 2007-01-18 John Yen Agent-based collaborative recognition-primed decision-making
EP1659468A2 (en) * 2004-11-16 2006-05-24 Rockwell Automation Technologies, Inc. Universal run-time interface for agent-based simulation and control systems
CN101964019A (en) * 2010-09-10 2011-02-02 北京航空航天大学 Against behavior modeling simulation platform and method based on Agent technology
US20200090042A1 (en) * 2017-05-19 2020-03-19 Deepmind Technologies Limited Data efficient imitation of diverse behaviors
CN108764453A (en) * 2018-06-08 2018-11-06 中国科学技术大学 The modeling method and action prediction system of game are synchronized towards multiple agent
CN109598342A (en) * 2018-11-23 2019-04-09 中国运载火箭技术研究院 A kind of decision networks model is from game training method and system
CN111507880A (en) * 2020-04-18 2020-08-07 郑州大学 Crowd confrontation simulation method based on emotional infection and deep reinforcement learning
CN111767786A (en) * 2020-05-11 2020-10-13 北京航空航天大学 Adversarial attack method and device based on three-dimensional dynamic interactive scene

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
XINYAN GAN 等: ""Multi-Agent Based Hybrid Evolutionary Algorithm"", 2011 SEVENTH INTERNATIONAL CONFERENCE ON NATURAL COMPUTATION *
孙旭朋 等: ""基于失效物理的水下航行器电子设备可靠性预计和近似建模方法"", 《环境技术》, no. 5 *
谭浪: ""强化学习在多智能体对抗中的应用研究"", 《中国优秀硕士论文期刊全文数据库(信息科技辑)》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113298260A (en) * 2021-06-11 2021-08-24 中国人民解放军国防科技大学 Confrontation simulation deduction method based on deep reinforcement learning
CN113894780A (en) * 2021-09-27 2022-01-07 中国科学院自动化研究所 Multi-robot cooperative countermeasure method and device, electronic equipment and storage medium
CN114254722A (en) * 2021-11-17 2022-03-29 中国人民解放军军事科学院国防科技创新研究院 Game countermeasure oriented multi-intelligent model fusion method
CN114996856A (en) * 2022-06-27 2022-09-02 北京鼎成智造科技有限公司 Data processing method and device for airplane intelligent agent maneuver decision
CN118887029A (en) * 2024-10-09 2024-11-01 中国科学院自动化研究所 A social simulation method, device and equipment based on large model intelligent agent
CN119647512A (en) * 2024-10-31 2025-03-18 中国科学院自动化研究所 Decision-making method, device, equipment and medium based on multi-agent

Similar Documents

Publication Publication Date Title
CN112434791A (en) Multi-agent strong countermeasure simulation method and device and electronic equipment
Narasimhan et al. Grounding language for transfer in deep reinforcement learning
CN112052948B (en) Network model compression method and device, storage medium and electronic equipment
Noothigattu et al. Interpretable multi-objective reinforcement learning through policy orchestration
CN115185294B (en) QMIX-based aviation soldier multi-formation collaborative autonomous behavior decision modeling method
CN114510012B (en) Unmanned cluster evolution system and method based on meta-action sequence reinforcement learning
CN113230650B (en) Data processing method and device and computer readable storage medium
CN114290339A (en) Robot reality migration system and method based on reinforcement learning and residual modeling
CN112163671A (en) New energy scene generation method and system
Laversanne-Finot et al. Intrinsically motivated exploration of learned goal spaces
Yang et al. Social learning with actor–critic for dynamic grasping of underwater robots via digital twins
CN116523076A (en) A multi-directional curriculum reinforcement learning method and device for multi-agent decision-making
CN116362349A (en) A reinforcement learning method and device based on an environment dynamic model
CN112044082B (en) Information detection method and device and computer readable storage medium
CN120079113A (en) An intelligent war game decision-making method based on sample transfer
CN119514614A (en) A drone capture resource allocation method based on adversarial generative imitation learning in pursuit scenarios
CN119005287A (en) Man-machine reinforcement learning method based on multidimensional human feedback fusion
Mediratta et al. A study of generalization in offline reinforcement learning
CN112933605B (en) Virtual object control and model training method and device and computer equipment
CN111443806B (en) Interactive task control method and device, electronic equipment and storage medium
CN116579231A (en) An Environmental Modeling Method Based on Reinforcement Learning
CN115906673A (en) Integrated modeling method and system for combat entity behavior model
Tanskanen et al. Modeling Risky Choices in Unknown Environments
CN120134328B (en) A robot imitation learning method, control method, device, equipment and medium
CN118045360B (en) Wargame agent training method, prediction method and corresponding system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210302