[go: up one dir, main page]

CN116306903A - A Robust Adversarial Training Framework for Multi-Agent Reinforcement Learning Energy Systems - Google Patents

A Robust Adversarial Training Framework for Multi-Agent Reinforcement Learning Energy Systems Download PDF

Info

Publication number
CN116306903A
CN116306903A CN202211516697.9A CN202211516697A CN116306903A CN 116306903 A CN116306903 A CN 116306903A CN 202211516697 A CN202211516697 A CN 202211516697A CN 116306903 A CN116306903 A CN 116306903A
Authority
CN
China
Prior art keywords
agent
strategy
attack
training
optimal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211516697.9A
Other languages
Chinese (zh)
Other versions
CN116306903B (en
Inventor
陈永辉
刘轩驿
林彤
王战
李隆锋
陈双照
朱凌风
翁洪康
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Zheneng Digital Technology Co ltd
Zhejiang Zheneng Yueqing Power Generation Co ltd
Original Assignee
Zhejiang Zheneng Digital Technology Co ltd
Zhejiang Zheneng Yueqing Power Generation Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Zheneng Digital Technology Co ltd, Zhejiang Zheneng Yueqing Power Generation Co ltd filed Critical Zhejiang Zheneng Digital Technology Co ltd
Priority to CN202211516697.9A priority Critical patent/CN116306903B/en
Publication of CN116306903A publication Critical patent/CN116306903A/en
Application granted granted Critical
Publication of CN116306903B publication Critical patent/CN116306903B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y04INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
    • Y04SSYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
    • Y04S10/00Systems supporting electrical power generation, transmission or distribution
    • Y04S10/50Systems or methods supporting the power network operation or management, involving a certain degree of interaction with the load-side end user applications

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to a robust countermeasure training framework for a multi-agent reinforcement learning energy system, which comprises the following components: constructing an anti-smart agent to generate an anti-attack and modeling as a random gaming system observable by the anti-attack portion; fixing the pre-trained harmful multi-agent strategy, and training an optimal deterministic countermeasure strategy to generate bounded disturbance; the optimal attack countermeasure strategy is fixed, and the robustness of the victim strategy under the optimal attacker is improved through countermeasure training. The beneficial effects of the invention are as follows: the invention models the resistance attack as an attack opponent based on single agent reinforcement learning, and learns to obtain the strongest attack strategy considering attack constraint. Mathematically, the problem is constructed as a countermarkov game and the performance of a multi-agent reinforcement learning-based integrated energy management system is improved by robust countertraining.

Description

Robust countermeasure training frame for multi-agent reinforcement learning energy system
Technical Field
The invention relates to the field of security defense of electric power systems, in particular to a robust countermeasure training framework for a multi-agent reinforcement learning energy system.
Background
With the development of socioeconomic and the increase of energy demand, electric power systems are undergoing a fundamental revolution in planning and operation from fossil fuels to clean energy. Under the background of rapid development of the energy Internet, the comprehensive energy system with multiple energy sources such as electricity, gas, heat, cold and the like coupled and coordinated can realize multi-energy complementation, promote the consumption of renewable energy sources, improve the energy utilization efficiency and relieve unbalanced supply and demand. Compared with the traditional power system, the energy flow of the comprehensive energy system is more complex, and the operation regulation and control of the comprehensive energy system relate to more complex load demands, supply devices and operation modes. The novel characteristics of high coupling of energy demand, supply and storage can cause the problems of improved system operation mode and dynamic characteristic complexity, aggravated uncertainty of the two sides of the source load, increased variable and dimension of a mathematical model of a simulation system, reduced safety stability margin and the like, so that the traditional comprehensive energy management method based on the mathematical model mechanism is difficult to meet the requirements of online evaluation and real-time control. Therefore, a data-driven comprehensive energy management method with multi-agent reinforcement learning as a core has been developed. With the integration of information and communication technologies, the safety and vulnerability problems of the comprehensive energy management system based on multi-agent reinforcement learning are more indiscriminate. The communication network of the comprehensive energy management system, including the monitoring and data acquisition network, the intelligent electric meter and other devices, is easy to be attacked by malicious network behaviours.
Disclosure of Invention
The invention aims to overcome the defects in the prior art and provides a robust countermeasure training framework for a multi-agent reinforcement learning energy system. The invention enhances the resistance of the comprehensive energy management system based on multi-agent reinforcement learning to the challenge attack through robust challenge training. Firstly, constructing an opponent intelligent body, wherein the aim is to model the system into a partial observable random game system for antagonism by formulating a fight attack and causing the worst performance of a control system; training the adversary agent to learn an optimal deterministic challenge strategy to generate bounded perturbations; and finally, robust countermeasure training is adopted for the damaged multi-agent reinforcement learning comprehensive energy management system so as to enhance the robustness of the model.
In a first aspect, a robust countermeasure training framework for a multi-agent reinforcement learning energy system is provided, comprising:
step 1, constructing an anti-intelligent agent to generate an anti-attack and modeling as a random game system with observable anti-part;
step 2, fixing a pre-trained harmful multi-agent strategy, and training an optimal deterministic countermeasure strategy to generate bounded disturbance;
and 3, fixing an optimal attack countermeasure strategy, and improving the robustness of the victim strategy under the optimal attacker through countermeasure training.
Preferably, step 1 includes:
step 1.1, expressing a comprehensive energy management system based on multi-agent reinforcement learning as a part of observable random game problem, wherein each agent controls a building, and the accumulated rewards of the whole team are maximized by optimizing strategies of all agents:
<N,S,{A i } i∈N ,P,{R i } i∈N ,γ,{O i } i∈N ,Z>
wherein N is the number of agents, S is the environmental state, A i Is the action space of the ith agent, { A i } i∈N Is a joint action space defined as a=a 1 ×…×A N The method comprises the steps of carrying out a first treatment on the surface of the P is S.times.A.times.S.delta.S is given action at any time t
Figure BDA0003972139850000021
Lower slave state s t State s to the next time t+1 t+1 State transition probabilities of (2);
Figure BDA0003972139850000022
Is the ith agent slave(s) t ,a t ) To the next time state s t+1 Timely feeding back rewards; gamma is the discount factor; o (O) i Is the observation space of the ith agent, and the joint observation space is { O } i } i∈N Defined as o=o 1 ×…×O N The method comprises the steps of carrying out a first treatment on the surface of the Z is S×A → delta (O) is the joint observation O at any time t t E O in arbitrary action a t Under state s t Is a part of the observation probability;
at time t, each agent i is based on observations
Figure BDA0003972139850000023
By policy->
Figure BDA0003972139850000024
Select action->
Figure BDA0003972139850000025
Then the environment moves to the next state, s, according to the state transition probability, P t+1 ~P(·|s t ,a t ) The method comprises the steps of carrying out a first treatment on the surface of the Each agent i gets a reward +.>
Figure BDA0003972139850000026
And new local observations->
Figure BDA0003972139850000027
Step 1.2, introducing an opponent agent into the comprehensive energy management system, and modeling the system as a random game problem observable by an antagonism part by generating the worst performance of a model caused by the strongest antagonism attack:
<N,S,A adv ,{A i } i∈N ,P,{R i } i∈N ,R adv ,γ,{O i } i∈N ,Z>
where N is the number of victim agents, S is the environmental state, A adv And R is adv An attacker's action space and rewards function, respectively; a is that i Is the action space of the A-th victim agent, { A i } i∈N Is a joint action space defined as a=a 1 ×…×A N ;P:S×A adv XA X S → delta (S) is a given action at any time t
Figure BDA0003972139850000028
And A adv Lower slave state s t To the next time state s t+1 State transition probabilities of (2);
Figure BDA0003972139850000029
Is the ith agent slave(s) t ,a t ) To the next time state s t+1 Timely feeding back rewards; gamma is the discount factor; o (O) i Is the observation space of the ith agent, and the joint observation space is { O } i } i∈N Defined as o=o 1 ×…×O N The method comprises the steps of carrying out a first treatment on the surface of the Z is S×A → delta (O) is the joint observation O at any time t t E O in arbitrary action a t Under state s t Is a function of the observation probability of (a).
Preferably, step 2 includes:
step 2.1, fixing the strategy parameters of the pre-trained normal harmful multi-agent system
Figure BDA00039721398500000210
θ i Model parameters representing each agent strategy, training an anti-agent strategy u φ Phi is a policy parameter of an attack agent to simulate against an attack and threaten one of themThe attack generated by the intelligent agent is as follows:
Figure BDA0003972139850000031
wherein delta t Is the generated attack vector for a particular agent observation,
Figure BDA0003972139850000032
is the observation of the agent to be attacked, B (o j ) Is a boundary constraint of the disturbance; the input of the compromised agent j is expressed as:
Figure BDA0003972139850000033
the victim policy makes decisions based on disturbance observations:
Figure BDA0003972139850000034
wherein the method comprises the steps of
Figure BDA0003972139850000035
Is the action made by the multi-agent comprehensive energy management system after being attacked;
step 2.2, fixing the harmful multi-agent system strategy pi θ Defines the rewarding function of the attacker as R adv =-∑R i Then its objective function is:
Figure BDA0003972139850000036
where J (θ, Φ) = Σr i The attack agent and the multi-agent comprehensive energy management system perform interactive training to generate an optimal attack strategy
Figure BDA0003972139850000037
Preferably, the steps areIn step 3, the optimal aggressor strategy trained in step 2.2 is fixed
Figure BDA0003972139850000038
Wherein phi is * Is a parameter of an optimal attack strategy, an attack vector is generated by utilizing the parameter and environment interaction, the robustness of a victim strategy under an optimal attacker is improved through resistance training, and an objective function is as follows:
Figure BDA0003972139850000039
where J (θ, P) = Σr i
In a second aspect, a robust countermeasure training apparatus for a multi-agent-oriented reinforcement learning energy system is provided, configured to execute the robust countermeasure training framework for a multi-agent-oriented reinforcement learning energy system according to the first aspect, including:
a construction module for constructing an challenge agent to generate a challenge attack and modeling as a random gaming system observable by the challenge portion;
the first fixing module is used for fixing the pre-trained harmful multi-agent strategy, and training an optimal deterministic countermeasure strategy to generate bounded disturbance;
the second fixing module is used for fixing the optimal anti-attack strategy and improving the robustness of the victim strategy under the optimal attacker through the anti-attack training.
The beneficial effects of the invention are as follows: the invention designs a robust countermeasure training framework for a multi-agent reinforcement learning energy system so as to cope with potential countermeasure attack. The antagonistic attack is modeled as an attack opponent based on single agent reinforcement learning, and the strongest attack strategy considering attack constraint is learned. Mathematically, the problem is constructed as a countermarkov game and the performance of a multi-agent reinforcement learning-based integrated energy management system is improved by robust countertraining.
Drawings
FIG. 1 is a flow chart of a robust countermeasure training framework for a multi-agent reinforcement learning energy system;
fig. 2 is a schematic structural diagram of a robust countermeasure training framework for a multi-agent reinforcement learning energy system.
Detailed Description
The invention is further described below with reference to examples. The following examples are presented only to aid in the understanding of the invention. It should be noted that it will be apparent to those skilled in the art that modifications can be made to the present invention without departing from the principles of the invention, and such modifications and adaptations are intended to be within the scope of the invention as defined in the following claims.
In order to ensure the whole stable, reliable and efficient operation of the comprehensive energy management system based on multi-agent reinforcement learning and improve the robustness of the comprehensive energy management system against malicious network attack, the invention provides a robust countermeasure training framework for the multi-agent reinforcement learning energy system, which enhances the elasticity by using countermeasure training and has important significance for realizing the stable and safe operation of the community comprehensive energy management system.
In the following, an experiment based on a community-level integrated energy management system including nine buildings is taken as an example to describe how to implement robust enhancement of the multi-agent reinforcement learning integrated energy management system.
As shown in fig. 1, the present invention is a robust countermeasure training framework for a multi-agent reinforcement learning energy system, the method comprising the steps of:
(1) The comprehensive energy management system based on multi-agent reinforcement learning is expressed as a part of observable random game problem, each agent controls one building, and the accumulated rewards of the whole team are maximized by optimizing the strategies of all agents:
<N,S,{A i } i∈N ,P,{R i } i∈N ,γ,{O i } i∈N ,Z>
wherein N is the number of agents, S is the environmental state, A i Is the action space of the ith agent, { A i } i∈N Is a joint action space defined as a=a 1 ×…×A N The method comprises the steps of carrying out a first treatment on the surface of the P is S.times.A.times.S.delta.S is given action at any time t
Figure BDA0003972139850000041
Lower slave state s t State s to the next time t+1 t+1 State transition probabilities of (2);
Figure BDA0003972139850000042
Is the ith agent slave(s) tt ) To the next time state s t+1 Timely feeding back rewards; gamma is the discount factor; o (O) i Is the observation space of the ith agent, and the joint observation space is { O } i } i∈N Defined as o=o 1 ×…×O N The method comprises the steps of carrying out a first treatment on the surface of the Z is S×A → delta (O) is the joint observation O at any time t t E O under arbitrary action a t State s t Is a function of the observation probability of (a). At time t, each agent i is based on observations +.>
Figure BDA0003972139850000051
By policy->
Figure BDA0003972139850000052
Select action->
Figure BDA0003972139850000053
Then, the environment moves to the next state according to the state transition probability P. Each agent i gets a reward +.>
Figure BDA0003972139850000054
And new local observations->
Figure BDA0003972139850000055
The whole process is iterated continuously, and a track about observations, actions and rewards can be obtained for each agent i:
Figure BDA0003972139850000056
The purpose of agent i is to obtain a policy pi i To maximize the cumulative return on the discount, as shown in the following formula:
J=∑R i
Where-i is all agents within set N except agent i. In the cooperative environment, the comprehensive energy management system based on multi-agent reinforcement learning aims at optimizing agent strategy parameters theta= { theta 12 ,…,θ N Maximizing cumulative team total prize J:
Figure BDA0003972139850000057
(2) As shown in fig. 2, an adversary agent is introduced into the multi-agent reinforcement learning based integrated energy management system, aiming to model this system as a random game problem observable in the resistant part by generating the worst performance of the model caused by the strongest resistant attack:
<N,S,A adv ,{A i } i∈N ,P,{R i } i∈N ,R adv ,γ,{O i } i∈N ,Z>
where N is the number of victim agents, S is the environmental state, A adv And R is adv An attacker's action space and rewards function, respectively; a is that i Is the action space of the ith victim agent, { A i } i∈N Is a joint action space defined as a=a 1 ×…×A N ;P:S×A adv XA X S → delta (S) is a given action at any time t
Figure BDA0003972139850000058
And A adv Lower slave state s t State s to the next time t+1 t+1 State transition probabilities of (2);
Figure BDA0003972139850000059
Is the ith agent slave(s) t ,a t ) To the next time state s t+1 Timely feeding back rewards; gamma is the discount factor; o (O) i Is the observation space of the ith agent, and the joint observation space is { O } i } i∈N Fixed, fixedMeaning o=o 1 ×…×O N The method comprises the steps of carrying out a first treatment on the surface of the Z is S×A → delta (O) is the joint observation O at any time t t E O under arbitrary action a t State s t Is a function of the observation probability of (a). Note that N, S, { a i } i∈N ,,γ,{O i } i∈N Z is consistent with the definition of the partially observable random game described above, but P, { R i } i∈N Is subjected to A adv Influence.
(3) Fixing pre-trained normal harmful multi-agent system strategy parameters
Figure BDA00039721398500000510
i Model parameters representing each agent-only strategy), training an agent-resistant strategy u φ (phi is the policy parameter of the attacking agent) to simulate a challenge against and threat one of the agents, which generates the attack as:
Figure BDA00039721398500000511
wherein delta t Is the generated attack vector for a particular agent observation,
Figure BDA0003972139850000061
is the observation of the agent to be attacked, B (o j ) Is the boundary constraint of the disturbance. Then the input of the compromised agent j is:
Figure BDA0003972139850000062
the victim policy makes decisions based on disturbance observations:
Figure BDA0003972139850000063
wherein the method comprises the steps of
Figure BDA0003972139850000064
Is quiltAnd after attack, the multi-agent reinforcement learning comprehensive energy management system performs actions. If the resistive disturbance is within the physical constraints of the physical characteristics and amplitude ranges, such as steadily increasing inflexible energy demands and energy storage within capacity, the defense mechanism cannot detect. Thus, the resistive disturbance can be limited to B (o j ) Thus, the loopholes of the comprehensive energy management system based on multi-agent reinforcement learning are discovered.
(4) Fixed-pest multi-agent system strategy pi θ Defines the rewarding function of the attacker as R adv =-∑R i Then its objective function is:
Figure BDA0003972139850000065
where J (θ, Φ) = Σr i . The attack agent and the multi-agent comprehensive energy management system perform interactive training to generate an optimal attack strategy
Figure BDA0003972139850000066
Wherein phi is * Is a parameter of the optimal attack strategy. Compared with a random attack strategy for generating random noise, the attack effect on the multi-agent comprehensive energy management system is as follows:
TABLE 1
Figure BDA0003972139850000067
The cumulative ramp rate, average daily peak and maximum peak are metrics related to the demand load profile of the multi-agent integrated energy management system. The optimal attack can be found from the table, so that the cumulative climbing rate, average daily peak value and maximum peak value of the model are improved by 38.61%, 8.77% and 16.42%, the load demand curve of the multi-agent comprehensive energy management system is more concussive, the attack effect is better than that of random attack, and the vulnerability of the comprehensive energy management system is fully explored.
(5) Optimal attacker strategy obtained through fixed training
Figure BDA0003972139850000068
The attack vector is generated for a certain victim agent by using the method, the robustness of the multi-agent comprehensive energy management system under the optimal attacker is improved through the antagonism training, and the objective function is as follows:
Figure BDA0003972139850000069
wherein J (θ, φ) * )=∑R i
At the same time, a random attack strategy of random noise is also adopted for contrast training. The performance of the multi-agent comprehensive energy management system with different training modes under the attack resistance is as follows:
TABLE 2
Figure BDA0003972139850000071
Compared with the non-countermeasure training, the optimal attack countermeasure training can be found from the table, so that the accumulated climbing rate, average daily peak value and maximum peak value of the model are reduced by 13.24%, 4.78% and 6.96%, the load demand curve of the multi-agent comprehensive energy management system is flattened, better performance can be maintained even under the attack, and the robustness of the comprehensive energy management system to the attack is improved. Conversely, the model trained against random attacks does not maintain good performance.
In summary, the invention introduces a single agent reinforcement learning-based attack countermeasure, and interacts with a multi-agent reinforcement learning comprehensive energy management system to generate the strongest attack, wherein the attack is realized by interfering the observation of a certain harmful agent; and the trained optimal countermeasure attack strategy is fixed, the comprehensive energy management system for the reinforcement learning of the harmful multi-agent and the countermeasure training are carried out, and the elasticity of the comprehensive energy management system for the countermeasure attack is enhanced by learning the countermeasure experience, so that a robust control strategy is generated.

Claims (5)

1. A robust countermeasure training framework for a multi-agent reinforcement learning energy system, comprising:
step 1, constructing an anti-intelligent agent to generate an anti-attack and modeling as a random game system with observable anti-part;
step 2, fixing a pre-trained harmful multi-agent strategy, and training an optimal deterministic countermeasure strategy to generate bounded disturbance;
and 3, fixing an optimal attack countermeasure strategy, and improving the robustness of the victim strategy under the optimal attacker through countermeasure training.
2. The multi-agent reinforcement learning energy system-oriented robust countermeasure training framework of claim 1, wherein step 1 comprises:
step 1.1, expressing a comprehensive energy management system based on multi-agent reinforcement learning as a part of observable random game problem, wherein each agent controls a building, and the accumulated rewards of the whole team are maximized by optimizing strategies of all agents:
Figure FDA0003972139840000011
wherein N is the number of agents, S is the environmental state, A i Is the action space of the ith agent,
Figure FDA0003972139840000012
is a joint action space defined as a=a 1 ×…×A N The method comprises the steps of carrying out a first treatment on the surface of the P is S.times.A.times.S.delta.S is given at any time t>
Figure FDA0003972139840000013
Lower slave state s t State s to the next time t+1 t+1 State transition probabilities of (2);
Figure FDA0003972139840000014
Is the ith agent slave(s) t ,a t ) To the next time state s t+1 Timely feeding back rewards; gamma is the discount factor; o (O) i Is the observation space of the ith agent, and the joint observation space is { O } i } i∈N Defined as o=o 1 ×…×O N The method comprises the steps of carrying out a first treatment on the surface of the Z is S×A → delta (O) is the joint observation O at any time t t E O in arbitrary action a t Under state s t Is a part of the observation probability;
at time t, each agent i is based on observations
Figure FDA0003972139840000015
By policy->
Figure FDA0003972139840000016
Select action->
Figure FDA0003972139840000017
Then the environment moves to the next state, s, according to the state transition probability, P t+1 ~P(·|s t ,a t ) The method comprises the steps of carrying out a first treatment on the surface of the Each agent i gets a reward +.>
Figure FDA0003972139840000018
And new local observations
Figure FDA0003972139840000019
Step 1.2, introducing an opponent agent into the comprehensive energy management system, and modeling the system as a random game problem observable by an antagonism part by generating the worst performance of a model caused by the strongest antagonism attack:
<W,S,A adv ,{A i } i∈N ,P,{R i } i∈w ,R adv ,γ,{O i } i∈N ,Z>
where N is the number of victim agents, S is the environmental state, A adv And R is adv An attacker's action space and rewards function, respectively; a is that i Is the action space of the ith victim agent,
Figure FDA00039721398400000110
is a joint action space defined as a=a 1 ×…×A N ;P:S×A adv XA X S → delta (S) is given action +.>
Figure FDA00039721398400000111
And A adv Lower slave state s t To the next time state s t+1 State transition probabilities of (2);
Figure FDA00039721398400000112
Is the ith agent slave(s) t ,a t ) To the next time state s t+1 Timely feeding back rewards; gamma is the discount factor; o (O) i Is the observation space of the ith agent, and the joint observation space is { O } i } i∈N Defined as o=o 1 ×…×O N The method comprises the steps of carrying out a first treatment on the surface of the Z is S×A → delta (O) is the joint observation O at any time t t E O in arbitrary action a t Under state s t Is a function of the observation probability of (a).
3. The multi-agent reinforcement learning energy system-oriented robust countermeasure training framework of claim 2, wherein step 2 comprises:
step 2.1, fixing the strategy parameters of the pre-trained normal harmful multi-agent system
Figure FDA0003972139840000021
θ i Model parameters representing each agent strategy, training an anti-agent strategy u φ Phi is a policy parameter of an attacking agent to simulate a challenge against and threat one of the agents, and the generated challenge is:
Figure FDA0003972139840000022
wherein delta t Is the generated attack vector for a particular agent observation,
Figure FDA0003972139840000023
is the observation of the agent to be attacked, B (o j ) Is a boundary constraint of the disturbance; the input of the compromised agent j is expressed as:
Figure FDA0003972139840000024
the victim policy makes decisions based on disturbance observations:
Figure FDA0003972139840000025
wherein the method comprises the steps of
Figure FDA0003972139840000026
Is the action made by the multi-agent comprehensive energy management system after being attacked;
step 2.2, fixing the harmful multi-agent system strategy pi θ Defines the rewarding function of the attacker as R adv =-∑R i Then its objective function is:
Figure FDA0003972139840000027
where J (θ, Φ) = Σr i The attack agent and the multi-agent comprehensive energy management system perform interactive training to generate an optimal attack strategy
Figure FDA0003972139840000029
4. A multi-intelligence oriented system according to claim 3The robust countermeasure training framework of the body reinforcement learning energy system is characterized in that in the step 3, the optimal attacker strategy obtained by training in the step 2.2 is fixed
Figure FDA00039721398400000210
Wherein phi is * Is a parameter of an optimal attack strategy, an attack vector is generated by utilizing the parameter and environment interaction, the robustness of a victim strategy under an optimal attacker is improved through resistance training, and an objective function is as follows:
Figure FDA0003972139840000028
where J (θ, Φ) = Σr i
5. A multi-agent reinforcement learning energy system-oriented robust countermeasure training apparatus for executing the multi-agent reinforcement learning energy system-oriented robust countermeasure training framework of claim 1, comprising:
a construction module for constructing an challenge agent to generate a challenge attack and modeling as a random gaming system observable by the challenge portion;
the first fixing module is used for fixing the pre-trained harmful multi-agent strategy, and training an optimal deterministic countermeasure strategy to generate bounded disturbance;
the second fixing module is used for fixing the optimal anti-attack strategy and improving the robustness of the victim strategy under the optimal attacker through the anti-attack training.
CN202211516697.9A 2022-11-30 2022-11-30 Robust countermeasure training frame for multi-agent reinforcement learning energy system Active CN116306903B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211516697.9A CN116306903B (en) 2022-11-30 2022-11-30 Robust countermeasure training frame for multi-agent reinforcement learning energy system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211516697.9A CN116306903B (en) 2022-11-30 2022-11-30 Robust countermeasure training frame for multi-agent reinforcement learning energy system

Publications (2)

Publication Number Publication Date
CN116306903A true CN116306903A (en) 2023-06-23
CN116306903B CN116306903B (en) 2025-11-28

Family

ID=86785697

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211516697.9A Active CN116306903B (en) 2022-11-30 2022-11-30 Robust countermeasure training frame for multi-agent reinforcement learning energy system

Country Status (1)

Country Link
CN (1) CN116306903B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118485282A (en) * 2024-07-15 2024-08-13 华北电力大学 Electric vehicle charging scheduling method and system based on robust reinforcement learning
CN119151235A (en) * 2024-11-11 2024-12-17 四川大学 Source-charge double-side energy storage collaborative scheduling method based on multiple agents

Citations (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070087756A1 (en) * 2005-10-04 2007-04-19 Hoffberg Steven M Multifactorial optimization system and method
WO2013176784A1 (en) * 2012-05-24 2013-11-28 University Of Southern California Optimal strategies in security games
WO2016065055A1 (en) * 2014-10-21 2016-04-28 Ask Y, Llc Platooning control via accurate synchronization
CN107888412A (en) * 2016-11-08 2018-04-06 清华大学 Multi-agent network finite time contains control method and device
CN108377238A (en) * 2018-02-01 2018-08-07 国网江苏省电力有限公司苏州供电分公司 Information network security of power system policy learning device and method based on Attack Defence
CN111461226A (en) * 2020-04-01 2020-07-28 深圳前海微众银行股份有限公司 Adversarial sample generation method, device, terminal and readable storage medium
CN112052456A (en) * 2020-08-31 2020-12-08 浙江工业大学 Deep reinforcement learning strategy optimization defense method based on multiple intelligent agents
WO2021068638A1 (en) * 2019-10-12 2021-04-15 中国海洋大学 Interactive intenstive learning method that combines tamer framework and facial expression feedback
US20210166123A1 (en) * 2019-11-29 2021-06-03 NavInfo Europe B.V. Method for training a robust deep neural network model
CN113031554A (en) * 2021-03-12 2021-06-25 西北工业大学 A fixed-time tracking consistency control method for second-order multi-agent systems
CN113282100A (en) * 2021-04-28 2021-08-20 南京大学 Unmanned aerial vehicle confrontation game training control method based on reinforcement learning
NL2025214B1 (en) * 2019-11-29 2021-08-31 Navinfo Europe B V A method for training a robust deep neural network model
CN113485313A (en) * 2021-06-25 2021-10-08 杭州玳数科技有限公司 Anti-interference method and device for automatic driving vehicle
CN113822318A (en) * 2021-06-29 2021-12-21 腾讯科技(深圳)有限公司 Adversarial training method, device, computer equipment and storage medium of neural network
CN114358141A (en) * 2021-12-14 2022-04-15 中国运载火箭技术研究院 A multi-agent reinforcement learning method for multi-combat unit collaborative decision-making
CN114638339A (en) * 2022-03-10 2022-06-17 中国人民解放军空军工程大学 Intelligent agent task allocation method based on deep reinforcement learning
CN114925850A (en) * 2022-05-11 2022-08-19 华东师范大学 Deep reinforcement learning confrontation defense method for disturbance reward
CN115291625A (en) * 2022-07-15 2022-11-04 同济大学 Multi-unmanned aerial vehicle air combat decision method based on multi-agent layered reinforcement learning
CN115392432A (en) * 2022-07-21 2022-11-25 华东师范大学 Extensible multi-agent reinforcement learning method in cooperation environment

Patent Citations (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070087756A1 (en) * 2005-10-04 2007-04-19 Hoffberg Steven M Multifactorial optimization system and method
WO2013176784A1 (en) * 2012-05-24 2013-11-28 University Of Southern California Optimal strategies in security games
WO2016065055A1 (en) * 2014-10-21 2016-04-28 Ask Y, Llc Platooning control via accurate synchronization
CN107888412A (en) * 2016-11-08 2018-04-06 清华大学 Multi-agent network finite time contains control method and device
CN108377238A (en) * 2018-02-01 2018-08-07 国网江苏省电力有限公司苏州供电分公司 Information network security of power system policy learning device and method based on Attack Defence
WO2021068638A1 (en) * 2019-10-12 2021-04-15 中国海洋大学 Interactive intenstive learning method that combines tamer framework and facial expression feedback
US20210166123A1 (en) * 2019-11-29 2021-06-03 NavInfo Europe B.V. Method for training a robust deep neural network model
NL2025214B1 (en) * 2019-11-29 2021-08-31 Navinfo Europe B V A method for training a robust deep neural network model
CN111461226A (en) * 2020-04-01 2020-07-28 深圳前海微众银行股份有限公司 Adversarial sample generation method, device, terminal and readable storage medium
CN112052456A (en) * 2020-08-31 2020-12-08 浙江工业大学 Deep reinforcement learning strategy optimization defense method based on multiple intelligent agents
CN113031554A (en) * 2021-03-12 2021-06-25 西北工业大学 A fixed-time tracking consistency control method for second-order multi-agent systems
CN113282100A (en) * 2021-04-28 2021-08-20 南京大学 Unmanned aerial vehicle confrontation game training control method based on reinforcement learning
CN113485313A (en) * 2021-06-25 2021-10-08 杭州玳数科技有限公司 Anti-interference method and device for automatic driving vehicle
CN113822318A (en) * 2021-06-29 2021-12-21 腾讯科技(深圳)有限公司 Adversarial training method, device, computer equipment and storage medium of neural network
CN114358141A (en) * 2021-12-14 2022-04-15 中国运载火箭技术研究院 A multi-agent reinforcement learning method for multi-combat unit collaborative decision-making
CN114638339A (en) * 2022-03-10 2022-06-17 中国人民解放军空军工程大学 Intelligent agent task allocation method based on deep reinforcement learning
CN114925850A (en) * 2022-05-11 2022-08-19 华东师范大学 Deep reinforcement learning confrontation defense method for disturbance reward
CN115291625A (en) * 2022-07-15 2022-11-04 同济大学 Multi-unmanned aerial vehicle air combat decision method based on multi-agent layered reinforcement learning
CN115392432A (en) * 2022-07-21 2022-11-25 华东师范大学 Extensible multi-agent reinforcement learning method in cooperation environment

Non-Patent Citations (7)

* Cited by examiner, † Cited by third party
Title
CHAOWEI XIAO ET AL.: "Characterizing Attacks on Deep Reinforcement Learning", 《ICLR 2019》, 31 December 2019 (2019-12-31), pages 1 - 20 *
LERREL PINTO ET AL.: "Robust Adversarial Reinforcement Learning", 《ARXIV》, 8 March 2017 (2017-03-08), pages 1 - 10 *
NESHAT ELHAMI FARD ET AL.: "Adversarial Attacks on Heterogeneous Multi-Agent Deep Reinforcement Learning System with Time-Delayed Data Transmission", 《JOURNAL SENSOR AND ACTUATOR NETWORKS》, vol. 11, no. 3, 9 August 2022 (2022-08-09), pages 1 - 25 *
XINLEI PAN ET AL.: "Characterizing Attacks on Deep Reinforcement Learning", 《AAMAS \'22: PROCEEDINGS OF THE 21ST INTERNATIONAL CONFERENCE ON AUTONOMOUS AGENTS AND MULTIAGENT SYSTEMS》, 9 May 2022 (2022-05-09), pages 1010 - 1018 *
景栋盛;杨钰;薛劲松;朱斐;吴文;: "基于最优初始值Q学习的电力信息网络防御策略学习算法", 《计算机与现代化》, no. 11, 15 November 2018 (2018-11-15), pages 18 - 22 *
林彤等: "微机继电保护系统故障信息自动检测方法研究", 《电子设计工程》, vol. 28, no. 16, 18 August 2020 (2020-08-18), pages 87 - 91 *
王赛男;: "信息安全领域中鲁棒的深度学习及其应用研究", 《智能计算机与应用》, vol. 9, no. 06, 1 November 2019 (2019-11-01), pages 111 - 117 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118485282A (en) * 2024-07-15 2024-08-13 华北电力大学 Electric vehicle charging scheduling method and system based on robust reinforcement learning
CN119151235A (en) * 2024-11-11 2024-12-17 四川大学 Source-charge double-side energy storage collaborative scheduling method based on multiple agents

Also Published As

Publication number Publication date
CN116306903B (en) 2025-11-28

Similar Documents

Publication Publication Date Title
He et al. Three-stage Stackelberg game enabled clustered federated learning in heterogeneous UAV swarms
Zhao et al. Modified cuckoo search algorithm to solve economic power dispatch optimization problems
Cai et al. Chaotic ant swarm optimization to economic dispatch
Feng et al. Robust federated deep reinforcement learning for optimal control in multiple virtual power plants with electric vehicles
CN112862281A (en) Method, device, medium and electronic equipment for constructing scheduling model of comprehensive energy system
CN116306903B (en) Robust countermeasure training frame for multi-agent reinforcement learning energy system
CN101908172B (en) A kind of power market hybrid simulation method adopting multiple intelligent agent algorithms
CN111275174A (en) A Game-Oriented Radar Countermeasure Strategy Generation Method
CN106712075A (en) Peaking strategy optimization method considering safety constraints of wind power integration system
Yang et al. DISTRIBUTED OPTIMAL DISPATCH OF VIRTUAL POWER PLANT BASED ON ELM TRANSFORMATION.
CN115293052A (en) Power system active power flow online optimization control method, storage medium and device
CN117441168A (en) Methods and apparatus for adversarial attacks in deep reinforcement learning
Niknam et al. New self‐adaptive bat‐inspired algorithm for unit commitment problem
Zhang et al. An improved symbiosis particle swarm optimization for solving economic load dispatch problem
CN113837654B (en) Multi-objective-oriented smart grid hierarchical scheduling method
Hassan et al. Optimal power flow analysis considering renewable energy resources uncertainty based on an improved wild horse optimizer
Ahmadian et al. Price restricted optimal bidding model using derated sensitivity factors by considering risk concept
CN120124859A (en) A dynamic game comprehensive evaluation method for system combat capability based on deep learning
CN117190405A (en) An energy-saving optimization control method for dehumidification unit system based on reinforcement learning
Chen et al. A multi-factor evolutionary algorithm for solving the multi-tasking robust optimization problem on networked systems
Zhi‐gang et al. Robust DED based on bad scenario set considering wind, EV and battery switching station
Lakshminarasimman et al. Water wave optimization algorithm for solving multi-area economic dispatch problem
Zheng et al. A hybrid invasive weed optimization algorithm for the economic load dispatch problem in power systems
Jing et al. An open-ended learning framework for opponent modeling
Yadav Hybridization of particle swarm optimization with differential evolution for solving combined economic emission dispatch model for smart grid

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant