CN116306903A

CN116306903A - A Robust Adversarial Training Framework for Multi-Agent Reinforcement Learning Energy Systems

Info

Publication number: CN116306903A
Application number: CN202211516697.9A
Authority: CN
Inventors: 陈永辉; 刘轩驿; 林彤; 王战; 李隆锋; 陈双照; 朱凌风; 翁洪康
Original assignee: Zhejiang Zheneng Digital Technology Co ltd; Zhejiang Zheneng Yueqing Power Generation Co ltd
Current assignee: Zhejiang Zheneng Digital Technology Co ltd; Zhejiang Zheneng Yueqing Power Generation Co ltd
Priority date: 2022-11-30
Filing date: 2022-11-30
Publication date: 2023-06-23
Anticipated expiration: 2042-11-30
Also published as: CN116306903B

Abstract

The invention relates to a robust countermeasure training framework for a multi-agent reinforcement learning energy system, which comprises the following components: constructing an anti-smart agent to generate an anti-attack and modeling as a random gaming system observable by the anti-attack portion; fixing the pre-trained harmful multi-agent strategy, and training an optimal deterministic countermeasure strategy to generate bounded disturbance; the optimal attack countermeasure strategy is fixed, and the robustness of the victim strategy under the optimal attacker is improved through countermeasure training. The beneficial effects of the invention are as follows: the invention models the resistance attack as an attack opponent based on single agent reinforcement learning, and learns to obtain the strongest attack strategy considering attack constraint. Mathematically, the problem is constructed as a countermarkov game and the performance of a multi-agent reinforcement learning-based integrated energy management system is improved by robust countertraining.

Description

Robust countermeasure training frame for multi-agent reinforcement learning energy system

Technical Field

The invention relates to the field of security defense of electric power systems, in particular to a robust countermeasure training framework for a multi-agent reinforcement learning energy system.

Background

With the development of socioeconomic and the increase of energy demand, electric power systems are undergoing a fundamental revolution in planning and operation from fossil fuels to clean energy. Under the background of rapid development of the energy Internet, the comprehensive energy system with multiple energy sources such as electricity, gas, heat, cold and the like coupled and coordinated can realize multi-energy complementation, promote the consumption of renewable energy sources, improve the energy utilization efficiency and relieve unbalanced supply and demand. Compared with the traditional power system, the energy flow of the comprehensive energy system is more complex, and the operation regulation and control of the comprehensive energy system relate to more complex load demands, supply devices and operation modes. The novel characteristics of high coupling of energy demand, supply and storage can cause the problems of improved system operation mode and dynamic characteristic complexity, aggravated uncertainty of the two sides of the source load, increased variable and dimension of a mathematical model of a simulation system, reduced safety stability margin and the like, so that the traditional comprehensive energy management method based on the mathematical model mechanism is difficult to meet the requirements of online evaluation and real-time control. Therefore, a data-driven comprehensive energy management method with multi-agent reinforcement learning as a core has been developed. With the integration of information and communication technologies, the safety and vulnerability problems of the comprehensive energy management system based on multi-agent reinforcement learning are more indiscriminate. The communication network of the comprehensive energy management system, including the monitoring and data acquisition network, the intelligent electric meter and other devices, is easy to be attacked by malicious network behaviours.

Disclosure of Invention

The invention aims to overcome the defects in the prior art and provides a robust countermeasure training framework for a multi-agent reinforcement learning energy system. The invention enhances the resistance of the comprehensive energy management system based on multi-agent reinforcement learning to the challenge attack through robust challenge training. Firstly, constructing an opponent intelligent body, wherein the aim is to model the system into a partial observable random game system for antagonism by formulating a fight attack and causing the worst performance of a control system; training the adversary agent to learn an optimal deterministic challenge strategy to generate bounded perturbations; and finally, robust countermeasure training is adopted for the damaged multi-agent reinforcement learning comprehensive energy management system so as to enhance the robustness of the model.

In a first aspect, a robust countermeasure training framework for a multi-agent reinforcement learning energy system is provided, comprising:

step 1, constructing an anti-intelligent agent to generate an anti-attack and modeling as a random game system with observable anti-part;

step 2, fixing a pre-trained harmful multi-agent strategy, and training an optimal deterministic countermeasure strategy to generate bounded disturbance;

and 3, fixing an optimal attack countermeasure strategy, and improving the robustness of the victim strategy under the optimal attacker through countermeasure training.

Preferably, step 1 includes:

step 1.1, expressing a comprehensive energy management system based on multi-agent reinforcement learning as a part of observable random game problem, wherein each agent controls a building, and the accumulated rewards of the whole team are maximized by optimizing strategies of all agents:

<N,S,{A ⁱ } _i∈N ,P,{R ⁱ } _i∈N ,γ,{O ⁱ } _i∈N ,Z>

wherein N is the number of agents, S is the environmental state, A ⁱ Is the action space of the ith agent, { A ⁱ } _i∈N Is a joint action space defined as a=a ¹ ×…×A ^N The method comprises the steps of carrying out a first treatment on the surface of the P is S.times.A.times.S.delta.S is given action at any time t

Lower slave state s _t State s to the next time t+1 _t+1 State transition probabilities of (2);

Is the ith agent slave(s) _t ,a _t ) To the next time state s _t+1 Timely feeding back rewards; gamma is the discount factor; o (O) ⁱ Is the observation space of the ith agent, and the joint observation space is { O } ⁱ } _i∈N Defined as o=o ¹ ×…×O ^N The method comprises the steps of carrying out a first treatment on the surface of the Z is S×A → delta (O) is the joint observation O at any time t _t E O in arbitrary action a _t Under state s _t Is a part of the observation probability;

at time t, each agent i is based on observations

By policy->

Select action->

Then the environment moves to the next state, s, according to the state transition probability, P _t+1 ～P(·|s _t ,a _t ) The method comprises the steps of carrying out a first treatment on the surface of the Each agent i gets a reward +.>

And new local observations->

Step 1.2, introducing an opponent agent into the comprehensive energy management system, and modeling the system as a random game problem observable by an antagonism part by generating the worst performance of a model caused by the strongest antagonism attack:

<N,S,A ^adv ,{A ⁱ } _i∈N ,P,{R ⁱ } _i∈N ,R ^adv ,γ,{O ⁱ } _i∈N ,Z>

where N is the number of victim agents, S is the environmental state, A ^adv And R is ^adv An attacker's action space and rewards function, respectively; a is that ⁱ Is the action space of the A-th victim agent, { A ⁱ } _i∈N Is a joint action space defined as a=a ¹ ×…×A ^N ；P:S×A ^adv XA X S → delta (S) is a given action at any time t

And A ^adv Lower slave state s _t To the next time state s _t+1 State transition probabilities of (2);

Is the ith agent slave(s) _t ,a _t ) To the next time state s _t+1 Timely feeding back rewards; gamma is the discount factor; o (O) ⁱ Is the observation space of the ith agent, and the joint observation space is { O } ⁱ } _i∈N Defined as o=o ¹ ×…×O ^N The method comprises the steps of carrying out a first treatment on the surface of the Z is S×A → delta (O) is the joint observation O at any time t _t E O in arbitrary action a _t Under state s _t Is a function of the observation probability of (a).

Preferably, step 2 includes:

step 2.1, fixing the strategy parameters of the pre-trained normal harmful multi-agent system

θ ⁱ Model parameters representing each agent strategy, training an anti-agent strategy u _φ Phi is a policy parameter of an attack agent to simulate against an attack and threaten one of themThe attack generated by the intelligent agent is as follows:

wherein delta _t Is the generated attack vector for a particular agent observation,

is the observation of the agent to be attacked, B (o ^j ) Is a boundary constraint of the disturbance; the input of the compromised agent j is expressed as:

the victim policy makes decisions based on disturbance observations:

wherein the method comprises the steps of

Is the action made by the multi-agent comprehensive energy management system after being attacked;

step 2.2, fixing the harmful multi-agent system strategy pi _θ Defines the rewarding function of the attacker as R ^adv ＝-∑R ⁱ Then its objective function is:

where J (θ, Φ) = Σr ⁱ The attack agent and the multi-agent comprehensive energy management system perform interactive training to generate an optimal attack strategy

Preferably, the steps areIn step 3, the optimal aggressor strategy trained in step 2.2 is fixed

Wherein phi is ^* Is a parameter of an optimal attack strategy, an attack vector is generated by utilizing the parameter and environment interaction, the robustness of a victim strategy under an optimal attacker is improved through resistance training, and an objective function is as follows:

where J (θ, P) = Σr ⁱ 。

In a second aspect, a robust countermeasure training apparatus for a multi-agent-oriented reinforcement learning energy system is provided, configured to execute the robust countermeasure training framework for a multi-agent-oriented reinforcement learning energy system according to the first aspect, including:

a construction module for constructing an challenge agent to generate a challenge attack and modeling as a random gaming system observable by the challenge portion;

the first fixing module is used for fixing the pre-trained harmful multi-agent strategy, and training an optimal deterministic countermeasure strategy to generate bounded disturbance;

the second fixing module is used for fixing the optimal anti-attack strategy and improving the robustness of the victim strategy under the optimal attacker through the anti-attack training.

The beneficial effects of the invention are as follows: the invention designs a robust countermeasure training framework for a multi-agent reinforcement learning energy system so as to cope with potential countermeasure attack. The antagonistic attack is modeled as an attack opponent based on single agent reinforcement learning, and the strongest attack strategy considering attack constraint is learned. Mathematically, the problem is constructed as a countermarkov game and the performance of a multi-agent reinforcement learning-based integrated energy management system is improved by robust countertraining.

Drawings

FIG. 1 is a flow chart of a robust countermeasure training framework for a multi-agent reinforcement learning energy system;

fig. 2 is a schematic structural diagram of a robust countermeasure training framework for a multi-agent reinforcement learning energy system.

Detailed Description

The invention is further described below with reference to examples. The following examples are presented only to aid in the understanding of the invention. It should be noted that it will be apparent to those skilled in the art that modifications can be made to the present invention without departing from the principles of the invention, and such modifications and adaptations are intended to be within the scope of the invention as defined in the following claims.

In order to ensure the whole stable, reliable and efficient operation of the comprehensive energy management system based on multi-agent reinforcement learning and improve the robustness of the comprehensive energy management system against malicious network attack, the invention provides a robust countermeasure training framework for the multi-agent reinforcement learning energy system, which enhances the elasticity by using countermeasure training and has important significance for realizing the stable and safe operation of the community comprehensive energy management system.

In the following, an experiment based on a community-level integrated energy management system including nine buildings is taken as an example to describe how to implement robust enhancement of the multi-agent reinforcement learning integrated energy management system.

As shown in fig. 1, the present invention is a robust countermeasure training framework for a multi-agent reinforcement learning energy system, the method comprising the steps of:

(1) The comprehensive energy management system based on multi-agent reinforcement learning is expressed as a part of observable random game problem, each agent controls one building, and the accumulated rewards of the whole team are maximized by optimizing the strategies of all agents:

<N,S,{A ⁱ } _i∈N ,P,{R ⁱ } _i∈N ,γ,{O ⁱ } _i∈N ,Z>

Is the ith agent slave(s) _t ,α _t ) To the next time state s _t+1 Timely feeding back rewards; gamma is the discount factor; o (O) ⁱ Is the observation space of the ith agent, and the joint observation space is { O } ⁱ } _i∈N Defined as o=o ¹ ×…×O ^N The method comprises the steps of carrying out a first treatment on the surface of the Z is S×A → delta (O) is the joint observation O at any time t _t E O under arbitrary action a _t State s _t Is a function of the observation probability of (a). At time t, each agent i is based on observations +.>

By policy->

Select action->

Then, the environment moves to the next state according to the state transition probability P. Each agent i gets a reward +.>

And new local observations->

The whole process is iterated continuously, and a track about observations, actions and rewards can be obtained for each agent i:

The purpose of agent i is to obtain a policy pi ⁱ To maximize the cumulative return on the discount, as shown in the following formula：

J＝∑R ⁱ

Where-i is all agents within set N except agent i. In the cooperative environment, the comprehensive energy management system based on multi-agent reinforcement learning aims at optimizing agent strategy parameters theta= { theta ¹ ,θ ² ,…,θ ^N Maximizing cumulative team total prize J:

(2) As shown in fig. 2, an adversary agent is introduced into the multi-agent reinforcement learning based integrated energy management system, aiming to model this system as a random game problem observable in the resistant part by generating the worst performance of the model caused by the strongest resistant attack:

<N,S,A ^adv ,{A ⁱ } _i∈N ,P,{R ⁱ } _i∈N ,R ^adv ,γ,{O ⁱ } _i∈N ,Z>

where N is the number of victim agents, S is the environmental state, A ^adv And R is ^adv An attacker's action space and rewards function, respectively; a is that ⁱ Is the action space of the ith victim agent, { A ⁱ } _i∈N Is a joint action space defined as a=a ¹ ×…×A ^N ；P:S×A ^adv XA X S → delta (S) is a given action at any time t

And A ^adv Lower slave state s _t State s to the next time t+1 _t+1 State transition probabilities of (2);

Is the ith agent slave(s) _t ,a _t ) To the next time state s _t+1 Timely feeding back rewards; gamma is the discount factor; o (O) ⁱ Is the observation space of the ith agent, and the joint observation space is { O } ⁱ } _i∈N Fixed, fixedMeaning o=o ¹ ×…×O ^N The method comprises the steps of carrying out a first treatment on the surface of the Z is S×A → delta (O) is the joint observation O at any time t _t E O under arbitrary action a _t State s _t Is a function of the observation probability of (a). Note that N, S, { a ⁱ } _i∈N ,,γ,{O ⁱ } _i∈N Z is consistent with the definition of the partially observable random game described above, but P, { R ⁱ } _i∈N Is subjected to A ^adv Influence.

(3) Fixing pre-trained normal harmful multi-agent system strategy parameters

(θ ⁱ Model parameters representing each agent-only strategy), training an agent-resistant strategy u _φ (phi is the policy parameter of the attacking agent) to simulate a challenge against and threat one of the agents, which generates the attack as:

is the observation of the agent to be attacked, B (o ^j ) Is the boundary constraint of the disturbance. Then the input of the compromised agent j is:

the victim policy makes decisions based on disturbance observations:

wherein the method comprises the steps of

Is quiltAnd after attack, the multi-agent reinforcement learning comprehensive energy management system performs actions. If the resistive disturbance is within the physical constraints of the physical characteristics and amplitude ranges, such as steadily increasing inflexible energy demands and energy storage within capacity, the defense mechanism cannot detect. Thus, the resistive disturbance can be limited to B (o ^j ) Thus, the loopholes of the comprehensive energy management system based on multi-agent reinforcement learning are discovered.

(4) Fixed-pest multi-agent system strategy pi _θ Defines the rewarding function of the attacker as R ^adv ＝-∑R ⁱ Then its objective function is:

where J (θ, Φ) = Σr ⁱ . The attack agent and the multi-agent comprehensive energy management system perform interactive training to generate an optimal attack strategy

Wherein phi is ^* Is a parameter of the optimal attack strategy. Compared with a random attack strategy for generating random noise, the attack effect on the multi-agent comprehensive energy management system is as follows:

TABLE 1

The cumulative ramp rate, average daily peak and maximum peak are metrics related to the demand load profile of the multi-agent integrated energy management system. The optimal attack can be found from the table, so that the cumulative climbing rate, average daily peak value and maximum peak value of the model are improved by 38.61%, 8.77% and 16.42%, the load demand curve of the multi-agent comprehensive energy management system is more concussive, the attack effect is better than that of random attack, and the vulnerability of the comprehensive energy management system is fully explored.

(5) Optimal attacker strategy obtained through fixed training

The attack vector is generated for a certain victim agent by using the method, the robustness of the multi-agent comprehensive energy management system under the optimal attacker is improved through the antagonism training, and the objective function is as follows:

wherein J (θ, φ) ^* )＝∑R ⁱ

At the same time, a random attack strategy of random noise is also adopted for contrast training. The performance of the multi-agent comprehensive energy management system with different training modes under the attack resistance is as follows:

TABLE 2

Compared with the non-countermeasure training, the optimal attack countermeasure training can be found from the table, so that the accumulated climbing rate, average daily peak value and maximum peak value of the model are reduced by 13.24%, 4.78% and 6.96%, the load demand curve of the multi-agent comprehensive energy management system is flattened, better performance can be maintained even under the attack, and the robustness of the comprehensive energy management system to the attack is improved. Conversely, the model trained against random attacks does not maintain good performance.

In summary, the invention introduces a single agent reinforcement learning-based attack countermeasure, and interacts with a multi-agent reinforcement learning comprehensive energy management system to generate the strongest attack, wherein the attack is realized by interfering the observation of a certain harmful agent; and the trained optimal countermeasure attack strategy is fixed, the comprehensive energy management system for the reinforcement learning of the harmful multi-agent and the countermeasure training are carried out, and the elasticity of the comprehensive energy management system for the countermeasure attack is enhanced by learning the countermeasure experience, so that a robust control strategy is generated.

Claims

1. A robust countermeasure training framework for a multi-agent reinforcement learning energy system, comprising:

2. The multi-agent reinforcement learning energy system-oriented robust countermeasure training framework of claim 1, wherein step 1 comprises:

wherein N is the number of agents, S is the environmental state, A ⁱ Is the action space of the ith agent,

is a joint action space defined as a=a ¹ ×…×A ^N The method comprises the steps of carrying out a first treatment on the surface of the P is S.times.A.times.S.delta.S is given at any time t>

at time t, each agent i is based on observations

By policy->

Select action->

And new local observations

＜W，S，A ^adv ，{A ⁱ } _i∈N ，P，{R ⁱ } _i∈w ，R ^adv ，γ，{O ⁱ } _i∈N ，Z＞

where N is the number of victim agents, S is the environmental state, A ^adv And R is ^adv An attacker's action space and rewards function, respectively; a is that ⁱ Is the action space of the ith victim agent,

is a joint action space defined as a=a ¹ ×…×A ^N ；P:S×A ^adv XA X S → delta (S) is given action +.>

3. The multi-agent reinforcement learning energy system-oriented robust countermeasure training framework of claim 2, wherein step 2 comprises:

θ ⁱ Model parameters representing each agent strategy, training an anti-agent strategy u _φ Phi is a policy parameter of an attacking agent to simulate a challenge against and threat one of the agents, and the generated challenge is:

the victim policy makes decisions based on disturbance observations:

wherein the method comprises the steps of

4. A multi-intelligence oriented system according to claim 3The robust countermeasure training framework of the body reinforcement learning energy system is characterized in that in the step 3, the optimal attacker strategy obtained by training in the step 2.2 is fixed

where J (θ, Φ) = Σr ⁱ 。

5. A multi-agent reinforcement learning energy system-oriented robust countermeasure training apparatus for executing the multi-agent reinforcement learning energy system-oriented robust countermeasure training framework of claim 1, comprising: