CN115086903B

CN115086903B - Adaptive Duty Cycle Adjustment Method for Energy Harvesting Wireless Sensors Based on Fuzzy Q-learning

Info

Publication number: CN115086903B
Application number: CN202210663594.9A
Authority: CN
Inventors: 葛永琪; 魏佳圆; 袁振博; 刘瑞
Original assignee: Ningxia University
Current assignee: Ningxia University
Priority date: 2022-06-10
Filing date: 2022-06-10
Publication date: 2024-06-14
Anticipated expiration: 2042-06-10
Also published as: CN115086903A

Abstract

The invention provides a method for adaptively adjusting duty cycle of an energy harvesting wireless sensor based on fuzzy Q-learning, belonging to the technical field of wireless sensors. The invention comprises: establishing a wireless sensor energy management model <S,A, _Psa ,R>; establishing a Q table, wherein the value in the Q table is recorded as q( _ski , _aj ); obtaining the state space St of the node at time _t , _St = [ _Eh (t), _Sv (t)]; using a fuzzy inference system to calculate the trigger strength _ωki of the fuzzy rule k triggered by _St ; selecting the action _aj corresponding to the activation of the fuzzy rule k from A according to the ε-greedy strategy; calculating the environmental reward R( _sij , _aj ) of _St performing the action _aj based on the reward function R, and further updating the q( _ski , _aj ) in the Q table according to the environmental reward R( _sij , _aj ); calculating the duty cycle replacement value ^dc (t) of the node at time t based on _aj and the trigger strength _ωki ; modifying the duty cycle of the node to ^dc (t) and entering time t+1 to obtain a new state space _St+1 ; and according to the new state space S _t+1 is used as input to perform duty cycle adjustment operation, and the above steps are repeated until the learning time reaches the learning time length T _total .

Description

Energy collection wireless sensor duty cycle self-adaptive adjustment method based on fuzzy Q-learning

Technical Field

The invention relates to the technical field of wireless sensors, in particular to a fuzzy Q-learning-based energy collection wireless sensor duty cycle self-adaptive adjustment method.

Background

The wireless sensor network is formed by self-organizing connection of a plurality of sensor nodes in a wireless communication mode, and the wireless sensor nodes with sensing capability serve as communication bodies in the wireless sensor network, and play a great role in the fields of environment monitoring, medical health, intelligent home, industrial control, military national defense and the like by virtue of the characteristics of low cost, low power consumption, multiple functions and the like. The wireless sensor is generally powered by a battery, and if the node is deployed in a severe environment, the battery of each device may be very expensive or unlikely to be charged, so that the service life of the wireless sensor network needs to be prolonged to the maximum extent, and a continuous power supply mode of the wireless sensor node is more feasible by collecting energy from the surrounding environment (sun, wind, vibration and the like) and converting the energy into electric energy.

The wireless sensor can continuously collect energy from the environment to store and supply energy, however, external energy alternately presents random or periodic dynamic changes along with time, so that node energy cannot be kept in a stable state all the time, if the energy collection rate is too fast, the actual consumed energy of the wireless sensor node is far lower than the collected energy, the collection and energy storage equipment can be damaged too fast, and if the energy collection rate is too low, the wireless sensor node can die due to energy consumption, so that the data safety of the whole wireless sensor network is affected.

Disclosure of Invention

In view of the above, the invention provides a fuzzy Q-learning-based energy collection wireless sensor duty cycle self-adaptive adjustment method, which is used for solving the problems of excessively fast breakage of collection and energy storage equipment and energy exhaustion of wireless sensor nodes caused by excessively high energy collection rate by adjusting the duty cycle to adapt to the energy collection rate.

The technical scheme adopted by the embodiment of the invention for solving the technical problems is as follows:

An energy harvesting wireless sensor duty cycle self-adaptive adjustment method based on fuzzy Q-learning, comprising:

Step S1, a wireless sensor energy management model < S, A, P _sa, R > is established, wherein S is a state space set, A is a node sleep action space set, P _sa is a probability distribution set that each state S _i in S is transferred to the next state S _i' through an action a _j, R is a reward function, S _i∈S,s_i′∈S,i∈[1,I],a_j epsilon A, j epsilon [1, M ];

Step S2, a Q table is established, the value in the Q table is recorded as Q (S _ki,a_j), the Q table is initialized, wherein the Q-learning duration is specified as T _total, the single-round duration is specified as T _episode, the updating interval duration is delta T, and S _ki is that a fuzzy rule k is adopted after the S _i input fuzzy inference system;

Step S3, a state space S _t,S_t＝[E_h(t),S_v(t)],S_t∈S,S_t＝s_i of a node at a time t is obtained, wherein E _h (t) represents energy collected by an energy collecting unit of the node at the time t, and S _v (t) represents super capacitor voltage of a wireless sensor at the time t;

S4, calculating triggering strength omega _ki of triggering the fuzzy rule k by the S _t through the fuzzy inference system, wherein k is E [1, N ];

S5, selecting an action a _j which is activated correspondingly by the fuzzy rule k from the A according to an epsilon-greedy strategy;

step S6, calculating an environmental reward R (S _i,a_j) for the S _t to execute the action a _j based on the reward function R, and further updating the Q (S _ki,a_j) in the Q table according to the environmental reward R (S _i,a_j);

Step S7, calculating a duty ratio replacement value d ^c (t) of the node at the time t based on the a _j and the trigger intensity ω _ki;

Step S8, modifying the duty ratio of the node to d ^c (t) and entering the time t+1 to obtain a new state space S _t+1,S_t+1＝[E_h(t+1),S_v(t+1)],S_t+1∈S,S_t+1＝s_i';

Step S9, returning to step S4, performing a duty cycle adjustment operation according to the new state space S _t+1 as input, and repeating steps S4-S8 until the learning time reaches the learning duration T _total.

Preferably, the probability elements in P _sa are:

preferably, the step S4 of calculating, by using the fuzzy inference system, a trigger intensity ω _ki of triggering the fuzzy rule k by the S _t includes:

Step S41, formulating N fuzzy rules and membership functions, defining the E _h (t) in the state space S _t as a triangle membership function, defining the S _v (t) in the state space S _t as a trapezoid membership function, and enabling the fuzzy rules k to be E [1, N ];

Step S42, in which the same S _i,s_i＝[E_h(s_i),S_v(s_i as the state space S _t is found, the S _i is input as an input variable to the fuzzy inference system, and the triggering strength ω _ki of the fuzzy rule k is calculated:

Wherein, Representing the input variable under the fuzzy rule k the E _h(s_i in s _i) by membership function calculationRepresenting the input variable under the fuzzy rule k the S _v(s_i in S _i) is a membership value calculated by a membership function.

Preferably, the step S6 of calculating an environmental reward R (S _i,a_j) for the step S _t of performing the action a _j based on the reward function R, and updating the Q (S _ki,a_j) in the Q table further according to the environmental reward R (S _i,a_j) includes:

Step S61, dividing the super capacitor voltage into low, medium, high states by a threshold value classification method;

Step S62, carrying out real-time environment rewards according to the state of S _v (t), when S _v (t) is in the low state,

When the S _v (t) is in the medium state,

When the S _v (t) is in the high state,

The symbol β and the symbol θ are calculation parameters, the ENO _c is an energy neutral threshold, the ENO _s is an energy neutral state of the node, and the iterative formula described by the ENO _s、ENO_c is:

ENO_s(t+1)＝ENO_s(t)+E_neu(t)

ENO_c(t+1)＝ENO_c(t)+μ×(ENO_ave(t)-ENO_c(t))

E_neu(t)＝E_h(t)–E_c(t)

wherein, E _c (t) represents the energy consumed by the energy consumption unit of the node at time t, ENO _ave (t) is the average value of the energy neutral values of the previous round of time period, and μ is the energy neutral threshold updating parameter;

step S63, updating the Q (S _ki,a_j) in the Q table according to the environmental reward R (S _i,a_j):

q(s_ki,a_j)←q(s_ki,a_j)+α·Δq(s_ki,a_j)

Wherein, the Q (s _ki,a_j) is the Q value calculated value of the action a _j executed by the s _i based on the fuzzy rule k, the Q (s '_ki,a_j) is the Q value calculated value of the action of the next state s _i', the For the optimal action of the next state s _i', α is the parameter learning rate and γ is the discount factor.

Preferably, the duty ratio replacement value d ^c (t) in the step S7 has a formula as follows:

Preferably, the a contains 4 sleep actions of different durations, a= [ a ₁,a₂,a₃,a₄ ], wherein a ₁ represents 15 seconds of sleep, a ₂ represents 60 seconds of sleep, a ₃ represents 300 seconds of sleep, and a ₄ represents 900 seconds of sleep.

Preferably, β has a value of 4, θ has a value of 2, T _episode is defined as 24 hours, and Δt is set to 0.25h.

Preferably, the update frequency of the ENO _ave (T) is T _episode, and the calculation formula is:

ENO_ave(t)＝0,t∈T_episode

According to the technical scheme, the energy collecting wireless sensor duty ratio self-adaptive adjusting method based on fuzzy Q-learning provided by the embodiment of the invention comprises the steps of firstly establishing wireless sensor energy management models < S, A, P _sa and R >; then a Q table is established, and the value in the Q table is marked as Q (s _ki,a_j); acquiring a state space S _t,S_t＝[E_h(t),S_v (t) of the node at the time t; calculating the triggering intensity omega _ki of triggering the fuzzy rule k by using a fuzzy reasoning system S _t; selecting an action a _j which is activated correspondingly to the fuzzy rule k from the A according to an epsilon-greedy strategy; calculating S _t an environmental benefit R (S _i,a_j) for performing action a _j based on the benefit function R, and further updating Q (S _ki,a_j) in the Q table according to the environmental benefit R (S _i,a_j); calculating a duty ratio replacement value d ^c (t) of the node at the time t based on a _j and the trigger intensity omega _ki; modifying the duty ratio of the node to d ^c (t) and entering the time t+1 to obtain a new state space S _t+1; the duty cycle adjustment operation is performed as input according to the new state space S _t+1, and the foregoing steps are repeated until the learning time reaches the learning duration T _total. The problem that the energy collection and energy storage equipment is damaged too quickly due to the fact that the energy collection rate is too high and the wireless sensor node is exhausted due to the fact that the energy collection rate is too low can be solved through adjusting the duty ratio to adapt to the energy collection rate in the Q-learning process.

Drawings

FIG. 1 is a flow chart of a method for adaptively adjusting the duty cycle of an energy harvesting wireless sensor based on fuzzy Q-learning.

FIG. 2 is a block diagram of a method for adaptively adjusting the duty cycle of an energy harvesting wireless sensor based on fuzzy Q-learning.

Detailed Description

The technical scheme and technical effects of the present invention are further elaborated below in conjunction with the drawings of the present invention.

Maintaining the wireless sensor node in an energy neutral state for a long period is an effective solution for realizing sustainable operation of the node, if the energy collected by the node is kept to be greater than or equal to the consumed energy, the node is said to be in the energy neutral state, as shown in formula (1), and if the difference between the consumed energy and the collected energy of the node approaches zero in an ideal state, the operation is said to be energy neutral operation:

E_neu＝E_h(t)–E_c(t) (1)

As shown in fig. 1 and 2, the invention provides a fuzzy Q-learning-based energy collection wireless sensor duty cycle self-adaptive adjustment method, which is used for adjusting the duty cycle of the work of an energy collection wireless sensor node to influence the energy storage rate of the node, and dynamically adjusting the energy storage rate to enable the node to be in an energy neutral state. The wireless sensor node for energy collection mainly comprises an energy collector, an energy storage and an energy consumption unit, wherein E _h (t) represents energy collected by the energy collection unit at the time t, E _r (t) represents energy remained by the energy storage unit at the time t, and E _c (t) represents energy consumed by the energy consumption unit at the time t. The method of the invention comprises the following specific implementation steps:

Step S1, establishing a wireless sensor energy management model < S, A, P _sa, R >, wherein S is a state space set, A is a node sleep action space set, P _sa is a probability distribution set of each state S _i in S to be transferred to the next state S _i' through an action a _j, R is a reward function, S _i∈S,s_I′∈S,i∈[1,I],a_j epsilon A, j epsilon [1, M ];

Step S2, a Q table is established, the value in the Q table is recorded as Q (S _ki,a_j), the Q table is initialized, wherein the Q-learning duration is specified as T _total, the single-round duration is specified as T _episode, the updating interval duration is delta T, and S _ki is S _i, and a fuzzy rule k is adopted after the fuzzy inference system is input;

Step S3, a state space S _t,S_t＝[E_h(t),S_v(t)],S_t∈S,S_t＝s_i of the node at the time t is obtained, wherein S _v (t) represents the super capacitor voltage of the wireless sensor at the time t;

s4, calculating triggering strength omega _ki of triggering the fuzzy rule k by using a fuzzy inference system S _t, wherein k is E [1, N ];

S5, selecting an action a _j which is activated correspondingly to the fuzzy rule k from the A according to an epsilon-greedy strategy; the intelligent agent selects actions from an action space according to an epsilon-greedy strategy in the current state, the action with the largest Q value is selected from a Q table according to the epsilon probability, the probability of 1 epsilon randomly selects a certain action, and the action selected by the intelligent agent affects the calculation output of the duty ratio;

Step S6, calculating S _t environmental rewards R (S _i,a_j) for executing action a _j according to S _v (t) based on the rewards function R, and further updating Q (S _ki,a_j) in the Q table according to the environmental rewards R (S _i,a_j);

Step S7, calculating a duty ratio replacement value d ^c (t) of the node at the time t based on a _j and the trigger intensity ω _ki:

Step S8, the duty ratio of the modified node is d ^c (t) and the time of entering t+1 is entered, and a new state space S _t+1,S_t+1＝[E_h(t+1),S_v(t+1)],S_t+1∈S,S_t+1＝s_I' is obtained;

step S9, returning to step S4, performing the duty cycle adjustment operation according to the new state space S _t+1 as input, and repeating steps S4-S8 until the learning time reaches the learning duration T _total.

In the present invention, the state space S _t、S_t+1 can find the state space with the same element value in the state space set S, S _I ' represents the next state of S _i, and the state space set S has the state space with the same element value as S _I ', so S _i ' S can be considered.

In an embodiment, T _episode is defined as 24 hours and Δt is set to 0.25h.

Step S1 defines an energy harvesting wireless sensor energy management model < S, a, P _sa, R > quadruple based on a Markov Decision Process (MDP), wherein:

(1) S denotes the state space, the state of the node at time unit t is defined as the collected energy E _h (t) and the supercapacitor voltage S _v (t), so the state space S is expressed as:

S＝[E_h(t),S_v(t)] (3)

(2) A represents an action space, the actions executed in the state S _t are designed to be sleep actions with different node durations, and in an embodiment, the sleep actions can be divided into 4 kinds and expressed as follows

A＝[a₁,a₂,a₃,a₄] (4)

For example, a ₁ represents 15 seconds of sleep, a ₂ represents 60 seconds of sleep, a ₃ represents 300 seconds of sleep, and a ₄ represents 900 seconds of sleep;

(3) P _sa represents the probability distribution that the node will transition to other states after action a _j under the action of the current state S _t ε S, and the probability that the node takes action a _j to reach S _I' in state S _t is represented as

(4) R represents a reward function, in response to the node' S rationality in performing action a _j at state S _t, the environment provides a reward scalar R (S _i,a_j) for evaluating the quality of the action. The core idea of the bonus function is to constrain the current energy neutral state ENO _s of the node with an energy neutral threshold ENO _c.

The fuzzy inference system provides a mapping from input to output based on a set of fuzzy rules and associated fuzzy membership functions. The rule base of the fuzzy inference system generally consists of a plurality of preset rules, and the fuzzy inference rule R _j which correlates the state vector with the action has the following form:

R_j:IFs is in s_jTHEN the action is a₁ with q(s_1i,a₁)

or…

or the action is a_k with q(s_ki,a_j)

R _k represents the kth rule, where Q (s _ki,a_i) is the Q value of the state action pair (s _ki,a_j) in the Q table.

The input linguistic variable E _h(t)＝{poor,fair,good},S_v (t) = { lite, medium, high } of the If conditional statement is divided into different sets, e.g. "face" set in E _h (t) represents weaker energy collected, and "lite" set in S _v (t) represents less energy remaining. The membership function is responsible for blurring clear input variables and calculating membership degrees of different variables in different sets, and a membership degree value belonging to each rule is generated under each rule. Thus, the specific implementation of step S4 to calculate S _t the trigger intensity ω _ki of the trigger fuzzy rule k using the fuzzy inference system includes:

Step S41, setting N fuzzy rules and membership functions, defining E _h (t) in a state space S _t as a triangle membership function, defining S _v (t) in a state space S _t as a trapezoid membership function, and setting a fuzzy rule k E [1, N ];

Step S42, find S _i,s_i＝[E_h(s_i),S_v(s_i identical to the state space S _t in S), input S _i as input variable to the fuzzy inference system, calculate the triggering strength ω _ki of the fuzzy rule k:

Wherein, E _h(s_i in the input variables s _i under the representative fuzzy rule k) is calculated by a membership function to obtain a membership value,S _v(s_i in the input variable S _i under the representative fuzzy rule k) is a membership value calculated by a membership function.

E _h (t) and S _v (t) are used as fuzzy input variables and are also used as state space of an intelligent agent, after the input variables are fuzzified, the intelligent agent selects an action a _j under a rule k and receives an incentive R (S _i,a_j) from the environment at a time t+1, a incentive function is responsible for giving negative rewards to the action deviating from the energy neutral state and giving positive rewards to the action meeting the requirement of the energy neutral state, so that the setting of a rewarding mechanism is beneficial to maintaining the energy neutral state of a node, a step S6 is used for calculating S _t the environmental incentive R (S _i,a_j) for executing the action a _j, and further the specific implementation for updating Q (S _ki,a_j) in a Q table according to the environmental incentive R (S _i,a_j) comprises:

step S61, dividing the voltage of the super capacitor into low, medium, high states by a threshold value classification method;

step S62 performs real-time environmental rewards according to the state in which S _v (t) is located, when S _v (t) is in the low state,

When S _v (t) is in the medium state,

When S _v (t) is in the high state,

In the embodiment, the value of β is 4, the value of θ is 2, ENO _c is an energy neutral threshold, ENO _s is an energy neutral state of a node, ENO _s is a negative value to represent that the energy currently consumed by the node is greater than the collected energy, and the iterative formula of ENO _s、ENO_c is as follows:

ENO_s(t+1)＝ENO_s(t)+E_neu(t) (10)

ENO_c(t+1)＝ENO_c(t)+μ×(ENO_ave(t)-ENO_c(t)) (11)

Wherein E _c (t) represents the energy consumed by the energy consumption unit of the node at the time t, ENO _ave (t) is the average value of the energy neutral values of the previous round time period, and mu is the energy neutral threshold updating parameter; the intelligent agent obtains the reward, then reaches the next new state and updates each action selected by rules, and under the new state space, the fuzzy inference system fuzzifies the state vector again, and the intelligent agent performs action selection again, and sequentially and circularly changes until the process is finished;

Step S63, updating Q (S _ki,a_j) in the Q table according to the environmental rewards R (S _i,a_j):

q(s_ki,a_j)←q(s_ki,a_j)+α·Δq(s_ki,a_j) (12)

Where Q (s _ki,a_j) is the Q value calculated value of the action a _j performed by s _i based on the fuzzy rule k, Q (s '_ki,a_j) is the Q value calculated value of the action of the next state s _i', For the optimal action of the next state s _i', α is the parameter learning rate and γ is the discount factor.

In this embodiment, the update frequency of ENO _ave (T) is T _episode, that is, the new ENO _ave (T) is obtained by calculating once at the end of each round, where the calculation formula is:

ENO_ave(t)＝0,t∈T_episode (17)

through the scheme, the node can continuously operate by setting the energy neutral threshold while the energy neutral state of the node is ensured.

Referring to fig. 2 together, the system structure shown in fig. 2 may operate as follows:

According to the energy collection wireless sensor duty cycle self-adaptive adjustment method based on fuzzy Q-learning, the node duty cycle is used as a specific adjustment object, so that the energy neutral state of the node in the duty cycle self-adaptive strategy floats around the energy neutral threshold value, the energy collection rate is adjusted, the node is maintained in the energy neutral state, and sustainable operation is maintained. The agent in reinforcement learning is used as a director in the node decision process to decide the action executed by the node in the current state. Because the collected energy is continuously changed and is difficult to predict, the action judgment of the intelligent agent can be influenced, so that the fuzzy logic and reinforcement learning are combined, the action selection of the intelligent agent is restrained by formulating a fuzzy rule, the formulation of the fuzzy rule is beneficial to reducing the trial-and-error action of the intelligent agent, and the convergence of the node energy neutral state is accelerated.

And the sustainable operation of the node is realized by setting the energy neutral threshold while the energy neutral state of the node is ensured. The invention has the following two advantages: (1) And in the operation of the sensor, node energy consumption and energy neutral threshold constraint of the energy storage unit are considered, so that node death caused by low energy neutral performance of node energy storage is avoided. (2) The fuzzy reasoning system is combined with reinforcement learning, and priori knowledge is provided for the intelligent agent when the intelligent agent performs action exploration, so that the speed of converging the node to the energy neutral threshold value is improved.

The foregoing disclosure is illustrative of the preferred embodiments of the present invention, and is not to be construed as limiting the scope of the invention, as it is understood by those skilled in the art that all or part of the above-described embodiments may be practiced with equivalents thereof, which fall within the scope of the invention as defined by the appended claims.

Claims

1. The energy collection wireless sensor duty cycle self-adaptive adjustment method based on fuzzy Q-learning is characterized by comprising the following steps of:

Step S1, a wireless sensor energy management model < S, A, P _sa, R > is established, wherein S is a state space set, A is a node sleep action space set, P _sa is a probability distribution set that each state S _i in S is transferred to the next state S' _i through an action a _j, R is a reward function, S _i∈S,s′_i∈S,i∈[1,I],a_j epsilon A, j epsilon [1, M ];

Step S8, modifying the duty ratio of the node to d ^c (t) and entering the time t+1 to obtain a new state space S _t+1,S_t+1＝[E_h(t+1),S_v(t+1)],S_t+1∈S,S_t+1＝s′_i;

Step S9, returning to the execution step S4, executing the duty ratio adjustment operation according to the new state space S _t+1 as input, and repeatedly executing the steps S4-S8 until the learning time reaches the learning duration T _total; the step S4 of calculating, by using the fuzzy inference system, the trigger intensity ω _ki of triggering the fuzzy rule k by the S _t includes:

Wherein, Representing the input variable under the fuzzy rule k the E _h(s_i in s _i) by membership function calculationRepresenting the S _v(s_i) of the input variable in S _i under the fuzzy rule k through membership function calculation;

The step S6 of calculating an environmental benefit R (S _i,a_j) for the step S _t to perform the action a _j based on the benefit function R, and further updating the Q (S _ki,a_j) in the Q table according to the environmental benefit R (S _i,a_j) includes:

When the S _v (t) is in the medium state,

When the S _v (t) is in the high state,

Wherein, the symbol beta and the symbol theta are calculation parameters, the ENO _c is an energy neutral threshold, the ENO _s is an energy neutral state of the node, and an iterative formula of the ENO _s、ENO_c is as follows:

ENO_s(t+1)＝ENO_s(t)+E_neu(t)

ENO_c(t+1)＝ENO_c(t)+μ×(ENO_ave(t)-ENO_c(t))

E_neu(t)＝E_h(t)–E_c(t)

q(s_ki,a_j)←q(s_ki,a_j)+α·Δq(s_ki,a_j)

Wherein, the Q (s _ki,a_j) is the Q value calculated value of the action a _j executed by the s _i based on the fuzzy rule k, the Q (s '_ki,a_j) is the Q value calculated value of the action of the next state s' _i, the For the optimal action of the next state s' _i, α is the parameter learning rate, γ is the discount factor;

The calculation formula of the duty ratio replacement value d ^c (t) in the step S7 is as follows:

2. the energy harvesting wireless sensor duty cycle adaptive adjustment method based on fuzzy Q-learning of claim 1, wherein the probability elements in P _sa are:

3. The adaptive adjustment method of duty cycle of energy harvesting wireless sensor based on fuzzy Q-learning of claim 2, wherein a comprises 4 sleep actions of different durations, a= [ a ₁,a₂,a₃,a₄ ], wherein a ₁ represents 15 seconds of sleep, a ₂ represents 60 seconds of sleep, a ₃ represents 300 seconds of sleep, and a ₄ represents 900 seconds of sleep.

4. The energy harvesting wireless sensor duty cycle adaptive adjustment method based on fuzzy Q-learning of claim 3, wherein the update frequency of ENO _ave (T) is T _episode, and the calculation formula is:

ENO_ave(t)＝0,t∈T_episode

5. The adaptive adjustment method of duty cycle of an energy harvesting wireless sensor based on fuzzy Q-learning of claim 4, wherein β is 4, θ is 2, T _episode is defined as 24 hours, and Δt is set as 0.25 hours.