Disclosure of Invention
In view of the above, the invention provides a fuzzy Q-learning-based energy collection wireless sensor duty cycle self-adaptive adjustment method, which is used for solving the problems of excessively fast breakage of collection and energy storage equipment and energy exhaustion of wireless sensor nodes caused by excessively high energy collection rate by adjusting the duty cycle to adapt to the energy collection rate.
The technical scheme adopted by the embodiment of the invention for solving the technical problems is as follows:
An energy harvesting wireless sensor duty cycle self-adaptive adjustment method based on fuzzy Q-learning, comprising:
Step S1, a wireless sensor energy management model < S, A, P sa, R > is established, wherein S is a state space set, A is a node sleep action space set, P sa is a probability distribution set that each state S i in S is transferred to the next state S i' through an action a j, R is a reward function, S i∈S,si′∈S,i∈[1,I],aj epsilon A, j epsilon [1, M ];
Step S2, a Q table is established, the value in the Q table is recorded as Q (S ki,aj), the Q table is initialized, wherein the Q-learning duration is specified as T total, the single-round duration is specified as T episode, the updating interval duration is delta T, and S ki is that a fuzzy rule k is adopted after the S i input fuzzy inference system;
Step S3, a state space S t,St=[Eh(t),Sv(t)],St∈S,St=si of a node at a time t is obtained, wherein E h (t) represents energy collected by an energy collecting unit of the node at the time t, and S v (t) represents super capacitor voltage of a wireless sensor at the time t;
S4, calculating triggering strength omega ki of triggering the fuzzy rule k by the S t through the fuzzy inference system, wherein k is E [1, N ];
S5, selecting an action a j which is activated correspondingly by the fuzzy rule k from the A according to an epsilon-greedy strategy;
step S6, calculating an environmental reward R (S i,aj) for the S t to execute the action a j based on the reward function R, and further updating the Q (S ki,aj) in the Q table according to the environmental reward R (S i,aj);
Step S7, calculating a duty ratio replacement value d c (t) of the node at the time t based on the a j and the trigger intensity ω ki;
Step S8, modifying the duty ratio of the node to d c (t) and entering the time t+1 to obtain a new state space S t+1,St+1=[Eh(t+1),Sv(t+1)],St+1∈S,St+1=si';
Step S9, returning to step S4, performing a duty cycle adjustment operation according to the new state space S t+1 as input, and repeating steps S4-S8 until the learning time reaches the learning duration T total.
Preferably, the probability elements in P sa are:
preferably, the step S4 of calculating, by using the fuzzy inference system, a trigger intensity ω ki of triggering the fuzzy rule k by the S t includes:
Step S41, formulating N fuzzy rules and membership functions, defining the E h (t) in the state space S t as a triangle membership function, defining the S v (t) in the state space S t as a trapezoid membership function, and enabling the fuzzy rules k to be E [1, N ];
Step S42, in which the same S i,si=[Eh(si),Sv(si as the state space S t is found, the S i is input as an input variable to the fuzzy inference system, and the triggering strength ω ki of the fuzzy rule k is calculated:
Wherein, Representing the input variable under the fuzzy rule k the E h(si in s i) by membership function calculationRepresenting the input variable under the fuzzy rule k the S v(si in S i) is a membership value calculated by a membership function.
Preferably, the step S6 of calculating an environmental reward R (S i,aj) for the step S t of performing the action a j based on the reward function R, and updating the Q (S ki,aj) in the Q table further according to the environmental reward R (S i,aj) includes:
Step S61, dividing the super capacitor voltage into low, medium, high states by a threshold value classification method;
Step S62, carrying out real-time environment rewards according to the state of S v (t), when S v (t) is in the low state,
When the S v (t) is in the medium state,
When the S v (t) is in the high state,
The symbol β and the symbol θ are calculation parameters, the ENO c is an energy neutral threshold, the ENO s is an energy neutral state of the node, and the iterative formula described by the ENO s、ENOc is:
ENOs(t+1)=ENOs(t)+Eneu(t)
ENOc(t+1)=ENOc(t)+μ×(ENOave(t)-ENOc(t))
Eneu(t)=Eh(t)–Ec(t)
wherein, E c (t) represents the energy consumed by the energy consumption unit of the node at time t, ENO ave (t) is the average value of the energy neutral values of the previous round of time period, and μ is the energy neutral threshold updating parameter;
step S63, updating the Q (S ki,aj) in the Q table according to the environmental reward R (S i,aj):
q(ski,aj)←q(ski,aj)+α·Δq(ski,aj)
Wherein, the Q (s ki,aj) is the Q value calculated value of the action a j executed by the s i based on the fuzzy rule k, the Q (s 'ki,aj) is the Q value calculated value of the action of the next state s i', the For the optimal action of the next state s i', α is the parameter learning rate and γ is the discount factor.
Preferably, the duty ratio replacement value d c (t) in the step S7 has a formula as follows:
Preferably, the a contains 4 sleep actions of different durations, a= [ a 1,a2,a3,a4 ], wherein a 1 represents 15 seconds of sleep, a 2 represents 60 seconds of sleep, a 3 represents 300 seconds of sleep, and a 4 represents 900 seconds of sleep.
Preferably, β has a value of 4, θ has a value of 2, T episode is defined as 24 hours, and Δt is set to 0.25h.
Preferably, the update frequency of the ENO ave (T) is T episode, and the calculation formula is:
ENOave(t)=0,t∈Tepisode
According to the technical scheme, the energy collecting wireless sensor duty ratio self-adaptive adjusting method based on fuzzy Q-learning provided by the embodiment of the invention comprises the steps of firstly establishing wireless sensor energy management models < S, A, P sa and R >; then a Q table is established, and the value in the Q table is marked as Q (s ki,aj); acquiring a state space S t,St=[Eh(t),Sv (t) of the node at the time t; calculating the triggering intensity omega ki of triggering the fuzzy rule k by using a fuzzy reasoning system S t; selecting an action a j which is activated correspondingly to the fuzzy rule k from the A according to an epsilon-greedy strategy; calculating S t an environmental benefit R (S i,aj) for performing action a j based on the benefit function R, and further updating Q (S ki,aj) in the Q table according to the environmental benefit R (S i,aj); calculating a duty ratio replacement value d c (t) of the node at the time t based on a j and the trigger intensity omega ki; modifying the duty ratio of the node to d c (t) and entering the time t+1 to obtain a new state space S t+1; the duty cycle adjustment operation is performed as input according to the new state space S t+1, and the foregoing steps are repeated until the learning time reaches the learning duration T total. The problem that the energy collection and energy storage equipment is damaged too quickly due to the fact that the energy collection rate is too high and the wireless sensor node is exhausted due to the fact that the energy collection rate is too low can be solved through adjusting the duty ratio to adapt to the energy collection rate in the Q-learning process.
Detailed Description
The technical scheme and technical effects of the present invention are further elaborated below in conjunction with the drawings of the present invention.
Maintaining the wireless sensor node in an energy neutral state for a long period is an effective solution for realizing sustainable operation of the node, if the energy collected by the node is kept to be greater than or equal to the consumed energy, the node is said to be in the energy neutral state, as shown in formula (1), and if the difference between the consumed energy and the collected energy of the node approaches zero in an ideal state, the operation is said to be energy neutral operation:
Eneu=Eh(t)–Ec(t) (1)
As shown in fig. 1 and 2, the invention provides a fuzzy Q-learning-based energy collection wireless sensor duty cycle self-adaptive adjustment method, which is used for adjusting the duty cycle of the work of an energy collection wireless sensor node to influence the energy storage rate of the node, and dynamically adjusting the energy storage rate to enable the node to be in an energy neutral state. The wireless sensor node for energy collection mainly comprises an energy collector, an energy storage and an energy consumption unit, wherein E h (t) represents energy collected by the energy collection unit at the time t, E r (t) represents energy remained by the energy storage unit at the time t, and E c (t) represents energy consumed by the energy consumption unit at the time t. The method of the invention comprises the following specific implementation steps:
Step S1, establishing a wireless sensor energy management model < S, A, P sa, R >, wherein S is a state space set, A is a node sleep action space set, P sa is a probability distribution set of each state S i in S to be transferred to the next state S i' through an action a j, R is a reward function, S i∈S,sI′∈S,i∈[1,I],aj epsilon A, j epsilon [1, M ];
Step S2, a Q table is established, the value in the Q table is recorded as Q (S ki,aj), the Q table is initialized, wherein the Q-learning duration is specified as T total, the single-round duration is specified as T episode, the updating interval duration is delta T, and S ki is S i, and a fuzzy rule k is adopted after the fuzzy inference system is input;
Step S3, a state space S t,St=[Eh(t),Sv(t)],St∈S,St=si of the node at the time t is obtained, wherein S v (t) represents the super capacitor voltage of the wireless sensor at the time t;
s4, calculating triggering strength omega ki of triggering the fuzzy rule k by using a fuzzy inference system S t, wherein k is E [1, N ];
S5, selecting an action a j which is activated correspondingly to the fuzzy rule k from the A according to an epsilon-greedy strategy; the intelligent agent selects actions from an action space according to an epsilon-greedy strategy in the current state, the action with the largest Q value is selected from a Q table according to the epsilon probability, the probability of 1 epsilon randomly selects a certain action, and the action selected by the intelligent agent affects the calculation output of the duty ratio;
Step S6, calculating S t environmental rewards R (S i,aj) for executing action a j according to S v (t) based on the rewards function R, and further updating Q (S ki,aj) in the Q table according to the environmental rewards R (S i,aj);
Step S7, calculating a duty ratio replacement value d c (t) of the node at the time t based on a j and the trigger intensity ω ki:
Step S8, the duty ratio of the modified node is d c (t) and the time of entering t+1 is entered, and a new state space S t+1,St+1=[Eh(t+1),Sv(t+1)],St+1∈S,St+1=sI' is obtained;
step S9, returning to step S4, performing the duty cycle adjustment operation according to the new state space S t+1 as input, and repeating steps S4-S8 until the learning time reaches the learning duration T total.
In the present invention, the state space S t、St+1 can find the state space with the same element value in the state space set S, S I ' represents the next state of S i, and the state space set S has the state space with the same element value as S I ', so S i ' S can be considered.
In an embodiment, T episode is defined as 24 hours and Δt is set to 0.25h.
Step S1 defines an energy harvesting wireless sensor energy management model < S, a, P sa, R > quadruple based on a Markov Decision Process (MDP), wherein:
(1) S denotes the state space, the state of the node at time unit t is defined as the collected energy E h (t) and the supercapacitor voltage S v (t), so the state space S is expressed as:
S=[Eh(t),Sv(t)] (3)
(2) A represents an action space, the actions executed in the state S t are designed to be sleep actions with different node durations, and in an embodiment, the sleep actions can be divided into 4 kinds and expressed as follows
A=[a1,a2,a3,a4] (4)
For example, a 1 represents 15 seconds of sleep, a 2 represents 60 seconds of sleep, a 3 represents 300 seconds of sleep, and a 4 represents 900 seconds of sleep;
(3) P sa represents the probability distribution that the node will transition to other states after action a j under the action of the current state S t ε S, and the probability that the node takes action a j to reach S I' in state S t is represented as
(4) R represents a reward function, in response to the node' S rationality in performing action a j at state S t, the environment provides a reward scalar R (S i,aj) for evaluating the quality of the action. The core idea of the bonus function is to constrain the current energy neutral state ENO s of the node with an energy neutral threshold ENO c.
The fuzzy inference system provides a mapping from input to output based on a set of fuzzy rules and associated fuzzy membership functions. The rule base of the fuzzy inference system generally consists of a plurality of preset rules, and the fuzzy inference rule R j which correlates the state vector with the action has the following form:
Rj:IFs is in sjTHEN the action is a1 with q(s1i,a1)
or…
or the action is ak with q(ski,aj)
R k represents the kth rule, where Q (s ki,ai) is the Q value of the state action pair (s ki,aj) in the Q table.
The input linguistic variable E h(t)={poor,fair,good},Sv (t) = { lite, medium, high } of the If conditional statement is divided into different sets, e.g. "face" set in E h (t) represents weaker energy collected, and "lite" set in S v (t) represents less energy remaining. The membership function is responsible for blurring clear input variables and calculating membership degrees of different variables in different sets, and a membership degree value belonging to each rule is generated under each rule. Thus, the specific implementation of step S4 to calculate S t the trigger intensity ω ki of the trigger fuzzy rule k using the fuzzy inference system includes:
Step S41, setting N fuzzy rules and membership functions, defining E h (t) in a state space S t as a triangle membership function, defining S v (t) in a state space S t as a trapezoid membership function, and setting a fuzzy rule k E [1, N ];
Step S42, find S i,si=[Eh(si),Sv(si identical to the state space S t in S), input S i as input variable to the fuzzy inference system, calculate the triggering strength ω ki of the fuzzy rule k:
Wherein, E h(si in the input variables s i under the representative fuzzy rule k) is calculated by a membership function to obtain a membership value,S v(si in the input variable S i under the representative fuzzy rule k) is a membership value calculated by a membership function.
E h (t) and S v (t) are used as fuzzy input variables and are also used as state space of an intelligent agent, after the input variables are fuzzified, the intelligent agent selects an action a j under a rule k and receives an incentive R (S i,aj) from the environment at a time t+1, a incentive function is responsible for giving negative rewards to the action deviating from the energy neutral state and giving positive rewards to the action meeting the requirement of the energy neutral state, so that the setting of a rewarding mechanism is beneficial to maintaining the energy neutral state of a node, a step S6 is used for calculating S t the environmental incentive R (S i,aj) for executing the action a j, and further the specific implementation for updating Q (S ki,aj) in a Q table according to the environmental incentive R (S i,aj) comprises:
step S61, dividing the voltage of the super capacitor into low, medium, high states by a threshold value classification method;
step S62 performs real-time environmental rewards according to the state in which S v (t) is located, when S v (t) is in the low state,
When S v (t) is in the medium state,
When S v (t) is in the high state,
In the embodiment, the value of β is 4, the value of θ is 2, ENO c is an energy neutral threshold, ENO s is an energy neutral state of a node, ENO s is a negative value to represent that the energy currently consumed by the node is greater than the collected energy, and the iterative formula of ENO s、ENOc is as follows:
ENOs(t+1)=ENOs(t)+Eneu(t) (10)
ENOc(t+1)=ENOc(t)+μ×(ENOave(t)-ENOc(t)) (11)
Wherein E c (t) represents the energy consumed by the energy consumption unit of the node at the time t, ENO ave (t) is the average value of the energy neutral values of the previous round time period, and mu is the energy neutral threshold updating parameter; the intelligent agent obtains the reward, then reaches the next new state and updates each action selected by rules, and under the new state space, the fuzzy inference system fuzzifies the state vector again, and the intelligent agent performs action selection again, and sequentially and circularly changes until the process is finished;
Step S63, updating Q (S ki,aj) in the Q table according to the environmental rewards R (S i,aj):
q(ski,aj)←q(ski,aj)+α·Δq(ski,aj) (12)
Where Q (s ki,aj) is the Q value calculated value of the action a j performed by s i based on the fuzzy rule k, Q (s 'ki,aj) is the Q value calculated value of the action of the next state s i', For the optimal action of the next state s i', α is the parameter learning rate and γ is the discount factor.
In this embodiment, the update frequency of ENO ave (T) is T episode, that is, the new ENO ave (T) is obtained by calculating once at the end of each round, where the calculation formula is:
ENOave(t)=0,t∈Tepisode (17)
through the scheme, the node can continuously operate by setting the energy neutral threshold while the energy neutral state of the node is ensured.
Referring to fig. 2 together, the system structure shown in fig. 2 may operate as follows:
According to the energy collection wireless sensor duty cycle self-adaptive adjustment method based on fuzzy Q-learning, the node duty cycle is used as a specific adjustment object, so that the energy neutral state of the node in the duty cycle self-adaptive strategy floats around the energy neutral threshold value, the energy collection rate is adjusted, the node is maintained in the energy neutral state, and sustainable operation is maintained. The agent in reinforcement learning is used as a director in the node decision process to decide the action executed by the node in the current state. Because the collected energy is continuously changed and is difficult to predict, the action judgment of the intelligent agent can be influenced, so that the fuzzy logic and reinforcement learning are combined, the action selection of the intelligent agent is restrained by formulating a fuzzy rule, the formulation of the fuzzy rule is beneficial to reducing the trial-and-error action of the intelligent agent, and the convergence of the node energy neutral state is accelerated.
And the sustainable operation of the node is realized by setting the energy neutral threshold while the energy neutral state of the node is ensured. The invention has the following two advantages: (1) And in the operation of the sensor, node energy consumption and energy neutral threshold constraint of the energy storage unit are considered, so that node death caused by low energy neutral performance of node energy storage is avoided. (2) The fuzzy reasoning system is combined with reinforcement learning, and priori knowledge is provided for the intelligent agent when the intelligent agent performs action exploration, so that the speed of converging the node to the energy neutral threshold value is improved.
The foregoing disclosure is illustrative of the preferred embodiments of the present invention, and is not to be construed as limiting the scope of the invention, as it is understood by those skilled in the art that all or part of the above-described embodiments may be practiced with equivalents thereof, which fall within the scope of the invention as defined by the appended claims.