[ summary of the invention ]
The invention aims to provide a resource allocation method in a wireless energy-carrying downlink communication scene facing a mode division multiple access technology, which aims to solve the problem that the minimum total transmission power of a transmitting end still has great computational complexity while the service quality of a receiving end user is met.
The technical scheme adopted by the invention is that the resource allocation method in the wireless energy-carrying downlink communication scene facing the mode division multiple access technology is implemented according to the following steps:
step one, making a constraint Markov decision process:
describing a resource allocation problem in a wireless energy-carrying communication scene facing a mode division multiple access technology as a constraint Markov decision process, and converting the problem into an unconstrained Markov decision process by using a Lagrangian dual method;
step two, solving the unconstrained Markov decision process in the step one by using a reinforcement learning method to finally obtain an optimal resource allocation strategy; the objective of this strategy is to minimize the total power of transmission at the transmitting end while satisfying the quality of service for each user at the receiving end.
Further, the wireless energy-carrying downlink communication scenario is constructed as a system model, and the system model specifically includes:
a base station carries out wireless transmission of data and energy to T users in a specific area through K subcarriers, wherein a transmitting end adopts superposition coding, a receiving end adopts a serial interference elimination technology, and the base station of the transmitting end and the users of the receiving end are matched with a single antenna; the users are randomly distributed within a circle with radius r centered at the base station.
Further, the first step specifically comprises:
1) according to the system model, defining a state space and an action space of the system:
the state space of the system is specifically as follows:
s=(SINRk,t,k=0,1,...K,t=0,1,...T)∈S=SINR (1),
wherein, the SINRk,tThe SINR is the SINR when the kth subcarrier is loaded to the tth user, and the state set SINR belongs to a limited set of SINRs;
the action space of the system is specifically as follows:
wherein,
is a vector of transmission time ratios, P, assigned to the decoding of the information by T users
PDMAIs a power distribution matrix, G
PDMAIs a sub-carrier mapping matrix that is,
G
PDMA∈G,P
PDMAe is P represents that the vector and the matrix respectively belong to a finite set of transmission time ratio, subcarrier mapping and power distribution allocated to information decoding;
2) the constrained markov decision process is detailed as follows:
wherein, P
totalIs total power of transmission of transmitting end(ii) a Equations (4) and (5) represent the constraints on the quality of service for each user, i.e. the energy E received by each user
tAnd a data rate R
tAre required to respectively satisfy the minimum energy requirement E
reqAnd a data rate requirement R
req(ii) a The Markov decision process is described as being through an adjustment action
G
PDMA,P
PDMAMinimizing the total power of transmission at the transmitting end under the constraint of satisfying the service quality of each user;
the markov decision process can be relaxed to an unconstrained markov process, i.e.:
wherein,
two sets of lagrangian operators, respectively; II type
*The optimal resource allocation strategy is converted into a saddle point of a solving function L (lambda, mu, Π).
Further, in the second step, the updating formula of the Q value in reinforcement learning is specifically as follows:
wherein r isk+1Gamma and rho < 1 < 0 are respectively the reward obtained at the moment of k +1, the reward discount coefficient and the learning rate;
the optimum function is expressed as follows:
wherein Q is*(sAnd a) is the Q value given when the optimal strategy is followed for state s and action a.
The beneficial results of the invention are:
1. the invention provides a resource allocation method in a wireless energy-carrying downlink communication scene facing a mode division multiple access technology. Taking the time switching receiver as an example, the minimum total transmission power at the transmitting end is obtained by jointly optimizing the time slot ratio, the subcarrier mapping matrix and the power allocation matrix allocated to the energy reception and the data rate by the receiver.
2. In order to solve the problem that the constrained Markov decision process is difficult to solve, the Lagrangian dual theory is used to convert the constrained Markov decision process into an unconstrained Markov decision process. And finally, obtaining the optimal strategy in the Markov decision process by applying a Q learning algorithm in reinforcement learning.
3. The effectiveness of the method is verified through experiments, and compared with other methods, the method has the advantage that the transmitting end can obtain lower total transmission power.
[ detailed description ] embodiments
The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.
In order to ensure that the total transmission power of a transmitting end in a wireless energy-carrying downlink communication scene facing to a mode division multiple access technology is minimum, the invention researches a resource allocation method based on a constraint Markov decision process. Specifically, the resource allocation problem in the wireless energy-carrying communication scene of the mode division multiple access technology is described as a constrained Markov decision process, and the constrained Markov decision problem is converted into an unconstrained Markov decision process by utilizing a Lagrangian duality theory. Finally, a Q-learning algorithm is proposed to solve the optimal solution of the unconstrained Markov decision process. Take the time-switched receiver as an example: the power allocation matrix, the subcarrier mapping matrix and the slot ratio allocated to information decoding and energy collection in the above-described scenario are adjusted to optimal values to minimize the total transmission power of the transmitter while satisfying the quality of service of each user.
Step one, constructing a system model: the system model is a wireless energy-carrying downlink communication system model based on a mode division multiple access technology and consists of a base station and a plurality of users;
the specific mode of the first step is as follows:
as shown in FIG. 1, assume that there is a base station that wirelessly transmits data and energy to T users in a particular area over K subcarriers, where
And
respectively user index and subcarrier index. In addition to this, superposition coding is employed at the transmitter and the subcarrier mapping matrix G is satisfied
PDMA∈N
K×TIn which K is
k={n|g
k,t1 (K ∈ K) and
respectively the set and number of users to which the k-th sub-carrier is mapped. The mapping matrix with 3 sub-carriers and 5 users is shown in fig. 1, where
K 11, 2, 3, 4 and | K
1And 4. In addition, the time-switched receiver is taken as an example to solve the optimizationAnd (4) resource allocation strategies. User U
tBy subcarrier H
kThe received signals are:
wherein h is
k,t=r
k,td
k -βIs through a subcarrier H
kFrom base station to user U
tOf channel gain r
k,tIs a small scale fading that satisfies the rayleigh distribution,
is large scale fading related to the distance between the base station and the user; in addition, P
k,tAnd x
k,tIs to transmit a signal through a subcarrier H
kLoaded to user U
tPower and signal of w
k,t~CN(0,σ
k 2) Is additive white gaussian noise.
The receiving end adopts the serial interference elimination technology according to
Decoding is performed in that order. In addition to the initial point of the process,
is the ratio of channel to noise, and CNR
k,tShould satisfy
Then, the normalized interference is:
thus, the snr when the kth subcarrier is loaded to the tth user is:
wherein,
it is ensured that the decoding process is not interrupted. User U
tBased on subcarrier H
kThe information rate and energy obtained are respectively:
Rk,t=Bklog2(1+SINRk,t) (4)
of these, η is the energy harvesting efficiency, in addition, αtAnd 1- αtThe transmission slot ratios assigned to information decoding and energy collection, respectively, can be deduced that the information and energy collected by each user is:
step two, formulation of a constraint Markov decision problem: the resource allocation problem in the wireless energy-carrying communication system is converted into a constraint Markov decision problem, and the constraint Markov decision problem is converted into the unconstrained Markov decision problem by using Lagrangian dual theory.
The specific implementation manner of the second step is as follows:
the decision maker minimizes the total power of transmission at the transmitting end while meeting the energy requirements and data rate requirements received by each user at the receiving end. The resource allocation problem with user quality of service constraints is denoted as a constrained markov decision problem, which provides a corresponding resource allocation policy for each state. Next, the state space, the action space, the targets, and the constraints of the system will be described separately.
1) State space: to characterize the energy and signal received by the user, we define the state space as:
s=(SINRk,t,k=0,1,...K,t=0,1,...T)∈S=SINR (8)
wherein the state set SINR is a finite set belonging to the signal-to-interference ratio.
2) An action space: the transmitter minimizes the total power of transmission by controlling power allocation and subcarrier mapping, and the receiver by controlling the ratio of time slots allocated for information decoding and energy collection. Thus, the motion space is:
wherein,
and P
PDMARespectively, the slot ratio vector and the power allocation matrix that all user receivers allocate to the decoding of the information. In addition, the first and second substrates are,
G
PDMA∈G,P
PDMAe P is discrete in the system and the α, G, P sets belong to a finite set of slot ratios, subcarrier mappings and power allocations, respectively, allocated to information decoding by all receivers.
3) Targets and constraints: the goal is to find the optimal strategy pi such that the total power transmitted, P, at the transmitting endtotalMinimum; the constraints are to meet minimum energy and data rate requirements per user. This resource allocation problem can be translated into a constrained markov decision process, i.e., P1:
the problem is that the total transmission power of a transmitting end is minimized by adopting a strategy pi to adaptively adjust the time slot ratio allocated to information decoding by all receivers, the subcarrier mapping and the power allocation of the transmitting end while meeting the service quality constraint of each user. In order to solve the problem of constrained Markov, the Lagrangian dual theory converts the constrained Markov problem into an unconstrained Markov process. The generalized Lagrangian function will be introduced below:
wherein λ ═ { λ ═ λ1,λ2,λ3,...,λt=T}、μ={μ1,μ2,μ3,...,μt=TIs a set of Lagrangian operators and the element λ1,λ2,λ3,...,λt=TAnd mu1,μ2,μ3,...,μt=TThe lagrange multipliers respectively correspond to the constraints of the energy harvested and the received data rate for each user. Considering L (λ, μ, Π) as a function of λ and μ, defined as:
the value of θ (Π) is P when the receiver satisfies the user quality of service constrainttotal. When the constraint is not satisfied, the two sets of Lagrangian operators are made positiveInfinite, the value of θ (Π) tends to be infinite, resulting in a function with no solution. Thus, the θ (Π) function can be described as:
thus, the constrained markov decision process can be relaxed to an unconstrained markov decision process, i.e.:
wherein,
and
additionally, pi
*Is the optimal strategy. Thus, the optimal resource allocation strategy translates into a saddle point solving the function L (Π, λ, μ). Namely, (II)
*,λ
*,μ
*) It should satisfy:
L(Π,λ*,μ*)≥L(Π*,λ*,μ*)≥L(Π*,λ,μ) (21)
since the channel transition probability is difficult to estimate, a Q learning algorithm is proposed to solve the optimal solution of the unconstrained markov decision process.
And thirdly, acquiring an optimal strategy of resource allocation based on a constraint Markov decision process in a wireless energy-carrying communication scene of a mode division multiple access technology by using a reinforcement learning method.
The specific implementation manner of the third step is as follows:
the reinforcement learning algorithm is widely applied to learning of an optimal control strategy of a model-free MDP problem, which means that environmental models such as channel conversion do not need to be considered. Therefore, the Q learning algorithm in reinforcement learning is proposed to solve the above resource allocation problem. The Q value calculation formula, the update formula, the epsilon-greedy strategy and the reward function of the Q learning algorithm will be given below respectively. For policy π, the Q value calculation formula when action a is performed at state s is:
Qπ(s,a)=Eπ[rk+1+γQπ(sk+1,ak+1)|sk=s,ak=a](22)
wherein r isk+1And γ are the prize and bonus discount coefficients obtained at time k +1, respectively. In the Q learning algorithm, the update formula of the Q value is:
wherein 0 < ρ < 1 is the learning rate. At state s, action a is chosen according to the strategy of ε -greedy, in order to make the best decision overall. Thus, the selection of actions follows:
wherein the-U (A) function randomly chooses any motion within the uniform motion space. To directly reflect the reward function of a target value, it is defined as:
in addition, the lagrange multiplier is calculated and updated using a secondary gradient method. After the Q value is calculated and updated, the control strategy for the problem (P2) can be described as:
where Q is*(s, a) is the Q value given following the optimal strategy for state s and action a.
Example (b):
the diagrams provided in the following examples and the setting of specific parameter values in the models are mainly for explaining the basic idea of the present invention and performing simulation verification on the present invention, and can be appropriately adjusted according to the actual scene and requirements in the specific application environment.
The invention relates to a wireless energy-carrying communication scene oriented to a mode division multiple access technology, wherein a transmitter and a receiver are both provided with a single antenna. The effectiveness of the proposed method is demonstrated by simulation: (1) the convergence performance of the algorithm under different learning rates is compared; (2) the total transmission power of the transmitting end varies with different algorithms as the receiving energy requirement of the user varies. Here, the proposed constrained markov process based Q learning algorithm is compared with the genetic algorithm based DBN algorithm; (3) the total transmission power at the transmitting end varies with different algorithms as the data rate requirements of the users vary. Here, the proposed constrained markov process based Q learning algorithm is compared with the genetic algorithm based DBN algorithm; (4) with the change of the number of users at the receiving end, the minimum total transmission power at the transmitting end changes with the difference of the requirements of the service quality of the users.
In the simulation, we assume that all users are distributed within a circle with a radius of 300meters, d, centered at the base stationkPath loss coefficient β is assumed to be 3.76 in order to meet the energy requirement of the receiving end, the power conversion efficiency of the receiver for energy harvesting is assumed to be η -30%20.01 w. To learn the Q value, an action set satisfying the constraints (13), (14), and (15) is set. Thus, the state space is a limited set of corresponding motion spaces. In addition, other parameters are set as: k is a radical ofmax2500,. epsilon. 0.1 and. gamma. 0.8. In the simulation process, three performance indexes are: total power transmitted at the transmitting end, energy harvested at the receiving end, and data rate. The advantages and disadvantages of the resource allocation strategy are characterized by performance indicators.
As shown in fig. 2, the convergence of the total transmission power at different learning rates was studied, and suitable algorithmic learning rates were determined, where ρ was set to 0.4, 0.5, and 0.6, respectively. In addition, the number of users and the number of subcarriers are set to 2. Furthermore, the acquired energy constraint and the data rate constraint are set to E, respectivelyreq0.1w and R req1 Mbit/s. It can be observed that the total power of transmission converges to 0.35w at different learning rates. Obviously, the convergence speed and stability are different at different learning rates. In consideration of two factors, the convergence speed and the stability, a learning rate of 0.6 is adopted. Since the algorithm adopts a greedy strategy, the total transmission power of the resource allocation scheme based on the constrained Markov process will slightly change with the increase of the iteration number, but the overall trend of the total transmission power is not affected.
As shown in fig. 3 and 4, the effectiveness of the algorithm was studied, which reflects the proposed performance comparison of the Q-learning based algorithm and the DBN algorithm at different user quality of service. In the simulation parameter setting, the number of receiving end users is set to 3. The results show that the algorithm is efficient and can significantly reduce the total transmission power.
Finally, fig. 5 shows the minimum total transmission power at the transmitting end estimated by the proposed Q learning algorithm under the constraints of different users and different user qos, where the number of users at the receiver is set to 2, 3 and 4, respectively. As shown in fig. 5, it is observed that the total power of transmission at the transmitting end tends to increase due to an increase in the quality of service of the user. In addition, as the number of users increases, the increasing trend of the minimum total transmission power of the transmitting end is gradually shown. The above results verify the effectiveness and reasonability of the algorithm.