Literature DB >> 35336499

Joint Beamforming, Power Allocation, and Splitting Control for SWIPT-Enabled IoT Networks with Deep Reinforcement Learning and Game Theory.

JainShing Liu¹, Chun-Hung Richard Lin², Yu-Chen Hu³, Praveen Kumar Donta⁴.

Abstract

Future wireless networks promise immense increases on data rate and energy efficiency while overcoming the difficulties of charging the wireless stations or devices in the Internet of Things (IoT) with the capability of simultaneous wireless information and power transfer (SWIPT). For such networks, jointly optimizing beamforming, power control, and energy harvesting to enhance the communication performance from the base stations (BSs) (or access points (APs)) to the mobile nodes (MNs) served would be a real challenge. In this work, we formulate the joint optimization as a mixed integer nonlinear programming (MINLP) problem, which can be also realized as a complex multiple resource allocation (MRA) optimization problem subject to different allocation constraints. By means of deep reinforcement learning to estimate future rewards of actions based on the reported information from the users served by the networks, we introduce single-layer MRA algorithms based on deep Q-learning (DQN) and deep deterministic policy gradient (DDPG), respectively, as the basis for the downlink wireless transmissions. Moreover, by incorporating the capability of data-driven DQN technique and the strength of noncooperative game theory model, we propose a two-layer iterative approach to resolve the NP-hard MRA problem, which can further improve the communication performance in terms of data rate, energy harvesting, and power consumption. For the two-layer approach, we also introduce a pricing strategy for BSs or APs to determine their power costs on the basis of social utility maximization to control the transmit power. Finally, with the simulated environment based on realistic wireless networks, our numerical results show that the two-layer MRA algorithm proposed can achieve up to 2.3 times higher value than the single-layer counterparts which represent the data-driven deep reinforcement learning-based algorithms extended to resolve the problem, in terms of the utilities designed to reflect the trade-off among the performance metrics considered.

Entities: Chemical

Keywords: IoT; beamforming; deep reinforcement learning; energy harvesting; game theory; joint optimization; multi-resource allocation; power control

Year: 2022 PMID： 35336499 PMCID： PMC8955841 DOI： 10.3390/s22062328

Source DB: PubMed Journal: Sensors (Basel) ISSN： 1424-8220 Impact factor: 3.576

1. Introduction

The tremendous growth in wireless data transmission would be a result from the introduction of fifth generation of wireless communications (5G) and will continue in the wireless networks beyond 5G (B5G). In particular, the collaboration between 5G enabled Internet of Things (5G-IoT) and wireless sensor networks (WSNs) will extend the connections between the Internet and the real world and widen the scope of IoT services. In such collective networks, by uploading part of or all of the computing tasks to the edge computing, a mobile edge computing (MEC) technique is developed to reduce the enormous data traffic and huge energy consumption brought by a great number of IoT devices and sensors [1,2]. Even given that, realizing 5G or B5G IoT networks is still challenging due to the limited energies for the IoT devices equipped with batteries. To alleviate this problem, simultaneous wireless information and power transfer (SWIPT) are proposed to effectively and conveniently extend the lifetime of IoT devices, and employed in many related works [3,4,5,6,7]. In fact, SWIPT is a key technique in 5G and B5G because power allocation and interference management are still the crucial issues to be addressed in the communication networks [8,9]. In the border ground, the techniques of power control along with beamforming and interference coordination are usually adopted to increase the signal for data transmissions and improve the data rates received by end-users. However, these techniques by default treat the interference as a harmful impact to data transmissions, and ignore its potential to increase the communication capacity. By contrast, SWIPT opens up the potential by harvesting energy from the ambient electromagnetic sources including the interference signals. Consequently, not only would the benefits be obtained in which devices with SWIPT can transfer the interference into a useful resource, but also there is an advantage that can be taken with the signal-to-noise and interference ratio (SINR) to be increased by SWIPT for the residual energy of IoT devices. In this work for the scenario that multiple BSs or APs can simultaneously transmit data and energy to their mobile nodes (MNs) in edge, we further show that, when the power control and interference management meet SWIPT, an overall system utility reflecting data rate, energy harvesting, and power consumption at the same time can be conduced to lead the system to an optimal trade-off on these performance metrics. Given that, how to allocate the transmit power, select the beamforming vector, and decide the power splitting ratio for the system will be a complex multiple resource allocation (MRA) problem, and can be formulated as a mixed integer nonlinear programming (MINLP) problem or even a non-convex MINLP problem. In general, MINLP problems are NP-hard and no efficient global optimal algorithm is available. Thus, apart from traditional optimization programming programs [10,11,12,13,14,15,16,17], research efforts usually resort to game theory [18,19,20,21], graph theory [22,23], and heuristic algorithms [24,25] to reduce the complexity. More recently, inspired by the success of deep reinforcement learning (DRL) [26] on the application of computer science in various important fields, using DRL to solve the network problems, such as power control [27,28,29], joint resource allocation [30,31], and energy harvesting [32], becomes one of the main trends in the communication society. Although DRL is a useful tool to resolve these problems, the data-driven approaches that resulted usually treat a given resource optimization problem as a black box to learn its input/output relationship via various DRL techniques, which do not explicitly take the advantages from the model-based counterparts, such as game theory, graph theory, and heuristic algorithms mentioned previously. By noticing this fact, in this work, we first show how to design DRL-based approaches operated in a single layer to (1) jointly solve for power control, beamforming selection, and power splitting decision, and (2) approach the optimal trade-off among the performance metrics without exhaustive search in the action space. Next, we show how to incorporate a data-driven DRL-based technique and a model-driven game-theory-based algorithm to form a two-layer iterative approach to resolve the NP-hard MRA problem. By taking benefits from both data-driven and model-driven methods, the proposed two-layer MRA approach is shown to outperform the single-layer counterparts which rely only on the data-driven DRL-based algorithms.

1.1. Related Work

As a related work for LTE, the almost blank subframe (ABS) method was proposed in the standard [33] to resolve the co-channel inter-cell interference problem caused by two LTE base stations interfering with each other. Although ABS works well in fixed beam patterns, it was shown in [34] that ABS would be inefficient due to the dynamic nature of beamforming. Apart from the standard’s solution, particular attention has also been paid to the efforts on resolving different resource allocation (RA) problems. In this work, these efforts would be classified into two categories, namely model-driven methods and data-driven methods. According to our subjects, the former includes optimization methods and game theory methods while the latter simply denotes machine learning methods. As expected, a lot of previous works would be classified into the former, including graph theory [35,36], optimization decomposition [10,11,13,14,15,17], and dual Lagrangian method [12,16], in addition to game theory. As a kind of data-driven method in the latter, which requires no model-oriented analysis and design, DRL would play a key role in solving RA problems. For example, the work in [37] proposed an inter-cell interference coordination and cell range expansion technique in heterogeneous networks, wherein dynamic Q-learning-based methods were introduced to improve user throughput. In addition, the previous works [29,38,39] introduced different deep Q-learning-based power control methods to maximize their objectives. Apart from Q-learning, in [40,41], actor–critic reinforcement learning (ACRL) algorithms were developed to reduce energy consumption. Recently, with deep deterministic policy gradient (DDPG), an algorithm was proposed in [32] that can be applicable for continuous states to realize continuous energy management, getting rid of the curse of dimensionality due to discrete action space from Q-learning. Apart from the above, game-theory-based methods also received a lot of attention. For example, non-cooperative interference-aware RA has been proposed in [19] to improve the resource utilization efficiency of OFDMA networks. In [42], an interference coordination game was introduced, and the Nash equilibrium was found to reduce its computational complexity. Similarly, a joint transmit power and subchannel allocation problem was considered in [20], and a distributed non-cooperative game-based RA algorithm and a linear pricing technique were introduced therein to find the solutions. In addition, a power control problem for self-organizing small cell networks was formulated as a non-cooperative game in [21], which can then be solved by using the distributed energy efficient power control scheme proposed. Recently, by introducing a time-varying interference pricing with SWIPT, the authors in [18] modeled the power allocation problem as a non-cooperative game, and, by minimizing the total interferences experienced, they modeled the subchannel allocation problem as a non-cooperative potential game. Then, they proposed iterative algorithms to obtain the Nash equilibrium points corresponding to these games for the solutions. More recently, there are different learning-based approaches proposed to resolve various problems in IoT networks. For example, a beamforming design for SWIPT-enabled networks was introduced in [43], where the rate-splitting scheme and the power-splitting energy harvesting receiver are adopted for secure information transfer and energy harvesting, respectively. This work formulates an energy efficiency (EE) maximization problem and properly addresses the beamforming design issue. However, such an issue is not our focus. In [44], an EE maximization problem is considered for the SWIPT enabled heterogeneous networks (HetNets). To resolve this problem, the authors introduced a min-max probability machine and an interactive power allocation/splitting scheme based on convex optimization methods. In the latter, the Lagrange multipliers for the optimization problem involved are obtained by using the subgradient method, which could be time-consuming to converge. Despite the different design aim, our work instead develops a game-based interactive method additionally controlled by a threshold to meet our time constraint. In [45], a sum rate maximization problem was formulated for SWIPT enabled HetNets, which jointly optimizes transmit beamforming vectors and power splitting ratios. With the multi-agent DDPG method for the user equipment (UE) without mobility, this work exhibits a notable performance gain when compared with the fixed beamforming design. When UE is mobile and not in the same location vicinity, the wireless channel is not constant and varies with UE’s location. Taking this into account, the work in [46] resolved the dynamic problem with a multi-agent formulation to learn its optimization policy. Specifically, the authors resorted to the majorization–minimization (MM) technique and Dinkelbach algorithm to find the locally optimal solution using the convex optimization method for solving the power and time allocation problem involved. As a complement to these works, our approach considers single agent-based reinforcement learning to comply with the fact noted in [47] that, when a multi-agent setting is modified by the actions of all agents, the environment becomes non-stationary, and the effectiveness of most reinforcement learning algorithms would not hold in non-stationary environments [48]. In addition, by further collaborating with the game-based iterative algorithms, our approach would reduce the overhead resulting from, e.g., the MM approach to resolve a complex optimization problem such as that in [46].

1.2. The Motivations and Characteristics of This Work

In recent years, advances in artificial intelligence are further helped by the neural networks such as generative adversarial networks [49] which use advanced game theory techniques to deep learn information and could converge to the Nash equilibrium of the game involved. In general, these advances can be reflected by the notion that a machine (computer) can learn about the outcomes of the game involved and teaches itself to do better based on the probabilities, strategies, and previous instances of the game and other players under the ground of game theory. By extending the advanced notation to the optimization framework, in this work, we further exhibit the possibility of applying learning-based methods, model-based methods, or both to resolve the joint beamforming, power control, and energy harvesting problem in the SWIPT-enabled wireless networks that can alleviate the hardness of finding an optimal solution with an optimization tool required to be completed in time. In particular, in this scenario, apart from BS i serving the user or MN needing to decide its transmit power, beamforming vector, and power splitting ratio, the other BSs would make their own decisions at the same time, which can affect the user or MN served by BS i simultaneously. Here, by leveraging the scenario, we conduct our approach to make a good trade-off between information decoding and energy harvesting, which can be deployed in an actual SWIPT-enabled IoT network as one of the various SWIPT applications surveyed in [50]. Specifically, by using the UE coordinates as that in [51] sent to BS, it can align with the industry specification [33] through the slight modification to reduce the original signal overhead of [33] on the channel state information to be sent by UE with a report to have its length equal to the number of antenna elements at least. As a summary, we list the characteristics of this work as follows: We introduce two single-layer algorithms based on the conventional DRL-based models, DQN and DDPG, to solve the joint optimization problem formulated here as a non-convex MINLP problem, and realized as an MRA problem subject to the different allocation constraints. We propose further a two-layer iterative approach that can incorporate the capability of data-driven DQN technique and the strength of non-cooperative game theory model to resolve the NP-hard MRA problem. For the two-layer approach, we also introduce a pricing strategy to determine the power costs based on the social utility maximization to control the transmit power. With the simulated environment based on realistic wireless networks, we show the results that, by means of both learning-based and model-based methods, the two-layer MRA algorithm proposed can outperform the single-layer counterparts introduced which rely only on the data-driven DRL-based models. The rest of this paper is structured as follows. In Section 2, we introduce the network and channel models for this work. Next, we present the single-layer learning-based approaches in Section 3, followed by the two-layer hybrid approach based on game theory and deep reinforcement learning in Section 4. These approaches are then numerically examined in Section 5 to show their performance differences. Finally, conclusions are drawn in Section 6.

2. Network and Channel Models

2.1. Network Model

As shown in Figure 1, an orthogonal frequency division multiplexing (OFDM) multi-access network with base stations (BSs) (or access points (APs)) is considered for downlink transmission, in which a serving BS would associate with one mobile node (MN). The distance between two neighbor BSs is R and the cell radius (or transmission range) of BS is to allow overlap. Here, unlike the conventional coordinated multipoint Tx/Rx (CoMP) system applied to the scenario in which a MN could receive data from multiple BSs, we apply the SWIPT technique to the network so that an MN can simultaneously receive not only wireless information but also energy from different BSs. In addition, although mmWave brings many performance benefits as an essential part of 5G, it is also known to have high propagation losses due to higher mmWave frequency bands to be adopted. Thus, analog beamforming for the downlink transmission is considered to alleviate these losses.

Figure 1

A system model with respect to the joint beamforming, power allocation, and splitting control for SWIPT-enabled IoT networks. In this model, each mobile node has a power split mechanism to split the received signal into two streams, one sent to the energy harvesting circuit for harvesting energy and the other to the communication circuit for decoding information.

Next, for more flexibly constructing a beampattern toward MN, each BS adopts a two-dimensional array of M antennas while each MN has a single antenna for transmission. Given that, the received signal at the MN associated with i-th BS would be In the above, are the transmitted signals form the i-th and j-th BSs, complying with the power constraint and , where and are the transmit powers of the i-th and j-th BSs. In addition, are the channel vectors from the i-th and j-th BSs to the MN at the i-th BS, and denote the downlink beamforming vectors adopted at the i-th BS and j-th BSs, respectively. As the last term, represents the noise at the receiver sampled from a complex normal distribution with zero mean and variance . Beamforming: As mentioned previously, for the high propagation loss, analog beamforming vectors are assumed for transmission, and each , consists of the beamforming weights for a two-dimensional (2D) planar array steered towards MN. More specifically, let each BS have a 2D array of antennas in the x–y plane, in which the antenna m is located at where is the wavelength. Given the elevation direction and the azimuthal direction , the phased weights for the 2D array steered towards the angle in the polar coordinates can be given by . If the target is located on the x–y plane, will be 1 and the weights can be simplified as . Given that, we consider every beamforming vector to be selected from a steering-based beamforming codebook with elements, wherein the n-th element or the array steering vector in the direction is given by

2.2. Channel Model

With the beamforming vector introduced above, we consider a narrow-band geometric channel model which is widely used for mmWave networks [52,53,54]. Specifically, the channel from BS i to the MN in BS j is formulated here as where represents the path-loss between BS i and the MN associated with BS j. is the complex path gain. denotes the array response vector with respect to , which is the angle of departure (AoD) of the p-th path. is the number of channel paths, and when compared with those for sub-6G, the number for mmWave is usually a small number [55,56]. Next, let the received power measured by the MN associated with BS i over a set of resource blocks (RBs) on the channel from BS j to the MN be . Given that, the received signal to noise and interference ratio (SINR) for the MN associated with BS i can be obtained by As shown above, each BS i uses to transmit to its user with beamforming vector . When incorporating SWIPT into power allocation, the use of beamforming on the mmWave MIMO system provides a new solution to resolve both interference and energy problems [57,58,59]. To this end, each MN in the network is installed with a power splitting unit to split the received signal for information decoding and energy harvesting simultaneously. Given that, the beamforming would provide a dedicated beam for MN through which power control and power splitting for energy harvesting can be realized at the same time. More specifically, in the power splitting architecture for downlink, the received signal at the MN associated with BS i which transmits with its beamforming vector , and transmit power is split into two separate signal streams according to the power split ratio , which will be determined in the sequel to maximize the system utility. In addition, when the technology of successive interference cancellation (SIC) is employed to mitigate the interference for data decoding, the stronger signal would be decoded first, and the weaker signals remaining could contribute to the interferences for decoding. With and to denote the sets for the transmit power and the power split ratio, respectively, in addition to the above, the SINR at the received MN i with SWIPT and SIC could be obtained by As shown above, denotes the fraction of signal for the data transmission of SWIPT. In addition, with SIC [60], when there are multiple signals received by the MN associated with BS i concurrently, it will decode the stronger signal, and treat the weaker signals as interference. Here, if there are stronger signals from some BSs, they would be decoded and deleted first. Then, the desired signal will be obtained by treating the weaker signals from the other BSs if they exist, noted here by , as the interference for decoding in addition to the noise .

2.3. Problem Formulation

Providing these essential models, our aim is to jointly optimize beamforming vectors, transmit powers, and power split ratios at the BSs to make the best trade-off between data rates, harvested energies, and power consumption from all MNs served in the SWIPT-enabled network with SIC, which is formulated as a complex multiple resource allocation (MRA) optimization problem subject to different allocation constraints that resulted from the different types of resources involved, shown as follows: where in (7a) denotes the utility function for the trade-off to be introduced in (19). (7b) specifies the constraint that the transmit power, , should be ranged between the minimum transmit power, , and the maximum transmit power, . (7c) requires to be a nonnegative ratio number no larger than 1. Finally, (7d) says that the vector, , should be selected from its codebook . Clearly, if in the objective involves in (6), (P1) will be a mixed integer nonlinear programming (MINLP) problem. It would be even a non-convex MINLP problem due to the non-convexity of the objective function and the allocation constraints involving discrete values, and its solution is hard to find even using an optimization tool. To resolve this hard problem efficiently, we propose two kinds of innovative approaches based on deep reinforcement learning, game theory, or both, resulting in data-driven, model-driven, or hybrid iterative algorithms which could be operated in a single layer or two different layers, as introduced in the following. In addition, for clarity, we summarize the import symbols for the approaches to be introduced in Table A1 located in Appendix A due to its size.

Table A1

Important symbols for the proposed approaches in this work.

Name	Description	Name	Description
P,Θ,F	sets of transmit powers, splitting ratios, and beamforming vectors, respectively	Pi,θi,fi	transmit power, splitting ratio, and beamforming vector for link i, respectively
L	set of locations	L^	loss function
S˜	a finite set of states, s1,s2,⋯,sm	s(t)	state at time t, denoted by L(t),P(t), Θ(t), F(t)
A˜	a finite set of actions, a1,a2,⋯,an	A^	a set of binary variables, where A^P, A^Θ, and A^F correspond to those for P, Θ, and F, respectively.
R˜	a finite set of rewards, where R˜(s,a,s′) is the function to provide reward r at state s∈S˜, action a∈A˜, and next state s′	P˜	a finite set of transition probabilities, where P˜ss′(a)=p(s′\|s,a) is the transition probability at state s taking action a to migrate to state s′
π*(s)	optimal policy at state s	Vπ(s)	value function for the expected value to be obtained by policy π from state s∈S
V*(s)	optimal action at state s	Qπ(s,a)	action–value function representing the expected reward starting from state s and taking action a from policy π
Qπ*	optimal policy for the (optimal) action–value function Q*(s,a)=maxπQπ(s,a)	Q(st,at)	action–value (Q) function at time t
Q^(st,at, ω′)	approximated action–value (Q) function with the weight of DNN, ω′, at time t	F	beamforming codebook
Pi1	a set of transmit powers for link i in PSF1	P2	a set of transmit powers for all links in PSF2
Ui(P(t), θi(t),F(t))	reward for link i at time t, including data rate ri(P(t),θi(t),F(t)) and energy harvest Ei(P(t),θi(t) ,F(t))	(s(t),a(t), r(t),s′)	transition at time t, where r(t)=U(t) that is the system utility at this time step
α, αa, αc	learning rate, the (learning) rate specific to actor network, and the (learning) rate specific to critic network	ϵ,ϵmin	exploration rate (probability) and its minimum requirement
ζ	discount factor	τ	soft update parameter
d	exploration decay rate	D	replay buffer
η	batch size	ϱ	converge threshold for the fixed point iteration
Qa(s;ωa), Qa′(s;ωa′)	output of actor network (online and target, respectively)	Q(s,a;ωc), Q¯(s,a;ωc′)	output of critic network (online and target, respectively)
αi,μi,νi	parameters for data rate, energy harvesting, and power consumption, respectively, for link i	ς1, ς2, ς3	scale factors for normalization of DDPG at time t
a(t)*	deterministic action of DDPG at time t, wherein AP(t), AΘ(t), and AF(t)* correspond to those for transmit power, split ratio and beamforming vector	P˜i(t),θ˜i(t), f˜i(t)	variables for normalization of DDPG at time t
Ri	total received power at link i for the fixed point iteration	R^i	auxiliary variable at link i for the fixed point iteration
Pid	desired transmit power at link i for the pricing strategy	νPid	desired power cost at link i for the pricing strategy

3. Single-Layer Learning-Based Approaches

Determining an exact state transition model for (P1) through a model-based dynamic programming algorithm is challenging because the MRA problem on transmit power, power split ratio, and beamforming vector is location dependent. It is not trivial to list all the state–action pairs to be found in a state transition model predefined. Therefore, we design two single-layer learning-based algorithms derived from Markov decision process (MDP) to resolve this problem.

3.1. Q-Learning Approach

The Q-learning algorithm is based on the MDP that can be defined as a 4-tuple <>, where is the finite set of states, and is the set of discrete actions. is the function to provide reward defined at state , action , and next state . is the transition probability of the agent at state s taking action a to migrate to state . Given that, reinforcement learning is conducted to find the optimal policy that can maximize the total expected discounted reward. Among the different approaches to this end, Q-learning is widely considered, which adopts a value function for the expected value to be obtained by policy from each . Specifically, based on the infinite horizon discounted MDP, the value function in the following is formulated to show the goodness of as where denotes the discount factor, and is the expectation operation. Here, the optimal policy is defined to map the states to the optimal action in order to maximize the expected cumulative reward. In particular, the optimal action at each sate s can be obtained with the Bellman equation [61]: Given that, the action–value function is in fact the expected reward of this model starting from state s which takes action a according to policy ; that is, Let the optimal policy be . Then, we can obtain The strength of Q-learning can now be revealed as it can learn without knowing the environment dynamics or , and the agent can learn it by adjusting the Q value with the following update rule: where denotes the learning rate. Given this strength, the application of Q-learning is, however, limited because the optimal policy can be obtained only when the state-action spaces are discrete and the dimension is relatively small. Fortunately, after considerable investigations on the deep learning techniques, reinforcement learning has made significant progress to replace a Q-table with the neural network, leading to DQN that can approximate . In particular, in DQN, the Q value in time t is rewritten as wherein is the weight of a deep neural network (DNN). Given that, the optimal policy in DQN can be represented by , where denotes the optimal Q value obtained through DNN. The goal of this approach is then to choose the approximated action , and the approximated Q value is given by In the above, will be updated by minimizing the loss function: Deep Q learning elements: Following the Q-learning design approach, we next define state, action, and reward function specific for solving (P1) as follows: State: First, if there are n links in the network, the state at time t is represented in the sequel by using the capital notations for their components and using the superscript such as “” for the time index as follows: where , , , and . In the above, denotes the Cartesian coordinates of MN in link i at time t, while the others, i.e., , , and , denote the transmit power, power splitting ratio, and beamforming vector for link i at time t, respectively. Among these variables, the transmit power is usually the only parameter to be considered in many previous works [27,62]. In the complex MRA problem also involving other types of resources, it is still a major factor affecting the system performance based on SINR in (5) that would be significantly impacted by the power, and thus we consider two different state formulations for as follows. Power state formulation 1 (PSF1): First, to align with the industry standard [33] which chooses integers for power increments, we consider a dB offset representation similar to that shown in [51], as the the first formulation for the power state. Specifically, given an initial value , the transmit power (despite t), will be chosen from the set where = and =. Power state formulation 2 (PSF2): Next, as shown in [27], the performance of a power-controllable network can be improved by quantizing the transmit power through a logarithmic step size instead of linear step size. Given that, the transmit power could be selected from the set Apart from the above, the other parameters, such as , can be chosen from the splitting ratio set with linear step size, and can be selected from the predefined codebook with finite vectors or elements. Action: The action of this process at time t, is selected from a set of binary decisions on the variables where , , and denote all the possible binary decisions on the three types of variables involved, respectively. That is, the agent can decide each link i to increase or decrease each of the variables to the next quantized value according to , and in , respectively. Note that, as the number of values of a variable is limited, when reaching the maximum or minimum value with a binary action chosen from , a modulo operation is used to decide the index for the next quantized value in the state space. For example, in PSF2, if with , and , then the modulo operation will lead to with in . As another example, with and to denote the first and the last vector in the codebook , respectively, the action of increasing or decreasing by 1 will choose the previous or the next vector of in as , and a similar modulo operation will also be applied to keep within , . Reward: To reduce the power consumption for green communication while maintaining the desired trade-off among the data rate and the energy harvesting, we introduce a reward function that can represent a trade-off among the three metrics properly normalized for link i with parameters , , and , at time t, as where denotes the data rate of link i obtained at time t, which can be represented by In addition, is the energy harvested at MN of link i at time t, represented in the log scale as wherein the harvested energy in its raw form is given by In the above, is the power conversion efficiency, and is the price or cost for the power consumption to be paid for link i’s transmission. Note that the log representation is considered here to accommodate a normalization process in deep learning similar to the batch normalization in [63]. Otherwise, the data rate obtained with a log operation and the raw energy harvesting without the (log) operation may be directly combined in the utility function. If so, with the metric values lying in very different ranges, such a raw representation could cause problems in the training process. Note also that, although and could be set to compensate the scale differences, a very high energy obtained in certain case can still happen to significantly vary the utility function and impede the learning process. By taking these into account, the system utility at time t can be represented by the sum of these link rewards as Policy selection: In general, Q-learning is an off-policy algorithm that can find a suboptimal policy even when its actions are obtained from an arbitrary exploratory selection policy [64]. Following that, we conduct the DQN-based MRA algorithm to have a near-greedy action selection policy, which consists of (1) exploration mode and (2) exploitation mode. On the one hand, in exploration mode, the DQN agent would randomly try different actions at every time t for getting a better state-action or Q value. On the other hand, in exploitation mode, the agent will choose at each time t an action that can maximize the Q value via DNN with weight ; that is, . More specifically, we conduct the agent to explore with a probability and to exploit with a probability , where denotes a hyperparameter to adjust the trade-off between exploration and exploitation, resulting in a -greedy selection policy. Experience replay: This algorithm also includes a buffer memory D as a replay memory to store transactions , where reward is obtained by (23) at time t. Given that, at each learning step, a mini-batch is constructed by randomly sampling the memory pool and then a stochastic gradient descent (SGD) is used to update . By reusing the previous experiences, the experience replay makes the stored samples to be exploited more efficiently. Furthermore, by randomly sampling the experience buffer, a more independent and identically distributed data set could be obtained for training. As a summary of these key points introduced above, we formulate the single-layer DQN-based MRA training algorithm with a pseudo code representation shown in Algorithm 1 for easy reference. (Input) , batch size , learning rate , minimum exploration rate , discount factor , and exploration decay rate d; (Output) Learned DQN to decide , for (7); Initialize action and replay buffer ; for episode = 1 to do Initialize state ; for time to do Observe current state ; ; if random number then Select at random; else Select ; end if Observe next state ; Store transition in D, where is obtained with (23); Select randomly stored samples from D for experience; Obtain for all j samples with (13); Perform SGD to minimize the loss in (14) for finding the optimal weight of DNN, ; Update in the DQN; ; end for end for

3.2. DDPG-Based Approach

Similar to that found in the literature [28,29], as a deep reinforcement learning algorithm, DQN would be superior to the classical Q-learning algorithm because it can handle the problems with high-dimensional state spaces that can hardly be done with the former. However, DQN still works on a discrete action space, and suffers the curse of dimensionality when the action space becomes large. For this, we next develop a deep deterministic policy gradient (DDPG)-based algorithm that can find optimal actions in a continuous space to solve this MRA optimization problem without quantizing the actions that should be done for the DQN-based algorithm. Specifically, with DDPG, we aim to determine an action a to maximize the action–value function for a given state s. That is, our goal is to find as that done with DQN introduced previously. However, unlike DQN, there are two neural networks for DDPG, namely actor network and critic network, and each contains two subnets, namely online net and target net, with the same architecture. First, the actor network with the weight of DNN, , which is called “actor parameter”, will take state s to output a deterministic action a, denoted by . Second, the critic network with the weight of DNN, , which is called “critic parameter” will take state s and a as its inputs to produce the state–value function, denoted by , to simulate a table for Q-learning or Q-table that would get rid of the curse of dimensionality. Given that, two key features of DDPG can be summarized as follows: Exploration: As defined, the actor network is conducted to provide solutions to the problem, playing a crucial role in DDPG. However, as it is designed to produce only deterministic actions, additional noise, n, is added to the output so that the actor network can explore the solution space. That is, Updating the networks: Next, with the notation to denote the transaction wherein reward is obtained by taking action a at state s to migrate to as that in DQN, the update procedures for the critic and actor networks can be further summarized in the following. As shown in (24), the actor network is updated by maximizing the state–value function. In terms of the parameters and , this maximization problem can be rewritten to find . Here, as the action space is continuous and the state–value function is assumed to be differentiable, the actor parameter, , would be updated by using the gradient ascent method. Furthermore, as the gradient depends on the derivative of the objective function with respect to , the chain rule can be applied as Then, as the actor network would output to be the action adopted by the critic network, the actor parameter can be updated by maximizing the critic network’s output with the action obtained from the actor network, while fixing the critic parameter . Apart from the actor network to generate the needed actions, the critic network is also crucial to ensure that the actor network is well trained. To update the critic network, there are two aspects to be considered. First, with from the target actor network to be an input of the target critic network, the state–value function would produce Second, the output of the critic network, , can be regarded as another source to estimate the state–value function. Based on these aspects, the critic network can be updated by minimizing the following loss function: Given that, the critic parameter, , can be obtained by finding the parameter to minimize this loss function. Finally, the target nets in both critic and actor networks can be updated with the soft update parameter, , on their parameters and , as follows: Action representation for the MRA problem: As defined, the actor network outputs the deterministic action . Due to the deterministic, a dynamic -greedy policy is used to determine the action by adding a noise term to explore the action space. Here, as the state of this work involves different types of variables, the action resulting at time t in fact consists of three parts as . When added with the corresponding noises, the exploration action would be specified as where the different parts of are clipped to the intervals , according to the different types of variables, and the added noises are obtained with a normal distribution also based on the different types as where denotes the exploration decay rate at time t. State normalization and quantization: As shown in the previous works [32,63,65], a state normalization to preprocess the training sample sets would lead to a much easier and faster training process. In our work, the three types of variables, , and (shown in vector forms) in may have their values lying in very different ranges, which could cause problems in a training process. To prevent them, we normalize the coordinates with the cell radius, and these variables with the scale factors , , and , as In the above, is an integer variable rounded from its real counterpart to denote which element in the codebook to be used because the output of DDGP is a continuous action. Specifically, given = where = is obtained by (30), its value at time t will be Note that, after the rounding operation (represented here by the floor function), the value may still be out of its feasible range, and thus a modulo operation similar to that for DQN is also applied here to keep it in . For the other types of variables, the corresponding modulo operations are required to keep them in their feasible ranges as well. Still, due to their continuous nature, a rounding operation is avoided. Specifically, with and , each and at time t would be updated by Apart from the above, the critic network is conducted to transfer gradient in learning, which is not involved in action generation. In particular, the critic network evaluates the current control action based on the performance index (23) while the parameters , and of U in (23) are obtained by the actor network. Apart from these networks, the DDPG-based algorithm also includes an experience replay mechanism as the DQN counterpart. That is, when the experience buffer is full, the current transition will replace the oldest one in the buffer D where reward , and then the algorithm would randomly choose stored transitions to form a mini-batch for updating the networks. Given these sampled transitions, the critic network can update its online net by minimizing the loss function represented by where . Similarly, the actor network can update its online net with Finally, we summarize the single-layer DDPG-based MRA training algorithm in Algorithm 2 to be referred to easily. (Input) , batch size , actor learning rate , critic learning rate , decay rate d, discount factor , and soft update parameter ; (Output) Learned actor/critic to decide , for (7); Initialize actor , critic , action , replay buffer D, and set initial decay rate ; for episode = 1 to do Initialize state and ; for time to do Normalize state with (32); Execute action in (30), obtain reward with (23), and observe new state ; if replay buffer D is not full then Store transition in D; else Replace the oldest one in buffer D with ; Set ; Randomly choose stored transitions from D; Update the critic online network by minimizing the loss function in (36); Update the actor online network with the gradient obtained by (37); Soft update the target networks with their parameters updated by (29); ; end if end for end for

4. Two-Layer Hybrid Approach Based on Game Theory and Deep Reinforcement Learning

As exhibited above, DDPG can be used for continuous action spaces as well as high-dimensional state spaces, which would overcome the difficulty of DQN which can apply only to discrete action spaces. However, the MRA problem includes both discrete and continuous variables, which requires DDPG to quantize the continuous variables involved to be their discrete counterparts as shown in (33). In addition, as a data driven approach, deep reinforcement learning does not explicitly benefit from an analytic model specific to the problem. To take the advantages from both data-driven and model-driven approaches, we propose in the following a novel approach that consists of two layers, where the lower layer is responsible for the continuous power allocation (PA) and energy harvest splitting (EHS) by using a game-theory-based iterative method, and the upper layer resolves the discrete beam selection problem (BSP) by using a DQN algorithm. That is, if , can be given, PA and EHS on and for each link i could be decomposed from the objective. Then, we could simplify the MRA problem by reducing (P1) to a BSP sub-problem and a PA/EHS sub-problem. Specifically, the latter (PA/EHS) is given by Clearly, if the BSP sub-problem can be solved, the major challenge of this approach would be the PA/EHS sub-problem shown in (P2). Here, even represented by a simpler form, (P2) is still a non-convex problem whose solution for link i will depend on the other links . That is, despite EHS, the PA problem still remains in (P2) that a larger would increase SINR of link i while reducing those of the other links in (6), increase energy harvesting in (22), or both, at the cost for in the objective function.

4.1. Game Model

To overcome this difficulty, we convert (P2) into a non-cooperative game among the multiple links which could be regarded as self-interesting players and finding its Nash equilibrium (NE) is the fundamental issue to be considered in this game model. On the one hand, a link i can be seen as a non-cooperative game player who can choose its own and to make a trade-off so that a larger will lead to a higher SINR value in (6) for data rate, a higher value in (22) for energy harvesting, or both on the cost of a higher power consumption, and vice versa. On the other hand, the utility given in (19) can be considered to reduce the power consumption for green communication while maintaining a desired trade-off among the data rate and the energy harvesting. The game-based pricing strategy is thus designed through which BS can require its link to pay a certain price for the power consumption on its transmission. For this, can be interpreted as the willingness of player i to pay for the data rate, and as that to pay for the energy harvesting. Given that, each link or player i can determine its and based on price to maximize its own utility, and in this maximization, , , and are predetermined values for player i and unknown for the others , as a basis for the non-cooperative game.

4.2. Existence of Nash Equilibrium

To ensure the outcome of the non-cooperative game to be effective, we next show this game to have at least one Nash equilibrium. As noted in [66], a Nash equilibrium point represents a situation wherein every player is unilaterally optimal and no player can increase its utility alone by changing its own strategy. Furthermore, according to the game theory fundamental [66], the non-cooperative game admits at least one Nash equilibrium point if (1) the strategy space is a nonempty, compact and convex set, and (2) the utility function is continuous quasiconcave with respect to the action space. In (P2), the utility function can be verified to satisfy the above conditions. Specifically, for the first condition, we can note that the transmit power is bounded by and , i.e., , and the power splitting ratio, , is a real number bounded by 0 and 1. Let be the set of all strategies as its strategy space. Then, the strategy space for each link i in the proposed game model can be represented by =, which is a compact (closed and bounded) convex set as required. For the second condition, we can derive the partial differential of the utility function with respect to power as where is the total received power at link i, which accommodates the effect of SIC involved, as shown as follows: Similarly, we can obtain the partial differential of the utility with respect to by Furthermore, from (39) and (41), the second derivative of the utility function with respect to and , respectively, can be obtained by It is easy to see that both and are less than or equal to 0, implying that the utility function is convex. In addition, is continuous in . Consequently, the utility functions, , all satisfy the required conditions for the existence of at least one Nash equilibrium.

4.3. Power Allocation and Energy Harvest Splitting in the Lower Layer

Based on the non-cooperative game model introduced, the associated BS is responsible for deciding the transmit power and the power splitting ratio for link i, with the channel state information and the weights and , which can be done by finding its Nash equilibrium. To see this, we note that, as the utility functions , are concave down with respect to , this decision can be made by using the solution to the system of equations: where n denotes the number of links in the network. To solve the system of equations, we propose an iterative algorithm based on the game model, and through the fixed point iteration process, the system of Equation (43) can be solved numerically. Here, by taking the derivative with respect to (resp. ) and setting the result equal to 0, we can transform the system into a fixed point form for each link i that can facilitate its convergence, as follows: where is an auxiliary variable denoted by To show the iterative process more clearly, we denote the transmit power, the total received power, the auxiliary variable, and the power splitting ratio, for link i at the k-th iteration, by , , , and , respectively. Given that, the iterations on and can be shown by the relationships between iterations k and with their results to be bounded by the corresponding maximum and minimum values as follows:

4.4. Beam Selection in the Upper Layer and the Overall Algorithm

With the transmit powers and energy splitting ratios from the lower layer with a low cost, the two-layer hybrid approach is designed to resolve the remaining beam selection problem with a DQN-based algorithm in the upper layer, which would reduce the computational overhead when compared with the DQN approach in Section 3.1 and the DDPG-based approach in Section 3.2. In addition, unlike the previous approaches considering either discrete action space or continuous action space solely, the two-layer approach obtains the variables in their own domains without either approximating the hybrid space by concretization or relaxing it into a continuous set. As a result, the two-layer approach would achieve higher utilities than the others, as exemplified in the experiments. Specifically, we propose to use a DQN-based algorithm in the upper layer to resolve the beam selection problem in its own discrete action space. When compared with that given in Section 3.1, this algorithm considers locations and beamforming vectors only, leading to a reduced DQN model whose state at time t is represented by , and the action is selected from (here including only ) modified to take into account also the case of no changes. That is, each selected from can now be anyone in instead of , in which 0 implies no changes on the previous beam selection. When the modification integrates with the lower layer, the two-layer hybrid MRA training algorithm has results as shown in Algorithm 3 along with its flowchart shown in Figure 2. Similar to Algorithms 1 and 2, the training algorithm would take the parameters for the utility, the hyperparameters for the learning algorithm, and the parameters for the game-based method, as the input, while producing a learned DQN model as the output that can online decide , , and , for the optimization problem in (7) afterwards. Apart from the input and output, its main steps are summarized as follows:

Figure 2

Flowchart of the two-layer hybrid MRA training algorithm. In the upper half, the input and the state (corresponding to line 1 and line 3 in Algorithm 3) are shown by the first box and the second box, respectively (from left to right). After selection (lines 10–14), the new state is shown by the fourth box. In the bottom half, the lower-layer iterations (lines 17–27) are exhibited with a box showing the equations involved. The two halves then cooperatively produce the reward (line 28) shown in the rightmost side toward the remaining boxes at the top denoting the following steps in Algorithm 3.

(Input) , batch size , learning rate , minimum exploration rate , discount factor , exploration decay rate d, and converge threshold ; (Output) Learned DQN to decide , for (7); (Upper-layer DQN-based learning:) Initialize action and replay buffer ; for episode = 1 to do Initialize state ; for time to do Observe current state ; ; if random number then Select from at random; else Select ; end if Observe next state ; (Lower-layer game-theory-based iteration:) for each link i do for iteration to do Update with (47); Update with (48); if | then ; break; end if end for ; ; ; end for Determine based on and in the lower layer, and in the upper layer, ; Store transition in D; Select random samples from D; Calculate and perform SGD to find the optimal weight of DNN, ; Update for DQN in the upper layer; ; end for end for Observe state at time t for beam section. Select an optimal action from at time step t. Given selected beamforming vectors , obtain transmit powers and splitting ratios through the game-theory-based iterative method in the lower layer. Assess the impact on data rate , energy harvesting , and transmit power , for all links i. Reward the action at time t as , based on the impact assessed. Train DQN with the system utility obtained. After the training or learning period, say T, the trained DQN from Algorithm 3 would be used to observe the following state , evaluate utility with the given parameters , and , and then take action to decide , , and , for the system in the testing process.

4.5. Time Complexity

Next, we show the time complexity for each of these algorithms before revealing their performance differences in the next section. Specifically, let the number of episodes be , and the number of time-steps per episode be . Assuming that the Q-learning network in Algorithm 1 has J fully connected layers, the time complexity with regard to the number of (floating point) operations in this algorithm would be based on the analysis in [32], where denotes the unit number in the jth layer, and is the input state size. In each time-step of an episode, there may be other operations such as the random selection of an action in line 10 not involving the neural network, which could be ignored when compared with the former for the analysis. Thus, taking the nesting for loops (the outer is episode loop and the inner is time-step loop) into account, we have its worst-case time complexity as . Apart from training, DDPG also involves a normalization process whose time complexity could be denoted by , where is the number of the variables in the state set. In addition, the actor and critic networks of DDPG in Algorithm 2 are assumed to have J and K fully connected layers, respectively. According to [32], the time complexity with respect to these networks in the training algorithm would be , where and denote the unit number in the ith layer with respect to the actor network and the critic network, respectively. Then, by taking the nesting loops into account as well, we have the overall time complexity of this algorithm as . Finally, let the number of links be n and the number of iterations per link be in addition to and given previously. As the two-layer hybrid training algorithm involves the lower-layer game-theory-based iterations, the overall time complexity of Algorithm 3 would be , where denotes the unit number in the jth layer with respect to the DQN neural network in this algorithm. Note that, although there are additional iterations for the lower layer, the input state size is that could be much smaller than in the single-layer Algorithm 1 with DQN, while , is considered. In addition, it requires no normalization process and has the computational overhead on its neural network lower than that of on the two different types of neural networks in Algorithm 2.

5. Numerical Experiments

In this section, we conduct simulation experiments to evaluate the proposed two-layer approach and compare it with the single-layer approaches also introduced. To this end, we first present the simulation setup adopted and the parameters involved. Then, we show the performance differences between the two-layer hybrid MRA algorithm based on game theory and deep reinforcement learning, and the single-layer counterparts based on the conventional deep reinforcement learning models (DQN and DDPG).

5.1. Simulation Setup

With the network model and the channel model introduced in Section 2, we conduct MNs to be uniformly distributed in the simulated cellular network and let them move at a speed of v = 2 km/h on average with log-normal shadow fading as well as small-scale fading. In this environment, the cell radius is set to and the distance between sites or BSs is considered to be 1.5 , in which MNs can experience a probability of line of sight, , on the signals from BSs. For easy reference, the important parameters for the radio environment including those not shown above are summarized in Table 1.

Table 1

Important radio environment parameters.

Parameter	Value
Maximum transmit power (Pmax)	40 W (46 dBm)
Minimum transmit power (Pmin)	1 W (30 dBm)
Probability of light of sight (Plos)	0.7
Cell radius (r^)	150 m
Distance between sites (BSs)	225 m
Antenna gain	3 dBi
Mobile node (MN) antenna gain	0 dBi
Number of multipaths	4
MN movement speed on average (v)	2 km/h
Number of transmit antennas of BS	4,8,16,32
Downlink frequency band	28 GHz

Apart from the parameters for radio, the converge threshold is set to for the two-layer algorithm, and the hyperparameters for the deep reinforcement learning models are tabulated in Table 2. For example, in the DQN for the single-layer approach, the state at time t is denoted by which corresponds to the size of state, 10, listed in this table. In addition, as introduced in Section 3.1, a dB offset representation is considered for PSF1, and the number of power levels is set here as 9 for PSF2 to construct their power sets and , respectively. Furthermore, a offset representation, and a set of 11 values, with step size of 0.1, are also conducted as the power splitting ratio sets for PSF1 and PSF2, respectively. Nevertheless, the size of action is 64 according to the binary decisions defined in (18), despite PSF1 or PSF2 in DQN. Apart from the above, for the two-layer approach, the DQN for the upper layer only considers the beamforming vectors in addition to the locations , which reduces the size of state to 6. Moreover, as it considers instead of for the actions, the size of action becomes 9. Despite these differences, the other hyperparameters of DQN are the same for both single- and two-layer approaches. Finally, the hyperparameters for DDPG are chosen to reflect its performance on average with a reasonable time complexity to execute, and a codebook with 4, 8, 16, and 32 elements or vectors, respectively, to correspond to the different numbers of antennas in the radio environment is considered for all the algorithms involved.

Table 2

Reinforcement learning parameters.

Parameter	Value
DQN:
Discount factor (ζ)	0.995
Learning rate (α)	0.01
Initial exploration rate (ϵ)	1.0
Minimum exploration rate (ϵmin)	0.1
Exploration decay rate (d)	0.9995
Size of state (\|s\|)	10
Size of action (\|a\|)	64
Replay buffer size (\|D\|)	2000
Batch size (η)	256
DDPG:
Actor learning rate (αa)	0.001
Critic learning rate (αc)	0.002
Replay buffer size (\|D\|)	10000
Exploration decay rate (d)	0.9995
Batch size (η)	32
Scale factors (ς1, ς2, ς3)	1
Discount factor (ζ)	0.9
Soft update parameter (τ)	0.01
DQN for two-layer:
Size of state (\|s\|)	6
Size of action (\|a\|)	9
The same parameters for the single-layer DQN

Given that, we conduct 50 experiments with different seeds for all the algorithms under comparison. For each of these experiments, there are 400 training episodes or epochs in total. At the beginning of each episode, MNs are randomly located in the simulated network, which then move at speed v in 500 time slots per episode. Afterward, with the trained from these algorithms, we conduct another 100 episodes with MNs randomly located at the beginning as well to obtain the averaged utility, data rate, energy harvesting, and power consumption to validate the parameters obtained with the different algorithms. Specifically, each 100 testing episodes of an experiment produce a mean value, and each averaged metric shown in the following figures denotes the average of these mean values from the 50 experiments. Note that, since DDPG is trained with normalized variables as shown in (32), in the testing process, we also have to preprocess these inputs.

5.2. Performance Comparison

Given the environment, we compare the proposed two-layer MRA algorithm aided by game theory with the single-layer MRA algorithms based solely on DQN and DDPG also introduced. To see their performance differences, we conduct two sets of experiments from different aspects; the first focuses on the number of antennas, M, and the second on the power cost . Given that, in Figure 3, Figure 4 and Figure 5 to be shown for the comparison results, the legends of “two-layer”, “single-layer with DDPG”, “single-layer with DQN of PSF1”, and “single-layer with DQN of PSF2” exhibited therein represent the two-layer MRA algorithm, the single-layer DDPG-based MRA algorithm, the single-layer DQN-based MRA algorithm with PSF1, and the single-layer DQN-based MRA algorithm with PSF2 introduced in this work, respectively.

Figure 3

Utilities obtained during training periods upon (a) , and (b) .

Figure 4

Impacts of varying the number of antennas (M) upon (a) utility, (b) data rate (bps), (c) energy harvesting (J), and (d) power consumption (W).

Figure 5

Impacts of the pricing strategy upon (a) utility, (b) data rate (bps), (c) energy harvesting (J), and (d) power consumption (W).

5.2.1. Impacts of Antennas

In the first experiment set, four numbers of transmit antennas, , in BS are examined while fixing = 10, = 1, and = 1, . Due to similar trends to be given, in Figure 3, we exemplify the utilities obtained during the training periods in two experiment instances with the highest and the lowest number of transmit antennas, 32 and 4, respectively. It can be seen easily from the two sub-figures that the utility that resulted from the two-layer MRA algorithm is higher than those from the single-layer counterparts during the training periods, despite the number of antennas, on average. In addition, it can also be observed that, with the continuous action space, DDPG could outperform DQN in general, despite the power state formulations (PSF1 and PSF2) of the latter. Finally, we can see that, with a dB offset representation, PSF1 of DQN would result in a greater number of states on the transmit power than PSF2 equipped with a limited number of quantized levels, which could eventually lead to a better performance on the utility in the long term. Next, we show the performance differences among the averaged metrics on utility, data rate, energy harvesting, and power consumption obtained by the testing process on resulting from these algorithms. As shown in Figure 4, the two-layer MRA algorithm outperforms the single-layer counterparts on all the performance metrics except the energy harvesting, despite the number of antennas, M. In particular, in terms of the averaged utilities resulting from all different M, the two-layer MRA algorithm can achieve up to 2.3 times higher value than the single-layer DQN of the PSF2 algorithm. Despite the utility, as the resulting energy harvesting has relatively smaller values to impact the overall utility, a lower (resp. higher) value of this metric represented in the log scale is still possible and its impact would be compensated by a higher (resp. lower) value of power consumption, data rate, or both, which eventually leads to the overall utility to increase as M increases. For example, the highest utilities which are obtained by the two-layer MRA algorithm (as shown in Figure 4a) are mainly contributed by the highest data rates (as shown in Figure 4b) and the lowest power consumption (as shown in Figure 4d), which are all resulting from the two-layer algorithm, despite the energy harvesting of this algorithm to be slightly fluctuated as M increases and lower than that from the single-layer counterparts (as shown in Figure 4c). In addition, as no previous works exactly consider the same system formulations and metrics presented here, it is hard to directly compare this work with the others such as [27,51] which consider only , , or both, for their data transmissions without the capability of energy harvesting. However, even without the capability, we could still consider the DRL algorithm in [51] with only to see the possible performance differences between ours and the conventional approaches. Specifically, with , the comparison results are summarized in Table 3. As shown readily, without the power split for energy harvesting, the DRL algorithm can obtain the highest data rate as an upper bound here, as expected. In comparison, the two-layer algorithm can achieve almost the same data rate while harvesting the energy with the lowest power consumption. Similarly, the single-layer algorithms can enjoy the energy harvesting with similar power consumption, but they may obtain lower data rates when splitting their powers to harvest energy and send data simultaneously.

Table 3

Performance comparison with .

Method	Data Rate	Energy Harvesting	Power Consumption
DRL	11.32910	0	22.51510
two-layer	11.26969	8.164853×10−9	16.40005
single-layer with DDPG	10.58339	1.062941×10−5	21.34165
single-layer with DQN of PSF1	9.31607	5.809001×10−8	22.50100
single-layer with DQN of PSF2	8.46842	3.477011×10−8	23.69319

5.2.2. Impacts of Pricing Strategy

From the utility function defined by (19), we can see that the unit power cost actually plays a crucial role in the non-cooperative game model, and would have a strong impact on the performance of joint optimization and the Nash equilibrium. Thus, in the final set of experiments, we propose a simple pricing strategy for the base station to determine on the basis of social utility maximization and to control the transmit power of link so that its value can be located within the feasible range for the high performance of this algorithm to be realized by the social utility maximization. Specifically, let the desired transmit power be , and, according to the fixed point formulation in (44), we have Given that, the desired power cost can be obtained by Accordingly, the two-layer hybrid MRA algorithm is slightly modified to dynamically adjust instead of using a fixed , as an input of the algorithm. To be more specific, the sketch of this modification is given in Algorithm 4, wherein the modified three statements showing their calculations (50), (47) and (48), respectively, are highlighted with bold italic font, in addition to the fact that the input does not include now. For the comparison, the pricing strategy is also applied to Algorithms 1 and 2 by replacing the input with dynamically adjusted by using (50) as well after observing the next state carried out in the corresponding steps in these algorithms. (Input), ⋯; ⋯ for episode = 1 to do for time to do ⋯ Observe next state ; Obtain , by using ( for each link i do for iteration to do Update by using (; Update by using (; ⋯ end for end for ⋯ end for end for Here, following the same setting = 1 W and = 40 W, we sample the feasible range at as to obtain with (50) while fixing , , and , and conduct these algorithms to output the performance metrics averaged to be compared. The results are now summarized in Figure 5, showing that the two-layer algorithm outperforms the others in terms of the utility. In particular, although it may have lower data rates when (denoting obtained by 1 W), and higher power consumption when (denoting with W), the increasing trend of these resulting metrics would still lead to a utility higher than the others and the resulting utility would increase as increases. Similarly, as the energy harvesting has relatively smaller values to impact the system as noted before, its small fluctuations from the different algorithms do not alter the increasing trend of utility in the final experiment set as well.

6. Conclusions

In this work, we sought to maximize the utility that can make an optimal trade-off among data rate and energy harvesting while balancing the cost of power consumption in multi-access wireless networks with base stations having multi-antennas. Given the capability of selecting beamforming vectors from a finite set, adjusting transmit powers, and deciding power splitting ratios for energy harvesting, the wireless networks developed toward the future generation (beyond 5G or B5G) are expected to achieve the extreme performance requirements that can only be satisfied by an optimal solution to be possibly found through an exhaustive search. To meet the expectation, we have shown in this work how to design DRL-based approaches operated in a single layer to jointly solve for power control, beamforming selection, and power splitting decision, and approach the optimal trade-off among the performance metrics without an exhaustive search in the action space that resulted. Furthermore, we have shown how to incorporate a data-driven DRL-based technique and a model-driven game-theory-based algorithm to form a two-layer iterative approach to resolve the NP-hard MRA problem in the wireless networks. Specifically, we have shown that, by taking benefits from both data-driven and model-driven methods, the proposed two-layer MRA algorithm can outperform the single-layer counterparts which rely only on the data-driven DRL-based algorithms. Here, the single-layer algorithms could represent the conventional DRL methods extended to have the energy harvesting capability. As shown readily in the experiments, the conventional DRL method and the single-layer algorithms would not provide a good performance trade-off on the metrics considered. That is, the overall utilities reflecting the trade-off from the single-layer algorithms have been shown to be lower than that from the two-layer approach. In contrast, by collaborating between DRL and game theory, the two-layer approach has been shown to achieve better trade-off among the data rate and the energy harvesting while balancing the cost of power consumption, reflecting on the higher utilities obtained. Specifically, in the simulation experiments, we have exemplified the performance differences of these algorithms in terms of data rate, energy harvesting, and power consumption, verified the feasibility of the three parameters in the utility function, and examined the pricing strategy proposed that can dynamically adjust the transmit power of the link to locate its value within the feasible range for the high performance of the two-layer algorithm to be obtained by the social utility maximization. From the viewpoint of social utility maximization, our pricing strategy had been shown to give this system the leverage to select beamforming vectors, transmit powers, and power split ratios by properly adjusting the power costs. Finally, inspired by the related works on multi-agent DRL, we would aim to develop further collaborating schemes that can reduce the overhead caused by different optimization methods even under the non-stationary environment brought by a multi-agent setting, as our future work.

4 in total

1. Human-level control through deep reinforcement learning.

Authors: Volodymyr Mnih; Koray Kavukcuoglu; David Silver; Andrei A Rusu; Joel Veness; Marc G Bellemare; Alex Graves; Martin Riedmiller; Andreas K Fidjeland; Georg Ostrovski; Stig Petersen; Charles Beattie; Amir Sadik; Ioannis Antonoglou; Helen King; Dharshan Kumaran; Daan Wierstra; Shane Legg; Demis Hassabis
Journal: Nature Date: 2015-02-26 Impact factor: 49.962

2. Switchable Coupled Relays Aid Massive Non-Orthogonal Multiple Access Networks with Transmit Antenna Selection and Energy Harvesting.

Authors: Thanh-Nam Tran; Miroslav Voznak
Journal: Sensors (Basel) Date: 2021-02-05 Impact factor: 3.576

3. Energy Efficient SWIPT Based Mobile Edge Computing Framework for WSN-Assisted IoT.

Authors: Fangni Chen; Anding Wang; Yu Zhang; Zhengwei Ni; Jingyu Hua
Journal: Sensors (Basel) Date: 2021-07-14 Impact factor: 3.576

4. Energy-Efficient Optimal Power Allocation for SWIPT Based IoT-Enabled Smart Meter.

Authors: Zaki Masood; Yonghoon Choi
Journal: Sensors (Basel) Date: 2021-11-25 Impact factor: 3.576

4 in total