| Literature DB >> 35336499 |
JainShing Liu1, Chun-Hung Richard Lin2, Yu-Chen Hu3, Praveen Kumar Donta4.
Abstract
Future wireless networks promise immense increases on data rate and energy efficiency while overcoming the difficulties of charging the wireless stations or devices in the Internet of Things (IoT) with the capability of simultaneous wireless information and power transfer (SWIPT). For such networks, jointly optimizing beamforming, power control, and energy harvesting to enhance the communication performance from the base stations (BSs) (or access points (APs)) to the mobile nodes (MNs) served would be a real challenge. In this work, we formulate the joint optimization as a mixed integer nonlinear programming (MINLP) problem, which can be also realized as a complex multiple resource allocation (MRA) optimization problem subject to different allocation constraints. By means of deep reinforcement learning to estimate future rewards of actions based on the reported information from the users served by the networks, we introduce single-layer MRA algorithms based on deep Q-learning (DQN) and deep deterministic policy gradient (DDPG), respectively, as the basis for the downlink wireless transmissions. Moreover, by incorporating the capability of data-driven DQN technique and the strength of noncooperative game theory model, we propose a two-layer iterative approach to resolve the NP-hard MRA problem, which can further improve the communication performance in terms of data rate, energy harvesting, and power consumption. For the two-layer approach, we also introduce a pricing strategy for BSs or APs to determine their power costs on the basis of social utility maximization to control the transmit power. Finally, with the simulated environment based on realistic wireless networks, our numerical results show that the two-layer MRA algorithm proposed can achieve up to 2.3 times higher value than the single-layer counterparts which represent the data-driven deep reinforcement learning-based algorithms extended to resolve the problem, in terms of the utilities designed to reflect the trade-off among the performance metrics considered.Entities:
Keywords: IoT; beamforming; deep reinforcement learning; energy harvesting; game theory; joint optimization; multi-resource allocation; power control
Year: 2022 PMID: 35336499 PMCID: PMC8955841 DOI: 10.3390/s22062328
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1A system model with respect to the joint beamforming, power allocation, and splitting control for SWIPT-enabled IoT networks. In this model, each mobile node has a power split mechanism to split the received signal into two streams, one sent to the energy harvesting circuit for harvesting energy and the other to the communication circuit for decoding information.
Important symbols for the proposed approaches in this work.
| Name | Description | Name | Description |
|---|---|---|---|
|
| sets of transmit powers, splitting ratios, and beamforming vectors, respectively |
| transmit power, splitting ratio, and beamforming vector for link |
|
| set of locations |
| loss function |
|
| a finite set of states, |
| state at time |
|
| a finite set of actions, |
| a set of binary variables, where |
|
| a finite set of rewards, where |
| a finite set of transition probabilities, where |
|
| optimal policy at state |
| value function for the expected value to be obtained by policy |
|
| optimal action at state |
| action–value function representing the expected reward starting from state |
|
| optimal policy for the (optimal) action–value function |
| action–value (Q) function at time |
|
| approximated action–value (Q) function with the weight of DNN, |
| beamforming codebook |
|
| a set of transmit powers for link |
| a set of transmit powers for all links in PSF2 |
|
| reward for link |
| transition at time |
| learning rate, the (learning) rate specific to actor network, and the (learning) rate specific to critic network |
| exploration rate (probability) and its minimum requirement | |
|
| discount factor |
| soft update parameter |
|
| exploration decay rate |
| replay buffer |
|
| batch size |
| converge threshold for the fixed point iteration |
| output of actor network (online and target, respectively) | output of critic network (online and target, respectively) | ||
|
| parameters for data rate, energy harvesting, and power consumption, respectively, for link | scale factors for normalization of DDPG at time | |
|
| deterministic action of DDPG at time | variables for normalization of DDPG at time | |
|
| total received power at link |
| auxiliary variable at link |
|
| desired transmit power at link |
| desired power cost at link |
Figure 2Flowchart of the two-layer hybrid MRA training algorithm. In the upper half, the input and the state (corresponding to line 1 and line 3 in Algorithm 3) are shown by the first box and the second box, respectively (from left to right). After selection (lines 10–14), the new state is shown by the fourth box. In the bottom half, the lower-layer iterations (lines 17–27) are exhibited with a box showing the equations involved. The two halves then cooperatively produce the reward (line 28) shown in the rightmost side toward the remaining boxes at the top denoting the following steps in Algorithm 3.
Important radio environment parameters.
| Parameter | Value |
|---|---|
| Maximum transmit power ( | 40 W (46 dBm) |
| Minimum transmit power ( | 1 W (30 dBm) |
| Probability of light of sight ( | 0.7 |
| Cell radius ( | 150 m |
| Distance between sites (BSs) | 225 m |
| Antenna gain | 3 dBi |
| Mobile node (MN) antenna gain | 0 dBi |
| Number of multipaths | 4 |
| MN movement speed on average ( | 2 km/h |
| Number of transmit antennas of BS |
|
| Downlink frequency band | 28 GHz |
Reinforcement learning parameters.
| Parameter | Value |
|---|---|
| Discount factor ( | 0.995 |
| Learning rate ( | 0.01 |
| Initial exploration rate ( | 1.0 |
| Minimum exploration rate ( | 0.1 |
| Exploration decay rate ( | 0.9995 |
| Size of state ( | 10 |
| Size of action ( | 64 |
| Replay buffer size ( | 2000 |
| Batch size ( | 256 |
| Actor learning rate ( | 0.001 |
| Critic learning rate ( | 0.002 |
| Replay buffer size ( | 10000 |
| Exploration decay rate ( | 0.9995 |
| Batch size ( | 32 |
| Scale factors ( | 1 |
| Discount factor ( | 0.9 |
| Soft update parameter ( | 0.01 |
| Size of state ( | 6 |
| Size of action ( | 9 |
| The same parameters for the single-layer DQN |
Figure 3Utilities obtained during training periods upon (a) , and (b) .
Figure 4Impacts of varying the number of antennas (M) upon (a) utility, (b) data rate (bps), (c) energy harvesting (J), and (d) power consumption (W).
Figure 5Impacts of the pricing strategy upon (a) utility, (b) data rate (bps), (c) energy harvesting (J), and (d) power consumption (W).
Performance comparison with .
| Method | Data Rate | Energy Harvesting | Power Consumption |
|---|---|---|---|
| DRL | 11.32910 | 0 | 22.51510 |
| two-layer | 11.26969 |
| 16.40005 |
| single-layer with DDPG | 10.58339 |
| 21.34165 |
| single-layer with DQN of PSF1 | 9.31607 |
| 22.50100 |
| single-layer with DQN of PSF2 | 8.46842 |
| 23.69319 |