| Literature DB >> 24995352 |
Kok-Lim Alvin Yau1, Geong-Sen Poh2, Su Fong Chien3, Hasan A A Al-Rawi1.
Abstract
Cognitive radio (CR) enables unlicensed users to exploit the underutilized spectrum in licensed spectrum whilst minimizing interference to licensed users. Reinforcement learning (RL), which is an artificial intelligence approach, has been applied to enable each unlicensed user to observe and carry out optimal actions for performance enhancement in a wide range of schemes in CR, such as dynamic channel selection and channel sensing. This paper presents new discussions of RL in the context of CR networks. It provides an extensive review on how most schemes have been approached using the traditional and enhanced RL algorithms through state, action, and reward representations. Examples of the enhancements on RL, which do not appear in the traditional RL approach, are rules and cooperative learning. This paper also reviews performance enhancements brought about by the RL algorithms and open issues. This paper aims to establish a foundation in order to spark new research interests in this area. Our discussion has been presented in a tutorial manner so that it is comprehensive to readers outside the specialty of RL and CR.Entities:
Mesh:
Year: 2014 PMID: 24995352 PMCID: PMC4068054 DOI: 10.1155/2014/209810
Source DB: PubMed Journal: ScientificWorldJournal ISSN: 1537-744X
Figure 1A simplified RL model.
Figure 2A SU exploits white spaces across various channels.
RL models with direct application of the traditional RL approach for various schemes in CR networks.
| References | Purpose | State | Action | Reward/cost |
|---|---|---|---|---|
| (A1) Dynamic channel selection (DCS) | ||||
| Tang et al. [ | Each SU (agent) selects the operating channel with the least channel utilization level by PUs in order to improve throughput and to reduce end-to-end delay and the number of channel switches | — | Selecting an available channel for data transmission | Fixed positive/negative values to be rewarded/punished for successful/unsuccessful transmission |
| Li [ | Each SU (agent) selects different operating channel with other SUs in order to reduce channel contention | — | Selecting an available channel for data transmission | Amount of successful data packet transmission |
| Yao and | SU base station (agent) selects an available channel and a power level for data transmission in order to improve its SNR. This scheme aims to increase packet delivery rate | Three-tuple information: | Selecting a set of actions (see | SNR level |
|
Li et al. [ | Each SU link (agent) aims to maximize its individual SNR level. Note that the agent is a SU link, instead of the SU itself as seen in the other schemes | The availability of a channel for data transmission. States | Selecting an available channel for data transmission | SNR level, which takes into account the interference from neighboring SUs |
|
| ||||
| (A2) Channel sensing | ||||
| Lo and Akyildiz [ | Each SU (agent) | A set of SU neighbor nodes that may cooperate with the SU agent to perform cooperative channel sensing | Selecting SU neighbor nodes that may cooperate with the SU agent. The SU neighbor nodes cooperate through sending their respective local sensing outcome to the SU agent | The reward (or cost) is dependent on the reporting delay, which is the time between a SU agent requesting for cooperation from a SU neighbor node and the arrival of its sensing outcome |
|
| ||||
| (A4) Energy efficiency enhancement | ||||
| Zheng and | Each SU (agent) selects a suitable action (transmit, idle, sleep, or sense channel) whenever it does not have any packets to send in order to reduce energy consumption | Four-tuple information: | Selecting an action: transmit, idle, sleep, or sense channel | Amount of energy consumption for each operation mode throughout the duration of the operation mode |
|
| ||||
| (A7) Routing | ||||
| Peng et al. [ | Each SU (agent) selects a SU neighbor node (or next hop) for data transmission to SU destination node in order to reduce end-to-end delay and energy consumption | A set of SU next hops | Selecting a SU next hop | Ratio of the residual energy of the SU next hop to energy consumption incurred by sending, receiving, encoding, and decoding data while transmitting data to the SU next hop |
Summary of RL models and algorithms for various schemes in CR networks.
| Model | Purpose | References |
|---|---|---|
| Model with | This model uses | Li et al. [ |
|
| ||
| Model with a set of | This model uses a set of distinctive |
Di Felice et al. [ |
|
| ||
| Dual | This model updates two | Xia et al. [ |
|
| ||
| Partial observable model | This model computes belief state, which is the probability of the environment operating in a particular state, in a dynamic and uncertain operating environment | Bkassiny et al. [ |
|
| ||
| Actor-critic model | This model adjusts the delayed reward value using reward corrections in order to expedite the learning process | Vucevic et al. [ |
|
| ||
| Auction model | This model allows agents to place bids during auctions conducted by a centralized entity so that the winning agents receive rewards | Chen and Qiu [ |
|
| ||
| Internal self-learning model | This model enables an agent to exchange its virtualactions continuously with rewards generated by a simulated internal environment within the agent itself in order to expedite the learning process | Bernardo et al. [ |
|
| ||
| Collaborative model | This model enables an agent to collaborate with its neighbor agents and subsequently make local decisions independently in distributed networks. A local decision is part of an optimal joint action, which is comprised of the actions taken by all the agents in a network | Lundén et al. [ |
|
| ||
| Competitive model | This model enables an agent to compete with its neighbor agents and subsequently make local decisions independently in worst-case scenarios in the presence of competitor agents, which attempt to minimize the accumulated rewards of the agent | Wang et al. [ |
RL model for joint dynamic channel selection and channel sensing [10].
| Action |
|
|
| |
| Reward |
|
Algorithm 1RL algorithm for joint DCS and channel sensing [10].
Algorithm 2RL algorithm for joint DCS and channel sensing [11].
RL model for joint dynamic channel selection and channel sensing [11].
| State |
|
|
| |
| Action |
|
|
| |
| Reward |
|
RL model for the routing scheme [33].
| State |
|
|
| |
| Action |
|
|
| |
| Reward |
|
RL model for joint DCS and channel sensing [34].
| State |
|
|
| |
| Action |
|
|
| |
| Reward |
|
Algorithm 3RL algorithm for joint dynamic channel selection and channel sensing [34].
RL model for security enhancement [13].
| Action |
|
|
| |
| Reward |
|
RL model for the channel auction scheme [16].
| State |
|
|
| |
| Action |
|
|
| |
| Reward |
|
RL model for the channel auction scheme [36].
| Reward |
|
RL model for the channel auction scheme [37].
| State |
|
|
| |
| Action |
|
|
| |
| Reward |
|
Algorithm 4RL algorithm for the channel auction scheme [37].
RL model for a power control scheme [38].
| Action |
|
|
| |
| Reward |
|
RL model for the DCS scheme [27].
| Action |
|
|
| |
| Reward |
|
Figure 3Internal self-learning model.
Algorithm 5RL algorithm for RL-DCS [27].
RL model for the channel sensing scheme [20].
| State |
|
|
| |
| Action |
|
|
| |
| Reward |
|
Algorithm 6RL algorithm for the channel sensing scheme [20].
RL model for the channel hopping scheme [14].
| State |
|
|
| |
| Action |
|
|
| |
| Reward |
|
Algorithm 7RL algorithm for the channel hopping scheme [14].
Performance enhancements achieved by the RL-based schemes in CR networks.
| Application schemes | References | RL Models | Performance enhancements | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| (P1) Higher throughput/ | (P2) Lower end-to-end delay or link delay | (P3) Lower level of interference to PUs | (P4) Lower number of sensing channels | (P5) Higher overall spectrum utilization | (P6) Lower number of channel switches | (P7) Lower energy consumption | (P8) Lower probability of false alarm | (P9) Higher probability of PU detection | (P10) Higher number of channels sensed idle | (P11) Higher accumulated rewards | |||
| (A1) Dynamic channel selection | Bkassiny et al. [ | Partial observable | × | × | |||||||||
| Tang et al. [ | Traditional | × | × | × | |||||||||
| Yao and Feng [ | Traditional | × | |||||||||||
| Chen et al. [ | Model with | × | × | ||||||||||
| Jiang et al. [ | Model with | × | × | ||||||||||
| Liu et al. [ | Collaborative | × | |||||||||||
| Yau et al. [ | Collaborative | × | × | ||||||||||
| Bernardo et al. [ | Internal self-learning | × | × | ||||||||||
|
| |||||||||||||
| (A2) Channel sensing |
Di Felice et al. [ | Set of Q-functions | × | × | × | ||||||||
| Li et al. [ | Model with | × | × | ||||||||||
| Lo and Akyildiz [ | Traditional | × | × | ||||||||||
| Chowdhury et al. [ | Collaborative | × | × | × | × | ||||||||
| Lundén et al. [ | Collaborative | × | |||||||||||
|
| |||||||||||||
| (A3) Security enhancement | Wang et al. [ | Competitive | × | ||||||||||
| Vucevic et al. [ | Actor-critic | × | |||||||||||
|
| |||||||||||||
| (A4) Energy efficiency enhancement | Zheng and Li [ | Traditional | × | ||||||||||
|
| |||||||||||||
| (A5) Auction mechanism | Jayaweera et al. [ | Auction | × | × | |||||||||
| Fu and van der Schaar [ | Auction | × | |||||||||||
|
| |||||||||||||
| (A6) Medium access control | Li et al. [ | Model with | × | ||||||||||
|
| |||||||||||||
| (A7) Routing | Peng et al. [ | Traditional | × | × | |||||||||
| Xia et al. [ | Dual Q-functions | × | |||||||||||
|
| |||||||||||||
| (A8) Power control | Xiao et al. [ | Auction | × | ||||||||||