Literature DB >> 35716038

Deep reinforcement learning for personalized treatment recommendation.

Mingyang Liu¹, Xiaotong Shen¹, Wei Pan².

Abstract

In precision medicine, the ultimate goal is to recommend the most effective treatment to an individual patient based on patient-specific molecular and clinical profiles, possibly high-dimensional. To advance cancer treatment, large-scale screenings of cancer cell lines against chemical compounds have been performed to help better understand the relationship between genomic features and drug response; existing machine learning approaches use exclusively supervised learning, including penalized regression and recommender systems. However, it would be more efficient to apply reinforcement learning to sequentially learn as data accrue, including selecting the most promising therapy for a patient given individual molecular and clinical features and then collecting and learning from the corresponding data. In this article, we propose a novel personalized ranking system called Proximal Policy Optimization Ranking (PPORank), which ranks the drugs based on their predicted effects per cell line (or patient) in the framework of deep reinforcement learning (DRL). Modeled as a Markov decision process, the proposed method learns to recommend the most suitable drugs sequentially and continuously over time. As a proof-of-concept, we conduct experiments on two large-scale cancer cell line data sets in addition to simulated data. The results demonstrate that the proposed DRL-based PPORank outperforms the state-of-the-art competitors based on supervised learning. Taken together, we conclude that novel methods in the framework of DRL have great potential for precision medicine and should be further studied.

Entities: Chemical

Keywords: Proximal Policy Optimization; actor-critic methods; deep learning; precision medicine; recommender systems

Mesh：

Year: 2022 PMID： 35716038 PMCID： PMC9427729 DOI： 10.1002/sim.9491

Source DB: PubMed Journal: Stat Med ISSN： 0277-6715 Impact factor: 2.497

INTRODUCTION

One main challenge in precision medicine is to understand common disease at the molecular level to recommend individualized therapies to patients, permitting high efficacy for different and possibly unknown disease subtypes. Another challenge in clinical practice is how to adapt treatment assignments to possible changes in patients' health states and preferences, including previous treatment history. Optimal treatment is determined by maximizing an evaluation signal indicating a long‐term patient outcome, accounting for possibly delayed treatment effects and influences on future treatment choices. An important resource for advancing precision medicine is through drug discovery via computational prediction of drug sensitivity for various cell lines based on their high‐dimensional genomic features as wells as the chemical structures of drugs. For such a purpose, large‐scale screenings of cancer cell lines have been performed with detailed molecular profiles. For example, the Cancer Cell‐Line Encyclopedia (CCLE) provides large‐scale molecular profiles including genomic (eg, genetic mutations), transcriptomic (ie, gene expression), and epigenomic (eg, DNA methylation) data. Similarly, the Genomics of Drug Sensitivity in Cancer (GDSC) project has been carried out for investigating drug responses of numerous cancer cell lines, which are measured by each drug's half‐maximal inhibitory concentration (); the smaller , the more effective (potentially) the drug. GDSC also provides multi‐omics data, including whole‐exome sequencing (WES), copy number variation (CNV), and DNA methylation (MET) data. Whereas it is challenging to examine the patient's response to different clinical options, the Cancer Genome Atlas (TCGA) collected follow‐up records of around 10k pan‐cancer patients for survival/recurrence in response to given treatments. We will use CCLE and GDSC to develop and (internally) evaluate our proposed and other existing learning methods before using a TCGA dataset for external validation. To advance drug discovery, various approaches have been proposed for predicting drug responses using genomic features. , , , , , For example, some References 3, 4, 6, 7, 9 have proposed linear regression models to predict drug sensitivity (as measured by ) or the area under the fitted dose response curve (AUC). Some non‐linear models such as random forests, kernel regression, and neural networks have also appeared. , , These models are drug‐specific as being trained for each drug separately and independently. A drug‐specific model is limited by the small number of cell lines available with a given drug. To increase the sample size, others have proposed a Bayesian multitask multiple kernel learning (BMTMKL) approach, achieving the best performance in a DREAM challenge for drug response prediction. This method has shown the importance of information sharing among the drugs. Besides, there is a matrix factorization approach based on cell line and drug similarities (SRMF), which, however, could not be used for previously unseen cell lines. In a similar framework of matrix factorization, a recommender system (CaDRReS) to map the drugs and cell lines into a latent “pharmacogenomic” space before predicting drug responses has been proposed. However, most of these methods may not be most clinically relevant: they directly predict drug responses, differing from directly ranking drugs. Accordingly, kernelized rank learning (KRL) has been proposed to use a ranking metric called normalized discounted cumulative gain () to recommend the top most sensitive drugs. Similarly, CaDRReS is another ranking procedure. However, both KRL and CaDRReS optimize either an approximate evaluation metric or the mean square error (MSE) without directly optimizing the ranking evaluation metric NDCG. Recently, various deep learning (DL) models have emerged, showing that simple neural networks such as multi‐layer perceptrons (MLPs) could be effective competitors to lasso or elastic net. CDRscan aims to explore various architectures of MLPs to predict responses from cell line genomic features and drug fingerprints. However, there are potential limitations with these methods. Importantly, all the methods are supervised, including regression. In practice, it would be more efficient for a learning system to learn sequentially and continuously as the data accrue, rather than collecting all the data first before starting learning. Accordingly, we propose a deep reinforcement learning (DRL) based approach, called PPORank. PPORank treats ranking as a sequential decision‐making process. We adopt a ranking evaluation metric, called NDCG, as the long‐term reward, and take both the long‐term effects of the ranking policy and the expected value of each ranking position into account. Thus we utilize more information from a ranking. One novel component of PPORank is our adaptation of a model‐free “actor‐critic” scheme. We directly optimize the non‐differentiable evaluation measure NDCG through policy gradient and achieve both stability and sample efficiency via the PPO optimization method. We design a state representation module using a deep neural network from a policy network, which is extensible to integrating heterogeneous data sources and is capable to capture complex non‐linear relationships between cell lines/patients and drugs. We have carried out some experiments on cancer drug screening data, which are high‐dimensional in the state and action spaces with only limited and sparse data. The results demonstrate that the proposed method outperforms the state‐of‐the‐art competitors. Our proposed PPORank and its DRL framework are related to, but differ from, the existing literature on estimating optimal dynamic treatment regimes (DTRs), , , in which often only a small number of treatments or time points (called finite time horizons) are considered, , , , , , , though a few exceptions with infinite time horizons are emerging. The methods in the first category are mainly based on Q‐learning and A‐learning, often based on parametric/linear regression models for the value functions. When there is only one time point, it reduces to individualized treatment selection, which is done in the framework of supervised learning, such as the outcome weighted learning, ‐learning, and residual weighted learning. Two representatives for infinite time horizons have appeared recently and are based on some postulated linear models for the Q‐value function and the V‐value function respectively; see Reference 1 for a nice and up‐to‐date review. Our proposed method in the framework of DRL not only is more flexible by allowing both nonparametric modeling with DL and infinite time horizons, but also takes a state‐of‐the‐art actor‐critic approach based on the PPO algorithm. In addition, many other methods require that the data have been collected (from either randomized experiments or observational studies) at the time of their application, while RL can learn to both recommend treatments and thus collect data simultaneously, which might be more efficient and timely in applications such as mobile health (mHealth) for continuously learning and thus recommending better and better behavioral interventions, though this will not be further explored here.

Related work

In recent years, RL has been adapted to develop powerful solutions in a variety of healthcare domains characterized as sequential procedures. , , , , , For example, Dynamic Treatment Regimes (DTR) are tailored for generating new scientific hypotheses and developing optimal treatments across or within groups of patients by utilizing data generated, for instance, from the Sequential Multiple Assignment Randomized Trials (SMART); the design of DTR can be viewed as a sequential decision making problem that fits into the RL framework. The domains of applying RL can be classified into two main categories: chronic diseases and critical care. Due to the long term treatment of a chronic disease, RL has been utilized to automate the discovery and generation of optimal DTRs in a variety of chronic diseases, including cancer, diabetes, anemia, HIV, and mental illnesses. For example, References 21, 22, 41 applied RL to determine optimal chemotherapy dosages for cancer treatment, and References 42, 43 adopted RL in decision making for optimal scheduling of radiation therapy for cancer. In the critical care domain, RL has been applied, for example, to treatment selection for sepsis, , medication recommendations in critical care units, and management of invasive mechanical ventilation in intensive care units. Nevertheless, how to tailor the existing RL methods to deal with limited biomedical data and to take into consideration of risk, safety, robustness, and efficiency in healthcare systems remain challenging and to be addressed when applying RL. The rest of the article is structured as follows. In Section 2, we introduce the main components of the proposed DRL‐based method. Section 3 presents the experimental details and the results. We end in Section 4 with some main conclusions and a short discussion for future work.

METHODS

This section formulates personalized ranking via reinforcement learning. For reinforcement learning, we propose an actor‐critic network based on a hidden state representation as input to learning to rank. In particular, we leverage an evaluation signal to reduce variance for more efficient sampling.

Personalized ranking

Given cell lines (or patients), we observe , with each row and column representing one cell line and one feature, for example, one gene's expression, respectively; cell line‐specific responses, denoted as , are their responses to (maximum) distinct drugs. Each of the cell lines is paired with its drug response vector , where is the number of candidate drugs for cell line . For each cell line , together with , where is the corresponding cell line feature vector, is a set of the drugs tested on cell line; each drug with denotes a drug feature vector, either a vector representing the drug structure or only an indicator for the drug's index when the drug feature vector is unused/unknown. In the ranking process, we use as the remaining candidate drugs to be ranked at time step for cell line , so is a subset of . The notations are summarized in Table A1 in the Appendix.

TABLE A1

Mathematical notations

Meaning	Notation
Cell‐line	c, or c(i) ^a
Number of training cell lines	N
Maximum number of drugs for each cell line	M
Feature vector of cell line c(i)	X(i) or X
Drug u is more sensitive than drug v	du≻dv
Number of drugs associated with cell line c(i)	Mi
Feature vector or binary features of a drug j associated with cell line c(i)	dj(i) or dj
Embedding vector of a drug j associated with cell line c(i)	dembed,j(i) or dembed,j
Feature vectors or binary indexes of drugs associated with cell line c(i)	D(i)={dj(i)}j=1Mi or D
Remaining drugs associated with cell line c(i) at time step t	Dt(i) or Dt
Ground truth response score of a drug j associated with cell line c(i)	yj(i) or yj
Ground truth permutation (ranking list) for drugs associate with cell line c	π∗
Original drug index of the jth element in permutation π	π−1(j)
Ranking position of drug j in permutation π	π(j)

The superscript indicates the associate cell line , if it is missing, it can be applied to general case that is associated with cell line .

We formulate the ranking problem as learning a scoring function that scores each candidate drug at each ranking position . The scores usually represent some relevance, inducing an ordering of the drugs by sorting them in descending order of relevance to form a ranked list at each position. When moving to the next position, according to the dynamic change of the candidate drugs, the context environment, and the corresponding feature vectors, the scores for the remaining drugs may change under the scoring function. As such, the goal of the ranking is to build a parameterized ranking function that minimizes the empirical loss across all the ranking positions : where is a specified local loss function. In ranking problems, is usually derived from the utility of interest such as the discounted cumulative gain (DCG) metric, and its normalized form called normalized discounted cumulative gain (NDCG). Formally, given a ranking list of the drugs for a cell line, say , the DCG at position is defined: where denotes the drug ranked at position by the ranking list , is the response score of a drug, usually set as with being the response score learned from the ranking function at position in the ranking list , and is a position discount factor, usually set as . In the end, represents the cumulative gain of information from position one to position with discounts on the positions. gives a higher weight to a drug with a higher sensitivity score, while the discount factor penalizes the incorrect ordering of these rankings; the gains can be maximized when the most effective drugs are the top recommended ones. As an evaluation criterion, the NDCG is more widely used. It is defined as , where is a normalizing factor calculated at the perfect ranking list based on the true response scores; ranges from 0 to 1, with 1 indicating a perfectly predicted ranking. The final evaluation metric for top k positions is When it comes to rank at all the positions, the criterion is where is the number of all the candidates (drugs) to be ranked (for a cell line), and is the normalizing factor. Given samples , (minus) the loss function to be maximized for a given ranking function across all ranking positions is where is the number of candidate drugs for cell line . Note that in general contains missing values since a cell line may be given only a (proper) subset of all candidate drugs; that is, may differ across the cell lines.

Reinforcement learning for ranking

We formulate a personalized ranking process as a Markov decision process (MDP). For each cell line, the ranking process on the candidate drugs is considered as a sequence of discrete‐time steps; each time step corresponds to a ranking position, for example, ranking position is considered as the same as the time step . As the candidate drug number varies across the cell lines, the length for each ranking process is determined by the number of available drugs. At each time step, the agent receives the environment's state and chooses an action based on the state. The state is constructed by the remaining candidate drugs, the cell line, and a list of the previously ranked drugs, including the candidate drug and cell line features, drug‐drug, and cell line‐drug interactions. The action of drug selection only depends on a policy, which is a function mapping from the current state to a probability distribution of selecting possible drugs. After one time step, the environment transits to a new state according to the current state with the cell line evaluation signal and the chosen action. The ranking problem modeled as a MDP in the framework of RL can be represented by , composed of states, actions, transition, reward, and policy as defined below: is a set of states, each describing the environment. In the ranking problem, the agent should be aware of the current ranking position as well as the candidate drug set that the agent can choose from. For simplicity, we only describe the notation under one cell line; if it comes to batch computation, we just need to include multiple cell lines. Thus, at time step , we have observation , and the state is defined as where stands for a state representation module, is the cell line, is the remaining drugs to be ranked, and is a hidden state learned from previous positive evaluation signals; for example, if the cell line provides a positive response (ie, the immediate reward is positive), then the next hidden state is updated to ; otherwise, . The initial state can be written as , where is the set of all the candidate drugs, and the hidden state is initialized to be . is a discrete set of actions that the agent can take. The set of possible actions depends on the state , denoted as . At time step , is the selection of a drug for the ranking position , where indexes the drug selected by . is a function , mapping a state to a new state according to action . When choosing a drug by action , it means removing the drug from the candidate set, then . is the reward function . As our target in ranking is to maximize NDCG, the reward should be able to reflect the contribution to the evaluation metric of selecting the drug. Thus we define the reward received corresponding to choosing action as the promotion of the DCG: where is the drug sensitivity score for the selected drug at position (in our two cell line data examples). Here we see the immediate reward at each time step is just the promotion of DCG at each ranking position , which enables the learning process to utilize DCG values at each time step. Furthermore, for each cell line, the cumulative rewards along a trajectory will be , which we aim to maximize. describes the behaviors of the agent, which is a probabilistic distribution over the possible actions for a given state. Specially, we use the softmax function to parametrize the policy, outputting the probability of selecting each of the drugs for the current ranking position: where is the policy parameters in the actor network, is the feature vector decided by the current state and action, and denotes the transpose. More details can be found in Section 2.4.6. , , defines the discount factor when we measure the present value for future reward. In particular, when , the agent only considers the immediate reward; alternatively, when , all future rewards can be counted fully into that of the current action. It is typical to have . Without loss of generality, a list‐wise ranking procedure is a ‐time step trajectory as In this way, we define one sample as a tuple of . Although we may only observe one trajectory for one cell line, we can simulate multiple and different trajectories for each cell line.

Some preliminaries for DRL

The goal of RL is to maximize the expected long‐term return from an initial state in terms of : where is the distribution over the trajectory, which only relates with the parameters in policy. is the long‐term return of the simulated ranking list, which is defined as the discounted sum of the rewards: where is the immediate reward the agent receives at time step after taking action , and is the discount factor. For our task, the maximum time step for each trajectory is the number of the total candidate drugs, . Note that when the discount factor , is identical to NDCG. Differentiating with respect to , the gradient becomes which can be approximated by sampling trajectory : where is the immediate reward of the th trajectory at time step and can be calculated from (6). To reduce the variance of the gradients, at time step , we only consider the cumulative rewards received after , as the action taken at time cannot affect the reward at for all . Second we subtract a baseline from the total rewards. A commonly used baseline is the value function, , which is the expected return starting in and following policy thereafter. Here we use the superscript denotes that the agent will follow policy . The state‐action value , which is the expected return of taking action at state following policy thereafter. And by subtracting this baseline from , we are essentially calculating the advantage, . Instead of estimating both the state‐action value function and value function , we approximate based on Then we build the critic network (also called the value network) to estimate the non‐linear approximation of the value function . In our DNN, limited by the sample size, the critic network will share the parameters with the policy network, so we will also only use for the state value output from the critic network. This gives the actor‐critic network's structure, which, with the critic, improves over the policy gradient. Denote by . Then, the gradient of the policy parameters becomes: and serves as an unbiased estimate of the advantage function. The actor‐critic network combines the advantages of value‐based learning and policy gradient to achieve accelerated and stable learning. It consists of two components: an actor to optimize the policy in the direction of gradient using (10), and a critic to estimate a state‐value function with the state output from the actor network. Finally, we obtain the policy gradient in (12).

Actor network

We describe the architecture of the actor network, inspired by the deep cross network. A main motivation is for flexible modeling of possibly complex relationships between high‐dimensional cell line and drug features. Previous studies , have shown the importance of such flexible modeling. This network starts with an embedding layer and a stacking layer, followed by a cross network and a deep network. In the concatenation layer, we combine the output from the two networks and that from a gated recurrent unit (GRU) layer. The concatenated vector from the combination layer is defined as the state in (5). Before building the embedding layer, we have pre‐trained a matrix factorization (MF) model to initialize the MF layer. The structure is displayed in Figure 1. Next, we describe the actor network under the setting for the cancer drug response ranking problem.

FIGURE 1

The actor network

Matrix factorization (MF) layer

Due to the heterogeneous data sources, if we model the interactions between cell lines and drugs, we may first project them into the spaces of the same dimension. The easiest way of incorporating drug features is to consider the input as one‐hot encoding. Cell‐line features can be projected into the same dense embedding layer with drug embedding vector for further computation. We have tried initializing the weight matrix for projecting cell line features randomly but noted that it converges slowly. Inspired by Zhang et al and Guo et al, we decide to use an MF layer for initialization. The input data include dense features such as the cell line gene expression and binary encoding for drugs. The drug response matrix can be decomposed into a bias matrix and some latent factors for cell lines and drugs: where , is the bias term for drug ; , and are the weight matrices projecting the cell line and drug features into the corresponding latent spaces with dimension respectively, and are both trainable in a neural network, which is the identity matrix. Thus denotes the cell lines preference matrix, denotes the drugs' preference matrix. Therefore, the output off the MF layer is the projected dense features for each cell line and the embedding matrix for the each drug.

Embedding and stacking layer

The input of this layer comes from the output of the MF layer. Then we stack the embedding vectors for each drug along with the dense features of cell lines, together with the cosine similarity score between the cell line's and drugs' latent vectors, into one vector: where is projected dense features, is from (13) and cos is the cosine similarity measure between the cell line and the drug. Then we have , , and is the th element of . Compared to the Word2Vec embedding, we explicitly use the response matrix and the cell line features, which not only generates an embedding for the drug and cell line in the same latent space but also compresses the cell line features. Importantly, we use the stacking of all the drugs' embedding vector, which is used as the initial state . At time step , if drug is removed from the candidate drugs, we will use a zero vector to replace . It is noted that, although we only use the one‐hot encoding vector for a drug as the sparse vector, sometimes it may not be sufficient, and we can use other side information for drugs. For example, some drug similarity scores from PubChem may be used as additional dense features and be projected onto the lower‐dimensional dense layer.

Cross network

For MF‐based ranking methods, it is possible to capture the linear relationship between the latent vectors, but not higher‐order cell line‐drug interactions based on the MF latent factor representation. To capture a higher‐order interactions, we add additional layers in a network to represent interactions as cross‐term features. We denoted the cross term (monomial) as , with degree . To capture higher order interactions, we have the cross network composed of cross layers, where the input is from the stacking and embedding layer in (14), and with each layer having the following formula: where is the output from (14), are the column vectors as the outputs from the th and th cross layers, respectively; are the weight and bias parameters in the th layer. By Wang et al, the highest degree of cross terms grows with the layer depth. It is noted that the cross network is a generalization of MF and so‐called factorization machines by allowing cross terms of any degrees. In addition, by Wang et al the number of parameters in the cross network is only , where is the number of cross layers. And the output of the cross network is . Accordingly, when we integrate multi‐omic and other high‐dimensional features, the number of parameters in a network model only grows linearly in the number of layers.

Deep network

The deep network is a fully connected feed‐forward neural network, which is good at generalization. Its input is the same as the cross network, which is also from the stacking and embedding layer in (14), and each layer is represented as where and are the th and th hidden layers respectively; and are the weight matrix and bias vector (as the parameters) for the th layer; and is the ReLU function. Suppose we have layers for the deep network, then the output is with the initial input . In this fashion, the cross and deep components share the same input, leading to more efficient training, compare.

GRU layer

In clinical settings, patients' preference or conditions (not captured by given covariates) may change over time, thus we propose using evaluation signals from the ranking process to learn the dynamic hidden preference. The positive drugs represent key information about patients' preferences, that is, which drugs the patients prefer (eg, without minimal side effects). Thus we propose using a GRU layer to memorize the previous evaluation signals, allowing it to remember values over multiple iterations in the sequence. So if the evaluation signal is positive, we can update the GRU layer as the following: where is the update vector, is the reset gate vector, is the concatenated output from the deep network and cross network at time step t, is the hidden state at time step t, is the Hadamard product and the matrices , and vectors 's are the unknown weights to be optimized, is the ReLU activation function. The initial hidden state is set as . Then if the current evaluation signal is positive, the output of the hidden state is incorporated into the next combination output layer. We tried to use a pooling layer to represent the dynamic evaluation signal, which gives equal weight to the cell line evaluation signals in the previous time steps; but with the large action space, we noticed that the top 1 drug would dominate the historical evaluation signals, and if the top 1 recommendation was far away from the truth, the actor network would have no chance to adjust the learning process.

Concatenation layer

The concatenation layer concatenates the outputs from the two networks and the GRU layer. We first output the concatenated vector as a state representation to input the critic network, and second apply a standard softmax activation function, yielding the output for policy: where is the concatenated vector, are the outputs from the cross network and deep network, and is the evaluation signal output from the GRU layer. is the drug index, is the weight vector for the output layer. During the training phase of PPORank (as many RL algorithms), the action is often selected according to the current policy estimate; however, to explore (and learn) other potentially more rewarding new actions, it selects an action randomly from the estimated policy distribution. On the other hand, in the testing phase, we just rank the drugs by their predicted scores and select the best one.

Critic network

The critic network is designed to reduce the variance of the gradients in the policy gradient method, which leverages a deep neural network to approximate the true state value function . The input of the critic network is just the state generated from the actor network in (17), and its output is the state value function The critic network is jointly learned with the actor network, with the shared state representation as the input from the output of the actor network. We have tried a separate network design, resulting in the divergence of the policy with too many parameters to learn; it is a good way to stabilize training with the shared parameters. The critic is then used to evaluate the policy generated by the actor network and guides the policy gradient. The structure can be found in Figure 2.

FIGURE 2

The actor and critic networks

Learning to rank with Proximal Policy Optimization

The actor‐critic network combines the advantages of value‐based methods and policy gradient to achieve accelerated and stable learning. Yet it is technical and thus is relegated to the Supplementary.

Top ranking

In drug recommendation, often only the top one or few matters (as opposed to predicting the exact value of a patient's responses to all drugs). Thus we may be more interested in ranking the top drugs, instead of ranking all drugs, though the latter may provide more information about the relationships between the drugs and cell lines. The details of our proposed methods are in the Supplementary.

EXPERIMENTS

We first conducted our experiments on the cancer screening data sets of GDSC and CCLE. Then we demonstrated the effectiveness of PPORank using simulation data.

Cancer drug screening data

Datasets

The drug‐screening data for cancer cell lines are obtained from Genomics of Drug Sensitivity in Cancer (GDSC) and Cancer Cell Line Encyclopedia (CCLE). The GDSC database provides a large‐scale drug screening dataset with each value for a drug‐cancer cell line pair. The CCLE database provides genomic, transcriptomic, and epigenomic profiles for more than a thousand cancer cell lines. The GDSC has released a panel of 1001 cancer cell lines (covering 31 cancer types) as well as 265 pharmacological compounds/drugs. The cell lines were mostly characterized using gene expression, whole‐exome sequencing, copy number variation, and DNA methylation. As pointed by Coronato et al, gene expression data is most informative. Then, we will put more emphasis on our analysis based on gene expression data. To this end, we encode each of the 983 profiled cell lines with the RMA‐normalized (robust multi‐array average) basal expression of 17 737 genes. To extract cell line features from the gene expression profiles, we adopt the method of Reference 13, where we normalize the baseline gene expression values for each gene by computing the fold‐changes compared to the median value across cell lines. According to Reference 12, 1856 essential genes are selected for dimension reduction for each cell line. Then we calculate Pearson's correlation for every pair of cell lines using the expression fold‐changes of these essential genes, among which we select 983 cell line features for the full GDSC dataset. For the CCLE dataset, 24 drugs and a panel of 1036 human cancer cell lines are available, and we select 491 cancer cell lines for which both drug sensitivity measures and gene expression profile data are available. Both GDSC and CCLE report drug sensitivity as the logarithm of half maximal inhibitory concentration of a drug on a cell line, , which denotes the concentration of the drug/compound required to inhibit the cell growth at 50; the smaller , the more sensitive the cell line to the drug. But the measured values are not comparable across different compounds. calculated drug‐specific sensitivity (binarization) thresholds for each of the 265 tested drugs. These thresholds were determined using a heuristic outlier detection procedure following a previous observation that the majority of cell lines are typically resistant to a given drug. Thus, we use these drug‐specific sensitivity thresholds to normalize the data set. Furthermore, as the ranking metric NDCG is only bounded with non‐negative scores, we normalize the sensitivity scores as , where is the sensitivity threshold of the given drug and is the maximum normalized score across all drugs. After deleting toxic drugs, we have 223 drugs for GDSC and 19 drugs for CCLE. For drug features, chemical structural information for each drug is available in PubChem. We exclude drug samples without PubChem ID in the GDSC database, and we also exclude 15 drugs with the same PubChem IDs, and finally 223 drugs remained. But at the end, for fair comparisons with other baseline methods, drug features are not used. We employ The Cancer Genome Atlas (TCGA) breast cancer (BRCA) sub‐cohort data to further evaluate whether/how the cell line‐based GDSC results of PPORank generalize to cancer patients. We download the Firehose data run from. We use the immunohistochemistry annotations to identify HER2 over‐expressed () patients () and triple‐negative breast cancers (TNBCs) patients (). Furthermore, we define the BRCA1/2‐mutant (mBRCA) patients as the 37 patients carrying germline loss‐of‐function BRCA1/2 mutations. Finally, we use the pipeline proposed by Geeleher et al to harmonize the Level 3 Illumina HiSeq RNA‐seq v2 data (1080 patients) and the GDSC gene expression data.

Baseline methods and evaluations

As described in Section 1, our method (PPORank) is motivated by more efficient healthcare applications: (i) in clinical settings, it is more important to recommend the most sensitive drugs in ranked order, instead of numerically predicting the sensitivity score of a given drug; (ii) a treatment regime is usually characterized after a prolonged and sequential procedure; (iii) currently, many cancer‐related resources covering genotypes, phenotypes, and drug information are available, calling for integrating these multiple data sources for model building; (iv) often only one or few top recommendations are needed (as opposed to all the drugs). To address the above issues, we will conduct experiments from the following aspects: for completeness, we evaluate our method on the full GDSC and CCLE data sets; for a fair comparison, we use the same data sources including only cell line features and drug response matrix and evaluate with NDCG to show long‐term effects, addressing the above issues (i) and (ii); to integrate multiple data resources, we also include drug similarity data into the training dataset along with other molecular data: whole‐exome sequencing (WES), copy number variation (CNV), and DNA methylation (MET) data, without any changes to the model structure, addressing issue (iii); to address the top ranking problem of issue (iv), we also evaluate the methods with the ranking metrics for different values of . We compare the methods with five‐fold cross‐validation, where each fold of the data is used once for testing while the other four folds are used for training. For each fold, we test completely on unseen cell‐lines. We compare our method, PPORank, against several state‐of‐the‐art and representative ones: a method based on the elastic net regression model ElasticNet (EN), , kernel ridge regression (KRR), similarity‐regularized matrix factorization (SRMF), Cancer Drug Response prediction using a Recommender System (CaDRRes), and Kernelized Ranking Learning(KRL). For EN, the model is trained for each drug as described previously , using the Elastic Net library in scikit‐learn ( ratio =0.5). KRR has been recommended as one of the best‐performing algorithms in a systematic survey and was also implemented using the machine learning library‐ scikit‐learn in Python. For SRMF, it goes beyond matrix factorization by also considering drug similarity obtained from chemical structural data and cell line similarity based on their gene expression profiles as regularization terms to avoid over‐fitting; but as described in the article, drug similarity did not improve the prediction performance, so we also ignore drug similarity here. Furthermore, SRMF cannot make predictions on unseen cell lines or patients. For CaDRReS, we set the latent space dimension to be 10. For KRL, we adopt the radial basis function (RBF) kernel, which performs better than the linear kernel, and instead of using cell line similarity features, according to the original paper, we use the RMA‐normalized (robust multi‐array average) basal expression of 17 737 genes as the input cell line features.

Hyper‐parameter tunning

For the baseline methods, the tunning parameters are tuned over grids with five‐fold cross‐validation. For PPORank, we have the following hyper‐parameters, some of which are fixed while others are tuned through the same cross‐validation procedure as for the baseline methods. The cross‐part of the deep‐ and cross network has two cross layers; the deep part of the Deep Cross network has three layers with 128, 64, and 32 nodes respectively. For the Deep Cross network, we tried some deeper or shallower structures but found that with more layers, the performance did not improve much while a higher dropout rate was needed; to reduce the computational complexity, we ended with the above network structure for our agent policy. For the PPO part, there are two main issues: how many experience episodes () should the agent gather before updating the policy and how to update an old policy to a new policy. For each trajectory, the cell lines have at most 24 and 265 drugs for the two datasets respectively, so is set to be 24 for CCLE and 265 for GDSC; for a fair comparison, we do not use the drug fingerprint features here; the numbers of the paralleled actors are 8 for CCLE and 16 for GDSC, mini‐batch size is 16 for both, PPO epoch number for the surrogate loss (in the Supplementary equation (5)) is 8 for both. For other hyper‐parameters in the clipping loss (in the Supplementary equation (1), which controls the new policy from moving into collapse), the clipping range is tunned among , and the discounting factors used in the advantage function (in the Supplementary equation (3)) are tuned in grids of , but when evaluated on the whole trajectory's ranking, they are all fixed at the value of 0.95; we only need to tune their values when the evaluation metrics are . The coefficients with the value function and entropy(in the Supplementary equation (5)) are fixed at and .

Prediction with the full dataset

We use the full GDSC and CCLE data sets with only cell line features to compare the performance of different methods, then we report the average performance based on NDCG through five‐fold cross‐validation. The results are shown in Figure 3. For the CCLE dataset, we can see that our method PPORank performs marginally better than other methods: it gives a mean NDCG of 0.7611, slightly higher than 0.7432 and 0.7468 from linear methods CaDRRes and EN. Note that the total number of the drugs is only 24, so even with unseen cell lines, the candidate drugs have appeared frequently enough in the training set, leading to good performance by both regression methods and static ranking methods. However, for the GDSC dataset, as shown in Figure 3B, the improvement of our PPORank over other methods is much higher. PPORank shows a consistent improvement over all the other methods with the smallest SD (0.003); as a comparison, the second‐best method, CaDRRes, has a much larger SD of 0.011, highlighting the robustness of our proposed method. Note that CaDRRes also uses the NDCG as a loss function, which may explain why it performs second best. To further compare the performance, we will use NDCG to show the top selected drugs by each method as to be discussed in Section 3.3 (where it is shown that PPORank demonstrates larger performance gains over CaDRRes and other methods).

FIGURE 3

Mean NDCG values by five‐fold cross‐validation (with 1 SD as error bars). (A) NDCG values using full CCLE data set; (B) NDCG values using full GDSC data set

Mean NDCG values by five‐fold cross‐validation (with 1 SD as error bars). (A) NDCG values using full CCLE data set; (B) NDCG values using full GDSC data set To compare the performance between our proposed RL and DNN‐based supervised learning, we also developed and trained a DNN using an approximate NDCG as the loss function based on the GDSC data. Figure 4 shows that after 600 epochs, PPORank approaches the performance of the DNN. It suggests that the improvement of PPORank over other methods perhaps mainly comes from two sources: its highly non‐linear model and its direct use of NDCG as its target loss function.

FIGURE 4

DNN and PPORank's performance using on the GDSC data

Prediction with other types of cell line features

We use the Gaussian kernel to summarize side information from other omic data sources: for each cell line , , where and denote the feature vectors of two cell lines in the form of (i) whole‐exome sequencing (WES), (ii) copy number variation (CNV), or (iii) DNA methylation (MET) data, and is the kernel width of the Gaussian kernel. For simplicity, we heuristically determine the kernel width using the “median trick,” setting the width as the reciprocal of the median squared Euclidean distance among all training sample pairs. In this fashion, we generate an kernel matrix for each additional data type, where is the number of cell lines. We treat each row of as a feature vector of length for the corresponding omic data and cell‐line, which can be combined with the original cell‐line features to incorporate the use of other omic data. Since our method is based on DNNs, it is quite flexible to accommodate other data sources: we add an (MLP) subnetwork to project other high‐dimensional features to a low‐dimension space before feeding its output to the stacking layer. As gene expression (GEX) data is known to be informative, we use the following cell line features (1) GEX+WES, (2) GEX+CNV, (3) GEX+MET respectively to show how performance may change; more details about these features can be found in Appendix Table A1. Figure 5 compares the performance in NDCG across using these three additional types of features. In agreement with the findings from a drug sensitivity collaborative competition, gene expression is confirmed to be the most predictive data type regardless of the method being employed, since it shows almost no improvement with the additional CNV or WES data and only slight improvement with the MET data; these results are also consistent with some previous studies supporting that MET data are informative to cancer.

FIGURE 5

Mean NDCG values by five‐fold cross‐validation on the GDSC data with different types of omic data features

Sequential learning and prediction

One distinct feature of RL is its sequential learning. Although all the data from GDSC are available at the time of training, we mimic a realistic situation by starting RL with a subset of the available data. We begin with 50 of all the cell lines in the GDSC data at start time . As the model uses all the drugs for embedding learning in our deep cross net, we consider all the drugs at time ; we evaluate the method at each following time point using the independent test data. Specifically, at time point , we add 20 cell lines as new data and run the RL algorithm based on the feedback of the newly added data; after the convergence, we evaluate its performance based on the (unseen) test data. Then we add another 20 cell lines as new data, and repeat the process. To save time, at each time point we start RL with the estimated model (ie, its estimated parameters) obtained from the previous run (or time point), then update the model with the new incoming cell lines. Figure 6 displays the results of applying RL on the GDSC test data, starting at with only 50% of the whole training data and subsequently adding 20 more cell lines at each time point until we use up all the GDSC training data. As expected, as more data are added, overall the performance of RL improves. It is confirmed that RL performs well in the realistic set‐up for sequential learning.

FIGURE 6

PPORank performance on GDSC with sequential data

Prediction with top ranking

To show the ability of the PPORank model for ranking the top most sensitive drugs across cell lines, we use the full CCLE and GDSC data sets respectively to compared the methods in terms of with ; with , one is to select the most sensitive drug. As shown in Figure 7, PPORank performs best consistently; its improvements over other methods are more obvious for the GDSC data. From Section 2.1, where is a normalized factor that . So, when is larger than , increases in . In practice, depending on data, an increasing trend may not be seen in all circumstances for a top ranking problem.

FIGURE 7

values with the full CCLE or GDSC data for different ; the error bars show one SD based on three‐fold cross‐validation. (A) with the CCLE data; (B) with the GDSC data

External validation with TCGA breast cancer cohort

To show the generalization ability of PPORank, we show that drug ranks predicted with PPORank trained on cell lines from GDSC are correlate with patient response to treatment in a TCGA breast cancer cohort. We follow the data preprocessing of Reference 63 to harmonize the gene expression profiles measured by RNA‐seq in TCGA and by micro‐arrays in GDSC. In this way, they were able to train ridge regression models on the GDSC cell lines and use these models to “impute” drug response for the TCGA patients. They also showed that TCGA‐BRCA patients were predicted to be more sensitive to lapatinib, a first‐line therapy for patients, than other patients. Here we follow the idea of using molecular subtypes to evaluate PPORank's recommendations. We compare lapatinib and four PARP 1/2 inhibitors (PARPi), veliparib, olaparib, talazoparib, and rucaparib, some of which have shown promising therapeutic response in BRCA1/2‐mutant (mBRCA) tumors. In our experiments, lapatinib is ranked higher than all four PARPi for 150 (92%) of the 163 patients; in contrast, for mBRCA TNBC , lapatinib is ranked higher than PARPi only 22. Table 1 shows how lapatinib is recommended as compared to the overall and individual PARPi treatments. To show the robustness of our proposed method, we retrain the model 10 times with a subset of 100 drugs (but with lapatinib and PARPi always included) randomly selected from the 200+ available drugs in the GDSC data. It is confirmed that PPORank's recommendations are indeed statistically different from that of a random recommendation (the unpaired ‐test with for and for mBRCA TNBC patients). Furthermore, when we apply the unpaired ‐test to compare random recommendations with each of the four individual PARPi treatments, the results were statistically significant () except for rucaparib (). This suggests no recommendation of rucaparib to mBRCA TNBC patients, which is in agreement with a recent clinical trial's conclusion that the efficacy of rucaparib was not established and it required further investigation.

TABLE 1

PPORank's recommendation rates of lapatinib for the TCGA patients with two different breast cancer subtypes

Recommendation	HER2+ (n=163)	mBRCA (n=9)
lapatinib > PARPi	0.92	0.22
lapatinib > veliparib	0.95	0.22
lapatinib > olaparib	0.94	0.44
lapatinib > talazoparib	0.92	0.22
lapatinib > rucaparib	0.95	0.44

PPORank's recommendation rates of lapatinib for the TCGA patients with two different breast cancer subtypes

Simulations

Primary simulations

To compare our method with other baseline methods, we first construct the response matrix with information from the cell lines. Now consider 100 drugs and 1000 cell lines with 2000 features for cell lines thus . As pointed out by Cortes‐Ciriano et al and Seashore‐Ludlow et al drug pairs within the same cluster of chemical fingerprints can show similar inhibitory effects on the same cell line. So we generate with 2000 features from 10 clusters with a cluster size of ; the features belong to the same cluster are correlated. Each feature is generated from a uniform distribution. The correlation matrix of can be seen from Figure 8A. We design a weight matrix with 10 clusters each with an equal size of 10; for each cluster, the weight is generated from a standard normal distribution as shown in Figure 8B. For each simulated cell line, a maximum number of 100 drugs need to be ranked.

FIGURE 8

Simulation setups: (A) The correlation matrix of the simulated cell lines features with 10 clusters; (B) the weight matrix with 10 clusters

Simulation setups: (A) The correlation matrix of the simulated cell lines features with 10 clusters; (B) the weight matrix with 10 clusters We generate the response matrix from one of the three models: representing linear, non‐linear (cubic) and highly non‐linear (exponential) scenarios, and is a Gaussian noise. Here, instead of using the original Pearson correlation matrix as in the CaDRRes paper, we use the original cell line features from . We define the th element of as the true mean response for the th cell line given the th drug. As NDCG is commutable only when the relevance label is non‐negative, we set a response value as missing if it is negative. We also set 50 of the remaining response values as missing. To distinguish between sensitive and nonsensitive drugs, we use the median response of all the cell lines to a specific drug as the threshold. If a response value is larger than the threshold, we take it as sensitive/positive; otherwise, it is nonsensitive/negative. During training, if we have a nonsensitive drug in a ranking position while there are other sensitive candidate drugs to be selected, we take it as a negative evaluation signal and otherwise as a positive evaluation signal. We also compare DNN with PPORank in the linear and exponential scenarios with a sample size of 1000. From Figure 9a, we can see that in the linear scenario, DNN reaches the stable and perhaps optimal performance sooner, while PPORank fluctuates but reaches a relatively stable plateau after 1000 epochs. As our sample size is only 1000, while the number of the parameters in PRORank is almost 10 times the sample size, it needs to sample more trajectories in a long run to reach and even slightly exceeds the DNN's performance. As expected, both DNN and PPORank perform worse than the linear method of “CaDDRes.” On the other hand, for the exponential scenario, Figure 9B shows faster convergence by both DNN and PPORank than that in the linear scenario, presumably because we are using nonlinear networks to describe the true nonlinear relationship.

FIGURE 9

Performance of DNN and PPORank in terms of in two simulation scenarios. (A) Simulation scenario (1); (B) simulation scenario (3)

Performance of DNN and PPORank in terms of in two simulation scenarios. (A) Simulation scenario (1); (B) simulation scenario (3) Then we compare our PPORank with and without positive evaluation signals (PPO vs PPO‐w/o), respectively, and other baseline methods, using simulated data as shown in Table 2. The sample size for each scenario is 1000 or 10 000. We do not include SRMF as it requires drug similarity, though the GDSC data application shows that drug similarity does not improve performance much. From the NDCG values in scenario (1), where the cell lines' features have a linear effect on drug response, MF‐based methods can capture the linear relationship and perform well. After increasing the sample size to 10 000, PPORank performs best, presumably because it is the only one targeting NDCG, the evaluation criterion being used, as its loss function. For scenario (2), we reach the same conclusion as for scenario (1). In contrast, for scenario (3) with a highly non‐linear true model, PPORank performs best with either sample size (Table 2).

TABLE 2

Mean NDCG (SD) in the primary simulations

	Scenario 1		Scenario 2		Scenario 3
	n=1000	n=10 000	n=1000	n=10 000	n=1000	n=10 000
EN	0.793 (0.03)	0.801 (0.02)	0.593 (0.05)	0.621 (0.04)	0.890 (0.04)	0.905 (0.03)
KRR	0.774 (0.04)	0.769 (0.07)	0.532 (0.05)	0.582 (0.05)	0.882 (0.06)	0.901 (0.04)
KRL	0.783 (0.02)	0.793 (0.01)	0.625 (0.02)	0.647 (0.03)	0.920 (0.03)	0.927 (0.02)
CaDRRes	0.809 (0.02)	0.824 (0.02)	0.632 (0.03)	0.682 (0.02)	0.922 (0.02)	0.931 (0.02)
ppo‐w/o	0.798 (0.07)	0.811 (0.10)	0.670 (0.04)	0.693 (0.01)	0.940 (0.03)	0.948 (0.04)
PPORank	0.790 (0.06)	0.832 (0.04)	0.651 (0.02)	0.705 (0.05)	0.941 (0.05)	0.959 (0.03)

Note: The boldfaced values show the best performances.

Mean NDCG (SD) in the primary simulations Note: The boldfaced values show the best performances.

Secondary simulations

To consider more complex relationships between cell lines and drugs, we design the following simulation setup. Suppose we have 40 drugs and 1000 cell lines, the cell line feature matrix is the same as in the previous static simulation setup. The weight matrix is also the same as before. However, the drug response varies with the drug. For the first 20, next 10, and final 10 drugs, we have , , and , respectively. We evaluate the performances with the sample size and . From Table 3 we can see that PPORank performs best.

TABLE 3

Mean NDCG (SD) for the secondary simulations

	n=1000	n=10000
EN	0.645 (0.08)	0.679 (0.07)
KRR	0.641 (0.11)	0.650 (0.10)
KRL	0.689 (0.09)	0.707 (0.08)
CaDRRes	0.647 (0.11)	0.689 (0.09)
ppo‐w/o	0.701 (0.11)	0.713 (0.12)
PPORank	0.721 (0.11)	0.732 (0.09)

Note: The boldfaced values show the best performances.

Mean NDCG (SD) for the secondary simulations Note: The boldfaced values show the best performances.

CONCLUSIONS AND DISCUSSION

In this article, we have proposed a novel personalized ranking system called Proximal Policy Optimization Ranking (PPORank), which ranks drugs/treatments based on their predicted effects per cell line (or patient). It makes recommendations directly based on the ranking of the drugs, instead of on predicting some drug‐specific responses per se, for example, their ) values, as adopted by other existing methods. Hence this policy‐based learning framework directly optimizes the target evaluation metric using the policy gradient algorithm. Furthermore, the evaluation metric NDCG is optimized by leveraging information from all the ranks. In the implementation, we have adopted some state‐of‐the‐art techniques in DRL as briefly discussed below. By using a DeepCross network to parametrize the policy, we can learn the interactions between cell lines and drugs efficiently, and it can be extended to integrate other data sources. In particular, we use a GRU layer to incorporate the dynamic evaluation signal from cell lines and reinforce the policy with evaluation signals. Variance reduction is realized by using the actor‐critic framework, where the actor network learns the states from the input and feeds it to the critic network to evaluate the current policy. To handle the sample inefficiency problem of policy gradient methods, we use the PPO algorithm with multiple epochs of stochastic gradient ascent to update the policy. As proofs‐of‐concept, we have applied our method to two large‐scale cancer drug screening datasets for personalized drug discovery. With the limited data and high dimensions of the states and actions, our method has shown superior performance over other methods. In particular, our method demonstrates its promising and clinically meaningful performance when the learning algorithm trained on the cell line drug screening data is applied to a TCGA breast cancer cohort. Most importantly, our method, as any RL method, can be more efficiently applied in practice: it learns sequentially and continuously as the data accrue; in contrast, existing supervised learning and many DTR methods would require finishing collecting data (from either a randomized experiment or an observational study) first before learning personalized drug ranking, which would be more resource‐ and time‐consuming. This feature of our method would be most useful for more efficient drug discovery as for our real data applications, and possibly for behavioral interventions in mHealth or selecting experimental drugs for patients with incurable or terminal diseases. In addition, through transfer learning, our proposed RL method can take advantage of existing data and be trained a priori before being applied to and fine‐tuned by future clinical data. Finally, an important aspect of our proposed RL method is its active learning and dynamic interaction with individuals: the learning algorithm would suggest the best treatment/intervention based on its current ranking to an individual, then learn from the outcome, for example, in mHealth applications. Due to lack of data, we have not explored this aspect. This is a distinct and attractive feature unique to RL, but not to supervised learning. On the other hand, there are some challenges remaining before DRL being applied more generally for health care in practice. First, due to the “data‐hungry” nature of both DL and RL (and thus DRL), more data efficient algorithms are urgently needed. Transfer learning mentioned earlier can also help. Second, the interpretability of a DL model as a black‐box needs to be improved. Both topics have generated extensive interests and intensive research in the DL community, which will hopefully lead to some breakthroughs in a near future. In summary, our proposed and possibly other DRL‐based methods are promising for precision medicine, and worth further investigation. Data S1: Supplementary material Click here for additional data file.

48 in total

1. The Cancer Genome Atlas Pan-Cancer analysis project.

Authors: John N Weinstein; Eric A Collisson; Gordon B Mills; Kenna R Mills Shaw; Brad A Ozenberger; Kyle Ellrott; Ilya Shmulevich; Chris Sander; Joshua M Stuart
Journal: Nat Genet Date: 2013-10 Impact factor: 38.330

2. Oral poly(ADP-ribose) polymerase inhibitor olaparib in patients with BRCA1 or BRCA2 mutations and advanced breast cancer: a proof-of-concept trial.

Authors: Andrew Tutt; Mark Robson; Judy E Garber; Susan M Domchek; M William Audeh; Jeffrey N Weitzel; Michael Friedlander; Banu Arun; Niklas Loman; Rita K Schmutzler; Andrew Wardley; Gillian Mitchell; Helena Earl; Mark Wickens; James Carmichael
Journal: Lancet Date: 2010-07-06 Impact factor: 79.321

3. Logic models to predict continuous outputs based on binary inputs with an application to personalized cancer therapy.

Authors: Theo A Knijnenburg; Gunnar W Klau; Francesco Iorio; Mathew J Garnett; Ultan McDermott; Ilya Shmulevich; Lodewyk F A Wessels
Journal: Sci Rep Date: 2016-11-23 Impact factor: 4.379

4. Discovering novel pharmacogenomic biomarkers by imputing drug response in cancer patients from large genomics studies.

Authors: Paul Geeleher; Zhenyu Zhang; Fan Wang; Robert F Gruener; Aritro Nath; Gladys Morrison; Steven Bhutra; Robert L Grossman; R Stephanie Huang
Journal: Genome Res Date: 2017-08-28 Impact factor: 9.043

5. Kernelized rank learning for personalized drug recommendation.

Authors: Xiao He; Lukas Folkman; Karsten Borgwardt
Journal: Bioinformatics Date: 2018-08-15 Impact factor: 6.937

6. Deep reinforcement learning for personalized treatment recommendation.

Authors: Mingyang Liu; Xiaotong Shen; Wei Pan
Journal: Stat Med Date: 2022-06-18 Impact factor: 2.497

7. Estimating Dynamic Treatment Regimes in Mobile Health Using V-learning.

Authors: Daniel J Luckett; Eric B Laber; Anna R Kahkoska; David M Maahs; Elizabeth Mayer-Davis; Michael R Kosorok
Journal: J Am Stat Assoc Date: 2019-04-17 Impact factor: 5.033

8. Comprehensive molecular portraits of human breast tumours.

Authors:
Journal: Nature Date: 2012-09-23 Impact factor: 49.962

9. Phase 2 multicentre trial investigating intermittent and continuous dosing schedules of the poly(ADP-ribose) polymerase inhibitor rucaparib in germline BRCA mutation carriers with advanced ovarian and breast cancer.

Authors: Yvette Drew; Jonathan Ledermann; Geoff Hall; Daniel Rea; Ros Glasspool; Martin Highley; Gordon Jayson; Julieann Sludden; James Murray; David Jamieson; Sarah Halford; Gary Acton; Zoe Backholer; Raffaella Mangano; Alan Boddy; Nicola Curtin; Ruth Plummer
Journal: Br J Cancer Date: 2016-03-22 Impact factor: 7.640

1 in total

1. Deep reinforcement learning for personalized treatment recommendation.

Authors: Mingyang Liu; Xiaotong Shen; Wei Pan
Journal: Stat Med Date: 2022-06-18 Impact factor: 2.497

1 in total