Literature DB >> 35679280

Discriminatory Gleason grade group signatures of prostate cancer: An application of machine learning methods.

Mpho Mokoatle¹, Darlington Mapiye², Vukosi Marivate^1,3, Vanessa M Hayes^3,4, Riana Bornman⁴.

Abstract

One of the most precise methods to detect prostate cancer is by evaluation of a stained biopsy by a pathologist under a microscope. Regions of the tissue are assessed and graded according to the observed histological pattern. However, this is not only laborious, but also relies on the experience of the pathologist and tends to suffer from the lack of reproducibility of biopsy outcomes across pathologists. As a result, computational approaches are being sought and machine learning has been gaining momentum in the prediction of the Gleason grade group. To date, machine learning literature has addressed this problem by using features from magnetic resonance imaging images, whole slide images, tissue microarrays, gene expression data, and clinical features. However, there is a gap with regards to predicting the Gleason grade group using DNA sequences as the only input source to the machine learning models. In this work, using whole genome sequence data from South African prostate cancer patients, an application of machine learning and biological experiments were combined to understand the challenges that are associated with the prediction of the Gleason grade group. A series of machine learning binary classifiers (XGBoost, LSTM, GRU, LR, RF) were created only relying on DNA sequences input features. All the models were not able to adequately discriminate between the DNA sequences of the studied Gleason grade groups (Gleason grade group 1 and 5). However, the models were further evaluated in the prediction of tumor DNA sequences from matched-normal DNA sequences, given DNA sequences as the only input source. In this new problem, the models performed acceptably better than before with the XGBoost model achieving the highest accuracy of 74 ± 01, F1 score of 79 ± 01, recall of 99 ± 0.0, and precision of 66 ± 0.1.

Entities: Chemical

Mesh：

Year: 2022 PMID： 35679280 PMCID： PMC9182297 DOI： 10.1371/journal.pone.0267714

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.752

1 Introduction

Prostate cancer is the leading male cancer in South Africa and is the second most frequently diagnosed cancer among men globally [1]. As men live longer, there is an increase in the occurrence and mortality of the disease [2]. Except for age, the main risk factor is hereditary. Other factors such as race, high-calorie diet, and exposure to heavy metals have a significant impact on the risk of occurring the disease [3, 4]. When it comes to the diagnosis of prostate cancer, a prostate biopsy procedure is common [5]. This procedure involves the extraction of tissue samples from the prostate by using specialised biopsy needles. It is typically performed by using an ultrasound probe that is placed in the rectum which than produces a real-time image of the prostate. The samples produced from this procedure are then taken to a pathologist for evaluation and grading [6, 7]. The Gleason grade group system is the most reliable method and criterion for selection of therapy. In 2014, the International Society of Urological Pathology (ISUP) [8] released supplementary guidance on an improved prostate cancer grading system called the ISUP-Grade Group. This system is simpler, with just five grades, 1 to 5, to describe the growth of the tumor. Grade 1 refers to the least aggressive growth of the tumor, and grade 5 refers to the most aggressive growth [9]. Due to the difficulty and natural subjectivity of this system, Gleason grading is affected by large discordance rates among pathologists (30-50%) [10-15]. However, grades provided by experts with numerous years of experience are more accurate and precise more than grades provided by pathologists with only a few years of experience [16-19], indicating the need to improve the clinical usefulness of the system by improving grading discordance and accuracy [20]. In this work, the DNA sequences that were sequenced from patients that present with a Gleason grade 1 and 5 are studied. The objective of this work is to find discriminatory features within the DNA sequences, and map them to their correct Gleason grade group using machine learning. Two key cancer genes are investigated: BRCA 1 and BRCA 2. These genes have been key genes of interest in prostate cancer [21]. Studies that interrogated these two genes suggest that men who harbor a disease-associated BRCA 2 allele have an increased predisposition of prostate cancer (2 to 5-fold increased risk). This finding suggests that deleterious mutations in BRCA 2 play a significant role in the susceptibility of prostate cancer [22, 23]. Different from BRCA 2 mutations, mutations in BRCA 1 have been inconsistently correlated with the risk of prostate cancer. Studies that have evaluated prostate cancer risk in men that carry BRCA 1 mutations have reportedly been negligible, but not insignificant [24, 25]. The contributions of this work are summarised as follows: this study specifically compares two extremes of the Gleason grade group (Gleason grade group 1 and 5) while previous studies have used medical images and clinical features [26-31] as input to their Gleason grade group predictor models, this study explores the challenges that are encountered when blood DNA sequences are used as the only input source to the machine learning models. This work is divided as follows: first, a literature review will be given that highlights the gap in the prediction of the Gleason grade group in the context of machine learning. Second, the data and description of methods will be discussed. Finally, the results, discussion, and conclusion section will follow.

2 Literature survey

Recently, deep learning has emerged as a powerful tool to automate the Gleason Grading system. Deep learning systems make use of multi-faceted neural networks that are able to extract complex features from data. Recent work [26] designed a Gleason score annotator by using a convolutional neural network (MobileNet) on tissue microarrays images. The final output layer of this architecture produced a probability distribution over four possible Gleason classes. A key limitation in this work is that the training, testing, and validation sets were too small, which led to some bias in the predictions produced by the model. A recent study [27] similar to this one also used a convolutional neural network (Inception V3) to develop a Gleason score annotator using whole slide images. In addition to predicting a Gleason pattern, this architecture first provided a probability distribution over an image being benign or malignant. Different from the above work, a study [28] applied a convolutional neural network on multi-parametric magnetic resonance imaging (mpMRI) images of prostate cancer patients to extract deep entropy features. Then, the features extracted from the convolutional neural network were used as input to a Random Forest model for prediction of the Gleason grade group. Even though the training data was too small, the performance measure would have been more reliable if the models were cross-validated. Biopsy images of patients who underwent a prostate biopsy following suspicion of prostate cancer has also been used as input to convolutional neural networks (U-Net and an Inception-v3 Network) for the prediction of the Gleason grade group and cancer detection [29, 30]. To validate the performance of the deep learning system, the predictions from the models were compared with those of pathologists where a high agreement was found between the deep learning systems and the pathologists. Unlike using convolutional neural networks for the prediction of the Gleason grade group, a study [31] developed a machine learning assisted model that predicts the probability of a patient having a Gleason grade upgrade before treatment. The input used to the machine learning models (Logistic Regression, Random Forest, Support Vector Machine) were clinical features such as age, prostate-specific antigen (PSA) level, and the clinical stage. Overall, much emphasis has been placed on creating machine learning models that predict the Gleason grade group from medical images and clinical data. To the best of our knowledge, this is the first study that focuses on DNA sequences as the only input source to a Gleason grade group prediction model. This work explores the challenges that are associated with finding discriminatory signatures within the DNA sequences of patients that present with a Gleason grade group of 1 and 5.

3 Data description, data representation methods, machine learning algorithms, and sequence similarity

3.1 Data description

Patients were recruited and consented according to approval granted from the University of Pretoria Faculty of Health Sciences Research Ethics Committee 43/2010 (South Africa); DNA sequencing was generated under approval granted from the St. Vincent’s Hospital Human Research Ethics Committee (HREC) SVH/15/227 in Sydney (Australia), and this study was approved by the Faculty of Engineering, Built Environment & IT (Ethics Reference No: 43/2010; 11 August 2020). The data was fully anonymized before analysis. The DNA sequences of twelve patients with a histopathological ISUP-GG of 1 (low risk prostate cancer) and 5 (high-risk prostate cancer) were selected for analysis. The DNA sequences were aligned using the BWA-MEM aligner [32] to produce BAM files. The BAM files were converted to FASTA files using samtools [33] and an in-house python script was used for pre-processing and removing IDs from the blood DNA sequences. The blood DNA sequences were then truncated into k-mers. k-mers are defined as all the possible substrings of length k that are contained in a sequence [34]. The classification problem in this work is defined as follows: given a DNA sequence x that consists of k-mers of size 63, can a machine learning function f learn the correct mapping from the input x to the outcome variable y (Gleason grade group of 5 or 1): After preprocessing, the data was transformed into the below data structure (Fig 1).

Fig 1

Blood DNA sequences x transformed into k-mers with their corresponding Gleason grade group y.

3.2 Data representation methods

To vectorize the k-mers, the Term Frequency—Inverse Document Frequency (TF-IDF) [35] algorithm were used. TF-IDF is a statistical method that calculates how significant a token or word is to a document in a set of documents. Two matrices are used to calculate the TF-IDF score: term frequency (TF), which is a measure of how many times a token appears in a document and inverse document frequency (IDF), is a measurement of how frequent or rare a token is in the entire document set. Multiplying these two measurements produces a TF-IDF score of each word in the document [36]. The main disadvantage of TF-IDF is that it produces extremely high dimensional vectors [37]. To overcome this, the Principal Component Analysis (PCA) [38] was used as a data reduction technique to transform the high dimensional vectors into 2-dimensional (d) vectors. The other vectorization method that was used was the Skip-gram method from the word2vec algorithm. This vectorization method was chosen as it has been found to be robust with regards to transforming DNA or genomic data into dense vector representations in preparation for machine learning [39-42]. In the context of this work, the usefulness of the Skip-gram model lies in determining k-mers that are important in predicting the surrounding k-mers in a DNA sequence. Precisely, given a sequence of training k-mers w1, w2, w3, …, w the training objective of the Skip-gram model is to maximise the average log probability: where c is the size of the context k-mers in the training set. In this work, the Skip-gram k-mer tokens were represented by a continuous vector of size 100, and summed up with other vectors of the same sequence to give a single continuous vector that represents the entire sequence.

3.3 Machine learning algorithms

After obtaining the 2-d TF-IDF vectors from PCA, they were used as features to several machine learning models: Gradient boosting algorithm: eXtreme Gradient Boosting (XGBoost), Long Short-Term Memory (LSTM), Gated Recurrent Unit (GRU), and Random Forest (RF). XGBoost is an ensemble boosting learning method that makes use of several learners to make predictions. This method is different from other ensemble methods as it builds a sequence of originally weak models into progressively more powerful models, where the errors made by previous models are corrected in subsequent models [43]. The steps involved in the ensemble technique are as follows: first, an initial model F is initialised to predict the target variable y. This model produces a residual error (y − F). next, an additive learner h1 is fit onto the the residuals from the previous step. than, F and h1 are summed to produce f1, which is the boosted version of f. The residual error from f1 will be lower in comparison to the residuals of f: To improve the performance of f1, the residuals of f1 can be modeled to create a new model f2: This procedure can be performed for a few iterations m until residual errors have been minimised as much as possible: Instead of fitting the additive learners hm(x) on the residuals, fitting it on the gradient of the loss function makes this process more generic and applicable across all loss functions. Hence, XGBoost uses the gradient descent algorithm to minimise the loss [43, 44]. The LSTM and GRU are variants of Recurrent Neural Networks (RNNs) that regulate information through the network by using several gates. The gates regulate the flow of information by learning which timestamps are important to keep or discard [45]. In an LSTM cell (Fig 2) the sigmoid function called the forget gate is responsible for deciding which information will be discarded from the cell state. This gate takes as input x and the previous hidden state h, than outputs a value between 0 and 1 for each value in the cell state C. If the value is 1, the information from the previous hidden state will be kept and if the value is 0, the information from the previous previous hidden state will be discarded [46-48]:

Fig 2

Architecture of an LSTM unit [59].

Next, the input gate i has to determine which new information will be added in the cell state. Than, a tanh layer will create a vector of new candidate values , that will be added to the cell state: To update the old cell state C into the new cell state C, the old state is multiplied by f. Next, is added, which are the new candidate values: Finally, the output that is based on the cell state is given. The cell state is put through a tanh function and multiplied by the output of the sigmoid gate: [46-48] A three layer LSTM architecture was selected with a total of 224 hidden units. The output layer consisted of a sigmoid activation function that provides a probability distribution of a sequence either belonging to a patient with a Gleason grade group of 1 or 5. The training dataset was divided over 50 batches and trained over 5 epochs. Dropout rate at 60% was used to control overfitting of the model. GRUs (Fig 3) are similar to LSTMs in that they both use gates to regulate the flow of information. GRUs are faster to train than LSTMs, and also have have a simpler architecture [49-51].

Fig 3

Architecture of a GRU unit [59].

Inside a GRU cell, at each timestamp t, the cell takes an input X and the hidden state h from the previous timestamp. Next, the cell will output a new hidden state h which will be fed as input to the next timestamp. Unlike the LSTM that has three gates, the GRU has two gates: the update gate and the reset gate. The reset gate r is in charge of the short-term memory of the network. It is responsible for deciding which timestamps to discard [49-51]: r will output a value between 0 and 1 due to the sigmoid function. As previously mentioned, if the output value is equal to 1, this means that the timestamps from the previous hidden state h will be kept. And if the output value is 0, the timestamps from the previous hidden state h will be discarded [49-51]. To generate the hidden state of a GRU cell, a two-step process is followed. First, a candidate hidden state needs to be generated: the input X and the hidden state from the previous timestamp H are multiplied by the output of the reset gate r. Next, this is passed to a tanh function which outputs the candidates hidden state . The usefulness of this equation is important in showing how the value of the reset gate is used to control how much influence the previous hidden state can have on the candidate state [49-51]. Similarly, the GRU cell also has an update gate which is responsible for determining how much past information needs to be kept: This equation is similar to the one used by the reset gate, the only key differences are the new weight matrices U and W [49-51]. The GRU models were configured with a stack of four hidden layers and a total of 240 hidden units. The output layer was also a dense layer with a sigmoid activation function, and the model was trained over 5 epochs with the training set divided over 50 batches. Dropout (at 60%) was also used to control overfitting. All the machine learning models were validated via a Repeated k-fold cross validation (cv) (cv = 5, runs = 5). The experiments in this work were conducted on a NVIDIA Tesla P100 GPU virtual machine with 100 GB of memory. RF was also used to find discriminatory signatures between Gleason grade group 1 and 5 blood DNA sequences. In RF, several decision trees are created simultaneously. In the final prediction, the multiple decision trees are merged in order to determine the final answer, which will be the average of all the decision trees [52]. To decide how the nodes of the decision trees would branch, the default Gini index was used: Where p is the relative frequency and c represents the number of classes. This equation makes use of the class and probability to determine the Gini of each branch on a node [52]. Another binary machine learning model that was used was the Logistic Regression (LR). The Skip-gram k-mer features were used as input to a Logistic Regression (LR) model. A logistic regression model is a machine learning model that uses a decision boundary to separate a set of data points into their distinct classes. A logistic regression is comparable to linear regression, the key difference between them is that logistic regression is used when the target variable is categorical, while linear regression is used when the target variable is continuous. In this study, the target variable is categorical (1 = Gleason grade group of 5, 0 = Gleason grade group of 1). Logistic regression uses a Sigmoid function to convert the probability values z to be in the range between 0 and 1: This function transforms −∞, 0 and +∞ to 0, 0.5, and 1 respectively. If the probability value z for a data point is close to + ∞, this is an indication that the data point is above the decision boundary, hence it will belong to the positive class. In contrast, If the the probability value z for a data point is close to −∞, it means that the data point is below the decision boundary, meaning it belongs to the negative class. If the data point is predicted to be on the decision boundary, the value of z is 0, and the Sigmoid function will transform it to 0.5, meaning that it has a 50% probability of belonging to the positive class [53, 54].

3.4 Sequence similarity

Multicollinearity is a problem in machine learning where two or more predictor variables are highly correlated with each other [55]. This presents a problem because the individual effects of the predictor variable on the target variable would not be distinguishable. One of the methods that is applied to deal with multicollinearity in machine learning is to remove the collinear variables. In the context of this work, removing collinear k-mers would result in a completely new set of DNA sequences since the sequences would have to be truncated either in the beginning, middle, or at the end. In the context of this work, multicollinearity can also be equated to sequence similarity in genomics. Sequence similarity is an important concept in genomics that refers to the degree of similarity between sequences [56]. This is often indicated as a percentage of identical bases over a given length of the alignment. The Basic Local Alignment Search Tool (BLAST) was used to evaluate the similarity between blood DNA sequences [57]. When a sequence similarity test is performed between a pair of sequences, several attributes are returned such as the E value, query cover, and percent identity. In this work, only the percent identity is reported. The percent identity refers to how similar the query sequence is to the subject sequence. Specifically, it describes the number of bases that are identical in the sequences. A significant match is 100% [58]. A figure (Fig 4) has been generated to provided an overview of all the methods that were used in this work.

Fig 4

This figure represents the summary of all the methods that were executed in this work.

4 Results and discussion

4.1 Sequence similarity results and TF-IDF Visualizations

For both BRCA 1 and BRCA 2, the results (Tables 1 and 2) illustrate that most sequences are highly similar with a percent identity of 90-100%. The lowest percent identity across the sequences is 70-80%, which is still too high. This indicates that blood DNA sequences that are derived from patients that present with Gleason grade group of 5 are not that very different from patients that present with a Gleason grade group of 1. There might exist a small region of dissimilarity, however, at this stage, the number of sequences available for this experiment are inadequate to capture the region of dissimilarity. It is probable that hundreds of thousands of DNA sequences are required to capture this region.

Table 1

Sequence similarity within a Gleason grade group of 5 and 1 for BRCA 1 blood DNA sequences.

	Grouped by percentage of identical matches	Total no. of local alignments
Gleason grade group 5	90-100	7170891
	80-90	3685304
	70-80	62500
Gleason grade group 1	90-100	7270628
	80-90	3732281
	70-80	56560

Table 2

Sequence similarity within a Gleason grade group of 5 and 1 for BRCA 2 blood DNA sequences.

	Grouped by percentage of identical matches	Total no. of local alignments
Gleason grade group 5	90-100	6256450
	80-90	910123
	70-80	17970
Gleason grade group 1	90-100	6510144
	80-90	932427
	70-80	16167

Next, the impact of this high similarity is investigated in the machine learning models to determine if discriminatory signatures (region of dissimilarity) within the DNA sequences can be detected and mapped to their correct Gleason grade group. To ensure that the machine learning models are trained on distinct sequences, highly similar sequences were removed using BLAST. Before the removal of highly similar sequences, the total number of blood DNA sequences from the BRCA 1 gene were 235 711. For BRCA 2, the total number of the sequences were 243 822. After the removal of highly similar sequences, the table (Table 3) shows the new data distribution and the total number of sequences in each class. Blood DNA sequences that shared more than 25 bases of homology were considered as similar and were thus removed.

Table 3

Data count and distribution of classes after the removal of highly similar DNA sequences.

	Gleason grade group 5	Gleason grade group 1
BRCA 1	3111 ∼ 58%	2210 ∼ 42%
BRCA 2	3108 ∼ 62%	1941 ∼ 38%

In keeping with the high sequence similarity observation amongst the blood DNA sequences as shown above, the TF-IDF visualisation of the k-mers (Figs 5 and 6), also show that there is a great overlap between the k-mer features of the two Gleason grade groups as no separable clusters were detected.

Fig 5

Visualisation of TF-IDF kmers for BRCA 1.

Fig 6

Visualisation of TF-IDF kmers for BRCA 2 kmers.

4.2 Machine learning results

The RF model achieved the highest accuracy as shown (Table 4). However, the recall was too high. This is an indication that the majority of the DNA sequences were predicted as positive (Gleason grade group 5), with very few true negatives (Fig 7). This trend was also observed with the other models as well, which is an indication that not enough learning was achieved.

Table 4

This table shows the results of the machine learning models using data from the BRCA 1 gene.

	Acc (%)	F1 (%)	Recall (%)	Precision (%)
XGBoost	57 ± 1.6	69 ± 1.3	85 ± 2.0	58 ± 1.8
LSTM	58 ± 1.5	74 ± 1.3	100 ± 0.0	58 ± 1.5
GRU	58 ± 1.1	74 ± 0.9	100 ± 0.0	58 ± 1.1
LR	58 ± 1.7	73 ± 1.3	98 ± 0.7	58 ± 1.6
Random Forest	59 ± 1.7	74 ± 1.4	98 ± 0.8	59 ± 1.7

Fig 7

Confusion matrix of the Random Forest model for BRCA 1.

Considering the results of the BRCA 2 gene (Table 5), the LR and GRU models achieved the highest accuracy while having the highest recalls indicating that a large number of sequences were predicted as positive. The confusion matrix of the GRU model is shown (Fig 8).

Table 5

This table shows the results of the machine learning models using data from the BRCA 2 gene.

	Acc (%)	F1 (%)	Recall (%)	Precision (%)
LSTM	58 ± 1.5	73 ± 1.3	100 ± 0	58 ± 1.6
XGBoost	61 ± 1.3	74 ± 1	93 ± 1.3	62 ± 1.4
Random Forest	61 ± 0.1	75 ± 0.8	99 ± 0.6	61 ± 1.1
LR	62 ± 1.3	76 ± 0.1	99 ± 0.2	62 ± 1.3
GRU	62 ± 1.2	77 ± 0.9	100 ± 0	62 ± 1.2

Fig 8

Confusion matrix of the GRU model for BRCA 2.

While some of machine learning models achieved just above average performance, they all seemed to classify most blood DNA sequences as positive (Gleason grade group 5), which suggests that no discriminatory signatures were discovered within the blood DNA sequences of patients that present with a Gleason grade group of 5 and Gleason grade group of 1. This finding further stipulates that are still a lot of opportunities for improvement with regards to designing more robust data representation methods and machine learning classifiers that are adequately sensitive to detect discriminatory Gleason grade groups signatures in DNA sequences.

4.3 Prediction of tumor DNA sequences

Having observed that the above machine learning models were not able to adequately find discriminatory signatures in the DNA sequences of the two Gleason grade groups, a new classification question was formulated: Given tumor and matched-normal DNA sequences, can the models predict tumor DNA sequences?. This new problem was formulated to further assess the usefulness of the machine learning models and determine if other classification problems can be learned using DNA sequences as the only input source to the models. In addition, a bigger dataset was used that contained 304 450 tumor DNA sequences and 305 214 matched-normal DNA sequences from the APC gene of colorectal cancer patients. The three machine learning models (LR, RF, and XGBoost) were evaluated to establish if they can distinguish tumor DNA sequences from normal DNA sequences. The results (Table 6) show an overall improvement in the performance of the models compared to the results seen in the previous section of the prediction of the Gleason grade group. In the previous section, the models struggled to predict the Gleason grade group given DNA sequences and in this section of results, although there is plenty of room for improvement; the models were able to satisfactorily separate tumor DNA sequences from matched-normal DNA sequences. The confusion matrix of the highest performing model (XGBoost) is shown (Fig 9).

Table 6

This table shows the results of the machine learning models using data from the APC gene.

	Acc (%)	F1 (%)	Recall (%)	Precision (%)
LR	65 ± 0.1	67 ± 0.1	71 ± 0.1	63 ± 0.1
Random Forest	71 ± 0.1	75 ± 0.3	87 ± 0.3	66 ± 0.3
XGBoost	74 ± 0.1	79 ± 0.1	99 ± 0.0	66 ± 0.1

Fig 9

Confusion matrix of the XGBoost model for the APC gene.

The main limitations of this work include the use of a small sample size, particularly the BRCA 1 and BRCA 2 DNA sequences. For this reason, the machine learning models were not able to competently distinguish Gleason grade group of 5 DNA sequences from Gleason grade group of 1 DNA sequences. The other limitation in this work include the lack of sufficient prior research on this topic, particularly research that has used DNA sequences as the only input source to machine learning or deep learning classifiers in the prediction of the Gleason grade group problem. Subsequently, it was difficult to benchmark the results of this work with those in the literature.

5 Conclusion

The goal of this work was to apply machine learning algorithms in the prediction of the Gleason grade group in blood DNA sequences of high-risk and low-risk prostate cancer patients. The machine learning models were not able to sufficiently discriminate between Gleason grade group of 5 DNA sequences from Gleason grade group of 1 DNA sequences. The reasons for this occurred as a result of having a large number of sequences that share a substantial amount of sequence homology. Even though this was circumvented by removing highly similar sequences, it was still not sufficient as the machine learning classifiers still produced a high number of false positives and a negligible amount of true negatives. Since the machine learning models were not able to discriminate between the DNA sequences of the two Gleason grade groups, they were further evaluated to determine their usefulness in the prediction of tumor DNA sequences from matched-normal DNA sequences. In this new problem, the models performed acceptably better than before. The future work involves the design of better data representation techniques that are sensitive enough to discover discriminatory signatures in small sample sizes of DNA sequences. These techniques should be generic in that they should not only be sensitive towards Gleason grade groups, but should extend to other prediction problems that are important in machine learning and cancer research. 27 Jan 2022

PONE-D-21-39391

Discriminatory Gleason grade group signatures of prostate cancer: An application of machine learning methods

PLOS ONE Dear Dr. Mokoatle, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Please submit your revised manuscript by Mar 04 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols. We look forward to receiving your revised manuscript. Kind regards, Sathishkumar V E Academic Editor PLOS ONE Journal Requirements: When submitting your revision, we need you to address these additional requirements. 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf 2. Thank you for stating the following financial disclosure: "No. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript." At this time, please address the following queries: a) Please clarify the sources of funding (financial or material support) for your study. List the grants or organizations that supported your study, including funding received from your institution. b) State what role the funders took in the study. If the funders had no role in your study, please state: “The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.” c) If any authors received a salary from any of your funders, please state which authors and which funders. d) If you did not receive any funding for this study, please state: “The authors received no specific funding for this work.” Please include your amended statements within your cover letter; we will change the online submission form on your behalf. 3. Thank you for stating the following in your Competing Interests section: "No. The authors have declared that no competing interests exist." Please complete your Competing Interests on the online submission form to state any Competing Interests. If you have no competing interests, please state "The authors have declared that no competing interests exist.", as detailed online in our guide for authors at http://journals.plos.org/plosone/s/submit-now This information should be included in your cover letter; we will change the online submission form on your behalf. 4. Your ethics statement should only appear in the Methods section of your manuscript. If your ethics statement is written in any section besides the Methods, please move it to the Methods section and delete it from any other section. Please ensure that your ethics statement is included in your manuscript, as the ethics statement entered into the online submission form will not be published alongside your manuscript. [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Partly Reviewer #2: Yes ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: No ********** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: No ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: This article focus on Gleason grade group in blood DNA sequences of high-risk and low-risk prostate cancer prediction using machine learning algorithms. This article primarily focus on similarity index and robust data representation methods where majority of the existing algorithms are lack. This article is partially novel whereas the contributions are very limited for a journal article. The decision on this manuscript may be considered after addressing the below: 1. There are several machine learning algorithms in the literaute, and it is confusing that which machine learning algorithms considered by the authors. It is recommended to mention the accurate machine learning algorithm used in the paper, instead of generalizing the machine learning. 2. Recommended to summarize the list of contributions of the article in the introduction. 3. The literature of the article is very poor. It is recommended to consider the recent literature on the specified topic. There are several articles published recently on this topic, and consider the papers published in last three years. 4. It is recommending the authors to summarize the details about each studied paper and and list the limitations. Which limitations of the existing paper are addressed in the proposed work to be mentioned. 5. Section 3 is too small. It may be written as a subsection under section 4. Recommended to provide the citations for the datasets considered for this work. 6. Section 4 name, (Methods) may be replaced with the actual name of the proposed work. 7. The name of subsection 4.3 must be changed. 8. The proposed method is not clear. Recommended to explain the proposed work through illustrative example. 9. the problem formulation and objective is not clear. It is recommended to provide through theoretical discussion. 10. Discuss about the computational complexity of the proposed model. 11. It is not clear that how the machine learning algorithm address the challenges discussed on the problem. 12. The limitations of the proposed work must be discussed. 13. The experimental results are not enough to justify the performance of the proposed work. Recommended to consider the more metrics with different datasets to justify the performance of the proposed work. 14. Add the future scope of the work in the conclusion. Reviewer #2: 1. Abstract must conclude the findings with quantitative results 2. Experiment analysis with other machine learning methods is required 3. Architecture diagram depicting LSTM and GRU need to be included 4. Statistical validation of the results is required ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. 6 Apr 2022 Dear Editors We would like to thank and appreciate your generous time and comments on reviewing the manuscript and have revised it to address your concerns. We believe that the manuscript is now in a suitable position for publication. Reviewer # 1: Comment: 1. There are several machine learning algorithms in the literaute, and it is confusing that which machine learning algorithms considered by the authors. It is recommended to mention the accurate machine learning algorithm used in the paper, instead of generalizing the machine learning. Response: Initially, the machine learning models where briefly described. In the revised version, all the machine learning models, and how they work are described (see section 3.3). Comment 2: Recommended to summarize the list of contributions of the article in the introduction. Response: Agreed. The key contributions are now summarised in bullet form at the end of the introduction section. Comment 3:The literature of the article is very poor. It is recommended to consider the recent literature on the specified topic. There are several articles published recently on this topic, and consider the papers published in last three years. Response: The literature review has been improved and includes recent literature [2018-2020]. Comment 4: It is recommending the authors to summarize the details about each studied paper and and list the limitations. Which limitations of the existing paper are addressed in the proposed work to be mentioned. Response: The studied papers have been summarised and the limitations have been stated where necessary. The relationship between the literature and the problem formulated in this paper has also been stated at the end of the literature review section. Comment 5: Section 3 is too small. It may be written as a subsection under section 4. Recommended to provide the citations for the datasets considered for this work. Response: Done. Section 3 has been moved under section 4. At this point, it is not possible to provide the citations of the datasets considered in this work as they have not been published or made freely available to the public. However, access to the data can be granted from the authors; provided that all ethical procedures are followed. Comment 6: Section 4 name, (Methods) may be replaced with the actual name of the proposed work. Response: Section 4 (Now section 3) includes all the data preparation steps and machine learning algorithms used in this paper. It cannot be renamed to the actual name of the proposed work as no new algorithm has been proposed by this work. All the techniques used in this work already exist in the literature. The application thereof is the one that is unique. Comment 7: The name of subsection 4.3 must be changed. Response: Subsection 4.3 now renamed to subsection 3.3 cannot be changed as it describes all machine learning learning algorithm that was used in this work. Comment 8: The proposed method is not clear. Recommended to explain the proposed work through illustrative example. Response: Figure 4 has been generated to provide a more detailed overview of all the methods followed in the paper from the preprocessing steps to the machine learning algorithms. Comment 9: the problem formulation and objective is not clear. It is recommended to provide through theoretical discussion. Response: An attempt of clarifying the problem and objective has been made by highlighting the gaps in the literature at the end of the literature review; and also summarising the contributions of this work at the end of the introduction section. Comment 10: Discuss about the computational complexity of the proposed model. Response: The models used in this work and how they work have been described in the section 3. Comment 11: It is not clear that how the machine learning algorithm address the challenges discussed on the problem. Response: The machine learning models were used to find discriminatory features within the DNA sequences, and try to map them to their correct phenotype (Gleason grade group of 1 or 5). To do this, binary classification models were created. Comment 12: The limitations of the proposed work must be discussed. Response: The limitations of this work are summarised at the end of the Results and discussion section (section 4). Comment 13: The experimental results are not enough to justify the performance of the proposed work. Recommended to consider the more metrics with different datasets to justify the performance of the proposed work. Response: More experiments have been done on a new and bigger dataset (APC gene), and an additional machine learning model (Random forest) was added to the experiments. Comment 14: Add the future scope of the work in the conclusion. Response: Done. The future work has been outlined towards the end of the conclusion section. Reviewer # 2: Comment 1: Abstract must conclude the findings with quantitative results Response: Done. The abstract now includes the summary of the results. Comment 2: Experiment analysis with other machine learning methods is required Response: An additional machine learning algorithm was used (Random forest). Also, the models were ran on a bigger dataset (APC gene). Comment 3: Architecture diagram depicting LSTM and GRU need to be included Response: Done. Figure 2 and 3 depicts an LSTM and GRU unit. Comment 4: Statistical validation of the results is required Response: The machine learning models were rerun and validated with the Repeated Kfold cross-validation. This validation method returned the average accuracy, F1 score, recall, and precision and average standard deviation on the held-out data from the folds. Submitted filename: Response to Reviewers.pdf Click here for additional data file. 14 Apr 2022 Discriminatory Gleason grade group signatures of prostate cancer: An application of machine learning methods PONE-D-21-39391R1 Dear Dr. Mokoatle, We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Sathishkumar V E Academic Editor PLOS ONE Additional Editor Comments (optional): Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #1: All comments have been addressed ********** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes ********** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes ********** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: No ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes ********** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: The authors extended the paper and addressed all the recommended comments and this version is well improved. This paper can considered for publication in this journal. ********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No 12 May 2022 PONE-D-21-39391R1 Discriminatory Gleason grade group signatures of prostate cancer: An application of machine learning methods Dear Dr. Mokoatle: I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. If we can help with anything else, please email us at plosone@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. Sathishkumar V E Academic Editor PLOS ONE

34 in total

1. Intraobserver and interobserver reproducibility of WHO and Gleason histologic grading systems in prostatic adenocarcinomas.

Authors: S O Ozdamar; S Sarikaya; L Yildiz; M K Atilla; B Kandemir; S Yildiz
Journal: Int Urol Nephrol Date: 1996 Impact factor: 2.370

Review 2. Sequence Similarity Searching.

Authors: Gang Hu; Lukasz Kurgan
Journal: Curr Protoc Protein Sci Date: 2018-08-13

3. A UK-based investigation of inter- and intra-observer reproducibility of Gleason grading of prostatic biopsies.

Authors: J Melia; R Moseley; R Y Ball; D F R Griffiths; K Grigor; P Harnden; M Jarmulowicz; L J McWilliam; R Montironi; M Waller; S Moss; M C Parkinson
Journal: Histopathology Date: 2006-05 Impact factor: 5.087

Review 4. Risk factors for prostate cancer.

Authors: K J Pienta; P S Esper
Journal: Ann Intern Med Date: 1993-05-15 Impact factor: 25.391

5. Predicting prostate cancer specific-mortality with artificial intelligence-based Gleason grading.

Authors: Ellery Wulczyn; Kunal Nagpal; Matthew Symonds; Melissa Moran; Markus Plass; Robert Reihs; Farah Nader; Fraser Tan; Yuannan Cai; Trissia Brown; Isabelle Flament-Auvigne; Mahul B Amin; Martin C Stumpe; Heimo Müller; Peter Regitnig; Andreas Holzinger; Greg S Corrado; Lily H Peng; Po-Hsuan Cameron Chen; David F Steiner; Kurt Zatloukal; Yun Liu; Craig H Mermel
Journal: Commun Med (Lond) Date: 2021-06-30

6. Phase 3 study of adjuvant radiotherapy versus wait and see in pT3 prostate cancer: impact of pathology review on analysis.

Authors: Dirk Bottke; Reinhard Golz; Stephan Störkel; Axel Hinke; Alessandra Siegmann; Lothar Hertle; Kurt Miller; Wolfgang Hinkelbein; Thomas Wiegel
Journal: Eur Urol Date: 2013-03-17 Impact factor: 20.096

7. Associations of high-grade prostate cancer with BRCA1 and BRCA2 founder mutations.

Authors: Ilir Agalliu; Robert Gern; Suzanne Leanza; Robert D Burk
Journal: Clin Cancer Res Date: 2009-02-01 Impact factor: 12.531

8. Predicting Prostate Cancer Upgrading of Biopsy Gleason Grade Group at Radical Prostatectomy Using Machine Learning-Assisted Decision-Support Models.

Authors: Hailang Liu; Kun Tang; Ejun Peng; Liang Wang; Ding Xia; Zhiqiang Chen
Journal: Cancer Manag Res Date: 2020-12-22 Impact factor: 3.989

9. Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics.

Authors: Ehsaneddin Asgari; Mohammad R K Mofrad
Journal: PLoS One Date: 2015-11-10 Impact factor: 3.240

10. Prostate Cancer Risks for Male BRCA1 and BRCA2 Mutation Carriers: A Prospective Cohort Study.

Authors: Tommy Nyberg; Debra Frost; Daniel Barrowdale; D Gareth Evans; Elizabeth Bancroft; Julian Adlard; Munaza Ahmed; Julian Barwell; Angela F Brady; Carole Brewer; Jackie Cook; Rosemarie Davidson; Alan Donaldson; Jacqueline Eason; Helen Gregory; Alex Henderson; Louise Izatt; M John Kennedy; Claire Miller; Patrick J Morrison; Alex Murray; Kai-Ren Ong; Mary Porteous; Caroline Pottinger; Mark T Rogers; Lucy Side; Katie Snape; Lisa Walker; Marc Tischkowitz; Rosalind Eeles; Douglas F Easton; Antonis C Antoniou
Journal: Eur Urol Date: 2019-09-06 Impact factor: 20.096