| Literature DB >> 35679280 |
Mpho Mokoatle1, Darlington Mapiye2, Vukosi Marivate1,3, Vanessa M Hayes3,4, Riana Bornman4.
Abstract
One of the most precise methods to detect prostate cancer is by evaluation of a stained biopsy by a pathologist under a microscope. Regions of the tissue are assessed and graded according to the observed histological pattern. However, this is not only laborious, but also relies on the experience of the pathologist and tends to suffer from the lack of reproducibility of biopsy outcomes across pathologists. As a result, computational approaches are being sought and machine learning has been gaining momentum in the prediction of the Gleason grade group. To date, machine learning literature has addressed this problem by using features from magnetic resonance imaging images, whole slide images, tissue microarrays, gene expression data, and clinical features. However, there is a gap with regards to predicting the Gleason grade group using DNA sequences as the only input source to the machine learning models. In this work, using whole genome sequence data from South African prostate cancer patients, an application of machine learning and biological experiments were combined to understand the challenges that are associated with the prediction of the Gleason grade group. A series of machine learning binary classifiers (XGBoost, LSTM, GRU, LR, RF) were created only relying on DNA sequences input features. All the models were not able to adequately discriminate between the DNA sequences of the studied Gleason grade groups (Gleason grade group 1 and 5). However, the models were further evaluated in the prediction of tumor DNA sequences from matched-normal DNA sequences, given DNA sequences as the only input source. In this new problem, the models performed acceptably better than before with the XGBoost model achieving the highest accuracy of 74 ± 01, F1 score of 79 ± 01, recall of 99 ± 0.0, and precision of 66 ± 0.1.Entities:
Mesh:
Year: 2022 PMID: 35679280 PMCID: PMC9182297 DOI: 10.1371/journal.pone.0267714
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.752
Fig 1Blood DNA sequences x transformed into k-mers with their corresponding Gleason grade group y.
Fig 2Architecture of an LSTM unit [59].
Fig 3Architecture of a GRU unit [59].
Fig 4This figure represents the summary of all the methods that were executed in this work.
Sequence similarity within a Gleason grade group of 5 and 1 for BRCA 1 blood DNA sequences.
| Grouped by percentage of identical matches | Total no. of local alignments | |
|---|---|---|
|
| 90-100 | 7170891 |
| 80-90 | 3685304 | |
| 70-80 | 62500 | |
|
| 90-100 | 7270628 |
| 80-90 | 3732281 | |
| 70-80 | 56560 |
Sequence similarity within a Gleason grade group of 5 and 1 for BRCA 2 blood DNA sequences.
| Grouped by percentage of identical matches | Total no. of local alignments | |
|---|---|---|
|
| 90-100 | 6256450 |
| 80-90 | 910123 | |
| 70-80 | 17970 | |
|
| 90-100 | 6510144 |
| 80-90 | 932427 | |
| 70-80 | 16167 |
Data count and distribution of classes after the removal of highly similar DNA sequences.
| Gleason grade group 5 | Gleason grade group 1 | |
|---|---|---|
|
| 3111 ∼ 58% | 2210 ∼ 42% |
|
| 3108 ∼ 62% | 1941 ∼ 38% |
Fig 5Visualisation of TF-IDF kmers for BRCA 1.
Fig 6Visualisation of TF-IDF kmers for BRCA 2 kmers.
This table shows the results of the machine learning models using data from the BRCA 1 gene.
| Acc (%) | F1 (%) | Recall (%) | Precision (%) | |
|---|---|---|---|---|
|
| 57 ± 1.6 | 69 ± 1.3 | 85 ± 2.0 | 58 ± 1.8 |
|
| 58 ± 1.5 | 74 ± 1.3 | 100 ± 0.0 | 58 ± 1.5 |
|
| 58 ± 1.1 | 74 ± 0.9 | 100 ± 0.0 | 58 ± 1.1 |
|
| 58 ± 1.7 | 73 ± 1.3 | 98 ± 0.7 | 58 ± 1.6 |
|
| 59 ± 1.7 | 74 ± 1.4 | 98 ± 0.8 | 59 ± 1.7 |
Fig 7Confusion matrix of the Random Forest model for BRCA 1.
This table shows the results of the machine learning models using data from the BRCA 2 gene.
| Acc (%) | F1 (%) | Recall (%) | Precision (%) | |
|---|---|---|---|---|
|
| 58 ± 1.5 | 73 ± 1.3 | 100 ± 0 | 58 ± 1.6 |
|
| 61 ± 1.3 | 74 ± 1 | 93 ± 1.3 | 62 ± 1.4 |
|
| 61 ± 0.1 | 75 ± 0.8 | 99 ± 0.6 | 61 ± 1.1 |
|
| 62 ± 1.3 | 76 ± 0.1 | 99 ± 0.2 | 62 ± 1.3 |
|
| 62 ± 1.2 | 77 ± 0.9 | 100 ± 0 | 62 ± 1.2 |
Fig 8Confusion matrix of the GRU model for BRCA 2.
This table shows the results of the machine learning models using data from the APC gene.
| Acc (%) | F1 (%) | Recall (%) | Precision (%) | |
|---|---|---|---|---|
|
| 65 ± 0.1 | 67 ± 0.1 | 71 ± 0.1 | 63 ± 0.1 |
|
| 71 ± 0.1 | 75 ± 0.3 | 87 ± 0.3 | 66 ± 0.3 |
|
| 74 ± 0.1 | 79 ± 0.1 | 99 ± 0.0 | 66 ± 0.1 |
Fig 9Confusion matrix of the XGBoost model for the APC gene.