Literature DB >> 31552329

Genome-Wide DNA Methylation Model for the Diagnosis of Prostate Cancer.

Abstract

Prostate cancer is the most prevalent and the second most lethal malignancy among males in the United States of America. Its diagnosis is almost entirely predicated upon histopathological analysis of the biopsied tissue, and it is associated with a substantial average error. Using genome-wide DNA methylation data derived from 469 prostatic tumor tissue samples and 50 normal prostatic tissue samples and interrogating over 485 000 CpG sites per sample (spanning across gene promoters, CpG islands, shores, shelves, gene bodies, and intergenic and other areas), we were able to develop a mathematical model that classified with a high accuracy (overall sensitivity = 95.31% and overall specificity = 94.00%) tumor tissue versus normal tissue. The methylation β values of five CpG sites, corresponding to the genes LINC01091, RPS15, SNORA10, and two unknown DNA areas in chromosome 1, provided the input to the model. The model was validated with unknown samples, as well as with a sixfold cross-validation and a leave-one-out cross-validation. This study presents a novel genomic model based on genome-wide DNA methylation analysis of biopsied prostatic tissue that could aid in the diagnosis of prostate cancer and help advance the transition to genomic medicine.

Entities: Chemical Disease Gene Species

Year: 2019 PMID： 31552329 PMCID： PMC6751714 DOI： 10.1021/acsomega.9b01613

Source DB: PubMed Journal: ACS Omega ISSN： 2470-1343

Introduction

Prostate Cancer and Its Diagnosis

Among males in the United States of America, prostate cancer is the most prevalent (approximately 165 000 new cases every year) and the second most lethal malignancy (approximately 29 000 deaths every year).[1] The pathological diagnosis of prostate cancer via a biopsy is associated with an average error of 25–30% in the case of underdetection and an average error of 1.3–7.1% in the case of overdetection,[2,3] and in some cases, the underdetection error is considerably higher.[4] Furthermore, the accuracy of a Gleason score—based on a scale used by pathologists to grade biopsied prostate tumors—is estimated to be 61%.[5] All those different types of error associated with the current method of prostate cancer diagnosis, in all likelihood, stem from the observation that histopathological analysis of tissue (performed at the tissue level) does not have the requisite resolution to discern those molecular processes and mechanisms, such as epigenetic modifications, that initiate tumorigenesis and drive its early evolution (occurring at the molecular level).

Epigenetic Modifications

Epigenetic modifications are molecular mechanisms that provide cells with the means to regulate gene expression without altering the DNA, and they include DNA methylation, chromatin remodeling, microRNAs (miRNAs), long intergenic nonprotein coding RNAs (lincRNAs), small nucleolar RNAs (snoRNAs), etc. Chromatin remodeling pertains to covalent modifications of histones, proteins that act as spools around which DNA is wound and compacted. Chromatin-remodeling mechanisms include histone methylation, acetylation, phosphorylation, ubiquitination, etc., and some of them, such as H3K4me3 and H3K9ac, have gene-activating functions, whereas others have gene-inactivating functions, such as H3K9me3 and H3K27me3.[6,7] miRNAs regulate messenger RNA (mRNA) and can suppress gene expression either by degrading mRNA or by inhibiting translation.[8] In addition to regulating mRNA, lincRNAs can activate or inactivate transcription-regulating genes and homeobox genes, and they are involved in transcription, translation, splicing, genomic imprinting, X-chromosome inactivation, etc.[9−11] Dysregulation and aberrant expression of lincRNAs have been observed in the pathogenesis of numerous types of cancer;[10,12,13] lincRNAs have been associated with various cancer subtypes;[12] and overexpression of lincRNAs has been observed to promote epithelial-to-mesenchymal transition and metastasis, both of which are characteristic of the most aggressive tumors.[13] In this study, the lincRNA LINC01091, which was found to be significantly unmethylated in the tumor cells as compared with normal cells, is one of the input variables to the model. snoRNAs modify mainly ribosomal RNA (rRNA), which is essential for protein synthesis, as well as transfer RNA (tRNA), small nuclear RNA (snRNA), and mRNA, and they are involved in ribosome biogenesis, splicing, and telomere maintenance.[14−16] Just as in the case of lincRNAs, dysregulation of snoRNAs has been linked to various types of cancer,[17,18] and the overexpression of some of them has been observed to promote tumorigenesis and to be indirectly proportional to patient survival time.[18] In this study, one of the five input variables to the model is the snoRNA SNORA10, which was found to be significantly unmethylated in the tumor cells, and another input variable is the ribosomal protein RPS15, which was also found to be significantly unmethylated in the tumor cells.

DNA Methylation

One of the most common and important epigenetic modifications is DNA methylation, which is used by cells for gene-inactivation purposes. It involves the addition of a methyl group (−CH3) at specific locations of the DNA molecule, called CpG sites, where a cytosine nucleotide is bound to, and immediately followed by, a guanine nucleotide. CpG sites are highly concentrated in certain DNA areas, called CpG islands, where the vast majority of gene promoters lie; methylation of CpG islands results in stable gene inactivation.[19] Within a distance of about 2 000 base pairs away from the CpG islands, and extending in both directions, there are DNA areas with a lower concentration of CpG sites, called CpG island shores.[20] Extending farther, within a distance of about 2 000 base pairs away from the CpG island shores and in both directions, there are DNA areas with even lower concentration of CpG sites, called CpG island shelves.[21] Finally, in the DNA area which corresponds to the gene body, and which extends beyond the first exon, there are also CpG sites.[19] In normal cells, DNA methylation plays a most important role in embryonic stem cell differentiation, embryonic development, genomic imprinting, X-chromosome inactivation, transposable element inactivation, etc. In the case of tumor cells, on the other hand, the process of DNA methylation, like other epigenetic modifications and many other cellular processes, is hijacked, and it is used to activate (demethylate) genes that are vital or beneficial to them and to inactivate (methylate) genes that are detrimental to them. Aberrant alterations of DNA methylation, which include, among others, demethylation of genes that promote tumorigenesis and methylation of genes that suppress tumorigenesis, have been observed from CpG islands and the gene promoters to CpG island shelves and the CpG sites in gene bodies in many types of cancer.[20,22,23] It is interesting to point out here that the average levels of genome-wide DNA methylation begin to decline years before cancer diagnosis.[23]

Hypothesis

The hypothesis of this study is that tumor cells have effected significant changes in the methylomic profile of a number of genes and that, consequently, these methylomic changes can discriminate between tumor cells and normal cells.

Model Overview

In this study, we were able to develop a multivariable function, whose five input variables were the methylation β values of five CpG sites, corresponding to the genes LINC01091, RPS15, SNORA10, and two unknown DNA areas in chromosome 1 (genomic coordinates 201509316 and 220697615). The model (multivariable function) was developed during the training phase using approximately 70% of all available samples, that is, 329 tumor samples and 35 normal samples, and it was based on the methylomic analysis of over 485 000 CpG sites per sample (spanning across gene promoters, CpG islands, shores, shelves, gene bodies, and intergenic and other areas). Following its development, the model was validated with unknown samples (approximately 30% of all available samples, i.e., 140 tumor samples and 15 normal samples), which had been randomly preallocated at the beginning of the study and used only for testing purposes. Subsequently, the model was further tested with a sixfold cross-validation and a leave-one-out cross-validation. The overall performance of the model, including the training and the validation phases, was sensitivity = (447/469) = 95.31% and specificity = (47/50) = 94.00%.

Methods

Data

All data used in this study were the normalized data obtained from 469 prostatic tumor tissue samples and 50 normal prostatic tissue samples, generated from DNA methylation analysis of the tissue using the Infinium Human Methylation 450 Bead Chip by Illumina, and downloaded from The Cancer Genome Atlas of the National Cancer Institute under the category PRAD.

Clinical Methods

All tumor samples used in this study were primary tumors obtained from a biopsy used for the initial diagnosis and prior to the administration of any treatment. Furthermore, in this study, tumor samples were included only if they had been obtained from subjects that did not have a history of prior and/or other malignancy. The range of the Gleason scores of all 469 tumor samples was from 6 to 10 (Table S1). The Gleason grading system, with a possible score between 2 and 10, is currently used for the grading of prostate cancer and for the clinical prognosis. Patients with tumors with a Gleason score ≤ 6 are considered to have good prognoses (typically, no treatment is recommended),[24] whereas patients with tumors with the highest scores are considered to have the worst prognoses.

Statistical Methods

Using only the data from the training set, a methylomic analysis was performed. Given that the total number of variables (CpG sites) per sample was 485 577 and using the Bonferroni correction for multiple tests, the statistical significance level for the methylomic analysis was set at α = (0.05/485 577) = 1.03 × 10–7 (two-tailed). Therefore, in order for any variable to be deemed statistically significant, it must have a P value < 1.03 × 10–7. The calculation of the P value of a given variable was done in accordance with the individual conditions imposed by various statistical tests. More specifically, for those variables that were normally distributed with respect to both groups and met the equality of variance condition, the independent t-test was used to calculate their P value. For those variables that were normally distributed with respect to both groups and failed to meet the equality of variance condition, the Welch unequal-variance t-test was used to calculate their P value. Finally, for those variables that were not normally distributed with respect to both groups, the Mann–Whitney U test was used to calculate their P value. Moreover, in the case of the Mann–Whitney U test, although there were no ties in any of the variables, the approximate method with correction was used because of large group sample sizes. The Anderson–Darling test was used to assess normality, and the Levene absolute test for equal variances was used to assess the equality of variance throughout this study. The sixfold cross-validation and the leave-one-out cross-validation were performed using the same methods as in one of our previous studies.[25] Briefly, in the case of the former, six rounds (six folds) of training and testing were performed. In each round, approximately 5/6 of the total number of samples were randomly selected and used for training purposes, whereas the remaining approximately 1/6 of samples were used for testing. At the end, the misclassification rate and the mean-squared error were calculated based on the total number of misclassifications. In the case of the leave-one-out cross-validation, the number of rounds was equal to the total number of samples (519). In each round, one sample was randomly left out and used for testing, and it was excluded from being selected again in any of the remaining rounds. At the end, the misclassification rate and the mean-squared error were calculated based on the total number of misclassifications. In addition to the aforementioned criterion of statistical significance (P < 1.03 × 10–7), a criterion of biological significance was imposed. In order for a CpG site to be deemed significantly differentially methylated by the two groups, namely, T (tumor samples) and N (normal samples), the mean β value of one group must be β ≥ 0.60 and, at the same time, the mean β value of the other group must be β ≤ 0.40. This would ensure with a great confidence that a particular CpG site was methylated (β ≥ 0.60) by one group and concurrently unmethylated (β ≤ 0.40) by the other group. In summary, both the statistical significance criterion and the biological significance criterion had to be met in order for a particular CpG variable to be deemed significant. The rationale for this follows next.

β Value

The methylation β value for a CpG site is defined by Illumina[21] as follows In eq , M is the methylated signal detected by probes with respect to both alleles at a particular CpG site; U is the unmethylated signal detected by probes with respect to both alleles at a particular CpG site; and C is a normalization constant (C > 0). The scale of the β value is continuous, and its range is the interval [0, 1). For all practical purposes, β may be considered as the methylation percentage of a particular CpG site with respect to both alleles expressed in the decimal form. A β value equal to zero indicates that a CpG site is unmethylated (0% methylated) with respect to both alleles (in this case, M = 0 and U = 1), whereas a β value very close to one indicates that a CpG site is 100% methylated with respect to both alleles (in this case, M = 1 and U = 0). In this study, we were solely interested in those CpG sites that were methylated by one group (either T or N) and, at the same time, were unmethylated by the other group. To state it differently and more broadly, the goal of this study was (a) to identify all the CpG sites in the human genome in which prostatic tumor cells effected significant methylomic alterations and (b) to use those CpG sites to develop a genomic model that could identify with a high accuracy tumor tissue versus normal tissue. To that end, we imposed the aforementioned stringent biological criterion of significance. More specifically, we did not follow the widely used β = 0.50 point as the criterion of methylation, such that if β > 0.50, a particular CpG site is considered methylated, and if β < 0.50, that CpG site is considered unmethylated. In our view, that criterion is very precarious and could easily be unreliable and misleading. The technology is not perfect, and the wrong signal by only a few probes could lead to β = 0.51 instead of β = 0.49. In this case, the particular CpG site is erroneously assumed to be methylated. To avoid this type of error and to increase the confidence, we imposed the following and very conservative criterion of methylation: in order for a CpG site to be deemed methylated, it should have β ≥ 0.60, and conversely, in order for a CpG site to be deemed unmethylated, it should have β ≤ 0.40. This further means that in order to arrive at either a false-positive or a false-negative methylation conclusion, more than 20% of the probe signals would have to be erroneous, something which is not likely. Since, as was mentioned above, in this study, we were exclusively interested in those CpG sites that were methylated by one group (either T or N) and, at the same time, were unmethylated by the other group, we imposed the aforementioned stringent biological criterion of significance, whereby in order for a CpG site to be deemed biologically significant, that CpG site had to be methylated by one group (with a β ≥ 0.60) and, at the same time, had to be unmethylated by the other group (with a β ≤ 0.40).

Development of the Model

Approximately, 70% of the total number of samples from each group were randomly selected for the training set, and the remaining approximately 30% of the samples were assigned to the validation set. Subsequently, the random selection of the tumor samples was modified in order for the following condition to be fulfilled. The training set had to comprise approximately 70% of the tumor samples with a particular Gleason score. To that end, approximately 70% of the tumor samples with a particular Gleason score were randomly selected and assigned to the training set, and the remaining approximately 30% were assigned to the validation set. Therefore, the training set comprised 329/469 tumor and 35/50 normal samples. In the 329 tumor samples, there were 30 with a Gleason score GS = 6; 163 with GS = 7; 42 with GS = 8; 91 with GS = 9; and 3 with GS = 10 (Table S2). The validation set comprised 140/469 tumor and 15/50 normal samples. In the 140 tumor samples, there were 13 with GS = 6; 69 with GS = 7; 18 with GS = 8; 39 with GS = 9; and 1 with GS = 10 (Table S3). Of the 485 577 variables interrogated during the methylomic analysis using the data from the training set, 2 913 met both of the aforementioned criteria of significance and provided the variable pool for the development of the model. Using the same methodology as in our previous studies,[25−28] we were able to develop the following multivariable function (eq ). Briefly, multiple functions, with at most ten input variables for each one of them, were generated and tested with computer simulations using the training set. Equation exhibited the best performance (highest sensitivity and highest specificity) in the training phase and was selected for validation. In eq , X1 corresponds to the gene RPS15, and X = β1 × 100 (i.e., the β value of X1 expressed in the percentage form); X2 corresponds to the snoRNA SNORA10, and X = β2 × 100; X3 corresponds to the lincRNA LINC01091, and X = β3 × 100; X4 corresponds to the unknown DNA area in chromosome 1 with the genomic coordinate of 220697615, and X = β4 × 100; and X5 corresponds to the unknown DNA area in chromosome 1 with the genomic coordinate of 201509316, and X = β5 × 100. C is a constant and is equal to 103. Since X1–X5 are the respective β values of the aforementioned genes expressed in the percentage form and since the value of each one of them belongs in the continuous interval [0, 100) (eq ), the domain of F (eq ) is the interval [0, 100). Moreover, since X5 ≥ 0, the function F is continuous and defined throughout its domain [0, 100). From eq , one can see that F attains its minimum value when X1 = X2 = X3 = X4 = 0 and X5 ≈ 100, and in that case, Fmin ≈ 3.22. Similarly, when X1 = X2 = X3 = X4 ≈ 100 and X5 = 0, F attains its maximum value, and in that case, Fmax ≈ 29.01. Therefore, the range of the function F is the interval (3.22, 29.01). Based on all 519 samples used in this study, the observed range of F was [10.146, 17.127] (Table S1). The cutoff point (COP) that demarcates a score of a tumor sample from that of a normal sample was determined to be COP = 15.258 using a receiver operating characteristic (ROC) curve analysis on the results of the F function in the training phase, as well as taking into account the respective variances of the two groups (T vs N). An F score < COP signifies a tumor sample, whereas an F score ≥ COP signifies a normal sample. The COP can be adjusted in the future within the constraints of ROC curve analysis in order to optimize the performance of the model in a subsequent clinical trial study. The methylomic analysis revealed that in the case of tumor cells, X1–X4 were significantly unmethylated (low β values) and X5 was significantly methylated (high β value) as compared with normal cells (Table ). Taking those results into consideration, as well as the COP, one can see that the Fmin represents, at least theoretically, the worst possible methylomic state in which tumor cells could be. This makes biological sense because in that case, X1–X4 are completely unmethylated, while X5 is completely methylated, and this methylomic state is the farthest from the normal one. In the opposite direction, the Fmax represents theoretically the most extreme methylomic state in which normal cells could be: X1–X4 are completely methylated, while X5 is completely unmethylated. There are two more interesting possibilities that need to be considered here. In the case where X1 = X2 = X3 = X4 = X5 = 0, eq yields F = 6.91. According to the COP, this score, which is very low, corresponds to a tumor sample, and this makes biological sense because four out of five markers (X1–X4) exhibit the worst possible methylomic profile that tumor cells could have—they are all completely unmethylated. Finally, in the case where X1 = X2 = X3 = X4 = X5 ≈ 100, eq yields F ≈ 25.33. This score, which is very high, corresponds to a normal sample, and from a biology perspective, it makes sense because four out of five markers (X1–X4) exhibit the most extreme methylomic profile that normal cells could have—they are all completely methylated.

Table 1

DNA Methylation Analysis Results of the Five Input Variables of the Model Based on the Data of the Training Seta

var.	gene symbol	gene ID	gen. coord.	M_T	SD_T	M_N	SD_N	FC	P
X₁	RPS15	6209	1440293	0.3830	0.1344	0.7145	0.0958	–1.8654	2.41 × 10^–19
X₂	SNORA10	574042	2012763	0.3970	0.1302	0.6984	0.0960	–1.8046	1.28 × 10^–19
X₃	LINC01091	285419	124694137	0.3255	0.1494	0.6525	0.0905	–2.0048	4.74 × 10^–19
X₄	unknown		220697615	0.2809	0.1618	0.6273	0.1007	–2.2332	1.03 × 10^–18
X₅	unknown		201509316	0.6082	0.2435	0.0794	0.1274	7.6578	9.10 × 10^–19

The FC was calculated as follows: R = MT/MN. If R ≥ 1, FC = R; if R < 1, FC = −1/R.

Principal Component Analysis

Following the same methodology as the one in one of our previous studies[29] and using the five variables (X1–X5), we performed a principal component analysis (PCA). The scores of the five principal components (PC1–PC5) are listed in Table S4.

Results

Methylomic Analysis

The methylomic results in connection with the five input variables of the model based on the data of the training set appear in Table , wherein the variable name, the gene symbol, the NCBI gene ID, the genomic coordinate, the mean β value (MT) of the T (tumor) group with its standard deviation (SDT), the mean β value (MN) of the N (normal) group with its standard deviation (SDN), the fold change (FC), and the probability of significance (P) for all five variables are listed.

Training

In the training phase, the model (eq ) identified correctly 312/329 tumor samples and 33/35 normal samples (Table S2). The sensitivity, therefore, was 94.83%, and the specificity was 94.29%. The ROC AUC statistics were as follows: AUC = 0.97647, its standard error was AUC SE = 0.00964, z-value = 49.41, and the 95% confidence interval of the AUC value was [0.94773, 0.98949] (Figure A). The T group had a mean F score and a standard deviation of FT = 13.5061 ± 1.2254, and its median value was 13.5457. Using 100 000 Monte Carlo simulations and 100 000 Bootstrap simulations, the 99% confidence interval of the FT was calculated to be [13.3315, 13.6799]. The N group had a mean F score and a standard deviation of FN = 16.2719 ± 0.5735, and its median value was 16.3988. Using 100 000 Monte Carlo simulations and 100 000 Bootstrap simulations, the 99% confidence interval of the FN was calculated to be [16.0590, 16.5467]. The Mann–Whitney U test statistics were as follows: UT = 271, UN = 11 244, z-value = 9.2698, and the approximate probability of significance with correction was P = 1.87 × 10–20 (Figure B).

Figure 1

(A) ROC AUC curve of the model F in the training phase. (B) Combination graph (box plot, density plot, and dot plot) of the model F in the training phase. Red circles denote statistical outliers. 249 × 111 mm (300 × 300 DPI).

Validation with Unknown Samples

Using the 155 preallocated unknown samples (140 tumor and 15 normal samples), the model (eq ) identified correctly 135/140 tumor samples and 14/15 normal samples (Table S3). The sensitivity, therefore, was 96.43%, and the specificity was 93.33%. The unknown T samples had a mean F score and a standard deviation of FT = 13.5759 ± 1.1354, and their median value was 13.7448. The unknown N samples had a mean F score and a standard deviation of FN = 16.1719 ± 0.8585, and their median value was 16.3709. The Mann–Whitney U test statistics for the two groups of unknown samples in connection with their F scores were as follows: UT = 91, UN = 2009, z-value = 5.8011, and the approximate probability of significance with correction was P = 6.59 × 10–9.

Overall Performance

Combining the 364 samples (329 tumor and 35 normal) from the training phase with the 155 unknown samples (140 tumor and 15 normal) from the validation phase, the model’s overall performance was as follows: sensitivity = (447/469) = 95.31% and specificity = (47/50) = 94.00% (Figure and Table S1). The ROC AUC statistics were as follows: AUC = 0.97053, its standard error was AUC SE = 0.01281, z-value = 36.73, and the 95% confidence interval of the AUC value was [0.93142, 0.98748] (Figure A). The T group had a mean F score and a standard deviation of FT = 13.5269 ± 1.1984, and its median value was 13.6241. Using 100 000 Monte Carlo simulations and 100 000 Bootstrap simulations, the 99% confidence interval of the FT was calculated to be [13.3848, 13.6706]. The N group had a mean F score and a standard deviation of FN = 16.2419 ± 0.6640, and its median value was 16.3939. Using 100 00 Monte Carlo simulations and 100 000 Bootstrap simulations, the 99% confidence interval of the FN was calculated to be [16.0343, 16.5067]. The Mann–Whitney U test statistics were as follows: UT = 691, UN = 22 759, z-value = 10.9454, and the approximate probability of significance with correction was P = 6.99 × 10–28. The range of all 519 F scores was [10.146, 17.127] (Figure B). Finally, Table shows the DNA methylation results of the five input variables of the model based on the combined data of the training and the validation sets. The variable name, the gene symbol, the NCBI gene ID, the genomic coordinate, the mean β value (MT) of the T (tumor) group with its standard deviation (SDT), the mean β value (MN) of the N (normal) group with its standard deviation (SDN), the FC, and the probability of significance (P) for all five variables are listed.

Figure 2

3D scatter plot illustrating the overall performance of the model F 223 × 183 mm (300 × 300 DPI).

Figure 3

(A) ROC AUC curve of the model F (overall performance). (B) Combination graph (box plot, density plot, and dot plot) of the model F (overall performance). Red circles denote statistical outliers. 263 × 115 mm (300 × 300 DPI).

Table 2

DNA Methylation Analysis Results of the Five Input Variables of the Model Based on the Combined Data of the Training and the Validation Setsa

var.	gene symbol	gene ID	gen. coord.	M_T	SD_T	M_N	SD_N	FC	P
X₁	RPS15	6209	1440293	0.3865	0.1315	0.7077	0.1038	–1.8308	2.79 × 10^–26
X₂	SNORA10	574042	2012763	0.3928	0.1298	0.6908	0.1045	–1.7586	7.15 × 10^–26
X₃	LINC01091	285419	124694137	0.3293	0.1495	0.6534	0.1053	–1.9843	7.54 × 10^–26
X₄	unknown		220697615	0.2821	0.1588	0.6171	0.1218	–2.1876	6.45 × 10^–25
X₅	unknown		201509316	0.6055	0.2369	0.0766	0.1194	7.9034	6.26 × 10^–27

The FC was calculated as follows: R = MT/MN. If R ≥ 1, FC = R; if R < 1, FC = −1/R.

3D scatter plot illustrating the overall performance of the model F 223 × 183 mm (300 × 300 DPI). (A) ROC AUC curve of the model F (overall performance). (B) Combination graph (box plot, density plot, and dot plot) of the model F (overall performance). Red circles denote statistical outliers. 263 × 115 mm (300 × 300 DPI). The FC was calculated as follows: R = MT/MN. If R ≥ 1, FC = R; if R < 1, FC = −1/R.

Cross-Validations

The sixfold cross-validation yielded a misclassification rate of 9.63% (50/519) and a mean-squared error of 0.096. The sensitivity was 0.900, and the specificity was 0.940. The Matthews correlation coefficient (MCC) was 0.643. The training and testing sets for each of the six rounds, along with the confusion matrix and other statistical information, appear in Table S5. Similarly, the leave-one-out cross-validation yielded a misclassification rate of 9.63% (50/519), a mean-squared error of 0.096, a sensitivity of 0.900, a specificity of 0.940, and an MCC of 0.643. The results of all 519 rounds, along with the confusion matrix and other statistical information, appear in Table S6.

Discussion

Recent, rapid advances in biotechnology have ushered in a genomic era. For the first time, researchers have acquired the capability to probe and monitor cellular events at the molecular level with unprecedented resolution and accuracy. Considering that the onset of any dysregulation of any cellular function results immediately in molecular changes and considering that dysregulation of any cellular function might take years to manifest itself at the tissue level, the importance of the newly acquired genomic capabilities becomes obvious, and so does the need to transition to genomic medicine. Utilizing one of those recent advances in biotechnology, namely, genome-wide DNA methylation, we developed and presented in this study a novel genomic test, operating at the molecular level, that could be used in the clinic to aid in the diagnosis of prostate cancer. A further clinical validation with a retrospective study would lead to the accomplishment of that goal, and it would also provide more information about the interesting subject of the few statistical outliers in this study. For example, in the N group, there were three samples with the lowest F scores for that group (from 14.669 to 13.398) (Figure B and Table S1), ranging from 2.37 to 4.28 standard deviations below the mean value of that group. Those three samples were determined to be statistical outliers, and they constituted the only three misclassifications of the model with respect to the N group. If they were to be removed from this study as extreme outliers, then the performance of the model would increase considerably (specificity = 100%). An interesting question that posits itself here is the following: were those three samples—deemed to be normal cells by the pathology analysis—in actuality tumor cells, or were they simply misclassified by the model? To further investigate this, we performed a PCA using the five variables (X1–X5). The graph in Figure shows the scores of the first principal component (PC1) versus the scores of PC2. Using a completely different method from that of our model, PCA, likewise, classified those three normal samples as extreme outliers; they are the three green spheres well above all other green spheres and embedded (though still visible) in the midst of the red spheres (tumor samples). This provides further and independent evidence about the actual nature of those three normal samples. Furthermore, similar questions could be posited regarding the T group. There were eight tumor samples with F scores higher than the mean F score of the N group (Table S1, Figures and 3B).

Figure 4

PCA graph (PC1 scores vs PC2 scores). 271 × 205 mm (300 × 300 DPI).

PCA graph (PC1 scores vs PC2 scores). 271 × 205 mm (300 × 300 DPI). In addition to its clinical applications, this study has research implications. Of the five input variables to the model, two are unknown DNA areas. It would be of great interest to elucidate the reason(s) that tumor cells repress (methylate) one of them (X5), whereas they derepress (demethylate) the other one (X4). Any gain in knowledge pertaining to the possible role of those two DNA areas in any cellular functions might lead to new understanding and strategies against cancer. Regarding two other variables, namely, the snoRNA SNORA10 (X2) and the lincRNA LINC01091 (X3), there exists very little research, even though the overexpression of other members of those two RNA families, as was mentioned earlier, has been associated with tumorigenesis. Discovering any advantages that the overexpression of those two RNAs confers to tumor cells would also be of great interest. Finally, regarding the gene RPS15 (X1), there is more research available, albeit to a limited extent. Upregulation of RPS15 has been observed to endow esophageal and gastric tumor cells with resistance to powerful chemotherapy agents.[30,31] Moreover, mutations in RPS15 at highly conserved sites resulting in gain of function have been identified as one of the drivers of chronic lymphocytic leukemia and have been associated with aggressive phenotype and shorter survival.[32,33] Finally, of the three known DNA areas used in this study (variables X1–X3), only one corresponds to a protein-coding gene, namely, RPS15. According to the results of our transcriptomic analysis, which will be presented in a future study, RPS15 was significantly and highly overexpressed in the tumor cells compared with the normal ones.

4 in total

1. A transcriptome-wide association study identifies novel candidate susceptibility genes for prostate cancer risk.

Authors: Duo Liu; Jingjing Zhu; Dan Zhou; Emily G Nikas; Nikos T Mitanis; Yanfa Sun; Chong Wu; Nicholas Mancuso; Nancy J Cox; Liang Wang; Stephen J Freedland; Christopher A Haiman; Eric R Gamazon; Jason B Nikas; Lang Wu
Journal: Int J Cancer Date: 2021-09-25 Impact factor: 7.396

2. Differential DNA Methylation in Prostate Tumors from Puerto Rican Men.

Authors: Gilberto Ruiz-Deya; Jaime Matta; Jarline Encarnación-Medina; Carmen Ortiz-Sanchéz; Julie Dutil; Ryan Putney; Anders Berglund; Jasreman Dhillon; Youngchul Kim; Jong Y Park
Journal: Int J Mol Sci Date: 2021-01-13 Impact factor: 5.923

3. KIF2C affects sperm cell differentiation in patients with Klinefelter syndrome, as revealed by RNA-Seq and scRNA-Seq data.

Authors: Haihong He; Tingting Huang; Fan Yu; Keyan Chen; Shixing Guo; Lijun Zhang; Xi Tang; Xinhua Yuan; Jiao Liu; Yiwen Zhou
Journal: FEBS Open Bio Date: 2022-06-16 Impact factor: 2.792

4. An integrative multi-omics analysis to identify candidate DNA methylation biomarkers related to prostate cancer risk.

Authors: Lang Wu; Yaohua Yang; Xingyi Guo; Xiao-Ou Shu; Qiuyin Cai; Xiang Shu; Bingshan Li; Ran Tao; Chong Wu; Jason B Nikas; Yanfa Sun; Jingjing Zhu; Monique J Roobol; Graham G Giles; Hermann Brenner; Esther M John; Judith Clements; Eli Marie Grindedal; Jong Y Park; Janet L Stanford; Zsofia Kote-Jarai; Christopher A Haiman; Rosalind A Eeles; Wei Zheng; Jirong Long
Journal: Nat Commun Date: 2020-08-06 Impact factor: 14.919

4 in total