Literature DB >> 31956794

Whole Exome and Transcriptome RNA-Sequencing Model for the Diagnosis of Prostate Cancer.

Jason B Nikas¹, Nikos T Mitanis², Emily G Nikas³.

Abstract

In our previous study, we developed a genome-wide DNA methylation model for the diagnosis of prostate cancer, and we pointed out that a considerable average error is associated with the current method for the diagnosis of prostate cancer, which is predicated on pathological assessment of biopsied tissue. In this study, we utilized whole exome and transcriptome RNA-sequencing (RNA-seq) data that were derived from 468 tumor samples and 51 normal samples of prostatic tissue, and we analyzed over 20,000 genes per sample. We were able to develop a mathematical model that classified tumor tissue versus normal tissue with a high accuracy. The overall sensitivity was 97.01%, and the overall specificity was 94.12%. The input variables to the model were the mRNA expression values of the following nine genes: ANGPT1, MED21, AOX1, PLP2, HPN, HPN-AS1, EPHA10, NKX2-3, and LRFN1. The model was validated with unknown samples, with a 10-fold cross-validation, and a leave-one-out cross-validation. We present here a genomic model, based on a whole exome and transcriptome RNA-seq analysis of biopsied prostatic tissue, that could be utilized in the diagnosis of prostate cancer.

Entities: CellLine Chemical Disease Gene Species

Year: 2019 PMID： 31956794 PMCID： PMC6964263 DOI： 10.1021/acsomega.9b02995

Source DB: PubMed Journal: ACS Omega ISSN： 2470-1343

Introduction

Prostate Cancer

In our previous study,[1] we pointed out that the current diagnosis of prostate cancer, based on pathological analysis of biopsied tissue, is associated with an average error of 25–30% in connection with underdetection and an average error of 1.3–7.1% in connection with overdetection.[2,3] In some cases, the underdetection error is substantially higher.[4] In addition, the error of a Gleason score—a pathological assessment scale used for grading biopsied prostate tumors—is estimated to be 39%.[5] Those sources of error associated with the current method of prostate cancer diagnosis most likely stem from the fact that pathological assessment of tissue does not have the necessary resolution to detect—much less assess and monitor—those molecular processes, such as gene expression alterations, which initiate tumorigenesis and drive its early evolution and which occur at the molecular level. Furthermore, those molecular processes that initiate tumorigenesis and drive its early evolution begin to perturb a host of other molecular processes and mechanisms, such as DNA methylation and other gene expression alterations, years before cancer diagnosis,[6] that is to say, years before cancer manifests itself at the tissue level.

Gene Expression and Cancer

Gene expression, the most fundamental cellular process, is driven by the nature and the state of cells and, equally importantly, by the environment in which they find themselves. That means that cell function and survival are the most dominant determinants of gene expression. Transcriptomics, the study of gene expression that relies on measurement of RNA molecules, is the most powerful and the most well-established method of assessing and monitoring the biology of cells.[7] Moreover, transcriptomics affords the most expeditious means of observing changes imposed onto cells by their state or their environment. In the case of tumor cells, gene expression, like many other cellular processes, is appropriated, and it is used to activate or overexpress genes that are vital or beneficial to them and to suppress or silence genes that are detrimental to their survival. Following the advent of next-generation sequencing technologies, RNA sequencing (RNA-seq), with its unprecedented accuracy and sensitivity, quickly became the dominant technology for gene expression measurements.[8] In addition to accuracy, sensitivity, and high resolution at the molecular level, RNA-seq revolutionized the entire field of transcriptomics by vastly expanding it so that whole exome and transcriptome sequencing, alternative splicing, RNA editing, fusion transcripts, allele-specific expression, post-transcriptional modifications, and many noncoding RNAs became essential components of it.[7,9] As a result, cancer transcriptome profiling has started being introduced into the clinic to improve diagnosis, prognosis, and treatment.[7]

Hypothesis

The hypothesis of this study is that tumor cells have induced significant alterations in the gene expression of a number of genes and that, as a result of this, the transcriptomic profile of the tumor cells is significantly different from that of the normal cells.

Overview of the Model

We developed a multivariable function, whose nine input variables were the mRNA expression values of the following nine genes: ANGPT1, MED21, AOX1, PLP2, HPN, HPN-AS1, EPHA10, NKX2-3, and LRFN1. The model, this multivariable function, was developed during the training phase using approximately 70% of all available samples (328 tumor samples and 36 normal samples), and it was based on the transcriptomic analysis of over 20,000 genes per sample. After its development, the model was first validated with unknown samples [approximately 30% of all available samples (140 tumor samples and 15 normal samples)], which had been randomly preallocated at the beginning of the study and used only for testing purposes. Following its validation with unknown samples, the model was further tested with a 10-fold cross-validation and a leave-one-out cross-validation. The overall performance of the model, including the training and validation phases, was sensitivity = (454/468) = 97.01% and specificity = (48/51) = 94.12%.

Methods

Data

The data used in this study were the normalized data obtained from 468 prostatic tumor tissue samples and 51 normal prostatic tissue samples, generated from whole exome and transcriptome RNA-seq analysis of tissue using the Illumina HiSeq 2000 sequencer and downloaded from The Cancer Genome Atlas of the National Cancer Institute under the category PRAD.

Clinical Methods

All clinical methods used in this study were the same as the ones we employed in our previous study.[1] Briefly, all tumor samples used in this study were primary tumors and had been obtained from subjects that did not have a history of prior and/or other malignancy. The Gleason scores of all 468 tumor samples ranged from 6 to 10 (Table S1).

Statistical Methods

The statistical methods used in this study were the same as the ones we used in our previous study and are presented in great detail there.[1] Briefly, using only the data from the training set, a transcriptomic analysis was performed. Taking into account that the total number of variables (transcripts) per sample was 20,531 and utilizing the Bonferroni correction for multiple tests, the statistical significance level for the transcriptomic analysis was set at α = (0.05/20,531) = 2.43 × 10–6 (two-tailed). Therefore, in order for any variable to be deemed statistically significant, it must have a P value of < 2.43 × 10–6.

Development of the Model

A random sample selection was performed to generate the training and validation sets so that the training set comprised approximately 70% of the total number of samples from each group and the validation set comprised the remaining approximately 30% of the samples. Furthermore, in accordance with the same methods as in our previous study,[1] the random selection of the tumor samples was modified in order for the training set to comprise approximately 70% of the tumor samples with a particular Gleason score and in order for the validation set to comprise approximately 30% of the tumor samples with a particular Gleason score (Tables S2 and S3). Of the 20,531 variables (transcripts) assessed during the transcriptomic analysis using the data from the training set, 4484 met the above-mentioned criterion of significance and provided the variable pool for the development of the model. Employing the same methodology as in our previous studies,[1,10−13] we were able to develop the following multivariable function (eq ). Equation exhibited the best performance in terms of sensitivity and specificity in the training phase and was selected for validation.In eq , X1 represents the mRNA expression value of the gene ANGPT1, X2 the gene MED21, X3 the gene AOX1, X4 the gene PLP2, X5 the gene HPN, X6 the gene HPN-AS1, X7 the gene EPHA10, X8 the gene NKX2-3, and X9 the gene LRFN1. Since X1 to X9 are the respective mRNA expression values of the aforementioned nine genes and since, theoretically, mRNA expression values belong in the continuous interval [0, +∞), the theoretical domain of F (eq ) is the interval [0, +∞). Moreover, since X5, X6, X7, X8, and X9 are all ≥0, the function F is continuous and is defined throughout its domain [0, +∞). From eq , it can be seen that F attains its minimum value when X1 = X2 = X3 = X4 = 0 and when, concomitantly, X5, X6, X7, X8, and X9 are very large. In this case, F → 0. Conversely, when X1, X2, X3, and X4 are very large and when, concomitantly, X5 = X6 = X7 = X8 = X9 = 0, F attains its maximum value. In this case, F → +∞. Therefore, the theoretical range of F is the interval (0, +∞). In practice, however, a very large value of mRNA expression may be in the order of 106. In fact, the largest expression value observed in this study was 6.9 × 106. Assuming, therefore, a very large expression value to be 107, we can observe the behavior of F in the two aforementioned extreme cases. In the case where X1 = X2 = X3 = X4 = 0 and X5 = X6 = X7 = X8 = X9 = 107, Fmin = 0.047, which is close to zero, the theoretical absolute minimum value to which F approaches. In the case where X1 = X2 = X3 = X4 = 107 and X5 = X6 = X7 = X8 = X9 = 0, Fmax = 6259.289. The transcriptomic analysis revealed that in the case of tumor cells, X1 to X4 were significantly underexpressed, whereas X5 to X9 were significantly overexpressed as compared with normal cells (Table ). In view of this finding, the Fmin represents theoretically the worst possible transcriptomic state in which tumor cells could be. From a biological perspective, that makes sense because in that case, X1 to X4 are completely suppressed, while X5 to X9 are completely upregulated, and this transcriptomic state is the farthest from the normal one. Conversely, the Fmax represents theoretically the most extreme transcriptomic state in which normal cells could be: X1 to X4 are completely upregulated, while X5 to X9 are completely suppressed, a state that is the farthest from that of the tumor cells. The observed range of F, based on all 519 samples of this study, was [4.251, 44.127] (Table S1).

Table 1

mRNA Expression Analysis Results of the Nine Input Variables of the Model Based on the Data of the Training Seta

var.	gene symbol	gene ID	M_T	SD_T	M_N	SD_N	FC	P
X₁	ANGPT1	284	237.857	214.325	953.763	487.196	–4.010	2.68 × 10^–19
X₂	MED21	9412	428.052	115.507	702.998	167.748	–1.642	4.97 × 10^–19
X₃	AOX1	316	407.088	436.165	2187.538	2056.420	–5.374	2.90 × 10^–17
X₄	PLP2	5355	751.555	429.088	1648.100	510.636	–2.193	5.34 × 10^–17
X₅	HPN	3249	5840.868	3401.767	1087.377	1031.123	5.372	2.24 × 10^–18
X₆	HPN-AS1	100,128,675	15.413	14.192	1.533	1.506	10.057	7.68 × 10^–18
X₇	EPHA10	284,656	236.742	129.399	53.527	31.985	4.423	6.04 × 10^–19
X₈	NKX2-3	159,296	36.437	39.944	1.935	4.304	18.831	1.24 × 10^–17
X₉	LRFN1	57,622	199.837	92.912	71.144	37.270	2.809	9.02 × 10^–18
	TBPb	6908	246.777	44.032	250.043	37.409	–1.013	0.950

The fold change was calculated as follows: R = MT/MN. If R ≥ 1, then FC = R; if R < 1, then FC = −1/R.

Control gene.

The fold change was calculated as follows: R = MT/MN. If R ≥ 1, then FC = R; if R < 1, then FC = −1/R. Control gene. Using a receiver operating characteristic (ROC) curve analysis on the results of the F function in the training phase and taking into account the respective variances of the two groups (T vs N), the cut-off point (COP) that demarcates a score of a tumor sample from that of a normal sample was determined to be COP = 23.162. An F score less than the COP designates a tumor sample, whereas an F score greater than or equal to the COP designates a normal sample. Using an ROC curve analysis in a future clinical trial validation study, the COP can be adjusted to optimize the performance of the model.

Results

Transcriptomic Analysis

The transcriptomic results in connection with the nine input variables of the model based on the data of the training set appear in Table . The variable name, the gene symbol, the NCBI gene ID, the mean mRNA expression value (MT) of the T (tumor) group with its standard deviation (SDT), the mean mRNA expression value (MN) of the N (normal) group with its standard deviation (SDN), the fold change (FC) (representing the mRNA expression change in the tumor cells as compared with that in the normal cells), and the probability of significance (P) for all nine variables are listed.

Training

In the training phase, the model (eq ) identified correctly 316/328 tumor samples and 34/36 normal samples (Table S2). Therefore, the sensitivity was 96.34%, and the specificity was 94.44%. The ROC area-under-the-curve (AUC) statistics were as follows: AUC = 0.98044, its standard error was AUC SE = 0.00718, z-value = 66.94, and the 95% confidence interval of the AUC value was [0.95998, 0.99049] (Figure A). The T group had a mean F score and standard deviation of FT = 13.0445 ± 4.9146, and its median value was 12.4581. Based on 100,000 Monte Carlo and bootstrap simulations, the 99% confidence interval of the FT was determined to be [12.3270, 13.7350]. The N group had a mean F score and standard deviation of FN = 28.7701 ± 5.1239, and its median value was 29.3081. Similarly, based on 100,000 Monte Carlo and bootstrap simulations, the 99% confidence interval of the FN was determined to be [26.5718, 30.9196]. The Mann–Whitney U test statistics were as follows: U = 231, U = 11,577, z-value = 9.4652, and the approximate probability of significance with correction was P = 2.93 × 10–21 (Figure A,B).

Figure 1

(A) ROC AUC curve of the model F in the training phase. (B) ROC AUC curve of the model F (overall performance).

Figure 2

(A) Combination graph (box plot, density plot, and dot plot) of the model F in the training phase. Red circles denote statistical outliers. (B) Comparative histogram of the model F in the training phase.

(A) ROC AUC curve of the model F in the training phase. (B) ROC AUC curve of the model F (overall performance). (A) Combination graph (box plot, density plot, and dot plot) of the model F in the training phase. Red circles denote statistical outliers. (B) Comparative histogram of the model F in the training phase.

Validation with Unknown Samples

In the validation phase, using the 155 preallocated unknown samples (140 tumor and 15 normal samples), the model (eq ) identified correctly 138/140 tumor samples and 14/15 normal samples (Table S3). Therefore, the sensitivity was 98.57%, and the specificity was 93.33%. The unknown T samples had a mean F score and standard deviation of FT = 13.4263 ± 4.2591, and their median value was 12.8706. The unknown N samples had a mean F score and standard deviation of FN = 30.2478 ± 7.2911, and their median value was 30.5702. The Welch unequal-variance t-test statistics for the two groups of unknown samples in connection with their F scores were as follows: t = 8.7764, degrees of freedom = 15.04, and the probability of significance was P = 2.65 × 10–07.

Overall Performance

When we pooled together the 364 samples (328 tumor and 36 normal) from the training phase with the 155 unknown samples (140 tumor and 15 normal) from the validation phase, the model’s overall performance was as follows: sensitivity = (454/468) = 97.01% and specificity = (48/51) = 94.12% (Figure and Table S1). The ROC AUC statistics were as follows: AUC = 0.97796, its standard error was AUC SE = 0.00901, z-value = 53.06, and the 95% confidence interval of the AUC value was [0.95113, 0.99014] (Figure B). The T group had a mean F score and standard deviation of FT = 13.1587 ± 4.7268, and its median value was 12.5619. Based on 100,000 Monte Carlo and bootstrap simulations, the 99% confidence interval of the FT was calculated to be [12.5949, 13.7157]. The N group had a mean F score and standard deviation of FN = 29.2047 ± 5.8073, and its median value was 29.3748. Similarly, based on 100,000 Monte Carlo and bootstrap simulations, the 99% confidence interval of the FN was calculated to be [27.1356, 31.2805]. The Mann–Whitney U test statistics were as follows: UT = 526, UN = 23,342, z-value = 11.2169, and the approximate probability of significance with correction was P = 3.37 × 10–29. The range of all 519 F scores was [4.251, 44.217] (Figure A,B). Table shows the mRNA expression results of the nine input variables of the model based on the combined data of the training and validation sets. The variable name, the gene symbol, the NCBI gene ID, the mean mRNA expression value (MT) of the T (tumor) group with its standard deviation (SDT), the mean mRNA expression value (MN) of the N (normal) group with its standard deviation (SDN), the FC, and the probability of significance (P) for all nine variables are listed.

Figure 3

Figure 4

(A) Combination graph (box plot, density plot, and dot plot) of the model F (overall performance). Red circles denote statistical outliers. (B) Comparative histogram of the model F (overall performance).

Table 2

mRNA Expression Results of the Nine Input Variables of the Model Based on the Combined Data of the Training and Validation Setsa

var.	gene symbol	gene ID	M_T	SD_T	M_N	SD_N	FC	P
X₁	ANGPT1	284	243.843	208.138	994.899	542.859	–4.080	5.38 × 10^–25
X₂	MED21	9412	429.729	111.912	709.904	164.239	–1.652	2.59 × 10^–26
X₃	AOX1	316	414.447	428.663	2216.079	1938.192	–5.347	4.18 × 10^–24
X₄	PLP2	5355	743.570	411.813	1673.877	548.394	–2.251	2.19 × 10^–23
X₅	HPN	3249	5721.015	3497.136	1066.070	1031.282	5.366	1.73 × 10^–25
X₆	HPN-AS1	100,128,675	15.033	13.945	1.555	1.710	9.670	2.19 × 10^–24
X₇	EPHA10	284,656	233.751	124.943	51.056	32.721	4.578	8.99 × 10^–27
X₈	NKX2-3	159,296	35.180	38.607	1.871	4.075	18.799	1.63 × 10^–24
X₉	LRFN1	57,622	197.464	93.427	69.171	36.344	2.855	5.60 × 10^–25
	TBPb	6908	246.959	44.381	251.222	38.559	–1.017	0.732

The fold change was calculated as follows: R = MT/MN. If R ≥ 1, then FC = R; if R < 1, then FC = −1/R.

Control gene.

3D scatter plot illustrating the overall performance of the model F. The blue transparent plane is parallel to the x–y plane and intersects the z-axis (F score) at z = 23.162, which is the cut-off point (COP) used by the model such that an F score less than the COP signifies a tumor sample, whereas an F score greater than or equal to the COP signifies a normal sample. Therefore, the blue transparent plane represents the cut-off plane in the 3D space of all 519 samples. (A) Combination graph (box plot, density plot, and dot plot) of the model F (overall performance). Red circles denote statistical outliers. (B) Comparative histogram of the model F (overall performance). The fold change was calculated as follows: R = MT/MN. If R ≥ 1, then FC = R; if R < 1, then FC = −1/R. Control gene.

Cross-Validations

The 10-fold cross-validation yielded a misclassification rate of 5.59% (29/519) and a mean-squared error of 0.0559. The sensitivity was 94.44%, and the specificity was 94.12%. The Matthews correlation coefficient (MCC) was 0.7540. The training and testing sets for each of the 10 rounds, as well as the confusion matrix and other statistical information, appear in Table S4. Similarly, the leave-one-out cross-validation yielded a misclassification rate of 5.59% (29/519) and a mean-squared error of 0.0559. The sensitivity was 94.44%, the specificity was 94.12%, and the MCC was 0.7540. The results of all 519 rounds, as well as the confusion matrix and other statistical information, appear in Table S5.

Discussion

In this study, using whole exome and transcriptome RNA-seq, we developed and presented a genomic model that could be utilized in the diagnosis of prostate cancer. A subsequent clinical validation with a larger, multicenter retrospective study would constitute the next step for this model and would advance its application closer to the clinic. Such a multicenter study would be able to minimize the innate error of the current diagnostic method (reliance on morphology, that is, visual inspection of the tissue) to classify samples, and it would also provide more information about the few statistical outliers in this study. For example, in the T group, there were eight samples with the highest F scores for that group (from 26.323 to 31.580) (Figures and 4A and Table S1), ranging from 2.79 to 3.90 standard deviations above the mean value of that group. All of those eight samples were determined to be statistical outliers, and they were misclassified by the model. If they were to be removed from the study as extreme outliers, then the performance of the model would increase considerably [sensitivity = (462/468) = 98.72%]. We should point out here that seven of those eight statistical outliers were also misclassified by the DNA methylation model in our previous study.[1] This provides independent evidence and raises questions about the actual nature of those seven samples, all of which were determined to be tumor samples by the pathology report. From the different perspectives of two different molecular analyses (gene expression and DNA methylation), those seven samples should be classified in the heart of the N group. Furthermore, in the N group, there were three samples with the lowest F scores for that group (19.352, 17.211, and 13.886) (Figures and 4A and Table S1), being 1.70, 2.07, and 2.64 standard deviations below the mean value of that group, respectively. Those samples constituted the only three misclassifications of the model with respect to the N group (Table S1). Interestingly, two out of those three samples, the ones with the two lowest F scores (17.211 and 13.886), were also misclassified by the DNA methylation model in our previous study.[1] This suggests that those two samples, which were deemed normal by the pathology report, should be classified as tumor samples according to both the gene expression analysis and the DNA methylation analysis. The results of this study, as well as those of the previous one,[1] demonstrate the capability of genomic methods and modeling to improve current cancer diagnosis. Furthermore, genomic methods and modeling, based on molecular analyses, can be used to improve cancer prognosis as well; genomic models can be developed and trained to detect a characteristic signature of the most aggressive tumor cells and to make accurate predictions about the survival of patients and about the future course of the disease.[12,13] According to the results of our transcriptomic analysis (Tables and 2), the genes ANGPT1, MED21, AOX1, and PLP2 (variables X1 to X4) were significantly suppressed in the tumor cells compared with normal cells. Suppression of ANGPT1 has been observed to correlate with progression and metastasis of prostate cancer, as well as with patients’ reduced relapse-free survival.[14] Suppression of AOX1 has been observed in prostate cancer and has been associated with reduced patients’ survival and disease recurrence.[15] Our transcriptomic analysis (Tables and 2) further revealed that the genes HPN, HPN-AS1, EPHA10, NKX2-3, and LRFN1 (variables X5 to X9) were significantly upregulated in the tumor cells compared with normal cells. A number of studies have descried that HPN is an oncogene, the overexpression of which promotes tumorigenesis and invasiveness of tumor cells in various different types of cancer,[16,17] particularly prostate cancer.[18]EPHA10 has been found to be overexpressed in breast, colon, and prostate cancers,[19] and its overexpression has been associated with stage progression and metastasis of tumor cells in breast cancer.[20] Overexpression of the homeobox gene NKX2-3 has been linked to tumorigenesis in marginal-zone B-cell lymphomas.[21] Finally, LRFN1 has been reported to be overexpressed in prostate cancer,[22] and its suppression has been observed to decrease cell viability and survival of pancreatic tumor cells.[23]

4 in total

1. A transcriptome-wide association study identifies novel candidate susceptibility genes for prostate cancer risk.

Authors: Duo Liu; Jingjing Zhu; Dan Zhou; Emily G Nikas; Nikos T Mitanis; Yanfa Sun; Chong Wu; Nicholas Mancuso; Nancy J Cox; Liang Wang; Stephen J Freedland; Christopher A Haiman; Eric R Gamazon; Jason B Nikas; Lang Wu
Journal: Int J Cancer Date: 2021-09-25 Impact factor: 7.396

2. Rapid Detection Method for Pathogenic Candida Captured by Magnetic Nanoparticles and Identified Using SERS via AgNPs^.

Authors: Shan Hu; Haiquan Kang; Feng Gu; Chongwen Wang; Siyun Cheng; Wenjing Gong; Liping Wang; Bing Gu; Ying Yang
Journal: Int J Nanomedicine Date: 2021-02-11

3. A novel tumor mutational burden-based risk model predicts prognosis and correlates with immune infiltration in ovarian cancer.

Authors: Haoyu Wang; Jingchun Liu; Jiang Yang; Zhi Wang; Zihui Zhang; Jiaxin Peng; Ying Wang; Li Hong
Journal: Front Immunol Date: 2022-08-08 Impact factor: 8.786

4. Streamlining Quantitative Analysis of Long RNA Sequencing Reads.

Authors: Sebastian Oeck; Alicia I Tüns; Sebastian Hurst; Alexander Schramm
Journal: Int J Mol Sci Date: 2020-10-01 Impact factor: 5.923