| Literature DB >> 36160043 |
Swathi Ramachandra Upadhya1,2, Colm J Ryan1,2.
Abstract
Large-scale studies of human proteomes have revealed only a moderate correlation between mRNA and protein abundances. It is unclear to what extent this moderate correlation reflects post-transcriptional regulation and to what extent it reflects measurement error. Here, by analyzing replicate profiles of tumors and cell lines, we show that there is considerable variation in the reproducibility of measurements of transcripts and proteins from individual genes. Proteins with more reproducible measurements tend to have a higher mRNA-protein correlation, suggesting that measurement reproducibility accounts for a substantial fraction of the unexplained variation between mRNA and protein abundances. The reproducibility of individual proteins is somewhat consistent across studies, and we exploit this to develop an aggregate reproducibility score that explains a substantial amount of the variation in mRNA-protein correlations across multiple studies. Finally, we show that pathways previously reported to have a higher-than-average mRNA-protein correlation may simply contain members that can be more reproducibly quantified.Entities:
Keywords: cancer; gene expression; machine learning; post-transcriptional regulation; proteogenomics; proteomics; reproducibility; transcriptomics
Year: 2022 PMID: 36160043 PMCID: PMC9499981 DOI: 10.1016/j.crmeth.2022.100288
Source DB: PubMed Journal: Cell Rep Methods ISSN: 2667-2375
Analysis of mRNA-protein correlation using a standardized pipeline
| Data | Published year | Reported correlation | Protein inclusion criterion in reported correlation | Computed median Spearman correlation | Computed median Pearson correlation |
|---|---|---|---|---|---|
| GTEx 32 healthy tissues (GTEx) | 2020 | 0.46 | <5 tissues with missing values for both protein and RNA measurements | 0.51 | 0.59 |
| Cancer Cell Line Encyclopaedia (CCLE) | 2020 | 0.48 | quantified in at least one ten-plex (9 cell lines) | 0.46 | 0.48 |
| NCI-60 cancer cell lines (NCI60) | 2019 | not reported | – | 0.36 | 0.40 |
| Glioblastoma (GBM) | 2021 | not reported | – | 0.50 | 0.51 |
| Head and neck squamous cell carcinoma (HNSCC) | 2021 | 0.52 | <50% missing values | 0.54 | 0.56 |
| Lung adenocarcinoma (LUAD) | 2020 | 0.53 | <50% missing values | 0.55 | 0.56 |
| Endometrial cancer (EC) | 2020 | 0.48 | contain mRNA and protein measurements across all patients | 0.48 | 0.51 |
| Breast cancer (BrCa 2020) | 2020 | 0.41 | contain mRNA and protein measurements (proteins <70% missing values) | 0.44 | 0.43 |
| Clear cell renal carcinoma (ccRCC) | 2019 | 0.43 | contain mRNA and protein measurements across all patients | 0.41 | 0.42 |
| Colon cancer (colon) | 2019 | 0.48 | top 10% most variably expressed proteins quantified in both platforms | 0.27 | 0.28 |
| Ovarian cancer (ovarian) | 2016 | 0.45 | contain mRNA and protein measurements across all patients | 0.41 | 0.41 |
| Breast cancer (BrCa 2016) | 2016 | 0.39 | contain mRNA and protein measurements across all patients passing quality control checks. | 0.42 | 0.42 |
| Colon and rectal cancer (CRC 2014) | 2014 | 0.23 | protein measurement with average spectral count across all patients ≥1.4 | 0.21 | 0.22 |
Figure 1Protein-protein reproducibility across replicates is moderate and variable
(A) Overview of the replicates available for the three different proteomic studies.
(B) For each study, we calculate the Spearman correlation for individual proteins across the proteomic replicates. The distribution of the protein-protein reproducibility is shown in the histogram for all measured proteins. For each study, the black dashed line represents the median.
Figure 2Proteins with higher reproducibility have higher mRNA-protein correlation
(A–C) Boxplots showing the distribution of mRNA-protein correlation for proteins binned according to their protein-protein reproducibility in the colon (A), ovarian (B), and CCLE (C) studies. The total number of proteins considered for each plot is indicated at the top right corner. The bins are deciles—each containing ∼10% of the proteins. The decile is indicated on the x axis along with the highest correlation between experimental replicates present within that decile. For each box plot, the black central line represents the median, the top and bottom lines represent the 1st and 3rd quartiles, and the whiskers extend to 1.5 times the interquartile range past the box. Outliers are not shown. The median of each decile is indicated above/below the black central line for each box plot. The median mRNA-protein correlation across all proteins for each study is indicated as a dotted gray line in each plot. The R2 obtained from regressing the mRNA-protein correlation on protein-protein reproducibility is in the bottom right corner.
Figure 3Proteins with high reproducibility in one study are also highly reproducible in other studies
(A–C) Binned heatmaps showing the relationship between the protein-protein reproducibility calculated in different studies. Each heatmap shows the relationship between two studies, indicated on the x and y axes. The regions of the heatmaps are colored according to the number of proteins present in the region as indicated in the color bar. The number of proteins in common and Spearman correlation between the two studies, with the associated p value, are specified in the box for each of the plots.
(D and E) For each study with experimental protein replicates, scatter plots illustrating the relationship between protein-protein reproducibility are shown for a protein with high reproducibility, GBP1 (D), and a protein with low reproducibility, RPS29 (E). For each scatter plot, the Spearman correlation coefficient of the protein-protein reproducibility and the associated p value is indicated at the bottom.
Figure 4Aggregated protein reproducibility ranks partially explains the variable mRNA-protein correlation in 10 additional studies
(A–J) For studies without experimental proteomic replicates, boxplots showing the distributions of mRNA-protein correlation for proteins in each decile of the aggregated protein reproducibility ranks. (A)–(H) are the CPTAC tumor studies; (I) is the NCI-60 cancer cell lines study wherein protein quantification, used for computing the mRNA-protein correlation, is obtained from data-independent acquisition-based untargeted proteomics (SWATH-MS); and (J) is the healthy tissues study from the GTEx Consortium. Box plot details as in Figure 2.
Figure 5Protein reproducibility is mainly influenced by abundance, variance, and unique peptides and not protein half-lives
(A–C) Boxplots showing the distribution of aggregated protein reproducibility ranks for proteins binned according to protein abundance (A), variance (B), and number of unique peptides (C). Box plot details as in Figure 2.
(D) Boxplot showing the distribution of aggregated protein reproducibility ranks for proteins with short and long protein half-lives.
Figure 6Transcriptomic reproducibility contributes to the variance in mRNA-protein correlation
(A) Histogram showing the distribution of the gene-wise correlation between experimental transcriptomic replicates of 382 cancer cell lines. The black line represents the median.
(B) For each of the 13 studies analyzed here, the R-squared obtained by regressing mRNA-protein correlation on transcriptomic reproducibility and aggregated protein reproducibility scores individually and in combination over the same set of proteins is shown in the dot plot. The number of proteins analyzed for each study is indicated in brackets below the study on the y axis.
Figure 7Metabolic pathways with higher-than-average mRNA-protein correlations may reflect differential reproducibility
Bar charts displaying the KEGG pathway enrichment analysis of the CCLE mRNA-protein correlation before (left) and after (right) accounting for protein-protein and mRNA-mRNA reproducibility. The −log10 of Benjamini-Hochberg false discovery rate (FDR)-corrected p values calculated using Mann-Whitney U test is used to assess enrichment for the pathway. For each bar chart, the gray line indicates the threshold considered for significant enrichment (FDR < 0.05). If the enrichment is below the threshold, then it is not considered significant. The bars are colored orange if the median mRNA-protein correlation of genes within the pathway is greater than the median mRNA-protein correlation of genes not in the pathway; otherwise, the bars are colored blue.
| REAGENT or RESOURCE | SOURCE | IDENTIFIER |
|---|---|---|
| cBioPortal | ||
| Cancer Dependency Map (DepMap) 20Q4 | ||
| LinkedOmics | ||
| CPTAC Python API | ||
| CORUM 3.0 | ||
| KEGG Pathway | ||
| Colorectal cancer transcriptomics | ||
| Colorectal cancer proteomics | Published supplemental Table S4 | |
| Ovarian cancer transcriptomics | ||
| Ovarian cancer proteomics | Published supplemental Table S2 | |
| Breast Cancer (2016) transcriptomics | ||
| Breast Cancer (2016) proteomics | Published supplemental Table S3 | |
| Colon Cancer | ||
| Clear cell renal carcinoma | ||
| Breast Cancer (2020) | ||
| Endometrial Cancer | ||
| Lung Adenocarcinoma | ||
| Head and Neck Squamous Cell Carcinoma | ||
| Glioblastoma | ||
| NCI60 cancer cell lines | Published supplemental Tables S6 and S1 | |
| Cancer Cell Line Encyclopedia (CCLE) transcriptomics | ||
| CCLE proteomics | Published supplemental Tables S2 and S3; | |
| GTEx healthy tissues | Published supplemental Tables S3 and S4 | |
| RNA-seq of 675 commonly used human cancer cell lines | ||
| Protein half-life | Published supplemental Table S3 | |
| NCI CPTAC DREAM Proteogenomics challenge prediction scores of the best performing model (Team Guan) | ||
| All analysis code | This study | |
| Python version 3.8 | Python Software Foundation | |
| Pandas 1.2.5 | ||
| Numpy 1.20.2 | ||
| StatsModels 0.12.2 | ||
| SciPy 1.7.1 | ||
| Matplotlib 3.3.4 | ||
| Seaborn 0.11.0 | ||