| Literature DB >> 25984486 |
Behrooz Darbani1, Charles Neal Stewart2.
Abstract
BACKGROUND: Reliability and reproducibility are key metrics for gene expression assays. This report assesses the utility of the correlation coefficient in the analysis of reproducibility and reliability of gene expression data.Entities:
Keywords: Gene expression; Reliability; Reproducibility
Year: 2014 PMID: 25984486 PMCID: PMC4376515 DOI: 10.1186/2241-5793-21-3
Source DB: PubMed Journal: J Biol Res (Thessalon) ISSN: 1790-045X Impact factor: 1.889
Figure 1Correlation-based reproducibility and reliability assays on real gene expression data. A publicly available RNA-Seq dataset from rice including two root-sample replicates and two shoot-sample replicates is used as an example to illustrate the power of correlation coefficient in the reproducibility assay of read counts as gene expression data. Briefly, the reads were mapped on the rice genome to apply the mapped total gene, unigene, and total exon reads (A1, A2 and B1, B2). Large bias shifts from ~65% in A1 and B1 to ~12% in A2 and B2 after both reference gene-based correction (A) and read count-based correction (B) were not able to increase the correlation coefficient. (C) As the level of noise increases the correlation becomes weaker. The slopes are nearly the same and approach a slope of 1, but the lower correlation observed results from the higher level of noise in C compared to A2 and B2. Considering the level of noise in the data and the slope of inter-replicate regression line, it is possible that, while helpful, they might not be precise for reproducibility assay. After data correction (in A2 and B2), the correlation was not changed but the slope was improved, i.e. the slope-deviation from 1 was decreased. However, 12% of the genes still show 50% or greater than 50% inter-replicate variation in expression, which would be indistinguishable by taking both the correlation and the slope into account. (A1α, B1α) Logarithmic transformation of data could change the difference between sample replicates which is represented by the changed slope-deviation on scatter plots. (A2α, B2α) Scatter plot narrow-intervals allow us to observe noise in the dataset. There is almost a clear ±0.5x variation shown by scatter plots. 350, 354: rice untreated root sample replicates; 349, 353: rice untreated shoot sample replicates. We evaluated the expression of 24122 to 24701 genes by applying the different types of reads and samples. The log10 transformed data were used to calculate the Pearson correlation. All correlations were significant (P = 0.000) by a two-tailed test.
Reproducibility of the microarray data
| Uncorrected data | After quantile correction | |||||||
|---|---|---|---|---|---|---|---|---|
| Experiment | Average number of genes with >50% inter-replicate variation |
| Slope | A.Dd % | Average number of genes with >50% inter-replicate variation |
| Slope | A.Dd % |
|
| 8488/22810 (37%) | 0.975 | 0.804 | 91% | 2000/22810 (8.8%) | 0.980 | 0.982 | 26% |
|
| 6433/22840 (28%) | 0.992 | 1.22 | 34% | 519/22840 (2.3%) | 0.994 | 0.992 | 8.2% |
|
| 1645/10928 (15%) | 0.997 | 1.257 | 26% | 396/10928 (3.6%) | 0.997 | 0.980 | 12% |
|
| 1059/10208 (10%) | 0.996 | 1.032 | 28% | 299/10208 (2.9%) | 0.997 | 0.981 | 10% |
|
| 837/10208 (8.2%) | 0.995 | 0.999 | 22% | 300/1028 (2.9%) | 0.996 | 0.990 | 15% |
Different publicly available microarray data were analyzed using the Affymetrix Expression Console build 1.3.1.187 following the 3 Expression Arrays-RMA protocol. Both the corrected and uncorrected data were extracted. The Pearson correlations between the replicates were calculated on the log2 transformed data. Following our method, the average of the ratios’ deviations was also calculated in order to evaluate its usefulness compared to the correlation coefficient.
The experiments are available at http://www.ebi.ac.uk/arrayexpress/experiments/browse.html, The average of Pearson correlations between the replicates in each experiment, The average of slopes of inter-replicate regression lines in each experiment, The average of the ratios’ deviations.
Pure post-transcriptional regulation effects can be examined after correction of the protein quantities
| Gene | TF | ORF | PF | CF | CPF |
|---|---|---|---|---|---|
| 1 | 2 | 1000 | 10 | 1 | 10 |
| 2 | 2 | 2000 | 9 | 2 | 18 |
| 3 | 2 | 3000 | 8 | 3 | 24 |
| 4 | 2 | 4000 | 7 | 4 | 28 |
| 5 | 2 | 5000 | 6 | 5 | 30 |
| 6 | 2 | 6000 | 5 | 6 | 30 |
| 7 | 2 | 7000 | 4 | 7 | 28 |
| 8 | 2 | 8000 | 3 | 8 | 24 |
The table represents an artificial example of similar transcript fold-change for all the genes with different lengths and different protein fold-changes.
Observed transcript fold-change, Coding sequence length, Observed protein fold-change, Calculated correction factor (length/1000), Corrected PF (PF.CF).