| Literature DB >> 15707480 |
Sreelatha Meleth1, Jessy Deshane, Helen Kim.
Abstract
BACKGROUND: The proteomics literature has seen a proliferation of publications that seek to apply the rapidly improving technology of 2D gels to study various biological systems. However, there is a dearth of systematic studies that have investigated appropriate statistical approaches to analyse the data from these experiments.Entities:
Mesh:
Substances:
Year: 2005 PMID: 15707480 PMCID: PMC553976 DOI: 10.1186/1472-6750-5-7
Source DB: PubMed Journal: BMC Biotechnol ISSN: 1472-6750 Impact factor: 2.563
Figure 1Statistical protocol for 2D gels. The first protocol that we followed in the statistical analysis of data from 2D gel experiments is demonstrated in the flowchart in Figure 1. This consists of: 1) testing for differences between the groups with respect to total protein expression; 2) normalizing protein intensities on a gel to the mean total intensity of its group (e.g. treatment or control); 3) expressing each normalized intensity as a fraction of the total protein intensity in the experiment in order to make fold change comparisons meaningful; 4) testing the distribution of normalized intensities and using appropriate transformations if necessary to convert distribution to a normal distribution; 5) selecting the subset of protein spots to analyse; 6) imputing values for missing spot intensities; 7) using 2-sample t-tests and f-tests to identify protein spots that can be used to build classifiers; 8) building a linear (or quadratic) discriminant classifier; and 9) using Principal Components Analysis plots to demonstrate the separation between groups visually.
The number of protein spots resolved in replicate gels of the same biological sample.
| 3 | 4064 (100) | 309 (8) | 563 (14) | 3192 (78) | - | |
| 4 | 546 (100) | 169 (31) | 130 (26) | 97 (18) | 150 (27) | |
| 4 | 954 (100) | 186 (19) | 120 (12) | 105 (11) | 543 (57) | |
| 2 | 904 (100) | 342 (39) | 562 (62) | - | - | |
| 2 | 396 (100) | 229 (58) | 167 (42) | - | - | |
| 4 | 924 (100) | 234 (25) | 89 (10) | 109 (12) | 492 (53) | |
| 4 | 950 (100) | 161 (17) | 151 (16) | 102 (11) | 536 (59) | |
| 2 | 879 (100) | 312 (35) | 567 (65) | - | - | |
| 3 | 957 (100) | 272 (28) | 117 (12) | 568 (59) | ||
| 2 | 432 (100) | 183 (42) | 249 (58) | - | - |
The variability in the resolution of protein spots in technical replicates in a 2D gel experiments is one of the causes of the large number of missing spot intensities. The variability in the table above demonstrates the need for technical replicates as a quality control measure to identify spots that are most 'reliable' and common and therefore most useful to generalize to a larger population.
Highest and lowest correlation among spot intensities between technical replicates in a 2D gel experiment; highest and lowest Kappa coefficients between technical replicates
| 6 (GSE) | 3 | 0.774 (60%) | 0.547 (30%) | 0.4962 (-0.1039,1.000) | 0.1476 (-0.1374,0.4326) |
| 7 (GSE) | 4 | 0.907 (82%) | 0.589 (35%) | 0.4437 (0.1004,0.7870) | -0.0767 (-0.114,-0.0350) |
| 8 (GSE) | 4 | 0.932 (87%) | 0.617 (38%) | 0.4445 (0.1027,0.7864) | 0.0099 (-0.0855,0.1052) |
| 9 (GSE) | 2 | 0.747 (56%) | 0.747 (56%) | 0.1299 (-0.1384,0.3982) | 0.1299 (-0.1384,0.3982) |
| 10 (GSE) | 2 | 0.837 (70%) | 0.837 (70%) | 0.1231 (-0.0486,0.2948) | 0.1231 (-0.0486,0.2948) |
| 22 (CONT) | 4 | 0.805 (65%) | 0.467 (22%) | 0.6538 (0.3705, 0.9372) | 0.2599 (-0.0566,0.5765) |
| 23 (CONT) | 4 | 0.845 (71%) | 0.632 (40%) | 0.3269 (0.0304,0.6235) | 0.0322 (-0.0296,0.0941) |
| 24 (CONT) | 2 | 0.711 (50%) | 0.711 (50%) | 0.0743 (-0.1284,0.2770) | 0.0743 (-0.1284,0.2770) |
| 25 (CONT) | 3 | 0.837 (70%) | 0.524 (27%) | 0.2843 (0.0073,0.5613) | 0.2384 (-0.0629,0.5397) |
| 26 (CONT) | 2 | 0.578 (33%) | 0.578 (33%) | 0.1946 (0.0531,0.3361) | 0.1946 (0.0531,0.3361) |
The Pearson correlation coefficient is a measure of the linear relationship between two variables. R square, the square of Pearson's correlation is a measure of how much variability in one variable is explained by the variability in the other. Since technical replicates are expected to be identical, the r-squares are expected to be very high, at least 0.95. The table demonstrates the degree of variability between technical replicates after normalization. The Kappa coefficients with the 95% confidence intervals confirm the same thing. Ten out of sixteen confidence intervals span zero, indicating no agreement between technical replicates of the same sample in those cases.
The effect of log transformation using non-normalized data.
| SSP 4438 | |
| SSP 4519 | |
| SSP 4724 | |
| SSP 6314 | SSP 6314 |
Column 1 has spots that have significantly different intensities (p = 0.05) before normalizing and log transforming data. Column 2 has spots that are significantly different in intensity before normalizing data, but after using a log transformation. Spots in bold were later identified by MALDI-TOF. These were all spots that were biologically relevant to the system being studied. The percentages in parenthesis measure in the header indicate how many of the ten proteins known to be different were identified before log transformation.
Comparing variances of spot intensities when using multiple values to impute versus a single value imputation.
| 0.767 | 0.663 | 7.05 | 8.39 | ||
| Range = (3.16) | Range = (2.63) | Range = (7.34) | Range = (6.56) | ||
| 0.356 | 1.4249 | 0.356 | 1.4249 | ||
| Range = (2.23) | Range = (3.74) | Range = (2.23) | Range = (3.74) | ||
| 1.2055 | 1.178 | 3.41 | 1.17 | ||
| Range = (3.94) | Range = (3.51) | Range = (7.15) | Range = (3.51) | ||
| 0.547 | 0.564 | 0.547 | 0 | ||
| Range = (2.79) | Range = (2.14) | Range = (2.79) | Range = (0) | ||
The variances for spots with missing values are either under estimated or over estimated with single imputation values. The ranges and variance values of intensities are closer to that of SSP 1733 (spot with no missing intensities) in the case of multiple value imputation.
Figure 3Intensity plots. (a) Plot of raw intensities before log-transformation and normalization. Probability plot of raw intensities of 201 spots in the final data set; a normally distributed variable is expected to plot a line close to the straight line; the intensities are very skewed. (b) Plot of Normalized spot intensities. Probability plot of normalized spot intensities of 201 spots in the final data set. Comparison of 3a and 3b demonstrates that the normalization does not alter the distribution of the spot intensities. (c) QQ plot of log transformed intensities. Figure 3c demonstrates that the log transformation successfully transforms the highly skewed distribution of spots into a normal distribution.
The effect of log transformation using non-normalized data.
| SSP 1134 | |
| SSP 4519 | |
| SSP 4724 | |
| SSP 5309 | |
| SSP 5329 | |
| SSP 6228 | |
| SSP 6304 | SSP 6304 |
| SSP 6314 | SSP 6314 |
| SSP 6321 | |
| SSP 6349 | |
| SSP 6443 | |
| SSP 7223 | SSP 7223 |
| SSP 7334 | |
| SSP 7750 | SSP 7750 |
| SSP 8613 |
Column 1 has spots that have significantly different intensities (p = 0.05) after normalizing and log transforming data. Column 2 has spots that are significantly different in intensity after normalizing data (using normalization 2), but after log transformation. The percentages in parenthesis in the header measure how many of the ten proteins known to be different were identified after log transformation and normalization.
The effect of log transformation using non-normalized data.
| SSP 03121 | |||
| SSP 11121 | |||
| SSP 13211 | |||
| SSP13311 | |||
| SSP 32341 | |||
| SSP 34371 | |||
| SSP 35231 | |||
| SSP 4438 | SSP 4438 | SSP 44382 | |
| SSP 45171 | |||
| SSP 4519 | SSP 4519 | SSP 4519 | SSP 45192 |
| SSP 4724 | SSP 4724 | SSP 4724 | |
| SSP 47351 | |||
| SSP 50111 | |||
| SSP 5309 | |||
| SSP 5329 | |||
| SSP 6205 | |||
| SSP 6304 | |||
| SSP 6314 | SSP 6314 | SSP 6314 | |
| SSP 6321 | |||
| SSP 6349 | |||
| SSP 6443 | |||
| SSP 7027 | |||
| SSP 7231 | |||
| SSP 7223 | |||
| SSP 7334 | |||
| SSP 74131 | |||
| SSP 7750 | |||
| SSP 8613 |
1 These are spots that were present in a very small number of gels, and therefore did not meet our criteria to be included.
2 These spots have highly skewed distributions or were very poor quality spots. Log transformation made out the distribution closer to normal and p-values were no longer significant.
Column 1 has spots that have significantly different intensities (p = 0.05) normalizing and log transforming data. Column 2 has spots that are significantly different in intensity after using normalization 1, but before using a log transformation. Column 3 has spots that are significantly different in intensity after using normalization 2, but before using a log transformation. Column 4 has the results from the image analysis software PDQUEST, which has an option for normalizing but no log transformation. Columns 1 and 2 are subsets of the 201 spots in the final data set that met our criteria for inclusion. Column 3 is a subset of all possible spots in the experiment. Spots in bold were later identified by MALDI-TOF. These were all spots that were biologically relevant to the system being studied. The percentages in parenthesis in the header measure how many of the ten proteins known to be different were identified after the different normalization techniques.
T-Test results of log transformed intensities pre-and post normalization – Imputation Method 1
| SSP 1134 | |
| SSP 4724 | SSP 4724 |
| SSP 6304 | |
| SSP 6314 | SSP 6314 |
| SSP 7223 | |
| SSP 7750 |
The table shows the results of two sample t-tests on log-transformed intensities, pre and post normalization, when all missing intensities were replaced by the lowest intensity value in the experiment (-17.28). Spots in bold were later identified by MALDI-TOF. These were all spots that were biologically relevant to the system being studied.
T-Test results of log transformed intensities pre-and post normalization – Imputation Method 2
| SSP1134 | |
| SSP 4724 | SSP 4724 |
| SSP 6228 | |
| SSP 6304 | |
| SSP 6314 | SSP 6314 |
| SSP 7223 | |
| SSP 7334 | |
| SSP 7750 |
The table shows the results of two sample t-tests on log-transformed intensities, pre and post normalization, when each missing intensities in GSE (control) gel were replaced by randomly selecting one of the 15 lowest spot intensity values from the 15 gels in the GSE (control) group. Spots in bold were later identified by MALDI-TOF. These were all spots that were biologically relevant to the system being studied.
Test results of log transformed intensities pre-and post normalization – Imputation Method 2
| SSP 1134 | |
| SSP 4724 | SSP 4724 |
| SSP 6228 | |
| SSP 6304 | |
| SSP 6314 | SSP 6314 |
| SSP 7223 | |
| SSP 7334 | |
| SSP 7750 |
The table shows the results of two sample t-tests on log-transformed intensities, pre and post normalization, when each missing intensities in GSE or control gel were replaced by randomly selecting one of the 30 lowest spot intensity values from the 30 gels in the experiment. Spots in bold were later identified by MALDI-TOF. These were all spots that were biologically relevant to the system being studied.
Comparing t-test results for the three different imputation methods
| SSP1134 | SSP 1134 | |
| SSP 4724 | SSP 4724 | |
| SSP 6228 | SSP 6228 | |
| SSP 6304 | SSP 6304 | |
| SSP 6314 | SSP 6314 | |
| SSP 7223 | SSP 7223 | |
| SSP 7334 | SSP 7334 | |
| SSP 7750 | SSP 7750 |
In column 1 missing values were replaced with the lowest intensity value in experiment; in column 2 values to replace missing intensities were randomly chosen from the 15 lowest intensity values within a treatment group; in values to replace missing intensities were randomly chosen from the 30 lowest intensity values without regard to treatment group.
Comparing averaging across replicates versus not.
| SSP 1134 | SSP 1134 |
| SSP 3222 | |
| SSP 4724 | |
| SSP 6236 | |
| SSP 6304 | |
| SSP 6314 | SSP 6314 |
| SSP 6349 | |
| SSP 7144 | |
| SSP 7223 | |
| SSP 7439 | |
| SSP 7750 | SSP 7750 |
Column 1 has the results of two sample t-tests when the intensity values were averaged across replicates. Columns 2 represents results of two sample t-tests when replicates were treated as independent observations.
Figure 2Statistical protocol for 2D gels. The second protocol that we followed in the statistical analysis of data from 2D gel experiments is demonstrated in the flowchart in Figure 2. This consists of: 1) testing for differences between the groups with respect to total protein expression; 2) normalizing protein intensities on a gel to the mean total intensity of its group (e.g. treatment or control); expressing each normalized intensity as a fraction of the total protein intensity in the experiment in order to make fold change comparisons meaningful; 3) selecting the subset of protein spots to analyse; 4) Average spot intensities across gels from the same sample; 5) imputing values for missing spot intensities; 6) using 2-sample t-tests and f-tests to identify protein spots that can be used to build classifiers; 7) building a linear (or quadratic) discriminant classifier; and 8) using Principal Components Analysis plots to demonstrate the separation between groups visually.