| Literature DB >> 19134174 |
Kai Wang1, Jian Li, Shengting Li, Lars Bolund, Carsten Wiuf.
Abstract
BACKGROUND: Array-based comparative genomic hybridization (CGH) is a commonly-used approach to detect DNA copy number variation in whole genome-wide screens. Several statistical methods have been proposed to define genomic segments with different copy numbers in cancer tumors. However, most tumors are heterogeneous and show variation in DNA copy numbers across tumor cells. The challenge is to reveal the copy number profiles of the subpopulations in a tumor and to estimate the percentage of each subpopulation.Entities:
Mesh:
Year: 2009 PMID: 19134174 PMCID: PMC2640360 DOI: 10.1186/1471-2105-10-12
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Subpopulation summary of the 29 pairs of primary and metastasis samples
| # | Total | Pure | |||||||
| T-2 | 15 | 25.3 (8.8) | - | - | 1.13 (0.76) | - | - | 0.25 (0.13) | 1.13 (0.76) |
| T-3 | 13 | 33.4 (9.4) | 14.2 (5.1) | - | 0.34 (0.13) | 1.18 (0.26) | - | 0.28 (0.10) | 0.59 (0.20) |
| T-4 | 1 | 23 (-) | 34 (-) | 8 (-) | 0.25 (-) | 0.53 (-) | 1.40 (-) | 0.35 (-) | 0.54 (-) |
| M-2 | 16 | 24.8 (8.3) | - | - | 1.07 (1.00) | - | - | 0.21 (0.11) | 1.07 (1.00) |
| M-3 | 10 | 32.5 (8.7) | 15.1 (6.1) | - | 0.45 (0.23) | 1.33 (0.41) | - | 0.33 (0.12) | 0.71 (0.30) |
| M-4 | 3 | 16.7 (3.1) | 31 (6) | 8.7 (2.1) | 0.18 (0.04) | 0.48 (0.11) | 1.33 (0.17) | 0.29 (0.05) | 0.52 (0.09) |
The table summarizes the analysis done on the 29 pairs of samples [see Additional file 2]. Shown are mean values with standard deviation in parenthesis. T-i: Primary tumor with i subpopulations, M-i: Metastasis with i subpopulations, #: Number of samples, p: Percentage of (abnormal) subpopulation k ≥ 1, AI: Aberration Index for subpopulation k, Total: Weighted sum of AI, ΣpAI, Pure: AI•, normalized weighted sum of AI, cf. equation (2).
Figure 1Cluster diagram of the 29 pairs of tumors. The 29 pairs of primary and metastasis samples were divided into 89 subpopulations using the method described in the paper. For two leaves with the same ID (e.g. T53), P1 refers to the abnormal subpopulation with the least aberration, P2 (if it exists) refers to the subpopulation with more aberrations than P1, and P3 (if it exists) the one with most aberrations. The percentage of each subpopulation is also included. The cluster diagram was generated using average linkage clustering based on the estimated copy numbers for all 3340 clones. Reducing the number of clones produces very similar results.
Primary tumor vs. metastasis
| Metastasis | |||
| Primary | 2 | 3 | 4 |
| 2 | 9 | 4 | 2 |
| 3 | 7 | 5 | 1 |
| 4 | 0 | 1 | 0 |
Shown is the estimated number of subpopulations in the primary tumor and the corresponding lymph node metastasis.
Figure 2Subpopulation development. Here we show possible subpopulation development in two paired tumor samples. The dashed lines connect the subpopulations from same sample. The solid and dashed lines represent the most likely and the least likely development path, respectively. From the top-down, the subpopulation contains more and more genomic aberrations.
Prediction accuracy for simulated samples
| Simulated | Predicted as | Correct | |||
| Real | 2 | 3 | 4 | (in %) | |
| 2 | 2 | 115 | 9 | 0 | 0.93 |
| 3 | 61 | 59 | 4 | 0.48 | |
| 4 | 57 | 48 | 19 | 0.15 | |
| 3 | 2 | 80 | 11 | 1 | 0.87 |
| 3 | 8 | 78 | 6 | 0.85 | |
| 4 | 6 | 60 | 26 | 0.28 | |
| 4 | 2 | 15 | 1 | 0 | 0.94 |
| 3 | 0 | 13 | 3 | 0.81 | |
| 4 | 0 | 1 | 15 | 0.94 | |
For each real sample four simulated samples were created; in total 174 samples. The real sample was used as template for the simulated samples. In the table, the simulation results are shown according to the estimated number of subpopulations in the real samples. Real: Estimated number of subpopulations in the real sample, Simulated: The number of subpopulations in the simulated sample, Predicted: The predicted number of subpopulations in the simulated sample.
Accuracy of copy numbers and percentages
| #Subpopulations | |||
| 2 | 3 | 4 | |
| A (in %) | 2.05 (2.26) | 2.76 (2.50) | 4.78 (4.01) |
| B (in %) | 89.5 (13.6) | 82.8 (10.0) | 80.5 (9.0) |
The table shows accuracy of the estimated copy numbers and subpopulation percentages when the number of subpopulations is correctly predicted. A) The average absolute difference between the estimated and true percentages in the simulated samples, B) The average number of times the copy number was predicted correctly, excluding the normal subpopulation. Standard deviations in parenthesis.
Robustness of the method
| 3 subpopulations | 4 subpopulations | |||||
| 1 | 0.5 | 0.25 | 1 | 0.5 | 0.25 | |
| Correct | 27 | 27 | 33 | 27 | 21 | 28 |
| Incorrect | 19 | 19 | 13 | 13 | 19 | 12 |
| A (in %) | 4.74 (4.68) | 4.15 (3.62) | 3.52 (3.70) | 7.70 (7.95) | 5.43 (4.23) | 5.59 (5.03) |
| B (in %) | 5.37 (6.89) | 5.59 (6.19) | 5.43 (7.13) | 7.43 (9.64) | 5.65 (5.35) | 6.93 (6.98) |
| Segments (in %) | 37.0 | 24.4 | 11.5 | 65.1 | 45.1 | 25.6 |
The table shows results for simulated data not fulfilling the assumption of sequential tumor evolution. With increasing λ, an increasing number of samples are incorrectly classified. A) The average absolute difference between the estimated and true percentages in the simulated samples when the number of subpopulations are predicted correctly, B) The average absolute difference between the estimated and true percentages of the normal subpopulation. Standard deviations in parenthesis. 'Segments' is the number of segments violating the model of sequential tumor evolution.
Validation experiment
| Estimated based on (X = S1, S2) | ||||
| Experiment | X | X with 15% | X with 30% | |
| S1 | 76,24 | - | 79,21 | 77,23 |
| S1 with 15% | 82,18 | 80,20 | - | 81,19 |
| S1 with 30% | 84,16 | 83,17 | 85,15 | - |
| S1 | 60,29,11 | - | 72,20,8 | 70,21,9 |
| S1 with 15% | 76,17,7 | 66,25,9 | - | 75,18,7 |
| S1 with 30% | 79,15,6 | 72,20,8 | 80,14,6 | - |
| S2 | 62,24,14 | - | 49,33,18 | 50,31,19 |
| S2 with 15% | 57,28,15 | 68,20,12 | - | 57,27,16 |
| S2 with 30% | 65,22,13 | 73,17,10 | 65,23,12 | - |
In the "Experiment" column, the estimated subpopulation percentages from the 6 experiments are shown: Two pure tumor samples (S1 and S2), and four samples with tumor mixed with 15 or 30% normal cells. The best fit for S1 is three subpopulations, whereas it is two for "S1 with 15%" and "S1 with 30%"; therefore we show results for both two and three subpopulations to facilitate comparison. The remaining three columns contain percentages estimated from the Experiment column; e.g. to estimate the percentages of the sample with 85% malignant cells and 15% normal cells ("S2 with 15%") from the sample S2 do (0.62·85 + 15, 0.24·85, 0.14·85) = (68, 20, 12).
Figure 3Classifier. Here we show three examples of classification of real samples. From top to bottom, the three samples are classified as 2, 3, and 4 subpopulations, respectively. Each subplot shows two empirical distributions and a blue line representing the NLSof the query sample. In the first column, the black curve is the smoothed empirical distribution (SED) of NLS(simulated as two subpopulations and fitted as two) and the red curve is the SED of NLS(simulated as three and fitted as two). In the second column, the red curve is the SED of NLS(simulated as three and fitted as three) and the green curve is the SED of NLS(simulated as four and fitted as three). Finally, in the last column, the green curve is the SED of NLS(simulated as four and fitted as four) and the yellow curve is the SED of NLS(simulated as five and fitted as four). The number in the subplots shows how many samples (in %) in the left distribution that obtain a value greater than the value indicated by the blue line.