| Literature DB >> 21398628 |
Ao Li1, Zongzhi Liu, Kimberly Lezon-Geyda, Sudipa Sarkar, Donald Lannin, Vincent Schulz, Ian Krop, Eric Winer, Lyndsay Harris, David Tuck.
Abstract
There is an increasing interest in using single nucleotide polymorphism (SNP) genotyping arrays for profiling chromosomal rearrangements in tumors, as they allow simultaneous detection of copy number and loss of heterozygosity with high resolution. Critical issues such as signal baseline shift due to aneuploidy, normal cell contamination, and the presence of GC content bias have been reported to dramatically alter SNP array signals and complicate accurate identification of aberrations in cancer genomes. To address these issues, we propose a novel Global Parameter Hidden Markov Model (GPHMM) to unravel tangled genotyping data generated from tumor samples. In contrast to other HMM methods, a distinct feature of GPHMM is that the issues mentioned above are quantitatively modeled by global parameters and integrated within the statistical framework. We developed an efficient EM algorithm for parameter estimation. We evaluated performance on three data sets and show that GPHMM can correctly identify chromosomal aberrations in tumor samples containing as few as 10% cancer cells. Furthermore, we demonstrated that the estimation of global parameters in GPHMM provides information about the biological characteristics of tumor samples and the quality of genotyping signal from SNP array experiments, which is helpful for data quality control and outlier detection in cohort studies.Entities:
Mesh:
Year: 2011 PMID: 21398628 PMCID: PMC3130254 DOI: 10.1093/nar/gkr014
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Definition of hidden states in GPHMM
| State | Copy number | Allelic information | Copy number alteration status | (Tumor genotype, normal cell genotype) |
|---|---|---|---|---|
| 0 | N/A | N/A | Fluctuation effect | ( |
| 1 | 0 | Deletion | Deletion of two copies | ( |
| 2 | 1 | LOH | Deletion of one copy | ( |
| 3 | 2 | Heterozygous | Normal | ( |
| 4 | 2 | LOH | Copy neutral with LOH | ( |
| 5 | 3 | Heterozygous | Three copies with duplication of one allele | ( |
| 6 | 3 | LOH | Three copies with LOH | ( |
| 7 | 4 | Heterozygous | Four copies with duplication of one allele | ( |
| 8 | 4 | Heterozygous | Four copies with duplication of both alleles | ( |
| 9 | 4 | LOH | Four copies with LOH | ( |
| 10 | 5 | Heterozygous | Five copies with duplication of one allele | ( |
| 11 | 5 | Heterozygous | Five copies with duplication of both alleles | ( |
| 12 | 5 | LOH | Five copies with LOH | ( |
Comparison of normal DNA proportions estimated by different methods on dilution series data
| Sample | GPHMM | GAP | Normal DNA proportion | ||||
|---|---|---|---|---|---|---|---|
| CRL2324_10pc_Tum | 0.011 | 0.027 | 0.20 | 0.02 | 0.90 | 0.01 | 0.90 |
| CRL2324_14pc_Tum | −0.009 | 0.019 | 0.19 | 0.02 | 0.88 | 0.01 | 0.86 |
| CRL2324_21pc_Tum | −0.016 | 0.023 | 0.20 | 0.03 | 0.81 | 0.84 | 0.79 |
| CRL2324_23pc_Tum | −0.067 | 0.023 | 0.23 | 0.03 | 0.69 | 0.73 | 0.77 |
| CRL2324_30pc_Tum | −0.046 | 0.022 | 0.18 | 0.03 | 0.72 | 0.75 | 0.70 |
| CRL2324_34pc_Tum | −0.058 | 0.026 | 0.23 | 0.03 | 0.68 | 0.72 | 0.66 |
| CRL2324_45pc_Tum | −0.069 | 0.016 | 0.22 | 0.03 | 0.63 | 0.66 | 0.55 |
| CRL2324_47pc_Tum | −0.102 | 0.042 | 0.22 | 0.03 | 0.55 | 0.58 | 0.53 |
| CRL2324_50pc_Tum | −0.102 | 0.031 | 0.25 | 0.03 | 0.57 | 0.59 | 0.50 |
| CRL2324_79pc_Tum | −0.189 | 0.032 | 0.24 | 0.03 | 0.19 | 0.20 | 0.21 |
| CRL2324 | −0.283 | 0.024 | 0.24 | 0.02 | 0.02 | 0.00 | 0.00 |
Figure 1.Strong correlation observed between proportion of normal cell and LRR signal shift in dilution series data. The empirical regression function is also shown in the figure.
Figure 2.Comparison of the self-consistency percentages for different methods. (a) Self-consistency percentages based on LOH status. (b) Self-consistency percentages based on copy number state. (c) Self-consistency percentages based on both copy number and LOH states.
Figure 3.Plots of LOH regions on chromosome 17 and the results of GPHMM for dilution series data. (a) Plot of sample ‘CRL2324’ (100% cancer cell DNA). Typical LOH patterns are observed in this pure cancer cell line and there is a significant difference in LRR signals for two LOH regions. (b) Plot of sample ‘CRL2324-50pc-Tum’ (50% cancer cell DNA). Due to normal cell contamination, two additional BAF bands and reduction in difference in LRR signals are observed whereas the results of GPHMM remain the same. (c) Plot of sample ‘CRL2324-14pc-Tum’ (14% cancer cell DNA). The results of GPHMM keep unchanged with the increase of normal cell proportion. (d) Plot of sample ‘CRL2324-10pc-Tum’ (10% cancer cell DNA). With 90% of normal cell, the patterns of BAF and LRR signals are barely discernible. However, GPHMM can still accurately identify these two LOH regions.
Comparison of tumor DNA indices estimated by different methods on GAP data
| Sample | GPHMM | GAP | FCM | ||||||
|---|---|---|---|---|---|---|---|---|---|
| DNA index | DNA index | DNA index | |||||||
| BLC_B1_T14 | −0.38 | 0.005 | 0.42 | 0.06 | 0.15 | 1.61 | 0.15 | 0.85 | 1.14 |
| BLC_B1_T17 | 0.04 | 0.080 | 0.65 | 0.06 | 0.30 | 0.84 | 0.23 | 0.82 | 0.84 |
| BLC_B1_T19 | −0.18 | −0.013 | 0.18 | 0.03 | 0.55 | 1.56 | 0.60 | 1.63 | 1.60 |
| BLC_B1_T20 | −0.11 | 0.003 | 0.18 | 0.03 | 0.59 | 1.39 | 0.60 | 1.48 | 1.41 |
| BLC_B1_T22 | 0.07 | 0.047 | 0.46 | 0.05 | 0.09 | 0.94 | 0.13 | 0.94 | 1.98 |
| BLC_T07 | −0.15 | 0.012 | 0.18 | 0.03 | 0.56 | 1.45 | 0.56 | 1.49 | 1.68 |
| BLC_T09 | −0.40 | 0.006 | 0.22 | 0.03 | 0.02 | 1.70 | 0.08 | 1.85 | 2.02 |
| BLC_T10 | −0.45 | 0.008 | 0.18 | 0.03 | 0.04 | 1.81 | 0.05 | 1.90 | 1.88 |
| BLC_T12 | −0.20 | −0.003 | 0.19 | 0.03 | 0.35 | 1.48 | 0.35 | 1.54 | 1.51 |
| BLC_T15 | −0.26 | −0.019 | 0.19 | 0.03 | 0.42 | 1.68 | 0.26 | 0.89 | 1.11 |
| BLC_T23 | −0.09 | 0.029 | 0.21 | 0.03 | 0.57 | 1.34 | 0.59 | 1.39 | 1.32 |
| BLC_T31 | −0.38 | −0.011 | 0.23 | 0.04 | 0.07 | 1.72 | 0.16 | 1.84 | 1.91 |
| BLC_T34 | 0.08 | 0.003 | 0.24 | 0.03 | 0.09 | 0.98 | 0.13 | 0.99 | 1.55 |
| BLC_T37 | −0.23 | −0.051 | 0.26 | 0.04 | 0.08 | 1.44 | 0.11 | 1.53 | 1.51 |
| L_B1_T24B | −0.18 | −0.028 | 0.21 | 0.03 | 0.42 | 1.50 | 0.41 | 1.64 | 1.84 |
| L_B1_T25A | 0.00 | −0.032 | 0.17 | 0.03 | 0.58 | 1.00 | 0.61 | 1.04 | 1.00 |
| L_B1_T30 | −0.39 | −0.005 | 0.22 | 0.04 | 0.17 | 1.76 | 0.22 | 1.83 | 1.84 |
| L_B1_T47 | 0.01 | −0.022 | 0.19 | 0.03 | 0.54 | 1.00 | 0.55 | 1.03 | 1.00 |
Figure 4.Histograms of estimated global parameters for HER2-positive breast cancer data. Top left: (a) histogram of GC coefficient h. Top right: (b) histogram of normal cell proportion Bottom left: (c) standard deviation of LRR signal Bottom right: (d) standard deviation of BAF signal .
Figure 5.Identification of HER2 amplification in HER2-positive breast cancer data. (a) Pie chart for the maximal copy numbers of HER2 region estimated by GPHMM. CN <2: maximal copy number <2; CN = 2: maximal copy number equal to 2; CN = 3: maximal copy number equal to 3; CN = 4: maximal copy number equal to 4; CN ≥5: maximal copy number greater than or equal to 5. (b) Different genomic patterns of HER2 amplification identified in HER2-positive breast cancer data, arrows indicate the HER2 locus on chromosome 17.
Figure 6.Validation of GPHMM results in HER2-positive breast cancer sample. (a) FISH image of HER2 (green), TOP2A (red) and CEP 17 (aqua) probe signals in tumor sample nuclei. HER2 locus is highly amplified (average copy number 23.1). (b) FISH image of CCND1 (red) and CEP 11 (green) probe signals in tumor nuclei. Yellow and green arrows show two different tumor subpopulations. (c) FISH image of MYC (green), LPL (red) and CEP 8 (aqua) probe signals in tumor sample nuclei. Yellow and green arrows show two different tumor subpopulations. (d) Comparison of the copy numbers estimated from FISH probes and the results of GPHMM using SNP array data.