| Literature DB >> 30428830 |
Hyoyoung Choo-Wosoba1, Paul S Albert1, Bin Zhu2.
Abstract
BACKGROUND: Somatic copy number alternation (SCNA) is a common feature of the cancer genome and is associated with cancer etiology and prognosis. The allele-specific SCNA analysis of a tumor sample aims to identify the allele-specific copy numbers of both alleles, adjusting for the ploidy and the tumor purity. Next generation sequencing platforms produce abundant read counts at the base-pair resolution across the exome or whole genome which is susceptible to hypersegmentation, a phenomenon where numerous regions with very short length are falsely identified as SCNA.Entities:
Keywords: Allele-specific somatic copy number alteration; Hidden Markov model; Hypersegmentation; Next-generation sequencing; The cancer genome Atlas study
Mesh:
Year: 2018 PMID: 30428830 PMCID: PMC6236906 DOI: 10.1186/s12859-018-2412-y
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Description of tumor genotype states and corresponding genotype of total copy number: homozygous deletion (HOMD), hemizygous deletion LOH (DLOH), copy neutral LOH (NLOH), diploid heterozygous (HET), gain of 1 allele (GAIN), amplified LOH (ALOH), allele-specific copy number amplification (ASCNA), balanced copy number amplification (BCNA), and unbalanced copy number amplification (UBCNA)
| State ( | Genotype | Copy number ( | Allelic information |
|---|---|---|---|
| 1 | 0 | 0 | HOMD |
| 2 | A | 1 | DLOH |
| 3 | AA | 2 | NLOH |
| 4 | AB | 2 | HET |
| 5 | AAB | 3 | GAIN |
| 6 | AAA | 3 | ALOH |
| 7 | AAAB | 4 | ASCNA |
| 8 | AABB | 4 | BCNA |
| 9 | AAAA | 4 | ALOH |
| 10 | AAAAB | 5 | ASCNA |
| 11 | AAABB | 5 | UBCNA |
| 12 | AAAAA | 5 | ALOH |
Fig. 1Probability of identification for all the genotype states from 500 simulated datasets with logR generated from the t-distribution blue lines and red lines indicate the probabilities of identification based on the hsegHMM-N and hsegHMM-T models, respectively; Each dataset consists of 10,000 observations of logR and logOR
Summary of simulation studies with hsegHMM-N and hsegHMM-T models based on 500 simulated datasets
| hsegHMM-N | hsegHMM-T | ||||||
|---|---|---|---|---|---|---|---|
| Simulation 1 | |||||||
| True | Est | SEs | SE | Est | SEs | SE H | |
|
| 1.6 | 1.61 | 0.018 △ | 0.009 | 1.60 | 0.014 | 0.007 |
|
| 0.9 | 0.90 | 0.004 | 0.003 | 0.90 | 0.003 | 0.003 |
|
| 0.3 | N/A | 0.30 | 0.007 | 0.007 | ||
| 0.6 | 0.55 | 0.039 | 0.008 | 0.61 a | 0.017 | 0.020 b | |
|
| 0.5 | 0.50 | 0.033 | 0.025 | 0.50 | 0.028 | 0.024 |
|
| 4 | N/A | 3.91 | 0.150 | 0.159 | ||
| Simulation 2 | |||||||
| True | Est | SEs | SE H | Est | SEs | SE H | |
|
| 1.6 | 1.62 | 0.028 | 0.015 | 1.60 | 0.013 | 0.011 |
|
| 0.9 | 0.90 | 0.003 | 0.003 | 0.90 | 0.003 | 0.003 |
|
| N/A | N/A | 0.64 | 0.017 | 0.018 | ||
| 0.65 | 1.46 | 0.055 | 0.023 | 2.28 a | 0.133 | 0.159 b | |
|
| 0.5 | 0.48 | 0.026 | 0.024 | 0.49 | 0.025 | 0.024 |
|
| N/A | N/A | 2.79 | 0.076 | 0.093 | ||
Simulation 1 and Simulation 2 are the t-distribution-based and the normal-mixture-based studies. Each dataset consists of 10,000 observations of logR and logOR. Est is average estimates from 500 datasets ;ψ is the ploidy, α is the tumor purity; κ2 is the variance component of logR in hsegHMM-T; V(W) and τ2 are the variance of logR and logOR in both models, respectively; SEs indicates the Monte-Carlo standard errors calculated from 500 datasets; SEH indicates the average asymptotic standard errors of estimates based on the Hessian matrices
∗ the average asymptotic standard errors based on the hsegHMM-N model are reported based on 486 datasets where 2.8% of 500 datasets cannot produce invertable Hessian matrices due to numerical problems
aV(W)=E(V(W|u))+V(E(W|u))
bthe asymptotic standard error of V(W) with the hsegHMM-T is calculated by using the Delta method
△ The distribution of the ploidy estimates is skewed so the SE s of the ploidy appears to be larger than SE H. Using the scaled MAD (median absolute deviation) gives a closer value (0.008) to SE H; , where is the estimate for the m dataset and is the median calculated from 500 simulated datasets
Fig. 2Probability of identification for all the genotype states from 500 simulated datasets with logR generated from the normal-mixture distribution blue lines and red lines indicate the probabilities of identification based on the hsegHMM-N and hsegHMM-T models, respectively; Each dataset consists of 10,000 observations of logR and logOR
Fig. 3Probability of identification for all the genotype states from 500 simulated datasets based on creating read counts for normal and tumor cells green lines, blue lines, and red lines indicate the probabilities of identification based on the FACETS, hsegHMM-N, and hsegHMM-T models, respectively; Each dataset consists of 4,942 observations of logR and logOR
Fig. 4Allele-specific SCNA analysis based on the hsegHMM-N model of a renal cell carcinoma from a TCGA project (TCGA-KL-1883). The blue dots are observed values and red bars are estimates; The first two panels show the profiles of logR and logOR over the entire chromosomes; The last two panels indicate estimated copy numbers and genotype for each sequence over the entire chromosomes
Fig. 5Allele-specific SCNA analysis based on the hsegHMM-T model a and the same model with the A and AB state space b of a renal cell carcinoma sample from a TCGA project (TCGA-KL-1883). The blue dots are observed values and red bars are estimates; The first two panels show the profiles of logR and logOR over the entire chromosomes; The last two panels indicate estimated copy numbers and genotype for each sequence over the entire chromosomes
Fig. 6Allele-specific SCNA analysis based on the FACETS model of a renal cell carcinoma sample from a TCGA project (TCGA-KL-1883). The first two panels show the profiles of logR and logOR over the entire chromosomes; The last panel indicates estimated copy numbers of total and minor alleles (black and red lines, respectively) for each sequence over the entire chromosomes
Summary of hsegHMM-N, hsegHMM-T, and hsegHMM-T A/AB models of a renal cell carcinoma sample: hsegHMM-T A/AB indicates the hsegHMM-T with A and AB state space; Est and logL represent estimated values for parameters and log-likelihood function values given all the estimates, respectively ; ψ is the ploidy and α is the tumor purity; κ2 is the variance component of logR in hsegHMM-T; V(W) and τ2 are the variance of logR and logOR in both models, respectively; SE H indicates the average asymptotic standard errors of estimates based on the Hessian matrices
| hsegHMM-N | hsegHMM-T | hsegHMM-T A/AB | ||||
|---|---|---|---|---|---|---|
| Est | SE H | Est | SE H | Est | SE H | |
|
| 1.62 | 0.003 | 1.61 | 0.003 | 1.60 | 0.003 |
|
| 0.87 | 0.002 | 0.88 | 0.002 | 0.88 | 0.002 |
|
| N/A | 0.16 | 0.002 | 0.17 | 0.003 | |
| 0.25 | 0.002 | 0.26 | 0.003 | 0.27 | 0.003 | |
|
| 0.57 | 0.012 | 0.58 | 0.012 | 0.57 | 0.014 |
|
| N/A | 5.50 | 0.185 | 5.48 | 0.218 | |
| AIC | 64682.17 | 63120.48 | 62923.90 | |||
| BIC | 65934.07 | 64380.89 | 62992.03 | |||