| Literature DB >> 26279618 |
Ming Li1, Yalu Wen2, Wenjiang Fu3.
Abstract
Cumulative evidence has shown that structural variations, due to insertions, deletions, and inversions of DNA, may contribute considerably to the development of complex human diseases, such as breast cancer. High-throughput genotyping technologies, such as Affymetrix high density single-nucleotide polymorphism (SNP) arrays, have produced large amounts of genetic data for genome-wide SNP genotype calling and copy number estimation. Meanwhile, there is a great need for accurate and efficient statistical methods to detect copy number variants. In this article, we introduce a hidden-Markov-model (HMM)-based method, referred to as the PICR-CNV, for copy number inference. The proposed method first estimates copy number abundance for each single SNP on a single array based on the raw fluorescence values, and then standardizes the estimated copy number abundance to achieve equal footing among multiple arrays. This method requires no between-array normalization, and thus, maintains data integrity and independence of samples among individual subjects. In addition to our efforts to apply new statistical technology to raw fluorescence values, the HMM has been applied to the standardized copy number abundance in order to reduce experimental noise. Through simulations, we show our refined method is able to infer copy number variants accurately. Application of the proposed method to a breast cancer dataset helps to identify genomic regions significantly associated with the disease.Entities:
Keywords: Affymetrix high density SNP array; breast cancer; copy number standardization; copy number variants; hidden Markov model
Year: 2015 PMID: 26279618 PMCID: PMC4519351 DOI: 10.4137/CIN.S15203
Source DB: PubMed Journal: Cancer Inform ISSN: 1176-9351
Figure 1Distribution of the size of identified CNVs based on BRCA GWAS data.
Configuration of five possible copy number states.
| STATE ( | COPY NUMBER | POSSIBLE GENOTYPES | EXPECTED | EXPECTED |
|---|---|---|---|---|
| 1 | 0 | − (Deletion) | log(0) =−∞ | 0 |
| 2 | 1 | log2(1/2) = −1 0 | 0 | |
| 3 | 2 | log2(1) = 0 | 0 | |
| 4 | 3 | log (3/2) = 0.585 | 0 | |
| 5 | 4 | log (2) = 1 | 0 |
Error rate for inference of copy number states with correctly and incorrectly specified expected length of copy number states.
| AVERAGE NO. OF SNP WITH COPY NUMBER STATE IN EACH SUBJECT | ||||||
|---|---|---|---|---|---|---|
| HMM STATE | 1 | 2 | 3 | 4 | 5 | Total |
| 557 | 163 | 8,875 | 185 | 220 | 10,000 | |
| λTrue | 5.92e–04 | 1.53e–04 | 2.37e–05 | 1.40e–04 | 1.32e–03 | 1.34e–04 |
| 2λTrue | 3.97e–03 | 4.91e–04 | 1.69e–05 | 7.01e–04 | 4.46e–03 | 3.55e–04 |
| 5λTrue | 4.18e–03 | 6.13e–4 | 1.80e–05 | 7.01e–04 | 4.51e–03 | 3.71e–04 |
| 10λTrue | 4.38e–03 | 9.20e–04 | 1.80e–05 | 1.08e–03 | 4.87e–03 | 4.02e–04 |
| 0.5λTrue | 9.69e–04 | 1.53e–04 | 3.27e–05 | 1.56e–04 | 1.32e–03 | 1.66e–04 |
Note:
means the model specified λ is 2 times greater than the true λ.
Regions showing significant copy number variation in phase III data and their replication in phase I data.
| CHRO. | CYTOBAND | PHYSICAL LOCATION | NO. OF SNPs | ||
|---|---|---|---|---|---|
| 1 | p21.1 | 102622376–102640646 | 7 | 2.62e–13 | 0.954 |
| 1 | p12 | 120292824–120312909 | 3 | 7.62e–14 | 0.999 |
| 1 | q22 | 154077091–154106555 | 3 | 2.453e–11 | 0.999 |
| 45759616–45760637 | 3 | 1.106e–08 | 0.014 | ||
| 2 | p12 | 81196767–81197522 | 3 | 7.232e–09 | 0.977 |
| 2 | q21.1 | 131925407–131955270 | 3 | 4.872e–13 | 0.999 |
| 3 | p14.3 | 57706175–57839689 | 3 | 1.228e–09 | 0.116 |
| 4 | q26 | 117544365–117576957 | 3 | 4.577e–11 | 0.138 |
| 148668320–148697327 | 10 | 9.43e–15 | 7.56e–05 | ||
| 4 | q32.3 | 166885930–166957371 | 5 | 6.664e–11 | 0.189 |
| 5 | q14.3 | 84350898–84398999 | 5 | 4.330e–14 | 0.720 |
| 5 | q22.3 | 115145252–115178424 | 4 | 2.220e–16 | 0.893 |
| 75247853–75311831 | 5 | 5.218e–15 | 0.034 | ||
| 6 | q22.33 | 128476625–128533696 | 6 | 2.409e–13 | 0.806 |
| 6 | q23.2 | 134651674–134672863 | 5 | 3.722e–10 | 0.999 |
| 6 | q27 | 165234976–165247908 | 6 | 1.752e–09 | 0.996 |
| 7 | q22.1 | 98318717–98361309 | 4 | 4.727e–11 | 0.103 |
| 7 | q31.31 | 118754169–118754169 | 5 | 1e–17 | 0.524 |
| 8 | q11.22 | 52786953–52796842 | 3 | 4.550e–10 | 0.840 |
| 8 | q21.3 | 90963387–90964181 | 3 | 2.862e–08 | 0.772 |
| 8 | q24.13 | 125649171–139914783 | 3 | 2.30e–08 | 0.973 |
| 145891814–145948840 | 4 | 3.220e–15 | 7.96e–04 | ||
| 22270796–22294230 | 5 | 6.249e–09 | 3.33e–03 | ||
| 10 | q21.1 | 56853055–74432554 | 7 | 1.084e–09 | 0.998 |
| 11 | p13 | 36306019–36366302 | 3 | 8.95e–11 | 0.223 |
| 11 | p12 | 37905557–37916354 | 6 | 2.627e–09 | 0.968 |
| 11 | q22.3 | 104741435–104806689 | 5 | 4.152e–14 | 0.999 |
| 94977527–95052366 | 4 | 1.11e–16 | 1.07e–04 | ||
| 13 | q13.3 | 34828145–34846106 | 4 | 8.975e–10 | 0.428 |
| 51036156–51071687 | 4 | 5.268e–12 | 6.83e–09 | ||
| 13 | q33.1 | 103334252–103344370 | 5 | 1.589e–09 | 0.964 |
| 14 | q23.1 | 60136001–60140123 | 5 | 1.843e–12 | 0.996 |
| 18 | p11.31 | 3597746–3635894 | 3 | 4.268e–10 | 0.417 |
| X | q27.3 | 146596395–146646974 | 4 | 5.873e–14 | 0.086 |
Notes:
location based on Human Genome Assembly NCBI build 36.1.
Phase III included 243 cases and 187 controls.
Phase I included 249 cases and 299 controls.
Figure 2Raw and standardized copy number abundance for a randomly selected HapMap sample (NA12892).