| Literature DB >> 22384316 |
Mamoru Kato, Seungtai Yoon, Naoya Hosono, Anthony Leotta, Jonathan Sebat, Tatsuhiko Tsunoda, Michael Q Zhang.
Abstract
Accurate information on haplotypes and diplotypes (haplotype pairs) is required for population-genetic analyses; however, microarrays do not provide data on a haplotype or diplotype at a copy number variation (CNV) locus; they only provide data on the total number of copies over a diplotype or an unphased sequence genotype (e.g., AAB, unlike AB of single nucleotide polymorphism). Moreover, such copy numbers or genotypes are often incorrectly determined when microarray signal intensities derived from different copy numbers or genotypes are not clearly separated due to noise. Here we report an algorithm to infer CNV haplotypes and individuals' diplotypes at multiple loci from noisy microarray data, utilizing the probability that a signal intensity may be derived from different underlying copy numbers or genotypes. Performing simulation studies based on known diplotypes and an error model obtained from real microarray data, we demonstrate that this probabilistic approach succeeds in accurate inference (error rate: 1-2%) from noisy data, whereas previous deterministic approaches failed (error rate: 12-18%). Applying this algorithm to real microarray data, we estimated haplotype frequencies and diplotypes in 1486 CNV regions for 100 individuals. Our algorithm will facilitate accurate population-genetic analyses and powerful disease association studies of CNVs.Entities:
Keywords: EM algorithm; copy number variation; haplotype inference; phasing
Year: 2011 PMID: 22384316 PMCID: PMC3276117 DOI: 10.1534/g3.111.000174
Source DB: PubMed Journal: G3 (Bethesda) ISSN: 2160-1836 Impact factor: 3.154
Figure 1 Illustration of ICN, SNP, and SNVC sites. (Upper part) “ICN”, “SNP”, and “SNVC” represent integer copy number, single nucleotide polymorphism, and single nucleotide variation in a CNV, respectively. A “copy unit” represents the unit of DNA sequence that is duplicated in a CNV region (Kato ). In this illustration, most invariant bases in copy units are omitted for the purpose of simplicity. Lower part: High-throughput experimental technologies give data on the total number of copies over a diplotype or an unphased sequence genotype.
Figure 2 Illustration of simulation. “Ind.” is short for individuals. The symbols “-”, “/”, and “,” in the top left table represent a deletion, the separator between haplotypes, and the separator between copy units in a duplication. (1) Using known diplotypes (Sachse ), we made unphased genotypes. (2) We randomly generated signal intensities for the unphased genotypes, using normal distributions with the means and variances taken from real microarray data (Korn ). (3) We calculated the likelihoods of the signal intensities for all possible unphased genotypes (or total copy numbers), based on the normal distributions above.
Examples of haplotype frequency estimation
| Variation Type | Haplotype | Known Frequency | Estimated Frequency |
|---|---|---|---|
| One ICN site | 1 copy | 0.9609 | 0.9607 |
| 0 copies | 0.0196 | 0.0199 | |
| 2 copies | 0.0196 | 0.0194 | |
| 3 copies | 1.0 × 10−10 | ||
| — | — | — | |
| One SNVC site | A | 0.9600 | 0.9598 |
| — | 0.0196 | 0.0199 | |
| A, A | 0.0196 | 0.0194 | |
| B | 0.0009 | 0.0009 | |
| A, B | 2.7 × 10−10 | ||
| — | — | — | |
| Two SNVC sites | -A | 0.9592 | 0.9589 |
| – | 0.0196 | 0.0199 | |
| -A, -A | 0.0196 | 0.0194 | |
| -B | 0.0009 | 0.0009 | |
| AA | 0.0009 | 0.0009 | |
| BA | 3.6 × 10−5 | ||
| — | — | — |
NA indicates that the corresponding haplotypes did not exist in the known dataset.
The symbols A and B represent different nucleotide bases, and “-” and “,” represent a deletion and the separator between copies in a duplicated region, respectively. We omitted haplotypes with estimated frequencies of less than 10−10 (all such haplotypes did not exist in the known dataset).
Error rate of haplotype frequencies estimated by the current and previous algorithms
| Variation Type | ||
|---|---|---|
| One ICN site | 0.019 ± 0.036 | 0.123 ± 0.122 |
| One SNVC site | 0.008 ± 0.010 | 0.118 ± 0.117 |
| Two SNVC sites | 0.022 ± 0.029 | 0.176 ± 0.109 |
TV measures the deviation of estimated frequencies from answer frequencies. The numbers are the mean ± SD across the 14 sets.
Inference accuracy of individuals’ diplotypes and total copy numbers or unphased genotypes
| Variation Type | Accuracy of Diplotypes by the Current Algorithm | Accuracy of the Most Likely Unphased Copy Numbers or Genotypes | Correction | Corruption |
|---|---|---|---|---|
| One ICN site | 94.9 ± 5.4% | 78.1 ± 20.6% | 19.2 ± 17.9% | 2.5 ± 2.9% |
| One SNVC site | 95.6 ± 4.3% | 78.6 ± 20.0% | 19.1 ± 17.7% | 2.2 ± 2.3% |
| Two SNVC sites | 94.9 ± 6.1% | 66.0 ± 17.4% | 29.4 ± 14.7% | 0.5 ± 0.4% |
Proportion of diplotypes (and also total copy numbers or unphased genotypes) that were correctly inferred by the current algorithm. The numbers are the mean ± SD across the 14 sets.
Proportion of correct ones of the most likely total copy numbers or unphased genotypes, which have the largest likelihood value of all possible total copy numbers or unphased genotypes and therefore we used as the input of the previous algorithm. As with diplotypes, we required that correct ones should agree with answers over all sites. The numbers are the mean ± SD across the 14 sets.
Proportion of total copy numbers or unphased genotypes that were incorrectly determined with the largest likelihood value but were correctly inferred by the current algorithm. The numbers are the mean ± SD across the 14 sets.
Proportion of total copy numbers or unphased genotypes that were correctly determined with the largest likelihood value but were incorrectly inferred by the current algorithm. The numbers are the mean ± SD across the 14 sets.
Figure 3 Influence of sample size on performance. The left y-axis represents the deviation (the TV index) of estimated haplotype frequencies from answer frequencies for the current (red line with circles) and the previous (red line with crosses) algorithms. The right y-axis represents the proportion of diplotypes correctly inferred by the current algorithm (blue line with circles) or unphased genotypes correctly determined with the largest likelihood value (blue line with crosses). Points in the y-axes at each sample size indicate the mean of values over 10 different answer sets × 14 different site sets. (A) For one SNVC site. (B) For two SNVC sites. The result of one ICN site is not shown, because this result was similar to that of one SNVC site.
Figure 4 Frequency spectrums. The width of each bin is 2%. Alleles with a very small [<1 / (2 × the number of individuals)] or large [>1 − 1 / (2 × the number of individuals)] frequency are excluded from the counts. (A) The frequency spectrum of alleles. An (n = 0, 1, and 2) in the box represents an allelic copy number. (B) The frequency spectrum of total copy numbers derived from the allelic copy numbers. Tn (n = 0, 1, ..., 4) in the box represents an total copy number.
Examples of CNV regions that overlapped with genes and that had substantial amounts of estimated frequencies for both zero-copy and two-copy alleles
| Chromosome | From | To | Haplotype | Frequency | Overlapping Gene | Synopsis of the Function |
|---|---|---|---|---|---|---|
| 1 | 54,858,310 | 54,866,895 | 0 copies | 0.025 | Associated with obesity in mouse | |
| 1 copy | 0.920 | |||||
| 2 copies | 0.055 | |||||
| 20 | 45,247,180 | 45,253,878 | 0 copies | 0.073 | Possible involvement in eye development | |
| 1 copy | 0.905 | |||||
| 2 copies | 0.023 | |||||
| 11 | 106,745,435 | 106,757,969 | 0 copies | 0.038 | Unknown ( | |
| 1 copy | 0.696 | |||||
| 2 copies | 0.266 |
The “Chromosome”, “From”, and “To” columns indicate the chromosomal locations of CNV regions (hg18).
We omitted haplotypes with estimated frequencies of less than 10−3.