| Literature DB >> 21278187 |
Christian Rödelsperger1, Peter Krawitz, Sebastian Bauer, Jochen Hecht, Abigail W Bigham, Michael Bamshad, Birgit Jonske de Condor, Michal R Schweiger, Peter N Robinson.
Abstract
MOTIVATION: Next-generation sequencing and exome-capture technologies are currently revolutionizing the way geneticists screen for disease-causing mutations in rare Mendelian disorders. However, the identification of causal mutations is challenging due to the sheer number of variants that are identified in individual exomes. Although databases such as dbSNP or HapMap can be used to reduce the plethora of candidate genes by filtering out common variants, the remaining set of genes still remains on the order of dozens.Entities:
Mesh:
Year: 2011 PMID: 21278187 PMCID: PMC3051326 DOI: 10.1093/bioinformatics/btr022
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.HMM to identify regions identical by descent in exome sequencing data. (A) In siblings affected with an autosomal recessive disorder, both the maternal and the paternal haplotypes surrounding the disease gene are identical by descent (IBD = 2). It is not possible to measure the IBD = 2 state directly, but only whether each sibling was called to the same homozygous or heterozygous genotype (referred to as IBS*). In this model, every genetic locus is either IBD = 2 or IBD ≠ 2 and the transition probabilities between these two states are defined by locus-specific transition rates, dd, nd, dn, nn. According to the HMM, these states emit genotypes that are IBS* or not, according to the appropriate probability distributions. Note that genotypes in IBD ≠ 2 may be IBS* by chance and genotypes in IBD = 2 may not be IBS* due to calling errors (displayed with outlined letters). The HMM and the observed exome sequence are used by the IBD = 2 classifying algorithm to identify regions of the genome that are IBD = 2. The disease gene must be located in such an IBD = 2 region. (B) Exome variant data of chromosome 1. Any chromosomal position that was called to a different genotype in at least one of three sibs on chromosome 1 with respect to the haploid reference sequence hg18 is depicted as a colored vertical line. In the upper panel, green indicates a genomic position that is IBS* in all three sibs, and red indicates ¬IBS* (non-IBS*). In the lower panel, green indicates genomic positions classified as IBD = 2, whereas red indicates IBD ≠ 2.
Fig. 2.(A and B) Distribution of the IBD = 2 ratio and the number of IBD = 2 intervals. Exome datasets were simulated for 25 000 families consisting of n = 2 and n = 3 siblings using HapMap variant frequency data (International HapMap Consortium, 2007). (A) The mean proportion of the genome that is IBD = 2 is μ = 1/4 for n = 2 and μ = 1/16 for n = 3. (B) The mean number of intervals that are IBD = 2 is 38 in two and 14 in three siblings. (C) Robustness of IBD = 2 classification. The in silico exomes were simulated with a sequencing accuracy of 0.999 and a variant calling error rate of ε = 0.05. This yielded emission probabilities of e11 = 0.77 and e01 = 0.28. Using these emission probabilities as HMM parameters in the IBD = 2 classifier, the simulated 3 sib exome data could be classified with a false negative rate of fnr = 0.016 and a false positive rate of fpr = 0.095 (triangle). Decreasing the emission probabilities increases sensitivity but lowers specificity. The default parameters for the classification of real exome datasets of three siblings, the emission probabilities were set to e11 = 0.75 and e01 = 0.26, to increase sensitivity above 99% for the expected error rates (filled circle). (D) Posterior probabilities of IBD = 2 classification. The logarithmic ratio of the posterior probabilities of being IBD = 2 versus IBD ≠ 2 are plotted for all classified variant positions on chromosome 1. A disease-causing mutation (red star) was identified in a IBD = 2 region of high posterior probability (Krawitz ).
IBD = 2 classification of simulated datasets
| Total | Percentage | Total | Percentage | |
|---|---|---|---|---|
| Variant sites | 21 150 (±1414) | 100 | 22 034 (±1687) | 100 |
| IBS* | 6615 (±867) | 31 | 6842 (±735) | 31 |
| True IBD = 2 | 5260 (±553) | 25 | 1494 (±457) | 6.5 |
| Classified IBD = 2 | 7754 (±1332) | 36 | 2198 (±683) | 10 |
The mean number (±1SD) of variants called on an CCDS exome of European ethnicity as expected from HapMap data is in the order of 20K. The fraction of exome positions that are expected to show identical genotypes IBS* is 0.31. The mean fraction of the exome that is IBD = 2 is 1/4 in 2 sibs and 1/16 in 3 sibs. With a false negative rate of fnr < 0.01 about a third of the exome is classified IBD = 2 for 2 sibs and about a tenth for 3 sibs. Chromosomal positions classified as IBD = 2 but that are not IBS* are either misclassifications or are ¬IBS* due to calling errors. Therefore, besides reducing the search space, IBD = 2 classification can help identify calling errors.
Fig. 3.Filtering of exome variant calls. Variants to hg18 were called in exomes of all three sibs of families A, B, and C. Subsequently, every position was classified as IBD = 2 or IBD ≠ 2 and as common or rare variants. Variants classified as IBD = 2 and rare represent the set of candidate mutations for rare monogenic diseases. Variants that are classified as IBD = 2 are either false IBD = 2 classifications or have been called to wrong genotypes in at least one of the probed samples.