| Literature DB >> 27047537 |
Varsha Dhankani1, David L Gibbs1, Theo Knijnenburg1, Roger Kramer1, Joseph Vockley2, John Niederhuber3, Ilya Shmulevich1, Brady Bernard1.
Abstract
Most currently available family based association tests are designed to account only for nuclear families with complete genotypes for parents as well as offspring. Due to the availability of increasingly less expensive generation of whole genome sequencing information, genetic studies are able to collect data for more families and from large family cohorts with the goal of improving statistical power. However, due to missing genotypes, many families are not included in the family based association tests, negating the benefits of large scale sequencing data. Here, we present the CIFBAT method to use incomplete families in Family Based Association Test (FBAT) to evaluate robustness against missing data. CIFBAT uses quantile intervals of the FBAT statistic by randomly choosing valid completions of incomplete family genotypes based on Mendelian inheritance rules. By considering all valid completions equally likely and computing quantile intervals over many randomized iterations, CIFBAT avoids assumption of a homogeneous population structure or any particular missingness pattern in the data. Using simulated data, we show that the quantile intervals computed by CIFBAT are useful in validating robustness of the FBAT statistic against missing data and in identifying genomic markers with higher precision. We also propose a novel set of candidate genomic markers for uterine related abnormalities from analysis of familial whole genome sequences, and provide validation for a previously established set of candidate markers for Type 1 diabetes. We have provided a software package that incorporates TDT, robustTDT, FBAT, and CIFBAT. The data format proposed for the software uses half the memory space that the standard FBAT format (PED) files use, making it efficient for large scale genome wide association studies.Entities:
Keywords: family based association tests; memory efficient data format; missing genotypes; population stratification; quantile intervals; randomized imputation; whole genome analysis
Year: 2016 PMID: 27047537 PMCID: PMC4796035 DOI: 10.3389/fgene.2016.00034
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Comparison of features of TDT, robustTDT, FBAT, and CIFBAT.
| Unaffected offspring | ✓ | ✓ | ✓ | ||||
| Incomplete trios | ✓ | ✓ | ✓ | ||||
| Support for ChrX | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |
| Genetic models | A,D,R | A,D,R | A,D,R | A,D,R | A | A,D,R | A |
| Memory efficient data format | ✓ | ✓ | ✓ | ✓ |
Original implementations of TDT, robustTDT and FBAT are compared with their implementations in FamSuite (For genetic models, A, additive; D, dominant; R, recessive).
Figure 1Examples of informative complete trios. (A) Autosomal chromosomes (B) X chromosome; trio with female offspring (C) X chromosome; trio with male offspring.
Figure 2Examples of admissible incomplete trios. (A) Autosomal chromosomes (B) X chromosome (female offspring) (C) X chromosome (male offspring). CIFBAT considers all valid completions of incomplete trios in the data as equally likely. Using randomly selected completions over several repetitions, CIFBAT computes a quantile interval of the FBAT statistic.
Simulation scenarios for comparison of FBAT and CIFBAT.
| MCAR | Random |
| MAR | Small Pop. |
| MAR | Large Pop. |
| MAR | Males |
| MAR | Females |
| MNAR | Cases |
| MNAR | Controls |
| MNAR | Heterozygotes |
| MNAR | Homozygotes |
| Missing data was split by the above variables 80/20 | |
Different missingness patterns (MCAR, MAR, MNAR) were simulated for comparing FBAT and CIFBAT. Population, gender, affectation status and zygosity were used to specify distribution of missing data for MAR and MNAR. In all, 9 scenarios were simulated at 0, 1, 5, 10% missing rate each.
Parameters for simulation of family genotype data.
| Populations | 2 | Sized 1/3 and 2/3 of individuals |
| Families | 4000 | Each assigned to a population |
| Number affected offspring | 636 (sd 294) | |
| Equal number of controls | Drawn from remainder of all pedigrees | |
| Number of offspring | Uniform random (1,2) | |
| Number of markers | 300 | |
| Number causative markers | 3 | |
| CIFBAT trials | 100 | |
| Penetrance with 2 causative SNPs | f2~ N(0.1, 0.01) | |
| Penetrance with 0 causative SNPs | f0~ N(0.001, 0.001) | |
| Environmental effect | λ | |
| Genomic effect | λ | |
| Marker Frequency | Gamma(shape = 2, scale = 2)/35.0 |
Figure 3Performance of FBAT and CIFBAT under Missing At Random (MAR) simulation scenario. Shown is the (A) Precision, (B) Recall, and (C) F-measures related to calling the causative variant. Missing data is concentrated within cases or controls and performance is measured for different missing data rates and FDR thresholds.
Results from analysis of simulated familial genotype data.
| CIFBAT | MAR | 0 | Controls | 0.547 | 0.922 | 0.686 |
| MAR | 1 | Controls | 0.523 | 0.959 | 0.677 | |
| MAR | 5 | Controls | 0.490 | 0.976 | 0.653 | |
| MAR | 10 | Controls | 0.484 | 0.970 | 0.646 | |
| MAR | 0 | Cases | 0.538 | 0.933 | 0.682 | |
| MAR | 1 | Cases | 0.484 | 0.962 | 0.644 | |
| MAR | 5 | Cases | 0.356 | 0.989 | 0.524 | |
| MAR | 10 | Cases | 0.263 | 1.000 | 0.417 | |
| FBAT | MAR | 0 | Controls | 0.547 | 0.922 | 0.686 |
| MAR | 1 | Controls | 0.540 | 0.936 | 0.685 | |
| MAR | 5 | Controls | 0.537 | 0.931 | 0.681 | |
| MAR | 10 | Controls | 0.530 | 0.941 | 0.678 | |
| MAR | 0 | Cases | 0.538 | 0.933 | 0.682 | |
| MAR | 1 | Cases | 0.522 | 0.934 | 0.670 | |
| MAR | 5 | Cases | 0.473 | 0.932 | 0.627 | |
| MAR | 10 | Cases | 0.404 | 0.934 | 0.564 |
Performance of FBAT and CIFBAT was compared based on recall, precision, and F-measure. CIFBAT tended to trade lower recall for higher precision; meaning that while fewer variants were called significant, they were more likely to be true positives. F-measure was comparable between FBAT and CIFBAT over all the simulation scenarios; however, the variance of F-measure was higher for CIFBAT.
Figure 4Analysis of familial whole genome sequencing data for uterine anomalies. (A) Comparing the number of significant results exclusively and jointly under FBAT and CIFBAT. Of the 551 markers significant under FBAT, 242 (~44%) were validated, and 309 (~56%) were negated by CIFBAT after including incomplete trios in the test. Thirty nine additional markers were identified as significant exclusively by CIFBAT. (B) An example of a marker which was significant under FBAT and further validated by CIFBAT. (C) An example of a marker which was significant under FBAT, but was not validated by CIFBAT upon inclusion of incomplete trios in the test. (D) An example of a marker which was exclusively significant under CIFBAT.
Figure 5Distribution of trio types within cases and controls for chr7:142008644. (A) Complete trio types—Trio type numbers mentioned in the legend correspond to those in the Figure S1. (B) Incomplete trio types - Trio type numbers mentioned in the legend correspond to those in the Figure S2. Only trio types that had non-zero counts are shown here.
Detailed results from analysis of candidate markers for Type I Diabetes.
| 11:2137971 | rs3842748 | 4.44E-16 | (<2.22E-16, < 2.22E-16)* | 22.95 | 10.95 |
| 11:2130023 | rs7924316 | 2.32E-07 | (2.18E-09, 9.33E-15)* | 13.25 | 49.12 |
| 11:2157914 | rs11564709 | 2.21E-07 | (<2.22E-16, < 2.22E-16)* | 13.04 | 5.37 |
| 11:2147527 | rs6356 | 9.00E-07 | (3.14E-06, 7.01E-03) | 13.27 | 46.16 |
| 11:2151386 | rs7119275 | 1.32E-06 | (<2.22E-16, 7.99E-15)* | 13.42 | 25.31 |
| 11:2126719 | rs1004446 | 4.11E-06 | (<2.22E-16, 4.44E-15)* | 13.24 | 43.86 |
| 11:2124119 | rs1003483 | 4.65E-06 | (1.51E-04, 2.29E-08)* | 13.44 | 43.17 |
| 11:2152413 | rs10840495 | 6.01E-06 | (<2.22E-16, 4.93E-14)* | 13.08 | 25.46 |
| 11:2119686 | rs4244808 | 7.62E-06 | (1.36E-07, 8.90E-04)* | 15.69 | 41.53 |
| 11:2156905 | rs11564710 | 2.81E-04 | (1.22E-12, 8.48E-08)* | 13.09 | 30.37 |
| 11:2154012 | rs4929966 | 6.55E-04 | (<2.22E-16, < 2.22E-16)* | 13.44 | 17.59 |
| 11:2150966 | rs10840491 | 2.85E-03 | (1.11E-14, 1.26E-08)* | 13.13 | 12.26 |
| 1:114089610 | rs2476601 | 2.40E-14 | (0.65, 1.53E-02) | 13.13 | 12.17 |
| 1:114141503 | rs2358994 | 1.76E-08 | (0.85, 4.48E-02) | 13.09 | 17.70 |
| 1:114127410 | rs2488457 | 1.16E-07 | (0.21, 0.48) | 13.04 | 21.43 |
| 1:114132370 | rs12566340 | 1.02E-07 | (0.16, 0.66) | 13.06 | 23.20 |
| 1:114132504 | rs7529353 | 2.24E-07 | (0.21, 0.56) | 13.07 | 23.40 |
| 1:114086477 | rs1217395 | 4.98E-07 | (0.20, 0.33) | 16.89 | 24.80 |
| 1:114138866 | rs7524200 | 5.81E-07 | (1.72E-04, 4.16E-02) | 13.10 | 32.80 |
| 1:114078476 | rs3789607 | 1.60E-05 | (<2.22E-16, 4.44E-15)* | 13.98 | 39.95 |
| 1:114142398 | rs1539438 | 1.80E-05 | (<2.22E-16, 5.60E-13)* | 13.63 | 23.91 |
| 1:114129479 | rs1235005 | 2.47E-05 | (2.04E-02, 3.28E-05) | 13.22 | 38.32 |
| 1:114131802 | rs1217384 | 4.29E-05 | (1.35E-14, < 2.22E-16)* | 13.07 | 22.29 |
| 1:114145701 | rs1217394 | 4.55E-05 | (<2.22E-16, 3.47E-13)* | 13.07 | 24.08 |
| 1:114129885 | rs6665194 | 6.89E-05 | (2.69E-02, 3.57E-05) | 13.49 | 38.31 |
| 1:114063748 | rs6537798 | 7.51E-05 | (1.35E-02, 2.39E-05) | 13.28 | 39.43 |
| 1:114113273 | rs1217418 | 9.21E-05 | (1.58E-05, 1.68E-02) | 13.09 | 39.50 |
| 1:114081776 | rs2476600 | 1.09E-04 | (1.61E-02, 1.94E-05) | 13.09 | 39.40 |
| 1:114056125 | rs1217379 | 1.73E-04 | (1.83E-02, 3.20E-05) | 13.69 | 39.37 |
| 2:204567056 | rs231727 | 1.62E-03 | (0.24, 0.48) | 13.32 | 48.15 |
| 2:204566672 | rs1427676 | 3.02E-03 | (0.31, 0.37) | 13.32 | 29.98 |
| 16:27281465 | rs1805012 | 5.62E-07 | (<2.22E-16, < 2.22E-16)* | 14.57 | 0.37 |
| 10:6163501 | rs12251307 | 8.15E-04 | (<2.22E-16, < 2.22E-16)* | 13.20 | 7.81 |
| 10:6139051 | rs2104286 | 3.97E-03 | (<2.22E-16, 2.22E-16)* | 13.20 | 18.71 |
| 5:158700244 | rs17056704 | 4.62E-03 | (2.10E-07, 8.96E-13)* | 13.32 | 23.70 |
| 2:162949558 | rs1990760 | 4.32E-03 | (1.31E-06, 1.75E-11)* | 13.32 | 29.97 |
FBAT results were corrected for multiple testing using Benjamini-Hochberg method with a 10% false discovery rate cut-off. The corresponding p-value cut-off of 1.10e-02 was also used to indicate significant QI from CIFBAT. Out of the 36 markers significant under FBAT, 20, indicated by an .
Figure 6Analysis of candidate set of markers for Type I Diabetes. (A) Example of a marker in the INS gene that was significant under FBAT, and further validated by CIFBAT upon inclusion of incomplete trios in the test. (B) Example of a marker in the INS gene that could not be validated by CIFBAT. (C) Example of a marker in gene PTPN22 which was significant under FBAT, but could not be validated by CIFBAT.