| Literature DB >> 24379824 |
Duncan C Thomas1, Zhao Yang1, Fan Yang1.
Abstract
The cost of next-generation sequencing is now approaching that of early GWAS panels, but is still out of reach for large epidemiologic studies and the millions of rare variants expected poses challenges for distinguishing causal from non-causal variants. We review two types of designs for sequencing studies: two-phase designs for targeted follow-up of genomewide association studies using unrelated individuals; and family-based designs exploiting co-segregation for prioritizing variants and genes. Two-phase designs subsample subjects for sequencing from a larger case-control study jointly on the basis of their disease and carrier status; the discovered variants are then tested for association in the parent study. The analysis combines the full sequence data from the substudy with the more limited SNP data from the main study. We discuss various methods for selecting this subset of variants and describe the expected yield of true positive associations in the context of an on-going study of second breast cancers following radiotherapy. While the sharing of variants within families means that family-based designs are less efficient for discovery than sequencing unrelated individuals, the ability to exploit co-segregation of variants with disease within families helps distinguish causal from non-causal ones. Furthermore, by enriching for family history, the yield of causal variants can be improved and use of identity-by-descent information improves imputation of genotypes for other family members. We compare the relative efficiency of these designs with those using unrelated individuals for discovering and prioritizing variants or genes for testing association in larger studies. While associations can be tested with single variants, power is low for rare ones. Recent generalizations of burden or kernel tests for gene-level associations to family-based data are appealing. These approaches are illustrated in the context of a family-based study of colorectal cancer.Entities:
Keywords: breast neoplasms; colorectal cancer; family-based study; rare variant association; sequencing; two-phase sampling design
Year: 2013 PMID: 24379824 PMCID: PMC3861783 DOI: 10.3389/fgene.2013.00276
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Parameter estimates (SEs) [Wald .
| Imputation | 1.69 (0.96) | 1.75 (0.86) | 1.63 (0.79) |
| Weighted likelihood | 1.88 (0.96) | 1.89 (0.91) | 1.72 (1.13) |
| Pseudolikelihood | 1.88 (0.96) | 2.03 (0.97) | 2.22 (1.00) |
| Semiparametric ML | 2.12 (0.98) | 2.22 (0.99) | 2.24 (1.00) |
| Full Study | 1.80 (0.69) | ||
| Average number discovered (causal) | 653 (44) | 719 (45) | 697 (44) |
| Imputation | 1.66 (0.95) | 1.73 (0.86) | 1.64 (0.80) |
| Weighted likelihood | 2.34 (1.01) | 1.87 (0.93) | 1.73 (1.12) |
| Pseudolikelihood | 2.35 (1.02) | 2.01 (1.00) | 2.27 (0.97) |
| Semiparametric ML | 2.56 (1.06) | 2.19 (1.02) | 2.29 (0.97) |
Empirical mean estimates and standard deviations are computed from 1000 replicates with 2000 cases and 2000 controls showing association with at least one GWAS SNP, subsampling 600 subjects, 50 causal rare variants. These results are contrasted across three sampling designs. Coefficients are in units of log RR per Madsen-Browning rare variant summary index divided by 1000; for consistency across designs, all rare variants are included in the index in the top portion of the table; the bottom portion includes just those discovered in the substudy, so point estimates are not comparable across sampling methods. All estimates are adjusted for the risk index.
100 subjects from each of the 6 strata.
Numbers of subjects in the subsample are fixed across replicates at (2, 20, 214) cases and (74, 116, 174) controls, stratified into 3 groups of risk index from low to high, based on overall optimization for all replicates combined.
Expected total number of discovered variants prioritized and expected number of these that are causal, by minimum number of copies in the sequencing sample and maximum number of copies in 1000 Genomes Project data.
| c′ = 0 | 1.5 M | 113 K | 10 K |
| c′ = 1 | 2.6 M | 265 K | 30 K |
| c′ = 2 | 3.4 M | 418 K | 57 K |
| c′ = 0 | 41 | 34 | 27 |
| c′ = 1 | 113 | 97 | 79 |
| c′ = 2 | 192 | 168 | 140 |
| c′ = 0 | 0.7 | 1.0 | 1.6 |
| c′ = 1 | 2.1 | 2.9 | 4.2 |
| c′ = 2 | 3.8 | 5.1 | 7.0 |
Bonferroni corrected α = 0.05 (i.e., in addition to these causal variants, 0.05 non-causal variants are expected to be declared significant).
Simulated results of hypotheses tested in the main study for various levels of aggregation in the planned WECARE Study; means over 100 replicate simulations.
| Pathway | 87.4 | 87.3 | 0.2 | 0.00 | 12.6 | 12.6 | 1.1 | 0.6 (4.6%) |
| Gene | 1016 | 1006 | 8.4 | 0.00 | 33.7 | 33.4 | 5.6 | 1.9 (5.7%) |
| Gene-region | 2925 | 2318 | 32.8 | 0.05 | 97.5 | 79.6 | 12.9 | 2.9 (2.9%) |
| Single variant | 31,218 | 6558 | 156 | 3.10 | 273 | 43.0 | 19.4 | 3.1 (1.1%) |
Based on 100 pathways with an average of 10 genes each, each gene having on average 10 exonic variants (r = 1), 20 regulatory variants (r = 2), and 30 other variants surrounding the gene (r = 3) with σ = 1.0, π = 0.125, σ = 0.25, π = 0.25, σ = exp[−η − 0.25 ln(q/.01)] where η1 = −1.0, η2 = −1.5, η3 = −2.0, logit(π) = ζ − 0.25 ln(q/.01) where ζ1 = −0.5, ζ2 = −1.5, ζ3 = −2.5.
Figure 1Mean scores for causal variants (top panel) and ratio of frequencies of causal to non-causal variants (bottom panel) in simulated 11-member pedigrees with at least 4 affected members. In each panel, results are shown for a design sequencing an affected sib pair and affected cousin by the number of carriers of the variant allele (left) or an affected first cousin pair and an unaffected sib by the number of carriers among cases and controls (right).
Figure 2Relative probabilities of discovery, prioritization, and both between causal vs. null variants for different criteria for selecting members for sequencing in simulated 11-member pedigrees with at least 4 affected members. Top panel, all designs; bottom panel, detail for designs with only two members sequenced. (Codes for top panel: S, sib; C, cousin; 2, first cousin once removed; U, uncle; G, grandparent; P, parent; Upper case, affected, lower case, unaffected; hyphen, affected but not sequenced.)
Figure 3Receiver operating curves comparing different prioritization schemes.
Some near-optimal multi-stage family-based and case-control designs (The first row of each block is the one with the highest ARCE among those investigated; the second is the one with better power among those with similar costs.)
| 1800 | 30 | 12 | 3.0 | 38 | 3,591 | 17% (9%) | $12.3 | 0.252 |
| 2400 | 30 | 14 | 3.0 | 13 | 3,894 | 21% (13%) | $16.8 | 0.241 |
| 2100 | 50 | 12 | 4.0 | 56 | 346 | 19% (12%) | $13.8 | 0.264 |
| 2400 | 50 | 12 | 4.0 | 59 | 378 | 21% (13%) | $15.8 | 0.260 |
| 7000 | 20 | 12 | 0.001 | 62 | 5,502 | 16% (9%) | $17.8 | 0.171 |
| 9000 | 20 | 14 | 0.001 | 65 | 5,862 | 20% (12%) | $23.1 | 0.164 |
| 6000 | 40 | 16 | 0.0001 | 70 | 823 | 17% (11%) | $8.1 | 0.416 |
| 7000 | 40 | 16 | 0.0001 | 74 | 912 | 20% (13%) | $9.4 | 0.407 |
Number of 22-member pedigrees for family-based designs; number of cases, number of controls for case-control designs.
Minimum score test for family-based designs; minimum p-value for case-control designs.
Assuming 1000 causal variants out of a total of 20 million.
Total number of true positives, inversely weighted by the square root of MAF, divided by total cost.
Figure 4Correlation across 32 GWAS SNPs between the statistics computed from the complete genotype data and those computed using only the genotypes for various subsets of members; top left: 5 genotyped CRC and Lynch syndrome cases; top right: 9 cases of any cancer. Bottom: prioritization statistics by degree of relationship for apparently associated or unassociated variants based on the complete data. Data from a single 145-member Australian pedigree with a total of 8 CRC or Lynch syndrome cases and 15 cases of any cancer and a total of 49 subjects genotyped.