| Literature DB >> 34561511 |
Laurent Abel1,2,3, Aurélie Cobat4,5, Matthieu Bouaziz1,2, Jimmy Mullaert1,2,6,7, Benedetta Bigio3, Yoann Seeleuthner1,2, Jean-Laurent Casanova1,2,3,8, Alexandre Alcais1,2.
Abstract
Population stratification is a confounder of genetic association studies. In analyses of rare variants, corrections based on principal components (PCs) and linear mixed models (LMMs) yield conflicting conclusions. Studies evaluating these approaches generally focused on limited types of structure and large sample sizes. We investigated the properties of several correction methods through a large simulation study using real exome data, and several within- and between-continent stratification scenarios. We considered different sample sizes, with situations including as few as 50 cases, to account for the analysis of rare disorders. Large samples showed that accounting for stratification was more difficult with a continental than with a worldwide structure. When considering a sample of 50 cases, an inflation of type-I-errors was observed with PCs for small numbers of controls (≤ 100), and with LMMs for large numbers of controls (≥ 1000). We also tested a novel local permutation method (LocPerm), which maintained a correct type-I-error in all situations. Powers were equivalent for all approaches pointing out that the key issue is to properly control type-I-errors. Finally, we found that power of analyses including small numbers of cases can be increased, by adding a large panel of external controls, provided an appropriate stratification correction was used.Entities:
Mesh:
Year: 2021 PMID: 34561511 PMCID: PMC8463695 DOI: 10.1038/s41598-021-98370-5
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Graphical representation of the European sample. (a) PCA plots of the 4887 samples comprising the 3104 samples from our in-house cohort HGID and the 1000 genomes (1KG) individuals including African (AFR), Ad Mixed American (AMR), East-Asian (EAS), European (EUR) and South-Asian (SAS). Common variants were used to produce these plots. The European reference individual is singled out. (b) 1523 selected individuals of the final European cohort with a genetic distance to the European reference sample below the empirically derived threshold. The dashed vertical lines correspond to empirical PC1 thresholds chosen to split the samples into three European subgroups: Northern (n = 127 including 99 1KG FIN and 28 HGID samples), Middle (n = 651 including 99 1KG CEU, 91 1KG GBR and 461 HGID samples), and South European ancestry (n = 745 including 107 1KG TSI, 107 1KG IBS and 531 HGID samples. PC1 threshold were defined based on the 1000 genomes samples and further applied to HGID samples (in yellow) so that they were all assigned to one and only one subgroup for simulation purpose.
Figure 2Graphical representation of the Worldwide sample. (a) PCA plots of the 4887 samples comprising the 3104 samples from our in-house cohort HGID and the 1000 genomes (1KG) individuals including African (AFR), Ad Mixed American (AMR), East-Asian (EAS), European (EUR) and South-Asian (SAS). Common variants were used to produce these plots. Reference individuals are singled out. These samples are then used to establish the final Worldwide cohort by considering all samples with a genetic distance to the references below given thresholds. (b) The selected 1,967 individuals with European (n = 700), Middle-Eastern (n = 543), North-African (n = 359) and South-Asian (n = 365) ancestries are colored. The remaining individuals are left in grey.
Type I error rates of the different approaches for the large European sample.
| CAST | PC3 | LMM | LocPerm | |
|---|---|---|---|---|
| RVs | 0.00106 | 0.00108 | 0.00118 | 0.00082 |
| LFVs | 0.0011 | 0.00119 | ||
| CVs | 0.00104 | 0.00118 | ||
| ALLVs | 0.00108 | 0.00116 | ||
| RVs | 0.00117 | 0.00095 | ||
| LFVs | 0.00101 | |||
| CVs | 0.001 | |||
| ALLVs | 0.00102 | 0.00117 | ||
| RVs | 0.00087 | |||
| LFVs | ||||
| CVs | ||||
| ALLVs | ||||
The nominal level alpha considered is and the corresponding 95%PI adjusted for the 10 methods is [0.00079–0.00121]. Type I error rates under the lower bound of the 95%PI are displayed in italic and above the upper bound of the 95%PI in bold.
Type I error rates of the different approaches for the large Worldwide sample.
| CAST | PC3 | LMM | LocPerm | |
|---|---|---|---|---|
| RVs | 0.00085 | 0.00099 | 0.00093 | 0.00087 |
| LFVs | 0.00099 | 0.00094 | ||
| CVs | 0.00099 | 0.00093 | ||
| ALLVs | 0.00099 | 0.00093 | ||
| RVs | 0.00096 | |||
| LFVs | 0.00109 | |||
| CVs | 0.00105 | 0.00117 | ||
| ALLVs | ||||
| RVs | 0.00113 | |||
| LFVs | 0.0012 | |||
| CVs | 0.00119 | 0.00115 | ||
| ALLVs | ||||
The nominal level alpha considered is and the corresponding 95%PI adjusted for the 10 methods is [0.00079–0.00121]. Type I error rates under the lower bound of the 95%PI are displayed in italic and above the upper bound of the 95%PI in bold.
Figure 3Histogram of powers for methods with a correct type I error rate for the large size European sample (n = 1523) at the level . (a) Without stratification. (b) With moderate stratification. (c) With high stratification. Relative risks considered vary from 2 to 4 on the x-axis.
Figure 4Histogram of powers for methods with a correct type I error rate for the large size Worldwide sample (n = 1967) at the level . (a) Without stratification. (b) With moderate stratification. (c) With high stratification. Relative risks considered vary from 2 to 4 on the x-axis.
Type I error rates of the different approaches for the small sample scenarios.
| Scenario | CAST | PC3CV | LMMCV | LocPerm |
|---|---|---|---|---|
| 50SE-100SE | 0.0012 | 0.0012 | 0.0009 | |
| 50SE-1000E | 0.0012 | 0.0008 | ||
| 50SE-1000W | 0.0011 | 0.0010 | ||
| 50SE-2000W | 0.0010 | 0.0011 | ||
| 50E-100E | 0.0012 | 0.0010 | ||
| 50E-1000E | 0.0010 | 0.0010 | 0.0009 | |
| 50E-1000W | 0.001 | 0.0010 | ||
| 50E-2000W | 0.0009 | 0.0011 | ||
| 50World-100W | 0.0012 | 0.0010 | ||
| 50World-1000E | 0.0010 | |||
| 50World-1000W | 0.0009 | 0.0010 | 0.0009 | |
| 50World-2000W | 0.0009 | 0.0009 | 0.0010 |
The nominal level alpha considered is . Type I error rates under the lower bound of the 95%PI are displayed in italic and above the upper bound of the 95%PI in bold.
Supplementary Table S3 provides the adjusted 95%PI for the different number of genes tested in each scenario.
Figure 5Power for methods with a correct type I error rate under H0 for the small size sample at the level . (a) Scenarios with 50 cases from Southern-Europe. (b)Scenarios with 50 cases from the whole Europe. (c) Scenarios with 50 cases from the Worldwide sample. The relative risk is fixed at 4.