| Literature DB >> 33257650 |
Juba Nait Saada1, Georgios Kalantzis2, Derek Shyr3, Fergus Cooper4, Martin Robinson4, Alexander Gusev5,6, Pier Francesco Palamara7,8.
Abstract
Detection of Identical-By-Descent (IBD) segments provides a fundamental measure of genetic relatedness and plays a key role in a wide range of analyses. We develop FastSMC, an IBD detection algorithm that combines a fast heuristic search with accurate coalescent-based likelihood calculations. FastSMC enables biobank-scale detection and dating of IBD segments within several thousands of years in the past. We apply FastSMC to 487,409 UK Biobank samples and detect ~214 billion IBD segments transmitted by shared ancestors within the past 1500 years, obtaining a fine-grained picture of genetic relatedness in the UK. Sharing of common ancestors strongly correlates with geographic distance, enabling the use of genomic data to localize a sample's birth coordinates with a median error of 45 km. We seek evidence of recent positive selection by identifying loci with unusually strong shared ancestry and detect 12 genome-wide significant signals. We devise an IBD-based test for association between phenotype and ultra-rare loss-of-function variation, identifying 29 association signals in 7 blood-related traits.Entities:
Year: 2020 PMID: 33257650 PMCID: PMC7704644 DOI: 10.1038/s41467-020-19588-x
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 14.919
Fig. 1FastSMC in coalescent simulations.
a (respectively c). Precision-recall curve randomly sampled from 10 realistic European simulated datasets with 300 haploid samples for IBD segments detection within the past 50 generations (respectively 100 generations), within the recall range where all methods are able to provide predictions. b (respectively d). Running time (CPU seconds) using chromosome 20 of the UK Biobank for IBD segments detection within the past 50 generations (respectively 100 generations). The entire cohort of 487, 409 samples (across 7913 SNPs) was randomly downsampled into smaller batches. Only one thread was used for each method and running time trend lines in logarithmic scale are shown, reflecting differences in the quadratic components of each algorithm. Parameters for all methods were optimized to maximize accuracy and used for both accuracy and running time benchmarking (details in Methods section). e Median absolute error of the length-based maximum likelihood age estimate (MLE) on segments detected by GERMLINE in blue and our MAP age estimate on segments detected by FastSMC (no filtering on the IBD quality score in orange, and a minimum IBD quality score of 0.01 in dark red) for 300 haploid simulated samples with European ancestry. As a reference, the average TMRCA at a random site for a pair is in the order of several thousands of generations[23]. Only detected IBD segments longer than the minimum length represented on the X-axis were considered. We ran both algorithms with the same minimum length of 0.001 cM and other parameters from grid search results for a time threshold of 50 generations. Data are represented as mean values ± SEM over 10 simulations.
Fig. 2Fine-scale population structure in the UK.
Hierarchical clustering of 432,866 individuals from the UK Biobank dataset based on the sharing of IBD segments within the past 10 generations. Individuals in clusters with <500 samples are shown in light gray. We observed 24 main clusters across the country (top left) and we refined two regions, corresponding to Newcastle (NE) (top right) and Liverpool (L) (bottom), revealing fine scale population structure. No relationship between clusters is implied by the colors or cluster labeling across different plots (details are provided in Methods section).
Fig. 3Genetic relatedness and geographic distances in the UK Biobank dataset.
For each of the 432,968 UK Biobank samples with available geographic data, we detected the individual sharing the largest total amount (in cM) of genome IBD within the past 10 generations (referred to as closest individual). a For each value x of total shared genome (in cM) on the X-axis, we report the percentage of UK Biobank samples (Y-axis) that share x or more with their closest individual. b For each value x of total shared genome (in cM) on the X-axis, we report the median distance (km, computed every 10 cM) for all pairs of (sample, closest individual) who shared at least x. Vertical dashed lines indicate the expected value of the total IBD sharing for kth degree cousins, computed using 2G(1/2)2(, where G = 7247.14 is the total diploid genome size (in cM) and k represents the degree of cousin relationship (e.g. k = 2 for second degree cousins, separated by 2(k + 1) generations)[11]. The value of 45 observed when no sharing cutoff is considered (i.e. when the x value approaches 0) reflects the median prediction error for a random individual, regardless of how much IBD they share with the closest individual.
Fig. 4Genome-wide scan for recent positive selection in the UK Biobank dataset.
Manhattan plot with candidate gene labels for 12 loci detected at genome-wide significance (adjusting for multiple testing, approximate 1-sided DRC50 p < 0.05/52,003 = 9.6 × 10−7; dashed red line). The DRC50 statistic of shared recent ancestry within the past 50 generations was computed using 487,409 samples within the UK Biobank cohort. FastSMC detected 5 loci known to be under recent positive natural selection (gene labels in black) and 7 novel loci (in red). The corresponding p-values are reported in Supplementary Table 4.
Fig. 5IBD sharing and rare variant associations.
a Correlation between IBD sharing (average number of IBD segments per pair across UK postcodes in the past 10 generations in the UK Biobank’s 487,409 samples) and ultra-rare variants sharing (average number of FN mutations per pair across UK postcodes in the UK Biobank 50k Exome Sequencing Data Release for increasing values of N). b Venn diagram representing the sets of exome-wide significant associated loci for 7 blood-related traits using three methods: the WES-based LoF burden test reported by Van Hout et al.[49], a WES-based LoF burden test we performed (WES-LoF burden), and the IBD-based LoF burden test we performed (LoF-segment burden). The corresponding p-values were computed using two-sided t-tests and are reported in Tables 1, 2 and Supplementary Table 8. c Exome-wide Manhattan plot for mean platelet (thrombocyte) volume, after SNP-correction, using 303, 125 unrelated UK Biobank samples not included in the exome sequencing cohort. Labeled genes are exome-wide significant after adjusting for multiple testing: p < 0.05/(14,249 × 10) = 3.51 × 10−7; dashed red line. Black labels indicate genes that were previously reported by Van Hout et al.[49] (KALRN, GP1BA, and IQGAP2), while red labels indicate novel associations detected by our LoF-segment burden analysis. The corresponding p-values were computed using two-sided t-tests and are reported in Table 2.
Comparison between association analyses.
| Gene | Trait | Van Hout et al. | WES LoF burden | LoF-segment burden | ||
|---|---|---|---|---|---|---|
| 1 | Eosinophil count | 3.30E-10 | 2.01E-03 | 8.64E-15 | 72.26 | |
| 2 | Mean platelet (thrombocyte) volume | 6.40E-08 | 8.84E-08 | 1.82E-19 | 32.57 | |
| 3 | Platelet distribution width | 2.50E-23 | 7.34E-18 | 7.38E-12 | 07.25 | |
| 4 | Mean platelet (thrombocyte) volume | 2.40E-08 | 3.01E-07 | 2.15E-03 | 04.11 | |
| 5 | Platelet count | 2.10E-09 | 7.45E-07 | 4.21E-05 | 07.84 | |
| 6 | Red blood cell distribution width | 5.80E-08 | 3.49E-02 | 2.25E-03 | 23.99 | |
| 7 | Red blood cell count | 1.70E-09 | 7.95E-02 | 2.68E-02 | 18.23 | |
| 8 | Red blood cell distribution width | 1.50E-13 | 6.95E-13 | 3.49E-34 | 32.99 | |
| 9 | Mean corpuscular hemoglobin | 1.70E-16 | 9.11E-15 | 6.79E-21 | 16.76 | |
| 10 | Platelet distribution width | 4.70E-09 | 1.44E-06 | 0.16E-00 | 00.98 | |
| 11 | Red blood cell distribution width | 2.40E-11 | 8.23E-04 | 0.32E-00 | 01.03 | |
| 12 | Mean platelet (thrombocyte) volume | 2.70E-23 | 3.85E-18 | 3.79E-12 | 07.33 | |
| 13 | Mean platelet (thrombocyte) volume | 1.10E-19 | 3.72E-15 | 4.40E-34 | 27.43 | |
| 14 | Mean corpuscular hemoglobin | 1.10E-08 | 2.94E-06 | 7.60E-11 | 22.18 |
We report association statistics for 14 loci and 7 traits as detected by Van Hout et al.[49] (obtained using a linear mixed model), our whole-exome sequencing burden analysis (two-sided t-test; labeled as WES LoF burden); and the LoF-segment burden (two-sided t-test). The Bonferroni-corrected exome-wide significance threshold for the first two approaches is 3.4 × 10−6, after correcting for multiple testing with ~15k genes, and 3.51 × 10−7 for the LoF-segment burden, after adjusting for 14,249 genes and 10 time transformations. We identify 10 genes at exome-wide significance with the WES-LoF burden test, and we replicate 11/14 at p < 0.05/10 = 0.005 (adjusted for testing of 10 transformation) using the LoF-segment association in non-sequenced samples (8 at exome-wide significance). The last column estimates the proportion of the phenotypic variation (, in %; Supplementary Table 9) of the sequenced samples that can be explained by the non-sequenced cohort; on average that is 19.64% for all the 14 reported associations, or 27.35% if focusing on the exome-wide significant signals.
Associations detected using LoF-segment burden.
| Trait | Chr | Region (Mb) | Min. | Candidate gene(s) | |
|---|---|---|---|---|---|
| 1 | Eosinophil count | chr6 | 26.01:31.10 | 1.21E-26 | |
| 2 | Eosinophil count | chr9 | 6.21:6.25 | 8.64E-15 | |
| 3 | Eosinophil count | chr9 | 135.82:135.86 | 1.92E-07 | |
| 4 | Eosinophil count | chr12 | 113.01:113.41 | 7.63E-14 | |
| 5 | Mean corpuscular hemoglobin | chr6 | 16.23:16.29 | 7.60E-11 | |
| 6 | Mean corpuscular hemoglobin | chr6 | 25.72:31.10 | 3.82E-69 | |
| 7 | Mean corpuscular hemoglobin | chr19 | 12.98:12.99 | 6.79E-21 | |
| 8 | Mean corpuscular hemoglobin | chr22 | 29.08:29.13 | 1.43E-07 | |
| 9 | Mean platelet thrombocyte volume | chr1 | 247.87:247.88 | 1.44E-08 | |
| 10 | Mean platelet thrombocyte volume | chr3 | 123.81:124.44 | 3.79E-12 | |
| 11 | Mean platelet thrombocyte volume | chr5 | 74.80:76.00 | 4.40E-34 | |
| 12 | Mean platelet thrombocyte volume | chr6 | 26.18:27.92 | 1.46E-08 | |
| 13 | Mean platelet thrombocyte volume | chr12 | 122.51:124.49 | 6.29E-10 | |
| 14 | Mean platelet thrombocyte volume | chr16 | 90.03:90.03 | 2.61E-07 | |
| 15 | Mean platelet thrombocyte volume | chr17 | 4.83:4.83 | 1.82E-19 | |
| 16 | Mean platelet thrombocyte volume | chr22 | 29.08:29.13 | 1.93E-07 | |
| 17 | Platelet count | chr1 | 43.80:43.82 | 1.99E-07 | |
| 18 | Platelet count | chr5 | 75.69:76.00 | 6.52E-08 | |
| 19 | Platelet count | chr6 | 26.59:26.60 | 1.1E-07 | |
| 20 | Platelet count | chr12 | 109.88:113.33 | 4.82E-13 | |
| 21 | Platelet count | chr17 | 4.83:4.83 | 1.43E-07 | |
| 22 | Platelet distr. width | chr11 | 116.66:116.66 | 1.94E-08 | |
| 23 | Platelet distr. width | chr17 | 4.83:4.83 | 4.26E-09 | |
| 24 | Platelet distr. width | chr20 | 57.59:57.60 | 7.38E-12 | |
| 25 | Red blood cell count | chr6 | 26.45:31.10 | 1.39E-10 | |
| 26 | Red blood cell distr. width | chr6 | 26.01:28.48 | 3.03E-15 | |
| 27 | Red blood cell distr. width | chr9 | 135.82:135.86 | 8.1E-08 | |
| 28 | Red blood cell distr. width | chr11 | 116.69:116.70 | 3.67E-11 | |
| 29 | Red blood cell distr. width | chr19 | 12.98:12.99 | 3.49E-34 |
Exome-wide significant associations (after adjusting for multiple testing p < 0.05/(14,249 × 10) = 3.51 × 10−7) detected using LoF-segment burden (SNP-adjusted). Associated genes are clustered in 29 loci. For each locus we report the set of associated genes and minimum p-value. The gene corresponding to the minimum p-value is highlighted in bold.