| Literature DB >> 35176773 |
Jordi Valls-Margarit1, Iván Galván-Femenía2, Daniel Matías-Sánchez1, Natalia Blay2, Montserrat Puiggròs1, Anna Carreras2, Cecilia Salvoro1, Beatriz Cortés2, Ramon Amela1, Xavier Farre2, Jon Lerga-Jaso3, Marta Puig3, Jose Francisco Sánchez-Herrero4, Victor Moreno5,6,7,8, Manuel Perucho9,10, Lauro Sumoy4, Lluís Armengol11, Olivier Delaneau12,13, Mario Cáceres3,14, Rafael de Cid2, David Torrents1,14.
Abstract
The combined analysis of haplotype panels with phenotype clinical cohorts is a common approach to explore the genetic architecture of human diseases. However, genetic studies are mainly based on single nucleotide variants (SNVs) and small insertions and deletions (indels). Here, we contribute to fill this gap by generating a dense haplotype map focused on the identification, characterization, and phasing of structural variants (SVs). By integrating multiple variant identification methods and Logistic Regression Models (LRMs), we present a catalogue of 35 431 441 variants, including 89 178 SVs (≥50 bp), 30 325 064 SNVs and 5 017 199 indels, across 785 Illumina high coverage (30x) whole-genomes from the Iberian GCAT Cohort, containing a median of 3.52M SNVs, 606 336 indels and 6393 SVs per individual. The haplotype panel is able to impute up to 14 360 728 SNVs/indels and 23 179 SVs, showing a 2.7-fold increase for SVs compared with available genetic variation panels. The value of this panel for SVs analysis is shown through an imputed rare Alu element located in a new locus associated with Mononeuritis of lower limb, a rare neuromuscular disease. This study represents the first deep characterization of genetic variation within the Iberian population and the first operational haplotype panel to systematically include the SVs into genome-wide genetic studies.Entities:
Mesh:
Substances:
Year: 2022 PMID: 35176773 PMCID: PMC8934637 DOI: 10.1093/nar/gkac076
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Overview of data and overall strategy. (A) Distribution of genetic data (SNVs) based on principal component analysis (PCA) (adapted from Novembre et al. (45)). The PC grouped by geographic localization (coloured in grey) the individuals of the GCAT cohort (blue dots) with Iberian samples from 1000G (asterisk) and POPRES (letters) projects in the context of other European samples. (B) Flowchart of the overall strategy followed in this study, covering from the quality control of the initial data, to the final generation of the GCAT haplotype panel, with particular focus on SVs. Overall, the complete strategy consumed ∼3.5 million CPU/hour, which highlights part of the computational challenges associated with this type of analysis (Supplementary Table S11) (See also Supplementary Figure S7).
Figure 2.Benchmarking of the structural variant identification and classification pipeline. (A) Structural variant (SV) detection patterns according to the programs used. Lines and dots indicate the programs used and bars the number of overlapping calls resulting from that combination. The first 30 patterns with more coincident SV calling are shown. Right coloured horizontal bars indicate the total number of SVs detected by each caller. Variant callers that detect all SV types and sizes tend to recover more SVs than those that detect specific SV types (i.e. CNVnator) and smaller SVs (i.e. Strelka2). (B) Overview of the detection performance of different strategies and filtering results from multiple variant callers. Each strategy is plotted according to the recall and precision ratios (F = F-score) using the benchmarking dataset. The logistic regression model (LRM), with a F-score of 0.9, outperformed other commonly used strategies that are based on the number of coincident callers (logical rules). The confidence interval for each case is represented by coloured area of each strategy. (C) Comparison of performances (F-score) of different merging and filtering strategies according to the size of the structural variant. (D) Comparative overview of the genotype error, associated to each strategy for each allelic state. Error values and their intervals were inferred from the benchmarking dataset (see supplementary Figures S2, S3 and S5 for the information across the different SV types).
Figure 3.Overview of the GCAT variant catalogue. (A) Table with the numbers of identified and accepted variants after applying the filters ‘at least two callers detecting the same variant’ for SNVs, the LRM for indels and SVs, Hardy–Weinberg equilibrium, and discard monomorphic variants and those with >10% missingness within the GCAT cohort, according to their class. (B) Overview of the variant distribution within an average individual in the GCAT cohort, according to their observed minor allele frequency (MAF). (C) Distribution of SV type according to their genomic sizes. (D) Comparative overview of the SV type number and distribution across the GCAT, 1000G, GnomAD and GoNL catalogues.
Figure 4.Phasing and Imputation performance of the GCAT|Panel. (A) Ternary diagram of the genotype imputation accuracy by variant type and frequency, considering the genotype calling as reference. Three dots evaluate each genotype state per sample. The samples with high concordances between genotype imputation and genotype calling were located at ternary diagram vertices. (B) Number of SNVs and indels imputed (info score ≥ 0.7) using different reference panels and combining their imputation results. More indels were recovered by GCAT|Panel. (C) Number of SVs imputed (info score ≥ 0.7) using different panels, and combining the imputation results with and without GCAT|Panel. (See also Supplementary Figure S21).
Figure 5.Genome-wide association analysis using GCAT|Panel and experimental validation of an AluYa5-element. (A) Locus zoom plot of the locus associated with mononeuritis of lower limb (ICD-9 355) (P-value = 9.84 × 10−7), showing the lead variant in purple. The AluYa5-element (g.49494276_49494600ins (hs37d5) maps in an enhancer element upstream of the DAG1. (B) Experimental validation of an AluYa5-element, agarose e-gel electrophoresis of PCR products after amplification of Alu-insertion-specific DNA fragments from blood DNA Lanes: 1, 100 bp DNA ladder marker (Life Technologies), expected sizes of both states are shown to the left; 2–5 Alu carriers (EGA_04200, EGA_01901, EGA_13378, EGA_03940); six control individual (EGA_01399). The numbers to the left refer to the size (bp) of marker DNA fragments. Electrophoresis analysis of Alu carriers show two-band amplicons (515 bp and 848 bp) detected in Alu carriers (lanes 2–5) and one-band amplicon (515 bp) in control non-Alu-allele individuals (lane 6) (See also Supplementary Figure S29).
Figure 6.Structural Variant imputation performance using GCAT|Panel across all continents. European and Latin American populations recover more low frequency and rare SVs at high info scores (≥0.7) than African and Asian populations (see also Supplementary Figures S22 and S23).