| Literature DB >> 27708267 |
Jayne Y Hehir-Kwa1, Tobias Marschall2,3, Wigard P Kloosterman4, Laurent C Francioli4,5,6,7, Jasmijn A Baaijens8, Louis J Dijkstra8, Abdel Abdellaoui9, Vyacheslav Koval10, Djie Tjwan Thung1, René Wardenaar11,12, Ivo Renkens4, Bradley P Coe13, Patrick Deelen14, Joep de Ligt4, Eric-Wubbo Lameijer15, Freerk van Dijk14,16, Fereydoun Hormozdiari13, André G Uitterlinden10,17, Cornelia M van Duijn17, Evan E Eichler13, Paul I W de Bakker4,18, Morris A Swertz14,16, Cisca Wijmenga14, Gert-Jan B van Ommen15, P Eline Slagboom19, Dorret I Boomsma9, Alexander Schönhuth8, Kai Ye20,21,22, Victor Guryev11.
Abstract
Structural variation (SV) represents a major source of differences between individual human genomes and has been linked to disease phenotypes. However, the majority of studies provide neither a global view of the full spectrum of these variants nor integrate them into reference panels of genetic variation. Here, we analyse whole genome sequencing data of 769 individuals from 250 Dutch families, and provide a haplotype-resolved map of 1.9 million genome variants across 9 different variant classes, including novel forms of complex indels, and retrotransposition-mediated insertions of mobile elements and processed RNAs. A large proportion are previously under reported variants sized between 21 and 100 bp. We detect 4 megabases of novel sequence, encoding 11 new transcripts. Finally, we show 191 known, trait-associated SNPs to be in strong linkage disequilibrium with SVs and demonstrate that our panel facilitates accurate imputation of SVs in unrelated individuals.Entities:
Mesh:
Substances:
Year: 2016 PMID: 27708267 PMCID: PMC5059695 DOI: 10.1038/ncomms12989
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 14.919
Figure 1Overviews of discovery approach and variant set.
(a) Overview of methods used for SV detection, genotyping and phasing within the GoNL project. (b) Structural variation consensus set, consisting of large duplications (outer ring), deletions larger than 100 bp (light red), chromosomes, insertions (triangles), mid-sized deletions (21–100 bp), small deletions (less than 20 bp) (dark red) and complex indels (purple). Heatmaps display the insertions of Alu, L1 and SVA elements. Inversions are indicated by black arcs in the centre of the plot, and interchromosomal break points (colored based on the source chromosome).
Characteristics of the consensus indel and structural variants set.
Figure 2Number of simple and complex indels, mobile element insertions (MEIs) and deletions (stratified by length).
Grey bars correspond to total counts, whereas coloured (blue to violet) bars give counts stratified into four bins by allele frequency quartiles (Q1 to Q3).
Figure 3Example of a large replacement within the KRBOX4 gene.
The plot depicts the coverage profile of whole genome sequencing reads from a GoNL sample with a homozygous replacement. The lack of coverage in the last exon of KRBOX4 is coinciding with the position of the replacement. The breakpoint junctions of the replacement are indicated in the panel underneath the coverage plot.
Figure 4Identification and expression of a novel ZNF gene.
(a) A Geuvadis RNA-sequencing dataset (ERR188316) was mapped to the human reference genome, which was extended with a new genomic segment inserted in chr 19 (bp 21,252,967). The plot shows RNA expression and split-read mappings across the novel ZNF gene present on this new genomic segment. (b) Protein domain structure of the novel ZNF gene as determined using NCBI Conserved Domain Search. (c) Neighbor-joining tree built from alignment of protein sequences homologous to the novel ZNF gene. Values at the nodes indicate bootstrap support of each group. Distances indicate protein sequence divergence on amino acid level.
Figure 5Effects of MEIs on gene expression.
(a) Schematic picture indicating an AluYa5 insertion in the promoter region of LCLAT1. (b) LCLAT1 gene expression (log2 of normalized read count) in blood from GoNL individuals who are heterozygous (het) or homozygous (hom) for the AluYa5 insertion. (c) RNA expression effects of an AluYb8 insertion in the last exon of ZNF880. The presence of the AluYb8 element results in spliced transcripts, which preferentially contain the last exon, while the before last exon is skipped (upper panel). The reverse effect is seen in the absence of the AluYb8 insertion (lower panel).
Figure 6Schematic overview of the imputation experiment.
Haplotypes are represented by thin grey bars, whereas diploid chromosomes with genotype calls are indicated by thick grey bars. Processing steps are shown in blue, with numbers (in black circles) for being referenced in the main text.
Figure 7Imputation results for different SV types.
(a) Histogram on the number of gold standard genotype calls per SV class. (b) Relationship between discordance and fraction of missing genotypes when altering the genotype likelihood (GL) threshold used for filtering the imputed genotypes, ranging from 0.33 (no filter) to 0.999 across SV classes. Thresholds used for further analyses, including panels (c,d), are circled in red. Increasing the minimum GL results in fewer discordant genotypes but increases the number of missing genotypes. Imputation of inversions had the highest rate of discordance and missing genotypes, whereas the tandem duplications and deletions had lower rates of discordant and missing genotypes for those events with a high GL. (c) Discordance rates for deletions, complex indels and MEIs stratified by minor allele frequencies for 20 bins (width=0.025). Bin boundaries are indicated by grey lines. The number of calls per bin are shown by dashed lines. (d) Same as (c), but restricted calls where the gold standard genotype contains at least one copy of the rare allele.