| Literature DB >> 25071822 |
Thomas M Keane1, Kim Wong1, David J Adams1, Jonathan Flint2, Alexandre Reymond3, Binnaz Yalcin4.
Abstract
Structural variation is variation in structure of DNA regions affecting DNA sequence length and/or orientation. It generally includes deletions, insertions, copy-number gains, inversions, and transposable elements. Traditionally, the identification of structural variation in genomes has been challenging. However, with the recent advances in high-throughput DNA sequencing and paired-end mapping (PEM) methods, the ability to identify structural variation and their respective association to human diseases has improved considerably. In this review, we describe our current knowledge of structural variation in the mouse, one of the prime model systems for studying human diseases and mammalian biology. We further present the evolutionary implications of structural variation on transposable elements. We conclude with future directions on the study of structural variation in mouse genomes that will increase our understanding of molecular architecture and functional consequences of structural variation.Entities:
Keywords: Heterogeneous Stock (HS); Sanger Mouse Genomes Project; array comparative genome hybridization (aCGH); inbred strains of mice; next-generation sequencing (NGS); paired-end mapping (PEM); structural variation (SV)
Year: 2014 PMID: 25071822 PMCID: PMC4079067 DOI: 10.3389/fgene.2014.00192
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Summary of mouse studies reporting genome-wide structural variants.
| aCGH | 80 | 20 | Graubert et al., |
| aCGH | 2,094 | 42 | Cutler et al., |
| WGS | 10,000 | 4 | Akagi et al., |
| aCGH | 1,300 | 20 | Cahan et al., |
| aCGH | 7,103 | 33 | Henrichsen et al., |
| aCGH | 7,196 | 1 | Quinlan et al., |
| aCGH | 1,976 | 7 | Agam et al., |
| NGS | 711,920 | 17 | Yalcin et al., |
| NGS | 30,048 | 1 | Wong et al., |
| NGS | 43 | 1 | Simon et al., |
Column 1 gives the technique used in the study (aCGH, array comparative genome hybridization; WGS, whole genome sequencing; NGS, next generation sequencing). Column 2 refers to the total number of structural variants (SVs) identified and column 3, to the number of laboratory inbred mouse strains used in the study at the exception of
that includes 21 wild-caught mice. The reference mouse strain (C57BL/6J) is excluded in the count. Column 4 is the reference to the study.
Figure 1Comparison between NGS and aCGH in inbred mouse strain DBA/2J. (A) Venn diagram of the number of deletions detected. (B) Boxplot showing the size distribution of deletions.
Figure 2Read mapping patterns used by computational methods to detect basic structural variation from NGS data. This figure shows the principle of SV identification using (i) read-pair analysis, (ii) split-read mapping, (iii) single end cluster analysis, and (iv) read depth analysis. Deletions and insertions are represented using red rectangles, and inversions and duplications using light blue arrows. Reads are represented using solid dark blue arrows. The first step consists in sequencing a test genome. Typically, the genomic test DNA is fragmented into chunks of 300–500 bp. Then, reads of 50–250 bp are sequenced from either side of each fragment (we call these paired-end reads). The second step consists in mapping these paired-end reads to the mouse reference genome. A rightward facing arrow denotes a positive strand alignment, and leftward a negative strand alignment. (i) In the read-pair analysis approach, when the paired-end reads are mapping in the correct orientation (“+/−” is normal) but to a distance that is significantly larger than the average fragment length. If we suppose this distance to be 1100 bp, it suggests a deletion of 600 bp, whereas if the distance is smaller than the fragment length, for example 200 bp, it suggests an insertion of 300 bp. When the two sequenced ends of two fragments map back to the reference genome in the wrong orientation (“+/+” and “−/−”), and at a distance that is significantly larger than the size of the fragment itself, this indicates an inversion. Finally, when paired-end reads map with orientation “−/+” to a large distance, it suggest tandem duplication. (ii) In the split-read approach, one of the paired-end reads map to the reference genome while its mate contains the structural variant, typically a deletion or an insertion of small length. (iii) In the single-end cluster analysis, one of the paired-end reads maps to the reference while its mate map to the inserted sequence that can be either de novo sequence or repeat element such as LINE, SINE, or ERV. (iv) Finally, the read depth approach takes advantage of the high coverage of next generation sequencing that makes it possible to detect copy number changes. Of note, the coverage drops at insertion and inversion breakpoints, which when combined with paired-end reads analysis makes the SV call highly reliable.
Algorithms for the detection of structural variation.
| BreakDancer | Predicts del, ins, inv, and translocations using PEM. Performance examined in an ind. with acute myeloid leukemia and samples from the 1000 Genomes trio. Compared with VariationHunter and MoDIL | Chen et al., | |
| CNAseg | Identifies CNVs from NGS data. Uses depth of coverage to estimate copy number states in cancer and normal samples | Ivakhno et al., | |
| cnD | HMM that uses read coverage to determine genomic copy number. Tested on short read sequence data generated from re-sequencing chr. 17 of the mouse strains A/J and CAST/EiJ with the Illumina platform | Simpson et al., | |
| cn.MOPS | Mixture Of PoissonS Bayesian approach to detect CNVs. Compared with mrFast, EWT, JointSLM, CNV-Seq, and FREEC using data from a male HapMap individual and high coverage data from the 1000 Genomes Project | Klambauer et al., | |
| CNVer | Method that supplements the depth-of-coverage with PEM information, where mate pairs mapping discordantly to the reference serve to indicate the presence of variation | Medvedev et al., | |
| CNVnator | Method for CNV discovery and genotyping from read-depth analysis of personal genome sequencing | Abyzov et al., | |
| CNV-Seq | Method to detect CNV using shotgun sequencing | Xie and Tammi, | |
| CREST | Clipping Reveals Structure, uses NGS reads with partial alignments to a ref. to map SVs at nucleotide level resolution. Used for 5 pediatric acute lymphoblastic leukemias and a human melanoma cell line | Wang et al., | |
| DELLY | Integrates paired-end and split-read analysis | Rausch et al., | |
| Dindel | Bayesian method to call small indels by realigning reads to candidate haplotypes that represent alternative sequence to the reference, using a split-read approach. Used in the 1000 Genomes Project call sets | Albers et al., | |
| EWT | Event-wise testing, method based on significance testing. Error rate tested using the analysis of chromosome 1 from paired-end shotgun sequence data (30×) on 5 individuals | Yoon et al., | |
| FREEC | Control-FREE Copy number caller that automatically normalizes and segments copy number profiles | Boeva et al., | |
| GASV-PRO | Combines both paired read and read depth signals into a probabilistic model for greater specificity | Sindi et al., | |
| GenomeSTRiP | Genome STRucture In Populations, toolkit for discovering and genotyping structural variations using sequencing data. Twenty to thirty genomes required to get good results | Handsaker et al., | |
| HYDRA | Localizes SV breakpoints by PEM. Uses a similar clustering strategy to VariationHunter. Accuracy evaluated using WGS slit-read mappings. Maps repetitive elements such as transposons and SD | Quinlan et al., | |
| inGAP-sv | Scheme that uses abnormally mapped read pairs. Possible to distinguish HOM and HET variants. Compared with VariationHunter, Breakdancer, PEMer, Spanner, Cortex, and Pindel | Qi and Zhao, | |
| JointSLM | Allows to detect common CNVs among individuals using depth of coverage | Magi et al., | |
| MoDIL | Detection of small indels from clone-end sequencing with mixtures of distributions | Lee et al., | |
| mrFast | Allows for the prediction of absolute copy-number variation of duplicated segments and genes | Alkan et al., | |
| PEMer | Compatible with several NGS platforms. Simulation-based error models, yielding confidence-values for each SV | Korbel et al., | |
| Pindel | A pattern growth approach, to detect breakpoints of large deletions and medium-sized insertions from PEM reads | Ye et al., | |
| RetroSeq | Detects non-reference mobile elements such as LINE, SINE, and ERV. Accuracy evaluated using a trio from the 1000 Genomes Project | Keane et al., | |
| SoftSearch | Combines three analyses: split-read, read-pair, and single-end cluster. Tested using low coverage HapMap samples and high-coverage 122 gene dataset. Performance compared with SVSeq2, DELLY, BrakDancer, and CREST | Hart et al., | |
| SPANNER | SV detection for the pilot phase of the 1000 Genomes Project using low-coverage WGS of 179 ind. from 4 pop., high-coverage seq. of 2 mother-father-child trios, and exon targeted seq. of 697 ind. from 7 pop | Abecasis et al., | |
| SplazerS | Method for split-read mapping, where a read may be interrupted by a gap in the read-to-reference alignment | Emde et al., | |
| Splitread | Detects SV and indels from 1 bp to 1 Mb in exome data sets. Uses one end-anchored placements to cluster the mappings of subsequences of unanchored ends to identify size, content, and location | Karakoc et al., | |
| SRiC | Split-read identification, calibrated (SRiC). Validated using a representative data from the 1000 Genomes Project | Zhang et al., | |
| SVDetect | Identify discordant mate-pairs derived from NGS data produced by the Illumina GA and ABI SOLiD platforms | Zeitouni et al., | |
| SVMerge | Pipeline integrating several existing callers followed by | Wong et al., | |
| SVSeq2 | Split-read mapping for low-coverage sequence data | Zhang et al., | |
| VariationHunter | Gives combinatorial formulations for the SV detection between a reference genome sequence and a NG-based, paired-end, whole genome shotgun-sequenced individual | Hormozdiari et al., |
Column 1 names the algorithm (alphabetical order); column 2 gives a description of the method and its application; column 3 cites the URL for software download and column 4 is the reference to the study. Note that de novo assembly algorithms are not listed in this table. PEM, Paired-End Mapping; CNVs, Copy Number Variants; NGS, Next-Generation Sequencing; SVs, Structural Variants; SD, Segmental Duplication; WGS, Whole Genome Sequencing; pop., population; ind., individual; ref., reference; seq., sequencing; ins, insertion; del, deletion; inv, inversion.
Figure 3Complex rearrangements in mouse genomes. We highlight three examples of complex rearrangements that cause ambiguous signals during their detection (for a full list of complex rearrangements see Yalcin et al., 2012a): (A), a deletion directly flanked by an insertion; (B), an inversion directly flanked by two deletions; and (C), an inversion directly flanked by an insertion. For each complex rearrangements, we provide: (1) a drawing of the paired-end mapping (PEM) pattern, (2) an illustration using the short read visualization tool LookSeq (Manske and Kwiatkowski, 2009), and (3) PCR validation. We draw paired-end reads (black arrows) and how they map to the reference genome (dashed gray lines). Green arrows represent primer pairs used for PCR validation. PCR amplification was carried out across eight inbred strains of mice (A/J, AKR/J, BALB/cJ, C3H/HeJ, C57BL/6J, CBA/J, DBA/2J, and LP/J), which are the parental strains of the Heterogenesous Stock population (Valdar et al., 2006). Hyperladder II is the size marker. Genomic coordinates refer to the mm9 mouse assembly. (A) Deletion of 836 bp directly flanked by an insertion of 1200 bp on mouse chromosome 19 (chr19: 48,061,057–48,061,892 bp) in mouse strains A/J, BALB/cJ, DBA/2J, and LP/J. In LookSeq, the two back arrows show singleton reads suggesting an insertion (their mates are within the inserted sequence). Read depth is null but paired-end reads in support of the deletion are missing because of the insertion. PCR in four strains (A/J, BALB/cJ, DBA/2J, and LP/J) does not show directly the presence of the 836-bp deletion but instead reveals the presence of an insertion of about 400 bp that is in fact the size difference between the deletion and the insertion. (B) Inversion of 325 bp on mouse chromosome 5 (chr5: 148,925,249–148,925,573 bp), directly flanked on the left by a deletion of 71 bp (chr5: 148,925,178–148,925,248 bp) and on the right by another deletion of 645 bp (chr5: 148,925,574–148,926,218 bp). In LookSeq, the top arrow shows the PEM pattern of the deletion. Normally, the underlying read depth should be null, however, it is only null at the regions shown by the two bottom arrows. This is caused by an intervening inversion. PCR in four strains (A/J, AKR/J, BALB/cJ, and C3H/HeJ) confirms the presence of the two deletions. (C) An inversion of 548 bp on mouse chromosome 8 (chr8: 77,137,213–77,137,760 bp) directly flanked by an insertion of 400 bp in mouse strain BALB/cJ, C3H/HeJ, CBA/J, and DBA/2J. In LookSeq, the bottom arrows show a dip in the coverage; on the right, it is caused by an insertion and on the left by an inversion. The presence of the insertion results in missing reads (“−/−”), supporting the inversion. PCR shows an amplification band of about 1400 bp in BALB/cJ, C3H/HeJ, CBA/J, and DBA/2J, whereas, in the remaining strains, the band is at about 1000 bp. This confirms the insertion of 400 bp in BALB/cJ, C3H/HeJ, CBA/J, and DBA/2J.
Structural variants associated with quantitative traits in outbred mice.
| 1 | 175158884 | 175158885 | Ins | Upstream | Mean platelet volume | |
| 2 | 144402760 | 144402971 | SINE Ins | Intron | OFT total activity | |
| 4 | 49690362 | 49690363 | Del | Intron | HP cellular proliferation marker | |
| 4 | 108951263 | 108951264 | IAP Ins | Upstream | Home cage activity | |
| 4 | 130038388 | 130038389 | SINE Ins | Intron | T-cells: %CD3 | |
| 7 | 90731819 | 90731820 | IAP Ins | Upstream | Wound healing | |
| 7 | 111397607 | 111479433 | Ins | Exon | Mean cellular hemoglobin | |
| 7 | 111504989 | 111505193 | Del | UTR | Mean cellular hemoglobin | |
| 8 | 87957244 | 87957245 | LINE Ins | Upstream | Mean cellular volume | |
| 11 | 115106127 | 115106250 | Del | UTR | Serum urea concentration | |
| 13 | 113783196 | 113783359 | Del | Upstream | HP cellular proliferation marker | |
| 17 | 34483681 | 34483682 | Del | Upstream | T-cells: CD4/CD8 ratio |
Columns 1, 2, and 3 give positional information about the structural variant (coordinates refer to the mm9 mouse assembly). Column 4 is the type of the variant. Column 5 and 6 give information about the underlying gene. Column 7 is the quantitative trait associated with the structural variant. Ins, insertion; Del, deletion; UTR, untranslated region; SINE, short interspersed nuclear element; LINE, long interspersed nuclear element; IAP, intracisternal A-particle; HP, Hippocampus; OFT, open field test.
Figure 4Deletion in The x-axis is the position along mouse chromosome 17 (Mb). The y-axis shows the significance level of the association between CD4+/CD8+ ratio and a set of bi-allelic markers (represented using polygons) using a population of 200 commercially available outbred mice (CFW mice Yalcin et al., 2010). Markers with strong association (−log10P > 10) are colored in red. Strongest association is within the promoter region of H2-Ea. (B) PCR image of H2-Ea reveals a 600-bp deletion in 8 CFW mice. (C) Plot of mouse transgenic data with and without the deletion. The x-axis is the CD4+/CD8+ ratio in the spleen and the y-axis in the thymus. White circles are measures from transgene negative mice so with no deletion. Black circles are measures from transgene positive mice (with the deletion). Apart from the deletion, the genetic background of these mice is identical.
Figure 5Insertion in A transposable element (Intracisternal A-particle) of 6400 bp has inserted in the promoter region of Eps15. (B) Boxplot showing expression in the brain measured using RNA-Seq in mice with and without the structural variant (RPKM, reads per kilobase per million mapped reads). There is a decrease of expression with the presence of the insertion. (C) Data from the mouse knockout of Eps15, showing a significant decrease of home cage activity compared to matched wildtype mice (*p-value < 0.05).
Structural variants differentiating C57BL/6J and C57BL/6N.
| 2 | 70619835 | 70620080 | SINE Ins | Tlk1 | Intron |
| 3 | 60336036 | 60336037 | Del (large) | Mbnl1 | Intron |
| 4 | 101954274 | 101954395 | Del | Pde4b | Intron |
| 4 | 116051393 | 116051799 | MaLR Ins | Mast2 | Intron |
| 6 | 103669536 | 103676487 | LINE Ins | Chl1 | Intron |
| 7 | 92095990 | 92096149 | Del | Vmn2r65 | Exon |
| 7 | 27636128 | 27748456 | Ins | Cyp2a22 | Entire |
| 7 | 139306094 | 139307981 | MaLR Ins | Cpxm2 | Intron |
| 8 | 16716381 | 16716382 | Del (large) | Csmd1 | Intron |
| 9 | 58544415 | 58546304 | MaLR Ins | 2410076I21Rik | Intron |
| 10 | 32536420 | 32543464 | LINE Ins | Nkain2 | Intron |
| 11 | 119560391 | 119566827 | MTA Ins | Rptor | Intron |
| 12 | 42023964 | 42032747 | Del | Immp2l | Intron |
| 13 | 120164268 | 120164269 | Del (large) | Nnt | Intron |
| 19 | 12863187 | 12863188 | Del (1800 bp) | Zfp91 | Intron |
Columns 1, 2, and 3 give positional information about the structural variant (coordinates refer to the mm9 mouse assembly). Column 4 is the type of the variant. Column 5 and 6 give information about the underlying gene. Ins, insertion; Del, deletion; LINE, Long Interspersed Nuclear Element; IAP, Intracisternal A-particle; SINE, Short Interspersed Nuclear Element; MaLR, Mammalian-Apparent Long-Terminal Repeat Retrotransposon; MTA, Mammalian Transposable Element; VNTR, Variable Number Tandem Repeat.
Figure 6How to access and query the data automatically and manually. (A) Workflow of how to automatically query structural variants. Our work was published relative to mm9 Genome Build, but data can also be visualized directly onto mm10. A gene name or genomic region can be searched for simple and complex structural variants. Results can be exported as TSV and CSV format. (B) Workflow of how to manually search for structural variants. To do this, we use LookSeq (Manske and Kwiatkowski, 2009) as a Web-based tool to visualize paired end reads NGS data. The choice of the insert size depends on the size of the underlying structural variant, so that when the variant is large the insert size should also be large. Types of structural variants can be recognized using our comprehensive catalog of paired end mapping (PEM) patterns described in Yalcin et al. (2012a).