Literature DB >> 22179552

Detection of structural variants and indels within exome data.

Emre Karakoc1, Can Alkan, Brian J O'Roak, Megan Y Dennis, Laura Vives, Kenneth Mark, Mark J Rieder, Debbie A Nickerson, Evan E Eichler.   

Abstract

We report an algorithm to detect structural variation and indels from 1 base pair (bp) to 1 Mbp within exome sequence data sets. Splitread uses one end-anchored placements to cluster the mappings of subsequences of unanchored ends to identify the size, content and location of variants with high specificity and sensitivity. The algorithm discovers indels, structural variants, de novo events and copy number-polymorphic processed pseudogenes missed by other methods.

Entities:  

Mesh:

Year:  2011        PMID: 22179552      PMCID: PMC3269549          DOI: 10.1038/nmeth.1810

Source DB:  PubMed          Journal:  Nat Methods        ISSN: 1548-7091            Impact factor:   28.547


Although the proportion of structural variants (SVs) and small insertions and deletions (indels; shorter than 50 bp) detected in sequence databases have increased exponentially[1,2], recent comparisons of both experimental and computational methods suggest that the false negative rate remains high[3,4]. In addition to whole-genome sequencing, the widespread use of exome-capture technologies that target genomic protein-coding regions provides a rich resource to discover potentially impactful SVs and indels associated with disease. The nature of the capture methods, limited size of coding regions, and non-uniform distribution of the reads pose significant computational challenges. As a result, variants greater than 15 bp have rarely been reported in exome studies[5,6]. Discovery has been based largely on sequence alignment gaps limited to uniquely mapped regions of the genome (GATK[7] or SAMtools[8]). Here, we detail a general combinatorial algorithm (Splitread) and validate its utility to discover indels and SVs in exome datasets. We developed Splitread to detect SVs and indels based on the computational prediction of breakpoints (see online Methods and Supplementary Note for details). Similar to Pindel[9], which is another split read based approach for detecting breakpoints of indels via a regional search around the anchored reads within the maximum event size, our algorithm searches for clusters of mate pairs where one end maps to the reference genome but the other end does not because it traverses a breakpoint creating a mapping inconsistency with respect to the reference sequence (Fig. 1a). We initially map reads using mrsFAST[10], which guarantees all possible placements within a given Hamming distance (reflecting the number of allowed mismatches). Next, we decompose the unmapped end into subsequences of either equal length (balanced splits) or unequal length (unbalanced splits). Unlike Pindel which uses pattern growth for optimal matching in the target region, we reiteratively search for clusters of split reads using the balanced splits as seeds (Fig. 1a), which refine the location and size of the indel or SV event. We apply weighted set-cover approximation (Supplementary Note) to minimize the number of possible breakpoints, which essentially provides a maximum parsimony framework for all the mappings at the breakpoints.
Figure 1

Splitread definition and analyses

(A) Schematic diagrams for the mapping of paired-end sequences in cases where an individual has either a deletion (red) or an insertion (blue) with respect to the reference sequence. In each case, one-end anchored sequence is used to map one read in a pair. The second (unmapped) read is then decomposed into either two equal subsequences (balanced split) or two unequal subsequences (unbalanced split). (B) Number of Splitread predictions called by 1000 Genomes plotted against the total number of Splitread predictions using the indicated threshold numbers of balanced and unbalanced reads, respectively. A threshold of two balanced and two unbalanced splits maximizes intersection with 1000 Genomes Project calls without losing any positive predictive value. (C) A Venn diagram comparing variants detected by Splitread exome analysis versus whole-genome sequence analysis of NA12891 (black) or all variants within dbSNP130 (red). In order to intersect, variants must be at the same position and within 10 base pairs of the predicted size. (D) Length distribution of insertions and deletions mapping within the coding region of NA12891 as predicted by Splitread. Events with multiples of three base pairs (red) are compared to those that would disrupt the frame (blue). (E) A Venn diagram comparing Pindel, GATK and Splitread call sets on NA12891. The total number of events (black) is compared to those previously detected (red) as part of dbSNP130 and/or the 1000 Genomes Project.

We tested different thresholds for the number of balanced and unbalanced splits required to support a call. For each configuration, we plotted the proportion of events called by the 1000 Genomes Project (http://www.1000genomes.org) that were predicted by Splitread for sample NA12891 (Fig 1b and Supplementary Table 1). The slope provides the positive predictive value (PPV) and we could maximize sensitivity (number of corroborated predictions) without any loss of specificity by selecting the local maximum of this line. At a threshold of at least two balanced and two unbalanced splits, we predicted a total of 213 indel events less than 50 bp in the NA12891 exome, of which 69% (148) intersect with whole-genome sequence analysis (Fig. 1c) and 72% (154) intersect with dbSNP130[2]. As expected for protein-coding sequence[11], indel sizes were predominantly in multiples of three resulting in no disruption of the protein-coding frame (47% or 100/213; Fig. 1d). If we exclude 1 bp indels, this fraction increases to 78% (100/129). We applied this threshold for the remainder of our analysis for calling the final events. We identified an additional 63 SV events ( > 50 bp) after excluding annotated processed pseudogenes (Supplementary Table 2). Although only four of these were predicted by the 1000 Genomes Project, nine of the remaining events intersect with SVs from dbSNP130 with sizes varying from 51 bp to 3,584 bp. We predict that 48 of these variants are common (observed in multiple HapMap samples we analyzed) with only 21 variants being specific to NA12891. Several correspond to genes known to carry complex insertion and deletion polymorphisms or variable number of tandem repeats (VNTRs) such as MUC6, DSPP and MUC16[12]. We compared Splitread with alternative indel detection methods Pindel[9] and GATK[7] (see Supplementary Note for comparison to CREST). 70% of Splitread calls are predicted by one of the other methods but a substantial fraction of calls are unique to each method. As expected, events called by two or more methods show the best corroboration with dbSNP and 1000 Genomes calls (Fig. 1e). We selected 19 events uniquely called by Splitread and previously not reported by dbSNP or 1000 Genomes for PCR-based validation. Thirteen of 19 events were validated (Supplementary Table 1), giving an estimated PPV of 68%. Most map within low complexity regions and correspond to repeat expansions and deletions (Supplementary Table 1). If we include previously reported events, Splitread accuracy rises to 87% (41/47). We extended our analyses by generating exome sequence data from 11 HapMap samples whose genomes were sequenced at 3- to 4-fold coverage by the 1000 Genomes Project (Supplementary Table 3). Using Splitread, we observed an average of 325 events for each sample, including 286 indels and 39 SVs (5:1 ratio). Approximately 68% and 70% of the calls intersected 1000 Genomes and dbSNP130 predictions, respectively. From the 11 samples, we identified 192 novel SVs, 93 of which were observed two or more times; an average of nine events that disrupt genes are unique to each individual (Supplementary Tables 2,3). As a final test, we applied Splitread to published exome data from 20 parent-child trios affected with sporadic autism spectrum disorder[6]. We identified an average of 191 indels and 57 SVs in this dataset (Supplementary Table 4). To test the accuracy of our calling method, we randomly selected indels and SVs not found in either dbSNP or the control individuals as part of the Exome Sequencing Project (http://esp.gs.washington.edu). We confirmed 10/12 events by PCR and sequencing, giving an estimated PPV of 83% (Supplementary Table 5). This included bona fide variation within repetitive and low-complexity regions such as a triplet and 12-mer insertion within a low-complexity coding portion of SHROOM4 (Supplementary Fig. 1) missed by Pindel[9] and GATK[7]. An important goal of parent-child trio sequencing is to discover potentially disruptive de novo events. This is challenging since the selection of potential de novo events will either enrich in false-positives or represent inherited variants that were not detected (false negatives) in one of the parents. In this study, we were only able to detect and confirm one previously reported de novo variant, in FOXP1[6]. The remaining events were either present in a parent or were false positives (Supplementary Table 1). We sought to increase our confidence in predicting de novo events by filtering via read-depth. Because our method uses Hamming distance to align reads, SV and indel breakpoints should cause fewer reads to map in the affected child if the event is truly de novo (Supplementary Note). We added this functionality as a filter which normalizes the read-depth of coding regions based on coverage and then compares proband and parents to flag regions of reduced depth. The filter is applied specifically at predicted breakpoints to minimize false positives (Supplementary Fig. 2). During our analysis of exome datasets, we routinely detected putative deletion events where an intron was precisely removed such that flanking exons were perfectly abutting. The structure of these events suggested uncharacterized processed pseudogenes as opposed to allelic deletions. These arise as a result of retrotransposition of spliced mRNA back into the genome. We discovered 25 such events in the 11 HapMap exomes (Supplementary Table 6), 14 of which could not be identified by BLAST searches against the reference genome (GRCh37). DNA amplification of flanking exons yielded 16 products consistent with a processed pseudogene in the affected individual while the other nine appear to be polymorphic in the population (Fig. 2). Since pseudogenes can create potential Splitread artifacts we created a modified exome reference for mapping that includes known processed pseudogenes, segmental duplications, and copy-number polymorphic pseudogenes. Compared to a whole-genome reference, this modified exome reference increases speed by 10-fold with only a 2% difference in the number of calls. Thus, Splitread can be applied to a large number of exomes in a computationally efficient manner to generate a database of bona fide exonic indels and SVs.
Figure 2

Validation of processed pseudogenes

Gene models and predicted intron deletions of the processed pseudogenes are shown. Primers (red triangles) are designed in the coding region of the genes and the expected product size for the processed pseudogenes are shown for (A) TMEM5, (B) C13orf3, (C) ATP9B, (D) MFF, and (E) TMEM66. Gel images show the size of the amplified product. We were able to detect the processed version of these genes in our PCR experiments. In D-E we genotyped the processed pseudogenes MFF and TMEM66 within eight HapMap samples and show that each is amplified only in the predicted sample [boxed in red: NA19238 (MFF) and NA12891 (TMEM66)]. All PCRs amplify the normal gene (signal on the top) with only one sample each amplifying the processed gene.

To test the applicability of Splitread to whole genome datasets we analyzed the genome of a patient (ND06769) with a hexanucleotide repeat expansion (GGCCCC) in the C9orf72 gene. Renton et al.[13] demonstrated that this is the causal variant of 9p21-associated Amyotrophic Lateral Sclerosis with frontotemporal dementia (ALS-FTD). This repeat expansion was missed by GATK and was discovered only through manual inspection of the read alignments[13]. Although the insertion is too long to be fully characterized by a split-read method (estimated 1.5 kilobase pairs), our algorithm was able to discover the approximate breakpoint of the expansion and supported the call with read-depth analysis. Splitread can detect insertions and deletions without any size limitation. The size spectrum of the insertions that can be accurately characterized by Splitread is bound by the read length; however it is possible to detect the approximate breakpoints of larger insertions using one-end anchored reads. Many validated events detected exclusively by Splitread involve microsatellite, low complexity, or polynucleotide tracts (Supplementary Table 1 and Supplementary Fig. 1). Such regions are subject to higher mutation rates, due in part to their greater potential for replication slippage[14]. Variation of this type, especially within coding regions, has frequently been associated with diseases including triplet repeat instability[14]. Our increased PPV for this class of variant stems from the fact that we consider multiple mappings frequently discarded by other methods. There is, however, genetic variation that we clearly missed (Fig. 1) emphasizing that no single approach is comprehensive in capturing all genetic variation[3]. One limitation of the Splitread is the dependence on the balanced splits to seed an event, which is directly dependent on the coverage. Given 76 bp reads, the chance of detecting a heterozygous event is 55% at 20X coverage, but rises to > 90% at 60X coverage. The sensitivity estimate increases from 79% at 20X coverage to 98% at 60X coverage. Such median sequence coverage is not uncommon in many exome sequencing projects. An unexpected consequence of our exome analysis has been the discovery of a substantial number of processed pseudogenes that are polymorphic but not represented in the human reference genome (Supplementary Table 6). Most of these variants were seen more than once, ranging in frequency from 3% to 72% based on an assessment of 51 exomes (Supplementary Table 6). Using read-pair information, we were able to map the location of all of these polymorphisms using a one-end anchored mapping strategy[15]. A comprehensive catalog of the most common of these will be important for correctly interpreting disease-causing variants discovered in exome studies. Since different computational methods vary in their sensitivity and specificity depending on the size, class, and context of variants, multiple approaches must be considered to maximize variant discovery. While most efforts are focused on the detection of point mutations within coding sequence, there is an opportunity to explore the landscape of intermediate and larger genetic variation, which is more likely to be gene disruptive. It is critical to include this type of variation in future analyses to correctly interpret the causes of disease. Re-examining exome datasets for larger and more complex variation may be particularly relevant when the causal variants for seemingly Mendelian diseases remain undiscovered. Supplementary Figure 1 Validation of a complex indel. Supplementary Figure 2 Read depth filtering for de novo events. Supplementary Table 1 Splitread Validation for the NA12891 exome. Supplementary Table 2 The list of all structural variants and the frequency of these variants among 11 samples. Supplementary Table 3 Summary for Splitread analysis of 11 HapMap Exomes. Supplementary Table 4 Analysis of the 63 individuals from autism trio data. Supplementary Table 5 Splitread Validation from Autism Trios. Supplementary Table 6 Copy-number polymorphic processed pseudogenes. Supplementary Note
  17 in total

1.  dbSNP: the NCBI database of genetic variation.

Authors:  S T Sherry; M H Ward; M Kholodov; J Baker; L Phan; E M Smigielski; K Sirotkin
Journal:  Nucleic Acids Res       Date:  2001-01-01       Impact factor: 16.971

2.  A haplotype map of the human genome.

Authors: 
Journal:  Nature       Date:  2005-10-27       Impact factor: 49.962

Review 3.  Repeat instability: mechanisms of dynamic mutations.

Authors:  Christopher E Pearson; Kerrie Nichol Edamura; John D Cleary
Journal:  Nat Rev Genet       Date:  2005-10       Impact factor: 53.242

4.  Short mucin 6 alleles are associated with H pylori infection.

Authors:  Thai V Nguyen; Marcel Janssen; Paulien Gritters; René H M te Morsche; Joost P H Drenth; Henri van Asten; Robert J F Laheij; Jan B M J Jansen
Journal:  World J Gastroenterol       Date:  2006-10-07       Impact factor: 5.742

5.  Characterization of missing human genome sequences and copy-number polymorphic insertions.

Authors:  Jeffrey M Kidd; Nick Sampas; Francesca Antonacci; Tina Graves; Robert Fulton; Hillary S Hayden; Can Alkan; Maika Malig; Mario Ventura; Giuliana Giannuzzi; Joelle Kallicki; Paige Anderson; Anya Tsalenko; N Alice Yamada; Peter Tsang; Rajinder Kaul; Richard K Wilson; Laurakay Bruhn; Evan E Eichler
Journal:  Nat Methods       Date:  2010-05       Impact factor: 28.547

6.  A hexanucleotide repeat expansion in C9ORF72 is the cause of chromosome 9p21-linked ALS-FTD.

Authors:  Alan E Renton; Elisa Majounie; Adrian Waite; Javier Simón-Sánchez; Sara Rollinson; J Raphael Gibbs; Jennifer C Schymick; Hannu Laaksovirta; John C van Swieten; Liisa Myllykangas; Hannu Kalimo; Anders Paetau; Yevgeniya Abramzon; Anne M Remes; Alice Kaganovich; Sonja W Scholz; Jamie Duckworth; Jinhui Ding; Daniel W Harmer; Dena G Hernandez; Janel O Johnson; Kin Mok; Mina Ryten; Danyah Trabzuni; Rita J Guerreiro; Richard W Orrell; James Neal; Alex Murray; Justin Pearson; Iris E Jansen; David Sondervan; Harro Seelaar; Derek Blake; Kate Young; Nicola Halliwell; Janis Bennion Callister; Greg Toulson; Anna Richardson; Alex Gerhard; Julie Snowden; David Mann; David Neary; Michael A Nalls; Terhi Peuralinna; Lilja Jansson; Veli-Matti Isoviita; Anna-Lotta Kaivorinne; Maarit Hölttä-Vuori; Elina Ikonen; Raimo Sulkava; Michael Benatar; Joanne Wuu; Adriano Chiò; Gabriella Restagno; Giuseppe Borghero; Mario Sabatelli; David Heckerman; Ekaterina Rogaeva; Lorne Zinman; Jeffrey D Rothstein; Michael Sendtner; Carsten Drepper; Evan E Eichler; Can Alkan; Ziedulla Abdullaev; Svetlana D Pack; Amalia Dutra; Evgenia Pak; John Hardy; Andrew Singleton; Nigel M Williams; Peter Heutink; Stuart Pickering-Brown; Huw R Morris; Pentti J Tienari; Bryan J Traynor
Journal:  Neuron       Date:  2011-09-21       Impact factor: 17.173

7.  mrsFAST: a cache-oblivious algorithm for short-read mapping.

Authors:  Faraz Hach; Fereydoun Hormozdiari; Can Alkan; Farhad Hormozdiari; Inanc Birol; Evan E Eichler; S Cenk Sahinalp
Journal:  Nat Methods       Date:  2010-08       Impact factor: 28.547

8.  A framework for variation discovery and genotyping using next-generation DNA sequencing data.

Authors:  Mark A DePristo; Eric Banks; Ryan Poplin; Kiran V Garimella; Jared R Maguire; Christopher Hartl; Anthony A Philippakis; Guillermo del Angel; Manuel A Rivas; Matt Hanna; Aaron McKenna; Tim J Fennell; Andrew M Kernytsky; Andrey Y Sivachenko; Kristian Cibulskis; Stacey B Gabriel; David Altshuler; Mark J Daly
Journal:  Nat Genet       Date:  2011-04-10       Impact factor: 38.330

9.  Targeted capture and massively parallel sequencing of 12 human exomes.

Authors:  Sarah B Ng; Emily H Turner; Peggy D Robertson; Steven D Flygare; Abigail W Bigham; Choli Lee; Tristan Shaffer; Michelle Wong; Arindam Bhattacharjee; Evan E Eichler; Michael Bamshad; Deborah A Nickerson; Jay Shendure
Journal:  Nature       Date:  2009-08-16       Impact factor: 49.962

10.  Exome sequencing in sporadic autism spectrum disorders identifies severe de novo mutations.

Authors:  Brian J O'Roak; Pelagia Deriziotis; Choli Lee; Laura Vives; Jerrod J Schwartz; Santhosh Girirajan; Emre Karakoc; Alexandra P Mackenzie; Sarah B Ng; Carl Baker; Mark J Rieder; Deborah A Nickerson; Raphael Bernier; Simon E Fisher; Jay Shendure; Evan E Eichler
Journal:  Nat Genet       Date:  2011-05-15       Impact factor: 38.330

View more
  60 in total

Review 1.  Massively parallel sequencing: the new frontier of hematologic genomics.

Authors:  Jill M Johnsen; Deborah A Nickerson; Alex P Reiner
Journal:  Blood       Date:  2013-09-10       Impact factor: 22.113

2.  Implications of genetic testing in noncompaction/hypertrabeculation.

Authors:  Joseph T C Shieh
Journal:  Am J Med Genet C Semin Med Genet       Date:  2013-07-10       Impact factor: 3.908

3.  Detection of copy number variants and loss of heterozygosity from impure tumor samples using whole exome sequencing data.

Authors:  Xiaocheng Liu; Ao Li; Jianing Xi; Huanqing Feng; Minghui Wang
Journal:  Oncol Lett       Date:  2018-07-16       Impact factor: 2.967

4.  An Incomplete Understanding of Human Genetic Variation.

Authors:  John Huddleston; Evan E Eichler
Journal:  Genetics       Date:  2016-04       Impact factor: 4.562

5.  The GEM mapper: fast, accurate and versatile alignment by filtration.

Authors:  Santiago Marco-Sola; Michael Sammeth; Roderic Guigó; Paolo Ribeca
Journal:  Nat Methods       Date:  2012-10-28       Impact factor: 28.547

6.  Discovery and statistical genotyping of copy-number variation from whole-exome sequencing depth.

Authors:  Menachem Fromer; Jennifer L Moran; Kimberly Chambert; Eric Banks; Sarah E Bergen; Douglas M Ruderfer; Robert E Handsaker; Steven A McCarroll; Michael C O'Donovan; Michael J Owen; George Kirov; Patrick F Sullivan; Christina M Hultman; Pamela Sklar; Shaun M Purcell
Journal:  Am J Hum Genet       Date:  2012-10-05       Impact factor: 11.025

Review 7.  Copy number variation in the cattle genome.

Authors:  George E Liu; Derek M Bickhart
Journal:  Funct Integr Genomics       Date:  2012-07-13       Impact factor: 3.410

Review 8.  Etiology of autism spectrum disorder: a genomics perspective.

Authors:  John J Connolly; Hakon Hakonarson
Journal:  Curr Psychiatry Rep       Date:  2014-11       Impact factor: 5.285

9.  Altered splicing of ATP6AP2 causes X-linked parkinsonism with spasticity (XPDS).

Authors:  Olena Korvatska; Nicholas S Strand; Jason D Berndt; Tim Strovas; Dong-Hui Chen; James B Leverenz; Konstantin Kiianitsa; Ignacio F Mata; Emre Karakoc; J Lynne Greenup; Emily Bonkowski; Joseph Chuang; Randall T Moon; Evan E Eichler; Deborah A Nickerson; Cyrus P Zabetian; Brian C Kraemer; Thomas D Bird; Wendy H Raskind
Journal:  Hum Mol Genet       Date:  2013-04-16       Impact factor: 6.150

Review 10.  Cancer genome-sequencing study design.

Authors:  Jill C Mwenifumbo; Marco A Marra
Journal:  Nat Rev Genet       Date:  2013-05       Impact factor: 53.242

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.