Literature DB >> 24192839

Performance of high-throughput sequencing for the discovery of genetic variation across the complete size spectrum.

Andy Wing Chun Pang¹, Jeffrey R Macdonald, Ryan K C Yuen, Vanessa M Hayes, Stephen W Scherer.

Abstract

We observed that current high-throughput sequencing approaches only detected a fraction of the full size-spectrum of insertions, deletions, and copy number variants compared with a previously published, Sanger-sequenced human genome. The sensitivity for detection was the lowest in the 100- to 10,000-bp size range, and at DNA repeats, with copy number gains harder to delineate than losses. We discuss strategies for discovering the full spectrum of genetic variation necessary for disease association studies.

Entities: Chemical Disease Species

Keywords: copy number variation; genome variation annotation; high-throughput sequencing; insertion/deletion

Mesh：

Year: 2014 PMID： 24192839 PMCID： PMC3887540 DOI： 10.1534/g3.113.008797

Source DB: PubMed Journal: G3 (Bethesda) ISSN： 2160-1836 Impact factor: 3.154

Insertion/deletion (indel, unbalanced change <100 bp) and copy number variation (CNV, unbalanced alteration 100 bp upwards) are increasingly observed to be important in development and disease (Lee and Scherer 2010; Weischenfeldt ). However, in our experience, it has been difficult to detect indels and CNVs, even when the latest high-throughput sequencing (HTS) technologies are used (Pang ). Although the detection of single-nucleotide variation by HTS seems sufficient (Lam ), the short reads of HTS limit the detection of larger and more complex genetic variants, and that limitation can hamper disease studies.

Materials and Methods

To investigate the robustness of indel/CNV calling using HTS, we assessed data from commercial genome sequencing vendors and found that Complete Genomics (CG) (Drmanac ) detected the greatest number of variants and yielded a more consistent and even variant size distribution (Supporting Information, Figure S1 and Table S1). To evaluate the quality of the CG variation (unbalanced genetic variants) profile, we chose to compare the structural variation data from a comprehensively characterized personal genome, namely the HuRef Standard (Levy ; Pang ), to 80 CG-sequenced genomes. One of the 80 genomes was HuRef, herein called HuRef CG (Table S2). The HuRef Standard assembly is of greater quality than HTS-generated genomes, since it was produced from high-accuracy Sanger-based sequencing of long mate-pair clone-end sequences. Using a combination of sequence- and microarray-based strategies, we detected 791,873 gains (insertions: size <100 bp or retrotransposons; duplications: size ≥100 bp) and losses (deletions) in HuRef relative to the National Center for Biotechnology Information reference assembly (Levy ; Pang ) (Table S3). Experimental validation confirmed 88% (184/210) of the variants (Levy ; Pang ). Details can be found in File S1.

Results and Discussion

First, by comparing the HuRef CG and HuRef Standard variation profiles, we noticed that short-read sequencing detected fewer calls and had substantial drops in discovery along the variation size spectrum (Figure 1, A and B). There were 241,033 gains and 230,737 losses in the HuRef CG data, which was a fraction of HuRef Standard’s 408,403 gains and 383,470 losses (Table S3). For losses, HuRef CG detected 60% of the total number of HuRef Standard losses whose size ranged from 1 to 100 bp, 30% of that from 100 to 10 kb, and 43% of that from >10 kb; for gains, HuRef CG detected 59% of that in HuRef Standard gains with the size ranged from 1 to 100 bp but only 7% of that from 100 to 10 kb and 21% of that from >10kb (Figure 1, C and D). CG used three primary approaches to detect gains and losses: split-read, paired-end and read depth (File S2 and Table S3). Unlike the uniform negative slope of the size distribution of variants annotated in the Sanger-based HuRef Standard (Figure 1, A and B), there were notable declines in sensitivity in the CG version, particularly for gains in the paired-end detection range, which spanned from 100 bp to 10 kb (Figure 2). As acknowledged by CG (Support & Community webpage), the paired-end detection approach had difficulty in calling variants at high-identity repeats, and calling novel insertion sequences relative to the National Center for Biotechnology Information reference.

Figure 1

Figure 2

Size distribution of HuRef CG gains and losses detected by each discovery strategy examined: split-read, paired-end mapping and read depth. (A) Gains. (B) Losses.

Variation distribution of genomes sequenced. The size distribution of nonredundant (A) gains and (B) losses detected in the HuRef and 79 other samples. The proportion of nonredundant (C) gains and (D) losses detected in HuRef by CG in comparison with HuRef Standard. Size distribution of HuRef CG gains and losses detected by each discovery strategy examined: split-read, paired-end mapping and read depth. (A) Gains. (B) Losses. To estimate false negatives in the CG profiles, we generated a compilation of variation from published studies (File S2, Figure S2, and Table S4). We identified a set of high confidence calls in the HuRef sample, by identifying HuRef Standard variants that were also detected in the population reference. We then examined the size distribution curves of HuRef CG variants against the curves representing the HuRef Standard variants also detected in the population reference, and we found that the HuRef CG curves were consistently below the curves of confirmed HuRef Standard. This analysis shows that there were variants missing in the HuRef CG profile; undercalling of gains greater than 100 bp was particularly severe (Figure S3). However, we emphasize that other short-read sequencing technologies also have similar problems, with large gains missing (Figure S1). When comparing the HuRef CG data to the HuRef Standard, we determined that some of the missing gains and losses were from regions containing repeats. We found a notable reduction of calls in loci with retrotransposable repeats, tandem repeats and segmental duplications (two-tailed χ2 test; P < 2.2e-16) (Figure S4, A and B and C and D). It is difficult to align HTS reads to tandem repeat loci whose length can be longer than the short reads, and consequently, variant-detection at these loci is hampered. Similarly, short inserts can prevent aligning and assembling of paired reads to regions with retrotranposons and segmental duplications. These observations highlight the importance of having long reads and inserts for alignment and variant calling. As for centromeric and telomeric repeats, both Sanger sequencing and HTS have difficulty with these locations. We evaluated false-positive results in the HuRef CG profile by comparing this data set to both the HuRef Standard and the profiles from the other 79 CG-sequenced genomes in this study, and we conservatively estimated that 11.4% of the HuRef CG gains and 3.9% of the losses could be false (File S2, Figure S5, and Table S5). Again, detection of gains was worse than losses. From our comparison of the HuRef CG and HuRef Standard datasets, we observed that CG also had notable strengths. First, the HuRef CG loss size distribution was fairly uniform when compared to the expected HuRef Standard (Figure 1B). Second, CG was highly precise in determining variant size, with the exception of overcalling by the read-depth approach (Figure S6). Increasing the sequence coverage plus decreasing the bin-size may reduce this overestimation. Finally, the HuRef CG variant profiles were similar to the profiles of the other 79 CG genomes, highlighting consistency across experiments (File S2 and Figure 1, A and B). Taking advantage of the availability of a comprehensive set of variation from a fully sequenced genome, we have analyzed the performance of detecting insertion and deletion by a HTS technology. Overall, we conclude that only a fraction of kown variation was captured, with notable shortcomings in detecting insertions and duplications in the 100-bp to 10-kb size range, and at repetitive DNA sequences. Many of these deficiencies are associated with short reads and insert lengths (File S2, Figure 1, A and B, Figure S4 and Figure S7, Table S6 and Table S7). Generating longer reads (Loomis ) or libraries of multiple insert lengths can mitigate these shortcomings. Greater depth of coverage can also partially recover some of the missing calls. Among our 80 CG-sequenced samples (File S2, Figure S8 and Figure S9), we noticed that the sequenced-depth and the number of variants reported were positively correlated (gains: R = 0.36, P = 0.00097; losses: R = 0.41, P = 0.00017; Figure S10). Computationally, one should continue to apply multiple complementary variant detection strategies: split-read, paired-end, read depth, and one-end-anchor approaches (Hajirasouliha ). Moreover, whole-genome assembly comparison approach should be considered (Khaja ; Levy ), as our analysis has shown that this approach can yield the greatest number, type and size range of variation (Table S3). However, current de novo assembly of short sequences is often restricted by the presence of repeats. A possible solution is a hybrid assembly constructed with a mixture of shallow coverage (~5×) of mate-pair long-reads with deeper coverage (~25×) of paired-end short-reads (Schatz ; Gnerre ). Alternatively, sequencing can be performed in conjunction with microarray or single-molecule physical mapping (Lam ) to detect larger variation. Physical mapping or other complexity-reduction processes [e.g., Long Fragment Read (Peters )] should improve alignment and the accuracy of variant discovery. Finally, some common variants (minor allele frequency >5%) that are missed by HTS could be imputed by nearby tag SNPs, although it may not be applicable to some rare variants as it has been shown that ~20% of biallelic CNVs cannot be readily captured (Mills ). Ultimately, if HTS is to become a primary technology in clinical laboratories it will further benefit from improvement, particularly in capturing rare indels, CNVs and more complex rearrangements that are associated with diseases.

14 in total

1. High-quality draft assemblies of mammalian genomes from massively parallel sequence data.

Authors: Sante Gnerre; Iain Maccallum; Dariusz Przybylski; Filipe J Ribeiro; Joshua N Burton; Bruce J Walker; Ted Sharpe; Giles Hall; Terrance P Shea; Sean Sykes; Aaron M Berlin; Daniel Aird; Maura Costello; Riza Daza; Louise Williams; Robert Nicol; Andreas Gnirke; Chad Nusbaum; Eric S Lander; David B Jaffe
Journal: Proc Natl Acad Sci U S A Date: 2010-12-27 Impact factor: 11.205

2. Genome assembly comparison identifies structural variants in the human genome.

Authors: Razi Khaja; Junjun Zhang; Jeffrey R MacDonald; Yongshu He; Ann M Joseph-George; John Wei; Muhammad A Rafiq; Cheng Qian; Mary Shago; Lorena Pantano; Hiroyuki Aburatani; Keith Jones; Richard Redon; Matthew Hurles; Lluis Armengol; Xavier Estivill; Richard J Mural; Charles Lee; Stephen W Scherer; Lars Feuk
Journal: Nat Genet Date: 2006-11-22 Impact factor: 38.330

Review 3. Phenotypic impact of genomic structural variation: insights from and for human disease.

Authors: Joachim Weischenfeldt; Orsolya Symmons; François Spitz; Jan O Korbel
Journal: Nat Rev Genet Date: 2013-02 Impact factor: 53.242

4. Detection and characterization of novel sequence insertions using paired-end next-generation sequencing.

Authors: Iman Hajirasouliha; Fereydoun Hormozdiari; Can Alkan; Jeffrey M Kidd; Inanc Birol; Evan E Eichler; S Cenk Sahinalp
Journal: Bioinformatics Date: 2010-04-12 Impact factor: 6.937

Review 5. The clinical context of copy number variation in the human genome.

Authors: Charles Lee; Stephen W Scherer
Journal: Expert Rev Mol Med Date: 2010-03-09 Impact factor: 5.600

6. Towards a comprehensive structural variation map of an individual human genome.

Authors: Andy W Pang; Jeffrey R MacDonald; Dalila Pinto; John Wei; Muhammad A Rafiq; Donald F Conrad; Hansoo Park; Matthew E Hurles; Charles Lee; J Craig Venter; Ewen F Kirkness; Samuel Levy; Lars Feuk; Stephen W Scherer
Journal: Genome Biol Date: 2010-05-19 Impact factor: 13.583

7. Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays.

Authors: Radoje Drmanac; Andrew B Sparks; Matthew J Callow; Aaron L Halpern; Norman L Burns; Bahram G Kermani; Paolo Carnevali; Igor Nazarenko; Geoffrey B Nilsen; George Yeung; Fredrik Dahl; Andres Fernandez; Bryan Staker; Krishna P Pant; Jonathan Baccash; Adam P Borcherding; Anushka Brownley; Ryan Cedeno; Linsu Chen; Dan Chernikoff; Alex Cheung; Razvan Chirita; Benjamin Curson; Jessica C Ebert; Coleen R Hacker; Robert Hartlage; Brian Hauser; Steve Huang; Yuan Jiang; Vitali Karpinchyk; Mark Koenig; Calvin Kong; Tom Landers; Catherine Le; Jia Liu; Celeste E McBride; Matt Morenzoni; Robert E Morey; Karl Mutch; Helena Perazich; Kimberly Perry; Brock A Peters; Joe Peterson; Charit L Pethiyagoda; Kaliprasad Pothuraju; Claudia Richter; Abraham M Rosenbaum; Shaunak Roy; Jay Shafto; Uladzislau Sharanhovich; Karen W Shannon; Conrad G Sheppy; Michel Sun; Joseph V Thakuria; Anne Tran; Dylan Vu; Alexander Wait Zaranek; Xiaodi Wu; Snezana Drmanac; Arnold R Oliphant; William C Banyai; Bruce Martin; Dennis G Ballinger; George M Church; Clifford A Reid
Journal: Science Date: 2009-11-05 Impact factor: 47.728

8. Performance comparison of whole-genome sequencing platforms.

Authors: Hugo Y K Lam; Michael J Clark; Rui Chen; Rong Chen; Georges Natsoulis; Maeve O'Huallachain; Frederick E Dewey; Lukas Habegger; Euan A Ashley; Mark B Gerstein; Atul J Butte; Hanlee P Ji; Michael Snyder
Journal: Nat Biotechnol Date: 2011-12-18 Impact factor: 68.164

9. Mapping copy number variation by population-scale genome sequencing.

Authors: Ryan E Mills; Klaudia Walter; Chip Stewart; Robert E Handsaker; Ken Chen; Can Alkan; Alexej Abyzov; Seungtai Chris Yoon; Kai Ye; R Keira Cheetham; Asif Chinwalla; Donald F Conrad; Yutao Fu; Fabian Grubert; Iman Hajirasouliha; Fereydoun Hormozdiari; Lilia M Iakoucheva; Zamin Iqbal; Shuli Kang; Jeffrey M Kidd; Miriam K Konkel; Joshua Korn; Ekta Khurana; Deniz Kural; Hugo Y K Lam; Jing Leng; Ruiqiang Li; Yingrui Li; Chang-Yun Lin; Ruibang Luo; Xinmeng Jasmine Mu; James Nemesh; Heather E Peckham; Tobias Rausch; Aylwyn Scally; Xinghua Shi; Michael P Stromberg; Adrian M Stütz; Alexander Eckehart Urban; Jerilyn A Walker; Jiantao Wu; Yujun Zhang; Zhengdong D Zhang; Mark A Batzer; Li Ding; Gabor T Marth; Gil McVean; Jonathan Sebat; Michael Snyder; Jun Wang; Kenny Ye; Evan E Eichler; Mark B Gerstein; Matthew E Hurles; Charles Lee; Steven A McCarroll; Jan O Korbel
Journal: Nature Date: 2011-02-03 Impact factor: 49.962

10. The diploid genome sequence of an individual human.

Authors: Samuel Levy; Granger Sutton; Pauline C Ng; Lars Feuk; Aaron L Halpern; Brian P Walenz; Nelson Axelrod; Jiaqi Huang; Ewen F Kirkness; Gennady Denisov; Yuan Lin; Jeffrey R MacDonald; Andy Wing Chun Pang; Mary Shago; Timothy B Stockwell; Alexia Tsiamouri; Vineet Bafna; Vikas Bansal; Saul A Kravitz; Dana A Busam; Karen Y Beeson; Tina C McIntosh; Karin A Remington; Josep F Abril; John Gill; Jon Borman; Yu-Hui Rogers; Marvin E Frazier; Stephen W Scherer; Robert L Strausberg; J Craig Venter
Journal: PLoS Biol Date: 2007-09-04 Impact factor: 8.029

14 in total

Review 1. A copy number variation map of the human genome.

Authors: Mehdi Zarrei; Jeffrey R MacDonald; Daniele Merico; Stephen W Scherer
Journal: Nat Rev Genet Date: 2015-02-03 Impact factor: 53.242

2. Whole-genome sequencing of quartet families with autism spectrum disorder.

Authors: Ryan K C Yuen; Bhooma Thiruvahindrapuram; Daniele Merico; Susan Walker; Kristiina Tammimies; Ny Hoang; Christina Chrysler; Thomas Nalpathamkalam; Giovanna Pellecchia; Yi Liu; Matthew J Gazzellone; Lia D'Abate; Eric Deneault; Jennifer L Howe; Richard S C Liu; Ann Thompson; Mehdi Zarrei; Mohammed Uddin; Christian R Marshall; Robert H Ring; Lonnie Zwaigenbaum; Peter N Ray; Rosanna Weksberg; Melissa T Carter; Bridget A Fernandez; Wendy Roberts; Peter Szatmari; Stephen W Scherer
Journal: Nat Med Date: 2015-01-26 Impact factor: 53.440

3. A Comprehensive Workflow for Read Depth-Based Identification of Copy-Number Variation from Whole-Genome Sequence Data.

Authors: Brett Trost; Susan Walker; Zhuozhi Wang; Bhooma Thiruvahindrapuram; Jeffrey R MacDonald; Wilson W L Sung; Sergio L Pereira; Joe Whitney; Ada J S Chan; Giovanna Pellecchia; Miriam S Reuter; Si Lok; Ryan K C Yuen; Christian R Marshall; Daniele Merico; Stephen W Scherer
Journal: Am J Hum Genet Date: 2018-01-04 Impact factor: 11.025

4. Assembly and diploid architecture of an individual human genome via single-molecule technologies.

Authors: Matthew Pendleton; Robert Sebra; Andy Wing Chun Pang; Ajay Ummat; Oscar Franzen; Tobias Rausch; Adrian M Stütz; William Stedman; Thomas Anantharaman; Alex Hastie; Heng Dai; Markus Hsi-Yang Fritz; Han Cao; Ariella Cohain; Gintaras Deikus; Russell E Durrett; Scott C Blanchard; Roger Altman; Chen-Shan Chin; Yan Guo; Ellen E Paxinos; Jan O Korbel; Robert B Darnell; W Richard McCombie; Pui-Yan Kwok; Christopher E Mason; Eric E Schadt; Ali Bashir
Journal: Nat Methods Date: 2015-06-29 Impact factor: 28.547

5. Genome-wide copy number variations in a large cohort of bantu African children.

Authors: Feyza Yilmaz; Megan Null; David Astling; Hung-Chun Yu; Joanne Cole; Stephanie A Santorico; Benedikt Hallgrimsson; Mange Manyama; Richard A Spritz; Audrey E Hendricks; Tamim H Shaikh
Journal: BMC Med Genomics Date: 2021-05-17 Impact factor: 3.063

6. Genetic tests by next-generation sequencing in children with developmental delay and/or intellectual disability.

Authors: Ji Yoon Han; In Goo Lee
Journal: Clin Exp Pediatr Date: 2019-11-04

7. A high-resolution copy-number variation resource for clinical and population genetics.

Authors: Mohammed Uddin; Bhooma Thiruvahindrapuram; Susan Walker; Zhuozhi Wang; Pingzhao Hu; Sylvia Lamoureux; John Wei; Jeffrey R MacDonald; Giovanna Pellecchia; Chao Lu; Anath C Lionel; Matthew J Gazzellone; John R McLaughlin; Catherine Brown; Irene L Andrulis; Julia A Knight; Jo-Anne Herbrick; Richard F Wintle; Peter Ray; Dimitri J Stavropoulos; Christian R Marshall; Stephen W Scherer
Journal: Genet Med Date: 2014-12-11 Impact factor: 8.822

8. Profiling of conserved non-coding elements upstream of SHOX and functional characterisation of the SHOX cis-regulatory landscape.

Authors: Hannah Verdin; Ana Fernández-Miñán; Sara Benito-Sanz; Sandra Janssens; Bert Callewaert; Kathleen De Waele; Jean De Schepper; Inge François; Björn Menten; Karen E Heath; José Luis Gómez-Skarmeta; Elfride De Baere
Journal: Sci Rep Date: 2015-12-03 Impact factor: 4.379

9. Genome-wide characteristics of de novo mutations in autism.

Authors: Ryan K C Yuen; Daniele Merico; Hongzhi Cao; Giovanna Pellecchia; Babak Alipanahi; Bhooma Thiruvahindrapuram; Xin Tong; Yuhui Sun; Dandan Cao; Tao Zhang; Xueli Wu; Xin Jin; Ze Zhou; Xiaomin Liu; Thomas Nalpathamkalam; Susan Walker; Jennifer L Howe; Zhuozhi Wang; Jeffrey R MacDonald; Ada Chan; Lia D'Abate; Eric Deneault; Michelle T Siu; Kristiina Tammimies; Mohammed Uddin; Mehdi Zarrei; Mingbang Wang; Yingrui Li; Jun Wang; Jian Wang; Huanming Yang; Matt Bookman; Jonathan Bingham; Samuel S Gross; Dion Loy; Mathew Pletcher; Christian R Marshall; Evdokia Anagnostou; Lonnie Zwaigenbaum; Rosanna Weksberg; Bridget A Fernandez; Wendy Roberts; Peter Szatmari; David Glazer; Brendan J Frey; Robert H Ring; Xun Xu; Stephen W Scherer
Journal: NPJ Genom Med Date: 2016-08-03 Impact factor: 8.617

10. Whole Genome Sequencing Expands Diagnostic Utility and Improves Clinical Management in Pediatric Medicine.

Authors: Dimitri J Stavropoulos; Daniele Merico; Rebekah Jobling; Sarah Bowdin; Nasim Monfared; Bhooma Thiruvahindrapuram; Thomas Nalpathamkalam; Giovanna Pellecchia; Ryan K C Yuen; Michael J Szego; Robin Z Hayeems; Randi Zlotnik Shaul; Michael Brudno; Marta Girdea; Brendan Frey; Babak Alipanahi; Sohnee Ahmed; Riyana Babul-Hirji; Ramses Badilla Porras; Melissa T Carter; Lauren Chad; Ayeshah Chaudhry; David Chitayat; Soghra Jougheh Doust; Cheryl Cytrynbaum; Lucie Dupuis; Resham Ejaz; Leona Fishman; Andrea Guerin; Bita Hashemi; Mayada Helal; Stacy Hewson; Michal Inbar-Feigenberg; Peter Kannu; Natalya Karp; Raymond Kim; Jonathan Kronick; Eriskay Liston; Heather MacDonald; Saadet Mercimek-Mahmutoglu; Roberto Mendoza-Londono; Enas Nasr; Graeme Nimmo; Nicole Parkinson; Nada Quercia; Julian Raiman; Maian Roifman; Andreas Schulze; Andrea Shugar; Cheryl Shuman; Pierre Sinajon; Komudi Siriwardena; Rosanna Weksberg; Grace Yoon; Chris Carew; Raith Erickson; Richard A Leach; Robert Klein; Peter N Ray; M Stephen Meyn; Stephen W Scherer; Ronald D Cohn; Christian R Marshall
Journal: NPJ Genom Med Date: 2016-01-13 Impact factor: 8.617