Literature DB >> 25861968

MetaSV: an accurate and integrative structural-variant caller for next generation sequencing.

Marghoob Mohiyuddin1, John C Mu1, Jian Li1, Narges Bani Asadi1, Mark B Gerstein2, Alexej Abyzov3, Wing H Wong4, Hugo Y K Lam1.   

Abstract

UNLABELLED: Structural variations (SVs) are large genomic rearrangements that vary significantly in size, making them challenging to detect with the relatively short reads from next-generation sequencing (NGS). Different SV detection methods have been developed; however, each is limited to specific kinds of SVs with varying accuracy and resolution. Previous works have attempted to combine different methods, but they still suffer from poor accuracy particularly for insertions. We propose MetaSV, an integrated SV caller which leverages multiple orthogonal SV signals for high accuracy and resolution. MetaSV proceeds by merging SVs from multiple tools for all types of SVs. It also analyzes soft-clipped reads from alignment to detect insertions accurately since existing tools underestimate insertion SVs. Local assembly in combination with dynamic programming is used to improve breakpoint resolution. Paired-end and coverage information is used to predict SV genotypes. Using simulation and experimental data, we demonstrate the effectiveness of MetaSV across various SV types and sizes.
AVAILABILITY AND IMPLEMENTATION: Code in Python is at http://bioinform.github.io/metasv/. CONTACT: rd@bina.com SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
© The Author 2015. Published by Oxford University Press.

Entities:  

Mesh:

Year:  2015        PMID: 25861968      PMCID: PMC4528635          DOI: 10.1093/bioinformatics/btv204

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 Introduction

SVs have been implicated in contributing to genomic diversity as well as genomic disorders (Stankiewicz and Lupski, 2010). Therefore, a significant amount of work has been done on detecting SVs. Generally, a tool for detecting SVs uses one or more of the following signals from read alignments: split-read [reads with split alignments, e.g. Pindel (Ye )], read-pair [abnormal paired-end alignments, e.g. BreakDancer (Chen )], depth-of-coverage [abnormal coverages, e.g. CNVnator (Abyzov )], junction-mapping [alignments to known SV breakpoints, e.g. BreakSeq2 (Abyzov ; Lam )] or assembly around potential breakpoints [e.g. MindTheGap (Rizk )]. However, there is no signal that comprehensively detects all types of SVs since each has a niche of SV types and sizes where it works well. This necessitates the development of tools which integrate multiple methods to improve SV detection. Prior work (Lam ; Mills ) has shown that variant calls made by multiple tools and methods generally are more accurate. For this reason, tools have been developed to employ multiple methods, e.g. DELLY (Rausch ), LUMPY (Layer ) and those that merge the results from multiple tools, such as SVMerge (Wong ). However, LUMPY and DELLY are unable to detect insertions and SVMerge ignores the SV resolution of individual tools when merging calls. Our work, therefore, attempts to address the limitations of existing SV merging tools for detecting SVs of different types and sizes with high accuracy and resolution.

2 Methods

MetaSV uses multiple SV-detection methods and tools to find a high-confidence and precise SV callset. The novelty of MetaSV lies in the combination of the following key ideas: calls reported by multiple orthogonal methods are generally better quality and that local assembly with dynamic programming can be used to refine the SV breakpoints.

2.1 Multi-method SV detection

MetaSV proceeds in the following steps (Fig. 1):
Fig. 1.

High-level view of the MetaSV methodology

Intra-tool merging: Potential duplicate calls generated by the same tool are merged here. Note that two calls are considered duplicate if they have significant overlap. Inter-tool merging: Calls which are generated by multiple tools are merged together. While determining the breakpoints for calls common to multiple tools, priority is given to methods known to be precise, e.g. split-read over read-pair. Note that this method-aware merging is crucial to ensuring that the breakpoints of the SVs reported are as precise as possible. Local assembly: Local assembly is performed on the SV regions reported by the tools to gather additional evidence as well as determine the SV sequences. The SPAdes (Bankevich ) assembler is used due to its unique ability to use paired-end information for assembly. Breakpoint resolution: The assembled SV sequences are aligned against the reference to detect or refine the breakpoints using dynamic programming (Abyzov and Gerstein, 2011). Genotyping: Read coverage around the SV breakpoints are used to determine the zygosity of the SVs. The final output is then generated as a VCF file with the genotypes for each SV. Annotation: MetaSV standardizes the inputs as well as the outputs in VCF. Each SV is annotated to indicate the corresponding calls made by the individual tools and to classify its confidence level. SVs which are detected by multiple tools, are considered high-confidence. High-level view of the MetaSV methodology

2.2 Insertion detection enhancement

The overall sensitivity of MetaSV by simply merging calls from multiple tools is upper bounded by the sensitivity of the union of all SVs detected by the individual tools. Therefore, for long insertions, which are underestimated by existing tools due to ascertainment bias, we augmented MetaSV with a soft-clip based method to boost insertion detection sensitivity. Soft-clips in read alignments are used to generate a set of candidate insertion intervals. These intervals are processed during the local assembly step to generate the final set of insertion locations. Even though assembly would not be able to determine insertion lengths for long insertions due to short read length, their locations can still be predicted accurately, which provides valuable information for interpretation. The Supplementary Text describes our method in more detail.

3 Results

We demonstrate the effectiveness of MetaSV using the VarSim simulation and validation framework (Mu ). Simulated 2 × 100 bp NGS reads were generated at 50× coverages for the NA12878 genome using published variant sets. The reads were aligned using BWA-MEM. For comparing reported SVs against the ground truth, we use a reciprocal overlap of 90% and a wiggle of 100 bp which captures accuracy at a high breakpoint resolution. The SV size cutoff was set to 100 bp since smaller variants are a target of indel callers such as GATK’s HaplotypeCaller. Our results show that each method has varying performance in different SV size ranges. By integrating multiple methods, MetaSV achieved a steady performance across all sizes (Fig. 2). We report accuracy as F1-score, which is the harmonic mean of sensitivity and precision. For this dataset, MetaSV achieved an F1-score of 96.2% (sensitivity and precision were 93.7 and 98.8%, respectively) for deletions, indicating high accuracy and resolution. For insertions, it achieved an F1-score of 84.7% (sensitivity and precision were 85.3 and 84.1%, respectively) comparing to less than 65% for all the individual tools analyzed. Insertion length was omitted from the accuracy analysis since long insertions cannot be assembled completely with NGS reads. Nevertheless, the significantly enhanced detection of insertion events can definitely improve interpretation largely as they may cause impactful disruption in the genome. Finally, genotyping accuracies of 95.2 and 95.5% were achieved for deletions and insertions, respectively.
Fig. 2.

Accuracy comparisons for deletions and insertions. Accuracy metrics are shown on a per size bin basis in the plots. The tables below the plots show the aggregate accuracy scores. If a tool does not support detecting the SV type, an NA is indicated in the table. Each tool name is color coded to match the color code in the plots. DELLY’s suboptimal deletion performance was due to its lower breakpoint resolution. For insertions, although Pindel’s sensitivity was close to MetaSV, it had a significantly lower precision and overall accuracy

Accuracy comparisons for deletions and insertions. Accuracy metrics are shown on a per size bin basis in the plots. The tables below the plots show the aggregate accuracy scores. If a tool does not support detecting the SV type, an NA is indicated in the table. Each tool name is color coded to match the color code in the plots. DELLY’s suboptimal deletion performance was due to its lower breakpoint resolution. For insertions, although Pindel’s sensitivity was close to MetaSV, it had a significantly lower precision and overall accuracy Figure 2 shows detailed accuracy comparisons for both deletion and insertion detection across different SV sizes and tools. For deletions (Fig. 2a), MetaSV performance was the highest across all SV sizes. In most cases, it improves upon the best performing individual tool for a given size. For insertions (Fig. 2b), the improvement due to MetaSV was more significant since all tools (with the exception of Pindel) almost detect no insertions due to inherent limitations of the methods used. Therefore, almost all improvement in accuracy is due to our insertion detection enhancement—our soft-clip based approach is very sensitive while the assembly step for insertion detection yields high precision in contrast to Pindel which had low precision. As with deletions, small insertions are more difficult to detect, in general. Detailed accuracy comparisons for other SV types are discussed in the Supplementary Text. In order to study the impact on accuracy and runtime of varying coverages, we generated additional simulation datasets with 2 × 100 bp reads at 10× and 30× coverages. We also generated 2 × 250 bp reads at 50× coverage to investigate the impact of increasing read length for the same coverage. Although accuracy dropped for lower coverages, MetaSV was still the most stable and most accurate, with deletion F1-scores of 89.1 and 95.8% for 10× and 30× coverages, respectively. For 250 bp read length, F1-scores of 96.8 and 80.9% were achieved for deletions and insertions, respectively—insertion accuracy dropped slightly over 100 bp reads due to reduced read count for the same coverage. Furthermore, it took around 25 h to run all the four aforementioned tools for MetaSV as well as MetaSV on a single node with dual-hexcore X5675 Intel Xeon processors for 50× coverage. Because MetaSV is parallelized in all its steps, its speed should scale linearly with the number of available cores. In addition to simulation, we used the publicly available Illumina Platinum Genomes sequencing data for NA12878 as a real testbed. Due to the lack of high-confidence comprehensive SV calls, particularly for insertions, false positive rates cannot be accurately determined using real data. Therefore, only sensitivity for deletions was reported here. For deletions, a sensitivity of 90.2% was achieved against the Complete Genomics high-confidence callset for NA12878 (Drmanac ) generated using their analysis pipeline version 2.0, which is similar to our simulation results. Complete Genomics was used since it is an orthogonal sequencing platform, providing a less biased validation.

4 Conclusions

MetaSV significantly improves the accuracy of SV-calling by integrating orthogonal methods and tools. In addition, it is augmented with soft-clip based insertion detection for significantly higher accuracy compared with the state of the art. We consider MetaSV as a proof of concept of the effectiveness of using an ensemble approach for calling SVs. The approach is not limited to only using the four aforementioned tools—it can be easily adapted to use additional or even a different set of tools.

Funding

W.H.W. was supported by National Institute of Health grants [1R01HG007834] and [1R01GM109836]. Conflicts of Interest: W.H.W. and N.B. are co-founders of Bina Technologies. W.H.W. and M.B.G. are scientific advisors for Bina Technologies.
  16 in total

Review 1.  Structural variation in the human genome and its role in disease.

Authors:  Paweł Stankiewicz; James R Lupski
Journal:  Annu Rev Med       Date:  2010       Impact factor: 13.739

2.  CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing.

Authors:  Alexej Abyzov; Alexander E Urban; Michael Snyder; Mark Gerstein
Journal:  Genome Res       Date:  2011-02-07       Impact factor: 9.043

3.  Detecting and annotating genetic variations using the HugeSeq pipeline.

Authors:  Hugo Y K Lam; Cuiping Pan; Michael J Clark; Phil Lacroute; Rui Chen; Rajini Haraksingh; Maeve O'Huallachain; Mark B Gerstein; Jeffrey M Kidd; Carlos D Bustamante; Michael Snyder
Journal:  Nat Biotechnol       Date:  2012-03-07       Impact factor: 54.908

4.  Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads.

Authors:  Kai Ye; Marcel H Schulz; Quan Long; Rolf Apweiler; Zemin Ning
Journal:  Bioinformatics       Date:  2009-06-26       Impact factor: 6.937

5.  Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays.

Authors:  Radoje Drmanac; Andrew B Sparks; Matthew J Callow; Aaron L Halpern; Norman L Burns; Bahram G Kermani; Paolo Carnevali; Igor Nazarenko; Geoffrey B Nilsen; George Yeung; Fredrik Dahl; Andres Fernandez; Bryan Staker; Krishna P Pant; Jonathan Baccash; Adam P Borcherding; Anushka Brownley; Ryan Cedeno; Linsu Chen; Dan Chernikoff; Alex Cheung; Razvan Chirita; Benjamin Curson; Jessica C Ebert; Coleen R Hacker; Robert Hartlage; Brian Hauser; Steve Huang; Yuan Jiang; Vitali Karpinchyk; Mark Koenig; Calvin Kong; Tom Landers; Catherine Le; Jia Liu; Celeste E McBride; Matt Morenzoni; Robert E Morey; Karl Mutch; Helena Perazich; Kimberly Perry; Brock A Peters; Joe Peterson; Charit L Pethiyagoda; Kaliprasad Pothuraju; Claudia Richter; Abraham M Rosenbaum; Shaunak Roy; Jay Shafto; Uladzislau Sharanhovich; Karen W Shannon; Conrad G Sheppy; Michel Sun; Joseph V Thakuria; Anne Tran; Dylan Vu; Alexander Wait Zaranek; Xiaodi Wu; Snezana Drmanac; Arnold R Oliphant; William C Banyai; Bruce Martin; Dennis G Ballinger; George M Church; Clifford A Reid
Journal:  Science       Date:  2009-11-05       Impact factor: 47.728

6.  Nucleotide-resolution analysis of structural variants using BreakSeq and a breakpoint library.

Authors:  Hugo Y K Lam; Xinmeng Jasmine Mu; Adrian M Stütz; Andrea Tanzer; Philip D Cayting; Michael Snyder; Philip M Kim; Jan O Korbel; Mark B Gerstein
Journal:  Nat Biotechnol       Date:  2009-12-27       Impact factor: 54.908

7.  BreakDancer: an algorithm for high-resolution mapping of genomic structural variation.

Authors:  Ken Chen; John W Wallis; Michael D McLellan; David E Larson; Joelle M Kalicki; Craig S Pohl; Sean D McGrath; Michael C Wendl; Qunyuan Zhang; Devin P Locke; Xiaoqi Shi; Robert S Fulton; Timothy J Ley; Richard K Wilson; Li Ding; Elaine R Mardis
Journal:  Nat Methods       Date:  2009-08-09       Impact factor: 28.547

8.  AGE: defining breakpoints of genomic structural variants at single-nucleotide resolution, through optimal alignments with gap excision.

Authors:  Alexej Abyzov; Mark Gerstein
Journal:  Bioinformatics       Date:  2011-01-13       Impact factor: 6.937

9.  Mapping copy number variation by population-scale genome sequencing.

Authors:  Ryan E Mills; Klaudia Walter; Chip Stewart; Robert E Handsaker; Ken Chen; Can Alkan; Alexej Abyzov; Seungtai Chris Yoon; Kai Ye; R Keira Cheetham; Asif Chinwalla; Donald F Conrad; Yutao Fu; Fabian Grubert; Iman Hajirasouliha; Fereydoun Hormozdiari; Lilia M Iakoucheva; Zamin Iqbal; Shuli Kang; Jeffrey M Kidd; Miriam K Konkel; Joshua Korn; Ekta Khurana; Deniz Kural; Hugo Y K Lam; Jing Leng; Ruiqiang Li; Yingrui Li; Chang-Yun Lin; Ruibang Luo; Xinmeng Jasmine Mu; James Nemesh; Heather E Peckham; Tobias Rausch; Aylwyn Scally; Xinghua Shi; Michael P Stromberg; Adrian M Stütz; Alexander Eckehart Urban; Jerilyn A Walker; Jiantao Wu; Yujun Zhang; Zhengdong D Zhang; Mark A Batzer; Li Ding; Gabor T Marth; Gil McVean; Jonathan Sebat; Michael Snyder; Jun Wang; Kenny Ye; Evan E Eichler; Mark B Gerstein; Matthew E Hurles; Charles Lee; Steven A McCarroll; Jan O Korbel
Journal:  Nature       Date:  2011-02-03       Impact factor: 49.962

10.  Enhanced structural variant and breakpoint detection using SVMerge by integration of multiple detection methods and local assembly.

Authors:  Kim Wong; Thomas M Keane; James Stalker; David J Adams
Journal:  Genome Biol       Date:  2010-12-31       Impact factor: 13.583

View more
  58 in total

Review 1.  A Primer on Infectious Disease Bacterial Genomics.

Authors:  Tarah Lynch; Aaron Petkau; Natalie Knox; Morag Graham; Gary Van Domselaar
Journal:  Clin Microbiol Rev       Date:  2016-09-07       Impact factor: 26.132

2.  Global analysis of somatic structural genomic alterations and their impact on gene expression in diverse human cancers.

Authors:  Babak Alaei-Mahabadi; Joydeep Bhadury; Joakim W Karlsson; Jonas A Nilsson; Erik Larsson
Journal:  Proc Natl Acad Sci U S A       Date:  2016-11-16       Impact factor: 11.205

Review 3.  Structural variation in the sequencing era.

Authors:  Steve S Ho; Alexander E Urban; Ryan E Mills
Journal:  Nat Rev Genet       Date:  2019-11-15       Impact factor: 53.242

4.  FusorSV: an algorithm for optimally combining data from multiple structural variation detection methods.

Authors:  Timothy Becker; Wan-Ping Lee; Joseph Leone; Qihui Zhu; Chengsheng Zhang; Silvia Liu; Jack Sargent; Kritika Shanker; Adam Mil-Homens; Eliza Cerveira; Mallory Ryan; Jane Cha; Fabio C P Navarro; Timur Galeev; Mark Gerstein; Ryan E Mills; Dong-Guk Shin; Charles Lee; Ankit Malhotra
Journal:  Genome Biol       Date:  2018-03-20       Impact factor: 13.583

5.  A recurrent de novo CTBP1 mutation is associated with developmental delay, hypotonia, ataxia, and tooth enamel defects.

Authors:  David B Beck; Megan T Cho; Francisca Millan; Carin Yates; Mark Hannibal; Bridget O'Connor; Marwan Shinawi; Anne M Connolly; Darrel Waggoner; Sara Halbach; Brad Angle; Victoria Sanders; Yufeng Shen; Kyle Retterer; Amber Begtrup; Renkui Bai; Wendy K Chung
Journal:  Neurogenetics       Date:  2016-04-19       Impact factor: 2.660

6.  Extreme copy number variation at a tRNA ligase gene affecting phenology and fitness in yellow monkeyflowers.

Authors:  Thomas C Nelson; Patrick J Monnahan; Mariah K McIntosh; Kayli Anderson; Evan MacArthur-Waltz; Findley R Finseth; John K Kelly; Lila Fishman
Journal:  Mol Ecol       Date:  2018-12-10       Impact factor: 6.185

7.  Deletion in the Cobalamin Synthetase W Domain-Containing Protein 1 Gene Is associated with Congenital Anomalies of the Kidney and Urinary Tract.

Authors:  Shoichiro Kanda; Masaki Ohmuraya; Hiroyuki Akagawa; Shigeru Horita; Yasuhiro Yoshida; Naoto Kaneko; Noriko Sugawara; Kiyonobu Ishizuka; Kenichiro Miura; Yutaka Harita; Toshiyuki Yamamoto; Akira Oka; Kimi Araki; Toru Furukawa; Motoshi Hattori
Journal:  J Am Soc Nephrol       Date:  2019-12-20       Impact factor: 10.121

8.  xGAP: A python based efficient, modular, extensible and fault tolerant genomic analysis pipeline for variant discovery.

Authors:  Aditya Gorla; Brandon Jew; Luke Zhang; Jae Hoon Sul
Journal:  Bioinformatics       Date:  2021-01-08       Impact factor: 6.937

9.  Whole-genome sequencing identifies functional noncoding variation in SEMA3C that cosegregates with dyslexia in a multigenerational family.

Authors:  Amaia Carrion-Castillo; Sara B Estruch; Ben Maassen; Barbara Franke; Clyde Francks; Simon E Fisher
Journal:  Hum Genet       Date:  2021-06-02       Impact factor: 4.132

10.  NPSV: A simulation-driven approach to genotyping structural variants in whole-genome sequencing data.

Authors:  Michael D Linderman; Crystal Paudyal; Musab Shakeel; William Kelley; Ali Bashir; Bruce D Gelb
Journal:  Gigascience       Date:  2021-07-01       Impact factor: 6.524

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.