Literature DB >> 24907369

ABRA: improved coding indel detection via assembly-based realignment.

Lisle E Mose1, Matthew D Wilkerson2, D Neil Hayes2, Charles M Perou2, Joel S Parker2.   

Abstract

MOTIVATION: Variant detection from next-generation sequencing (NGS) data is an increasingly vital aspect of disease diagnosis, treatment and research. Commonly used NGS-variant analysis tools generally rely on accurately mapped short reads to identify somatic variants and germ-line genotypes. Existing NGS read mappers have difficulty accurately mapping short reads containing complex variation (i.e. more than a single base change), thus making identification of such variants difficult or impossible. Insertions and deletions (indels) in particular have been an area of great difficulty. Indels are frequent and can have substantial impact on function, which makes their detection all the more imperative.
RESULTS: We present ABRA, an assembly-based realigner, which uses an efficient and flexible localized de novo assembly followed by global realignment to more accurately remap reads. This results in enhanced performance for indel detection as well as improved accuracy in variant allele frequency estimation.
AVAILABILITY AND IMPLEMENTATION: ABRA is implemented in a combination of Java and C/C++ and is freely available for download at https://github.com/mozack/abra.
© The Author 2014. Published by Oxford University Press.

Entities:  

Mesh:

Year:  2014        PMID: 24907369      PMCID: PMC4173014          DOI: 10.1093/bioinformatics/btu376

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 INTRODUCTION

A number of realignment or assembly methods have been proposed to overcome the alignment errors and reference bias that limit indel detection. Short read micro aligner locally realigns reads to regionally assembled variant graphs (Homer and Nelson, 2010). Pindel uses a pattern growth approach to detect indels (Ye ). Dindel realigns reads to candidate haplotypes and uses a Bayesian method to call indels up to 50 bp in length (Albers ). The Genome Analysis Toolkit (GATK)’s IndelRealigner seeks to minimize the number of mismatching bases via local realignment (DePristo ). Whole-genome de novo assembly approaches include Fermi (Li, 2012) and Cortex Var (Iqbal ). SOAPIndel performs localized assembly and calling on regions containing reads where only one half of a paired read is mapped (Li ). Clipping REveals STructure (CREST) uses soft clipped reads and localized assembly to identify somatic structural variants (Wang et al., 2010). Targeted Iterative Graph Routing Assembler (TIGRA) uses targeted assembly to produce contigs from putative breakpoints (Chen ). Additional proprietary localized assembly methods have been developed by Complete Genomics (Carnevali ) and Foundation Medicine (Frampton ). Our newly developed tool called ABRA accepts a Sequence Alignment/Map (SAM/BAM) file as input and produces a realigned BAM file as output, allowing flexibility in selection of variant calling algorithms and other downstream analysis. Global realignment allows reads that are unaligned or improperly mapped to be moved to a correct location. ABRA detects variation that is not present in the original read alignments and improves allele-frequency estimates for variation that is present. ABRA can be used to enhance both germ-line and somatic variant detection and works with paired-end as well as single-end data.

2 METHODS

The ABRA algorithm consists of localized region assembly, contig building, alignment of assembled contigs and read realignment. Localized assembly of reads is done on small genomic regions of size ≤2 kb. For exome or targeted sequencing, these regions roughly correspond to capture targets. For each region, a De Bruijn graph of k-mers is assembled from the input reads (Pevzner ). K-mers containing low quality or ambiguous bases are filtered and k-mers that do not appear in at least two distinct reads are pruned from the graph, reducing the impact of sequencing errors on the assembly process. After initial pruning of the assembled graph, the graph is traversed to build contigs longer than the original read length. There is no smoothing of the graph to remove low-frequency variation, as we are interested in detecting such variation. All non-cyclic paths through the graph are traversed. In cases where a cycle in the graph is observed for a given region, that region is iteratively reassembled using increasing k-mer sizes until the cycle no longer exists or a configurable maximum k-mer size is reached. As currently implemented, detection of local insertions is limited to less than maximum k-mer size. Larger insertions of sequence from another location in the genome are likely to be aligned elsewhere and not included in local assembly, thus limiting detection of insertions as the size approaches read length. Assembled contigs for all regions are aligned to the reference genome. We currently use BWA MEM (Li, 2013) for contig alignment. Chimerically aligned contigs are combined when appropriate (in cases of longer indels). Redundant sequence as well as sequence not varying from the original reference is removed. The result is used as the basis for an alternate reference. The original reads are mapped to the alternate reference using a non-gapped alignment. Reads that unambiguously align more closely to the alternate than the original reference are modified to reflect the updated alignment information in the context of the original reference. Typical ABRA runtime for a human whole exome of depth 150X on a machine with eight cores is roughly 2 h using <16 GB of RAM.

3 RESULTS

3.1 HapMap trio

ABRA was applied to exome target regions of a CEPH Hapmap trio of three individuals sequenced to 50x as part of the Illumina Platinum Genomes project and aligned using bwa mem. Variants were called with and without ABRA using Freebayes (Garrison and Marth, 2012) and UnifiedGenotyper (DePristo ). The GATK’s HaplotypeCaller was used to call variants without ABRA and the GATK’s IndelRealigner was applied to UnifiedGenotyper input. Coding indels with variant-allele frequency of ≥20% are used in this germ-line evaluation. ABRA enables an increase in the number of Mendelian consistent loci (MCL) detected and a decrease in Mendelian conflict rate (MCR) with either Freebayes or UnifiedGenotyper (Fig. 1). The Freebayes/ABRA combination yields a decrease in MCR compared with HaplotypeCaller and remains competitive in number of MCL detected. Pre-/post-ABRA concordance for Mendelian consistent SNP loci is >99%. Although we anticipate that ABRA will also provide improved performance in non-coding regions, this has not yet been explored.
Fig. 1.

Mendelian consistent loci and Mendelian conflict rates for Freebayes and UnifiedGenotyper both pre- and post-ABRA. UnifiedGenotyper results with GATK Local Realignment around Indels as well as HaplotypeCaller results are also shown for comparison. Shapes in this figure represent variant depth, whereas color/shading represent caller and realignment method

Mendelian consistent loci and Mendelian conflict rates for Freebayes and UnifiedGenotyper both pre- and post-ABRA. UnifiedGenotyper results with GATK Local Realignment around Indels as well as HaplotypeCaller results are also shown for comparison. Shapes in this figure represent variant depth, whereas color/shading represent caller and realignment method

3.2 TCGA tumor and normal data

We applied ABRA to 100 normal exomes from the Breast Invasive Carcinoma (BRCA) cohort of The Cancer Genome Atlas (TCGA) project (The Cancer Genome Atlas Network, 2012) using BWA (Li and Durbin, 2009) for the initial alignments. Germ-line variants were called both with and without ABRA using FreeBayes. We also called germ-line variants using HaplotypeCaller and Pindel for comparison purposes. To evaluate these calls in the absence of ground truth, we assembled predicted calls for all methods using TIGRA and aligned the resulting contigs with the BLAST-like alignment tool (BLAT) (Kent ). ABRA increased concordance with the TIGRA/BLAT results and maintained a low discordance rate (Fig. 2). Further, ABRA generated estimated allele frequencies closer to 50 and 100%, which is expected in a diploid individual (see Supplementary Material). We next compared pre- and post-ABRA somatic variant calls on 750 TCGA BRCA normal/tumor exome pairs. Strelka (Saunders ) and UNCeqR (Wilkerson ) were used for somatic calling. Improved detection of somatic mutation was observed in the post-ABRA calls (see Supplementary Material).
Fig. 2.

Concordance/discordance with TIGRA assembled contigs for predicted calls from FreeBayes (pre- and post-ABRA), Pindel and Haplotype Caller. Indels within the ranges enabled by ABRA are evaluated (deletions up to 2000 bp and insertions up to the read length). The numbers in the figure represent a cutoff point for variant quality scores as reported in the respective caller’s VCF output. A small number of pre-ABRA deletions >30 bp and 0 pre-ABRA insertions >30 bp are called. FreeBayes currently does not use reads partially overlapping an insert as supporting evidence, which may impact post-ABRA sensitivity for longer insertions

Concordance/discordance with TIGRA assembled contigs for predicted calls from FreeBayes (pre- and post-ABRA), Pindel and Haplotype Caller. Indels within the ranges enabled by ABRA are evaluated (deletions up to 2000 bp and insertions up to the read length). The numbers in the figure represent a cutoff point for variant quality scores as reported in the respective caller’s VCF output. A small number of pre-ABRA deletions >30 bp and 0 pre-ABRA insertions >30 bp are called. FreeBayes currently does not use reads partially overlapping an insert as supporting evidence, which may impact post-ABRA sensitivity for longer insertions

4 CONCLUSION

ABRA improves on next-generation sequencing read alignments, providing enhanced performance in detection of indels as well as greater accuracy in variant allele frequency estimation. ABRA accepts BAM files as input and outputs realigned BAM files, allowing flexibility in downstream analysis. ABRA can be used with a variety of variant callers for both germ-line and somatic variant calling. Funding: This work was supported in part by the National Cancer Institute Breast SPORE program (P50-CA58223-09A1) and The Cancer Genome Atlas (U24-CA143848-05). Conflict of Interest: none declared.
  16 in total

1.  An Eulerian path approach to DNA fragment assembly.

Authors:  P A Pevzner; H Tang; M S Waterman
Journal:  Proc Natl Acad Sci U S A       Date:  2001-08-14       Impact factor: 11.205

2.  Computational techniques for human genome resequencing using mated gapped reads.

Authors:  Paolo Carnevali; Jonathan Baccash; Aaron L Halpern; Igor Nazarenko; Geoffrey B Nilsen; Krishna P Pant; Jessica C Ebert; Anushka Brownley; Matt Morenzoni; Vitali Karpinchyk; Bruce Martin; Dennis G Ballinger; Radoje Drmanac
Journal:  J Comput Biol       Date:  2011-12-16       Impact factor: 1.479

3.  Dindel: accurate indel calls from short-read data.

Authors:  Cornelis A Albers; Gerton Lunter; Daniel G MacArthur; Gilean McVean; Willem H Ouwehand; Richard Durbin
Journal:  Genome Res       Date:  2010-10-27       Impact factor: 9.043

4.  Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly.

Authors:  Heng Li
Journal:  Bioinformatics       Date:  2012-05-07       Impact factor: 6.937

5.  Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads.

Authors:  Kai Ye; Marcel H Schulz; Quan Long; Rolf Apweiler; Zemin Ning
Journal:  Bioinformatics       Date:  2009-06-26       Impact factor: 6.937

6.  Integrated RNA and DNA sequencing improves mutation detection in low purity tumors.

Authors:  Matthew D Wilkerson; Christopher R Cabanski; Wei Sun; Katherine A Hoadley; Vonn Walter; Lisle E Mose; Melissa A Troester; Peter S Hammerman; Joel S Parker; Charles M Perou; D Neil Hayes
Journal:  Nucleic Acids Res       Date:  2014-06-26       Impact factor: 16.971

7.  A framework for variation discovery and genotyping using next-generation DNA sequencing data.

Authors:  Mark A DePristo; Eric Banks; Ryan Poplin; Kiran V Garimella; Jared R Maguire; Christopher Hartl; Anthony A Philippakis; Guillermo del Angel; Manuel A Rivas; Matt Hanna; Aaron McKenna; Tim J Fennell; Andrew M Kernytsky; Andrey Y Sivachenko; Kristian Cibulskis; Stacey B Gabriel; David Altshuler; Mark J Daly
Journal:  Nat Genet       Date:  2011-04-10       Impact factor: 38.330

8.  De novo assembly and genotyping of variants using colored de Bruijn graphs.

Authors:  Zamin Iqbal; Mario Caccamo; Isaac Turner; Paul Flicek; Gil McVean
Journal:  Nat Genet       Date:  2012-01-08       Impact factor: 38.330

9.  Improved variant discovery through local re-alignment of short-read next-generation sequencing data using SRMA.

Authors:  Nils Homer; Stanley F Nelson
Journal:  Genome Biol       Date:  2010-10-08       Impact factor: 13.583

10.  Fast and accurate short read alignment with Burrows-Wheeler transform.

Authors:  Heng Li; Richard Durbin
Journal:  Bioinformatics       Date:  2009-05-18       Impact factor: 6.937

View more
  75 in total

1.  Transcriptional Mechanisms of Resistance to Anti-PD-1 Therapy.

Authors:  Maria L Ascierto; Alvin Makohon-Moore; Evan J Lipson; Janis M Taube; Tracee L McMiller; Alan E Berger; Jinshui Fan; Genevieve J Kaunitz; Tricia R Cottrell; Zachary A Kohutek; Alexander Favorov; Vladimir Makarov; Nadeem Riaz; Timothy A Chan; Leslie Cope; Ralph H Hruban; Drew M Pardoll; Barry S Taylor; David B Solit; Christine A Iacobuzio-Donahue; Suzanne L Topalian
Journal:  Clin Cancer Res       Date:  2017-02-13       Impact factor: 12.531

2.  Needlestack: an ultra-sensitive variant caller for multi-sample next generation sequencing data.

Authors:  Tiffany M Delhomme; Patrice H Avogbe; Aurélie A G Gabriel; Nicolas Alcala; Noemie Leblay; Catherine Voegele; Maxime Vallée; Priscilia Chopard; Amélie Chabrier; Behnoush Abedi-Ardekani; Valérie Gaborieau; Ivana Holcatova; Vladimir Janout; Lenka Foretová; Sasa Milosavljevic; David Zaridze; Anush Mukeriya; Elisabeth Brambilla; Paul Brennan; Ghislaine Scelo; Lynnette Fernandez-Cuesta; Graham Byrnes; Florence L Calvez-Kelm; James D McKay; Matthieu Foll
Journal:  NAR Genom Bioinform       Date:  2020-04-20

Review 3.  Applications of Immunogenomics to Cancer.

Authors:  X Shirley Liu; Elaine R Mardis
Journal:  Cell       Date:  2017-02-09       Impact factor: 41.582

4.  Indel variant analysis of short-read sequencing data with Scalpel.

Authors:  Han Fang; Ewa A Bergmann; Kanika Arora; Vladimir Vacic; Michael C Zody; Ivan Iossifov; Jason A O'Rawe; Yiyang Wu; Laura T Jimenez Barron; Julie Rosenbaum; Michael Ronemus; Yoon-Ha Lee; Zihua Wang; Esra Dikoglu; Vaidehi Jobanputra; Gholson J Lyon; Michael Wigler; Michael C Schatz; Giuseppe Narzisi
Journal:  Nat Protoc       Date:  2016-11-17       Impact factor: 13.491

5.  Increased Sensitivity of Diagnostic Mutation Detection by Re-analysis Incorporating Local Reassembly of Sequence Reads.

Authors:  Christopher M Watson; Nick Camm; Laura A Crinnion; Samuel Clokie; Rachel L Robinson; Julian Adlard; Ruth Charlton; Alexander F Markham; Ian M Carr; David T Bonthron
Journal:  Mol Diagn Ther       Date:  2017-12       Impact factor: 4.074

6.  Clonal evolution underlying leukemia progression and Richter transformation in patients with ibrutinib-relapsed CLL.

Authors:  Sabah Kadri; Jimmy Lee; Carrie Fitzpatrick; Natalie Galanina; Madina Sukhanova; Girish Venkataraman; Shruti Sharma; Brad Long; Kristin Petras; Megan Theissen; Mei Ming; Yuri Kobzev; Wenjun Kang; Ailin Guo; Weige Wang; Nifang Niu; Howard Weiner; Michael Thirman; Wendy Stock; Sonali M Smith; Chadi Nabhan; Jeremy P Segal; Pin Lu; Y Lynn Wang
Journal:  Blood Adv       Date:  2017-05-02

7.  Germline Analysis from Tumor-Germline Sequencing Dyads to Identify Clinically Actionable Secondary Findings.

Authors:  Bryce A Seifert; Julianne M O'Daniel; Krunal Amin; Daniel S Marchuk; Nirali M Patel; Joel S Parker; Alan P Hoyle; Lisle E Mose; Andrew Marron; Michele C Hayward; Christopher Bizon; Kirk C Wilhelmsen; James P Evans; H Shelton Earp; Norman E Sharpless; D Neil Hayes; Jonathan S Berg
Journal:  Clin Cancer Res       Date:  2016-04-15       Impact factor: 12.531

8.  Unclassified renal cell carcinoma with tubulopapillary architecture, clear cell phenotype, and chromosome 8 monosomy: a new kid on the block.

Authors:  Thanh T H Lan; Jennifer Keller-Ramey; Carrie Fitzpatrick; Sabah Kadri; Jerome B Taxy; Jeremy P Segal; Larissa V Furtado; Tatjana Antic
Journal:  Virchows Arch       Date:  2016-05-12       Impact factor: 4.064

9.  Integrated Analysis of RNA and DNA from the Phase III Trial CALGB 40601 Identifies Predictors of Response to Trastuzumab-Based Neoadjuvant Chemotherapy in HER2-Positive Breast Cancer.

Authors:  Maki Tanioka; Cheng Fan; Joel S Parker; Katherine A Hoadley; Zhiyuan Hu; Yan Li; Terry M Hyslop; Brandelyn N Pitcher; Matthew G Soloway; Patricia A Spears; Lynn N Henry; Sara Tolaney; Chau T Dang; Ian E Krop; Lyndsay N Harris; Donald A Berry; Elaine R Mardis; Eric P Winer; Clifford A Hudis; Lisa A Carey; Charles M Perou
Journal:  Clin Cancer Res       Date:  2018-07-23       Impact factor: 12.531

10.  A molecular map of lung neuroendocrine neoplasms.

Authors:  Aurélie A G Gabriel; Emilie Mathian; Lise Mangiante; Catherine Voegele; Vincent Cahais; Akram Ghantous; James D McKay; Nicolas Alcala; Lynnette Fernandez-Cuesta; Matthieu Foll
Journal:  Gigascience       Date:  2020-10-30       Impact factor: 6.524

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.