Literature DB >> 28205675

BCFtools/csq: haplotype-aware variant consequences.

Abstract

MOTIVATION: Prediction of functional variant consequences is an important part of sequencing pipelines, allowing the categorization and prioritization of genetic variants for follow up analysis. However, current predictors analyze variants as isolated events, which can lead to incorrect predictions when adjacent variants alter the same codon, or when a frame-shifting indel is followed by a frame-restoring indel. Exploiting known haplotype information when making consequence predictions can resolve these issues.
RESULTS: BCFtools/csq is a fast program for haplotype-aware consequence calling which can take into account known phase. Consequence predictions are changed for 501 of 5019 compound variants found in the 81.7M variants in the 1000 Genomes Project data, with an average of 139 compound variants per haplotype. Predictions match existing tools when run in localized mode, but the program is an order of magnitude faster and requires an order of magnitude less memory.
AVAILABILITY AND IMPLEMENTATION: The program is freely available for commercial and non-commercial use in the BCFtools package which is available for download from http://samtools.github.io/bcftools . CONTACT: pd3@sanger.ac.uk. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical

Mesh：

Year: 2017 PMID： 28205675 PMCID： PMC5870570 DOI： 10.1093/bioinformatics/btx100

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

With the rapidly growing number of sequenced exome and whole-genome samples, it is important to be able to quickly sift through the vast amount of data for variants of most interest. A key step in this process is to take sequencing variants and provide functional effect annotations. For clinical, evolutionary and genotype-phenotype studies, accurate prediction of functional consequences can be critical to downstream interpretation. There are several popular existing programs for predicting the effect of variants such as the Ensembl Variant Effect Predictor (VEP) (McLaren ), SnpEff (Cingolani ) or ANNOVAR (Wang ). One significant limitation is that they are single-record based and, as shown in Figure 1, this can lead to incorrect annotation when surrounding in-phase variants are taken into account.

Fig. 1

Three types of compound variants that lead to incorrect consequence prediction when handled in a localized manner, i.e. each variant separately rather than jointly. (A) Multiple SNVs in the same codon result in a TAG stop codon rather than an amino acid change. (B) A deletion locally predicted as frame-shifting is followed by a frame-restoring variant. Two amino acids are deleted and one changed, the functional consequence on protein function is likely much less severe. (C) Two SNVs separated by an intron occur within the same codon in the spliced transcript. Unchanged areas are shaded for readability. All three examples were encountered in real data (Color version of this figure is available at Bioinformatics online.) With recent experimental and computational advancements, phased haplotypes over tens of kilobases are becoming routinely available through the reduced cost of long-range sequencing technologies (Zheng ) and the increased accuracy of statistical phasing algorithms (Loh ; Sharp ) due to the increased sample cohort sizes (McCarthy ). We present a new variant consequence predictor implemented in BCFtools/csq that can exploit this information.

2 Materials and methods

For haplotype-aware calling, a phased VCF, a GFF3 file with gene predictions and a reference FASTA file are required. The program begins by parsing gene predictions in the GFF3 file, then streams through the VCF file using a fast region lookup at each site to find overlaps with regions of supported genomic types (exons, CDS, UTRs or general transcripts). Active transcripts that overlap variants being annotated are maintained on a heap data structure. For each transcript we build a haplotype tree which includes phased genotypes present across all samples. The nodes in this tree correspond to VCF records with as many child nodes as there are alleles. In the worst case scenario of each sample having two unique haplotypes, the number of leaves in the haplotype tree does not grow exponentially but stops at the total number of unique haplotypes present in the samples. Thus each internal node of the tree corresponds to a set of haplotypes with the same prefix and the leaf nodes correspond to a set of haplotypes shared by multiple samples. Once all variants from a transcript are retrieved from the VCF, the consequences are determined on a spliced transcript sequence and reported in the VCF. Representing the consequences is itself a challenge as there can be many samples in the VCF, each with different haplotypes, thus making the prediction non-local. Moreover, diploid samples have two haplotypes and at each position there can be multiple overlapping transcripts. To represent this rich information and keep the output compact, all unique consequences are recorded in a per-site INFO tag with structure similar to existing annotators. Consequences for each haplotype are recorded in a per-sample FORMAT tag as a bitmask of indexes into the list of consequences recorded in the INFO tag. The bitmask interleaves each haplotype so that when stored in BCF (binary VCF) format, only 8 bits per sample are required for most sites. The bitmask can be translated into a human readable form using the BCFtools/query command. Consequences of compound variants linking multiple sites are reported at one of the sites only with others referencing this record by position.

3 Results

3.1 Accuracy

Accuracy was tested by running in localized mode and comparing against one of the existing local consequence callers (VEP) using gold-standard segregation-phased NA12878 data (Cleary ). With each site treated independently, we expect good agreement with VEP at all sites. Indeed, only 11 out of 1.6M predictions differed within coding regions. See the Supplement (S2) for further details about these differences. Detailed comparison between VEP and other local callers is a topic that has been discussed elsewhere (McCarthy ).

3.2 Performance

Performance was compared to VEP (McLaren ), SnpEff (Cingolani ) and ANNOVAR (Wang ) running on the same NA12878 data. In localized and haplotype-aware mode, BCFtools/csq was faster by an order of magnitude than the fastest of the programs and required an order of magnitude less memory, see Supplementary Table S1. In contrast to localized calling, scaling of haplotype-aware calling will depend on the number of samples being annotated. In Supplementary Figures S2–S5, we show that memory and time both scale linearly with number of sites in the transcript buffer and number of samples.

3.3 Compound variants in 1000 Genomes

Applied to the 1000 Genomes Phase 3 data, haplotype-aware consequence calling modifies the predictions for 501 of 5019 compound variants, summarized in Table 1 and discussed in the Supplement S3. On average, we observe 139.4 compound variants per haplotype (Supplementary Fig. S1), recover 16.4 variants incorrectly predicted as deleterious, and identify 0.8 newly deleterious compound variants per haplotype.

Table 1

Summary of BCFtools/csq consequence type changes from localized (rows) to haplotype-aware (columns) calling in 1000 Genomes data

Note: Blue/orange background indicates a change to a less/more severe prediction in haplotype-aware calling. Only variants with modified predictions are included in the table. (Color version of this table is available at Bioinformatics online.)

Summary of BCFtools/csq consequence type changes from localized (rows) to haplotype-aware (columns) calling in 1000 Genomes data Note: Blue/orange background indicates a change to a less/more severe prediction in haplotype-aware calling. Only variants with modified predictions are included in the table. (Color version of this table is available at Bioinformatics online.) To highlight an example, a frame-restoring pair of indels in the DNA-binding protein gene SON was found to be monomorphic across all 1000 Genomes samples. There, a 1-bp insertion followed by a 1-bp deletion (G > GA at 21:34 948 684 and GA > G at 21:34 948 696) are each predicted as frame-shifting, but in reality the combined effect is a substitution of four amino acids. The functional consequence is therefore likely much less severe, consistent with the SON gene being highly intolerant of loss-of-function mutations, as predicted by ExAC (Lek ). In most studies haplotypes have been determined statistically. Given the typical 1% switch error rate, we estimate the compound error rate from the distribution of heterozygous genotypes in compound variants to be 1.1%, see the Supplement S4 for details.

4 Discussion

Correctly classifying the functional consequence of variants in the context of nearby variants in known phase can change the interpretation of their effect. Variants previously flagged as benign or less severe may now be flagged as deleterious and vice versa. In a rare disease sequencing study, for example, this may have a significant impact as these functional annotations may determine which variants to follow up for further study. Previous work by Wei does not consider indels or introns occurring within the same codon, and requires access to the BAM alignment files to estimate haplotypes. Our approach starts with phased VCF data, leaving haplotype calling as a problem to be solved by other means, for example by statistical phasing. Instead, we focus on providing fast consequence prediction taking into account all variation within a transcript. The standard programs have rich functionality beyond the reporting of variant consequence, and the aim of BCFtools/csq is not to compete with that. Instead, we propose haplotype-aware calling is included in annotation pipelines for enhanced downstream analysis. Click here for additional data file.

11 in total

1. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3.

Authors: Pablo Cingolani; Adrian Platts; Le Lily Wang; Melissa Coon; Tung Nguyen; Luan Wang; Susan J Land; Xiangyi Lu; Douglas M Ruden
Journal: Fly (Austin) Date: 2012 Apr-Jun Impact factor: 2.160

2. Joint variant and de novo mutation identification on pedigrees from high-throughput sequencing data.

Authors: John G Cleary; Ross Braithwaite; Kurt Gaastra; Brian S Hilbush; Stuart Inglis; Sean A Irvine; Alan Jackson; Richard Littin; Sahar Nohzadeh-Malakshah; Mehul Rathod; David Ware; Len Trigg; Francisco M De La Vega
Journal: J Comput Biol Date: 2014-06 Impact factor: 1.479

3. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data.

Authors: Kai Wang; Mingyao Li; Hakon Hakonarson
Journal: Nucleic Acids Res Date: 2010-07-03 Impact factor: 16.971

4. Phasing for medical sequencing using rare variants and large haplotype reference panels.

Authors: Kevin Sharp; Warren Kretzschmar; Olivier Delaneau; Jonathan Marchini
Journal: Bioinformatics Date: 2016-02-27 Impact factor: 6.937

5. Analysis of protein-coding genetic variation in 60,706 humans.

Authors: Monkol Lek; Konrad J Karczewski; Eric V Minikel; Kaitlin E Samocha; Eric Banks; Timothy Fennell; Anne H O'Donnell-Luria; James S Ware; Andrew J Hill; Beryl B Cummings; Taru Tukiainen; Daniel P Birnbaum; Jack A Kosmicki; Laramie E Duncan; Karol Estrada; Fengmei Zhao; James Zou; Emma Pierce-Hoffman; Joanne Berghout; David N Cooper; Nicole Deflaux; Mark DePristo; Ron Do; Jason Flannick; Menachem Fromer; Laura Gauthier; Jackie Goldstein; Namrata Gupta; Daniel Howrigan; Adam Kiezun; Mitja I Kurki; Ami Levy Moonshine; Pradeep Natarajan; Lorena Orozco; Gina M Peloso; Ryan Poplin; Manuel A Rivas; Valentin Ruano-Rubio; Samuel A Rose; Douglas M Ruderfer; Khalid Shakir; Peter D Stenson; Christine Stevens; Brett P Thomas; Grace Tiao; Maria T Tusie-Luna; Ben Weisburd; Hong-Hee Won; Dongmei Yu; David M Altshuler; Diego Ardissino; Michael Boehnke; John Danesh; Stacey Donnelly; Roberto Elosua; Jose C Florez; Stacey B Gabriel; Gad Getz; Stephen J Glatt; Christina M Hultman; Sekar Kathiresan; Markku Laakso; Steven McCarroll; Mark I McCarthy; Dermot McGovern; Ruth McPherson; Benjamin M Neale; Aarno Palotie; Shaun M Purcell; Danish Saleheen; Jeremiah M Scharf; Pamela Sklar; Patrick F Sullivan; Jaakko Tuomilehto; Ming T Tsuang; Hugh C Watkins; James G Wilson; Mark J Daly; Daniel G MacArthur
Journal: Nature Date: 2016-08-18 Impact factor: 49.962

6. A reference panel of 64,976 haplotypes for genotype imputation.

Authors: Shane McCarthy; Sayantan Das; Warren Kretzschmar; Olivier Delaneau; Andrew R Wood; Alexander Teumer; Hyun Min Kang; Christian Fuchsberger; Petr Danecek; Kevin Sharp; Yang Luo; Carlo Sidore; Alan Kwong; Nicholas Timpson; Seppo Koskinen; Scott Vrieze; Laura J Scott; He Zhang; Anubha Mahajan; Jan Veldink; Ulrike Peters; Carlos Pato; Cornelia M van Duijn; Christopher E Gillies; Ilaria Gandin; Massimo Mezzavilla; Arthur Gilly; Massimiliano Cocca; Michela Traglia; Andrea Angius; Jeffrey C Barrett; Dorrett Boomsma; Kari Branham; Gerome Breen; Chad M Brummett; Fabio Busonero; Harry Campbell; Andrew Chan; Sai Chen; Emily Chew; Francis S Collins; Laura J Corbin; George Davey Smith; George Dedoussis; Marcus Dorr; Aliki-Eleni Farmaki; Luigi Ferrucci; Lukas Forer; Ross M Fraser; Stacey Gabriel; Shawn Levy; Leif Groop; Tabitha Harrison; Andrew Hattersley; Oddgeir L Holmen; Kristian Hveem; Matthias Kretzler; James C Lee; Matt McGue; Thomas Meitinger; David Melzer; Josine L Min; Karen L Mohlke; John B Vincent; Matthias Nauck; Deborah Nickerson; Aarno Palotie; Michele Pato; Nicola Pirastu; Melvin McInnis; J Brent Richards; Cinzia Sala; Veikko Salomaa; David Schlessinger; Sebastian Schoenherr; P Eline Slagboom; Kerrin Small; Timothy Spector; Dwight Stambolian; Marcus Tuke; Jaakko Tuomilehto; Leonard H Van den Berg; Wouter Van Rheenen; Uwe Volker; Cisca Wijmenga; Daniela Toniolo; Eleftheria Zeggini; Paolo Gasparini; Matthew G Sampson; James F Wilson; Timothy Frayling; Paul I W de Bakker; Morris A Swertz; Steven McCarroll; Charles Kooperberg; Annelot Dekker; David Altshuler; Cristen Willer; William Iacono; Samuli Ripatti; Nicole Soranzo; Klaudia Walter; Anand Swaroop; Francesco Cucca; Carl A Anderson; Richard M Myers; Michael Boehnke; Mark I McCarthy; Richard Durbin
Journal: Nat Genet Date: 2016-08-22 Impact factor: 38.330

7. MAC: identifying and correcting annotation for multi-nucleotide variations.

Authors: Lei Wei; Lu T Liu; Jacob R Conroy; Qiang Hu; Jeffrey M Conroy; Carl D Morrison; Candace S Johnson; Jianmin Wang; Song Liu
Journal: BMC Genomics Date: 2015-08-01 Impact factor: 3.969

8. Choice of transcripts and software has a large effect on variant annotation.

Authors: Davis J McCarthy; Peter Humburg; Alexander Kanapin; Manuel A Rivas; Kyle Gaulton; Jean-Baptiste Cazier; Peter Donnelly
Journal: Genome Med Date: 2014-03-31 Impact factor: 11.117

9. The Ensembl Variant Effect Predictor.

Authors: William McLaren; Laurent Gil; Sarah E Hunt; Harpreet Singh Riat; Graham R S Ritchie; Anja Thormann; Paul Flicek; Fiona Cunningham
Journal: Genome Biol Date: 2016-06-06 Impact factor: 13.583

10. Reference-based phasing using the Haplotype Reference Consortium panel.

Authors: Po-Ru Loh; Petr Danecek; Pier Francesco Palamara; Christian Fuchsberger; Yakir A Reshef; Hilary K Finucane; Sebastian Schoenherr; Lukas Forer; Shane McCarthy; Goncalo R Abecasis; Richard Durbin; Alkes L Price
Journal: Nat Genet Date: 2016-10-03 Impact factor: 38.330

88 in total

1. Investigation of On-Farm Transmission Routes for Contamination of Dairy Cows with Top 7 Escherichia coli O-Serogroups.

Authors: D Rapp; C M Ross; P Maclean; V M Cave; G Brightwell
Journal: Microb Ecol Date: 2020-06-20 Impact factor: 4.552

2. Combinations of Spok genes create multiple meiotic drivers in Podospora.

Authors: Aaron A Vogan; S Lorena Ament-Velásquez; Alexandra Granger-Farbos; Jesper Svedberg; Eric Bastiaans; Alfons Jm Debets; Virginie Coustou; Hélène Yvanne; Corinne Clavé; Sven J Saupe; Hanna Johannesson
Journal: Elife Date: 2019-07-26 Impact factor: 8.140

3. Characterizing mobile element insertions in 5675 genomes.

Authors: Yiwei Niu; Xueyi Teng; Honghong Zhou; Yirong Shi; Yanyan Li; Yiheng Tang; Peng Zhang; Huaxia Luo; Quan Kang; Tao Xu; Shunmin He
Journal: Nucleic Acids Res Date: 2022-03-21 Impact factor: 16.971

4. Functional analysis of a novel C-glycosyltransferase in the orchid Dendrobium catenatum.

Authors: Zhiyao Ren; Xiaoyu Ji; Zhenbin Jiao; Yingyi Luo; Guo-Qiang Zhang; Shengchang Tao; Zhouxi Lei; Jing Zhang; Yuchen Wang; Zhong-Jian Liu; Gang Wei
Journal: Hortic Res Date: 2020-07-01 Impact factor: 6.793

5. Signatures of human-commensalism in the house sparrow genome.

Authors: Mark Ravinet; Tore Oldeide Elgvin; Cassandra Trier; Mansour Aliabadian; Andrey Gavrilov; Glenn-Peter Sætre
Journal: Proc Biol Sci Date: 2018-08-08 Impact factor: 5.349

6. Prevalence and Clinical Features of Inflammatory Bowel Diseases Associated With Monogenic Variants, Identified by Whole-Exome Sequencing in 1000 Children at a Single Center.

Authors: Eileen Crowley; Neil Warner; Jie Pan; Sam Khalouei; Abdul Elkadri; Karoline Fiedler; Justin Foong; Andrei L Turinsky; Dana Bronte-Tinkew; Shiqi Zhang; Jamie Hu; David Tian; Dalin Li; Julie Horowitz; Iram Siddiqui; Julia Upton; Chaim M Roifman; Peter C Church; Donna A Wall; Arun K Ramani; Daniel Kotlarz; Christoph Klein; Holm Uhlig; Scott B Snapper; Claudia Gonzaga-Jauregui; Andrew D Paterson; Dermot P B McGovern; Michael Brudno; Thomas D Walters; Anne M Griffiths; Aleixo M Muise
Journal: Gastroenterology Date: 2020-02-19 Impact factor: 22.682

7. Whole genome sequencing identifies novel genetic mutations in patients with eczema herpeticum.

Authors: Lianghua Bin; Claire Malley; Patricia Taylor; Meher Preethi Boorgula; Sameer Chavan; Michelle Daya; Malaika Mathias; Gautam Shankar; Nicholas Rafaels; Candelaria Vergara; Joseph Potee; Monica Campbell; Jon M Hanifin; Eric Simpson; Lynda C Schneider; Richard L Gallo; Tissa Hata; Amy S Paller; Anna De Benedetto; Lisa A Beck; Peck Y Ong; Emma Guttman-Yassky; Brittany Richers; David Baraghoshi; Ingo Ruczinski; Kathleen C Barnes; Donald Y M Leung; Rasika A Mathias
Journal: Allergy Date: 2021-03-15 Impact factor: 13.146

8. A non-genetic, cell cycle-dependent mechanism of platinum resistance in lung adenocarcinoma.

Authors: David R Croucher; Andrew Burgess; Alvaro Gonzalez Rajal; Kamila A Marzec; Rachael A McCloy; Max Nobis; Venessa Chin; Jordan F Hastings; Kaitao Lai; Marina Kennerson; William E Hughes; Vijesh Vaghjiani; Paul Timpson; Jason E Cain; D Neil Watkins
Journal: Elife Date: 2021-05-13 Impact factor: 8.140

9. Effective variant filtering and expected candidate variant yield in studies of rare human disease.

Authors: Brent S Pedersen; Joe M Brown; Harriet Dashnow; Amelia D Wallace; Matt Velinder; Martin Tristani-Firouzi; Joshua D Schiffman; Tatiana Tvrdik; Rong Mao; D Hunter Best; Pinar Bayrak-Toydemir; Aaron R Quinlan
Journal: NPJ Genom Med Date: 2021-07-15 Impact factor: 8.617

10. Nanopore Sequencing and Hi-C Based De Novo Assembly of Trachidermus fasciatus Genome.

Authors: Gangcai Xie; Xu Zhang; Feng Lv; Mengmeng Sang; Hairong Hu; Jinqiu Wang; Dong Liu
Journal: Genes (Basel) Date: 2021-05-06 Impact factor: 4.096