Literature DB >> 31834373

HaploTypo: a variant-calling pipeline for phased genomes.

Cinta Pegueroles¹, Verónica Mixão¹, Laia Carreté¹, Manu Molina¹, Toni Gabaldón^1,2,3.

Abstract

SUMMARY: An increasing number of phased (i.e. with resolved haplotypes) reference genomes are available. However, the most genetic variant calling tools do not explicitly account for haplotype structure. Here, we present HaploTypo, a pipeline tailored to resolve haplotypes in genetic variation analyses. HaploTypo infers the haplotype correspondence for each heterozygous variant called on a phased reference genome.
AVAILABILITY AND IMPLEMENTATION: HaploTypo is implemented in Python 2.7 and Python 3.5, and is freely available at https://github.com/gabaldonlab/haplotypo, and as a Docker image. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: CellLine Disease Species

Year: 2020 PMID： 31834373 PMCID： PMC7178392 DOI： 10.1093/bioinformatics/btz933

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Motivation

The heterozygosity (i.e. the presence of alternative alleles at the same locus) present in diploid organisms can complicate genome analyses, particularly when the levels of heterozygosity are high. Over the last years, several bioinformatics tools have been developed to account for this sequence complexity. These include pipelines and algorithms to assist during the genome assembly process (Pryszcz and Gabaldón, 2016; Safonova ), subsequent phasing of assembled genomes (Chin ; Edge ; Pan ) or allele-specific transcriptomic analysis (Deonovic ; Romanel ). However, and to the best of our knowledge, available variant calling tools do not explicitly account for phased genomes. As a result, the user has to decide between using the combined phased haplotypes as reference and thereby losing heterozygosity information or, alternatively, using only one of the haplotypes as reference and sacrificing haplotype information. An illustrating example of such problem is studies on the heterozygous yeast pathogen Candida albicans. Although the diploid genome of this pathogen was phased in 2013 (Muzzey ), subsequent studies have only used one of the haplotypes (Bensasson ; Ropars ), thereby losing the valuable haplotype information. Given the increasing amount of highly heterozygous genomes, including those from hybrids (Mixão and Gabaldón, 2018), and the relevance of phased information to reconstruct their population structures and evolutionary histories, there is an urgent need for solutions that allow the exploitation of phased genomes in genomic variation analysis. To fill in this gap, we developed HaploTypo, a python-based pipeline that, in the presence of a phased reference genome, provides detailed genome variation resolved at the haplotype level. HaploTypo is not a de novo genome phasing tool, but a tool to phase variants in re-sequencing analysis, using information of an already phased genome, resulting in a fast and accurate assessment of heterozygosity levels and reconstruction of haplotypes.

2 Implementation

HaploTypo requires as input the phased haplotypes of a diploid genome, and filtered genomic paired-end sequencing reads or, alternatively, their alignment to each of the reference haplotypes. The pipeline is divided in four modules, which can be run in block or separately (Fig. 1). The first module aligns the genomic paired-end reads independently to each of the phased haplotypes using BWA-MEM (Li, 2013). The second module performs variant calling on the two generated alignments using GATK (McKenna ), BCFtools (Li, 2011) or FreeBayes (Garrison and Marth, 2012) followed by variant filtration. From here, variability information is obtained for each reference haplotype independently. The third module of HaploTypo implements a variant phasing algorithm that, based on the comparison of reference haplotypes and previously called variants, infers which variants correspond to each haplotype. Phased (and unphased if required) genotypes from the two phased haplotypes are provided as independent VCF files. Additionally, unphased and unsolved positions for each haplotype are reported as bed files. A final module uses the VCF files generated in module 3 to reconstruct the haplotypes and provide them in fasta format. Detailed information on HaploTypo implementation is available in the pipeline’s manual.

Fig. 1.

Schematic representation of the four modules of HaploTypo pipeline. The steps are described in the main text and in the pipeline’s manual

3 Validation and results

We validated HaploTypo using simulated phased genome sequences with known variable positions. To explore the influence of divergence between the two phased haplotypes in the downstream analysis, we simulated diploid reference genomes with haplotypes diverging 0.5, 1 or 5% at the nucleotide level. These simulated phased reference genomes were derived with fasta2diverged.py [https://github.com/lpryszcz/bin, (Pryszcz )] from C.albicans haplotype A (Muzzey ). The same script was used to simulate diploid strains, which differed from the respective reference genome in 1 position per kilo-base, referred to as the ‘simple’ dataset. Given that, for these simulated strains, most of the polymorphisms are of the type 0/1 (where 0 is reference allele and 1 is alternative allele), we also simulated divergent strains where the polymorphisms between the two haplotypes and the reference could be 0/1 (60% of the total variation), 1/1 (38% of the total variation) or 1/2 (2% of the total variation, where 2 is an alternative allele different from 1), referred to as the ‘complex’ dataset. The relative proportions of the different variant types were based on real data from C.albicans sequenced strains (Ropars ). We next simulated sequencing reads using wgsim v0.3.1-r13 (https://github.com/lh3/wgsim). By using these simulated references and reads, we compared the performance of the HaploTypo pipeline to: (i) mapping libraries to only one of the haplotypes (standard procedure) and, (ii) mapping the libraries to the phased genome reference, which combines the two haplotypes (alternative approach). In all cases, we assessed the performance with the three mentioned variant callers. As expected, when using a haploid reference, sensitivity varied between 96.37 and 99.38%, depending on the variant caller and the divergence between haplotypes (Supplementary Table S1). However, as discussed above, this approach serves to assess heterozygosity levels and the location of heterozygous SNPs, but this information is unphased. When using a diploid reference, the divergence between the two haplotypes highly influenced the outcome, with better results being achieved at higher nucleotide divergences, with sensitivity ranging from 6.5 to 19.3% for 0.5% divergence, from 36.7 to 66.4% for 1% divergence and from 98.3 to 98.8% for 5% divergence (Supplementary Table S1). GATK had the poorest results, specially at low divergence levels (Supplementary Table S1). It is worth noting that accuracy and specificity remained high and stable (>99% and 100% respectively) independently of the levels of divergence and the variant caller used. When using HaploTypo, reads are mapped independently to the two haploid references of a phased genome (approach with the best results, as shown above) and outputs the two haplotypes, correctly phasing >99% of the positions independently of the variant caller used, with few exceptions (Supplementary Table S2). The unphased cases always represented ambiguous situations that cannot possibly be resolved with this type of data (see manual for details), and the user can decide whether to include them in the VCF or not. In addition, HaploTypo also reports positions that have incoherent results in the two haplotypes and therefore are likely to be mapping or variant calling errors (unsolved positions, see Table 1 from the manual for details). HaploTypo benchmarking was performed on a workstation [Intel(R) Xeon(R) CPU E5-1650 v3] and 64 GB of RAM with default number of threads. The total running time ranged from 1 to 15 h, depending on the level of heterozygosity of the dataset and the variant caller (Supplementary Table S3). Hence HaploTypo is a user-friendly tool that eases variant analyses and allows to incorporate haplotype-specific information when a phased reference genome is available. Click here for additional data file.

14 in total

1. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.

Authors: Aaron McKenna; Matthew Hanna; Eric Banks; Andrey Sivachenko; Kristian Cibulskis; Andrew Kernytsky; Kiran Garimella; David Altshuler; Stacey Gabriel; Mark Daly; Mark A DePristo
Journal: Genome Res Date: 2010-07-19 Impact factor: 9.043

2. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data.

Authors: Heng Li
Journal: Bioinformatics Date: 2011-09-08 Impact factor: 6.937

3. IDP-ASE: haplotyping and quantifying allele-specific expression at the gene and gene isoform level by hybrid sequencing.

Authors: Benjamin Deonovic; Yunhao Wang; Jason Weirather; Xiu-Jie Wang; Kin Fai Au
Journal: Nucleic Acids Res Date: 2017-03-17 Impact factor: 16.971

4. Assembly of a phased diploid Candida albicans genome facilitates allele-specific measurements and provides a simple model for repeat and indel structure.

Authors: Dale Muzzey; Katja Schwartz; Jonathan S Weissman; Gavin Sherlock
Journal: Genome Biol Date: 2013 Impact factor: 13.583

5. The Genomic Aftermath of Hybridization in the Opportunistic Pathogen Candida metapsilosis.

Authors: Leszek P Pryszcz; Tibor Németh; Ester Saus; Ewa Ksiezopolska; Eva Hegedűsová; Jozef Nosek; Kenneth H Wolfe; Attila Gacser; Toni Gabaldón
Journal: PLoS Genet Date: 2015-10-30 Impact factor: 5.917

6. Hybridization and emergence of virulence in opportunistic human yeast pathogens.

Authors: Verónica Mixão; Toni Gabaldón
Journal: Yeast Date: 2017-09-14 Impact factor: 3.239

7. HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies.

Authors: Peter Edge; Vineet Bafna; Vikas Bansal
Journal: Genome Res Date: 2016-12-09 Impact factor: 9.043

8. Gene flow contributes to diversification of the major fungal pathogen Candida albicans.

Authors: Jeanne Ropars; Corinne Maufrais; Dorothée Diogo; Marina Marcet-Houben; Aurélie Perin; Natacha Sertour; Kevin Mosca; Emmanuelle Permal; Guillaume Laval; Christiane Bouchier; Laurence Ma; Katja Schwartz; Kerstin Voelz; Robin C May; Julie Poulain; Christophe Battail; Patrick Wincker; Andrew M Borman; Anuradha Chowdhary; Shangrong Fan; Soo Hyun Kim; Patrice Le Pape; Orazio Romeo; Jong Hee Shin; Toni Gabaldon; Gavin Sherlock; Marie-Elisabeth Bougnoux; Christophe d'Enfert
Journal: Nat Commun Date: 2018-06-08 Impact factor: 14.919

9. Diverse Lineages of Candida albicans Live on Old Oaks.

Authors: Douda Bensasson; Jo Dicks; John M Ludwig; Christopher J Bond; Adam Elliston; Ian N Roberts; Stephen A James
Journal: Genetics Date: 2018-11-21 Impact factor: 4.562

10. WinHAP2: an extremely fast haplotype phasing program for long genotype sequences.

Authors: Weihua Pan; Yanan Zhao; Yun Xu; Fengfeng Zhou
Journal: BMC Bioinformatics Date: 2014-05-30 Impact factor: 3.169

5 in total

1. Factors enforcing the species boundary between the human pathogens Cryptococcus neoformans and Cryptococcus deneoformans.

Authors: Shelby J Priest; Marco A Coelho; Verónica Mixão; Shelly Applen Clancey; Yitong Xu; Sheng Sun; Toni Gabaldón; Joseph Heitman
Journal: PLoS Genet Date: 2021-01-19 Impact factor: 5.917

2. Extreme diversification driven by parallel events of massive loss of heterozygosity in the hybrid lineage of Candida albicans.

Authors: Verónica Mixão; Ester Saus; Teun Boekhout; Toni Gabaldón
Journal: Genetics Date: 2021-02-09 Impact factor: 4.562

3. Multiple Hybridization Events Punctuate the Evolutionary Trajectory of Malassezia furfur.

Authors: Bart Theelen; Verónica Mixão; Giuseppe Ianiri; Joleen Pei Zhen Goh; Jan Dijksterhuis; Joseph Heitman; Thomas L Dawson; Toni Gabaldón; Teun Boekhout
Journal: mBio Date: 2022-04-11 Impact factor: 7.786

4. The Transcriptional Aftermath in Two Independently Formed Hybrids of the Opportunistic Pathogen Candida orthopsilosis.

Authors: Hrant Hovhannisyan; Ester Saus; Ewa Ksiezopolska; Toni Gabaldón
Journal: mSphere Date: 2020-05-06 Impact factor: 4.389

5. Genome analysis of Candida subhashii reveals its hybrid nature and dual mitochondrial genome conformations.

Authors: Verónica Mixão; Eva Hegedűsová; Ester Saus; Leszek P Pryszcz; Andrea Cillingová; Jozef Nosek; Toni Gabaldón
Journal: DNA Res Date: 2021-06-25 Impact factor: 4.458

5 in total