Cinta Pegueroles1, Verónica Mixão1, Laia Carreté1, Manu Molina1, Toni Gabaldón1,2,3. 1. Centre for Genomic Regulation, The Barcelona Institute of Science and Technology, Barcelona 08003, Spain. 2. Universitat Pompeu Fabra (UPF), Barcelona, Spain. 3. ICREA, Barcelona 08010, Spain.
Abstract
SUMMARY: An increasing number of phased (i.e. with resolved haplotypes) reference genomes are available. However, the most genetic variant calling tools do not explicitly account for haplotype structure. Here, we present HaploTypo, a pipeline tailored to resolve haplotypes in genetic variation analyses. HaploTypo infers the haplotype correspondence for each heterozygous variant called on a phased reference genome. AVAILABILITY AND IMPLEMENTATION: HaploTypo is implemented in Python 2.7 and Python 3.5, and is freely available at https://github.com/gabaldonlab/haplotypo, and as a Docker image. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
SUMMARY: An increasing number of phased (i.e. with resolved haplotypes) reference genomes are available. However, the most genetic variant calling tools do not explicitly account for haplotype structure. Here, we present HaploTypo, a pipeline tailored to resolve haplotypes in genetic variation analyses. HaploTypo infers the haplotype correspondence for each heterozygous variant called on a phased reference genome. AVAILABILITY AND IMPLEMENTATION: HaploTypo is implemented in Python 2.7 and Python 3.5, and is freely available at https://github.com/gabaldonlab/haplotypo, and as a Docker image. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
The heterozygosity (i.e. the presence of alternative alleles at the same locus) present in diploid organisms can complicate genome analyses, particularly when the levels of heterozygosity are high. Over the last years, several bioinformatics tools have been developed to account for this sequence complexity. These include pipelines and algorithms to assist during the genome assembly process (Pryszcz and Gabaldón, 2016; Safonova ), subsequent phasing of assembled genomes (Chin ; Edge ; Pan ) or allele-specific transcriptomic analysis (Deonovic ; Romanel ). However, and to the best of our knowledge, available variant calling tools do not explicitly account for phased genomes. As a result, the user has to decide between using the combined phased haplotypes as reference and thereby losing heterozygosity information or, alternatively, using only one of the haplotypes as reference and sacrificing haplotype information. An illustrating example of such problem is studies on the heterozygous yeast pathogen Candida albicans. Although the diploid genome of this pathogen was phased in 2013 (Muzzey ), subsequent studies have only used one of the haplotypes (Bensasson ; Ropars ), thereby losing the valuable haplotype information. Given the increasing amount of highly heterozygous genomes, including those from hybrids (Mixão and Gabaldón, 2018), and the relevance of phased information to reconstruct their population structures and evolutionary histories, there is an urgent need for solutions that allow the exploitation of phased genomes in genomic variation analysis. To fill in this gap, we developed HaploTypo, a python-based pipeline that, in the presence of a phased reference genome, provides detailed genome variation resolved at the haplotype level. HaploTypo is not a de novo genome phasing tool, but a tool to phase variants in re-sequencing analysis, using information of an already phased genome, resulting in a fast and accurate assessment of heterozygosity levels and reconstruction of haplotypes.
2 Implementation
HaploTypo requires as input the phased haplotypes of a diploid genome, and filtered genomic paired-end sequencing reads or, alternatively, their alignment to each of the reference haplotypes. The pipeline is divided in four modules, which can be run in block or separately (Fig. 1). The first module aligns the genomic paired-end reads independently to each of the phased haplotypes using BWA-MEM (Li, 2013). The second module performs variant calling on the two generated alignments using GATK (McKenna ), BCFtools (Li, 2011) or FreeBayes (Garrison and Marth, 2012) followed by variant filtration. From here, variability information is obtained for each reference haplotype independently. The third module of HaploTypo implements a variant phasing algorithm that, based on the comparison of reference haplotypes and previously called variants, infers which variants correspond to each haplotype. Phased (and unphased if required) genotypes from the two phased haplotypes are provided as independent VCF files. Additionally, unphased and unsolved positions for each haplotype are reported as bed files. A final module uses the VCF files generated in module 3 to reconstruct the haplotypes and provide them in fasta format. Detailed information on HaploTypo implementation is available in the pipeline’s manual.
Fig. 1.
Schematic representation of the four modules of HaploTypo pipeline. The steps are described in the main text and in the pipeline’s manual
Schematic representation of the four modules of HaploTypo pipeline. The steps are described in the main text and in the pipeline’s manual
3 Validation and results
We validated HaploTypo using simulated phased genome sequences with known variable positions. To explore the influence of divergence between the two phased haplotypes in the downstream analysis, we simulated diploid reference genomes with haplotypes diverging 0.5, 1 or 5% at the nucleotide level. These simulated phased reference genomes were derived with fasta2diverged.py [https://github.com/lpryszcz/bin, (Pryszcz )] from C.albicans haplotype A (Muzzey ). The same script was used to simulate diploid strains, which differed from the respective reference genome in 1 position per kilo-base, referred to as the ‘simple’ dataset. Given that, for these simulated strains, most of the polymorphisms are of the type 0/1 (where 0 is reference allele and 1 is alternative allele), we also simulated divergent strains where the polymorphisms between the two haplotypes and the reference could be 0/1 (60% of the total variation), 1/1 (38% of the total variation) or 1/2 (2% of the total variation, where 2 is an alternative allele different from 1), referred to as the ‘complex’ dataset. The relative proportions of the different variant types were based on real data from C.albicans sequenced strains (Ropars ). We next simulated sequencing reads using wgsim v0.3.1-r13 (https://github.com/lh3/wgsim). By using these simulated references and reads, we compared the performance of the HaploTypo pipeline to: (i) mapping libraries to only one of the haplotypes (standard procedure) and, (ii) mapping the libraries to the phased genome reference, which combines the two haplotypes (alternative approach). In all cases, we assessed the performance with the three mentioned variant callers.As expected, when using a haploid reference, sensitivity varied between 96.37 and 99.38%, depending on the variant caller and the divergence between haplotypes (Supplementary Table S1). However, as discussed above, this approach serves to assess heterozygosity levels and the location of heterozygous SNPs, but this information is unphased. When using a diploid reference, the divergence between the two haplotypes highly influenced the outcome, with better results being achieved at higher nucleotide divergences, with sensitivity ranging from 6.5 to 19.3% for 0.5% divergence, from 36.7 to 66.4% for 1% divergence and from 98.3 to 98.8% for 5% divergence (Supplementary Table S1). GATK had the poorest results, specially at low divergence levels (Supplementary Table S1). It is worth noting that accuracy and specificity remained high and stable (>99% and 100% respectively) independently of the levels of divergence and the variant caller used. When using HaploTypo, reads are mapped independently to the two haploid references of a phased genome (approach with the best results, as shown above) and outputs the two haplotypes, correctly phasing >99% of the positions independently of the variant caller used, with few exceptions (Supplementary Table S2). The unphased cases always represented ambiguous situations that cannot possibly be resolved with this type of data (see manual for details), and the user can decide whether to include them in the VCF or not. In addition, HaploTypo also reports positions that have incoherent results in the two haplotypes and therefore are likely to be mapping or variant calling errors (unsolved positions, see Table 1 from the manual for details). HaploTypo benchmarking was performed on a workstation [Intel(R) Xeon(R) CPU E5-1650 v3] and 64 GB of RAM with default number of threads. The total running time ranged from 1 to 15 h, depending on the level of heterozygosity of the dataset and the variant caller (Supplementary Table S3). Hence HaploTypo is a user-friendly tool that eases variant analyses and allows to incorporate haplotype-specific information when a phased reference genome is available.Click here for additional data file.
Authors: Aaron McKenna; Matthew Hanna; Eric Banks; Andrey Sivachenko; Kristian Cibulskis; Andrew Kernytsky; Kiran Garimella; David Altshuler; Stacey Gabriel; Mark Daly; Mark A DePristo Journal: Genome Res Date: 2010-07-19 Impact factor: 9.043
Authors: Leszek P Pryszcz; Tibor Németh; Ester Saus; Ewa Ksiezopolska; Eva Hegedűsová; Jozef Nosek; Kenneth H Wolfe; Attila Gacser; Toni Gabaldón Journal: PLoS Genet Date: 2015-10-30 Impact factor: 5.917
Authors: Douda Bensasson; Jo Dicks; John M Ludwig; Christopher J Bond; Adam Elliston; Ian N Roberts; Stephen A James Journal: Genetics Date: 2018-11-21 Impact factor: 4.562
Authors: Bart Theelen; Verónica Mixão; Giuseppe Ianiri; Joleen Pei Zhen Goh; Jan Dijksterhuis; Joseph Heitman; Thomas L Dawson; Toni Gabaldón; Teun Boekhout Journal: mBio Date: 2022-04-11 Impact factor: 7.786
Authors: Verónica Mixão; Eva Hegedűsová; Ester Saus; Leszek P Pryszcz; Andrea Cillingová; Jozef Nosek; Toni Gabaldón Journal: DNA Res Date: 2021-06-25 Impact factor: 4.458