| Literature DB >> 34809571 |
Abstract
BACKGROUND: Identifying haplotypes is central to sequence analysis in diploid or polyploid genomes. Despite this, there remains a lack of research and tools designed for physical phasing and its downstream analysis.Entities:
Keywords: Batrachochytrium dendrobatidis; Haplotype; Hybridization; Phasing; Recombination; Software
Mesh:
Year: 2021 PMID: 34809571 PMCID: PMC8607637 DOI: 10.1186/s12859-021-04473-1
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Accuracy of heterozygous variant calling by Pilon was assessed
| Test | 1 | 2 | 3 | 4 | 5 | 6 |
|---|---|---|---|---|---|---|
| Introduced HET | 23,13,700 | 23,13,700 | 2,31,370 | 2,31,370 | 23,137 | 23,137 |
| Introduced HET (/kb) | 100 | 100 | 10 | 10 | 1 | 1 |
| Read Length (nt, paired) | 100 | 10,000 | 100 | 10,000 | 100 | 10,000 |
| SNP | 86,808 | 1,89,498 | 7,752 | 16,583 | 1,075 | 1,633 |
| HET | 20,90,044 | 17,81,654 | 2,30,380 | 1,91,558 | 37,153 | 22,365 |
| INS | 43 | 66 | 0 | 0 | 0 | 0 |
| DEL | 48 | 69 | 2 | 1 | 2 | 1 |
| AMB | 2 | 92 | 2 | 59 | 2 | 67 |
| TP | 20,27,722 | 17,45,567 | 2,13,362 | 1,87,637 | 21,429 | 18,807 |
| TN | 2,08,11,068 | 1,97,41,846 | 2,28,43,007 | 2,19,30,208 | 2,30,49,406 | 2,20,49,041 |
| FP | 62,322 | 36,087 | 17,018 | 3,921 | 15,724 | 3,558 |
| FN | 98,679 | 1,65,971 | 9,073 | 16,227 | 904 | 1,572 |
| FP other | 86,901 | 1,89,725 | 7,756 | 16,643 | 1,079 | 1,701 |
| TP (%) | 87.64 | 75.44 | 92.22 | 81.10 | 92.62 | 81.29 |
| TN (%) | 89.94 | 85.32 | 98.73 | 94.78 | 99.62 | 95.29 |
| FP (%) | 2.98 | 2.03 | 7.39 | 2.05 | 42.32 | 15.91 |
| FN (%) | 4.26 | 7.17 | 3.92 | 7.01 | 3.91 | 6.79 |
| Sensitivity | 0.95 | 0.91 | 0.96 | 0.92 | 0.96 | 0.92 |
| Specificity | 0.99 | 0.99 | 1.00 | 1.00 | 1.00 | 1.00 |
| Accuracy | 0.99 | 0.98 | 1.00 | 1.00 | 1.00 | 1.00 |
Paired reads (100nt or 10 kb) were simulated at 20X depth from reference Bd JEL423 genome that was duplicated to create an in silico diploid. In silico mutations were then randomly introduced throughout (1/kb, 10/kb or 100/kb). Reads were aligned to the original reference sequence (non-duplicated, non-mutated version), and diploid variants called by Pilon. Counts of variants are shown including single nucleotide polymorphisms (SNP), heterozygous positions (HET), insertions (INS), deletions (DEL) and ambiguous (AMB). Accuracy was assessed according to Comparison of FDR tool [28], that calculated TN = true negatives (correct reference bases), TP = true positives (correct HET), FN = false negatives (incorrect reference bases) and FP = false positives (incorrect HET). FP (other) is a count of all additional (non-heterozygous) incorrect bases including SNPs, INS, DEL and AMB. > 99% of FP (other) were SNPs. TP (%) and FN (%) are precents of Introduced HET, FN (%) is a percent of assembly length, and FP (%) is a percent of HETs called. Sensitivity = TP/(TP + FN), Specificity = TN/(TN + FP + FP (other)), Accuracy = (TN + TP)/(TN + TP + FN + FP + FP (other))
Assessment of HaplotypeTools and WhatsHap on simulated paired reads (100nt or 10 kb) from 100/kb, 10/kb and 1/kb heterozygosity levels
| Tool | Param | Het. (/Kb) | Read Length (nt, paired) | Hap n | Genome coverage (%) | Hap Nmax | Hap N50 | SE | SER | QAN50 | Time |
|---|---|---|---|---|---|---|---|---|---|---|---|
| HT | default | 100 | 100 | 95,532 | 83.2 | 2867 | 342 | 6248 | 0.0031 | 248 | 27 m41 s |
| HT | default | 10 | 100 | 33,962 | 12.0 | 620 | 109 | 25 | 0.0002 | N.D | 32 m33 s |
| HT | -m 2 | 10 | 100 | 34,423 | 12.1 | 620 | 109 | 32 | 0.0003 | N.D | 30 m14 s |
| HT | default | 1 | 100 | 728 | 0.2 | 211 | 66 | 0 | 0.0000 | N.D | 32 m59 s |
| HT | -m 2 | 1 | 100 | 875 | 0.2 | 211 | 64 | 1 | 0.0005 | N.D | 29 m42 s |
| HT | default | 100 | 10,000 | 61,536 | 73.3 | 6455 | 528 | 4242 | 0.0024 | 267 | 31 m13 s |
| HT | -m 2 | 100 | 10,000 | 56,825 | 80.5 | 6455 | 605 | 4185 | 0.0022 | 353 | 27 m53 s |
| HT | -m 2 -r 100 kb | 100 | 10,000 | 56,903 | 80.3 | 11,694 | 612 | 4185 | 0.0022 | 350 | 23 m13 s |
| HT | default | 10 | 10,000 | 5,965 | 78.6 | 9995 | 6480 | 35 | 0.0002 | 4254 | 29 m14 s |
| HT | default | 1 | 10,000 | 2,492 | 57.7 | 9946 | 7774 | 0 | 0.0000 | 5888 | 27 m |
| WH | default | 100 | 100 | 366 | 95.1 | 8,43,884 | 3,08,400 | 35,461 | 0.0163 | 9,91,296 | 95 m56 s |
| WH | default | 10 | 100 | 2,496 | 92.7 | 74,839 | 17,864 | 1570 | 0.0070 | 19,555 | 14 m29 s |
| WH | default | 1 | 100 | 6,540 | 12.1 | 3932 | 578 | 235 | 0.0192 | N.D | 43 s |
| WH | default | 100 | 10,000 | 128 | 89.1 | 8,43,900 | 3,60,273 | 29,373 | 0.0144 | 9,36,966 | 23 m27 s |
| WH | default | 10 | 10,000 | 130 | 90.0 | 8,74,030 | 4,00,757 | 347 | 0.0016 | 4,64,693 | 3 m11 s |
| WH | default | 1 | 10,000 | 196 | 88.0 | 7,52,451 | 2,03,122 | 18 | 0.0009 | 97,221 | 24 s |
Haplotype (Hap) count (n) and their coverage across the genome assembly is shown as a percent. Haplotype lengths are described by Hap Nmax (the maximum haplotype length found) and Hap N50 (the shortest haplotype length that includes ≥ 50% of haplotype sequence.). Switch Errors (SE), Switch Error Rate (SER), Quality Adjusted N50 (QAN50) and Computational Time were calculated and compared as described in methods (code to calculate these values are part of HaplotypeTools)
Details of haplotypes from phasing a single-isolate VCF from hybrid Bd isolate SA-EC3 using GATK v4 HaplotypeCaller, HaplotypeTools (default settings), and WhatsHap (default settings)
| HaplotypeCaller | HaplotypeTools | WhatsHap | |
|---|---|---|---|
| Number of haplotypes | 5545 | 7975 | 9982 |
| Total phased nucleotides | 1,35,334 | 6,74,408 | 50,02,469 |
| Haplotype Nmax (nt) | 230 | 1083 | 8485 |
| Haplotype N50 (nt) | 26 | 115 | 1107 |
| Haplotype N90 (nt) | 13 | 44 | 314 |
| Overlap with HT (default) (#haps) | 2027 | N/A | 7948 |
| Overlap with HT (nt) | 44,157 | N/A | 6,58,713 |
| Computational time | N/A | 41m56s | 5m54s |
Rows include the number of haplotypes produced by each tool, the total number of nucleotides included in those haplotypes, the maximum haplotype length found (Haplotype Nmax), the Hap N50 and N90 (the shortest haplotype length that includes ≥ 50% and ≥ 90% of haplotype sequence, respectively.). The number of haplotypes (#haps) that overlap on the genome assembly with haplotypes produced by HaplotypeTools (HT), and the number of nucleotides (nt) in those haplotpes. Computational time is also given for HaplotypeTools and Whatshap, but omitted for GATK given its primary role was variant calling, which both HaplotypeTools and WhatsHap were also based on, and the time taken for phasing alone can neither be determined or distinguished from that process
Fig. 1The five isolates representing each of the lineages were assessed for ploidy and aneuploidy (largest five supercontigs presented). A Non-overlapping 10 kb windows of the normalized depth of coverage (normalized by total sequencing depth across percentiles of GC content, and excluding ambiguous sites) shows evidence for aneuploidies (supercontig 2, 3 and 5) among each of the Bd isolates apart from BdCH ACON. B Allele frequencies (percent of reads agreeing with the reference base) are shown from 25% agree to 75% disagree, with red-dotted lines indicating greatest support for bi-allelic/diploidy between 47 and 53% and greatest support for tri-allelic/triploidy between 30–36% and 63–69%
Fig. 2Comparisons of haplotypes for the hybrid Bd isolate SA-EC3, generated by GATK v4 HaplotypeCaller physical phasing, HaplotypeTools and WhatsHap. A The length in nucleotides of haplotype pairs vs the difference between those haplotypes pairs (%). The red line indicates the minimum haplotype length used for all analysis, and the red arrow indicates the haplotype pairs illustrated in part B of this figure. B HaplotypeTools’ utility script HaplotypePlacer constructs haplotype trees with FastTree. Among the longest representative haplotype pairs (shown by the red arrow) are shown including supercontig 1.7 positions 1,335,923–1,336,143 for HaplotypeCaller, supercontig 1.17 positions 206,141–207,223 for HaplotypeTools and supercontig 1.4 positions 319,655–328,139 for WhatsHap. In each of these examples, one haplotype is closest to BdCAPE. For HaplotypeTools and WhatsHap that have several advantages over HaplotypeCaller’s physical phasing presented, the second haplotype is closest to BdGPL
HaplotypeTools’ utility script HaplotypePlacer constructs haplotype trees with FastTree and identifies the closest relative to each
| Lineage | |||||
|---|---|---|---|---|---|
| HaplotypeCaller (nt) | 2556 | 1637 | 5180 | 2418 | 7089 |
| HaplotypeCaller (%) | 14 | 9 | 27 | 13 | 38 |
| HaplotypeTools (nt) | 52,806 | 55,341 | 3,22,642 | 51,476 | 3,14,161 |
| HaplotypeTools (%) | 7 | 7 | 41 | 6 | 39 |
| Whatshap (nt) | 3,86,839 | 2,58,519 | 43,96,845 | 2,61,089 | 44,56,306 |
| Whatshap (%) | 4 | 3 | 45 | 3 | 46 |
| Overlapping phase groups | 889 | 940 | 1344 | 758 | 1018 |
| Overlapping phased positions (OPP) | 1941 | 2131 | 3457 | 1661 | 2487 |
| OPP Same phase (nt) | 1758 | 1922 | 3421 | 1486 | 2467 |
| OPP Same phase (%) | 91 | 90 | 99 | 89 | 99 |
| OPP Cross-over (nt) | 183 | 209 | 36 | 175 | 20 |
| OPP Cross-over (%) | 9 | 10 | 1 | 11 | 1 |
Hybrid Bd isolate SA-EC3 haplotypes from GATK v4 HaplotypeCaller physical phasing, HaplotypeTools and WhatsHap were analysed using HaplotypePlacer, finding that the majority of haplotypes from each of the three tools are closest in those trees to BdGPL (38–46%) and BdCAPE (27–45%). A HaplotypeTools utility script was used to compare phasing between SA-EC3 and each of the lineages. For each comparison, the script identified overlapping phase groups, comprising overlapping phased positions (OPP), which were either in the same phase, or showed evidence of crossovers
Fig. 3HaplotypeTools’ utility script HaplotypePlacer was used to produce 10 kb windows showing the genomic region for each haplotype pairs and colored according to their closest relative (light blue = BdAsia1, blue = BdAsia2, light green = BdCAPE, green = BdCH, red = BdGPL). Windows are for A GATK v4 HaplotypeCaller, B HaplotypeTools and C WhatsHap