| Literature DB >> 32789024 |
David Twesigomwe1,2, Galen E B Wright3,4, Britt I Drögemöller5, Jorge da Rocha1,2, Zané Lombard2, Scott Hazelhurst1,6.
Abstract
Genetic variation in genes encoding cytochrome P450 enzymes has important clinical implications for drug metabolism. Bioinformatics algorithms for genotyping these highly polymorphic genes using high-throughput sequence data and automating phenotype prediction have recently been developed. The CYP2D6 gene is often used as a model during the validation of these algorithms due to its clinical importance, high polymorphism, and structural variations. However, the validation process is often limited to common star alleles due to scarcity of reference datasets. In addition, there has been no comprehensive benchmark of these algorithms to date. We performed a systematic comparison of three star allele calling algorithms using 4618 simulations as well as 75 whole-genome sequence samples from the GeT-RM project. Overall, we found that Aldy and Astrolabe are better suited to call both common and rare diplotypes compared to Stargazer, which is affected by population structure. Aldy was the best performing algorithm in calling CYP2D6 structural variants followed by Stargazer, whereas Astrolabe had limitations especially in calling hybrid rearrangements. We found that ensemble genotyping, characterised by taking a consensus of genotypes called by all three algorithms, has higher haplotype concordance but it is prone to ambiguities whenever complete discrepancies between the tools arise. Further, we evaluated the effects of sequencing coverage and indel misalignment on genotyping accuracy. Our account of the strengths and limitations of these algorithms is extremely important to clinicians and researchers in the pharmacogenomics and precision medicine communities looking to haplotype CYP2D6 and other pharmacogenes using high-throughput sequencing data.Entities:
Keywords: Genome informatics; Haplotypes; Pharmacogenomics
Year: 2020 PMID: 32789024 PMCID: PMC7398905 DOI: 10.1038/s41525-020-0135-2
Source DB: PubMed Journal: NPJ Genom Med ISSN: 2056-7944 Impact factor: 8.617
Fig. 1Graphical overview of the highly polymorphic CYP2D6/2D7/2D8 locus, by Twist et al.[15], licensed under Creative Commons CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/). No changes have been made to the figure content.
a The relative position of the reference CYP2D6*1 haplotype (white) to two non-functional paralogs, CYP2D7 (red) and CYP2D8 (grey) on the minus strand of Chromosome 22. REP6 and REP7 are paralogous, Alu-containing, 600-bp repetitive sequences found downstream of CYP2D6 and CYP2D7, respectively. The blue boxes indicate identical unique sequences downstream of CYP2D6 and CYP2D7. Notice the “spacer” (1.6-kb) separating REP7 from CYP2D7 but none between CYP2D6 and REP6. b Common CYP2D6 star alleles defined by core single nucleotide variants (SNVs). c Examples of CYP2D6 copy number variations and their functional annotation. d Examples of CYP2D7/2D6 hybrid genes. e Common tandem rearrangements in the CYP2D gene locus. The activity level boxes on the right are coded; red for a non-functional haplotype, orange for decreased activity, green for fully functional reference activity, and blue for increased activity.
Properties of Astrolabe, Aldy, and Stargazer.
| Tool | OS | Language | Central feature | NGS data | Input/ref | Output |
|---|---|---|---|---|---|---|
| Astrolabe | Linux | Java | Probabilistic | WGS | VCF | Diplotypes |
| Mac | scoring system | PGRNseqa | BAM | Suballeles | ||
| b37, b38 | Phenotype | |||||
| All novel SNVs | ||||||
| Aldy | Linux | Python | Combinatorial | WGS | BAM | Diplotypes |
| Mac | framework | PGRNseq | b37 | Suballeles | ||
| Putative novel | ||||||
| core variants | ||||||
| Stargazer | Linux | Python | Statistical | WGS | VCF | Diplotypes |
| Mac | phasing | PGRNseq | GDF | Phenotype | ||
| b37 | All novel SNVs |
aNot yet optimised for Astrolabe’s CNV calling.
Concordance of Astrolabe, Aldy, and Stargazer on test cases (N = 154) homozygous for SNV-defined star alleles at 30×.
| Astrolabe | Aldy | Stargazer | |||
|---|---|---|---|---|---|
| Run 1 | Run 2 | Run 1 | Run 2 | ||
| Match | 142 (92%) | 152 (99%) | 148 (96%) | 149 (97%) | 135 (88%) |
| Mismatch | 9 (6%) | 1 (1%) | 4 (3%) | 3 (2%) | 19 (12%) |
| Ungenotyped | 3 (2%) | 1 (1%) | 2 (1%) | 1 (1%) | 0 |
The complete list of genotypes called per sample is provided in Supplementary Data Set 2.
Run 1 and Run 2 indicate results before and after defining missing suballeles in the allele tables, respectively. There was only one run for Stargazer as it does not typically call suballeles.
Fig. 2Concordance of Astrolabe, Aldy, and Stargazer for all theoretically possible CYP2D6 diplotype combinations (homozygous and heterozygous) comprising SNV-defined haplotypes with PharmVar definitive or moderate level of evidence from our starting set.
We left the allele databases of the three tools as is in order to examine the effect of the undefined suballeles on diplotype calling. The sequencing coverage for each test case was 30× and default parameters were used for each tool.
Summary of correctly called structural variations by Astrolabe, Aldy, and Stargazer in set 2.
| Set 2 | Truth | Astrolabe | Aldy | Stargazer | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| 30× | 60× | 100× | 30× | 60× | 100× | 30× | 60× | 100× | ||
| Full gene deletion (*5) | 20 | 10 | 5 | 10 | 15 | 14 | 15 | 15 | 15 | 15 |
| Copy number gaina | 52 | 29 | 25 | 25 | 52 | 50 | 52 | 47 | 45 | 45 |
| Resolved duplicated/ multiplicated alleles | 61 | 4c | 4c | 4c | 53 | 49 | 51 | 54 | 51 | 48 |
| Hybridsb | 107 | 3 | 8 | 8 | 72 | 66 | 67 | 31 | 30 | 28 |
| Non-hybrid tandem | 4 | 0 | 0 | 0 | 4 | 4 | 4 | 0 | 0 | 0 |
| Ungenotyped (defaults) | 0 | 0 | 0 | 1 | 3 | 2 | 18 | 16 | 19 | |
The complete list of genotypes called per sample is provided in Supplementary Data Set 4.
aDue to gene duplications/multiplications.
bCollectively representing exon conversions and gene hybrids involving CYP2D6 and CYP2D7.
cDetermined only for homozygous allele cases.
Fig. 3Performance of Astrolabe, Aldy, and Stargazer on 75 WGS GeT-RM samples.
a Concordance of each algorithm to the GeT-RM consensus calls. b Cases with discordant genotypes. The green colour represents samples “without” CYP2D6 CNVs (most of the samples may have the intron 1 conversion), whereas the blue colour represents samples with allele-defining CYP2D6 CNVs. c Overlap of haplotypes called by each algorithm. As shown ensemble call sets have high concordance. Haplotypes called by one tool but not confirmed with any of the others have low true positive rate.
Fig. 4Comparison between the CYP2D6 diplotype concordance and the phenotype prediction concordance for each algorithm based on 75 GeT-RM samples and the Activity Score system.
All three algorithms and the ensemble approach have relatively high phenotype concordance thus underscoring their clinical utility even for some cases with inconsistent diplotype calls.
Pros and cons of current CYP2D6 genotyping algorithms.
| Algorithm | Advantages | Disadvantages |
|---|---|---|
| Astrolabe | High accuracy for calling catalogued SNV-defined alleles. | Lower recall for |
| Supports hg19 and hg38. | Prone to ambiguous calls and miscalls if alleles/suballeles are not comprehensively defined | |
| Performs variant calling error correction. | Does not discriminate duplicated alleles in a sample with copy number gain | |
| (generically reports presence of a duplication) | ||
| Easy to run (one-liner command). | Does not distinguish between duplication events and multiplication events | |
| Performs automated phenotype prediction. | ||
| Aldy | High accuracy for calling catalogued SNV-defined alleles. | Aldy supports only hg19 as of v2.2.3 |
| Highest accuracy of the three tools in calling | Prone to ambiguous calls and miscalls if alleles/suballeles are not comprehensively defined | |
| Easiest to run of the three tools as it requires only the BAM file on any system (even a normal laptop). | Prone to some erroneous calls when using default arguments due to sequencing noise | |
| Stargazer | Supports various file inputs and imputation. | Low recall for rare alleles especially in heterozygous combinations |
| Supports user-defined references for phasing and/or phased VCF input. | Affected by linkage disequilibrium as it has to do statistical reference-based phasing | |
| Provides tools for viewing metrics of key variants in a sample. | Prone to “no calls” for complex hybrid arrangements/combinations | |
| Outputs coverage plots for visually examining change-points for samples with CNVs. | Stargazer v1.0.7 supports only hg19 | |
| Performs automated phenotype prediction. | Does not call suballeles as of v1.0.7 | |
| Easy to run (one-liner command). | Has inconsistent reporting of alleles with uncertain function (reports background functionally annotated alleles for these cases as of v1.0.7) | |
| Ensemble | Comparably high diplotype/haplotype concordance. | Difficult to resolve complete discrepancies between the genotypes of the three tools |
| Resolves single tool deficiencies. | Prone to ambiguous calls especially for structural variations | |
| No automated pipeline as of now as the three algorithms are being updated on a regular basis and the reporting of alleles/suballeles is non-uniform |