| Literature DB >> 32340307 |
Valentina Cipriani1,2,3,4, Nikolas Pontikos2,3, Gavin Arno2,3,5, Panagiotis I Sergouniotis6, Eva Lenassi6, Penpitcha Thawong7, Daniel Danis8, Michel Michaelides2,3, Andrew R Webster2,3, Anthony T Moore2,3,9, Peter N Robinson8, Julius O B Jacobsen1, Damian Smedley1.
Abstract
Next-generation sequencing has revolutionized rare disease diagnostics, but many patients remain without a molecular diagnosis, particularly because many candidate variants usually survive despite strict filtering. Exomiser was launched in 2014 as a Java tool that performs an integrative analysis of patients' sequencing data and their phenotypes encoded with Human Phenotype Ontology (HPO) terms. It prioritizes variants by leveraging information on variant frequency, predicted pathogenicity, and gene-phenotype associations derived from human diseases, model organisms, and protein-protein interactions. Early published releases of Exomiser were able to prioritize disease-causative variants as top candidates in up to 97% of simulated whole-exomes. The size of the tested real patient datasets published so far are very limited. Here, we present the latest Exomiser version 12.0.1 with many new features. We assessed the performance using a set of 134 whole-exomes from patients with a range of rare retinal diseases and known molecular diagnosis. Using default settings, Exomiser ranked the correct diagnosed variants as the top candidate in 74% of the dataset and top 5 in 94%; not using the patients' HPO profiles (i.e., variant-only analysis) decreased the performance to 3% and 27%, respectively. In conclusion, Exomiser is an effective support tool for rare Mendelian phenotype-driven variant prioritization.Entities:
Keywords: bioinformatics; human phenotype ontology; inherited retinal disease; phenotypic similarity; rare disease; variant prioritization; whole-exome sequencing; whole-genome sequencing
Year: 2020 PMID: 32340307 PMCID: PMC7230372 DOI: 10.3390/genes11040460
Source DB: PubMed Journal: Genes (Basel) ISSN: 2073-4425 Impact factor: 4.096
Figure 1Overview of the Exomiser workflow analysis: The diagram depicts the two main steps in an Exomiser analysis: (a) variant filtering and (b) variant prioritization. A single-sample variant call format (VCF) file and corresponding list of Human Phenotype Ontology (HPO) terms are mandatory inputs. If a multi-sample VCF file (from a nuclear family) is used in the analysis, the user must provide a corresponding pedigree file. In the filtering step, variants are filtered according to type of variant, allele frequency in selected databases, and mode of inheritance as per user-defined options and values. In the prioritization step, a variant score is calculated based on allele frequency and pathogenicity as predicted by user-defined in silico algorithms, together with a gene-specific phenotype score based on the semantic similarity of the patient’s HPO terms and phenotypic annotation in known human disease gene, mouse, zebrafish and protein–protein interaction databases. Finally, the Exomiser score is obtained from the variant score and phenotypic score within a logistic regression classifier framework and used for variant prioritization. PED: Pedigree, OMIM: Online mendelian inheritance in man, IMPC: International mouse phenotyping consortium, STRING: Search tool for the retrieval of interacting genes/proteins.
Frequency distribution of the clinical diagnosis in the inherited retinal disease (IRD) patient dataset.
| Clinical Diagnosis a | N | % |
|---|---|---|
| Retinitis pigmentosa (RP) | 36 | 26.9 |
| Leber congenital amaurosis (LCA) | 25 | 18.7 |
| Macular dystrophy (MD) | 16 | 11.9 |
| Cone-rod dystrophy (CRD) | 14 | 10.4 |
| Early onset retinal dystrophy (EORD) | 9 | 6.7 |
| Usher syndrome type II (USH2) | 8 | 6.0 |
| Achromatopsia (ACHM) | 6 | 4.5 |
| Congenital stationary night blindness (CSNB) | 5 | 3.7 |
| Retinal dystrophy (RD) | 3 | 2.2 |
| Usher syndrome type I (USH1) | 2 | 1.5 |
| Stargardt disease (STGD) | 2 | 1.5 |
| Occult macular dystrophy (OCMD) | 1 | 0.7 |
| Benign fleck retina (BFR) | 1 | 0.7 |
| Coloboma (COLOB) | 1 | 0.7 |
| Familial exudative vitreoretinopathy (FEVR) | 1 | 0.7 |
| Foveal hypoplasia (FH) | 1 | 0.7 |
| Myopia and deafness (Stickler syndrome) (STICKL) | 1 | 0.7 |
| Ocular albinism (OALB) | 1 | 0.7 |
| Optic atrophy (OATR) | 1 | 0.7 |
|
|
|
|
a As assigned by a consultant ophthalmologist before performing whole-exome sequencing.
Frequency distribution of the genotype of the “solved” molecular diagnoses in the IRD patient dataset.
| Genotype | N | % |
|---|---|---|
| Homozygote | 72 | 53.7 |
| Compound heterozygote | 39 | 29.1 |
| Heterozygote | 13 | 9.7 |
| Hemizygote a | 10 | 7.5 |
|
|
|
|
a Male patients with diagnosed variants on the X chromosome.
Figure 2HPO graphic visualization for the HPO-encoded clinical diagnosis Leber congenital amaurosis (Retinal dystrophy, HP:0000556; visual impairment, HP:0007758, undetectable electroretinogram, HP:0000550; and Nystagmus, HP:0000639).
Exomiser analysis YML files (.yml) using different settings. Bold stands for analysis setting names and italics for functions.
| Analysis YML File | |
|---|---|
LOCAL is University College London exome database (UCLex) [74]. Ensembl transcript annotation was used across all the analysis settings.
Descriptive statistics of the Exomiser disease-causing variant ranking in the IRD patient dataset using different analysis settings.
| Analysis Setting | Variants Filtered out | Variants not | Mean Rank (SD) | Median Rank | Min Rank | Max Rank | Top Ranked, % |
|---|---|---|---|---|---|---|---|
| 1. DEFAULT | 2 | 1 | 2.1 (5.0) | 1 | 1 | 42 | 73.9 |
| 2. VAR-ONLY | 2 | 1 | 10.8 (9.0) | 9.5 | 1 | 60.5 | 3.0 |
| 3. CADD | 2 | 2 | 2.5 (8.4) | 1 | 1 | 77 | 72.4 |
| 4. REVEL | 2 | 2 | 3.9 (9.2) | 1 | 1 | 78 | 71.6 |
| 5. MPC | 2 | 1 | 10.1 (16.6) | 1 | 1 | 79 | 56.0 |
| 6. M_CAP | 2 | 2 | 6.8 (12.5) | 1 | 1 | 64 | 62.7 |
| 7. MVP | 2 | 2 | 3.1 (10.3) | 1 | 1 | 108 | 76.9 |
| 8. PRIMATE-AI | 2 | 1 | 2.7 (5.0) | 1 | 1 | 33 | 73.1 |
a In the case of patients with the disease-causing variants passing the filtering step but not being prioritized (i.e., not “contributing” to the final Exomiser score), the ranks given to the correct diagnosed gene are indicated in parentheses. The mean/median/min/max ranks in the table refer to the effective N, that is 134 − (# of patients with disease-causing variants filtered out) − (# of patients with disease-causing variants not prioritized).
Figure 3Exomiser performance on the IRD patient dataset using different analysis settings. The left-hand side panel shows the categorical percentage distribution of the disease-causing variant ranking according to five mutually exclusive disease-causing ranking bins (“Top”, “2–5”, “6–10”, “>10”, and “Filtered out/Not prioritized”) per each analysis setting. The right-hand panel shows the corresponding cumulative percentage distributions.
Pairwise agreement between Exomiser disease-causing ranking results in the IRD patient dataset using different analysis settings.
| DEFAULT vs. | Top | 2–5 | 6–10 | >10 | Filtered | Total | DEFAULT vs. | Top | 2–5 | 6–10 | >10 | Filtered | Total | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Top |
|
|
|
|
|
| Top |
|
|
|
|
|
| ||
| 2–5 |
|
|
|
| 7.6 | 2–5 |
|
|
|
| 1 |
| 64.6 | ||
| 6–10 |
|
|
|
| 6–10 |
|
|
|
| ||||||
| >10 |
|
|
| −0.013 (“poor”) | >10 |
|
|
| 0.27 (“fair”) | ||||||
| Filtered out/Not prioritized |
|
|
| Filtered out/Not prioritized |
|
| Stuart–Maxwell | ||||||||
|
|
|
|
|
|
|
| 3.4 × 10−22 |
|
|
|
|
|
|
| 8.5 × 10−6 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| ||
| Top |
|
|
|
|
| Top |
|
|
|
|
|
| |||
| 2–5 |
|
|
|
|
| 92.3 | 2–5 |
|
|
| 1 |
| 84.6 | ||
| 6–10 |
|
|
|
| 6–10 |
|
|
|
| ||||||
| >10 |
|
|
| 0.80 (“substantial”) | >10 |
|
|
| 0.58 (“moderate”) | ||||||
| Filtered out/Not prioritized |
|
|
|
| Filtered out/Not prioritized |
|
|
| |||||||
|
|
|
|
|
|
|
| 0.818 |
|
|
|
|
|
|
| 0.040 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| ||
| Top |
|
|
|
|
|
| Top |
|
|
|
|
|
| ||
| 2–5 |
|
|
|
| 1 |
| 75.4 | 2–5 |
|
|
|
| 1 |
| 79.2 |
| 6–10 |
|
|
| Cohen’s kappa | 6–10 |
|
|
|
| ||||||
| >10 |
|
|
| 0.40 (“fair”) | >10 |
|
|
| 0.48 (“moderate”) | ||||||
| Filtered out/Not prioritized |
|
|
| Filtered out/Not prioritized | 1 |
|
|
| |||||||
|
|
|
|
|
|
|
| 0.011 |
|
|
|
|
|
|
| 0.011 |
|
|
|
|
|
|
|
| |||||||||
| Top |
|
|
|
|
|
| |||||||||
| 2–5 |
|
|
|
|
|
| 59.2 | ||||||||
| 6–10 |
|
|
|
| |||||||||||
| >10 |
|
|
| 0.24 (“fair”) | |||||||||||
| Filtered out/Not prioritized | 1 |
|
|
| |||||||||||
|
|
|
|
|
|
|
| 1.0 × 10−7 |
Table contingency counts represent number of patients at which the corresponding correct diagnosed variants were ranked by two analysis settings. DEFAULT: the Exomiser score was obtained from the variant score based on allele frequency (as defined in the filtering step, Table 3) and the original pathogenicity algorithms PolyPhen-2 [14], MutationTaster [13], and SIFT [75] using pathogenicitySources: [POLYPHEN, MUTATION_TASTER, SIFT], plus the gene-specific phenotype score, using hiPhivePrioritiser: {} and omimPrioritiser: {}; VAR-ONLY: as in DEFAULT analysis setting but using only the variant score, i.e., hiPhivePrioritiser and omimPrioritiser were disabled; CADD/REVEL/MPC/M_CAP/MVP/PRIMATE-AI: as in DEFAULT analysis setting using both the variant score and the phenotype score but making use of only one of the newly added pathogenicity algorithms at a time to calculate the variant score, e.g., pathogenicitySources: [CADD]. Agreement between two analysis settings is summarised by the percentage agreement and Cohen’s kappa and assessed via Stuart–Maxwell test. Cohen’s kappa values are interpreted according to Landis and Koch’s guidelines (i.e., κ < 0.00 as “poor” agreement, 0.00–0.20 as “slight”, 0.21–0.40 as “fair”, 0.41–0.60 as “moderate”, 0.61–0.80 as “substantial”, and 0.81–1 as “almost perfect” agreement) [76]. Counts in bold (on the diagonals) represent agreement between two analysis settings (both settings assigned the diagnosed variants the same rank); counts underlined represent when a row analysis setting performed better than a column analysis setting; and counts in italics represent when a row analysis setting performed worse than a column analysis setting. Blank cells represent zero counts.
Figure 4Screenshot of the Exomiser HTML output file (from the DEFAULT analysis) for patient P127 who was clinically diagnosed with Usher syndrome type II and molecularly diagnosed with frameshift elongation c.920_923dup:p.(His308Glnfs*16) and inframe deletion c.3832_3834del:p.(Leu1278del) in USH2A (Table S1).