| Literature DB >> 26229552 |
Damian Smedley1, Peter N Robinson2.
Abstract
Whole exome sequencing has altered the way in which rare diseases are diagnosed and disease genes identified. Hundreds of novel disease-associated genes have been characterized by whole exome sequencing in the past five years, yet the identification of disease-causing mutations is often challenging because of the large number of rare variants that are being revealed. Gene prioritization aims to rank the most probable candidate genes towards the top of a list of potentially pathogenic variants. A promising new approach involves the computational comparison of the phenotypic abnormalities of the individual being investigated with those previously associated with human diseases or genetically modified model organisms. In this review, we compare and contrast the strengths and weaknesses of current phenotype-driven computational algorithms, including Phevor, Phen-Gen, eXtasy and two algorithms developed by our groups called PhenIX and Exomiser. Computational phenotype analysis can substantially improve the performance of exome analysis pipelines.Entities:
Year: 2015 PMID: 26229552 PMCID: PMC4520011 DOI: 10.1186/s13073-015-0199-2
Source DB: PubMed Journal: Genome Med ISSN: 1756-994X Impact factor: 11.117
Comparison of exome analysis tools
| Software | Exome input | Types of variant analyzed | Availability | Software approach |
|---|---|---|---|---|
| VEP | Various including VCF, pileup, HGVS notations | All | Website, command line and REST service | Filtering by allele frequency and deleteriousness scores (SIFT, PolyPhen) |
| ANNOVAR | Various including multi-sample VCF | All | Command line | Filtering by allele frequency, inheritance model and deleteriousness scores (SIFT, PolyPhen, MutationTaster, MutationAssessor, LRT, FATHMM, MetaSVM, MetaLR, GERP++, PhyloP, SiPhy, CADD) |
| eXtasy | Single sample VCF | Non-synonymous | Website and command line | Prioritization based on a Random Forest score from combined deleteriousness scores (CAROL, LRT, MutationTaster, PhastCons, PhyloP, PolyPhen, SIFT), haploinsufficiency, and similarity of the gene to genes annotated with the input Human Phenotype Ontology (HPO) phenotypes as measured by sequence similarity, co-expression, and involvement in the same pathway or protein–protein interactions |
| Phevor | Pre-filtered VAAST or ANNOVAR files or functionally annotated multi-sample VCF | All | Website | Prioritization based on semantic similarity of each candidate gene to genes annotated with the input set of ontology terms taken from HPO, Mammalian Phenotype Ontology (MPO), Disease Ontology (DO), and Gene Ontology (GO) |
| Phen-Gen | Multi-sample family VCF | All | Website and command line | Filtering by inheritance model and stringency or reentrance. Prioritization based on predicted variant impact and semantic phenotypic similarity between HPO input and HPO-annotated diseases associated with each exomic candidate or its neighbors in an interaction network |
| PhenIX | Multi-sample family VCF | All coding | Website and command line | Filtering by allele frequency, variant quality, and inheritance model. Prioritization based on predicted deleteriousness (SIFT, PolyPhen, MutationTaster), allele frequency and semantic phenotypic similarity between HPO input and HPO-annotated diseases associated with each exomic candidate |
| Exomiser | Multi-sample family VCF | All coding | Website and command line | Filtering by allele frequency, variant quality, deleteriousness scores and inheritance model. Prioritization based on predicted deleteriousness (SIFT, PolyPhen, MutationTaster), allele frequency and semantic phenotypic similarity between HPO input and HPO-annotated diseases, MPO-annotated mouse and Zebrafish Phenotype Ontology (ZPO)-annotated fish models associated with each exomic candidate or its neighbors in an interaction network |
Abbreviations: CADD Combined Annotation-Dependent Depletion, GERP Genomic Evolutionary Rate Profiling, HGVS Human Genome Variation Society, HPO Human Phenotype Ontology, LRT likelihood ratio test (LRT), PolyPhen Polymorphism Phenotyping, REST Representational State Transfer, SIFT Sorting Intolerant from Tolerant, VAAST Variant Annotation, Analysis, Search Tool, VCF variant call format
Fig. 1Benchmarking of all phenotype-based exome analysis tools on 1000 Genomes Project or in-house exomes. Exomes were generated by randomly inserting known disease variants from the Human Genome Mutation Database (HGMD) into either (a, c, e) 50 unaffected exomes from the 1000 Genomes Project or (b, d, f) 50 in-house generated exomes. These exomes were analyzed using each tool and the ability of each tool to rank the causative variant as the top hit, in the top 10 or top 50 was recorded. Default settings, along with filtering with a minor allele frequency cutoff of 1 %, were used for all tools. Analysis was performed using (a, b) all phenotype annotations (c, d) just three of the terms chosen randomly, or (e, f) with two of these three terms made less-specific and two random terms from the whole of the Human Phenotype Ontology (HPO) added
Number of genes per benchmarked sample
| 1000 Genomes Project exomes | In-house exomes | |||
|---|---|---|---|---|
| AD | AR | AD | AR | |
| Before filtering | 10542 ± 783 | 10631 ± 802 | 19235 ± 916 | 19712 ± 976 |
| Exomiser filtered | 388 ± 110 | 38 ± 11 | 973 ± 104 | 557 ± 74 |
| PhenIX filtered | 388 ± 110 | 38 ± 11 | 973 ± 104 | 557 ± 74 |
| Exomiser filtered for eXtasy analysis | 388 ± 110 | 38 ± 11 | 973 ± 104 | 557 ± 74 |
| Phen-Gen filtered | 100 ± 34 | 5 ± 4 | 665 ± −86 | 331 ± 70 |
| Annovar filtered for Phevor analysis | 88 ± 36 | 2 ± 1 | 372 ± 61 | 52 ± 17 |
Abbreviations: AD autosomal dominant, AR autosomal recessive
Fig. 2Benchmarking of command-line exome analysis software. Exomes were generated by randomly inserting known disease variants from the Human Genome Mutation Database (HGMD) into 1000 unaffected exomes from the 1000 Genomes Project. These were analyzed using each tool and the ability of each to rank the causative variant as the top hit, in the top 10 or top 50 was recorded. Default settings along with a minor allele frequency cutoff of 1 % were used for all. Analysis was performed using all phenotype annotations (a), just three of the terms chosen randomly (b), or with two of these three terms made less-specific and two random terms from the whole of the Human Phenotype Ontology (HPO) added (c)