| Literature DB >> 35134823 |
Xiao Yuan1,2,3, Jing Wang1, Bing Dai1, Yanfang Sun1, Keke Zhang1, Fangfang Chen1, Qian Peng1, Yixuan Huang4, Xinlei Zhang5, Junru Chen3, Xilin Xu2, Jun Chuan1,2, Wenbo Mu2, Huiyuan Li1,2, Ping Fang2, Qiang Gong1,2, Peng Zhang6.
Abstract
It's challenging work to identify disease-causing genes from the next-generation sequencing (NGS) data of patients with Mendelian disorders. To improve this situation, researchers have developed many phenotype-driven gene prioritization methods using a patient's genotype and phenotype information, or phenotype information only as input to rank the candidate's pathogenic genes. Evaluations of these ranking methods provide practitioners with convenience for choosing an appropriate tool for their workflows, but retrospective benchmarks are underpowered to provide statistically significant results in their attempt to differentiate. In this research, the performance of ten recognized causal-gene prioritization methods was benchmarked using 305 cases from the Deciphering Developmental Disorders (DDD) project and 209 in-house cases via a relatively unbiased methodology. The evaluation results show that methods using Human Phenotype Ontology (HPO) terms and Variant Call Format (VCF) files as input achieved better overall performance than those using phenotypic data alone. Besides, LIRICAL and AMELIE, two of the best methods in our benchmark experiments, complement each other in cases with the causal genes ranked highly, suggesting a possible integrative approach to further enhance the diagnostic efficiency. Our benchmarking provides valuable reference information to the computer-assisted rapid diagnosis in Mendelian diseases and sheds some light on the potential direction of future improvement on disease-causing gene prioritization methods.Entities:
Keywords: HPO; Mendelian diseases; benchmarking; gene prioritization
Mesh:
Year: 2022 PMID: 35134823 PMCID: PMC8921623 DOI: 10.1093/bib/bbac019
Source DB: PubMed Journal: Brief Bioinform ISSN: 1467-5463 Impact factor: 11.622
Figure 1Illustration of study workflow. Flowchart of data collection and method implementation in this work. DDD patient cohort includes 305 cases with developmental disorders (represented as light blue) while the in-house KMCGD patient cohort involves a total of 209 cases with a wide range of syndromes (represented as various colors). Then, curated HPO terms and a VCF file of each case in both cohorts are imported into six ‘HPO + VCF’ prioritization methods. Additionally, curated HPO terms of each case are imported into five ‘HPO-only’ prioritization methods. In particular, AMELIE is run in both ‘HPO + VCF’ mode and ‘HPO-only’ mode(AMELIE_HPO). Finally, for each case, the ranking position of the known causal gene in the gene list output by each method is recorded, based on which the performance of each method is evaluated.
Overview of the datasets used in this work; the basic information, gender composition, age composition, frequent causal genes and disease subgroup composition of the DDD and KMCGD dataset
|
|
| ||
|---|---|---|---|
|
|
| 305 | 209 |
| Average HPO amount | 7.5 | 2.0 | |
| Average variant amount | 100 033 | 83 587 | |
|
| Male | 141 (46.2%) | 125 (59.8%) |
| Female | 164 (53.8%) | 84 (40.2%) | |
|
| 0–1 | 3 | 61 |
| 1–7 | 148 | 30 | |
| 7–18 | 154 | 51 | |
| 18–65 | – | 65 | |
| 65+ | – | 2 | |
|
| >9 | ARID1B (11) | ATP7B (37) |
| 9 | – | SRD5A2 | |
| 8 | MED13L | – | |
| 7 | ANKRD11, SYNGAP1 | – | |
| 6 | KCNQ2, SATB2, SCN2A | UGT1A1 | |
| 5 | CTNNB1, DDX3X, PPP2R5D, STXBP1 | PAH | |
|
| 305 | 51 | |
|
| – | 21 | |
|
| – | 26 | |
|
| – | 22 | |
|
| – | 48 | |
|
| – | 41 | |
Overview of the methods evaluated in this work; the brief features, running time, version numbers and the released time of the 10 methods
|
|
|
|
|
|
|
|---|---|---|---|---|---|
|
| PhenIX | Computational phenotype analysis | 103 s | 1.16 | 2014 [ |
| Exomiser | Cross-species phenotype comparison | 35 s | 12.1.0 | 2015 [ | |
| DeepPVP | Deep learning | 280 s | 2.1 | 2019 [ | |
| Xrare | Machine learning | 260 s | pub:2015 | 2019 [ | |
| AMELIE | Text mining and natural language processing | 94 s | Oct 5, 2020 | 2020 [ | |
| LIRICAL | Likelihood ratio framework | 31 s | 1.3.0 | 2020 [ | |
|
| Phenolyzer | Machine learning | 107 s | 0.4.0 | 2015 [ |
| HANRD | Heterogeneous networks and graph convolution | 628 s | – | 2018 [ | |
| GADO | Gene network based on transcriptome data | 6 s | 1.0.1 | 2019 [ | |
| Phen2Gene | Probabilistic model | 6 s | 1.2.3 | 2020 [ |
The running time is tested under the default setting of each software using a VFC file (29 968 variants, size: 23.87 MB) of case NA12878 from the Genome in a Bottle project and a set of randomly chosen HPO terms including HP:0000002 (Abnormality of body height), HP:0003020 (Enlargement of the wrists), HP:0006089 (Palmar hyperhidrosis), HP:0009023 (Abdominal wall muscle weakness) and HP:0012047 (Hemeralopia).
Figure 2Distribution plots of performance evaluation results. Distribution plots of performance evaluation results of 10 phenotype-driven gene prioritization methods on the DDD (A) and KMCGD (B) datasets. The distribution plots illustrate the percentage of the cases with causal genes ranked in top-1 and within the top-5, -10, -20, -30, -40 and -50 by each method. Each method is represented by a different color.
Figure 3CDF and bar plots of performance evaluation results. CDF plots (A) and bar plots (B) of performance evaluation results of 10 phenotype-driven gene prioritization methods on the DDD (left) and KMCGD (right) datasets. The CDF plots illustrate the percentage of the cases with causal genes ranked within the top k by each method. k could be any integer between 1 and 50 (inclusive). Each method is represented by a different color. The bar plots illustrate the relative proportion of each group involved cases with causal genes ranked within a designated range. Each group is represented by a different color. (C) The overlapping set of cases with causal genes ranked in top-1 and within top-5, -10, -20, -30, -40 and -50 by LIRICAL and AMELIE in DDD (left) and KMCGD (right) dataset.
Figure 4Performance evaluation across different disease subgroups. (A) Frequency distribution of the HPO parent classes for the KMCGD dataset. The HPO terms of each case in the KMCGD dataset are assigned to HPO parent classes according to the official HPO hierarchy and some cases involve more than one kind of HPO parent class. (B) Disease subgroup composition of KMCGD dataset. Case amount and proportion are tagged for each subgroup. (C) CDF plots of performance evaluation results of 10 phenotype-driven gene prioritization methods on each subgroup of the KMCGD dataset.