| Literature DB >> 34951628 |
Azza Althagafi1,2, Lamia Alsubaie3, Nagarajan Kathiresan4, Katsuhiko Mineta1, Taghrid Aloraini3, Fuad Almutairi5,6, Majid Alfadhel5,6,7, Takashi Gojobori8, Ahmad Alfares3,7,9, Robert Hoehndorf1.
Abstract
MOTIVATION: Structural genomic variants account for much of human variability and are involved in several diseases. Structural variants are complex and may affect coding regions of multiple genes, or affect the functions of genomic regions in different ways from single nucleotide variants. Interpreting the phenotypic consequences of structural variants relies on information about gene functions, haploinsufficiency or triplosensitivity, and other genomic features. Phenotype-based methods to identifying variants that are involved in genetic diseases combine molecular features with prior knowledge about the phenotypic consequences of altering gene functions. While phenotype-based methods have been applied successfully to single nucleotide variants as well as short insertions and deletions, the complexity of structural variants makes it more challenging to link them to phenotypes. Furthermore, structural variants can affect a large number of coding regions, and phenotype information may not be available for all of them.Entities:
Year: 2021 PMID: 34951628 PMCID: PMC8896633 DOI: 10.1093/bioinformatics/btab859
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Overview over the DeepSVP model. (a) First, a graph is generated from the ontology axioms in which nodes represent classes or entities annotated with ontology classes, and edges represent axioms that hold between these classes. (b) The DL2Vec workflow takes a set of phenotypes as input and predicts whether a gene is likely associated with these phenotypes using the background knowledge in the graph generated from the ontology and its associations. (c) The combined model uses the prediction score of the DL2Vec phenotype model combined with genomic features derived from SVs. The model outputs a prediction score for each variant that determines how likely the variant is causative of the phenotypes provided as input. G, genes; D, diseases; F, (genomic) features; P, phenotypic score
Summary of the evaluation for predicting causative variants in the benchmark dataset of dbVar, time-based split, for 1503 newly added variants along with the evaluation for 640 newly added variants associated with 175 new diseases which were not present in our training dataset
| Synthetic dataset | Synthetic dataset (novel diseases) | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Recall@1 | Recall@10 | Recall@30 | ROCAUC | PRAUC | Recall@1 | Recall@10 | Recall@30 | ROCAUC | PRAUC | ||
| DeepSVP models using average score | GO | 435 (0.2894) | 626 (0.4165) | 811 (0.5396) | 0.9647 | 0.3303 | 114 (0.1781) | 192 (0.3000) | 259 (0.4047) | 0.9560 | 0.2163 |
| MP | 634 (0.4218) | 1035 (0.6886) | 1217 (0.8097) | 0.9850 | 0.5123 | 208 (0.3250) | 330 (0.5156) | 436 (0.6813) | 0.9728 | 0.3873 | |
| HP | 545 (0.3626) | 977 (0.6500) | 1220 (0.8117) | 0.9828 | 0.4528 | 157 (0.2453) | 352 (0.5500) | 447 (0.6984) | 0.9760 | 0.3263 | |
| CL | 157 (0.1045) | 542 (0.3606) | 882 (0.5868) | 0.9740 | 0.1761 | 40 (0.0625) | 125 (0.1953) | 262 (0.4094) | 0.9659 | 0.1050 | |
| UBERON | 254 (0.1690) | 602 (0.4005) | 1097 (0.7299) | 0.9752 | 0.2377 | 32 (0.0500) | 147 (0.2297) | 347 (0.5422) | 0.9627 | 0.1070 | |
| Union |
| 1055 (0.7019) | 1248 (0.8303) | 0.9854 |
|
|
|
| 0.9858 |
| |
| DeepSVP models using maximum score | GO | 325 (0.2162) | 536 (0.3566) | 725 (0.4824) | 0.9558 | 0.2670 | 97 (0.1516) | 174 (0.2719) | 245 (0.3828) | 0.9494 | 0.1917 |
| MP | 237 (0.1577) | 630 (0.4192) | 855 (0.5689) | 0.9605 | 0.2492 | 102 (0.1594) | 156 (0.2437) | 233 (0.3641) | 0.9431 | 0.1949 | |
| HP | 445 (0.2961) |
|
|
| 0.4364 | 122 (0.1906) | 370 (0.5781) | 528 (0.8250) |
| 0.3194 | |
| CL | 272 (0.1810) | 835 (0.5556) | 1148 (0.7638) | 0.9801 | 0.2569 | 52 (0.0813) | 250 (0.3906) | 390 (0.6094) | 0.9756 | 0.1429 | |
| UBERON | 259 (0.1723) | 637 (0.4238) | 1049 (0.6979) | 0.9733 | 0.2417 | 69 (0.1078) | 161 (0.2516) | 369 (0.5766) | 0.9656 | 0.1550 | |
| Union | 328 (0.2182) | 948 (0.6307) | 1122 (0.7465) | 0.9750 | 0.3489 | 85 (0.1328) | 363 (0.5672) | 457 (0.7141) | 0.9758 | 0.2585 | |
| SV pathogenicity prediction/ranking | StrVCTVRE | 72 (0.0479) | 223 (0.1484) | 405 (0.2695) | 0.9178 | 0.0952 | 34 (0.0531) | 120 (0.1875) | 210 (0.3281) | 0.9308 | 0.1142 |
| CADD-SV | 38 (0.0253) | 620 (0.4125) | 1020 (0.6786) | 0.9816 | 0.1262 | 9 (0.0141) | 162 (0.2531) | 373 (0.5828) | 0.9871 | 0.0860 | |
| AnnotSV | 19 (0.0126) | 229 (0.1524) | 700 (0.4657) | 0.9605 | 0.2203 | 5 (0.0078) | 60 (0.0938) | 190 (0.2969) | 0.9424 | 0.2319 | |
Note: The evaluation inserts one disease-associated SV in a whole genome and reports the rank at which the inserted variant was recovered. Some methods provide the same score for variants, and we break ties randomly and report the absolute number of variants recovered at each rank together with the recall, as well as areas under the ROC curve (using microaverages per genome) and precision–recall curve. Best performing results (using maximum or average score) for each measure are indicated in bold.
Summary of the ROCAUC performance for reranking causative variants from the benchmark dataset with DeepSVP that are assigned the same classification by AnnotSV (930 variants classified as pathogenic, 563 variants as likely pathogenic)
| GO | MP | HP | CL | UBERON | Union | ||
|---|---|---|---|---|---|---|---|
| Maximum score | Pathogenic variants | 0.9032 | 0.9032 | 0.9018 | 0.9018 | 0.9034 | 0.9028 |
| Likely pathogenic variants | 0.9710 | 0.9711 | 0.9713 | 0.9739 | 0.9704 | 0.9720 | |
| Average score | Pathogenic variants | 0.9032 | 0.9028 | 0.9029 | 0.9020 | 0.9028 | 0.9016 |
| Likely pathogenic variants | 0.9703 | 0.9695 | 0.9704 | 0.9707 | 0.9702 | 0.9694 |