| Literature DB >> 26834980 |
Indika Kahanda1, Christopher Funk2, Karin Verspoor3, Asa Ben-Hur1.
Abstract
The human phenotype ontology (HPO) was recently developed as a standardized vocabulary for describing the phenotype abnormalities associated with human diseases. At present, only a small fraction of human protein coding genes have HPO annotations. But, researchers believe that a large portion of currently unannotated genes are related to disease phenotypes. Therefore, it is important to predict gene-HPO term associations using accurate computational methods. In this work we demonstrate the performance advantage of the structured SVM approach which was shown to be highly effective for Gene Ontology term prediction in comparison to several baseline methods. Furthermore, we highlight a collection of informative data sources suitable for the problem of predicting gene-HPO associations, including large scale literature mining data.Entities:
Keywords: human phenotype ontology; structured SVM
Year: 2015 PMID: 26834980 PMCID: PMC4722686 DOI: 10.12688/f1000research.6670.1
Source DB: PubMed Journal: F1000Res ISSN: 2046-1402
Figure 1. A portion of the Organ abnormality subontology.
All HPO parent-child relationships represent “is-a” relationships.
Figure 2. HPO annotations.
a) general format of annotations: genes are annotated with a set of phenotype terms based on their known relationships with diseases b) an example annotation: the amyloid precursor protein (APP) gene is associated with Alzheimer’s disease and cerebroarterial amyloidosis. Therefore, the APP gene is annotated with the set of HPO terms (Organ in orange, Inheritance in green) associated with these diseases.
Number of genes, unique terms and annotations.
The “unique terms” column provides both the number of terms and the number of leaf terms; the “annotations” column provides the number of annotations, as well as their number when expanded using the true-path rule.
| Subont. | Genes | Terms | Annotations |
|---|---|---|---|
| Organ | 2,768 | 1,796/1,337 | 213k/60k |
| Inheritance | 2,668 | 12/10 | 3.6k/3.3k |
| Onset | 926 | 23/20 | 1.7k/1.4k |
Figure 3. Overview of PHENOstruct.
PHENOstruct takes the set of feature vectors and HPO annotations associated with each gene as input for training. Once trained, it can predict a set of hierarchically consistent HPO terms for a given test gene. PHENOstruct is trained on and makes predictions for a single subontology at a time (DAGs belonging to Organ, Inheritance and Onset subontologies are shown in orange, green and blue, respectively).
Figure 4. Visual interpretation of the structured prediction framework.
The compatibility function, which is the key component of the structured prediction framework, measures compatibility between a given input and a structured output. The compatibility function of the true label (correct set of HPO annotations) is required to be higher than that of any other label. and the difference between these two scores (margin) is maximized.
Comparison between PHENOstruct and other methods.
| Subontology | Method | F-max | Precision | Recall | mac-AUC |
|---|---|---|---|---|---|
| Organ | PhenoPPIOrth | 0.20 | 0.27 | 0.15 | 0.52 |
| Struct->Dis->HPO | 0.23 | 0.16 | 0.41 | 0.49 | |
| Binary SVMs | 0.35 | 0.32 | 0.40 | 0.66 | |
| Clus-HMC-Ens | 0.41 |
| 0.43 | 0.65 | |
| PHENOstruct |
| 0.35 |
|
| |
| Inheritance | PhenoPPIOrth | 0.12 | 0.16 | 0.10 | 0.55 |
| Struct->Dis->HPO | 0.11 | 0.07 | 0.25 | 0.46 | |
| Binary SVMs | 0.69 | 0.62 | 0.78 | 0.72 | |
| Clus-HMC-Ens | 0.73 | 0.64 |
| 0.73 | |
| PHENOstruct |
|
| 0.81 |
| |
| Onset | PhenoPPIOrth | 0.25 | 0.25 | 0.24 | 0.53 |
| Struct->Dis->HPO | 0.07 | 0.06 | 0.10 | 0.49 | |
| Binary SVMs | 0.33 | 0.24 | 0.51 | 0.62 | |
| Clus-HMC-Ens | 0.35 | 0.27 | 0.48 | 0.58 | |
| PHENOstruct |
|
|
|
|
The performance is evaluated using macro AUC and protein-centric F-max, Precision and Recall (as defined above) on the complete HPO graph (i.e. true-path rule is applied to annotations and predictions).
Comparison between PHENOstruct vs. other methods only leaf terms.
| Subontology | Method | F-max | Precision | Recall | mac-AUC |
|---|---|---|---|---|---|
| Organ | PhenoPPIOrth | 0.03 | 0.05 | 0.03 | 0.51 |
| Struct->Dis->HPO | 0.01 | 0.01 | 0.02 | 0.50 | |
| Binary SVMs | 0.20 | 0.19 | 0.22 | 0.66 | |
| Clus-HMC-Ens | 0.08 | 0.04 | 0.31 | 0.64 | |
| PHENOstruct |
|
|
|
| |
| Inheritance | PhenoPPIOrth | 0.11 | 0.15 | 0.09 | 0.55 |
| Struct->Dis->HPO | 0.01 | 0.01 | 0.02 | 0.46 | |
| Binary SVMs | 0.69 | 0.62 | 0.78 | 0.71 | |
| Clus-HMC-Ens | 0.72 | 0.63 |
| 0.73 | |
| PHENOstruct |
|
| 0.82 |
| |
| Onset | PhenoPPIOrth | 0.19 | 0.20 | 0.17 | 0.52 |
| Struct->Dis->HPO | 0.03 | 0.02 | 0.14 | 0.49 | |
| Binary SVMs | 0.29 | 0.21 | 0.47 | 0.62 | |
| Clus-HMC-Ens | 0.28 | 0.21 | 0.43 | 0.58 | |
| PHENOstruct |
|
|
|
|
The performance is evaluated by using the exact annotations (i.e. only leaf terms) as ground truth. In other words, true-path rule is not applied. Performance is presented using macro AUC and protein-centric F-max, Precision and Recall as defined above.
PHENOstruct vs. other methods.
Performance across the three HPO subontologies for PHENOstruct, binary SVMs and Clus-HMC-Ens measured using the macro AUC. P-values provide the significance level for the difference between the corresponding method and PHENOstruct.
| Subont. | Terms | Method | AUC | P-value |
|---|---|---|---|---|
| Organ | 1,796 | Binary SVMs | 0.66 | 1.7E-262 |
| Clus-HMC-Ens | 0.65 | 0.0E+00 | ||
| PHENOstruct |
| — | ||
| nherit. | 12 | Binary SVMs | 0.72 | 2.2E-01 |
| Clus-HMC-Ens | 0.73 | 7.3E-01 | ||
| PHENOstruct |
| — | ||
| Onset | 23 | Binary SVMs | 0.62 | 4.4E-03 |
| Clus-HMC-Ens | 0.58 | 3.3E-05 | ||
| PHENOstruct |
| — |
Performance of PHENOstruct in the Onset subontology.
The average macro AUC for the Onset subontology is 0.64. Terms are displayed in ascending order of frequency.
| Name | Freq. | Depth | AUC |
|---|---|---|---|
| Late onset | 11 | 4 | 0.70 |
| Neonatal death | 14 | 2 | 0.54 |
| Sudden death | 14 | 2 | 0.50 |
| Nonprogressive disorder | 15 | 2 | 0.82 |
| Stillbirth | 21 | 2 | 0.67 |
| Death in childhood | 23 | 2 | 0.65 |
| Neonatal onset | 23 | 3 | 0.64 |
| Rapidly progressive | 33 | 2 | 0.50 |
| Childhood onset | 41 | 3 | 0.62 |
| Death in infancy | 44 | 2 | 0.70 |
| Incomplete penetrance | 58 | 2 | 0.61 |
| Juvenile onset | 90 | 3 | 0.70 |
| Slow progression | 95 | 2 | 0.62 |
| Adult onset | 98 | 3 | 0.71 |
| Death | 111 | 1 | 0.61 |
| Variable expressivity | 132 | 2 | 0.66 |
| Congenital onset | 135 | 3 | 0.60 |
| Progressive disorder | 141 | 2 | 0.70 |
| Infantile onset | 245 | 3 | 0.66 |
| Phenotypic variability | 310 | 1 | 0.65 |
Figure 5. Performance of PHENOstruct in the Organ subontology.
Performance for each term is displayed using AUC against its frequency. The average AUC for the Organ subontology is 0.73.
Figure 6. Example of experimental and predicted annotations.
a) experimental annotation of protein P43681 b) PHENOstruct’s prediction for P43681 (protein-centric precision and recall for this individual protein is 1.0 and 0.62, respectively).
Performance of PHENOstruct in the Inheritance subontology.
The average macro AUC for the Inheritance subontology is 0.74. Terms are displayed in ascending order of frequency.
| Name | Freq. | Depth | AUC |
|---|---|---|---|
| Multifactorial inheritance | 15 | 1 | 0.54 |
| Polygenic inheritance | 15 | 2 | 0.54 |
| Mitochondrial inheritance | 41 | 1 | 0.98 |
| Sporadic | 52 | 1 | 0.61 |
| Somatic mutation | 61 | 1 | 0.76 |
| X-linked dominant inheritance | 62 | 3 | 0.83 |
| X-linked recessive inheritance | 111 | 3 | 0.77 |
| Heterogeneous | 148 | 1 | 0.69 |
| Gonosomal inheritance | 198 | 1 | 0.80 |
| X-linked inheritance | 198 | 2 | 0.80 |
| Autosomal dominant inherit. | 1096 | 1 | 0.78 |
| Autosomal recessive inheri. | 1665 | 1 | 0.73 |
Performance of Binary SVMs in the Inheritance subontology.
| Name | Freq. | Depth | AUC |
|---|---|---|---|
| Multifactorial inheritance | 15 | 1 | 0.62 |
| Polygenic inheritance | 15 | 2 | 0.62 |
| Mitochondrial inheritance | 41 | 1 | 0.96 |
| Sporadic | 52 | 1 | 0.66 |
| Somatic mutation | 61 | 1 | 0.71 |
| X-linked dominant inheritance | 62 | 3 | 0.79 |
| X-linked recessive inheritance | 111 | 3 | 0.70 |
| Heterogeneous | 148 | 1 | 0.65 |
| X-linked inheritance | 198 | 2 | 0.78 |
| Gonosomal inheritance | 198 | 1 | 0.78 |
| Autosomal dominant inheritance | 1096 | 1 | 0.69 |
| Autosomal recessive inheritance | 1665 | 1 | 0.68 |
The macro AUC for the Inheritance subontology is 0.72. Terms are displayed in ascending order of frequency.
Figure 7. Performance of PHENOstruct with individual data sources.
Results are shown for each source of data: network (functional association data); Gene Ontology annotations; literature mining data; genetic variants; and the model that combines all features together.
Figure 8. Performance of PHENOstruct in leave-one-source-out experiments (measured by the % change in macro AUC by leaving out a single selected source relative to its macro AUC obtained using all data sources; negative % change means the performance dropped after leaving out the particular source of data).
The Organ subontology terms that are well-predicted by variant features with PHENOstruct.
| HPO ID | HPO term | Freq | Depth | AUC | Cancer-related |
|---|---|---|---|---|---|
| HP:0006846 | Acute encephalopathy | 15 | 5 | 1.00 | No |
| HP:0006965 | Acute necrotizing encephalopathy | 15 | 6 | 1.00 | No |
| HP:0003287 | Abnormality of mitochondrial metabolism | 16 | 4 | 1.00 | No |
| HP:0008316 | Abnormal mitochondria in muscle tissue | 16 | 5 | 1.00 | No |
| HP:0012103 | Abnormality of the mitochondrion | 16 | 3 | 1.00 | No |
| HP:0002141 | Gait imbalance | 14 | 5 | 1.00 | No |
| HP:0000148 | Vaginal atresia | 19 | 7 | 1.00 | No |
| HP:0001827 | Genital tract atresia | 19 | 4 | 1.00 | No |
| HP:0002862 | Bladder carcinoma | 22 | 5 | 1.00 | Yes |
| HP:0006740 | Transitional cell carcinoma of the bladder | 22 | 6 | 1.00 | Yes |
| HP:0009725 | Bladder neoplasm | 22 | 4 | 1.00 | Yes |
| HP:0010784 | Uterine neoplasm | 29 | 7 | 0.98 | Yes |
| HP:0002672 | Gastrointestinal carcinoma | 23 | 6 | 0.98 | Yes |
| HP:0006716 | Hereditary nonpolyposis colorectal carcinoma | 23 | 7 | 0.98 | Yes |
| HP:0006749 | Malignant gastrointestinal tract tumors | 23 | 5 | 0.98 | Yes |
| HP:0010747 | Medial flaring of the eyebrow | 14 | 6 | 0.98 | No |
| HP:0002891 | Uterine leiomyosarcoma | 20 | 8 | 0.98 | Yes |
| HP:0100243 | Leiomyosarcoma | 20 | 4 | 0.98 | Yes |
| HP:0004481 | Progressive macrocephaly | 18 | 6 | 0.97 | No |
| HP:0007707 | Congenital primary aphakia | 14 | 7 | 0.97 | No |
| HP:0100834 | Neoplasm of the large intestine | 33 | 6 | 0.97 | Yes |
| HP:0006519 | Alveolar cell carcinoma | 13 | 6 | 0.97 | Yes |
| HP:0100552 | Neoplasm of the tracheobronchial system | 13 | 5 | 0.97 | Yes |
| HP:0009806 | Nephrogenic diabetes insipidus | 15 | 3 | 0.95 | No |
| HP:0100273 | Neoplasm of the colon | 15 | 7 | 0.95 | Yes |
| HP:0005584 | Renal cell carcinoma | 28 | 6 | 0.94 | Yes |
| HP:0003002 | Breast carcinoma | 20 | 3 | 0.94 | Yes |
| HP:0006753 | Neoplasm of the stomach | 39 | 5 | 0.94 | Yes |
| HP:0100013 | Neoplasm of the breast | 31 | 2 | 0.92 | Yes |
| HP:0004808 | Acute myeloid leukemia | 22 | 5 | 0.92 | No |
| HP:0003006 | Neuroblastoma | 19 | 7 | 0.91 | Yes |
| HP:0004376 | Neuroblastic tumors | 19 | 6 | 0.91 | Yes |
| HP:0002370 | Poor coordination | 18 | 6 | 0.90 | No |
| HP:0010786 | Urinary tract neoplasm | 49 | 3 | 0.90 | Yes |
| HP:0000142 | Abnormality of the vagina | 27 | 6 | 0.90 | No |
| HP:0001413 | Micronodular cirrhosis | 14 | 5 | 0.90 | No |
| HP:0009726 | Renal neoplasm | 47 | 5 | 0.90 | Yes |
Terms are listed in the ascending order of their individual AUCs. 21 out of the 37 (57%) terms well-predicted Organ subontoloy terms by the variant data are terms related to cancer.
The top-100 literature features with respect to the 8 HPO terms that have individual AUCs equal to or above 0.9 in the organ subontology.
| Category | Tokens |
|---|---|
| proteins/protein complexes | cx32, kisspeptin, -308, t308, smn2, ns5, trap-positive, mpp+-induced, 1-methyl-4-phenylpyridinium,
|
| genes | hmsh2, cx26, fkrp, smn1, cln3, nphp4, mn1, nnt, apex2, akt-2 |
| pathways | ras/raf/mek/erk, pi3k-akt-mtor |
| diseases/phenotypes | cmt1a, hnpp, hdl2, cln2, hpp, fmf, rtt, hnpcc, charcot-marie-tooth, amenorrhea, rett, anticardiolipin |
| misc. | sheldrick, shelxl97, bruker, farrugia, ortep-3, platon, shelxs97, spek, sgdid, wlds, caii, aoa, tdf,
|
The union set of the top-100 literature features with respect to the 8 HPO terms that have individual AUCs equal to or above 0.9. It is composed of 107 unique tokens. The token “-308” and “t308” in the “proteins/protein complexes” category are due to mis-tokenization of “miR-308”. Similarly, “-238” in the same category is due to mis-tokenization of “BQ-23”. Also “=-galcer” in the same category originated from α-galcer and β-galcer due to mis-handling of UTF characters α and β.
Validation of false positives in the Organ subontology.
| Gene | HPO term | PubMed ID | Evidence |
|---|---|---|---|
| DKC1 | Postnatal growth
| PMID: 10583221 | “Hoyeraal-Hreidarsson (HH) syndrome is a multisystem
|
| PEX13 | Neonatal hypotonia | PMID:12897163
| “…In the studies reported here, we crossed these mice with transgenic mice that express Cre recombinase
|
| PEX13 | Retinal dystrophy | PMID:10441568,
| “…The clinical course of patients with the NALD and IRD presentation is variable and may include
|
| RPE65 | Retinal dystrophy | PMID: 23878505 | “…These results strongly suggest that causal mutations in
|
| BEST1 | Retinitis pigmentosa | PMID: 19853238 | “…Missense mutations in a retinal pigment epithelium protein,
|
| PRPF6 | Retinitis pigmentosa | PMID: 21549338 | “…A missense mutation in
|
| SNRNP200 | Retinitis pigmentosa | PMID: 19878916 | “…Autosomal-dominant
|
| ORC1 | Emphysema | PMID: 22333897 | “…Four individuals were deceased: two siblings with mutations in
|
| CUL7 | Hip dysplasia | PMID: 21396581 | “A predisposing factor in
|
| ACTB | Postnatal microcephaly | PMID: 22366783 | “…Riviere
|
| ACTB | Progressive hearing
| PMID:16685646
| “…However, aging mice with
|
| MSH2 | Gastrointestinal
| PMID: 8252616 | “…
|
| MSH2 | Malignant
| PMID: 8252616 | “…
|
| MSH2 | Hereditary
| PMID: 8252616 | “…
|
The columns “HPO term”, “PubMed ID” and “Evidence” provides the false positive prediction made by PHENOStruct for the given gene, PubMed ID of the literature that contains evidence which actually suggests that the prediction should be considered true and the excerpt from that literature which contains the evidence, respectively. We used the 25 false positive predictions for the 17 proteins that had the highest individual protein-centric precision and found evidence for 14 predictions. Two of the evidence comes from studies involving mice (indicated within parentheses with the PubMed ID)