| Literature DB >> 29340838 |
Jessica J Y Lee1, Michael M Gottlieb1, Jake Lever2, Steven J M Jones2,3, Nenad Blau4, Clara D M van Karnebeek1,5,6, Wyeth W Wasserman7,8.
Abstract
Phenomics is the comprehensive study of phenotypes at every level of biology: from metabolites to organisms. With high throughput technologies increasing the scope of biological discoveries, the field of phenomics has been developing rapid and precise methods to collect, catalog, and analyze phenotypes. Such methods have allowed phenotypic data to be widely used in medical applications, from assisting clinical diagnoses to prioritizing genomic diagnoses. To channel the benefits of phenomics into the field of inborn errors of metabolism (IEM), we have recently launched IEMbase, an expert-curated knowledgebase of IEM and their disease-characterizing phenotypes. While our efforts with IEMbase have realized benefits, taking full advantage of phenomics requires a comprehensive curation of IEM phenotypes in core phenomics projects, which is dependent upon contributions from the IEM clinical and research community. Here, we assess the inclusion of IEM biochemical phenotypes in a core phenomics project, the Human Phenotype Ontology. We then demonstrate the utility of biochemical phenotypes using a text-based phenomics method to predict gene-disease relationships, showing that the prediction of IEM genes is significantly better using biochemical rather than clinical profiles. The findings herein provide a motivating goal for the IEM community to expand the computationally accessible descriptions of biochemical phenotypes associated with IEM in phenomics resources.Entities:
Keywords: Biochemical phenotypes; Clinical informatics; Data mining; Inborn errors of metabolism; Metabolic phenotypes; Text-based phenomics
Mesh:
Substances:
Year: 2018 PMID: 29340838 PMCID: PMC5959948 DOI: 10.1007/s10545-017-0125-4
Source DB: PubMed Journal: J Inherit Metab Dis ISSN: 0141-8955 Impact factor: 4.982
Fig. 1An overview of HPO to IEMbase mapping. 287 biochemical phenotypes in IEMbase had 852 associations with 475 unique HPO phenotypes. The figure illustrates such mappings with respect to 26 subclasses of the HPO class “phenotypic abnormality (HP:0000118)”. “Multiple subclasses” refer to HPO phenotypes that belong to multiple subclasses, consisting of: abnormality of metabolism/homeostasis (HP:0001939), abnormality of the genitourinary system (HP:0000119), abnormality of the endocrine system (HP:0000818), abnormality of the nervous system (HP:0000707), abnormality of blood and blood-forming tissues (HP:0001871), abnormality of the immune system (HP:0002715), and abnormality of the digestive system (HP:0025031)
Fig. 2An illustration of the text-based phenotype analysis procedure. Numbered boxes (in orange) represent the main steps of the text-based phenotype analysis. First, 563 disease-gene pairings were extracted from IEMbase (v. 1.1.0). Each pair contained the disorder name and gene name, and the pair was coupled to a phenotypic profile (i.e., disease symptoms and biomarkers). Second, using the phenotypic profile P, associated genes were identified using a text-analysis tool by Lever et al. The association strength between P and g was defined as the ratio of the number of sentences in the PubMed literature where P and g appeared together over the total number of sentences where P and g appeared individually. Third, the identified genes were ranked by the strength of their association with P before a list of top 100 associated genes was determined. Finally, the causal gene g was identified based on the disease-gene pair connected to P. The rank of g was recorded
An example disease-gene pair and its phenotypic profile extracted from IEMbase
| Disease name | Dopamine beta-hydroxylase deficiency |
|---|---|
| Associated gene |
|
| Phenotypes* | Exercise intolerance |
| Hypoglycemia | |
| Hypotension, orthostatic | |
| Dopamine (plasma) | |
| Epinephrine (plasma) | |
| Homovanillic acid, HVA (cerebrospinal fluid) | |
| Vanillinmandelic acid, VMA (urine) |
*Only select phenotypes are listed for brevity
A summary of text-based phenotype analysis performance
| N | Top 1 | Top 5 | Top 10 | Top 20 | Top 100 |
|---|---|---|---|---|---|
| Number of disease-gene pairs ranked within top N predictions | 31 | 90 | 120 | 173 | 308 |
| McNemar’s test at N | p < 0.001 | p < 0.001 | p < 0.001 |
a% Success at N refers to the proportion of IEMbase disease-gene pairs whose causal genes ranked within the top N predictions
bMcNemar’s test at N refers to paired comparison between the causal ranking and the baseline ranking with a dichotomous trait defined as (1) disease-gene pairs whose causal genes ranked within the top N predictions or (2) disease-gene pairs whose causal genes did not rank within the top N predictions where N = 1, 5, 10, 20, 100. Reported p-value was adjusted using the Bonferroni correction
An overview of impact on gene predictions by biochemical phenotypes vs clinical phenotypes
| N | Top 1 | Top 5 | Top 10 | Top 20 | Top 100 |
|---|---|---|---|---|---|
| Number of disease-gene pairs ranked within top N predictions based on biochemical phenotypes | 19 | 67 | 88 | 132 | 292 |
| Number of disease-gene pairs ranked within top N predictions based on clinical phenotypes | 2 | 12 | 22 | 37 | 132 |
| McNemar’s test at N | p < 0.001 | p < 0.001 | p < 0.001 | p < 0.001 |
*Success at N refers to the proportion of IEMbase disease-gene pairs whose causal gene ranked within the top N predictions
bMcNemar’s test at N refers to paired comparison between the biochemical ranking and the clinical ranking with a dichotomous trait defined as (1) genes ranked within the top N predictions or (2) genes not ranked within the top N predictions where N = 1, 5, 10, 20, 100. Reported p-value was adjusted using the Bonferroni correction
Fig. 3Distribution of ranks using only biochemical phenotypes vs using only clinical phenotypes. The x-axis represents the subset of phenotypes (biochemical-only and clinical-only). The y-axis represents the ranks of causal genes in the top N predictions. The distribution of ranks is shown in a violin plot (hour-glass figure). A scatter plot version of the same distribution (dot) is overlaid on top of the violin plot to show the position of each data point in the distribution. The text-based method predicted significantly more causal genes within the top N predictions (N = 1, 5, 10, 20, 100) using biochemical phenotypes than clinical phenotypes (Table 3; McNemar’s test with Bonferroni correction)