Literature DB >> 28062448

ontologyX: a suite of R packages for working with ontological data.

Daniel Greene^1,2, Sylvia Richardson¹, Ernest Turro^1,2.

Abstract

Summary: Ontologies are widely used constructs for encoding and analyzing biomedical data, but the absence of simple and consistent tools has made exploratory and systematic analysis of such data unnecessarily difficult. Here we present three packages which aim to simplify such procedures. The ontologyIndex package enables arbitrary ontologies to be read into R, supports representation of ontological objects by native R types, and provides a parsimonius set of performant functions for querying ontologies. ontologySimilarity and ontologyPlot extend ontologyIndex with functionality for straightforward visualization and semantic similarity calculations, including statistical routines. Availability and Implementation: ontologyIndex , ontologyPlot and ontologySimilarity are all available on the Comprehensive R Archive Network website under https://cran.r-project.org/web/packages/ . Contact: Daniel Greene dg333@cam.ac.uk. Supplementary information: Supplementary data are available at Bioinformatics online.

Entities: Disease Gene Species

Mesh：

Year: 2017 PMID： 28062448 PMCID： PMC5386138 DOI： 10.1093/bioinformatics/btw763

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Ontological annotation is now used to describe many different biological phenomena, including gene function (Gene Ontology Consortium ) and human phenotype abnormality (Köhler ), with many ontologies, and ontological datasets publicly available. Accounting for dependency between terms induced by the structure of their ontologies is vital for downstream statistical analysis and visualization. Therefore, software methods are required which integrate ontologies and ontological data with mainstream statistical programming environments, so that the data can be analyzed effectively. The ontoCAT (Adamusiak ) package enables simple querying and traversal of ontologies, but many of its key functions are slow and it requires a Java runtime installation. There are software packages enabling manipulation and plotting of graphs, for example graph (Gentleman ) and Rgraphviz (Hansen ) respectively, which can be used to view sections of ontologies. However, their functions are low level, which makes procedures such as plotting of ontological term sets and fine-grained control of graphical parameters quite involved. There are R packages which provide procedures for computing semantic similarities between terms and sets of terms for specific ontologies (Fröhlich ; Yu , 2015) but they do not support semantic similarity computation for arbitrary ontologies. Furthermore, the currently available methods are too slow to apply to large datasets. Here we present a suite of R packages, dubbed ‘ontologyX’, consisting of , which together address the issues described and form a consistent interoperable set of tools that is readily extensible with additional ontological functionality.

2 Methods

ontologyIndex is an R package which was developed in order to provide a terse, low-level and easy to use set of functions for exploiting the structure of ontologies. Ontologies can be read into R from files in Open Biomedical Ontologies (OBO) format, with most commonly used ontologies available in this format on the OBO Foundry’s website (Smith ) . Ontologies which are only available in a Web Ontology Language (OWL) format may be used by first converting them into OBO format, for example using the ROBOT command line tool (Overton ) . A custom internal representation of ontologies—the ontology_index class—is used which stores properties of terms including term ancestors, enables fast ontological operations, and can be queried using base R functions. It uses native R types to represent ontological terms and sets of ontological terms, enabling simple integration with R’s features, high-level functions and other packages. It includes functions for performing set operations respecting the structure of the ontology, for example: exclude_ descendants, which given term sets A and B, excludes terms in B and their descendants from set A; prune_descendants, which preserves terms in B which are ancestors of terms in A after applying exclude_descendants), and minimal_set, which maps a set of ontological terms onto a non-redundant set. ontologyIndex is lightweight, fast (see Table 1) and readily extended by other packages. For example, the R package gsEasy (Greene, 2016) facilitates gene-set enrichment analysis (Subramanian ) using the get_ancestors function to propagate parent-child relations through the GO. ontologyPlot extends ontologyIndex with functions which considerably ease the task of plotting sets of ontological terms and the ‘is-a’ relations between them, as the user need only pass an ontology_index and a vector of term IDs to the plotting function. It includes several functions for transforming sets of terms to distill the important features for particular visualizations. For example, given a set of ontologically annotated objects, the function remove_uninformative_terms removes terms whose children are annotated to the same objects, leading to simpler diagrams. Figure 1 demonstrates how ontologyPlot can be used to visualize GO annotation for QPCTL and CRNN, and the effect of using remove_uninformative_terms to simplify the figure. ontologyPlot utilizes the Rgraphviz package’s interface to the graphviz (Gansner and North, 2000) graphical layout engine. It further allows graphs to be exported in standard DOT format and does not constrain the graphical parameters, so users can take full advantage of options in any rendering software.

Table 1

Mean execution time for retrieving descendants and ancestors for individual terms in the Human Phenotype Ontology

	Descendants (ms)	Ancestors (ms)
ontoCAT	11.99	12.75
ontologyIndex	0.38	0.14

Fig. 1

Plot of terms descending from the cellular_component term in the GO, extracted using the exclude_descendants function from ontologyIndex, for genes QPCTL and CRNN using ontologyPlot. The left panel shows the full set of ancestral terms used in the annotation of the genes, while the right panel shows only those remaining after remove_uninformative_terms has been called. Terms annotated to both genes, either implicitly or explicitly, are shown in light blue, while those annotated only QPCTL and CRNN are shown in green and purple respectively. The size of the nodes has been set to be proportional to the information content (i.e. negative log frequency) of the terms with respect to gene annotation downloaded from the GO website Mean execution time for retrieving descendants and ancestors for individual terms in the Human Phenotype Ontology Semantic similarity quantifies similarity between ontological terms and sets of ontological terms. ontologySimilarity extends ontologyIndex to enable similarities between ontological objects to be computed given an ontology_index and sets of term IDs. It facilitates the calculation of similarity at three levels: between ontological terms (ID strings), between ontologically annotated objects (ID string vectors), and within groups of ontologically annotated objects (lists of ID string vectors). It implements Resnik’s (Resnik ) and Lin’s (Lin, 1998) expressions for the similarity of terms. Unlike other packages for calculating semantic similarities, ontologySimilarity does not depend on static, pre-built SQLite databases or Bioconductor annotation packages and works with arbitrary term annotations. Furthermore, it offers inferential procedures such as get_sim_p, which assesses the strength of similarity between groups of objects (Westbury ) . Flexible functions facilitate use in complex methods, for example as in the R package SimReg (Greene ), which implements a semantic similarity based regression algorithm. All similarity routines are written in C ++ and called from R (Eddelbuettel ), and the user can balance performance and memory usage for downstream analysis by selecting whether to store similarities between terms or term sets, or store an index for fast similarity lookups. We compared the performance of ontologySimilarity against other packages offering functions for calculating pairwise term and gene similarities, the results of which are shown in Table 2. The results indicate that ontologySimilarity executes substantially faster, and suggests tangible advantages for use with large datasets.

Table 2

Execution times for computing pairwise similarity matrices for 1000 randomly selected GO terms and 100 randomly selected gene GO annotation sets using Lin's expression for term similarity

	Term sim (s)	Gene sim (s)
GOSim	1075.43	298.34
GOSemSim	1.71	116.72
ontologySimilarity	0.31	0.06
ontologySimilarity (indexed)		0.04

Execution times for computing pairwise similarity matrices for 1000 randomly selected GO terms and 100 randomly selected gene GO annotation sets using Lin's expression for term similarity

3 Conclusion

The key advantage of ontologyIndex is that it can read in arbitrary ontologies, integrates naturally with R, and provides a solid base for extension. ontologyPlot enables uniquely simple and aesthetically pleasing visualization of ontological terms and ontological annotation with a wide variety of graphical options. ontologySimilarity facilitates fast and flexible semantic similarity functionality for ontological objects including assessment of statistical significance and is suitable for application to high-throughput datasets. Software: The following versions of software packages were used to generate the results presented in this manuscript: ontologyIndex 2.2, ontologyPlot 1.4, ontologySimilarity 2.1, GOSim 1.11, GOSemSim 1.99.4 and ontoCAT 1.26.0.

Funding

This work was supported by National Institute for Health Research award RG65966 (D.G. and E.T.) and the Medical Research Council programme grant MC_UP_ 0801/1 (D.G. and S.R.). Conflict of Interest: none declared. Click here for additional data file.

10 in total

1. The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration.

Authors: Barry Smith; Michael Ashburner; Cornelius Rosse; Jonathan Bard; William Bug; Werner Ceusters; Louis J Goldberg; Karen Eilbeck; Amelia Ireland; Christopher J Mungall; Neocles Leontis; Philippe Rocca-Serra; Alan Ruttenberg; Susanna-Assunta Sansone; Richard H Scheuermann; Nigam Shah; Patricia L Whetzel; Suzanna Lewis
Journal: Nat Biotechnol Date: 2007-11 Impact factor: 54.908

2. GOSemSim: an R package for measuring semantic similarity among GO terms and gene products.

Authors: Guangchuang Yu; Fei Li; Yide Qin; Xiaochen Bo; Yibo Wu; Shengqi Wang
Journal: Bioinformatics Date: 2010-02-23 Impact factor: 6.937

3. DOSE: an R/Bioconductor package for disease ontology semantic and enrichment analysis.

Authors: Guangchuang Yu; Li-Gen Wang; Guang-Rong Yan; Qing-Yu He
Journal: Bioinformatics Date: 2014-10-17 Impact factor: 6.937

4. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles.

Authors: Aravind Subramanian; Pablo Tamayo; Vamsi K Mootha; Sayan Mukherjee; Benjamin L Ebert; Michael A Gillette; Amanda Paulovich; Scott L Pomeroy; Todd R Golub; Eric S Lander; Jill P Mesirov
Journal: Proc Natl Acad Sci U S A Date: 2005-09-30 Impact factor: 11.205

5. OntoCAT--simple ontology search and integration in Java, R and REST/JavaScript.

Authors: Tomasz Adamusiak; Tony Burdett; Natalja Kurbatova; K Joeri van der Velde; Niran Abeygunawardena; Despoina Antonakaki; Misha Kapushesky; Helen Parkinson; Morris A Swertz
Journal: BMC Bioinformatics Date: 2011-05-29 Impact factor: 3.307

6. Human phenotype ontology annotation and cluster analysis to unravel genetic defects in 707 cases with unexplained bleeding and platelet disorders.

Authors: Sarah K Westbury; Ernest Turro; Daniel Greene; Claire Lentaigne; Anne M Kelly; Tadbir K Bariana; Ilenia Simeoni; Xavier Pillois; Antony Attwood; Steve Austin; Sjoert Bg Jansen; Tamam Bakchoul; Abi Crisp-Hihn; Wendy N Erber; Rémi Favier; Nicola Foad; Michael Gattens; Jennifer D Jolley; Ri Liesner; Stuart Meacham; Carolyn M Millar; Alan T Nurden; Kathelijne Peerlinck; David J Perry; Pawan Poudel; Sol Schulman; Harald Schulze; Jonathan C Stephens; Bruce Furie; Peter N Robinson; Chris van Geet; Augusto Rendon; Keith Gomez; Michael A Laffan; Michele P Lambert; Paquita Nurden; Willem H Ouwehand; Sylvia Richardson; Andrew D Mumford; Kathleen Freson
Journal: Genome Med Date: 2015-04-09 Impact factor: 11.117

7. Phenotype Similarity Regression for Identifying the Genetic Determinants of Rare Diseases.

Authors: Daniel Greene; Sylvia Richardson; Ernest Turro
Journal: Am J Hum Genet Date: 2016-02-25 Impact factor: 11.025

8. GOSim--an R-package for computation of information theoretic GO similarities between terms and gene products.

Authors: Holger Fröhlich; Nora Speer; Annemarie Poustka; Tim Beissbarth
Journal: BMC Bioinformatics Date: 2007-05-22 Impact factor: 3.169

9. Gene Ontology Consortium: going forward.

Authors:
Journal: Nucleic Acids Res Date: 2014-11-26 Impact factor: 19.160

10. The Human Phenotype Ontology project: linking molecular biology and disease through phenotype data.

Authors: Sebastian Köhler; Sandra C Doelken; Christopher J Mungall; Sebastian Bauer; Helen V Firth; Isabelle Bailleul-Forestier; Graeme C M Black; Danielle L Brown; Michael Brudno; Jennifer Campbell; David R FitzPatrick; Janan T Eppig; Andrew P Jackson; Kathleen Freson; Marta Girdea; Ingo Helbig; Jane A Hurst; Johanna Jähn; Laird G Jackson; Anne M Kelly; David H Ledbetter; Sahar Mansour; Christa L Martin; Celia Moss; Andrew Mumford; Willem H Ouwehand; Soo-Mi Park; Erin Rooney Riggs; Richard H Scott; Sanjay Sisodiya; Steven Van Vooren; Ronald J Wapner; Andrew O M Wilkie; Caroline F Wright; Anneke T Vulto-van Silfhout; Nicole de Leeuw; Bert B A de Vries; Nicole L Washingthon; Cynthia L Smith; Monte Westerfield; Paul Schofield; Barbara J Ruef; Georgios V Gkoutos; Melissa Haendel; Damian Smedley; Suzanna E Lewis; Peter N Robinson
Journal: Nucleic Acids Res Date: 2013-11-11 Impact factor: 16.971

10 in total

27 in total

1. CNVxplorer: a web tool to assist clinical interpretation of CNVs in rare disease patients.

Authors: Francisco Requena; Hamza Hadj Abdallah; Alejandro García; Patrick Nitschké; Sergi Romana; Valérie Malan; Antonio Rausell
Journal: Nucleic Acids Res Date: 2021-07-02 Impact factor: 16.971

Review 2. 'There and Back Again'-Forward Genetics and Reverse Phenotyping in Pulmonary Arterial Hypertension.

Authors: Emilia M Swietlik; Matina Prapa; Jennifer M Martin; Divya Pandya; Kathryn Auckland; Nicholas W Morrell; Stefan Gräf
Journal: Genes (Basel) Date: 2020-11-26 Impact factor: 4.096

3. Quantitative dissection of multilocus pathogenic variation in an Egyptian infant with severe neurodevelopmental disorder resulting from multiple molecular diagnoses.

Authors: Isabella Herman; Angad Jolly; Haowei Du; Moez Dawood; Ghada M H Abdel-Salam; Dana Marafi; Tadahiro Mitani; Daniel G Calame; Zeynep Coban-Akdemir; Jawid M Fatih; Ibrahim Hegazy; Shalini N Jhangiani; Richard A Gibbs; Davut Pehlivan; Jennifer E Posey; James R Lupski
Journal: Am J Med Genet A Date: 2021-11-23 Impact factor: 2.802

4. Sub-GOFA: A tool for Sub-Gene Ontology function analysis in clonal mosaicism using semantic (logical) similarity.

Authors: Tadaaki Katsuda; Noriko Sato; Kaoru Mogushi; Takeshi Hase; Masaaki Muramatsu
Journal: Bioinformation Date: 2022-01-31

5. STarFish: A Stacked Ensemble Target Fishing Approach and its Application to Natural Products.

Authors: Nicholas T Cockroft; Xiaolin Cheng; James R Fuchs
Journal: J Chem Inf Model Date: 2019-10-24 Impact factor: 4.956

6. MitoPhen database: a human phenotype ontology-based approach to identify mitochondrial DNA diseases.

Authors: Thiloka E Ratnaike; Daniel Greene; Wei Wei; Alba Sanchis-Juan; Katherine R Schon; Jelle van den Ameele; Lucy Raymond; Rita Horvath; Ernest Turro; Patrick F Chinnery
Journal: Nucleic Acids Res Date: 2021-09-27 Impact factor: 16.971

7. Modeling seizures in the Human Phenotype Ontology according to contemporary ILAE concepts makes big phenotypic data tractable.

Authors: David Lewis-Smith; Peter D Galer; Ganna Balagura; Hugh Kearney; Shiva Ganesan; Mahgenn Cosico; Margaret O'Brien; Priya Vaidiswaran; Roland Krause; Colin A Ellis; Rhys H Thomas; Peter N Robinson; Ingo Helbig
Journal: Epilepsia Date: 2021-05-05 Impact factor: 6.740

8. Diagnosis of genetic diseases in seriously ill children by rapid whole-genome sequencing and automated phenotyping and interpretation.

Authors: Michelle M Clark; Amber Hildreth; Sergey Batalov; Yan Ding; Shimul Chowdhury; Kelly Watkins; Katarzyna Ellsworth; Brandon Camp; Cyrielle I Kint; Calum Yacoubian; Lauge Farnaes; Matthew N Bainbridge; Curtis Beebe; Joshua J A Braun; Margaret Bray; Jeanne Carroll; Julie A Cakici; Sara A Caylor; Christina Clarke; Mitchell P Creed; Jennifer Friedman; Alison Frith; Richard Gain; Mary Gaughran; Shauna George; Sheldon Gilmer; Joseph Gleeson; Jeremy Gore; Haiying Grunenwald; Raymond L Hovey; Marie L Janes; Kejia Lin; Paul D McDonagh; Kyle McBride; Patrick Mulrooney; Shareef Nahas; Daeheon Oh; Albert Oriol; Laura Puckett; Zia Rady; Martin G Reese; Julie Ryu; Lisa Salz; Erica Sanford; Lawrence Stewart; Nathaly Sweeney; Mari Tokita; Luca Van Der Kraan; Sarah White; Kristen Wigby; Brett Williams; Terence Wong; Meredith S Wright; Catherine Yamada; Peter Schols; John Reynders; Kevin Hall; David Dimmock; Narayanan Veeraraghavan; Thomas Defay; Stephen F Kingsmore
Journal: Sci Transl Med Date: 2019-04-24 Impact factor: 19.319

9. A phenotype centric benchmark of variant prioritisation tools.

Authors: Denise Anderson; Timo Lassmann
Journal: NPJ Genom Med Date: 2018-02-05 Impact factor: 8.617

10. Text-based phenotypic profiles incorporating biochemical phenotypes of inborn errors of metabolism improve phenomics-based diagnosis.

Authors: Jessica J Y Lee; Michael M Gottlieb; Jake Lever; Steven J M Jones; Nenad Blau; Clara D M van Karnebeek; Wyeth W Wasserman
Journal: J Inherit Metab Dis Date: 2018-01-16 Impact factor: 4.982