Literature DB >> 22779057

Context-specific ontology integration: a bayesian approach.

Kshitij Marwah¹, Dustin Katzin, Amin Zollanvari, Natalya F Noy, Marco Ramoni, Gil Alterovitz.

Abstract

We introduce a principled computational framework and methodology for automated discovery of context-specific functional links between ontologies. Our model leverages over disparate free-text literature resources to score the model of dependency linking two terms under a context against their model of independence. We identify linked terms as those having a significant bayes factor (p < 0.01). To scale our algorithm over massive ontologies, we propose a heuristic pruning technique as an efficient algorithm for inferring such links.We have applied this method to translationalize Gene Ontology to all other ontologies available at National Center of Biomedical Ontology (NCBO) BioPortal under the context of Human Disease ontology. Our results show that in addition to broadening the scope of hypothesis for researchers, our work can potentially be used to explore continuum of relationships among ontologies to guide various biological experiments.

Entities: Chemical Disease Gene Species

Year: 2012 PMID： 22779057 PMCID： PMC3392068

Source DB: PubMed Journal: AMIA Jt Summits Transl Sci Proc

Introduction

Every year, over 400,000 new articles reportedly enter biomedical literature [1]. This staggering growth of biomedical findings has created an unprecedented corpus of knowledge that is impossible to explore with traditional means of literature consultation and database searches. This information overload has motivated the development of structured information repositories that organize biomedical findings according to hierarchical ontologies. Ontologies find themselves at the heart of two major complementary activities in biomedical research. Communities of researches create and maintain these ontologies to represent different types of entities and relations in different domains of biomedicine. On the other hand, biomedical experimentalists use ontologies to annotate data in order to facilitate data integration and translational discoveries. This activity is greatly intensified by the development of high-throughput experimental platforms such as gene expression microarrays [2], SNP microarrays [3] and next generation sequencing platforms [4]. The rise of such ontological organization has created a new problem, the proliferation of disparate and seemingly unrelated biomedical ontologies. For example, the National Center of Biomedical Ontology’s (NCBO) BioPortal [5] provides over 200 such ontologies to researchers. These ontologies are generally used by scientists to annotate their data, but which ontologies to use and how they relate to each other is generally unclear. What is needed is the integration of these conceptualizations in a principled fashion, a “grand unification” of biological terms. It has been established [6] that the integration of these available ontologies will have a tremendous impact on the advancement of biomedical sciences. These integrated ontologies will provide a complete basis of biomedical knowledge representation and act as a foundation for inference on new biomedical data. Furthermore, a quantitative approach for integration would make the navigation of the complex space of ontologies more amenable to researchers by offering them guidance to numerous links among ontologies, ranking them according to a principled metric, thus making the discovery process faster and efficient. To date, the mapping and integrating of ontologies in the biomedical domain has relied on discovering links between syntactically and semantically similar terms across ontologies [7]. Such an approach can relate terms with similar meanings but would not deduce any relationships between seemingly disparate functional spaces such as diseases, drugs and anatomy. Approaches in the data integration community for ontology integration use methods ranging from machine learning [8] to graph matching [9] to natural language processing [10]. These methods again inherently focus on mapping synonyms across ontologies. Recently, Ontology Alignment Evaluation Initiative [11] has been launched as a competition between alignment algorithms on a given standardized dataset. These methods generally cater to the definition of traditional ontology alignment considering synonyms. Even instance-based methods in these initiatives for mappings have the goal of converging two ontologies that represent the same knowledge base. For domains as disparate as biomedical ontologies, such methods do not work and moreover, the computational complexity of these algorithms makes them infeasible for massive scales of such vocabularies. Other approaches to infer these links use standard means of manual curation, which is again a tedious and labor intensive task with extremely bad scaling properties. Here we propose a novel computational and methodological framework for context-specific integration of biomedical ontologies using free-text literature analysis. We model context specificity using another ontology and derive context-dependent functional links between ontological concepts occurring as phrases in free-text literature. We cache massive amounts of literature data to enable efficient counts of co-occurring ontology terms. Based on these statistics, the penalized likelihood of the model of dependency and independency is computed by applying the well-known bayesian information criterion [12] over a context-sensitive model scoring function. We account for scalability via a depth-first branch and bound heuristic technique, to prune sub-graphs that do not yield significant links. We believe that such a methodological approach would turn machine-processable ontologies into a single landscape of integrated biomedical concepts and annotations. This would enable researchers to bear on each single finding the entire power of established biomedical knowledge.

Methods

Caching Sufficient Statistics

We gather raw free-text literature from disparate sources and drive our concept search by finding exact matches of ontology terms. We use the MGREP [13], concept recognition tool that also powers the NCBO Annotator [14] to efficiently find occurrence of concepts in published literature and thus annotate the documents with those concepts. This allows us to leverage on a consolidate vocabulary (of about 4 million ontology concepts) to temper the problem of missing synonyms and term permutations. We also used a pre-computed index containing the transitive closure of ontology terms for semantically expanding the annotations, propagating them up the hierarchy of the ontology. The document annotations and the concepts are reverse indexed using a disk based b-tree structure an approach commonly used in information retrieval systems. We use Lucene [15], an open source high-powered information retrieval engine to create and store the b-tree structure. To answer conjunctive queries for efficient counting we use a bitmap hash-based filter over the stored index. Our integrated pipeline is shown Figure 1 above.

Figure 1:

Pipeline used for caching sufficient statistics for model scoring.

Alignment Algorithm

For computing context dependent links between ontology terms, we have developed a novel technique relying on statistical analysis of literature. Our algorithm uses the observed co-occurrence of terms in the literature to infer the relationship between two terms A and B in the context of the ontology term C. As an example the term A can be the ontology concept, 5-fluorouracil, which we want to align with the term B, cell-cycle under the context of term C, say colon cancer. To do so it builds a contingency table like the one in Figure 3, collecting the frequencies of cooccurrence of the two terms in the literature, a 2 × 2 table where n++ is the number of papers in which two terms appear together, n+− is the number of papers in which A appears but B does not, n−+ is the number of papers in which B appears and A does not, and n−− is the number of papers in which neither appear all in the context of term C.

Figure 3:

Graph depicting exponential reduction in running time as the minimum threshold for pruning increases.

Our method uses the Bayesian information criterion to compute the penalized likelihood of dependence A ⇔ B | C (where two terms are related) and the model of independence A ⇑ B | C (where the two terms are unrelated) as where N is the number of observations, k is the number of parameters of the model, and MLL is the marginal log likelihood of the model. We assume that both the models of dependence and independence are equally likely in which case maximizing the posterior probability converges with maximizing the marginal likelihood as shown in Equation 1. The marginal log likelihood for the model of dependency is: whereas the marginal log likelihood for the model of independence is: where Γ is the gamma function, n++, n+−, n−+, n—are the co-occurrence frequencies as described above, α is the prior precision and, αk is the prior precision per term, that is, α/|T|, where |T| is the number of terms in the dependency: in our particular case, |T| = 2. In our case, we use α = 4 for 2 × 2 tables, so that for the initial prior precision we put 1 in each cell, maintaining the uniformity of the distribution and the lowest possible precision, so as to minimize bias on the precision. By plugging the marginal log likelihood into equation (1), we obtain respectively the penalized likelihood of dependency BIC (A ⇔ B | C), where the two terms are linked, and the model of independence BIC (A ⇑ B | C), where the two terms are not linked. The final score is the bayes factor that estimates how many times the model linking term A and B in the context of C is more likely than the model in which the terms are not related. We use the pipeline explained in the previous section to efficiently count the co-occurrence frequencies, for computing the bayes factor. Context-dependent functional links are then selected as the ones having a bayes factor greater than 20 (p < 0.01).

Heuristic Pruning Using Depth First Branch and Bound

To apply our algorithm we, in the worst case, would have to compare all possible triples of terms representing the ontologies. Such an approach would work for small ontologies but will not scale up to massive ontologies even with cached statistics. We apply a depth first branch and bound algorithm to prune away ontology sub-graphs where the likelihood of finding functional links is extremely low. We use the bayes factor as a scoring cue to find such subgraphs. We build on the empirical observation that if the bayes factor for an ontology concept A mapped to another ontology concept B under the context C is less than given a custom user-set threshold ɛ, then the bayes factor for mappings amongst majority of A’s children with the concept B under C would also be less than ɛ. An intuition towards such an observation can be gauged from the fact that any instance of a specific concept, say a paper, is also an instance of a more general concept. This follows the subsumption property that the taxonomy structure of an ontology follows. Hence, if not enough evidence is found for linking A to B under C, as demonstrated by the computed bayes factor it follows that a major fraction of A’s children would also not have enough evidence of a map to B under C. Further extending the empirical observation to span sub-graphs under A and B in context of the subgraph under C helps us to use the metric to prune away insignificant portions in the ontological graph. We rather than giving theoretical bounds on the likelihood of matches, experimentally analyze the effect of the given threshold ɛ over the running time and the amount of false negatives. Our results show below an expected exponential reduction in computations for inferring functional links. We also depict below the linear increase in the amount of false negatives if we prune the full graph. We implement the depth first branch and bound algorithm allowing us to compute functional links with much greater efficiency with a trade-off in loss of some alignments. The minimum threshold can be controlled by the user, depending on the efficacy of the results required. A suitable threshold can be determined empirically, by running the algorithm with different thresholds and observing the occurrence of “false positive” links. Once this threshold is chosen, we say that if the bayes factor is greater than ɛ (or corresponding desired significance level via corresponding p-value), than a high-confidence link exists between concepts.

Results

We obtain in all about 200 ontologies from the National Center for Biomedical Ontology’s BioPortal interface. For caching sufficient statistics we obtain the dictionary of all available ontology concepts (4,153,358 terms) for searching in the corpora. We further create our b-tree index on the corpus containing the following: Adverse Event Reporting System [16] database containing about 774,606 records. Array Express [17] containing 9281 records. BioSiteMaps [18] data containing 1013 records. caNanoLab [19] data containing 444 records. Conserved Domain Databases [20] containing 34,735 records. Clinical Trials [21] database containing 75,828 records. Drug Bank [22] containing 4774 records. Database of Phenotypes and Genotypes [23] having 184 records. Gene Expression Omnibus [24] containing 15,968 records. Stanford Microarray Database [25] containing 16,148 records. Published articles in PubMed [26] containing about 100,000 records. Each element of the corpus contains the full abstract of corresponding published article. We then apply our proposed algorithm over the heuristic pruning technique described earlier to integrate Gene Ontology (containing 24,987 concepts) to all available ontologies in BioPortal under the context of Human Disease Ontology (containing 12,033 concepts). The threshold for a significant link was set to be with a bayes factor greater than twenty (p < 0.01), while the threshold for pruning was set to be with bayes factor less than zero. An example of such a network is shown above in Figure 5. This is a part of a full network containing about 2000 relevant links. Figure 6 another network in which we switch the context to Minimal Anatomical Terminology from Human Disease. In such ways, our framework can take any two ontologies and compute scalable mappings under any given context.

Figure 5:

A part of mapping network showing links between Gene Ontology (green circles) and Minimal Anatomical Terminology (blue circles) under the context of Human Disease (red links).

Figure 6:

Portion of network showing context-specific links between Gene Ontology (blue circles) and Human Disease (green circles) in context of Minimal Anatomical Terminology (red links).

To validate the soundness of our context-sensitive mappings we take a random sampling of about a thousand high information content links [27], having a significantly high bayes factor. We repeat the experiment about ten times and use published literature and a domain expert in the field of molecular biology to validate these links. The number of repetitions are constrained by time resource available at our disposal for the domain expert. The precision number for the algorithm using this approach was found to be about 0.78. We further validate the completeness of our mappings by again taking a random sampling of about a thousand high information content triplets of nodes. We then use published literature and the domain expert to predict links amongst these concepts. These predicted links are then matched against the ones inferred by our algorithm to get recall. We repeat the experiment about ten times to get the recall number for the algorithm, which was found to be about 0.91. This corresponds to f-measure about 0.83. These numbers underscore the robustness and quality of our inferred links.

Discussion

This work is based on data that is changing and evolving over time. New data enters the biomedical literature and ontological databases constantly. Thus, conclusions and links can change over time. This framework provides an efficient and scalable algorithm to incorporate big data prevalent in the biomedical domain. A limitation of such analysis is its inability to differentiate between positive and negative correlation. Though nodes may be connected but their type of association is not computed. Incorporating some shallow semantics from natural language processing domain would help such a cause. A sliding window that detects relationships in conjunction with ontology concepts can be implemented to classify these alignments. A better algorithm to incorporate and update new data would be a nice addition accompanied by a graphical visualization toolkit to succinctly map such links. We only consider textual abstracts for caching statistics ontology terms. Expanding to full-text articles and incorporating varied datasets like images and experimental data would be interesting and challenging. A further extension of such a framework to propagate annotations over these links and perform enrichment analysis on ontologies other than Gene Ontology would be extremely useful. Another exciting analysis for future work would be to look at the evolution of the derived links over time as biological knowledge expands. Such a network can provide insights of how different biological terms relate to each other as advancements and new knowledge is added. They can also be used to detect and predict clusters of influence and propagation. Combining these links into a continuous bridge between different domains can help guide biological experiments and analyses.

Conclusion

Our framework and algorithms combine disparate sources of data for discovery of relationships between ontologies. Unlike prior work, our approach tries to find context-specific functional links between ontologies, which is not possible if only semantically-relevant links were considered. By developing a novel algorithm we identified links across ontologies, which can be used for guided expansion of various biomedical experiments. We then augmented this algorithm with heuristic approaches, for scaling up to massive data sizes with marginal loss in functional quality of links. We further validated the utility of our algorithm, by manual verification using a domain expert, increasing confidence in our methodological approach. Our work provides a new approach for translationalizing diverse functional spaces in biomedical domain, making this huge space of knowledge amenable to researchers.

15 in total

1. Experimental design for gene expression microarrays.

Authors: M K Kerr; G A Churchill
Journal: Biostatistics Date: 2001-06 Impact factor: 5.899

Review 2. Nanoinformatics and DNA-based computing: catalyzing nanomedicine.

Authors: Victor Maojo; Fernando Martin-Sanchez; Casimir Kulikowski; Alfonso Rodriguez-Paton; Martin Fritts
Journal: Pediatr Res Date: 2010-05 Impact factor: 3.756

3. Toward genome-wide SNP genotyping.

Authors: Ann-Christine Syvänen
Journal: Nat Genet Date: 2005-06 Impact factor: 38.330

4. AliBaba: PubMed as a graph.

Authors: Conrad Plake; Torsten Schiemann; Marcus Pankalla; Jörg Hakenberg; Ulf Leser
Journal: Bioinformatics Date: 2006-07-26 Impact factor: 6.937

5. The NCBI dbGaP database of genotypes and phenotypes.

Authors: Matthew D Mailman; Michael Feolo; Yumi Jin; Masato Kimura; Kimberly Tryka; Rinat Bagoutdinov; Luning Hao; Anne Kiang; Justin Paschall; Lon Phan; Natalia Popova; Stephanie Pretel; Lora Ziyabari; Moira Lee; Yu Shao; Zhen Y Wang; Karl Sirotkin; Minghong Ward; Michael Kholodov; Kerry Zbicz; Jeffrey Beck; Michael Kimelman; Sergey Shevelev; Don Preuss; Eugene Yaschenko; Alan Graeff; James Ostell; Stephen T Sherry
Journal: Nat Genet Date: 2007-10 Impact factor: 38.330

6. Ontology engineering.

Authors: Gil Alterovitz; Michael Xiang; David P Hill; Jane Lomax; Jonathan Liu; Michael Cherkassky; Jonathan Dreyfuss; Chris Mungall; Midori A Harris; Mary E Dolan; Judith A Blake; Marco F Ramoni
Journal: Nat Biotechnol Date: 2010-02 Impact factor: 54.908

7. Ontologies in quantitative biology: a basis for comparison, integration, and discovery.

Authors: Lars J Jensen; Peer Bork
Journal: PLoS Biol Date: 2010-05-25 Impact factor: 8.029

8. ArrayExpress--a public repository for microarray gene expression data at the EBI.

Authors: Alvis Brazma; Helen Parkinson; Ugis Sarkans; Mohammadreza Shojatalab; Jaak Vilo; Niran Abeygunawardena; Ele Holloway; Misha Kapushesky; Patrick Kemmeren; Gonzalo Garcia Lara; Ahmet Oezcimen; Philippe Rocca-Serra; Susanna-Assunta Sansone
Journal: Nucleic Acids Res Date: 2003-01-01 Impact factor: 16.971

9. CDD: specific functional annotation with the Conserved Domain Database.

Authors: Aron Marchler-Bauer; John B Anderson; Farideh Chitsaz; Myra K Derbyshire; Carol DeWeese-Scott; Jessica H Fong; Lewis Y Geer; Renata C Geer; Noreen R Gonzales; Marc Gwadz; Siqian He; David I Hurwitz; John D Jackson; Zhaoxi Ke; Christopher J Lanczycki; Cynthia A Liebert; Chunlei Liu; Fu Lu; Shennan Lu; Gabriele H Marchler; Mikhail Mullokandov; James S Song; Asba Tasneem; Narmada Thanki; Roxanne A Yamashita; Dachuan Zhang; Naigong Zhang; Stephen H Bryant
Journal: Nucleic Acids Res Date: 2008-11-04 Impact factor: 16.971