Literature DB >> 27570672

Using a Novel Ontology to Inform the Discovery of Therapeutic Peptides from Animal Venoms.

Joseph D Romano¹, Nicholas P Tatonetti¹.

Abstract

Venoms and venom-derived compounds constitute a rich and largely unexplored source of potentially therapeutic compounds. To facilitate biomedical research, it is necessary to design a robust informatics infrastructure that will allow semantic computation of venom concepts in a standardized, consistent manner. We have designed an ontology of venom-related concepts - named Venom Ontology - that reuses an existing public data source: UniProt's Tox-Prot database. In addition to describing the ontology and its construction, we have performed three separate case studies demonstrating its utility: (1) An exploration of venom peptide similarity networks within specific genera; (2) A broad overview of the distribution of available data among common taxonomic groups spanning the known tree of life; and (3) An analysis of the distribution of venom complexity across those same taxonomic groups. Venom Ontology is publicly available on BioPortal at http://bioportal.bioontology.org/ontologies/CU-VO.

Entities: Chemical Disease Gene Species

Year: 2016 PMID： 27570672 PMCID： PMC5001765

Source DB: PubMed Journal: AMIA Jt Summits Transl Sci Proc

1. Introduction

Venoms are complex mixtures of biological macromolecules and cofactors used for either offensive or defensive purposes. In contrast to poisons (which are harmful only when swallowed), venoms can be introduced to their targets via injection, surface abrasion, or topical application. Most known venoms exert their effect using specialized proteins that have a strong and specific effect on target cells[1]. A vast array of animal species is known to be venomous, and many more are predicted to be venomous. Additionally, individual venoms usually consist of a diverse array of many components. As an example, many species of cone snail (genus Conus) are known to contain upwards of 100 individual peptide components in their venoms. Although venoms are intended to cause harm to a prey or a potential threat, humans have used specific venoms for therapeutic purposes for millennia. Since the advent of modern medicine, researchers have validated many of these uses and have begun to uncover the fundamental mechanisms by which they treat diseases and aid in diagnoses. Some examples of successful venom-derived pharmaceuticals are: Ziconotide (from the cone snail Conus magus - treats severe chronic pain)[2], exenatide (from the Gila monster, Heloderma suspectum - treats both noninsulin-dependent diabetes mellitus and obesity)[3], and bombesin (from the European fire-bellied toad Bombina bombina - treats many gastrointestinal illnesses and, interestingly, sudden deafness) [4]. Venoms and their therapeutic uses represent a major unexplored frontier in computational and systems biology. While some studies have successfully characterized and assessed the biodiversity of venom components in a highly limited context (e.g., the diversity of ω-conotoxins in species of genus Conus [5,6], or the evolution of certain protein classes among closely related snake species [7 –9]), these studies are limited. There is a severe lack of studies that analyze and assess the properties between - and the relationships among - venoms and their components on a large scale (e.g., across the entire tree of life). Better methods that allow researchers to extract generalizable knowledge about venoms from existing data would enable researchers to discover novel therapeutic uses for them in a systematic and data- driven way. The most fundamental issue standing in the way of such studies is the almost complete lack of an informatics infrastructure uniting our existing knowledge on venoms. In this study, we present a novel ontology of venoms and related concepts that addresses this problem systematically. Biomedical ontologies allow for consistent and unambiguous naming of entities (in this cases, venoms, venom components, and the species from which they are sourced) and how they are interconnected. We also present a number of initial investigations regarding venom biodiversity across the tree of life, and explore how they can inform the discovery and refinement of novel therapeutic uses for venom compounds.

2. Methods

2.1 Building “venom ontology”

We used Protégé (ver. 4.2) [10] to create the class structure of the Venom Ontology using domain knowledge: By our definition, every venomous species has exactly one venom, and every venom has one or more molecular components that can be classified by the class of molecule they are (e.g., peptide, carbohydrate, inorganic cofactor). Recent reports suggest Conus geographicus modifies its venom based on whether it is used defensively or offensively [11], but for the purposes of this ontology they can be grouped together as a single venom. If a venom component is a peptide, it has a canonical amino acid sequence. Each of the entities may have one or more other pieces of metadata, including links to other ontologies and structured terminologies. After defining the class structure of the ontology, we populated the ontology with individuals (specific instances of the ontology’s classes) sourced from UniProtKB/Swiss-Prot’s Tox-Prot database [12]. This database is a manually curated list of venom peptides containing numerous annotation tags including species of origin, amino acid sequences, full taxonomic lineage, and automated cross-mappings to other online resources. However, the structure of Tox-Prot does not support semantic reasoning. Due to the large number of individual records in the Tox-Prot database (6,092 at the time of creating the ontology), we added the contained information programmatically by first exporting the ontology from Protégé to an RDF-formatted XML file [13], and then using Apache’s Jena framework [14] to parse the venom records and insert relevant data into the appropriate spot within the ontology’s class hierarchy.

2.2 Exploratory analysis of venom ontology data records

To demonstrate some potential applications of Venom Ontology, we performed three exploratory analyses of its contained data. The first of these involved assessing the similarity of amino acid sequences for venom peptides produced by species of the same genera. To accomplish this, we grouped species (stored as “Organisms” in the ontology) by genus, along with their derived peptide compounds. We then selected 2 genera that are well represented in the data set, and built “sequence similarity networks” for each of them. In selecting these genera, we looked for ones that are prolific enough within the ontology to generate informative (non-trivial) networks, yet not so prolific as to be unwieldy in terms of visualization or computation. In practice, we looked for two genera with approximately 20 species in the ontology. For each genus selected, we used BLASTp [15] to align all pairs of peptides within the genus. We constructed the networks using peptide sequences as nodes, and the alignments between them as edges. We transformed the BLAST scores (which represent the percent coverage of the alignments; denoted S)for alignments using the following equation: which allows us to define a “distance” between two peptide sequences (i.e., smaller values of S’ indicate higher similarity), used as edge weights in the final networks. S’ is a value in the interval [0,1), and is generally very small (e.g., <1*10−15). Finally, we filtered edges by setting a maximum expect value (“e-value” - a normalized p-value defining confidence that the alignment is non-random) threshold of 1*10−50. Alignments that fell below this maximum cutoff almost certainly signify evolutionarily related sequences, and are therefore informative for the purposes of constructing these networks. For visualization purposes, we rendered the networks in Cytoscape [16] using the Prefuse force-directed layout [17], and colored nodes (individual peptides) by the species from which those peptides were sourced. Our second analysis was a basic exploration of the distribution of both species and individual peptides in the ontology across the tree of life. We defined common groupings of animals (cnidarians, molluscs, insects, arachnids, fish, amphibians, reptiles, birds, and mammals) that may contain venomous species. From these large classes, we used NCBI’s Taxonomy database [18] to determine the highest-level taxa common to all members of those groups (grouping multiple taxa for paraphyletic groups, such as “fish”). For each of these taxa, we searched for their frequency of occurrence in the set of all species present in the database. We also enumerated the number of total sequences in the database for the groups listed. The third and final analysis consisted of observing the complexity of venoms within the ontology. In this context, we simplistically define complexity as the number of distinct peptide components within the venom (e.g., a venom containing 20 peptide components is more complex than a venom containing only 10). We investigated the distributions of venom complexity for each of the taxonomic groups mentioned in the previous paragraph, making note of features such as mean number of peptide components per venom, standard deviation, and skewness (i.e., lack of symmetry, computed as the estimated third standardised moment ).It should be noted that the results of these analyses are subject to systematic biases depending on how well the data in Tox-Prot is representative of the totality of venoms that exist in nature (refer to §4.5 for further discussion).

3. Results

All code and data files used in this study are available for public use on GitHub at (http://github.com/JDRomano2/venom_ontology_code). The ontology is available online, hosted both on BioPortal (http://bioportal.bioontology.org/ontologies/CU-VO) and on the project’s homepage, at http://venomkb.tatonettilab.org/ontology. A visualization of the ontology’s class hierarchy and object property associations is shown in .

3.1 Venom ontology

Venom Ontology presently contains 614 known venomous species, and 6,092 curated peptides, each of which has a known amino acid sequence. There are correspondingly 614 “whole venom extract” entities, arising from the following axiom: which states that every organism has exactly one whole venom extract. Due to our data source being peptide-centric, each whole venom extract (and, correspondingly, each organism) currently included in the ontology has at least one peptide, although this is not defined as necessary (i.e., the ontology allows for whole venom extracts to contain zero or more peptides). We added a small number of synthetic venom compounds (all clinically approved drugs) to the ontology by manually entering them as individuals for the “Synthetic_Venom_Derivative” class. This is a tractable approach presently, but as venom-derived therapeutic agents continue to be discovered and are coerced into a structured format, an automated means for adding them will become necessary - this point is elaborated on below, in §4.5. Venom Ontology was validated using the FaCT++ reasoning engine [19].

3.2 Analysis of the ontology’s contained data

Our analysis of venom peptide sequence similarity for a number of well-represented genera highlights some noteworthy features of venoms that have significant implications for drug discovery. In , we show two sequence similarity networks - one for genus Loxosceles (widow spiders) and one for Bungarus (kraits - a genus of venomous snakes) - yet our methods could be applied to any other taxonomic group that is present in the ontology. Since we only kept alignments with strong statistical support (low e-value - see §2.2 for details), the graphs are not fully connected. Small connected components (e.g., the “islands” seen around the periphery of the networks) as well as clusters within larger connected components can be interpreted as groups of peptides that are likely to be closely related on a structural level. Although we originally expected sequences from a given species to segregate together, there are clusters in each of the networks that contain a diverse mixture of sequences from numerous species (denoted in the by red arrows). The smaller connected components tend to be more homogeneous in terms of their species composition (e.g., they have higher cluster purity). Subjectively, it is also noteworthy that the networks do not display the properties of “scale-free” networks (characterized primarily by few nodes of very high degree, and many nodes of very low degree), which are arguably the most prevalent family of networks that arise from biological phenomena [20]. While speculation as to why this occurs is beyond the scope of this exploratory analysis, it would be an interesting topic to pursue in a follow-up study. The distribution of species and sequences by higher taxonomic groupings is shown in . Both “fish” and “reptiles” are common names that consist of multiple clades (i.e., they are paraphyletic). It should be noted that 5 species, containing a total of 1,348 sequences, are not classified within any of these groups. While this only makes up 0.81% of the total number of species, it contains 22.13% of the total number of sequences found in the ontology. This seems to be the result of numerous sequences that have poorly formed or absent “taxonomic lineage” annotations in Tox-Prot (meaning that some of the ‘orphaned’/unclassified sequences likely come from already classified species that are included in the larger taxonomic groups). After looking at properties of venoms exposed by the ontology at the genus level, we investigated the distribution more generally across the tree of life. Distributions of venom complexity are shown in . In this portion of the data analysis, we only show the common taxonomic groups from that have at least 1 venom and 1 peptide. The final row of the table shows the distribution across all species present in the ontology. Additionally, shows a graphical representation of these distributions, drawn as violin plots with a logarithmic scale on the vertical axis.

4. Discussion

4.1 Some ontology classes possess no individuals, yet are still informative

The Venom Ontology contains a number of terminal classes that do not currently have any members (“individuals”), including the venom component subclasses “Biological_Macromolecule/Carbohydrate” and “Inorganic_Molecule”. The main rational for their inclusion is threefold: (1) The ontology is meant to convey computable semantic knowledge of venoms, and with the current structure ontology reasoning software is able to understand that venoms may contain a number of different components, of which only some may be peptides. (2) Since future revisions to the ontology may incorporate new data sources, we hope to be able to populate these classes with informative instances in a future release. (3) We hope to be able to generate members for these classes using machine learning methods that don’t require a curated dataset of venom components (such as “ontology learning from text” [21]). Another class - “Synthetic Venom Derivative” - seems to be specific enough to allow for manual population using domain knowledge of existing synthetic versions of venoms used as pharmaceuticals. However, existing synthetic venom derivatives are more numerous than it would initially seem. For example, a number of conantokins (a specific sub-class of conotoxins - sourced from snails in the genus Conus) have been modified and produced synthetically, yet none have received approval for clinical use [22,23]. For this reason, a potential follow-up to this study would be a comprehensive survey of synthetic derivatives of venom peptides.

4.2 Grouping venom peptides by genus reveals clusters of similar venoms across species

As briefly alluded to in §3.2, the networks in show clusters of venom peptides that contain members from a number of closely related species. This suggests a novel approach for discovering libraries of therapeutic venom- derived peptides with a similar therapeutic effect. During drug development, having a large number of drug candidates available improves the likelihood of finding a molecule that simultaneously has the greatest therapeutic effect while minimizing toxic effects (a notoriously challenging obstacle in repurposing venoms for clinical use). This proposed approach provides a data-driven framework for discovering venom-derived therapeutic agents, which is an improvement over traditional methods that are almost entirely based on serendipitous discovery or borrowed from ancient traditional medicine [24].

4.3 Non-reptile venomous species are underrepresented in existing data

Recent analyses of venom biodiversity reveal surprising patterns, including that the prevalence of venomous fish is far higher than in any other major taxonomic group, including reptiles [25]. Table 2, however, shows a strong bias towards venomous reptiles in available data (fish peptides make up only 0.23% of venom sequences in the Tox-Prot dataset, while reptilian peptides make up 37.72%). Other discrepancies are also apparent: for example, only one venomous mammal is included in the database: Ornithorhynchus anatinus (duck-billed platypus). While it is uncommon for mammals to be venomous, reviews on the subject have identified numerous others aside from O. anatinus, including multiple shrews, bats, and certain species of loris (taxonomic family Lorinae).

Table 2

Distribution of venom complexity across the tree of life, by common taxonomic groups. A venom’s complexity is defined as the number of distinct peptide components it contains.

Common Name	Minimum	Median	Mean	Maximum	Skewness*
Molluscs	1	4	11.230	118	3.638
Insects	1	2	3.101	15	2.211
Arachnids	1	4	13.020	293	6.576
Fish	1	2	2.800	6	1.517
Amphibians	1	1.5	1.500	2	n/a
Reptiles	1	4	9.496	64	2.271
Mammals	6	6	6.000	6	n/a
All Species	1	4	9.922	293	7.987

Skewness is the estimated third standardised moment of the empirical distribution. Higher skewness indicates greater lack of symmetry about the mean

By knowing about these discrepancies, we can prioritize future venom research to include presently underrepresented categories of animals, which should in-turn increase the likelihood of discovering novel compounds that have diverse therapeutic effects.

4.4 Apparent complexity of venoms varies across the tree of life

Venoms usually consist of a complex mixture of organic and inorganic molecules, each of which has a particular effect. If we define “complexity” as the number of distinct peptide components in a venom, our results show that venom complexity is highly variable across the tree of life. In we list summary statistics for venom complexity distribution across 7 common taxonomic groupings. These data are additionally visualized in as a violin plot. The plot, shown with number of peptides per venom on a logarithmic scale, highlights that there are many outliers in the dataset - species with extremely complex venoms compared to the mean of 9.922 peptides per venom. Furthermore, each of the taxonomic groups has its own unique distribution. Although the sizes of some groups in the ontology are too small to result in viable statistical inferences (e.g., mammals and amphibians), variable distributions of venom complexity suggest that complexity is regulated in some manner that is conserved by evolution - otherwise, all of the distributions would converge. In particular, insects seem to have venoms that are relatively simple compared to arachnids, molluscs, and reptiles. Interestingly, arachnids have the largest number of outlier species that have extremely complex venoms. Reptiles, by far the most well-represented group in the dataset, have notably fewer highly complex outliers than either molluscs or arachnids. As an example of a quantitative approach to comparing these distributions, shows the p-values of the Mann-Whitney U test applied pairwise to all of the distributions shown in . These observations may be an artifact of data completeness (see §4.6), but if not, they can help to guide research towards more rich libraries of venoms that may include important therapeutic compounds.

4.5 Using venom ontology in conjunction with venomKB to support drug discovery

In a previous study, we describe VenomKB - a knowledge base cataloguing putative therapeutic uses of venoms and venom-derived compounds, constructed via manual literature review and automated knowledge discovery techniques applied to MEDLINE[26]. Linking these two separate data resources may optimize the process of computational drug discovery by implying a polyhierarchical structure on many of VenomKB’s data records (specifically, ones that map to instances in Venom Ontology). For example, if a record in VenomKB describes the therapeutic effect of a compound produced by species “X”, we may be able to find highly similar (and possibly more efficacious) molecules by using Venom Ontology to identify venom peptides from species that are in the same genus as species “X”. In the future, we intend to add a component to the ontology that resolves venom names with their synonyms, which could allow us to identify venoms with multiple therapeutic effects, as well as increase confidence in therapies when multiple studies corroborate the same effect. We plan to fully integrate these two resources, so that VenomKB can be browsed by navigating the hierarchical structure of Venom Ontology, and vice versa.

4.6 Limitations - structured data on venoms are largely incomplete

One of the most important features of this set of investigations is that it is not based on a complete survey of venoms across the tree of life. Since we only have discovered a small handful of the vast number of venomous species believed to exist (and have actually studied even a smaller number), we treat the Tox-Prot dataset as a “best approximation” of venom diversity based on available data. This obviously introduces various sources of systematic bias into the inferences that are made from the ontology’s contained data. In §2.2, we mention this limitation in regards to our definition of venom complexity (the relative number of peptide components contained within a venom). We analyze our data under the assumption that the Tox-Prot data set does not prioritize certain species for “completeness” - in other words, that the ratio of the actual number of peptides to the number that are in the data set remains consistent for all species. However, this may not be the case. The available data for some species may be substantially more complete than for others. Also, it may be more challenging to run proteomic analyses on some species than others. Each of these factors would affect the consistency of completeness across the dataset. A future goal that could help eliminate these potential sources of bias would be only to populate the ontology with complete proteomic surveys of species’ venoms. We intend for the Venom Ontology to be one of the first major steps towards systematically and consistently coercing newly discovered venoms and venom components into a standardized format. The ontology’s structure suggests numerous ways to define a consistent vocabulary for these semantic concepts.

5. Conclusion

Although the study of venoms traditionally falls into the fields of biological systematics, toxicology, biodiversity, and ecology, they constitute a largely untapped library that could be useful for drug discovery purposes. Recently, venoms have piqued the interest of researchers from diverse fields, including both translational bioinformatics and systems pharmacology. Additionally, venom research requires a substantial amount of informatics infrastructure to support rigorous and informative computational and data-driven analysis. In this study, we provide a novel ontology of venom- related concepts that facilitates many of these analyses. In demonstrating possible uses for the ontology, we have discerned several important features of venoms that suggest promising areas to study in the future, including studying venomous species that are underrepresented in existing data sets, and using peptide sequence characteristics to build libraries of related venomous compounds that may improve on existing drugs derived from venoms. As the ontology grows in future releases, we expect to discover many more ways the ontology can facilitate data-driven research of venoms, greatly advancing knowledge and available methodology in a diverse range of medical and biologically based fields.

Table 1.

Distribution of species and sequences in Venom Ontology across common taxonomic groups. Some groups with no species or sequences are included for completeness.

Common Name	Taxonomic group(s)	# species in ontology	% total species	# sequences in ontology	% total sequences
Cnidarians	Cnidaria	0	0.00%	0	0.00%
Molluscs	Mollusca	97	15.80%	1089	17.88%
Insects	Insecta	79	12.87%	245	4.02%
Arachnids	Arachnida	183	29.80%	1089	17.88%
Fish	Actinopterygii	4	0.65%	12	0.20%
	Coelacanthimorpha	0	0.00%	0	0.00%
	Chondrichthyes	1	0.16%	2	0.03%
	Cyclostomata	0	0.00%	0	0.00%
	Dipnoi	0	0.00%	0	0.00%
Amphibians	Amphibia	2	0.33%	3	0.05%
Reptiles	Archelosauria	0	0.00%	0	0.00%
	Squamata	242	39.41%	2298	37.72%
Birds	Aves	0	0.00%	0	0.00%
Mammals	Mammalia	1	0.16%	6	0.10%
Other/unclassified		5	0.81%	1348	22.13%
Total		614		6092

Table 3

Mann-Whitney U test results for all pairs of venom complexity distributions. A p-value of less than 0.05 signifies that two distributions are statistically different.

	Arachnids	Fish	Insects	Mammals	Molluscs	Reptiles
Amphibians	0.1165	0.417	0.4389	0.6667	0.1669	0.1262
Arachnids		0.1554	1.85E-07	0.8424	0.9086	0.7252
Fish			0.732	0.3657	0.2162	0.1701
Insects				0.1935	2.74E-05	2.20E-07
Mammals					0.8584	0.8128
Molluscs						0.8778

21 in total

1. Cytoscape: a software environment for integrated models of biomolecular interaction networks.

Authors: Paul Shannon; Andrew Markiel; Owen Ozier; Nitin S Baliga; Jonathan T Wang; Daniel Ramage; Nada Amin; Benno Schwikowski; Trey Ideker
Journal: Genome Res Date: 2003-11 Impact factor: 9.043

Review 2. Diversity of the neurotoxic Conus peptides: a model for concerted pharmacological discovery.

Authors: Baldomero M Olivera; Russell W Teichert
Journal: Mol Interv Date: 2007-10

Review 3. Intrathecal ziconotide: a review of its use in patients with chronic pain refractory to other systemic or intrathecal analgesics.

Authors: Mark Sanford
Journal: CNS Drugs Date: 2013-11 Impact factor: 5.749

4. Scale-free networks: a decade and beyond.

Authors: Albert-László Barabási
Journal: Science Date: 2009-07-24 Impact factor: 47.728

Review 5. Chemistry and evolution of toxins in snake venoms.

Authors: C C Yang
Journal: Toxicon Date: 1974-01 Impact factor: 3.033

6. Diversity of Conus neuropeptides.

Authors: B M Olivera; J Rivier; C Clark; C A Ramilo; G P Corpuz; F C Abogadie; E E Mena; S R Woodward; D R Hillyard; L J Cruz
Journal: Science Date: 1990-07-20 Impact factor: 47.728

7. Venom evolution widespread in fishes: a phylogenetic road map for the bioprospecting of piscine venoms.

Authors: William Leo Smith; Ward C Wheeler
Journal: J Hered Date: 2006-06-01 Impact factor: 2.645

8. Proteomic characterization and comparison of Malaysian Bungarus candidus and Bungarus fasciatus venoms.

Authors: Muhamad Rusdi Ahmad Rusmili; Tee Ting Yee; Mohd Rais Mustafa; Wayne C Hodgson; Iekhsan Othman
Journal: J Proteomics Date: 2014-08-19 Impact factor: 4.044

Review 9. Conotoxin gene superfamilies.

Authors: Samuel D Robinson; Raymond S Norton
Journal: Mar Drugs Date: 2014-12-17 Impact factor: 5.118

10. Evolution of separate predation- and defence-evoked venoms in carnivorous cone snails.

Authors: Sébastien Dutertre; Ai-Hua Jin; Irina Vetter; Brett Hamilton; Kartik Sunagar; Vincent Lavergne; Valentin Dutertre; Bryan G Fry; Agostinho Antunes; Deon J Venter; Paul F Alewood; Richard J Lewis
Journal: Nat Commun Date: 2014-03-24 Impact factor: 14.919

1 in total

Review 1. Studying Smaller and Neglected Organisms in Modern Evolutionary Venomics Implementing RNASeq (Transcriptomics)-A Critical Guide.

Authors: Björn Marcus von Reumont
Journal: Toxins (Basel) Date: 2018-07-16 Impact factor: 4.546

1 in total