| Literature DB >> 28018846 |
Keywan Hassani-Pak1, Martin Castellote2, Maria Esch1, Matthew Hindle1, Artem Lysenko1, Jan Taubert1, Christopher Rawlings1.
Abstract
The chances of raising crop productivity to enhance global food security would be greatly improved if we had a complete understanding of all the biological mechanisms that underpinned traits such as crop yield, disease resistance or nutrient and water use efficiency. With more crop genomes emerging all the time, we are nearer having the basic information, at the gene-level, to begin assembling crop gene catalogues and using data from other plant species to understand how the genes function and how their interactions govern crop development and physiology. Unfortunately, the task of creating such a complete knowledge base of gene functions, interaction networks and trait biology is technically challenging because the relevant data are dispersed in myriad databases in a variety of data formats with variable quality and coverage. In this paper we present a general approach for building genome-scale knowledge networks that provide a unified representation of heterogeneous but interconnected datasets to enable effective knowledge mining and gene discovery. We describe the datasets and outline the methods, workflows and tools that we have developed for creating and visualising these networks for the major crop species, wheat and barley. We present the global characteristics of such knowledge networks and with an example linking a seed size phenotype to a barley WRKY transcription factor orthologous to TTG2 from Arabidopsis, we illustrate the value of integrated data in biological knowledge discovery. The software we have developed (www.ondex.org) and the knowledge resources (http://knetminer.rothamsted.ac.uk) we have created are all open-source and provide a first step towards systematic and evidence-based gene discovery in order to facilitate crop improvement.Entities:
Keywords: Bioinformatics; CropNet, Crop knowledge network; Data integration; GSKN, Genome-scale Knowledge Network; Gene discovery; Knowledge discovery, crop genomics; Knowledge network; RefNet, Reference knowledge network of model species
Year: 2016 PMID: 28018846 PMCID: PMC5167366 DOI: 10.1016/j.atg.2016.10.003
Source DB: PubMed Journal: Appl Transl Genom ISSN: 2212-0661
Fig. 1Examples of public data sources that can be integrated into Ondex (A) using the Ondex Integrator and the Ondex Console (B). Following the data integration workflow, the knowledge network (C) is loaded into the Ondex UI for visualisation and exploration (D).
Fig. 2The Ondex workflow involves parsing, mapping and collapsing the data. Ondex input datasets A and B are merged via common concepts (e.g. Protein). The mapping step creates relations of type equal between “equivalent” concepts. The collapsing is a network transformation that merges equivalent concepts into a single concept to avoid redundancy. The merged concepts contain a summary of all the data sources as a record of the provenance of the merged network.
Summary of the data sources and Ondex parsers that were used to create the crop and reference knowledge networks.
| Knowledge | Data source | Data type | Ondex parser | Concept class | Relation type |
|---|---|---|---|---|---|
| Genes | Ensembl | GFF3 | FASTA-GFF3 | Gene | encodes |
| SNP | Ensembl | Tabular | Console | Gene | in_proximity |
| GWAS | Ensembl | Tabular | Console | SNP | associated_with |
| QTL | Gramene | Tabular | Console | QTL | control |
| Homology | Ensembl | TAB | Console | Protein | ortholog |
| Interaction | TAIR | Tabular | Console | Gene | interacts_with |
| GO annotations | Gene Ontology | GAF2.0 | GAF | Gene/Protein | participates_in |
| Phenotype | TAIR | Tabular | Console | Gene/Protein | has_observed_phenotype |
| Pathway | AraCyc | BioPax | BioCyc BioPAX | Protein | is_a |
| Protein domain | Ensembl | Tabular | Console | Protein | has_domain |
| Literature | PubMed | Medline XML | Medline/PubMed | Publication | NA |
| Literature citations | TAIR | Tabular | Console | Gene | published_in |
| Gene ontology | Gene Ontology | OBO | GenericOBO | BioProc | is_a |
| Trait ontology | Gramene | OBO | GenericOBO | Trait Ontology | is_a |
Total number of concepts per Concept Class included in BarleyNet and WheatNet (Release June 2016). Note that these networks include the same RefNet.
| Concept class | BarleyNet | WheatNet |
|---|---|---|
| Biological process | 27,486 | 27,525 |
| Cellular component | 3787 | 3787 |
| Compound | 5457 | 2980 |
| EC | 1789 | 1754 |
| Enzyme | 26,698 | 15,150 |
| Gene | 112,091 | 130,815 |
| Molecular function | 9866 | 9919 |
| Pathway | 676 | 587 |
| Phenotype | 6489 | 6489 |
| Protein complex | 192 | 187 |
| Protein domain | 7032 | 9417 |
| Protein | 136,735 | 177,378 |
| Publication | 61,329 | 61,305 |
| QTL | 285 | 0 |
| Reaction | 5612 | 3097 |
| RNA | 1296 | 1296 |
| SNP | 16,030 | 0 |
| TO | 1314 | 1314 |
| Quantitative trait | 30 | 0 |
| Transport | 96 | 54 |
Fig. 3The ontology types present in the metagraph of the barley GSKN (BarleyNet). Different node shapes and colors represent different Concept Classes. Relation Types are omitted here for clarity reasons.
Fig. 4A heterogeneous knowledge network that links crop-specific information on the left (Traits, QTL and Gene) to RefNet information on the right (Homology, Interaction and Annotation).