Literature DB >> 27899679

AtPID: a genome-scale resource for genotype-phenotype associations in Arabidopsis.

Qi Lv^1,2, Yiheng Lan¹, Yan Shi¹, Huan Wang¹, Xia Pan¹, Peng Li³, Tieliu Shi⁴.

Abstract

AtPID (Arabidopsis thaliana Protein Interactome Database, available at http://www.megabionet.org/atpid) is an integrated database resource for protein interaction network and functional annotation. In the past few years, we collected 5564 mutants with significant morphological alterations and manually curated them to 167 plant ontology (PO) morphology categories. These single/multiple-gene mutants were indexed and linked to 3919 genes. After integrated these genotype-phenotype associations with the comprehensive protein interaction network in AtPID, we developed a Naïve Bayes method and predicted 4457 novel high confidence gene-PO pairs with 1369 genes as the complement. Along with the accumulated novel data for protein interaction and functional annotation, and the updated visualization toolkits, we present a genome-scale resource for genotype-phenotype associations for Arabidopsis in AtPID 5.0. In our updated website, all the new genotype-phenotype associations from mutants, protein network, and the protein annotation information can be vividly displayed in a comprehensive network view, which will greatly enhance plant protein function and genotype-phenotype association studies in a systematical way.

Entities: Gene Species

Mesh：

Year: 2016 PMID： 27899679 PMCID： PMC5210528 DOI： 10.1093/nar/gkw1029

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Protein functional annotation and the protein networks including protein–protein interactions and regulatory relations are essential for understanding the underlying mechanism of the biological system. AtPID is a comprehensive data resource developed using Arabidopsis thaliana as the model system for protein interactions and functional annotation. From the year 2005, we started to collect the protein–protein interactions (PPIs) from literature and released AtPID 1.0, which only included limited curated PPIs and protein functional annotations from TAIR (The Arabidopsis Information Resource) (1). Due to the increasing demand of the comprehensive PPI from related research communities, we extended the PPI network by different computational methods and released AtPID 2.0 in 2006. In order to further increase the coverage and overcome the false positive issue within the predicted dataset, we manually curated more PPIs from the literature and developed a Naïve Bayesian based classifier to integrate and evaluate all the predicted PPIs, which made our database updated to AtPID 3.0 in 2008 as a rich source of information for system-level understanding of gene function and biological processes (2). In order to better serve the related research communities for the mechanism studies of various physiological activities, we annotated the Arabidopsis proteins in the AtPID 4.0 database with further information (e.g. functional annotation, subcellular localization, tissue-specific expression, phosphorylation information, SNP phenotype and mutant phenotype, etc.) and interaction qualifications (e.g. transcriptional regulation, complex assembly, functional collaboration, etc.) via further literature text mining and integration of other resources (3) (Table 1).

Table 1.

A comprehensive comparison for the different versions of AtPID database

	Function annotation				Molecule interaction
The version of AtPID	Protein functional description	Subcellular localization	Mutant	Phenotype annotation	Curated PPIs	Predicted PPIs	Curated transcriptional regulations	Predicted transcriptional regulations
AtPID 3 (2)	32 000	–	--	–	4666	23 396	–	–
AtPID 4 (3)	40 000	10 429	5121 mutants, 3431 genes	–	5565	98 174	8070	–
AtPID 5 (Current Version)	40 000	11 052	5609 mutants, 3916 genes	8202 mutant-PO associations	45 382	118 556	9435	31 991

Comparing with other organisms, plants have unique advantages on the mutagenesis and tissue culture, a large number of characterized stable Arabidopsis mutants have been reported in research literature, and large-scale seeds/mutant resources for plant functional studies were built for genome annotation and functional studies, e.g. uNASC Database (The European Arabidopsis Stock Centre), RAPID (RIKEN Arabidopsis Phenome Information Database), CSHL Database (the Arabidopsis Genetrap Website at Cold Spring Harbor Lab), Chloroplast Function Database, SeedGenes Database, AGRICOLA Database (Systematic RNAi knockouts in Arabidopsis), Araport (the Arabidopsis Information Portal) and TAIR (4–10). Mutant phenotypes are especially critical for functional studies of plants. Although great efforts have been made on collecting related data in plants, the mutant phenotypes are still largely under-annotated. AtPID has been committed to collect more mutants with significantly morphological alterations and tried to annotate all the mutants’ phenotypes in a systematical way. The Plant Ontology is a controlled vocabulary (ontology) that describes plant anatomy and morphology and stages of development for all plants (11). In order to index and annotate all the mutants in AtPID into a standard semantic framework, we cooperated with Shanghai Society for Plant Biology and annotated all the mutants to more specific downstream PO categories. In this update, the AtPID 5.0 database greatly expands the information on PPIs, mutant phenotypes obtained from published literature (12–14), public databases and computational approaches. For mutant related information, the data of mutant phenotypes were carefully curated by biologists. In addition, novel associations between genes and phenotypes were predicted through Naïve Bayes method. Furthermore, we developed a more comprehensive visualization toolkit to view all the interactions at PPI, transcriptional regulation and genotype–phenotype levels under the same framework, which could easily show/map all other annotation information in our database for selected genes. All of the improvements and updates will accelerate researchers in exploiting information in our database in a more effective and comprehensive way.

RESULTS

Summary of new data in the updated AtPID 5.0

Comparing with the other well-used PPI resources (Table 2), the updated database indexed 45 382 curated PPIs and 118 556 predicted PPIs from literature mining, public databases or computational approaches. These numbers are significantly increased due to the ravenous growth and maturing biomedical national processing language and the large-scale experiments for functional studies (15–17). We also generated a comprehensive chloroplast proteomics dataset in Arabidopsis by large-scale proteomics experiments and indexed all 3134 credible chloroplast proteins into our annotation system. Furthermore, we systematically annotated 31 991 TFBS associations to 6891 genes based on the integration of expression profiling and cis-regulatory element information. This update largely enriches protein annotations in our database by tracking the recent research progresses of related areas and will greatly assist functional experiments and systematic studies.

Table 2.

Numbers of interactions in AtPID 5.0 compared with the other well-used data resources

PPI-related database	Description for the PPI database	Curated PPIs	Predicted PPIs
AtPID 5.0	An integrated database resource for protein interaction network and functional annotation proteome.(http://www.megabionet.org/atpid)	45 382	118 556
PAIR	The predicted Arabidopsis interactome resource(http://www.cls.zju.edu.cn/pair/) (24)	5990	137 986
TAIR	A database of genetic and molecular biology data for the model higher plant Arabidopsis thaliana.(http://www.arabidopsis.org) (25)	6503
BioGRID	An interaction repository with data compiled through comprehensive curation efforts.(http://thebiogrid.org/) (26)	42 216
STRING	A database of predicted functional associations between proteins.(http://string-db.org/) (27)		>1 000 000

Comprehensive annotation of genotype–phenotype associations

Using text mining and database integration, the previous version (AtPID 4.1) collected 5121 mutants with significantly observable phenotypes related to 3431 genes. In the past few years, through in-depth cooperation with Shanghai Society for Plant Biology, we collected 488 new mutants and systematically annotated all the existed and new curated mutants’ phenotypes to 167 standardized plant ontology categories (Figure 1A). Comprehensive collection on phenotype data can help phenotype mechanism studies as what have been done in systematical exploration of disease associations (18,19). Strategies or algorithms have been developed to predict gene related functions by integrating multiple level data (18–20). We integrated three different information, PPIs, co-expression from expression profiling and GO annotation with Naïve Bayes method. PPIs were quantified by the extended Czekanowski–Dice distance (21) and missing values were complemented by orthologs in other 14 species’ experimental PPIs from STRING database (22). Shared Smallest Biological Processes (SSBPs) was applied to describe the possibility of gene interactions on GO annotation (23). Co-expression of gene pairs were computed over the microarrays mentioned above to predict regulatory interactions. The correlation coefficient values of the three information were low (PPIs-GO: 0.05; PPIs-co-expression: 0.08; co-expression-GO: −0.03), suggesting that features were independent from each other and satisfied with the assumption of Naïve Bayes method. Naïve Bayes was undertaken by e1071 package in R. The model showed high predictability, with average AUC 0.72. Finally, the prediction contains 4457 novel gene-PO pairs with 1369 genes, which could be a supplement to the known mutant information.

Figure 1.

The overview and network display of the curated genotype–phenotype associations in AtPID 5.0. Top left corner (A) exhibits the top level Plant Ontology (PO) entries in Arabidopsis and the annotated gene numbers related to this PO. Bottom right corner (B) shows the flower-associated network. (C) Node with mouse hovering annotation.

User friendly visualization toolkit for comprehensive genotype–phenotype network

For the phenotype annotation information, we re-developed the network visualization application (Figure 1B and C) with JavaScript, which inherited all the functions of the old java applet, and added phenotype as a new node type. The new visualization application has better compatibility and performance due to the optimization of the database structure and the network generation methods. Meanwhile, it presents the network in a more interactive and comprehensive way. All the protein annotation information and protein relations in AtPID 5.0 can be presented simultaneously on the same view, and users can easily extend the network by double clicking any node on the border of current network. The combination of genotype–phenotype associations and the protein interaction information can provide existing knowledge of selected proteins to biologists in a very intuitive way and help them easily understand the functional relations to confirm their hypotheses or inspire them on new study designs.

CONCLUSIONS

Here, we have made great efforts to provide a significantly improved resource for genotype–phenotype associations, which could serve as a resource for experimental design and facilitate genome-wide systematical studies in Arabidopsis. The AtPID 5.0 also provides illustrations of the functional annotation and protein network with a friendly web-based interface. We have largely extended the current annotation information by literature curation, bioinformatics predictions and also the high-throughput experimental data in the AtPID 5.0, e.g. we generated a comprehensive chloroplast proteomics dataset in Arabidopsis by large-scale proteomics experiments and indexed all the data as the evidence for subcellular localization in current AtPID. We will continue to accumulate more genome-wide data to better serve the research community.

27 in total

1. A trial of phenome analysis using 4000 Ds-insertional mutants in gene-coding regions of Arabidopsis.

Authors: Takashi Kuromori; Takuji Wada; Asako Kamiya; Masahiro Yuguchi; Takuro Yokouchi; Yuko Imura; Hiroko Takabe; Tetsuya Sakurai; Kenji Akiyama; Takashi Hirayama; Kiyotaka Okada; Kazuo Shinozaki
Journal: Plant J Date: 2006-06-30 Impact factor: 6.417

2. Systematic identification of human mitochondrial disease genes through integrative genomics.

Authors: Sarah Calvo; Mohit Jain; Xiaohui Xie; Sunil A Sheth; Betty Chang; Olga A Goldberger; Antonella Spinazzola; Massimo Zeviani; Steven A Carr; Vamsi K Mootha
Journal: Nat Genet Date: 2006-04-02 Impact factor: 38.330

Review 3. Protein networks in disease.

Authors: Trey Ideker; Roded Sharan
Journal: Genome Res Date: 2008-04 Impact factor: 9.043

4. Gene trap lines define domains of gene regulation in Arabidopsis petals and stamens.

Authors: Naomi Nakayama; Juana M Arroyo; Joseph Simorowski; Bruce May; Robert Martienssen; Vivian F Irish
Journal: Plant Cell Date: 2005-07-29 Impact factor: 11.277

5. Versatile gene-specific sequence tags for Arabidopsis functional genomics: transcript profiling and reverse genetics applications.

Authors: Pierre Hilson; Joke Allemeersch; Thomas Altmann; Sébastien Aubourg; Alexandra Avon; Jim Beynon; Rishikesh P Bhalerao; Frédérique Bitton; Michel Caboche; Bernard Cannoot; Vasil Chardakov; Cécile Cognet-Holliger; Vincent Colot; Mark Crowe; Caroline Darimont; Steffen Durinck; Holger Eickhoff; Andéol Falcon de Longevialle; Edward E Farmer; Murray Grant; Martin T R Kuiper; Hans Lehrach; Céline Léon; Antonio Leyva; Joakim Lundeberg; Claire Lurin; Yves Moreau; Wilfried Nietfeld; Javier Paz-Ares; Philippe Reymond; Pierre Rouzé; Goran Sandberg; Maria Dolores Segura; Carine Serizet; Alexandra Tabrett; Ludivine Taconnat; Vincent Thareau; Paul Van Hummelen; Steven Vercruysse; Marnik Vuylsteke; Magdalena Weingartner; Peter J Weisbeek; Valtteri Wirta; Floyd R A Wittink; Marc Zabeau; Ian Small
Journal: Genome Res Date: 2004-10 Impact factor: 9.043

6. The BioGRID interaction database: 2015 update.

Authors: Andrew Chatr-Aryamontri; Bobby-Joe Breitkreutz; Rose Oughtred; Lorrie Boucher; Sven Heinicke; Daici Chen; Chris Stark; Ashton Breitkreutz; Nadine Kolas; Lara O'Donnell; Teresa Reguly; Julie Nixon; Lindsay Ramage; Andrew Winter; Adnane Sellam; Christie Chang; Jodi Hirschman; Chandra Theesfeld; Jennifer Rust; Michael S Livstone; Kara Dolinski; Mike Tyers
Journal: Nucleic Acids Res Date: 2014-11-26 Impact factor: 19.160

7. Araport: the Arabidopsis information portal.

Authors: Vivek Krishnakumar; Matthew R Hanlon; Sergio Contrino; Erik S Ferlanti; Svetlana Karamycheva; Maria Kim; Benjamin D Rosen; Chia-Yi Cheng; Walter Moreira; Stephen A Mock; Joseph Stubbs; Julie M Sullivan; Konstantinos Krampis; Jason R Miller; Gos Micklem; Matthew Vaughn; Christopher D Town
Journal: Nucleic Acids Res Date: 2014-11-20 Impact factor: 16.971

8. Sustainable funding for biocuration: The Arabidopsis Information Resource (TAIR) as a case study of a subscription-based funding model.

Authors: Leonore Reiser; Tanya Z Berardini; Donghui Li; Robert Muller; Emily M Strait; Qian Li; Yarik Mezheritsky; Andrey Vetushko; Eva Huala
Journal: Database (Oxford) Date: 2016-03-17 Impact factor: 3.451

9. Clustering proteins from interaction networks for the prediction of cellular functions.

Authors: Christine Brun; Carl Herrmann; Alain Guénoche
Journal: BMC Bioinformatics Date: 2004-07-13 Impact factor: 3.169

10. OpenDMAP: an open source, ontology-driven concept analysis engine, with applications to capturing knowledge regarding protein transport, protein interactions and cell-type-specific gene expression.

Authors: Lawrence Hunter; Zhiyong Lu; James Firby; William A Baumgartner; Helen L Johnson; Philip V Ogren; K Bretonnel Cohen
Journal: BMC Bioinformatics Date: 2008-01-31 Impact factor: 3.169

2 in total

1. Using expression quantitative trait loci data and graph-embedded neural networks to uncover genotype-phenotype interactions.

Authors: Xinpeng Guo; Jinyu Han; Yafei Song; Zhilei Yin; Shuaichen Liu; Xuequn Shang
Journal: Front Genet Date: 2022-08-15 Impact factor: 4.772

2. AtMAD: Arabidopsis thaliana multi-omics association database.

Authors: Yiheng Lan; Ruikun Sun; Jian Ouyang; Wubing Ding; Min-Jun Kim; Jun Wu; Yuhua Li; Tieliu Shi
Journal: Nucleic Acids Res Date: 2021-01-08 Impact factor: 16.971

2 in total