Literature DB >> 15588483

Update on genome completion and annotations: Protein Information Resource.

Abstract

The Protein Information Resource (PIR) recently joined the European Bioinformatics Institute (EBI) and Swiss Institute of Bioinformatics (SIB) to establish UniProt--the Universal Protein Resource--which now unifies the PIR, Swiss-Prot and TrEMBL databases. The PIRSF (SuperFamily) classification system is central to the PIR/UniProt functional annotation of proteins, providing classifications of whole proteins into a network structure to reflect their evolutionary relationships. Data integration and associative studies of protein family, function and structure are supported by the iProClass database, which offers value-added descriptions of all UniProt proteins with highly informative links to more than 50 other databases. The PIR system allows consistent, rich and accurate protein annotation for all investigators.

Entities: Gene Species

Mesh：

Substances：
Proteins

Year: 2004 PMID： 15588483 PMCID： PMC3525084 DOI： 10.1186/1479-7364-1-3-229

Source DB: PubMed Journal: Hum Genomics ISSN： 1473-9542 Impact factor: 4.639

Introduction

The high-throughput genome projects have resulted in a rapid accumulation of genome sequences for a large number of organisms. Meanwhile, researchers have begun to tackle gene function and other complex regulatory processes by studying organisms at the global scale for various levels of biological organisation. To fully exploit the value of the data, bioinfor-matics infrastructures are urgently needed to identify proteins encoded by these genomes and to understand how these proteins function in making up a living cell. The Protein Information Resource (PIR) is a public bioinformatics database, and is located at the Georgetown University Medical Center (Washington, DC). PIR (http://pir.georgetown.edu) provides an advanced framework for comparative analysis and functional annotation of proteins. PIR recently joined the European Bioinformatics Institute (EBI) and Swiss Institute of Bioinformatics (SIB) to establish UniProt [1] (http://www.uniprot.org), the world's most comprehensive catalogue of information on proteins. It is a central repository of protein sequence and function and was created by joining the information contained in PIR-PSD, Swiss-Prot and TrEMBL. To facilitate consistent and accurate protein annotation, PIR has extended its superfamily concept and developed the PIR SuperFamily (PIRSF) classification system [2]. This framework is supported by the iProClass integrated database of protein family, function and structure [3]. iProClass offers value-added descriptions of all UniProt proteins and has highly informative links to more than 50 other databases of protein family, function, pathway, interaction, modification, structure, genome, ontology, literature and taxonomy (Figure 1).

Figure 1

Diagram of the interrelated links of the iProClass database. Comprehensive protein and superfamily views exist in two types of summary reports. The protein sequence report covers information on family, structure, function, gene, genetics, disease, ontology, taxonomy and literature, with cross-references to relevant molecular databases and executive summary lines, as well as graphical display of domain and motif regions. The superfamily report provides PIR superfamily membership information with length, taxonomy and keyword statistics, complete member listing separated by major kingdoms, family relationships at the whole protein and domain and motif levels with direct mapping to other classifications, structure and function cross-references, and domain and motif graphical display.

PIR, then and now

For more than three decades, PIR has made many protein databases and analysis tools freely accessible to the scientific community. These include the Protein Sequence Database (PSD) -- the first international protein database -- which grew out of the Atlas of Protein Sequence and Structure, edited by Margaret Dayhoff [1965-1978], a pioneer in molecular evolution research. As a core resource, the PIR environment is widely used by researchers to develop other bioinformatics infrastructures and algorithms and to enable basic and applied scientific research. The current version (January 2004) consists of more than 1,232,000 (non-redundant PIR-PSD, SwissProt and TrEMBL) proteins organised into more than 36,290 PIR superfamilies, 145,340 families, 5,720 Pfam and PIR homology domains, 1,300 PROSITE/ProClass motifs, 280 RESID post-transla-tional modification sites, 550,000 FASTA similarity clusters and links to more than 50 molecular biology databases. iProClass cross-references include: databases for protein families (eg COG, InterPro); functions and pathways (eg KEGG, WIT); protein-protein interactions (eg DIP); structures and structural classifications (eg PDB, SCOP, CATH, PDBSum); genes and genomes (eg TIGR, OMIM); ontologies (eg gene ontology); literature (NCBI PubMed); and taxonomy (NCBI taxonomy). Coupling protein classification and data integration allows associative studies of protein family, function and structure [3]. Domain-based or structural classification-based searches allow identification of protein families sharing domains or structural-fold classes. Functional convergence (unrelated proteins with the same activity) and functional divergence are revealed by the relationships between the enzyme classification and protein family classification. With the underlying taxonomic information in hand, protein families that occur in given lineages can be identified. Combining phylogenetic pattern and biochemical pathway information for protein families allows identification of alternative pathways to the same end product in different taxonomic groups, which may suggest potential drug targets. The systematic approach for protein family curation, using integrative data, leads to novel predictions and functional inference for uncharacterised 'hypothetical' proteins, and to detection and correction of genome annotation errors (a few examples are listed in Table 1). Such studies may serve as a basis for further analysis of protein functional evolution and its relationship to the co-evolution of metabolic pathways, cellular networks and organisms.

Table 1

Protein family classification and integrative associative analysis for functional annotation*

A. Functional inference of uncharacterised hypothetical proteins
SF034452	TIM-barrel signal transduction protein
SF004961	Metal-dependent hydrolase
SF005928	Nucleotidyltransferase
SF005933	ATPase with chaperone activity and inactive LON protease domain
SF005211	Alpha/beta hydrolase
SF014673	Lipid carrier protein
SF005019	[Ni, Fe]-hydrogenase-3-type complex, membrane protein EhaA
B. Correction, or improvement, of genome annotations
SF025624	Ligand-binding protein with an ACT domain
SF005003	Inactive homologue of metal-dependent protease
SF000378	Glycyl radical cofactor protein YfiD
SF000876	Chemotaxis response regulator methylesterase CheB
SF000881	Thioesterase, type II
SF002845	Bifunctional tetrapyrrole methylase and MazG NTPase
C. Enhanced understanding of structure, function and evolutionary Relationships
SF005965	Chorismate mutase, AroH class
SF001501	Chorismate mutase, AroQ class, prokaryotic type

*PIRSF protein family reports detail supporting evidence for both experimentally validated and computationally predicted annotations.

Protein family classification and integrative associative analysis for functional annotation* *PIRSF protein family reports detail supporting evidence for both experimentally validated and computationally predicted annotations.

Organisational levels of protein groups

PIR has three organisational levels of protein groups -- namely, protein sequence entry, homeomorphic superfamily/family/subfamily and domain superfamily.

Protein sequence entries

A UniProt protein sequence entry represents the unprocessed normal product of a gene (or, sometimes, of very closely-related genes) from a single species. (A number of Swiss-Prot entries still contain identical sequences from different species, which will be unmerged in future releases.) The mature protein chain and its modifications are detailed in the feature table. To the extent that that is practical, UniProt aims to have one entry for each genetic locus that encodes protein. When the sequence variation is more extensive than can be conveniently represented within the entry, however, additional entries may be constructed for splice variants, allelic variants and strain variants. The source data from which entries are constructed include entries that represent a single sequence report, either published or deposited in a databank. Often, such reports will need to be 'merged' with other reports representing the same protein sequence. The UniProt staff attempt to identify these cases and perform the required merges.

Protein families

For the purposes of standardising annotation, database entries are organised into families of closely-related sequences. These generally represent proteins with the same function in various organisms. The taxonomic distribution within a family will depend on how well conserved are the structure and function of the protein. As a general guideline, sequences having more than 50 per cent sequence identity are usually similar in structure and function, and the major sequence features are unambiguously aligned by commonly-used multiple sequence alignment programs. Therefore, 50 per cent sequence identity is used by the database staff for the provisional clustering of proteins into families. This threshold is appropriate in many cases; however, some families may be repartitioned into more convenient clusters after PIR review.

Homeomorphic superfamilies/families/subfamilies

The PIR superfamily concept [4], the original classification based on sequence similarity, has been used as a guiding principle to provide comprehensive and non-overlapping clustering of PIR protein sequences into a hierarchical order to reflect their evolutionary relationships [5]. To facilitate sensible propagation and standardisation of protein annotation and systematic detection of annotation errors as part of the UniProt project, PIR has extended its hierarchical superfamily concept and developed the PIRSF system, a network classification system based on the evolutionary relationships of whole proteins. Proteins are considered 'homeomorphic' if they share full-length sequence similarity and a common domain architecture, as indicated by the same type, number and order of defined domains. Length deviation may occur for alternative-splice and alternate-initiator variants, sequence fragments and peptides derived from proteolytic processing. Variation of the domain architecture may exist for repeating domains and/or auxiliary domains, which are often mobile and may easily be lost, acquired or functionally replaced during evolution. Classification based on whole proteins, rather than on the component domains, allows annotation of both generic biochemical and specific biological functions. The network structure accommodates a flexible number of levels that reflect varying degrees of sequence conservation (superfamily, family and subfamily). The threshold values of sequence similarity may vary at each level, depending on the evolutionary rate in each group of proteins (ie the taxonomic distribution within a protein group will depend on how well conserved are the structure and function of the protein). The network structure allows improved protein annotation, more accurate extraction of conserved functional residues, and classification of distantly-related orphan proteins. Homeo-morphic families and subfamilies -- generally representing proteins with the same function in various organisms -- are suitable for propagating standardised protein names, position-specific features (such as functional sites) and keywords. Distantly-related homeomorphic families and orphan proteins sharing a common domain architecture may form a homeo-morphic superfamily. It is assumed, although in most cases this has not been investigated in detail, that the molecules in a homeomorphic superfamily share a common evolutionary history. Thus, it should be valid to construct an evolutionary tree from the members of a homeomorphic superfamily. If two groups of proteins with the same architecture or function are shown to have come to that structure independently (convergent evolution), they are appropriately separated into two homeomorphic superfamilies. For example, the cytochrome P450 (CYP)[6] and nitric oxide synthase (NOS)[7] families of enzymes both carry out "P450-like" oxygenation reactions and at first were believed to be evolutionarily related. Upon further in-depth analysis, however, no evidence for an evolutionary relationship of the two gene superfamilies was found [5], so the conclusion can only be that this is a likely case in point of convergent evolution.

Domain superfamilies

Many types of domains have been found in diverse proteins. In common use, for example, the term 'protein kinase superfamily' refers to the collection of all proteins that contain a protein kinase-like domain. PIR calls such a group a 'domain superfamily'. Any given protein sequence will be assigned to only one homeomorphic superfamily, but it may contain sequence segments belonging to several domain superfamilies [5].

Recent directions for additional protein analyses and databases

With the new surge in interest in the fields of subcellular and intracellular signal transduction circuitry and 'systems biology' [8], confirmed protein-protein interactions are being registered at the Human Protein Reference Database (HPRD; http://www.hprd.org) [9]. Another bioinformatics database under development is the Secreted Protein Discovery Initiative (SPDI), which has begun to identify novel and transmembrane proteins [10]. A Bayesian networks approach for predicting protein-protein interactions, genome-wide, in yeast [11] is available at: http://genecensus.org/intint. A protein interaction map for Drosophila melanogaster has very recently been developed [12], as a starting point of a systems biology modelling for multicellular organisms, including humans. OrthoMCL provides a scalable method for constructing orthologous groups across multiple eukaryotic taxa, using a Markov Cluster algorithm when applied to two genomes, but can be extended to cluster orthologue analysis across multiple species (http://www.cbil.upenn.edu/gene-family). Analysis of clusters incorporating Plasmodium falciparum genes, for example, identifies numerous enzymes that were incompletely annotated in first-pass annotation of that parasite genome [13]. Finally, the evolutionary divergence of large enzyme protein families, based on the complexities of their substrates, can be compared by a profile Hidden Markov Model method; the method was recently used to classify 47 glycosyltransferase families in the CAZy database into four superfamilies [14].

14 in total

Review 1. Nitric oxide synthases: domain structure and alignment in enzyme function and control.

Authors: Dipak K Ghosh; J C Salerno
Journal: Front Biosci Date: 2003-01-01

2. A Bayesian networks approach for predicting protein-protein interactions from genomic data.

Authors: Ronald Jansen; Haiyuan Yu; Dov Greenbaum; Yuval Kluger; Nevan J Krogan; Sambath Chung; Andrew Emili; Michael Snyder; Jack F Greenblatt; Mark Gerstein
Journal: Science Date: 2003-10-17 Impact factor: 47.728

3. PIRSF: family classification system at the Protein Information Resource.

Authors: Cathy H Wu; Anastasia Nikolskaya; Hongzhan Huang; Lai-Su L Yeh; Darren A Natale; C R Vinayaka; Zhang-Zhi Hu; Raja Mazumder; Sandeep Kumar; Panagiotis Kourtesis; Robert S Ledley; Baris E Suzek; Leslie Arminski; Yongxing Chen; Jian Zhang; Jorge Louie Cardenas; Sehee Chung; Jorge Castro-Alvear; Georgi Dinkov; Winona C Barker
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

4. UniProt: the Universal Protein knowledgebase.

Authors: Rolf Apweiler; Amos Bairoch; Cathy H Wu; Winona C Barker; Brigitte Boeckmann; Serenella Ferro; Elisabeth Gasteiger; Hongzhan Huang; Rodrigo Lopez; Michele Magrane; Maria J Martin; Darren A Natale; Claire O'Donovan; Nicole Redaschi; Lai-Su L Yeh
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

5. The secreted protein discovery initiative (SPDI), a large-scale effort to identify novel human secreted and transmembrane proteins: a bioinformatics assessment.

Authors: Hilary F Clark; Austin L Gurney; Evangeline Abaya; Kevin Baker; Daryl Baldwin; Jennifer Brush; Jian Chen; Bernard Chow; Clarissa Chui; Craig Crowley; Bridget Currell; Bethanne Deuel; Patrick Dowd; Dan Eaton; Jessica Foster; Christopher Grimaldi; Qimin Gu; Philip E Hass; Sherry Heldens; Arthur Huang; Hok Seon Kim; Laura Klimowski; Yisheng Jin; Stephanie Johnson; James Lee; Lhney Lewis; Dongzhou Liao; Melanie Mark; Edward Robbie; Celina Sanchez; Jill Schoenfeld; Somasekar Seshagiri; Laura Simmons; Jennifer Singh; Victoria Smith; Jeremy Stinson; Alicia Vagts; Richard Vandlen; Colin Watanabe; David Wieand; Kathryn Woods; Ming-Hong Xie; Daniel Yansura; Sothy Yi; Guoying Yu; Jean Yuan; Min Zhang; Zemin Zhang; Audrey Goddard; William I Wood; Paul Godowski; Alane Gray
Journal: Genome Res Date: 2003-09-15 Impact factor: 9.043

6. Superfamily classification in PIR-International Protein Sequence Database.

Authors: W C Barker; F Pfeiffer; D G George
Journal: Methods Enzymol Date: 1996 Impact factor: 1.600

7. A protein interaction map of Drosophila melanogaster.

Authors: L Giot; J S Bader; C Brouwer; A Chaudhuri; B Kuang; Y Li; Y L Hao; C E Ooi; B Godwin; E Vitols; G Vijayadamodar; P Pochart; H Machineni; M Welsh; Y Kong; B Zerhusen; R Malcolm; Z Varrone; A Collis; M Minto; S Burgess; L McDaniel; E Stimpson; F Spriggs; J Williams; K Neurath; N Ioime; M Agee; E Voss; K Furtak; R Renzulli; N Aanensen; S Carrolla; E Bickelhaupt; Y Lazovatsky; A DaSilva; J Zhong; C A Stanyon; R L Finley; K P White; M Braverman; T Jarvie; S Gold; M Leach; J Knight; R A Shimkets; M P McKenna; J Chant; J M Rothberg
Journal: Science Date: 2003-11-06 Impact factor: 47.728

8. OrthoMCL: identification of ortholog groups for eukaryotic genomes.

Authors: Li Li; Christian J Stoeckert; David S Roos
Journal: Genome Res Date: 2003-09 Impact factor: 9.043

9. Development of human protein reference database as an initial platform for approaching systems biology in humans.

Authors: Suraj Peri; J Daniel Navarro; Ramars Amanchy; Troels Z Kristiansen; Chandra Kiran Jonnalagadda; Vineeth Surendranath; Vidya Niranjan; Babylakshmi Muthusamy; T K B Gandhi; Mads Gronborg; Nieves Ibarrola; Nandan Deshpande; K Shanker; H N Shivashankar; B P Rashmi; M A Ramya; Zhixing Zhao; K N Chandrika; N Padma; H C Harsha; A J Yatish; M P Kavitha; Minal Menezes; Dipanwita Roy Choudhury; Shubha Suresh; Neelanjana Ghosh; R Saravana; Sreenath Chandran; Subhalakshmi Krishna; Mary Joy; Sanjeev K Anand; V Madavan; Ansamma Joseph; Guang W Wong; William P Schiemann; Stefan N Constantinescu; Lily Huang; Roya Khosravi-Far; Hanno Steen; Muneesh Tewari; Saghi Ghaffari; Gerard C Blobe; Chi V Dang; Joe G N Garcia; Jonathan Pevsner; Ole N Jensen; Peter Roepstorff; Krishna S Deshpande; Arul M Chinnaiyan; Ada Hamosh; Aravinda Chakravarti; Akhilesh Pandey
Journal: Genome Res Date: 2003-10 Impact factor: 9.043

10. Comparison of glycosyltransferase families using the profile hidden Markov model.

Authors: Norihiro Kikuchi; Yeon-Dae Kwon; Masanori Gotoh; Hisashi Narimatsu
Journal: Biochem Biophys Res Commun Date: 2003-10-17 Impact factor: 3.575

11 in total

1. semCDI: a query formulation for semantic data integration in caBIG.

Authors: E Patrick Shironoshita; Yves R Jean-Mary; Ray M Bradley; Mansur R Kabuka
Journal: J Am Med Inform Assoc Date: 2008-04-24 Impact factor: 4.497

2. Using EMBL-EBI Services via Web Interface and Programmatically via Web Services.

Authors: Rodrigo Lopez; Andrew Cowley; Weizhong Li; Hamish McWilliam
Journal: Curr Protoc Bioinformatics Date: 2014-12-12

Review 3. Structure, function, and modulation of GABA(A) receptors.

Authors: Erwin Sigel; Michael E Steinmann
Journal: J Biol Chem Date: 2012-10-04 Impact factor: 5.157

4. RLIMS-P 2.0: A Generalizable Rule-Based Information Extraction System for Literature Mining of Protein Phosphorylation Information.

Authors: Manabu Torii; Cecilia N Arighi; Gang Li; Qinghua Wang; Cathy H Wu; K Vijay-Shanker
Journal: IEEE/ACM Trans Comput Biol Bioinform Date: 2015 Jan-Feb Impact factor: 3.710

5. Endothelial extracellular vesicles contain protective proteins and rescue ischemia-reperfusion injury in a human heart-on-chip.

Authors: Moran Yadid; Johan U Lind; Herdeline Ann M Ardoña; Sean P Sheehy; Lauren E Dickinson; Feyisayo Eweje; Maartje M C Bastings; Benjamin Pope; Blakely B O'Connor; Juerg R Straubhaar; Bogdan Budnik; Andre G Kleber; Kevin Kit Parker
Journal: Sci Transl Med Date: 2020-10-14 Impact factor: 17.956