Literature DB >> 15608170

SPD--a web-based secreted protein database.

Yunjia Chen¹, Yong Zhang, Yanbin Yin, Ge Gao, Songgang Li, Ying Jiang, Xiaocheng Gu, Jingchu Luo.

Abstract

With the improved secreted protein prediction approach and comprehensive data sources, including Swiss-Prot, TrEMBL, RefSeq, Ensembl and CBI-Gene, we have constructed secretomes of human, mouse and rat, with a total of 18 152 secreted proteins. All the entries are ranked according to the prediction confidence. They were further annotated via a proteome annotation pipeline that we developed. We also set up a secreted protein classification pipeline and classified our predicted secreted proteins into different functional categories. To make the dataset more convincing and comprehensive, nine reference datasets are also integrated, such as the secreted proteins from the Gene Ontology Annotation (GOA) system at the European Bioinformatics Institute, and the vertebrate secreted proteins from Swiss-Prot. All these entries were grouped via a TribeMCL based clustering pipeline. We have constructed a web-based secreted protein database, which has been publicly available at http://spd.cbi.pku.edu.cn. Users can browse the database via a GO assignment or chromosomal-location-based interface. Moreover, text query and sequence similarity search are also provided, and the sequence and annotation data can be downloaded freely from the SPD website.

Entities: Chemical Gene Species

Mesh：

Substances：
Proteins

Year: 2005 PMID： 15608170 PMCID： PMC540047 DOI： 10.1093/nar/gki093

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Secreted proteins such as cytokines, chemokines, hormones, digestive enzymes, antibodies as well as components of the extra-cellular matrix, are secreted from cells into the extra-cellular space. They play pivotal biological regulatory roles and have the potential for protein therapeutics (1). The majority of secreted proteins have a signal peptide according to the signal hypothesis (2). Signal peptides are located at the N-terminal of nascent proteins and their lengths are usually <70 amino acid residues. They are cleaved during the process of entering the endoplasmic reticulum (ER) lumen. The signal peptide is a hallmark of secreted proteins. However, many transmembrane (TM) proteins also have a signal peptide (3–4). Several secreted protein prediction methods have been developed mainly based on the analysis of signal peptides, and genome-wide identification of potential novel secreted proteins has been reported (5–7). In this study, we implemented an improved secreted protein prediction approach, CJ-SPHMM+TMHMM+PSORT (8–10), to search a comprehensive data source, including Swiss-Prot/TrEMBL (11), RefSeq (12), Ensembl (13) as well as CBI-Gene constructed locally, and constructed the secretomes of human, mouse and rat. We have also set up a complete secreted protein classification pipeline, and classified our predicted secreted proteins into different functional categories. To make our predicted results more comprehensive, we collected nine reference datasets including the Secreted Protein Discovery Initiative (SPDI) (5), the Riken mouse secretomes (6), the secreted proteins from Gene Ontology Annotation (GOA) (14), etc. A TribeMCL (15) based cluster pipeline was implemented, to group our predicted secreted proteins with these reference sequences. All the sequence data and annotation information have been publicly available at http://spd.cbi.pku.edu.cn.

CONSTRUCTION OF THE PIPELINE

Collection and annotation of the core dataset

SPD consists of a core dataset and a reference dataset. The core dataset contains 18 152 secreted proteins retrieved from Swiss-Prot/TrEMBL, Ensembl, RefSeq and CBI-Gene. The pipeline of constructing the dataset is shown in Figure 1. A combined strategy was applied to collect as much secreted proteins as possible, using both automatic processing and manual intervening. The dataset Rank0 from Swiss-Prot includes some partial sequences without the N- or C-termini, for they were collected according to database annotation. Given that most of the signal peptides are located at the N-terminal of proteins and fundamental to secreted protein prediction, we eliminated the entries without N-terminal methionine (Met, M) in CBI-Gene, Ensembl, Swiss-Prot/TrEMBL and RefSeq in our prediction results, since some entries from these datasets are hypothetical and truncated. Therefore, proteins in the datasets of Rank1, 2, 3 all have N-terminal Met.

Figure 1

The pipeline we used to collect secreted proteins. Sequence data were downloaded by December 2003. Non-redundant: identical sequences were excluded via pairwise BLASTP. Sec-HMMER (CJ-SPHMM/TMHMM): the tool used to predict signal peptides and transmembrane region. Set1 and Set2: proteins predicted by PSORT and Sec-HMMER, respectively; Set3: proteins matched by Sec-HMMER only. Rank0: known secreted proteins in Swiss-Prot; Rank1: predicted by both PSORT and Sec-HMMER; Rank2: predicted by either PSORT or Sec-HMMER; and Rank3: predicted by Sec-HMMER only and with a signal peptide >70 amino acids.

All the 18 152 sequences were annotated via the Protein Centric Annotation System (PCAS), an integrated protein annotation system that we developed previously (16). Functional classification was also performed on all datasets. We extracted all vertebrate secreted proteins from the Swiss-Prot database. Based on the annotation information, they were assigned into 11 classes: antibiotic protein, apolipoprotein, casein, cytokine, hormone, immune system protein, neuropeptide and defense peptide, protease, protease inhibitor, toxin and Wnt protein. Entries without explicit functional annotation were defined as ‘other secreted proteins’. Based on the cross-link information in Swiss-Prot, we obtained representative motifs and domains from Pfam (17), PROSITE (18), SMART (19) and PRINTS (20), representing one of the above eleven functional classes. Taking the Wnt entry in Pfam as an example, if a protein sequence has a Wnt domain, it was classified as a Wnt protein. BLAST searches (21) against the above 11 classes were performed taking all the predicted secreted proteins as queries. If the query sequence has >50% identity with a known protein in class A and the matched sequence length is >80%, we classified this novel secreted protein into class A. For those proteins failing to meet with this cutoff, they were also classified into class A, if they comprise a motif or domain belonging to proteins in class A. An approach with similar ideas to predict sub-cellular location of proteins has been described previously (22). A total of ∼3000 novel secreted proteins were classified into these 11 classes. Details of the domains and protein function assignment can be found at the SPD website. However, the majority of the predicted proteins could not be assigned to these classes, since the number of known representative domains is still very small.

Collection and clustering of the reference dataset

The SPD reference dataset consists of the following data sources: SPDI (5), Riken Mouse Secretome (6), the human and pufferfish secretome (7), Swiss-Prot secreted proteins of vertebrates except for human, mouse and rat (11), secreted proteins extracted from GO assignment (14), DBSubLoc (23), NPD (24), TMPDB (25) and NESbase (26). Most of the data were downloaded from their websites or retrieved from the supplementary materials of related literatures. Vertebrate secreted proteins were extracted from Swiss-Prot according to the annotation. As for GOA, we took all entries as secreted proteins if they were annotated as ‘extracellular matrix’ (ID 0005578) or ‘extracellular space’ (ID 0005615). All these nine datasets comprise a comprehensive reference dataset. SPDI, Riken, the human and pufferfish secretome, the Swiss-Prot vertebrates and the GO datasets focus on collecting secreted proteins. They were taken as positive controls. On the other hand, NPD, TMPDB and NESbase collect nuclear proteins, transmembrane proteins and proteins with nuclear export signal, respectively. They may serve as negative controls. In total, the SPD core dataset includes 65–75% positive controls and 5–8% negative controls. DBSubLoc is a database of sub-cellular location of proteins, hence, useful in both aspects. In addition, pairwise BLASTP was performed between each entry of the SPD core dataset and the reference datasets. The output results were processed via TribeMCL, and relevant entries were clustered together. Here BLAST E-value cutoff is set to 1E−10 and inflation value of TribeMCL is set to 5 (27).

WEB INTERFACE

Currently, the SPD web interface includes five modules. (i) Browse: browse SPD proteins according to chromosome and functional classification or GO assignment. (ii) Search: text query from the core dataset with protein IDs, keywords, descriptions and sequence similarity search against both core and reference datasets. (iii) Download: download SPD protein sequences, corresponding cDNAs, etc. (iv) Data statistics: statistics table of the secreted proteins in each division. (v) Help: frequently asked questions about SPD, including descriptions of using the web interface and details of the SPD construction pipeline. The chromosomal browser was designed to show the chromosomal content of proteins with a common feature, for example, proteins from the same data source or with the same confidence rank, etc. (Figure 2A). The GO browser organizes the proteins based on the GO assignments. The text search function supports the Boolean mode, and could be used to look for proteins with special description, length, etc. Sequence similarity search can be used to find the entries similar to the query sequence via BLASTP or BLASTX.

Figure 2

Screen snapshots of the SPD database, only partially drawn (A and B) for clarity. (A) An example page of the chromosomal location of human protease secreted proteins. The clickable ‘+’ and ‘−’ symbols denotes the plus and minus strands as indicated in the browser at the SPD website. (B) An example of an SPD core dataset entry (TAFA 3.2) showing four divisions of the entry format and various fields with links to detail information and original data source. (C) An example of an SPD reference dataset entry (AY359017) showing two divisions with general information.

Entry format of the core dataset

The data fields of the entries of the core dataset was designed as four major divisions and displayed as separate parts within a table on the web page: the general information, the SPD annotation, the SPD cross-reference and the protein family (Figure 2B). Each entry starts with a header line at the top with links to the PCAS annotation and original websites of this protein (16). The ‘General information’ section is designed to show names, descriptions, reference papers, cDNAs and the GO assignment, etc., retrieved from original data sources. Names of Swiss-Prot/TrEMBL/RefSeq entries were taken from the ucsc_kgXref table (28). Ensembl names were retrieved by the EnsMart batch query (29). For the ‘GO’ field, if multiple GO entries were assigned to a protein, one schematic figure can be shown to display the relationship between these GO entries. The ‘SPD annotation’ includes confidence rank, signal peptide cleavage site, functional category, domain structure, chromosomal loci, loci cluster, homolog clusters, etc. The ‘Domain structure’ field displays possible domain architectures derived from the PCAS system. The ‘Loci structure’ and ‘Loci cluster’ fields show the chromosomal content of this entry obtained from BLAT search against the UCSC HG16, mm4 and rn3, respectively (30,31). Users can browse the detailed information such as the intron/exon structure, synteny pair information, genetic band, etc. The ‘Homolog cluster’ field displays similar sequences with overall identity cutoff >90% and overall length coverage >90%. The ‘SPD Cross-Reference’ field can be used to show similar sequences from different reference datasets. By default, only sequences with overall identity >90% are shown. The ‘Protein Family’ field gives the corresponding cluster yielded via TribeMCL.

Entry format of the reference dataset

Comparing with the format of the core dataset, the layout of the entries of the reference dataset is relatively simple and can be divided into two large sections. The first section shows the original information, such as the data source, and the second part lists the similar sequences from the core dataset with a certain overall identity cutoff (Figure 2C).

DISCUSSION

SPD is comprehensive

SPD tends to be inclusive, not exclusive. In other words, SPD collects as many secreted proteins as possible. First, all distinct sequences are kept in the database, including entries with >90% identity, for it is difficult to decide on the correct variant sequence. Second, nine reference datasets were also introduced to help increase the coverage. For example, most members of a recently identified secreted protein family, TAFA, have been covered by SPD (32). In fact, reference datasets may also help users to gain some information, such as similarity relationships between entries, to discern possible redundancy. In addition, the rank system can be implemented to show the prediction confidence as well.

Distinguish true positives from false positives

Four modules are provided to help users to judge the entries that are true positives. (i) The rank system: proteins of Rank0 or Rank1 tend to be more convincing than those of Rank2 and Rank3. (ii) The category assignment: proteins classified into relevant functional categories are usually more reliable. (iii) Clustering information: users may make some judgment according to the clustering information. For example, an SPD secreted protein might be more reliable if it is grouped into a cluster comprising many entries from GO secreted proteins, Riken mouse secretome, etc. (iv) GO assignment: proteins with GO assignment like ‘extracellular space’ or ‘extracellular matrix’ are likely to be true positives. In contrast, proteins like ‘integral to membrane’ tend to be false positive.

SPD is tuned for biologists

SPD has been tuned for biologists looking for novel secreted proteins. First, mRNA or cDNA sequences can be found in the ‘cross-reference’ field. Second, reference number is shown in the ‘description’ field, which reflects whether this protein is novel or not.

Conflicts with GO assignments

Based on the GO browser, users could find that many proteins with some GO assignments have conflicts, such as membrane, intracellular, etc. This could be explained in three aspects: (i) some proteins can be sorted to multiple locations; (ii) some proteins have low prediction confidence, such as those in rank3; and (iii) the GO assignment might be not much convincing, for example, cellular component information labeled with IEA (inferred from electronic annotation) or NR (not recorded) tends to be not very reliable.

FUTURE DEVELOPMENT

The current SPD database has data source from three model organisms. We plan to add secretomes from other organisms, when their completed genome sequences are available. Moreover, evolutionary analysis to construct the ortholog groups is underway to provide useful information for wet lab biological experiments.

32 in total

1. RefSeq and LocusLink: NCBI gene-centered resources.

Authors: K D Pruitt; D R Maglott
Journal: Nucleic Acids Res Date: 2001-01-01 Impact factor: 16.971

2. The PROSITE database, its status in 2002.

Authors: Laurent Falquet; Marco Pagni; Philipp Bucher; Nicolas Hulo; Christian J A Sigrist; Kay Hofmann; Amos Bairoch
Journal: Nucleic Acids Res Date: 2002-01-01 Impact factor: 16.971

3. BLAT--the BLAST-like alignment tool.

Authors: W James Kent
Journal: Genome Res Date: 2002-04 Impact factor: 9.043

4. Ensembl 2004.

Authors: E Birney; D Andrews; P Bevan; M Caccamo; G Cameron; Y Chen; L Clarke; G Coates; T Cox; J Cuff; V Curwen; T Cutts; T Down; R Durbin; E Eyras; X M Fernandez-Suarez; P Gane; B Gibbins; J Gilbert; M Hammond; H Hotz; V Iyer; A Kahari; K Jekosch; A Kasprzyk; D Keefe; S Keenan; H Lehvaslaiho; G McVicker; C Melsopp; P Meidl; E Mongin; R Pettett; S Potter; G Proctor; M Rae; S Searle; G Slater; D Smedley; J Smith; W Spooner; A Stabenau; J Stalker; R Storey; A Ureta-Vidal; C Woodwark; M Clamp; T Hubbard
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

5. Secreted protein prediction system combining CJ-SPHMM, TMHMM, and PSORT.

Authors: Yunjia Chen; Peng Yu; Jingchu Luo; Ying Jiang
Journal: Mamm Genome Date: 2003-12 Impact factor: 2.957

6. TMPDB: a database of experimentally-characterized transmembrane topologies.

Authors: Masami Ikeda; Masafumi Arai; Toshikatsu Okuno; Toshio Shimizu
Journal: Nucleic Acids Res Date: 2003-01-01 Impact factor: 16.971

7. Shopping in the genome market with EnsMart.

Authors: Don Gilbert
Journal: Brief Bioinform Date: 2003-09 Impact factor: 11.622

8. A hidden Markov model for predicting transmembrane helices in protein sequences.

Authors: E L Sonnhammer; G von Heijne; A Krogh
Journal: Proc Int Conf Intell Syst Mol Biol Date: 1998

Review 9. Eukaryotic protein secretion.

Authors: M Sakaguchi
Journal: Curr Opin Biotechnol Date: 1997-10 Impact factor: 9.740

10. Identifying secretomes in people, pufferfish and pigs.

Authors: Eric W Klee; Daniel F Carlson; Scott C Fahrenkrug; Stephen C Ekker; Lynda B M Ellis
Journal: Nucleic Acids Res Date: 2004-02-27 Impact factor: 16.971

90 in total

1. Gene expression analysis identifies potential biomarkers of neurofibromatosis type 1 including adrenomedullin.

Authors: Trent R Hummel; Walter J Jessen; Shyra J Miller; Lan Kluwe; Victor F Mautner; Margaret R Wallace; Conxi Lázaro; Grier P Page; Paul F Worley; Bruce J Aronow; Elizabeth K Schorry; Nancy Ratner
Journal: Clin Cancer Res Date: 2010-08-25 Impact factor: 12.531

2. Substrate discrimination of the chaperone BiP by autonomous and cochaperone-regulated conformational transitions.

Authors: Moritz Marcinowski; Matthias Höller; Matthias J Feige; Danae Baerend; Don C Lamb; Johannes Buchner
Journal: Nat Struct Mol Biol Date: 2011-01-09 Impact factor: 15.369

3. Combinatorial peptide ligand library treatment followed by a dual-enzyme, dual-activation approach on a nanoflow liquid chromatography/orbitrap/electron transfer dissociation system for comprehensive analysis of swine plasma proteome.

Authors: Chengjian Tu; Jun Li; Rebeccah Young; Brian J Page; Frank Engler; Marc S Halfon; John M Canty; Jun Qu
Journal: Anal Chem Date: 2011-05-26 Impact factor: 6.986

Review 4. Protein quality control in the early secretory pathway.

Authors: Tiziana Anelli; Roberto Sitia
Journal: EMBO J Date: 2008-01-23 Impact factor: 11.598

5. Computational prediction of human proteins that can be secreted into the bloodstream.

Authors: Juan Cui; Qi Liu; David Puett; Ying Xu
Journal: Bioinformatics Date: 2008-08-12 Impact factor: 6.937

6. Secreted Proteins Defy the Expression Level-Evolutionary Rate Anticorrelation.

Authors: Felix Feyertag; Patricia M Berninsone; David Alvarez-Ponce
Journal: Mol Biol Evol Date: 2017-03-01 Impact factor: 16.240

7. Whole-genome sequence of Schistosoma haematobium.

Authors: Neil D Young; Aaron R Jex; Bo Li; Shiping Liu; Linfeng Yang; Zijun Xiong; Yingrui Li; Cinzia Cantacessi; Ross S Hall; Xun Xu; Fangyuan Chen; Xuan Wu; Adhemar Zerlotini; Guilherme Oliveira; Andreas Hofmann; Guojie Zhang; Xiaodong Fang; Yi Kang; Bronwyn E Campbell; Alex Loukas; Shoba Ranganathan; David Rollinson; Gabriel Rinaldi; Paul J Brindley; Huanming Yang; Jun Wang; Jian Wang; Robin B Gasser
Journal: Nat Genet Date: 2012-01-15 Impact factor: 38.330

8. An integrated transcriptomics and proteomics analysis of the secretome of the helminth pathogen Fasciola hepatica: proteins associated with invasion and infection of the mammalian host.

Authors: Mark W Robinson; Ranjeeta Menon; Sheila M Donnelly; John P Dalton; Shoba Ranganathan
Journal: Mol Cell Proteomics Date: 2009-05-14 Impact factor: 5.911

9. Unlocking the transcriptomes of two carcinogenic parasites, Clonorchis sinensis and Opisthorchis viverrini.

Authors: Neil D Young; Bronwyn E Campbell; Ross S Hall; Aaron R Jex; Cinzia Cantacessi; Thewarach Laha; Woon-Mok Sohn; Banchob Sripa; Alex Loukas; Paul J Brindley; Robin B Gasser
Journal: PLoS Negl Trop Dis Date: 2010-06-22

10. Protein-protein interaction as a predictor of subcellular location.

Authors: Chang Jin Shin; Simon Wong; Melissa J Davis; Mark A Ragan
Journal: BMC Syst Biol Date: 2009-02-25