Literature DB >> 16381849

LOCATE: a mouse protein subcellular localization database.

J Lynn Fink¹, Rajith N Aturaliya, Melissa J Davis, Fasheng Zhang, Kelly Hanson, Melvena S Teasdale, Chikatoshi Kai, Jun Kawai, Piero Carninci, Yoshihide Hayashizaki, Rohan D Teasdale.

Abstract

We present here LOCATE, a curated, web-accessible database that houses data describing the membrane organization and subcellular localization of proteins from the FANTOM3 Isoform Protein Sequence set. Membrane organization is predicted by the high-throughput, computational pipeline MemO. The subcellular locations of selected proteins from this set were determined by a high-throughput, immunofluorescence-based assay and by manually reviewing >1700 peer-reviewed publications. LOCATE represents the first effort to catalogue the experimentally verified subcellular location and membrane organization of mammalian proteins using a high-throughput approach and provides localization data for approximately 40% of the mouse proteome. It is available at http://locate.imb.uq.edu.au.

Entities: CellLine Disease Gene Species

Mesh：

Substances：

Year: 2006 PMID： 16381849 PMCID： PMC1347432 DOI： 10.1093/nar/gkj069

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Determination of the membrane organization and the subcellular location of a protein are essential to understanding its biochemical function. A cell is divided into different cellular compartments and each compartment is associated with a different range of biochemical processes; by localizing a protein to a specific compartment, or set of compartments, the cellular role of the protein can be inferred. This information can provide insight into the functions of hypothetical or novel proteins and can provide a more specific organellar context in which to investigate a particular protein. Historically, these data have been difficult to produce on a large scale for higher eukaryotic organisms. However, recent advances in membrane organization prediction methods and high-throughput subcellular localization assays have made it possible to generate these datasets. We used high-throughput methods to predict the membrane organization for the entire mouse proteome and to determine the subcellular localization of a subset of the proteome. We then developed a database, LOCATE, to organize and warehouse these data.

DATABASE CONTENT

Dataset

The mouse proteome dataset we used was the FANTOM3 Isoform Protein Sequence set (IPS7) generated by the RIKEN FANTOM Consortium (1). This dataset is comprised of protein sequences based on transcript sequences generated from direct sequencing of full-length transcripts. The sequenced transcripts were clustered into transcriptional units (TUs) where a TU is a grouping of transcripts that arise from a single genomic locus and share at least one nucleotide having the same genomic location and orientation. The IPS7 dataset contains 33 451 protein sequences encoded by 19 853 TUs.

Membrane organization

Protein orientation with respect to the membrane was predicted by MemO, a high-throughput, automated pipeline, which combines publicly available feature predictors with empirically determined annotation rules (1,2) (M. J. Davis, F. Clark, J. L. Fink, Z. Yuan, F. Zhang, T. Kasukawa, Y. Hayashizaki, P. Carnici and R. D. Teasdale, manuscript in preparation). The pipeline is described briefly here. Prediction of signal peptides was performed by a local implementation of SignalP v2.0 (3) and by the Australian National Genomic Information Service (ANGIS, ) version of SPScan. A protein was predicted to contain a signal peptide if the averaged and normalized raw output scores from both methods exceeded a threshold identified to maximize the proportions of true positives and true negatives on a training set. α-Helical transmembrane domain prediction was performed by a consensus method consisting of five currently available predictors: HMMTOP (4), TMHMM v2.0 (5), SVMTM v3.0 (6), MEMSAT (7) and DAS (8). A protein was said to contain a transmembrane domain if at least 7, but no more than 42, consecutive residues in the protein (ignoring a gap of <4 residues) were predicted to participate in a transmembrane domain by at least three of the five predictors. The prediction of the absence or presence of the signal peptide and transmembrane domain provided a classification into one of five categories of membrane organization: We applied this pipeline to the 33 451 protein sequences in the IPS7 dataset and identified 5116 (∼15%) proteins containing signal peptides, and 8238 (∼25%) proteins containing transmembrane domains. These proteins were then allocated to the five membrane organization categories based on combinations of those features. The class breakdown of proteins is shown in Table 1.

Table 1

Distribution of membrane organization classes and high-quality localization data in LOCATE

Membrane organization class	MemO data	Subcellular localization data
	IPS proteins in class (TUs/isoforms)	Isoforms with experimental data	TUs with literature-mined data	Total represented (TUs/isoforms)
Soluble, intracellular protein	13 105/22 265	0	302	302/353
Soluble, secreted protein	2190/2948	0	340	340/469
Type I membrane protein	1038/1548	0	377	377/653
Type II membrane protein	2149/2869	207	408	549/766
Multi-pass membrane protein	2538/3821	210	325	460/652
Total proteins analyzed	19 538/33 451	417	1752	2028/2893

The MemO Data columns show the absolute numbers of proteins classified by MemO into each membrane organization class. The ‘Subcellular localization data’ columns show the number of protein isoforms that have an experimentally determined subcellular location and the number of transcriptional units (TUs) that have a literature-mined subcellular location as well as the total numbers of TUs and isoforms that have subcellular localization data. Localization data mined from other databases is not included here.

soluble intracellular proteins (no transmembrane domains or signal peptide); soluble secreted proteins (signal peptide, no transmembrane domains); type I membrane proteins (one transmembrane domain, signal peptide) (9); type II membrane proteins (one transmembrane domain, no signal peptide) (9); multi-pass membrane protein (multiple transmembrane domains) (9).

Subcellular localization

Proteins were selected for experimentation based on clone availability and the extent of previous characterization of their subcellular localization. When selecting multipass membrane proteins, only those without a predicted ER signal peptide were chosen. N-terminally tagged myc-gene of interest expression constructs were generated using a modified overlapping PCR methodology originally reported by Suzuki et al. (10). The expressed protein, within fixed transfected HeLa cells, was detected by indirect immunofluorescence and representative images were collected and analyzed to determine the protein's subcellular localization. To date, experimental subcellular localization data have been generated for 417 of these selected proteins and localization data based on primary literature review have been gathered for 1752 TUs. Both the experimental and literature-mined localization data were manually examined and evaluated for sufficient quality prior to addition to the database. When evaluating literature-mined localization data, only papers describing the localization of full-length proteins in individual mammalian cells in which the protein is detected directly were included in our analysis. These peer-reviewed observations were not reinterpreted. However, some observations were excluded when considered not to be of a sufficient quality. Because it was not always possible to determine to which protein isoform the literature data referred, we assigned the literature-mined location to all protein isoforms encoded by the corresponding TU. Table 1 summarizes the subcellular localization statistics by membrane organization class. To provide as complete a location description as possible for any given protein, we also include localization data mined from other online databases including LIFEdb (11), Mouse Genome Informatics (12), UniProt (13), RefSeq (14) and others. A total of 7410 TUs and 11 353 protein isoforms are annotated with these data. In total, we have localization data for 8017 TUs and 12 598 protein isoforms representing 41 and 37% of the IPS7 set, respectively.

Data presentation

General information

Information in LOCATE is displayed as a web page which describes a particular protein entry in detail. The page is divided into sections which summarize several types of data. The top of the page contains a summary of the MemO classification and the subcellular localization of the protein as well as associated metadata provided by FANTOM3 annotations such as the protein identifier, a functional description, protein name synonyms, the source organism and links to other databases which also contain this protein.

Transmembrane topology and predicted domains

Knowing what functional domains and motifs exist in a protein is extremely useful when attempting to decipher the cellular role of the protein. We have generated predictions of Pfam and SCOP domains for all proteins in the database and have displayed the predicted domains on a graphical protein schematic diagram alongside the membrane organization data (Figure 1). The presence and position of certain domains in relation to predicted transmembrane domains can provide insights into the validity of the functional annotation of the protein (if one exists) as well as the validity or range of the transmembrane domain prediction.

Figure 1

Visualization of MemO- and Pfam- and SCOP-predicted motif data. (a) Plots the number of computational methods (from 0 to 5) that predict whether a residue in the protein sequence participates in a helical transmembrane domain. Five independent methods are used in the TMD prediction; we assign a residue to a TMD if at least three of the five methods have a positive prediction at that position in the sequence and the range of the predicted TMD fulfils a set of rules defined in the MemO pipeline (M. J. Davis, F. Clark, J. L. Fink, Z. Yuan, F. Zhang, T. Kasukawa, Y. Hayashizaki, P. Carnici and R. D. Teasdale, manuscript in preparation). (b) A schematic diagram of a protein sequence with predicted domains mapped onto it. In this particular diagram, the transmembrane domains predicted by MemO are shown at the top of the figure and the domains predicted by Pfam or SCOP are shown in the bottom of the figure. The schematics are vertically aligned to show the positional relationships of the predicted TMDs and other domains.

Subcellular location data

If a protein entry has high-throughput subcellular localization data, we display the subcellular location(s) in which that particular protein isoform was observed and a high-resolution fluorescent-image which best illustrates the observed localization. Information about the experimental conditions such as the cell type and epitope used in the localization assays is also displayed. If a protein entry has subcellular localization data mined from literature, we display the determined subcellular location(s), the PubMed ID, and a full citation of the data source.

Controlled vocabulary

Consistent naming of subcellular locations is critical to the integrity and extensibility of the LOCATE data. Therefore, we have constructed a controlled vocabulary which describes both experimentally determined and literature-mined subcellular locations. In the case of high-throughput experimental subcellular localization assays, it is not always possible to determine the exact cellular compartment to which the protein is observed to localize. To address this problem, our controlled vocabulary contains a hierarchical set of terms that allows the call to be only as specific as the data allow. This system also reflects the confidence of the localization call; use of a very specific term implies higher confidence. Some proteins have been observed to localize to more than one subcellular compartment; in these cases, we allow the use of multiple terms to describe the observed locations. When mining subcellular localization data from the literature, we use terms that allow for different levels of location resolution and for cellular components that are specific to cells with a lineage or morphology that differs from the model cells used in our experiments. In both vocabularies, we use Gene Ontology (15) terms to describe subcellular locations whenever possible (see the LOCATE website for more details).

Observed spliced isoforms

For each protein in the database, we display a list of all proteins that belong to the same TU to allow comparisons between each of the observed protein isoforms. Specifically, we display the membrane organization and length of each isoform on a splicing graph which illustrates the observed exons and the various alternate splice forms for that particular TU (Figure 2). These graphs enable analysis of the pattern of membrane organization variation within the observed protein isoforms and examination of the possible effects of alternative splicing on membrane organization. The graphs were generated by a customized version of the Splicing Graph Module (16).

Figure 2

Splicing graph. This graph shows the observed exons and splice junctions for the transcriptional unit 101566 and the splice isoforms of the transcripts that arise from this transcriptional unit. The light gray color represents soluble, cytoplasmic proteins (PA101566.2 and PA101566.4); light orange represents a Type II membrane protein (PA101566.1); black represents all observed exons. The green and red bars represent the observed start and stop codons, respectively. The teal rectangle represents the position and range of the MemO-predicted transmembrane domain; note that the transmembrane domain occurs in the exon that only appears in the Type II membrane protein and not in the soluble, cytoplasmic proteins. This is a clear example of how alternate splicing of these transcripts may change the proteins' membrane organization.

Data accessibility

This database does not seek to duplicate information contained in other databases unless it is particularly useful when viewed in juxtaposition with the subcellular localization or membrane organization data. However, we understand the value of convenient data accessibility and provide links to offsite resources such as SymAtlas (17), GenBank (18), RIKEN (1), MGI (19), READ (20), Pfam (21), SCOP (22), UniProt (13), OMIM (23), Entrez Gene (24), BIND (25), the GeneNetwork (26) and the Mouse Retrovirus Tagged Cancer Gene Database (RTCGD) (20) where applicable. Because the major aim of this database effort is to present protein subcellular location data and the predicted membrane organization of the protein, these two features are the primary search mechanisms; proteins can be retrieved by protein class, subcellular localization or both. Alternatively, individual protein entries can be retrieved by searching the database with a protein ID (RIKEN clone/IPS ID, GenBank accession number, Entrez Gene ID), by protein name, by Pfam or SCOP accession number, or by functional description. BLAST searches against the database, and subsets of the database, are also available. The BLAST results are enhanced to display the membrane organization of the hits. We also offer a number of batch data retrieval options. The proteins in any given search can be retrieved as FASTA-formatted protein or transcript sequences, subcellular localization data, membrane organization data or protein schematics. XML-marked-up documents containing these data can also be obtained.

CONCLUSIONS

LOCATE represents a significant contribution to the biological research community by organizing and presenting membrane organization and subcellular localization data for the mouse proteome. The LOCATE search interface allows users to retrieve data and sets of data using several different approaches. The interface to individual proteins was designed to maximize ease of interpretation by providing summaries or visualizations that contain the most relevant points of data; links are provided to the raw data or other details that are necessary for a careful evaluation of the experimental results. LOCATE data can be retrieved as individual entries or downloaded as HTML, plain text or XML files.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

25 in total

1. Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes.

Authors: A Krogh; B Larsson; G von Heijne; E L Sonnhammer
Journal: J Mol Biol Date: 2001-01-19 Impact factor: 5.469

2. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium.

Authors: M Ashburner; C A Ball; J A Blake; D Botstein; H Butler; J M Cherry; A P Davis; K Dolinski; S S Dwight; J T Eppig; M A Harris; D P Hill; L Issel-Tarver; A Kasarskis; S Lewis; J C Matese; J E Richardson; M Ringwald; G M Rubin; G Sherlock
Journal: Nat Genet Date: 2000-05 Impact factor: 38.330

Review 3. Topogenesis of membrane proteins: determinants and dynamics.

Authors: V Goder; M Spiess
Journal: FEBS Lett Date: 2001-08-31 Impact factor: 4.124

4. The HMMTOP transmembrane topology prediction server.

Authors: G E Tusnády; I Simon
Journal: Bioinformatics Date: 2001-09 Impact factor: 6.937

5. Protein-protein interaction panel using mouse full-length cDNAs.

Authors: H Suzuki; Y Fukunishi; I Kagawa; R Saito; H Oda; T Endo; S Kondo; H Bono; Y Okazaki; Y Hayashizaki
Journal: Genome Res Date: 2001-10 Impact factor: 9.043

6. Large-scale analysis of the human and mouse transcriptomes.

Authors: Andrew I Su; Michael P Cooke; Keith A Ching; Yaron Hakak; John R Walker; Tim Wiltshire; Anthony P Orth; Raquel G Vega; Lisa M Sapinoso; Aziz Moqrich; Ardem Patapoutian; Garret M Hampton; Peter G Schultz; John B Hogenesch
Journal: Proc Natl Acad Sci U S A Date: 2002-03-19 Impact factor: 11.205

7. MGD: the Mouse Genome Database.

Authors: Judith A Blake; Joel E Richardson; Carol J Bult; Jim A Kadin; Janan T Eppig
Journal: Nucleic Acids Res Date: 2003-01-01 Impact factor: 16.971

8. Mouse proteome analysis.

Authors: Alexander Kanapin; Serge Batalov; Melissa J Davis; Julian Gough; Sean Grimmond; Hideya Kawaji; Michele Magrane; Hideo Matsuda; Christian Schönbach; Rohan D Teasdale; Zheng Yuan
Journal: Genome Res Date: 2003-06 Impact factor: 9.043

9. The Pfam protein families database.

Authors: Alex Bateman; Lachlan Coin; Richard Durbin; Robert D Finn; Volker Hollich; Sam Griffiths-Jones; Ajay Khanna; Mhairi Marshall; Simon Moxon; Erik L L Sonnhammer; David J Studholme; Corin Yeats; Sean R Eddy
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

10. SCOP database in 2004: refinements integrate structure and sequence family data.

Authors: Antonina Andreeva; Dave Howorth; Steven E Brenner; Tim J P Hubbard; Cyrus Chothia; Alexey G Murzin
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

32 in total

1. Automated protein subcellular localization based on local invariant features.

Authors: Chao Li; Xue-hong Wang; Li Zheng; Ji-feng Huang
Journal: Protein J Date: 2013-03 Impact factor: 2.371

2. Systems biology of facial development: contributions of ectoderm and mesenchyme.

Authors: Joan E Hooper; Weiguo Feng; Hong Li; Sonia M Leach; Tzulip Phang; Charlotte Siska; Kenneth L Jones; Richard A Spritz; Lawrence E Hunter; Trevor Williams
Journal: Dev Biol Date: 2017-03-29 Impact factor: 3.582

3. Sculpting MHC class II-restricted self and non-self peptidome by the class I Ag-processing machinery and its impact on Th-cell responses.

Authors: Charles T Spencer; Srdjan M Dragovic; Stephanie B Conant; Jennifer J Gray; Mu Zheng; Parimal Samir; Xinnan Niu; Magdalini Moutaftsi; Luc Van Kaer; Alessandro Sette; Andrew J Link; Sebastian Joyce
Journal: Eur J Immunol Date: 2013-03-05 Impact factor: 5.532

4. Lipase maturation factor LMF1, membrane topology and interaction with lipase proteins in the endoplasmic reticulum.

Authors: Mark H Doolittle; Saskia B Neher; Osnat Ben-Zeev; Jo Ling-Liao; Ciara M Gallagher; Maryam Hosseini; Fen Yin; Howard Wong; Peter Walter; Miklós Péterfy
Journal: J Biol Chem Date: 2009-09-26 Impact factor: 5.157

5. The proteins of intra-nuclear bodies: a data-driven analysis of sequence, interaction and expression.

Authors: Nurul Mohamad; Mikael Bodén
Journal: BMC Syst Biol Date: 2010-04-13

6. A testis-specific regulator of complex and hybrid N-glycan synthesis.

Authors: Hung-Hsiang Huang; Pamela Stanley
Journal: J Cell Biol Date: 2010-08-30 Impact factor: 10.539

7. A screen for endocytic motifs.

Authors: Patrycja Kozik; Richard W Francis; Matthew N J Seaman; Margaret S Robinson
Journal: Traffic Date: 2010-03-04 Impact factor: 6.215

8. Genome-wide analysis of cancer/testis gene expression.

Authors: Oliver Hofmann; Otavia L Caballero; Brian J Stevenson; Yao-Tseng Chen; Tzeela Cohen; Ramon Chua; Christopher A Maher; Sumir Panji; Ulf Schaefer; Adele Kruger; Minna Lehvaslaiho; Piero Carninci; Yoshihide Hayashizaki; C Victor Jongeneel; Andrew J G Simpson; Lloyd J Old; Winston Hide
Journal: Proc Natl Acad Sci U S A Date: 2008-12-16 Impact factor: 11.205

9. The mannose 6-phosphate glycoprotein proteome.

Authors: David E Sleat; Maria Cecilia Della Valle; Haiyan Zheng; Dirk F Moore; Peter Lobel
Journal: J Proteome Res Date: 2008-05-29 Impact factor: 4.466

10. Protein-protein interaction as a predictor of subcellular location.

Authors: Chang Jin Shin; Simon Wong; Melissa J Davis; Mark A Ragan
Journal: BMC Syst Biol Date: 2009-02-25