| Literature DB >> 34328525 |
Valérie Marot-Lassauzaie1, Tatyana Goldberg2, Jose Juan Almagro Armenteros3, Henrik Nielsen4, Burkhard Rost2,5,6,7.
Abstract
The native subcellular location (also referred to as localization or cellular compartment) of a protein is the one in which it acts most frequently; it is one aspect of protein function. Do ten eukaryotic model organisms differ in their location spectrum, i.e., the fraction of its proteome in each of seven major cellular compartments? As experimental annotations of locations remain biased and incomplete, we need prediction methods to answer this question. After systematic bias corrections, the complete but faulty prediction methods appeared to be more appropriate to compare location spectra between species than the incomplete more accurate experimental data. This work compared the location spectra for ten eukaryotes: Homo sapiens (human), Gorilla gorilla (gorilla), Pan troglodytes (chimpanzee), Mus musculus (mouse), Rattus norvegicus (rat), Drosophila melanogaster (fruit/vinegar fly), Anopheles gambiae (African malaria mosquito), Caenorhabitis elegans (nematode), Saccharomyces cerevisiae (baker's yeast), and Schizosaccharomyces pombe (fission yeast). The two largest classes were predicted to be the nucleus and the cytoplasm together accounting for 47-62% of all proteins, while 7-21% of the proteins were predicted in the plasma membrane and 4-15% to be secreted. Overall, the predicted location spectra were largely similar. However, in detail, the differences sufficed to plot trees (UPGMA) and 2D (PCA) maps relating the ten organisms using a simple Euclidean distance in seven states (location classes). The relations based on the simple predicted location spectra captured aspects of cross-species comparisons usually revealed only by much more detailed evolutionary comparisons. Most interestingly, known phylogenetic relations were reproduced better by paralog-only than by ortholog-only trees.Entities:
Keywords: Genome sequence analysis; Prediction of cellular compartment; Protein location; Species comparisons
Mesh:
Substances:
Year: 2021 PMID: 34328525 PMCID: PMC8379119 DOI: 10.1007/s00239-021-10022-4
Source DB: PubMed Journal: J Mol Evol ISSN: 0022-2844 Impact factor: 2.395
Reliable Human Protein Atlas (HPA) annotations agree with Swiss-Prot
| HPA level | Nprot | HPA = Swiss-Prot (%) | Expected (%) |
|---|---|---|---|
| Validated | 644 | 95 | 38 |
| Supportive | 1617 | 94 | 39 |
| Uncertain | 783 | 57 | 45 |
| Unreliable | 126 | 39 | 44 |
| Merged | |||
| Reliable | 2261 | 94 | 39 |
| Speculative | 909 | 54 | 45 |
HPA level reliability provided by Human Protein Atlas (HPA) (Thul et al. 2017), Nprot number of human proteins compared (only HPA proteins with Swiss-Prot match), HPA = Swiss-Prot percentage of proteins for which any annotation in Swiss-Prot (experimental only Boutet et al. (2016)) agrees with at least one annotation in HPA, Expected agreement between annotations after random shuffle (randomly pick proteins from Swiss-Prot set, compare to HPA proteins of corresponding HPA level, repeat 100 times for each level and average), Merged the two best HPA levels (high agreement to Swiss-Prot) were merged into “reliable”, the two worst into “speculative”
Fig. 1Protein location in human proteome annotated by experiments. The Venn diagram compares experimental annotations of human proteins between Swiss-Prot (Boutet et al. 2016) and The Human Protein Atlas (HPA) (Thul et al. 2017). The white background (all 21,018 human proteins) is not to scale. We grouped the four HPA annotation levels into reliable (94% agreement with Swiss-Prot, Table 1) and speculative (54% agreement with Swiss-Prot only slightly above random, Table 1). For instance, 2261 proteins have HPA reliable annotations and match at least one Swiss-Prot annotation (evidence code: ECO:0000269), while 37% of all human proteins (7705 = 1963 + 2572 + 2261 + 909) have reliable experimental annotations (Swiss-Prot ECO:0000269 or HPA validated and supportive)
Fig. 3Grouping of ten eukaryotes according to location spectra of paralogs and orthologs only. We used InPAranoid to identify all in-paralogs and orthologs for the ten species, and extracted the LocTree3 prediction for these two subset of genes. We computed the Euclidean distances (Eq. 2) between the predicted distributions predicted by LocTree3 (without correction) and used these distances to build a UPGMA tree for each subset of genes. The mean distance of location spectra for paralogs is two times greater for paralogs than for orthologs, but the trees are scaled to visually remove this effect
Fig. 2Grouping of ten eukaryotes according to predicted location spectra. We computed the Euclidean distances (Eq. 2) between the proteome-wide distributions predicted by LocTree3 with error correction (Eq. 1) (Marot-Lassauzaie et al. 2018) for each of the ten reference organisms. Those values were plotted onto a UPGMA tree (top panel A) and shown through PCA in 2D (lower panel B). A UPGMA tree built from the predicted distributions from DeepLoc for the 10 organisms. B UPGMA tree along with a bar representing the predicted distribution from LocTree3 in the seven main subcellular location classes is shown for each organism. The seven location classes (from left to right): secreted (white), nuclear (gray), cytoplasmic (blue), plasma membrane (green), mitochondrial (yellow), endoplasmic reticulum (orange), and Golgi apparatus (red). Despite the small differences, the resulting tree largely agrees with what we expect from evolution. C The PCA adds more details to the comparison between species from the LocTree3 predictions. Two interesting aspects are the large differences between the two yeast species (y-axis) and the approximate triangle between mouse, rat, and human