| Literature DB >> 32427841 |
Ryan W Christian1,2, Seanna L Hewitt1,2, Eric H Roalson2,3, Amit Dhingra4,5.
Abstract
Plastids are morphologically and functionally diverse organelles that are dependent on nuclear-encoded, plastid-targeted proteins for all biochemical and regulatory functions. However, how plastid proteomes vary temporally, spatially, and taxonomically has been historically difficult to analyze at a genome-wide scale using experimental methods. A bioinformatics workflow was developed and evaluated using a combination of fast and user-friendly subcellular prediction programs to maximize performance and accuracy for chloroplast transit peptides and demonstrate this technique on the predicted proteomes of 15 sequenced plant genomes. Gene family grouping was then performed in parallel using modified approaches of reciprocal best BLAST hits (RBH) and UCLUST. A total of 628 protein families were found to have conserved plastid targeting across angiosperm species using RBH, and 828 using UCLUST. However, thousands of clusters were also detected where only one species had predicted plastid targeting, most notably in Panicum virgatum which had 1,458 proteins with species-unique targeting. An average of 45% overlap was found in plastid-targeted protein-coding gene families compared with Arabidopsis, but an additional 20% of proteins matched against the full Arabidopsis proteome, indicating a unique evolution of plastid targeting. Neofunctionalization through subcellular relocalization is known to impart novel biological functions but has not been described before on a genome-wide scale for the plastid proteome. Further work to correlate these predicted novel plastid-targeted proteins to transcript abundance and high-throughput proteomics will uncover unique aspects of plastid biology and shed light on how the plastid proteome has evolved to influence plastid morphology and biochemistry.Entities:
Mesh:
Substances:
Year: 2020 PMID: 32427841 PMCID: PMC7237471 DOI: 10.1038/s41598-020-64670-5
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Self-Reported Performance of Six Algorithms on Prediction of Plastid-Targeted Proteins.
| Algorithm | Source | Training Dataset(s) | # of Training Sequences | Plastidial SE | Plastidial SP | Plastidial MCC | Plastidial ACC |
|---|---|---|---|---|---|---|---|
| TargetP* | [ | SWISS-PROT releases 36,37,38 | 940 | 0.85 | 0.69 | 0.72 | N/A (0.921) |
| WolfPSORT | [ | Uniprot version 45 | 2,113 | 0.7 | 0.7 | N/A | N/A |
| PredSL† | [ | Various (Uniprot release 3.5) | 1,002 | 0.9 | 0.91 | 0.88 (0.874) | N/A |
| Localizer‡ | [ | CropPAL (GFP only) | 410 | 0.725 | 0.957 (0.798) | 0.71 | 0.914 (0.916) |
| Multiloc2 (Low-Res)† | [ | BaCelLo Independent Dataset | 132 | 0.77 | 0.53 | 0.72 | N/A (0.853) |
| Multiloc2 (High-Res)† | [ | BaCelLo Independent Dataset | 132 | 0.53 | 0.94 | 0.51 (0.539) | N/A (0.735) |
| PCLR* | [ | ChloroP, TargetP | 847 | 0.87 (0.821) | 0.30 (0.301) | 0.372 | 0.720 |
Self-reported values for overall and plastidial sensitivity (SE), specificity (SP), Matthew’s Correlation Coefficient (MCC), and accuracy (ACC). Parentheses indicate values that were calculated to be different from the original paper using the same data. Programs marked with an asterisk (*) had a confusion matrix available, while those marked with a cross (†) did not, but confusion matrices were inferred by the available data; estimations were left as non-integer values, and therefore suffer from rounding errors in MCC and ACC calculations. Localizer, marked with a double cross (‡), was re-run with the original dataset provided in the publication’s supplementary information[41].
Review of Algorithms using modern curated datasets (combined).
| GFP | GFP & Mass Spectrometry | Difference | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SE | SP | MCC | ACC | SE | SP | MCC | ACC | SE | SP | MCC | ACC | |
| TargetP | 0.67 | 0.59 | 0.54 | 0.86 | 0.46 | 0.55 | 0.32 | 0.73 | 0.21 | 0.04 | 0.22 | 0.13 |
| Wolf-PSORT | 0.72 | 0.38 | 0.38 | 0.75 | 0.57 | 0.44 | 0.24 | 0.65 | 0.15 | −0.05 | 0.14 | 0.09 |
| PredSL | 0.57 | 0.53 | 0.45 | 0.84 | 0.37 | 0.52 | 0.26 | 0.71 | 0.19 | 0.01 | 0.20 | 0.12 |
| Localizer | 0.68 | 0.71 | 0.63 | 0.90 | 0.46 | 0.58 | 0.34 | 0.74 | 0.22 | 0.14 | 0.29 | 0.16 |
| Multiloc2 | 0.50 | 0.83 | 0.59 | 0.89 | 0.31 | 0.63 | 0.30 | 0.74 | 0.18 | 0.20 | 0.28 | 0.15 |
| PCLR | 0.74 | 0.46 | 0.47 | 0.80 | 0.54 | 0.48 | 0.28 | 0.69 | 0.20 | −0.02 | 0.19 | 0.11 |
For each program, SE, SP, MCC, and ACC are reported compared to in vivo experimental data using a conservative dataset of GFP-validated proteins, or a larger but more liberal dataset comprised of both GFP and MS data. Difference between observed performance statistics of different datasets is presented as GFP minus MS/GFP. MS data was found to have increased error especially for observed sensitivity, indicating that a large number of MS-validated proteins are likely artefactual. Furthermore, this suggests that the overall performance of subcellular prediction methods is likely more accurate than high-throughput proteomics reports suggest. Sensitivity can be inverted (1-SE) to yield the false negative rate, i.e. the fraction of proteins that were experimentally found to be plastid-targeted by the given experimental method but predicted to be non-plastidial. Likewise, specificity can be inverted (1-SP) to yield the false positive rate, i.e. the fraction of predicted experimentally determined to be non-plastidial that were found by the prediction algorithm to be plastidial.
Figure 1Venn-Diagram of Combinatorial and Standalone Subcellular Prediction Algorithms. Performance measured by MCC on proteins with subcellular localization validated by GFP is represented as a heatmap with high values in green and low values in red. For each intersection, only the best accept threshold is represented. Numbers indicate workflow number followed by the calculated MCC.
Performance of prediction algorithms against GFP-validated proteins from monocots and eudicots.
| Monocot: GFP | Eudicot: GFP | Monocot-Eudicot | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SE | SP | MCC | ACC | SE | SP | MCC | ACC | SE | SP | MCC | ACC | |
| TargetP | 0.62 | 0.71 | 0.59 | 0.87 | 0.68 | 0.56 | 0.53 | 0.86 | −0.06 | 0.15 | 0.05 | 0.02 |
| Wolf-PSORT | 0.72 | 0.38 | 0.35 | 0.71 | 0.72 | 0.38 | 0.39 | 0.76 | 0.00 | −0.01 | −0.04 | −0.05 |
| PredSL | 0.43 | 0.59 | 0.40 | 0.83 | 0.61 | 0.52 | 0.47 | 0.84 | −0.17 | 0.06 | −0.07 | −0.02 |
| Localizer | 0.63 | 0.76 | 0.63 | 0.89 | 0.69 | 0.70 | 0.63 | 0.90 | −0.06 | 0.06 | −0.01 | −0.01 |
| Multiloc2 | 0.40 | 0.89 | 0.54 | 0.87 | 0.53 | 0.81 | 0.60 | 0.90 | −0.12 | 0.08 | −0.06 | −0.03 |
| PCLR | 0.75 | 0.56 | 0.54 | 0.83 | 0.73 | 0.43 | 0.45 | 0.80 | 0.01 | 0.12 | 0.09 | 0.03 |
Performance of each prediction algorithm in monocots and eudicots and the difference between these datasets is presented; dataset sizes are roughly similar for monocot and eudicot sequences, but MCC is still preferable for comparison. 161 plastid-localized proteins and 640 non-plastid-targeted proteins are included for monocots, while eudicots include 489 plastid-targeted and 2,432 non-plastid-targeted proteins. Sensitivity can be inverted (1-SE) to yield the false negative rate, i.e. the fraction of proteins that were experimentally found to be plastid targeted by the given experimental method but predicted to be non-plastidial. Likewise, specificity can be inverted (1-SP) to yield the false positive rate, i.e. the fraction of predicted experimentally determined to be non-plastidial that were found by the prediction algorithm to be plastidial.
Best combinatorial prediction approaches ranked by Matthew’s Correlation Coefficient (MCC).
| Rank | Workflow | Description | SE | SP | MCC | ACC |
|---|---|---|---|---|---|---|
| 1 | 125 | 2/3 of (TargetP, Localizer, Multiloc2) | 0.646 | 0.785 | 0.659 | 0.907 |
| 2 | 167 | 2/2 of (TargetP, Localizer) | 0.611 | 0.807 | 0.650 | 0.907 |
| 3 | 80 | 3/4 of (TargetP, Localizer, Multiloc2, PCLR) | 0.622 | 0.791 | 0.647 | 0.905 |
| 4 | 127 | 3/3 of (TargetP, Localizer, PCLR) | 0.588 | 0.822 | 0.644 | 0.906 |
| 5 | 152 | 2/3 of (PredSL, Localizer, Multiloc2) | 0.597 | 0.803 | 0.639 | 0.904 |
| 6 | 68 | 3/4 of (TargetP, PredSL, Localizer, Multiloc2) | 0.575 | 0.827 | 0.639 | 0.905 |
| 7 | 161 | 2/3 of (Localizer, Multiloc2, PCLR) | 0.660 | 0.732 | 0.635 | 0.898 |
| 8 | 189 | 2/2 of (Localizer, PCLR) | 0.634 | 0.756 | 0.634 | 0.900 |
| 9 | 188 | 1/2 of (Localizer, Multiloc2) | 0.697 | 0.696 | 0.632 | 0.894 |
| 10 | 4 | 1 of (Localizer) | 0.675 | 0.714 | 0.632 | 0.896 |
| 11 | 19 | 4/5 of (TargetP, PredSL, Localizer, Multiloc2, PCLR) | 0.563 | 0.828 | 0.632 | 0.903 |
| 12 | 100 | 3/4 of (PredSL, Localizer, Multiloc2, PCLR) | 0.578 | 0.807 | 0.630 | 0.902 |
| 13 | 20 | 3/5 of (TargetP, PredSL, Localizer, Multiloc2, PCLR) | 0.660 | 0.688 | 0.606 | 0.888 |
| 14 | 72 | 3/4 of (TargetP, PredSL, Localizer, PCLR) | 0.648 | 0.697 | 0.606 | 0.889 |
| 15 | 69 | 2/4 of (TargetP, PredSL, Localizer, Multiloc2) | 0.678 | 0.656 | 0.595 | 0.882 |
| 16 | 187 | 2/2 of (Localizer, Multiloc2) | 0.474 | 0.870 | 0.594 | 0.896 |
| 17 | 116 | 2/3 of (TargetP, PredSL, Localizer) | 0.663 | 0.664 | 0.592 | 0.883 |
| 18 | 181 | 2/2 of (PredSL, Localizer) | 0.511 | 0.814 | 0.591 | 0.894 |
| 19 | 160 | 3/3 of (Localizer, Multiloc2, PCLR) | 0.462 | 0.880 | 0.590 | 0.895 |
| 20 | 115 | 3/3 of (TargetP, PredSL, Localizer) | 0.491 | 0.835 | 0.588 | 0.894 |
| 21 | 155 | 2/3 of (PredSL, Localizer, PCLR) | 0.698 | 0.629 | 0.587 | 0.875 |
| 22 | 5 | 1 of (Multiloc2) | 0.495 | 0.826 | 0.587 | 0.894 |
| 23 | 101 | 2/4 of (PredSL, Localizer, Multiloc2, PCLR) | 0.706 | 0.621 | 0.585 | 0.873 |
| 24 | 124 | 3/3 of (TargetP, Localizer, Multiloc2) | 0.452 | 0.883 | 0.585 | 0.894 |
| 25 | 79 | 4/4 of (TargetP, Localizer, Multiloc2, PCLR) | 0.445 | 0.895 | 0.585 | 0.894 |
The sensitivity (SE), specificity (SP), Matthew’s Correlation Coefficient (MCC), and accuracy (ACC) are presented for each program. Almost all of the highest-performing programs utilized Localizer in their approach, followed by Multiloc2 and TargetP. Localizer and MultiLoc2 were also the only two programs which ranked highly as standalone algorithms, whereas the remaining workflows used two or more individual programs.
Targeting Prediction for Selected Species.
| Species | Version | Source | Sequences | Chloroplast-Targeted* | Percent Chloroplast-Targeted |
|---|---|---|---|---|---|
| 1.0 | [ | 26,846 | 1,833 | 6.83% | |
| 1.0 | [ | 27,959 | 1,324 | 4.74% | |
| TAIR10 | [ | 35,386 | 2,826 | 7.99% | |
| 3.1 | [ | 52,972 | 4,240 | 8.00% | |
| 1.1 | [ | 32,831 | 2,051 | 6.25% | |
| Wm82 | [ | 73,320 | 5,125 | 6.99% | |
1.0 (custom transcriptome) | [ | 57,386 (74,249) | 4,665 | 8.13% (6.28%) | |
| 7.0 | [ | 49,061 | 3,417 | 6.96% | |
| 3.1 | DOE-JGI** | 133,775 | 10,262 | 7.67% | |
| 3.0 | [ | 73,013 | 5,741 | 7.86% | |
| 2.1 | [ | 47,089 | 3,615 | 7.68% | |
| 2.2 | [ | 43,001 | 3,461 | 8.05% | |
| 2.1 | [ | 47,205 | 1,875 | 3.97% | |
| 3.1 | [ | 34,727 | 3,918 | 11.28% | |
| 2.0 | [ | 55,564 | 3,932 | 7.08% |
Predicted protein sequences from fifteen species representing a mixture of model organisms and crop species as well as a mixture of monocots, eudicots, and the early diverging species Amborella trichopoda were downloaded from Phytozome (phytozome.jgi.doe.gov) or from the sources indicated in the table. For each species, the version, reference, and sequence count are provided from the original publications. *TargetP and Localizer were used to detect plastid-targeted sequences. **Indicates unpublished but publicly-available data downloaded from Phytozome for Panicum virgatum.
Figure 2Illustration of RBH and UCLUST Sequence Clustering Methods. Initial (A) and expanded (B) RBH figures indicate clustering between species 1 (blue circles), 2 (green circles), and 3 (orange circles). Bidirectional best BLAST hits between sequences from different species are indicated with black lines; bidirectional better BLAST hits between sequences within the same species with red lines and fragments with dotted red lines. For UCLUST, the initial length-sorted (C) run is illustrated with yellow stars indicating centroids, small gray patterned circles indicating non-centroid sequences, large black circles indicating the match range for initial centroids, and black lines indicating sequence clustering for the initial run. For clarity, sequences are patterned to indicate belonging to each initial cluster, and red dotted lines indicate cluster fragmentation. Randomization of centroids (D) mitigates this artificially-induced problem; gray patterned stars indicate randomly-selected centroids, light blue circles indicate the match range for randomly-seeded centroids, and red lines indicate new matches found with red lines. Distances not drawn to scale.
Figure 3Overall Performance of RBH and UCLUST methods. (A) Cluster distribution in RBH and UCLUST. Both methods resulted in similar distributions of clusters, although RBH resulted in slightly more clusters with 13–15 species and UCLUST resulted in more clusters from 2–12 species. The slight increase in clusters with five species is interesting, and may result from sequences with homology within the Poaceae family or within Rosids but with no significant homologs outside those groups. (B) GO annotation similarity in RBH and UCLUST clusters. Lower similarity scores in higher-order clusters are partially due to different annotation methods and thresholds used for different species. Annotation similarity was generally higher in RBH at smaller cluster sizes and higher in UCLUST for larger clusters. Similarity decreased with the increasing representation of species, which may be partially caused by different annotation methods used for different genome sequencing projects, or may alternatively be caused by decreased homology within large clusters.
RBH Clustering Results by Species.
| Species | Total Clusters | Clustered with Arabidopsis Proteome | Plastid-Targeted Clusters | Clustered with Arabidopsis Plastid Proteome | Unique Plastid-targeted | Singleton and Single-Species Clusters | NPTPs |
|---|---|---|---|---|---|---|---|
| 20533 | 60.97% | 1673 | 44.47% | 667 | 585 | 82 | |
| 7497 | 81.43% | 937 | 61.26% | 187 | 135 | 52 | |
| 15817 | 100.00% | 1796 | 100.00% | 375 | 301 | 74 | |
| 17933 | 67.23% | 2380 | 41.81% | 727 | 498 | 229 | |
| 18328 | 70.63% | 1798 | 47.66% | 566 | 426 | 140 | |
| 26629 | 63.60% | 2464 | 43.83% | 905 | 714 | 191 | |
| 30257 | 49.84% | 3100 | 32.13% | 1581 | 1253 | 328 | |
| 18657 | 65.83% | 2204 | 44.01% | 643 | 459 | 184 | |
| 43875 | 37.39% | 5234 | 20.27% | 3194 | 2512 | 682 | |
| 20348 | 71.99% | 2167 | 50.21% | 580 | 413 | 167 | |
| 14375 | 82.64% | 1838 | 58.65% | 296 | 184 | 112 | |
| 16618 | 73.25% | 2310 | 43.29% | 509 | 241 | 268 | |
| 16287 | 87.55% | 1486 | 66.42% | 202 | 131 | 71 | |
| 16201 | 68.65% | 2351 | 42.28% | 636 | 386 | 250 | |
| 16711 | 79.83% | 1785 | 56.47% | 353 | 240 | 113 |
Clustering of gene families using 40% reciprocal Intergeneric best BLAST hits and 90% reciprocal Intergeneric better BLAST hits was performed, and clusters containing plastid-targeted sequences were identified for each species. The number of total proteomes and plastid-targeted clusters with at least one Arabidopsis sequence were identified, as well as the number of clusters containing a plastid-targeted sequence from only the selected species. The number of clusters overlapping with Arabidopsis for all clusters and plastid-targeted clusters was identified, as well as the number of clusters containing a plastid-targeted sequence from only the selected species. NPTPs – Nascent Plastid Targeted Proteins.
UCLUST Clustering Results by Species.
| Species | Total Clusters | Clustered with Arabidopsis Proteome | Plastid-Targeted Clusters | Clustered with Arabidopsis Plastid proteome | Unique Plastid-targeted | Singleton and Single-Species Clusters | NPTPs |
|---|---|---|---|---|---|---|---|
| 19190 | 55.61% | 1721 | 41.78% | 736 | 541 | 195 | |
| 7365 | 78.22% | 909 | 58.53% | 173 | 76 | 97 | |
| 13065 | 100.00% | 1783 | 100.00% | 261 | 95 | 166 | |
| 16777 | 57.05% | 2375 | 37.68% | 623 | 225 | 398 | |
| 16821 | 65.75% | 1828 | 46.01% | 551 | 172 | 379 | |
| 20157 | 70.78% | 2320 | 49.31% | 637 | 296 | 341 | |
| 21427 | 56.25% | 2846 | 35.45% | 1197 | 469 | 728 | |
| 18102 | 55.76% | 2249 | 38.86% | 564 | 235 | 329 | |
| 29207 | 34.12% | 4725 | 20.00% | 2506 | 1048 | 1458 | |
| 15881 | 78.35% | 1977 | 56.35% | 335 | 95 | 240 | |
| 14753 | 79.49% | 1921 | 57.16% | 229 | 36 | 193 | |
| 17810 | 57.11% | 2427 | 36.51% | 480 | 98 | 382 | |
| 15675 | 81.13% | 1574 | 62.26% | 245 | 79 | 166 | |
| 17395 | 54.56% | 2410 | 36.14% | 554 | 191 | 363 | |
| 16092 | 79.06% | 1805 | 57.62% | 299 | 102 | 197 |
Clustering of gene families was performed using an initial UCLUST iteration with 40% coverage and 40% identity followed by extraction of random sequences from each cluster to seed additional iterations performed at 90% coverage and identity. Clusters containing shared sequences were merged, followed by identification of clusters containing plastid-targeted sequences in each species. The number of clusters overlapping with Arabidopsis for all clusters and plastid-targeted clusters was identified, as well as the number of clusters containing a plastid-targeted sequence from only the selected species.
Figure 4Workflow Diagram of Sequence Clustering Methods. For RBH (left panel), 1. initial cluster edges were generated by finding all reciprocal best-BLAST hits in all-vs.-all comparisons of proteomes from two separate species at a 40% identity, 40% coverage threshold, and 2. Secondary cluster edges were generated by finding all reciprocal better-BLAST hits in all-v-all comparisons of each proteome against itself at a 90% identity, 90% coverage threshold. For UCLUST (right panel), 1. An initial run was performed at 40% identity and 40% coverage threshold on a FASTA file containing sequences from every species in length-sorted order, and 2. Random sequences of at least 90% identity and 90% coverage were extracted from each cluster, this subset was length-sorted, and then the original length-sorted FASTA file was concatenated to the new seed sequences. This process was iterated 100 times, and a separate UCLUST run was performed for each iteration. Downstream processes for RBH and UCLUST were identical: 3. All clusters/pairs with a shared sequence were condensed into single clusters, 4. All sequences that failed to have at least 40% identity and 40% coverage based on BLAST-P analysis to any of the predicted plastid-targeted sequences in the cluster were trimmed out, 5A. all clusters with at least three species were extracted, and 5B. Clusters containing plastid-targeted sequences were sorted into “conserved,” “semi-conserved,” and “non-conserved” groups according to the number of species with predicted plastid targeting and the taxonomic grouping of those species. cTP – chloroplast transit peptide.
Figure 5RBH Visual Representation. For “unique” clusters, single-species and singleton clusters are not represented, leaving only clusters with non-targeted homologs present in other species. The relative size of these unique clusters is represented by the area of the respective geometric shape. Shared protein groups at the kingdom, clade, subclade, and family levels are not represented by figure size. Overall, 628 protein clusters were shared between all 15 species, 1,002 had plastid-targeting specific to either monocots or eudicots, and 2,943 had plastid-targeting specific to only a single species.
Figure 6UCLUST Visual Representation. For “unique” clusters, single-species and singleton clusters are not represented, leaving only clusters with non-targeted homologs present in other species. The relative size of these unique clusters is represented by the area of the respective geometric shape. Shared protein groups at the kingdom, clade, subclade, and family levels are not represented by figure size. Overall, 828 protein clusters included plastid-targeted sequences from all 15 species, 1,983 had plastid-targeting specific to monocots or eudicots, and 5,632 had plastid-targeting specific to a single species.
Enriched GO terms for Conserved Plastid-Targeted RBH Clusters.
| GO term | Description | Ontology | P-value | FDR | |
|---|---|---|---|---|---|
| 1 | GO:0015979 | photosynthesis | BIOLOGICAL_PROCESS | 1.73E-44 | 3.99E-47 |
| 2 | GO:0008152 | metabolic process | BIOLOGICAL_PROCESS | 6.40E-27 | 1.59E-29 |
| 3 | GO:0006091 | generation of precursor metabolites and energy | BIOLOGICAL_PROCESS | 1.43E-24 | 3.82E-27 |
| 4 | GO:0009058 | biosynthetic process | BIOLOGICAL_PROCESS | 3.73E-20 | 1.20E-22 |
| 5 | GO:0044711 | single-organism biosynthetic process | BIOLOGICAL_PROCESS | 3.00E-15 | 1.02E-17 |
| 6 | GO:0016043 | cellular component organization | BIOLOGICAL_PROCESS | 4.28E-12 | 1.64E-14 |
| 7 | GO:0071840 | cellular component organization or biogenesis | BIOLOGICAL_PROCESS | 6.51E-12 | 2.66E-14 |
| 8 | GO:0044710 | single-organism metabolic process | BIOLOGICAL_PROCESS | 1.05E-10 | 5.06E-13 |
| 9 | GO:0006629 | lipid metabolic process | BIOLOGICAL_PROCESS | 3.16E-10 | 1.57E-12 |
| 10 | GO:0043604 | amide biosynthetic process | BIOLOGICAL_PROCESS | 7.84E-09 | 4.05E-11 |
| 11 | GO:0043603 | cellular amide metabolic process | BIOLOGICAL_PROCESS | 9.01E-09 | 4.81E-11 |
| 12 | GO:0019725 | cellular homeostasis | BIOLOGICAL_PROCESS | 1.07E-08 | 5.93E-11 |
| 13 | GO:0044699 | single-organism process | BIOLOGICAL_PROCESS | 1.22E-08 | 7.39E-11 |
| 14 | GO:0009987 | cellular process | BIOLOGICAL_PROCESS | 1.22E-08 | 7.21E-11 |
| 15 | GO:0065008 | regulation of biological quality | BIOLOGICAL_PROCESS | 1.54E-08 | 9.56E-11 |
| 16 | GO:0006412 | translation | BIOLOGICAL_PROCESS | 1.67E-08 | 1.10E-10 |
| 17 | GO:0042592 | homeostatic process | BIOLOGICAL_PROCESS | 1.67E-08 | 1.08E-10 |
| 18 | GO:0043043 | peptide biosynthetic process | BIOLOGICAL_PROCESS | 1.73E-08 | 1.17E-10 |
| 19 | GO:0006518 | peptide metabolic process | BIOLOGICAL_PROCESS | 1.80E-08 | 1.25E-10 |
| 20 | GO:1901566 | organonitrogen compound biosynthetic process | BIOLOGICAL_PROCESS | 2.95E-08 | 2.10E-10 |
| 21 | GO:1901564 | organonitrogen compound metabolic process | BIOLOGICAL_PROCESS | 1.04E-07 | 7.58E-10 |
| 22 | GO:0034641 | cellular nitrogen compound metabolic process | BIOLOGICAL_PROCESS | 1.61E-05 | 1.52E-07 |
| 23 | GO:0044271 | cellular nitrogen compound biosynthetic process | BIOLOGICAL_PROCESS | 1.93E-05 | 1.85E-07 |
| 24 | GO:0006807 | nitrogen compound metabolic process | BIOLOGICAL_PROCESS | 2.26E-05 | 2.21E-07 |
| 25 | GO:0044249 | cellular biosynthetic process | BIOLOGICAL_PROCESS | 2.52E-05 | 2.51E-07 |
| 26 | GO:1901576 | organic substance biosynthetic process | BIOLOGICAL_PROCESS | 4.41E-05 | 4.55E-07 |
| 27 | GO:0034645 | cellular macromolecule biosynthetic process | BIOLOGICAL_PROCESS | 4.47E-05 | 4.69E-07 |
| 28 | GO:0009059 | macromolecule biosynthetic process | BIOLOGICAL_PROCESS | 4.74E-05 | 5.06E-07 |
| 29 | GO:0010467 | gene expression | BIOLOGICAL_PROCESS | 2.96E-04 | 3.31E-06 |
| 30 | GO:0009536 | plastid | CELLULAR_COMPONENT | 1.35E-279 | 2.41E-283 |
| 31 | GO:0005622 | intracellular | CELLULAR_COMPONENT | 1.07E-222 | 3.82E-226 |
| 32 | GO:0044424 | intracellular part | CELLULAR_COMPONENT | 1.22E-222 | 6.49E-226 |
| 33 | GO:0044464 | cell part | CELLULAR_COMPONENT | 6.98E-222 | 6.21E-225 |
| 34 | GO:0005623 | cell | CELLULAR_COMPONENT | 6.98E-222 | 5.62E-225 |
| 35 | GO:0005737 | cytoplasm | CELLULAR_COMPONENT | 2.03E-218 | 2.16E-221 |
| 36 | GO:0044444 | cytoplasmic part | CELLULAR_COMPONENT | 1.38E-217 | 1.72E-220 |
| 37 | GO:0043229 | intracellular organelle | CELLULAR_COMPONENT | 1.94E-193 | 2.76E-196 |
| 38 | GO:0043226 | organelle | CELLULAR_COMPONENT | 1.97E-193 | 3.16E-196 |
| 39 | GO:0043231 | intracellular membrane-bounded organelle | CELLULAR_COMPONENT | 1.16E-179 | 2.06E-182 |
| 40 | GO:0043227 | membrane-bounded organelle | CELLULAR_COMPONENT | 5.74E-179 | 1.12E-181 |
| 41 | GO:0009579 | thylakoid | CELLULAR_COMPONENT | 2.71E-68 | 5.78E-71 |
| 42 | GO:0016020 | membrane | CELLULAR_COMPONENT | 1.08E-20 | 3.06E-23 |
| 43 | GO:0005739 | mitochondrion | CELLULAR_COMPONENT | 2.48E-12 | 8.82E-15 |
| 44 | GO:0005840 | ribosome | CELLULAR_COMPONENT | 1.48E-11 | 6.31E-14 |
| 45 | GO:1990904 | ribonucleoprotein complex | CELLULAR_COMPONENT | 3.36E-11 | 1.55E-13 |
| 46 | GO:0030529 | intracellular ribonucleoprotein complex | CELLULAR_COMPONENT | 3.36E-11 | 1.55E-13 |
| 47 | GO:0032991 | macromolecular complex | CELLULAR_COMPONENT | 1.22E-08 | 7.29E-11 |
| 48 | GO:0009507 | chloroplast | CELLULAR_COMPONENT | 2.01E-07 | 1.61E-09 |
| 49 | GO:0043228 | non-membrane-bounded organelle | CELLULAR_COMPONENT | 3.31E-06 | 3.06E-08 |
| 50 | GO:0043232 | intracellular non-membrane-bounded organelle | CELLULAR_COMPONENT | 3.31E-06 | 3.06E-08 |
| 51 | GO:0044434 | chloroplast part | CELLULAR_COMPONENT | 3.09E-04 | 3.52E-06 |
| 52 | GO:0044435 | plastid part | CELLULAR_COMPONENT | 3.34E-04 | 3.86E-06 |
| 53 | GO:0005198 | structural molecule activity | MOLECULAR_FUNCTION | 3.04E-05 | 3.08E-07 |
Clusters containing at least 13 species with predicted or likely plastid-targeted sequences were mined for common GO terms and compared against terms extracted for the total set of RBH-derived clusters using BLAST2GO. All terms enriched above p = 1.0E−5 in core plastid-targeted clusters are represented.
Enriched GO terms for Conserved Plastid-Targeted UCLUST Clusters.
| GO term | Description | Ontology | P-value | FDR | |
|---|---|---|---|---|---|
| 1 | GO:0008152 | metabolic process | BIOLOGICAL_PROCESS | 3.36E-32 | 9.19E-35 |
| 2 | GO:0015979 | photosynthesis | BIOLOGICAL_PROCESS | 1.24E-29 | 3.62E-32 |
| 3 | GO:0044710 | single-organism metabolic process | BIOLOGICAL_PROCESS | 1.38E-21 | 4.57E-24 |
| 4 | GO:0044711 | single-organism biosynthetic process | BIOLOGICAL_PROCESS | 5.52E-16 | 2.15E-18 |
| 5 | GO:0044699 | single-organism process | BIOLOGICAL_PROCESS | 2.89E-15 | 1.18E-17 |
| 6 | GO:0006091 | generation of precursor metabolites and energy | BIOLOGICAL_PROCESS | 1.15E-13 | 4.93E-16 |
| 7 | GO:0005975 | carbohydrate metabolic process | BIOLOGICAL_PROCESS | 1.20E-10 | 5.84E-13 |
| 8 | GO:0006629 | lipid metabolic process | BIOLOGICAL_PROCESS | 1.33E-06 | 7.01E-09 |
| 9 | GO:0051234 | establishment of localization | BIOLOGICAL_PROCESS | 1.85E-04 | 1.23E-06 |
| 10 | GO:0006810 | transport | BIOLOGICAL_PROCESS | 1.85E-04 | 1.20E-06 |
| 11 | GO:0051179 | localization | BIOLOGICAL_PROCESS | 2.65E-04 | 1.81E-06 |
| 12 | GO:0016043 | cellular component organization | BIOLOGICAL_PROCESS | 2.98E-04 | 2.10E-06 |
| 13 | GO:0044723 | single-organism carbohydrate metabolic process | BIOLOGICAL_PROCESS | 3.01E-04 | 2.23E-06 |
| 14 | GO:0071840 | cellular component organization or biogenesis | BIOLOGICAL_PROCESS | 4.29E-04 | 3.26E-06 |
| 15 | GO:0042592 | homeostatic process | BIOLOGICAL_PROCESS | 8.59E-04 | 6.87E-06 |
| 16 | GO:0009536 | plastid | CELLULAR_COMPONENT | 1.01E-165 | 1.97E-169 |
| 17 | GO:0044464 | cell part | CELLULAR_COMPONENT | 1.54E-140 | 6.02E-144 |
| 18 | GO:0005623 | cell | CELLULAR_COMPONENT | 1.52E-139 | 8.87E-143 |
| 19 | GO:0044444 | cytoplasmic part | CELLULAR_COMPONENT | 8.76E-120 | 6.83E-123 |
| 20 | GO:0005737 | cytoplasm | CELLULAR_COMPONENT | 3.57E-119 | 3.49E-122 |
| 21 | GO:0044424 | intracellular part | CELLULAR_COMPONENT | 2.06E-110 | 2.41E-113 |
| 22 | GO:0005622 | intracellular | CELLULAR_COMPONENT | 1.27E-104 | 1.73E-107 |
| 23 | GO:0043229 | intracellular organelle | CELLULAR_COMPONENT | 6.39E-93 | 1.05E-95 |
| 24 | GO:0043226 | organelle | CELLULAR_COMPONENT | 6.39E-93 | 1.12E-95 |
| 25 | GO:0043231 | intracellular membrane-bounded organelle | CELLULAR_COMPONENT | 6.87E-82 | 1.34E-84 |
| 26 | GO:0043227 | membrane-bounded organelle | CELLULAR_COMPONENT | 1.90E-81 | 4.07E-84 |
| 27 | GO:0009579 | thylakoid | CELLULAR_COMPONENT | 4.05E-39 | 9.47E-42 |
| 28 | GO:0016020 | membrane | CELLULAR_COMPONENT | 4.66E-36 | 1.18E-38 |
| 29 | GO:0071944 | cell periphery | CELLULAR_COMPONENT | 7.58E-11 | 3.55E-13 |
| 30 | GO:0005886 | plasma membrane | CELLULAR_COMPONENT | 1.32E-07 | 6.69E-10 |
| 31 | GO:0009507 | chloroplast | CELLULAR_COMPONENT | 3.01E-04 | 2.18E-06 |
| 32 | GO:0005840 | ribosome | CELLULAR_COMPONENT | 5.00E-04 | 3.90E-06 |
| 33 | GO:0003824 | catalytic activity | MOLECULAR_FUNCTION | 9.90E-19 | 3.67E-21 |
Clusters containing at least 13 species with predicted or likely plastid-targeted sequences were mined for common GO terms and compared against terms extracted for the total set of UCLUST -derived clusters using BLAST2GO. All terms enriched above p = 1.0E−5 in core plastid-targeted clusters are represented.
Figure 7Correlation of Total Proteome Size with Nascent Plastid-Targeted Proteins (NPTPs). (A) Clusters containing at least three species and with predicted plastid-targeted proteins in only one species were compared to the total proteome size for both RBH and UCLUST clustering methods. Although the correlation was moderately linear when P. virgatum was included, its extremely large proteome skewed results. (B) Correlation after removal of P. virgatum. Weakly linear correlation indicates that the evolution of novel transit peptides is a random process.