| Literature DB >> 23441934 |
Jorge Alberto Jaramillo-Garzón1, Joan Josep Gallardo-Chacón, César Germán Castellanos-Domínguez, Alexandre Perera-Lluna.
Abstract
BACKGROUND: Proteins are the key elements on the path from genetic information to the development of life. The roles played by the different proteins are difficult to uncover experimentally as this process involves complex procedures such as genetic modifications, injection of fluorescent proteins, gene knock-out methods and others. The knowledge learned from each protein is usually annotated in databases through different methods such as the proposed by The Gene Ontology (GO) consortium. Different methods have been proposed in order to predict GO terms from primary structure information, but very few are available for large-scale functional annotation of plants, and reported success rates are much less than the reported by other non-plant predictors. This paper explores the predictability of GO annotations on proteins belonging to the Embryophyta group from a set of features extracted solely from their primary amino acid sequence.Entities:
Mesh:
Substances:
Year: 2013 PMID: 23441934 PMCID: PMC3660269 DOI: 10.1186/1471-2105-14-68
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Definition and size of the classes
| | | | | ||
| Nucleotide binding | Ntbind | 47 | Reproduction* | Reprod* | 337 |
| Molecular function* | MF* | 268 | Carbohydrate metabolic process | ChMet | 315 |
| DNA binding | DnaBind | 107 | Generation of precursor metabolites | MetEn | 150 |
| Transcription factor activity | TranscFact | 307 | and energy | | |
| RNA binding | RnaBind | 43 | Nucleobase, nucleoside, nucleotide, | NaMet* | 712 |
| Catalytic activity* | Catal* | 334 | nucleic acid metabolic process* | | |
| Receptor binding | RecBind | 38 | DNA metabolic process | DnaMet | 191 |
| Transporter activity | Transp | 125 | Translation | Transl | 82 |
| Binding* | Bind* | 173 | Protein modification process | ProtMod | 391 |
| Protein binding* | ProtBind* | 630 | Lipid metabolic process | LipMet | 324 |
| Kinase activity | Kinase | 68 | Transport | Transport | 531 |
| Transferase activity* | Transf* | 173 | Response to stress | StressResp | 790 |
| Hydrolase activity | Hydrol | 190 | Cell cycle | CellCycle | 234 |
| Enzyme regulator activity | EnzReg | 41 | Cell communication* | CellComm* | 66 |
| | | | Signal transduction | SigTransd | 305 |
| | | | Cell-cell signaling | Cell-cell | 53 |
| Multicellular organismal development* | MultDev* | 490 | |||
| Cellular component* | CC* | 234 | Biological process* | BP* | 879 |
| Extracellular region | ExtcellReg | 109 | Metabolic process* | Met* | 279 |
| Cell wall | CellWall | 77 | Cell death | CellDeath | 95 |
| Intracellular* | Intracell* | 167 | Catabolic process | Catabolic | 479 |
| Nucleus* | Nucleus* | 421 | Biosynthetic process* | Biosint* | 1125 |
| Nucleoplasm | NuclPlasm | 51 | Response to external stimulus* | ExtResp* | 65 |
| Nucleolus | Nucleolus | 84 | Tropism | Tropism | 36 |
| Cytoplasm* | CitPlasm* | 168 | Response to biotic stimulus | BioResp | 275 |
| Mitochondrion | Mitochond | 244 | Response to abiotic stimulus | AbioResp | 642 |
| Endosome | Endosome | 58 | Anatomical structure morphogenesis | StrMorph | 366 |
| Vacuole | Vacuole | 171 | Response to endogenous stimulus | EndoResp | 332 |
| Peroxisome | Peroxisome | 32 | Embryonic development | EmbDev | 139 |
| Endoplasmatic reticulum | EndRet | 109 | Post-embryonic development* | PostDev* | 375 |
| Golgi apparatus | GolgiApp | 100 | Pollination | Poll | 43 |
| Cytosol | Cytosol | 389 | Flower development | FlowerDev | 228 |
| Ribosome | Ribosome | 98 | Cellular process* | CP* | 1486 |
| Plasma membrane | PlasmMb | 353 | Response to extracellular stimulus | ExtcellResp | 59 |
| Plastid | Plastid | 696 | Photosyntesis | Photosyn | 102 |
| Thylakoid | Thylk | 147 | Cellular component organization | CellOrg | 757 |
| Membrane* | Mb* | 472 | Cell growth | CellGrowth | 133 |
| | | | Protein metabolic process* | ProtMet* | 187 |
| | | | Cellular homeostasis | CellHom | 53 |
| | | | Secondary metabolic process | SecMet | 164 |
| | | | Cell differentiation | CellDiff | 267 |
| | | | Growth* | Growth* | 64 |
| | | | Regulation of gene expression, | RGE | 103 |
| epigenetic | |||||
The list of GO terms covered by this analysis is intended to provide a complete landscape of GO predictability at the three levels of protein functionality in Embryophyta plants. For classification purposes, classes marked with an asterisk (*) were redefined. The number of samples in those categories corresponds to the sequences associated to that class and none of its also listed descendants.
Initial set of features extracted from amino acid sequences
| Physical-chemical | Sequence length | 1 |
| | Molecular weight | 1 |
| | Positively charged residues (%) | 1 |
| | Negatively charged residues (%) | 1 |
| | Isoelectric point | 1 |
| | GRAVY | 1 |
| Primary structurestatistics | Amino acid frequencies | 20 |
| | Amino acid dimer frequencies | 400 |
| Secondary structurestatistics | Structure frequencies | 3 |
| | Structural dimer frequencies | 9 |
| Total |
Features are divided into three broad categories: physical-chemical features, primary structure composition statistics and secondary structure composition statistics.
Feature clusters
| 1 | Protein length | 34 | 9 | Proline | 14 |
| 2 | Negative charge /Acidic | 8 | 10 | Glutamine | 35 |
| 3 | Positive charge /Basic | 30 | 11 | Arginine | 26 |
| 4 | Alanine | 10 | 12 | Tryptophan | 38 |
| 5 | Cysteine | 38 | 13 | Tyrosine | 35 |
| 6 | Hidrophobic | 46 | 14 | Alpha helices | 6 |
| 7 | Histidine | 29 | 15 | Beta sheets | 4 |
| 8 | Asparagine /Methionine | 85 |
Description of the clusters of features with similar information content. A complete description of features belonging to each cluster can be found in the Additional file 2.
Figure 1Prediction performance with different feature clusters for the three ontologies (a) Molecular function, (b) Cellular component and (c) Biological process. Rows represent classes in Table 1 while columns represent feature groups in Table 3. For each ontology, best predicted categories are ordered from top to bottom while most discriminant feature groups are ordered from left to right.
Figure 3Prediction performance with the complete set of features for the three ontologies (a) Molecular function, (b) Cellular component and (c) Biological process. Bars in the left plots show sensitivity and specificity of SVMs. Lines depict geometric mean as a global performance measure for SVM (green) and BLASTP (blue). Right plots depicts variability throughout the five folds of cross-validation. For each ontology, best predicted categories are ordered from top to bottom.
Figure 2Performance variation in function of the identity cutoff for the three ontologies (a) Molecular function, (b) Cellular component and (c) Biological process. Green and blue plots show the variation of the general prediction performances for SVM and BLASTP, respectively, according to the identity percentage cutoff used in the dataset. Boxplots show the dispersion throughout the 75 GO terms.
Figure 4Propagated prediction performance for the three ontologies (a) Molecular function, (b) Cellular component and (c) Biological process. Prediction performance when propagating predictions of children nodes to their parents. Note that asterisks in the category names have been removed since categories include all their member now.