| Literature DB >> 28062412 |
Marie E Bolger1, Borjana Arsova1,2, Björn Usadel1,3.
Abstract
Next-generation sequencing has triggered an explosion of available genomic and transcriptomic resources in the plant sciences. Although genome and transcriptome sequencing has become orders of magnitudes cheaper and more efficient, often the functional annotation process is lagging behind. This might be hampered by the lack of a comprehensive enumeration of simple-to-use tools available to the plant researcher. In this comprehensive review, we present (i) typical ontologies to be used in the plant sciences, (ii) useful databases and resources used for functional annotation, (iii) what to expect from an annotated plant genome, (iv) an automated annotation pipeline and (v) a recipe and reference chart outlining typical steps used to annotate plant genomes/transcriptomes using publicly available resources.Entities:
Mesh:
Substances:
Year: 2018 PMID: 28062412 PMCID: PMC5952960 DOI: 10.1093/bib/bbw135
Source DB: PubMed Journal: Brief Bioinform ISSN: 1467-5463 Impact factor: 11.622
Figure 1.Overview of the number of annotated genes for the genome of the model plant A. thaliana based on analysis of GO terms. The GoSlim annotations were downloaded from the TAIR Web site (ftp://ftp.arabidopsis.org/Ontologies/Gene_Ontology/ATH_GO_GOSLIM.txt—downloaded July 2016). For each of the three main GO domains, the respective annotations were categorized according to the evidence code. The ‘Experimental’ category includes genes annotated with evidence codes IDA (inferred from direct assay), IMP (inferred from mutant phenotype), IGI (inferred from genetic interaction), IPI (inferred from physical interaction) or IEP (inferred from expression profile). ‘Curated’ includes those which had evidence codes IC (Inferred by Curator), NAS (Non-traceable Author Statement) and TAS (Traceable Author Statement) but lacking any annotation covered by the ‘Experimental’ category. ‘Electronic’ includes genes annotated with evidence codes ISS (Inferred from Sequence or Structural Similarity), ISO (Inferred from Sequence Orthology), ISM (Inferred from Sequence Model), IBA (Inferred from Biological Aspect of Ancestor), RCA (Inferred from Reviewed Computational Analysis) or IEA (Inferred from Electronic Annotation), but lacking any annotation from the ‘Experimental’ or ‘Curated’ categories. (A) The three aspects are shown separately. (B) The best annotation from multiple domains is shown, with the combination of Molecular Function and Biological Process on the left, and all three domains combined on the right.
Available resources for protein family- or domain-based functional identifications
| Resource | Version | Families | Web address | Comments |
|---|---|---|---|---|
| PFAM | 30.0 | 16 306 | ||
| TIGRFAM | 15.0 | 4488 | ||
| PANTHER | 11.0 | 13 096 | ||
| SMART | 7.1 | 1312 | License necessary | |
| EggNOG | 4.5 | 190 648 (37 127 plants) | ||
| INTERPROSCAN | 58.0 | >40 000 integrated entries | Meta engine including all other resources except EggNOG but not necessarily the most recent version at all times | |
| CDD | 3.15 | 52 411 (11 474 from CDD curation) | Uses RPS-BLAST and includes partly older versions of PFAM, SMART and TIGRFAM |
Available resources to complement functional annotation
| Resource | Web address | Comments |
|---|---|---|
| TMHMM | Can be downloaded and installed locally for academics. Online version allows the submission of 10 000 sequences at most | |
| TOPCONS | Can be downloaded and installed freely (GPL v2). Online version allows the submission of 100 MB sequence data at most | |
| TargetP | Can be downloaded and installed locally for academics. The online version allows the submission of 2000 sequences at most | |
| Plant-mPLoc | At time of writing problem with multifasta submission | |
| AtSubP | Up to 2000 predictions | |
| Predotar | Only N-terminal signals for mitochondria and chloroplasts | |
| PHOSFER | Free for academic use only | |
| PhosPhAt | ||
| PlantPhos | Uploads <2 MB | |
| Musite | ≤100 predictions; can be downloaded and installed locally freely (GPL v3) | |
| TAIR/Protein Interaction Data | ||
| Arabidopsis Predicted Interactome and Arabidopsis interactions Viewer | Downloadable from TAIR, these are the data for interactome v2.0 (also available at the Arabidopsis Interactions viewer). In total, 70 000 predicted interactions and 3000 experimentally determined interactions | |
| IntAct | Interactions from literature curations or user submissions; part of the IMEx consortium | |
| AtPIN | Incorporates data from: IntAct, BioGRID, TAIR, Predicted Interactome for | |
| ANAP | Integrates 11 interaction databases | |
| M.I.N.D | In total, 12 102 high-confidence protein–protein interactions, based on split-uniquitin system in yeast; in addition, >3000 Arabidopsis membrane proteins in a separate screen are included | |
| PPIM | Contains predictions and information form literature | |
| PRIN | Predictions based on interlogs in various model organisms, where studies have been carried out |
Integrated tools for the functional analysis of plant genomes
| Resource | Time taken | Annotation rate (%) | Comments |
|---|---|---|---|
| Reference | — | 51 | At least one GO term assigned including cellular component |
| Blast2GO | 8 h 23 min | 78 | BLAST is performed locally or as WebBLAST via NCBI; InterProScan is performed as a Web service at the European Bioinformatics Institute (EBI) |
| KAAS | 10 min (only single- directional best hit (SBH) was used as a survey sample of sequence) | 29 | Runs as a Web service, no user resources needed |
| GhostKOALA | 28 min | 26 | Runs as a Web service, no user resources needed |
| Mercator | 5 min | 56 | Runs as a Web service, no user resources needed |
| TRAPID | 5 min | 56 | Runs as a Web service, no user resources needed |
Note. For the analysis, the first 1476 proteins from the Brassica proteome version 5 were downloaded from http://www.genoscope.cns.fr/brassicanapus/data/ alongside their GO annotations, representing exactly 10 000 lines of text and submitted to the various services, where available searches were limited to plant data sets. In the case of Blast2GO, WebBLAST was used. We have rounded the values, as annotations are subjected to updates, and time taken will depend on server loads. Therefore, these values should be seen as a general orientation.
Tools and Web sites useful in annotating large protein families
| Resource | Function | Web address |
|---|---|---|
| CoGe | Compares genomes, find synteny | |
| PlantTFDB | Plant Transcription Factor families | |
| Potsdam plntfdb | Plant Transcription Factor families | |
| P450 Database | P450 protein families | |
| CAZy | Enzymes acting on carbohydrates | |
| Aramemnon | Plant membrane proteins | |
| Merops Database | Peptidases | |
| PLAZA | Generalist Plant Family database | |
| GreenPhylDB | Generalist Plant Family database |
Note. aAlso lists a comprehensive set of tools for transmembrane domains, subcellular localization and lipid modifications.
Figure 2.Flowchart for the annotation of plant genomes/transcriptomes.