| Literature DB >> 12914659 |
Wilfred W Li1, Greg B Quinn, Nickolai N Alexandrov, Philip E Bourne, Ilya N Shindyalov.
Abstract
Using an integrative genome annotation pipeline (iGAP) for proteome-wide protein structure and functional domain assignment, we analyzed all the proteins of Arabidopsis thaliana. Three-dimensional structures at the level of the domain are assigned by fold recognition and threading based on a novel fold library that extends common domain classifications. iGAP is being applied to proteins from all available proteomes as part of a comparative proteomics resource. The database is accessible from the web.Entities:
Mesh:
Substances:
Year: 2003 PMID: 12914659 PMCID: PMC193643 DOI: 10.1186/gb-2003-4-8-r51
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Comparison of different annotation pipelines
| Pipeline | Focus area | Applications | Coverage |
| GeneQuiz | Sequence homology | BLAST, FASTA, COILS, | 65 genomes |
| PEDANT | Gene prediction | BLAST, PSI-BLAST, | 133 complete genomes, 91 partial genomes |
| PAT | Sequence homology | WU-BLAST, PSI-BLAST, | 103+ genomes, continuous expansion |
Figure 1The integrative genome annotation pipeline (iGAP). Processing of initial structural information is shown on the left and processing of initial sequence information on the right. Green shading indicates a processing step involving structure information and blue shading a processing step involving a sequence. Steps boxed with dotted lines indicate partial integration into the benchmarking scheme. See text for further details.
Figure 2Overview of the user interface. The information stored in the database may be accessed by known identifiers, keywords, browsing classifications (SCOP and FOLDLIB) and by sequence. Identifiers supported include Arabidopsis locus id, NCBI gi number, SCOP id, PDB id, FOLDLIB id and PFAM id. Keywords are limited to those available in each original data source.
Database feature comparison
| Databases | Features | Scope | Level of integration | Learning curve | Drawbacks |
| Entrez Genome [ | Domains from CDD (SMART, PFAM) | All sequences published or voluntarily deposited 1,000+ genomes | High | Easy to high | Complex system |
| EBI Proteome Analysis Database [ | InterPro member databases (SwissProt, PFAM, SMART, TIGRFAM, PRINTS, PROSITE, ProDom, PIR SuperFamily) | Complete proteomes in SwissProt and TrEMBL | Medium | Easy to moderate | SRS based query interface free to academia |
| MatDB | Medium | Easy to moderate | Query response time varies | ||
| Proteins of | Domains from SCOP, predicted domains from PDP, and full length PDB chains with less than 90% sequence identity (FOLDLIB) | Currently 87 Expanding to provide coverage for all known proteomes | Medium | Easy to Moderate | Presentation |
| TAIR | GO and other ontology development | Comprehensive resource devoted to | Medium | Easy to moderate | No structural information |
| SUPERFAMILY | HMM (SAM) models for SCOP domains | 107 genomes | Low to medium | Easy to moderate | Presentation style |
| Gene 3D | Structural assignment based on CATH domain classification using PSI-BLAST | 66 genomes | Low | Easy | Annotation not dynamically linked to CATH |
Comparison of PAT with other resources
| PAT | PEDANT/MatDB | TAIR/GO | EBI Proteomes/InterPro | |
| 94% A-E | 30.9% PDB | 38% ALL | 77.3% InterPro | |
| 84% A-D | 26.7% SCOP | 14% Non-IEA | 0.07% PDB | |
| 65% A-C | ||||
| 46% A-B | ||||
| 38% A | ||||
| Target | Other sources | PAT | ||
| Results | Reliability | |||
| AP2 domain (1gcc) | 140 hits by BLAST against NR | 155 hits | C (90% certainty) or above | |
| 15239082 (At5g11550.1) | No hits by PSI-BLAST | 1EE4 | C | |
| 15228210 (At3g47660) | FYVE/PHD zinc finger | FYVE/PHD zinc finger; | A (99.9% certainty); | |
| Cytochrome P450 | 238 (TAIR GO) | 249 hits | C or above | |
| Protein-kinase-like domain | 1037 hits (PEDANT/MatDB) 951 hits (TAIR GO) | 1,179 hits | C or above | |
| Alpha/beta hydrolase fold | 194 hits (PEDANT/MatDB, SCOP 3.65) | 340 hits | C or above | |
| Human | 69 hits (PEDANT/MatDB, | 1,086 hits | C or above | |
(a) Percent coverage against specific data sources. (b) PDB sequence of 1gcc [22] was used to perform a standard BLAST search. The putative protein with gi number 15239082 (At5g11550.1) returns no hits using PSI-BLAST. The putative protein (gi number 15228210, locus id At3g47660) contains a FYVE/PHD zinc finger domain, and an RCC1 like domain (a regulator of chromosome condensation). TAIR also reported a sugar transporter signature for this protein from Prosite search. The term 'cytochrome P450' was used to search TAIR GO annotation (release). This was obtained using the search by keyword query feature, after we've loaded the TAIR GO data into our database. The cytochrome P450 fold in the SCOP hierarchy was used to retrieve the hits from PAT. Actual hits may vary between releases.
Figure 3Classes of Arabidopsis proteome annotation. (a) The functional annotation on Arabidopsis proteins provided by the NCBI NR database. In this database, 36.4% of Arabidopsis proteins are reliably assigned on the basis of experimental evidence; 55.6% are annotated when automated annotation is included. This data is based on the 17 October 2001 release of NR. (b) Structural annotation provided by PAT. PAT has 69.3% coverage with a C reliability or better.
Sampling of known Arabidopsis protein structures in PAT
| PDB ID | SCOP family | SCOP superfamily | GI number | Name | Domain found | Reliability | Number of unknown or putative proteins with similar domain : total number* | |
| 1dj2 | Nitrogenase iron protein-like | P-loop containing nucleotide triphosphate hydrolases | 15230358 | Adenylosuccinate synthetase | 1dj2 (48-490) | A | 1:2 | |
| 1dcf | The receiver domain of the ethylene receptor | CheY-like | 15219629 | The receiver domain of the ethylene receptor | 1dcf (605-736) | A | 19:33 | |
| 1jh7 | Cyclic nucleotide phospho-diesterase | Cyclic nucleotide phospho-diesterase | 15234068 | Putative protein | 1fsi (1-181) | A | 2:2 | |
| 2aak | Ubiquitin conjugating enzyme | Ubiquitin conjugating enzyme | 15223746 | Ubiquitin conjugating enzyme | 1a3s (1-151) | A | 6:12 | |
| 1vok | TATA-box binding protein (TBP), carboxy-terminal domain | TATA-box binding protein-like | 15231241 | TATA sequence-binding protein 1 | 1ais (12-198) | A | 0:2 | |
| 3nul | Profilin (actin-binding protein) | Profilin (actin-binding protein) | 15224838 | Profilin 1 | 3nul (2-131) | A | 0:4 | |
| 1ibj | Cystathionine synthase-like | PLP-dependent transferases | 15230203 | Cystathionine beta-lyase precursor | 1ibj (1-464) | A | 41:54 | |
| PDB ID | SCOP family | SCOP superfamily | GI number | Name | Domain found | Reliability | Method | |
| 1gp4,6 | Penicillin synthase-like | Clavaminate synthase-like | 15235853 | Putative leucoantho-cyanidin dioxygenase | 1hjg (43-350) | A | 123D | |
| 1e6b (88-220) | Glutathione | Pseudo SCOP entry by PAT (glutathione | 15226952 | Putative glutathione | 1fw1 (89-193) | A | WU-BLAST | |
| Thioredoxin-like (glutathione | 1fw1 [1-218] | A | 123D | |||||
| 1fw1 [11-215] | A | WU-BLAST | ||||||
| 1e6b (8-87) | Thioredoxin-like | 1fw1 (11-89) | A | WU-BLAST | ||||
a) The known Arabidopsis PDB ids are obtained from NCBI pdbaa FASTA file (9/1/02 release). Each PDB id is used as a query using the PAT id search field. The 'Domain found' column lists some of the domains found in the protein. Use the GI number to search the PAT web site to see all possible domain assignments. If there are multiple domain boundaries specified, only the longest possible domain boundary is listed. *Non-NR entries were also excluded in the statistics collected in the last column of the table. Only predictions with higher than C reliability (90% certainty) are included. The non-NR entries (contributed by Ceres, Inc) were absent from NR of NCBI at the time of analysis. 1gp4, 1gp6, and 1e6b were not in SCOP release 1.55 or the FOLDLIB in this study (see Table 1b). 1j6y was an NMR structure and was excluded. (b) The sequences of the three structures not in the FOLDLIB were analyzed as unknown proteins. The assignment by SCOP release 1.59 is enclosed in parenthesis. In the case of 1e6b, two distinct domains are classified by SCOP 1.59. The two regions are listed after the PDB id. In the case of 1gp4 or 1gp6, only 123D produced an A prediction correctly. In the case of 1e6b, the template is predicted correctly by both 123D and WUBLAST, but WUBLAST produced multiple domains, two of which coincides with SCOP release 1.59 assignment.
Figure 4SCOP classifications for the Arabidopsis thaliana proteome. (a) Occurrences of SCOP folds. Folds belonging to the same SCOP class are shaded the same color. (b) Occurrences of SCOP families. Families belonging to the same fold are shaded the same color. Families belonging to the same fold but to different superfamilies are indicated by striped bars. The top 15 folds and families are shown. Data are based on SCOP release 1.59.