| Literature DB >> 21622657 |
Damiano Piovesan1, Pier Luigi Martelli, Piero Fariselli, Andrea Zauli, Ivan Rossi, Rita Casadio.
Abstract
We introduce BAR-PLUS (BAR(+)), a web server for functional and structural annotation of protein sequences. BAR(+) is based on a large-scale genome cross comparison and a non-hierarchical clustering procedure characterized by a metric that ensures a reliable transfer of features within clusters. In this version, the method takes advantage of a large-scale pairwise sequence comparison of 13,495,736 protein chains also including 988 complete proteomes. Available sequence annotation is derived from UniProtKB, GO, Pfam and PDB. When PDB templates are present within a cluster (with or without their SCOP classification), profile Hidden Markov Models (HMMs) are computed on the basis of sequence to structure alignment and are cluster-associated (Cluster-HMM). Therefrom, a library of 10,858 HMMs is made available for aligning even distantly related sequences for structural modelling. The server also provides pairwise query sequence-structural target alignments computed from the correspondent Cluster-HMM. BAR(+) in its present version allows three main categories of annotation: PDB [with or without SCOP (*)] and GO and/or Pfam; PDB (*) without GO and/or Pfam; GO and/or Pfam without PDB (*) and no annotation. Each category can further comprise clusters where GO and Pfam functional annotations are or are not statistically significant. BAR(+) is available at http://bar.biocomp.unibo.it/bar2.0.Entities:
Mesh:
Substances:
Year: 2011 PMID: 21622657 PMCID: PMC3125743 DOI: 10.1093/nar/gkr292
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.BAR+ implementation. Our method collects sequences from the protein universe (UniProtKB) including also some 988 genomes. By this, all the features [PDB (± SCOP classification) (red circles), GO terms (including Molecular Function, Biological Process and Cellular Localization) and Pfam models (blue circles) are also included. An extensive BLAST alignment is performed of all the 13 495 736 sequences in a GRID environment. The sequence similarity network is built by connecting two sequences only if their SI is ≥40% with an overlapping COV ≥ 90%. About 913 762 clusters are obtained by splitting of the connected components. By this, any cluster may contain from 2 up to 87 893 sequences (one cluster containing ABC transporters from Prokaryotes, Eukaryotes and Archaea). Stand alone sequences are called Singletons (30.4% of the total protein universe). Sequences inherit the annotations within a cluster. When clusters are endowed with PDB template/s, a Cluster-HMM is generated by considering all the sequences that have an identity ≥ 40% and a COV ≥ 90% with the structure/s (pink subset). The Cluster-HMM can be used to align all the other sequences in the cluster to template/s.
Figure 2.Different types of annotations are possible with BAR+. After clustering and depending on the features (structure, domains and function) annotated in the cluster, sequences within a cluster can inherit different types of annotation. The percentage of sequences endowed with a given annotation type and inheriting validated annotation (P < 0.01) is indicated. (A) Sequences within clusters. Percentage is computed with respect to 9 401 223 comprised in 913 762 clusters. Inherited: sequences that inherit annotations by falling into a cluster. Without validated annotation: the slice comprises sequences with no annotation and not validated annotations. (B) Singletons (stand alone sequences). Percentage is computed with respect to 4 091 908 singleton sequences.
The fine grain types of annotation with BAR+
| PDB (%) | SCOP Mono | SCOP Multi | Without PDB | |
|---|---|---|---|---|
| GO validated | ||||
| Pfam validated | ||||
| Clusters | 8251 (0.90) | 3613 (0.40) | 1461 (0.16) | 83 266 (9.11) |
| Sequences | 2 982 449 (22.10) | 1 408 542 (10.44) | 1 028 565 (7.62) | 2 903 431 (21.51) |
| | ||||
| Pfam | ||||
| Clusters | 8334 (0.91) | 3647 (0.40) | 1463 (0.16) | 85 886 (9.40) |
| Sequences | 2 984 057 (22.11) | 1 409 647 (10.45) | 1 028 569 (7.62) | 2 922 876 (21.66) |
| | ||||
| Without Pfam | ||||
| Clusters | 320 (0.04) | 123 (0.01) | 25 | 6251 (0.68) |
| Sequences | 42 202 (0.31) | 15 415 (0.11) | 7363 (0.05) | 143 533 (1.06) |
| | ||||
| GO | ||||
| Pfam validated | ||||
| Clusters | 8938 (0.98) | 3887 (0.43) | 1504 (0.16) | 133 895 (14.65) |
| Sequences | 3 042 649 (22.55) | 1 450 437 (10.75) | 1 029 707 (7.63) | 3 311 421 (24.54) |
| | ||||
| Pfam | ||||
| Clusters | 9357 (1.02) | 4033 (0.44) | 1526 (0.17) | 322 937 (35.34) |
| Sequences | 3 045 465 (22.57) | 1 451 928 (10.76) | 1 029 755 (7.63) | 3 739 076 (27.71) |
| | ||||
| Singletons | 2608 (0.02) | 10 | 5 | 1 515 720 (11.23) |
| Without Pfam | ||||
| Clusters | 452 (0.05) | 176 (0.02) | 30 | 45 539 (4.98) |
| Sequences | 46 311 (0.34) | 17 020 (0.13) | 7400 (0.05) | 330 354 (2.45) |
| | ||||
| Singletons | 279 | 2 | 2 | 129 212 (0.96) |
| Without GO | ||||
| Pfam validated | ||||
| Clusters | 679 (0.07) | 345 (0.04) | 15 | 54 314 (5.94) |
| Sequences | 44 172 (0.33) | 27 775 (0.21) | 654 | 547 459 (4.06) |
| | ||||
| Pfam | ||||
| Clusters | 779 (0.09) | 377 (0.04) | 16 | 122 236 (13.38) |
| Sequences | 44 582 (0.33) | 27 983 (0.21) | 656 | 695 684 (5.15) |
| | ||||
| Singletons | 205 | 702 834 (5.21) | ||
| Without Pfam | ||||
| Clusters | 270 (0.03) | 83 (0.01) | 5 | 412 192 (45.11) |
| Sequences | 5308 (0.04) | 1771 (0.01) | 154 | 1 494 443 (11.07) |
| | ||||
| Singletons | 129 | 1 | 0 | 1 743 526 (12.92) |
Percentage is evaluated with respect to the total number of sequences in the data base (13 495 736 sequences). Bold character: sequences that inherit the annotation type
aValues are negligible. Validated: P ≤ 0.01 (See text for details, 11). Within BAR+ clusters, 35 different types of annotations are possible: (i) +GO+Pfam+PDB [with or without SCOP (Monodomain, Multidomain)*]; GO and Pfam are or not validated (no. of levels = 12). (ii) +Pfam+PDB (with or without SCOP)* (no. of levels = 6). (iii) +GO+PDB (with or without SCOP)* (number of levels = 6). (iv) +Pfam+GO (no. of levels = 4). (v) +PDB (with or without SCOP)* (number of levels = 3). (vi) +GO (no. of levels = 2). (vii) +Pfam (no. of levels = 2). Seventy percent of the initial set fall into clusters (913 962) and 53% in validated clusters. Some 6% of the sequences are annotated without validation and the remaining 11% are not annotated (rightmost bottom cell). About 17 and 13% of the sequences are singletons with and without annotations, respectively.
Figure 3.BAR+ at work. A query sequence has been submitted. Provided that the sequence after running BLAST has a level of SI ≥ 40% with a COV ≥ 90% to any sequence of BAR+, it is included into a cluster. In the above example, the cluster is well annotated and the sequence inherits all the possible annotations from the cluster including GO terms (203), PDB/s, ligands, SCOP and Pfam annotations and the Cluster-HMM. Furthermore in PIR format alignment/alignments of the query sequence to the cluster template/s with Cluster HMM is/are also provided. All the sequences that align with the query are returned. (•••) Only the top and bottom portions of the page are shown.