| Literature DB >> 28453653 |
Giuseppe Profiti1, Pier Luigi Martelli1, Rita Casadio1.
Abstract
BAR 3.0 updates our server BAR (Bologna Annotation Resource) for predicting protein structural and functional features from sequence. We increase data volume, query capabilities and information conveyed to the user. The core of BAR 3.0 is a graph-based clustering procedure of UniProtKB sequences, following strict pairwise similarity criteria (sequence identity ≥40% with alignment coverage ≥90%). Each cluster contains the available annotation downloaded from UniProtKB, GO, PFAM and PDB. After statistical validation, GO terms and PFAM domains are cluster-specific and annotate new sequences entering the cluster after satisfying similarity constraints. BAR 3.0 includes 28 869 663 sequences in 1 361 773 clusters, of which 22.2% (22 241 661 sequences) and 47.4% (24 555 055 sequences) have at least one validated GO term and one PFAM domain, respectively. 1.4% of the clusters (36% of all sequences) include PDB structures and the cluster is associated to a hidden Markov model that allows building template-target alignment suitable for structural modeling. Some other 3 399 026 sequences are singletons. BAR 3.0 offers an improved search interface, allowing queries by UniProtKB-accession, Fasta sequence, GO-term, PFAM-domain, organism, PDB and ligand/s. When evaluated on the CAFA2 targets, BAR 3.0 largely outperforms our previous version and scores among state-of-the-art methods. BAR 3.0 is publicly available and accessible at http://bar.biocomp.unibo.it/bar3.Entities:
Mesh:
Substances:
Year: 2017 PMID: 28453653 PMCID: PMC5570247 DOI: 10.1093/nar/gkx330
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Distribution of sequences in clusters and singletons with their annotations
| In clusters | In singletons | |
|---|---|---|
|
| 28 869 663 | 3 399 026 |
| From SwissProt | 519 015 | 17 478 |
| From TrEMBL | 28 350 648 | 3 381 548 |
|
| 82 672 | 6 092 |
| From SwissProt | 57 391 | 3 684 |
| From TrEMBL | 25 281 | 2 408 |
|
| 20 556 103 | 1 506 125 |
| From SwissProt | 494 047 | 14 277 |
| From TrEMBL | 20 062 056 | 1 491 848 |
|
| 23 263 014 | 1 509 339 |
| From SwissProt | 487 946 | 12 111 |
| From TrEMBL | 22 775 068 | 1 497 228 |
|
| 35 660 | 1 185 |
As defined by gene ontology Consortium, experimental GO terms are those associated to evidence codes EXP, IDA, IPI, IMP, IGI, IEP (http://geneontology.org/page/guide-go-evidence-codes)
Statistics of inherited annotations in BAR 3.0
| # Clusters | # Sequences included | # Sequences inheriting new annotations | |
|---|---|---|---|
| Total number of clusters | 1 361 773 | ||
| With any validated annotation | 674 463 | 25 448 877 | 16 430 135 |
| With validated GO terms | 302 159 | 22 241 661 | 15 938 828 |
| With validated PFAM | 645 502 | 24 555 055 | 16 105 082 |
| With at least one PDB | 19 015 | 11 653 046 | 11 626 119 |
Figure 1.Distribution of statistically validated annotation among sequences in BAR 3.0. On the first chart (A), percentage of clustered sequences per statistically validated annotation type. First value refers to percentage of sequences falling in clusters with that annotation. Second value refers to percentage of sequences inheriting at least one annotation they did not have in UniProt. On the second chart (B), percentage of singleton sequences by annotation type.
Figure 2.BAR 3.0 benchmarking toward the three best performing methods in CAFA2. Besides BAR 3.0, values are as reported in the assessment (8). F1-score is evaluated as the harmonic mean of precision and recall, where precision is the ratio of correct annotation over all the predicted annotation and recall is the ratio of correct annotation over the real annotation. Other methods shown are MS-knn (21), EVEX (22) and the one from Paccanaro Lab. Dashed bars show the upper limits of the performance when exact values are not available. The CAFA2 paper (8) does not list the exact performance of the Paccanaro Lab in CC and BP sub-ontologies and of EVEX on the MF sub-ontology, since the methods did not classify among the best 10 methods. The performance reported with dashed bars correspond to the 10th classified method in the corresponding sub-ontology.