| Literature DB >> 25414345 |
Matt E Oates1, Jonathan Stahlhacke2, Dimitrios V Vavoulis2, Ben Smithers2, Owen J L Rackham3, Adam J Sardar4, Jan Zaucha5, Natalie Thurlby5, Hai Fang2, Julian Gough2.
Abstract
We present updates to the SUPERFAMILY 1.75 (http://supfam.org) online resource and protein sequence collection. The hidden Markov model library that provides sequence homology to SCOP structural domains remains unchanged at version 1.75. In the last 4 years SUPERFAMILY has more than doubled its holding of curated complete proteomes over all cellular life, from 1400 proteomes reported previously in 2010 up to 3258 at present. Outside of the main sequence collection, SUPERFAMILY continues to provide domain annotation for sequences provided by other resources such as: UniProt, Ensembl, PDB, much of JGI Phytozome and selected subcollections of NCBI RefSeq. Despite this growth in data volume, SUPERFAMILY now provides users with an expanded and daily updated phylogenetic tree of life (sTOL). This tree is built with genomic-scale domain annotation data as before, but constantly updated when new species are introduced to the sequence library. Our Gene Ontology and other functional and phenotypic annotations previously reported have stood up to critical assessment by the function prediction community. We have now introduced these data in an integrated manner online at the level of an individual sequence, and--in the case of whole genomes--with enrichment analysis against a taxonomically defined background.Entities:
Mesh:
Substances:
Year: 2014 PMID: 25414345 PMCID: PMC4383889 DOI: 10.1093/nar/gku1041
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 19.160
A summary of the proteomes included in SUPERFAMILY 1.75 comparing the currently available database with the initial release
| SUPERFAMILY 1.75 in 2010 | SUPERFAMILY 1.75 in 2014 | |
|---|---|---|
| Eukaryota | 341 | 498 |
| Archaea | 87 | 165 |
| Bacteria | 1177 | 2595 |
| Viruses | NA | 5239 |
| Plasmids | 2354 | 4573 |
| Metagenomes | 118 | 121 |
| UniProt (Complete Proteomes) | NA | 5255 |
A summary of the HMM sequence coverage in SUPERFAMILY 1.75 comparing the currently available database with the initial release
| Proteins with assignments (%) | Amino acid coverage (%) | |||
|---|---|---|---|---|
| 2010 | 2014 | 2010 | 2014 | |
| Eukarya | 59.11 | 56.8 | 38.9 | 39.91 |
| Archaea | 65.13 | 62.9 | 61.67 | 60.5 |
| Bacteria | 68.08 | 67.27 | 63.4 | 62.97 |
| Viruses | NA | 63.0 | NA | 25.52 |
| Plasmids | 47.0 | 48.79 | 47.0 | 48.22 |
| Metagenomes | 51.47 | 57.67 | 54.1 | 60.48 |
| Protein Data Bank | NA | 89.94 | NA | 89.11 |
| UniProt (All Proteins) | 64.0 | 64.73 | 56.0 | 58.78 |
Figure 1.Summary of all genome updates and additions at the level of taxonomic Class since the release of SUPERFAMILY 1.75. Eukarya in red, Archaea in green and Bacteria in blue. The size of each pie chart is log scaled based on the number of proteomes within each Class. Light colouration is the proportion of genomes that have been added to the database within a Class, and the dark colouration represents updated genomes. The grey colouring seen in Eukarya represents the relatively few genomes to not have been altered since the release of 1.75.
Figure 2.This Venn diagram demonstrates the extent to which the sequence space of the SUPERFAMILY proteome collection is not covered by the PDB and UniProt. Each value in the diagram describes the number of distinct (collapsed to 100% sequence identity) amino acid sequences in each sequence collection.
Figure 3.How to create your own phylogenetic trees that are built daily against the most recent updates to the SUPERFAMILY sequence collection. In this example the family Hominidae has been selected from the table and links to phylogenetic resources provided by the sTOL method given at the top of the page. A user may also select individual species of interest and create trees annotated by domain inclusion directly from domain summary pages.
Figure 4.In this figure we demonstrate viewing the ancestral domain content for the last common ancestor to all Metazoa, linked from the summary of domain assignments for Homo sapiens. From the main SUPERFAMILY assignments page for a proteome (accessible from the Taxonomy page under Browse on the side menu) a user can view reconstructed ancestral states for any common ancestor as long as the clade has sufficient whole proteome data.