| Literature DB >> 24288371 |
Robert D Finn1, Alex Bateman, Jody Clements, Penelope Coggill, Ruth Y Eberhardt, Sean R Eddy, Andreas Heger, Kirstie Hetherington, Liisa Holm, Jaina Mistry, Erik L L Sonnhammer, John Tate, Marco Punta.
Abstract
Pfam, available via servers in the UK (http://pfam.sanger.ac.uk/) and the USA (http://pfam.janelia.org/), is a widely used database of protein families, containing 14 831 manually curated entries in the current release, version 27.0. Since the last update article 2 years ago, we have generated 1182 new families and maintained sequence coverage of the UniProt Knowledgebase (UniProtKB) at nearly 80%, despite a 50% increase in the size of the underlying sequence database. Since our 2012 article describing Pfam, we have also undertaken a comprehensive review of the features that are provided by Pfam over and above the basic family data. For each feature, we determined the relevance, computational burden, usage statistics and the functionality of the feature in a website context. As a consequence of this review, we have removed some features, enhanced others and developed new ones to meet the changing demands of computational biology. Here, we describe the changes to Pfam content. Notably, we now provide family alignments based on four different representative proteome sequence data sets and a new interactive DNA search interface. We also discuss the mapping between Pfam and known 3D structures.Entities:
Mesh:
Substances:
Year: 2013 PMID: 24288371 PMCID: PMC3965110 DOI: 10.1093/nar/gkt1223
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
The reduction in size of RP versus full alignments
| Family identifier (accession) | Seed | Full | RP75 | RP55 | RP35 | RP15 |
|---|---|---|---|---|---|---|
| ABC_tran (PF00005) | 55 | 363 409 | 26% (93 265) | 21% (77 150) | 16% (57 358) | 8% (28 903) |
| COX1 (PF00115) | 94 | 254 351 | 1% (2006) | 0.7% (1661) | 0.4% (1218) | 0.2% (538) |
| zf-H2C2_2 (PF13465) | 163 | 227 898 | 61% (138 033) | 27% (60 664) | 15% (34 039) | 9% (21 562) |
| WD40 (PF00400) | 1804 | 193 252 | 65% (125 805) | 52% (100 531) | 36% (69 386) | 23% (21 562) |
| MFS_1 (PF07690) | 195 | 181 668 | 30% (55 719) | 25% (55 719) | 17% (55 719) | 8% (55 719) |
| RVT_1 (PF00078) | 152 | 172 360 | 5% (8257) | 4% (6662) | 3% (5373) | 2% (3604) |
| BPD_transp_1 (PF00528) | 81 | 156 339 | 23% (36 523) | 19% (29 422) | 14% (22 134) | 7% (10 630) |
| Response_reg (PF00072) | 57 | 151 337 | 29% (44 329) | 25% (37 848) | 20% (29 453) | 10% (15 208) |
| GP120 (PF00516) | 24 | 146 453 | N/A | N/A | N/A | N/A |
| HATPase_c (PF02518) | 659 | 129 386 | 28% (36 085) | 24% (30 935) | 19% (24 121) | 10% (12 473) |
The seed alignment is used to construct the profile HMM and contains a representative set of sequences of the family. The full alignment contains all hits in pfamseq scoring above the gathering threshold. In Pfam 27.0, we have introduced four additional alignments based on RPs, which contain decreasing amounts of sequence redundancy from RP75 to RP15. For each RP data set, the percentage reduction in the size of the full alignment is shown, with the number of sequences given in brackets.
Figure 1.Table from the ‘Alignments’ tab of the family page for COX1 (PF00115), showing the availability of different views and different alignments for COX1. The posterior probability-based alignment is only available for the full alignments as it is derived from the alignment of a sequence to the HMM, as indicated by the subscript 1 in the corresponding seed alignment cell.
Figure 2.Results from searching Pfam with the Hepatitis B virus isolate G376-7, complete genome (GenBank accession AF384371.1), providing a striking example of overlapping genes. The six reading frames are displayed graphically in the top box of the results page. All three reading frames from the positive strand contain matches to Pfam-A, which are tabulated below. The positions of stop codons are indicated by the square lollipops. The results are shown with the ‘protein’ coordinates of the open reading frame, but it is also possible to toggle this to DNA sequence coordinates. This search tool accepts sequences up to 80 000 nucleotides in length, and searches the Pfam-A HMM library using the gathering threshold.
Breakdown of contextual hits that are reported by Pfam entries in Pfam 27.0, according to the protein family type
| Entry type | % Context regions reported in Pfam 27.0 | % Context regions not reported in Pfam 27.0 |
|---|---|---|
| Family | 4 | 7 |
| Domain | 13 | 13 |
| Motif | <1 | 2 |
| Repeat | 20 | 41 |
| All | 37 | 63 |
The percentage reported for each entry type is the fraction out of all of the 10 559 contextual domains, with the total for all domains shown at the bottom of the table.
Figure 3.Graphical representation of the Pfam sequence annotations for human tyrosine-protein kinase ABL1 sequence (UniProtKB accession P00519). This sequence matches four different Pfam-A entries, SH3_1 (PF00018), SH2 (PF00017), Pkinase_Tyr (PF007714) and F_actin-bind (PF08919). Between the Pkinase_Tyr and F_actin_bind families is a long region of disorder, indicated by the presence of the grey boxes on the sequence. A disorder prediction does not necessarily mean that the sequence is not conserved, highlighted by the presence of an overlapping Pfam-B region (striped box).