| Literature DB >> 18957444 |
Hugo Y K Lam1, Ekta Khurana, Gang Fang, Philip Cayting, Nicholas Carriero, Kei-Hoi Cheung, Mark B Gerstein.
Abstract
Pseudofam (http://pseudofam.pseudogene.org) is a database of pseudogene families based on the protein families from the Pfam database. It provides resources for analyzing the family structure of pseudogenes including query tools, statistical summaries and sequence alignments. The current version of Pseudofam contains more than 125,000 pseudogenes identified from 10 eukaryotic genomes and aligned within nearly 3000 families (approximately one-third of the total families in PfamA). Pseudofam uses a large-scale parallelized homology search algorithm (implemented as an extension of the PseudoPipe pipeline) to identify pseudogenes. Each identified pseudogene is assigned to its parent protein family and subsequently aligned to each other by transferring the parent domain alignments from the Pfam family. Pseudogenes are also given additional annotation based on an ontology, reflecting their mode of creation and subsequent history. In particular, our annotation highlights the association of pseudogene families with genomic features, such as segmental duplications. In addition, pseudogene families are associated with key statistics, which identify outlier families with an unusual degree of pseudogenization. The statistics also show how the number of genes and pseudogenes in families correlates across different species. Overall, they highlight the fact that housekeeping families tend to be enriched with a large number of pseudogenes.Entities:
Mesh:
Substances:
Year: 2008 PMID: 18957444 PMCID: PMC2686518 DOI: 10.1093/nar/gkn758
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.The generation of pseudofam. (1) Identify pseudogenes by existing proteins of the genome. (2) Map all the parent proteins to their protein families. (3) Assign the identified pseudogenes to their parent protein families. (4) Align the pseudogenes in each family to build the pseudogene families. (5) Calculate the key statistics for the families and organize the data into the Pseudofam database.
Figure 2.The alignment of pseudogene family. Each pseudogene in a family is first aligned to its parent protein. Then, the pseudogene alignment is aligned with the parent protein domain by transferring the corresponding alignment from the Pfam multiple alignments. At last, all the aligned pseudogene domains, including their aligned parent protein domains, will be adjusted together to generate the final alignment.
Figure 3.The Pseudogene family ontology. An upper ontology that describes the various relationships between a pseudogene family and other genomic elements. The solid lines represent direct relationships and the dashed lines represent inferred or indirect relationships. The core part is represented in blue, while the well-established relationships are in dark gray and the secondary aspects of a pseudogene family are in light gray. For detailed concepts and relationships about pseudogene, see Supplementary Figure S1.
Numbers of protein and pseudogene families in different species out of 9318 PfamA families
| Protein family | Pseudogene family | Pseudogenized (%) | |
|---|---|---|---|
| Homo sapiens (HS) | 3486 | 1790 | 51.35 |
| Pan troglodytes (PT) | 3443 | 1906 | 55.36 |
| Canis familiaris (CF) | 3151 | 1529 | 48.52 |
| Mus musculus (MM) | 3461 | 1654 | 47.79 |
| Rattus norvegicus (RN) | 3138 | 1489 | 47.45 |
| Anopheles gambiae (AG) | 2715 | 570 | 20.99 |
| Gallus gallus (GG) | 2911 | 860 | 29.54 |
| Drosophila melanogaster (DM) | 2620 | 201 | 7.67 |
| Danio rerio (DR) | 3145 | 1125 | 35.77 |
| Caenorhabditis elegans (CE) | 2633 | 360 | 13.67 |
| Total | 3821 | 2986 | 78.15 |
The number of protein families represents the total number of families that each has at least one protein in the species. The number of pseudogene families is a subset of the previous number representing the total number of protein families with at least one pseudogene.
Spearman's rank correlation of protein family sizes (the upper right) and pseudogene family sizes (the lower left) between different species
| HS | PT | CF | MM | RN | |
|---|---|---|---|---|---|
| HS | - | 0.92 | 0.77 | 0.84 | 0.75 |
| PT | 0.89 | - | 0.79 | 0.84 | 0.77 |
| CF | 0.60 | 0.62 | - | 0.78 | 0.85 |
| MM | 0.58 | 0.60 | 0.57 | - | 0.80 |
| RN | 0.57 | 0.59 | 0.59 | 0.67 | - |