| Literature DB >> 16846249 |
Mikita Suyama1, Eoghan Harrington, Peer Bork, David Torrents.
Abstract
The identification and classification of genes and pseudogenes in duplicated regions still constitutes a challenge for standard automated genome annotation procedures. Using an integrated homology and orthology analysis independent of current gene annotation, we have identified 9,484 and 9,017 gene duplicates in human and mouse, respectively. On the basis of the integrity of their coding regions, we have classified them into functional and inactive duplicates, allowing us to define the first consistent and comprehensive collection of 1,811 human and 1,581 mouse unprocessed pseudogenes. Furthermore, of the total of 14,172 human and mouse duplicates predicted to be functional genes, as many as 420 are not included in current reference gene databases and therefore correspond to likely novel mammalian genes. Some of these correspond to partial duplicates with less than half of the length of the original source genes, yet they are conserved and syntenic among different mammalian lineages. The genes and unprocessed pseudogenes obtained here will enable further studies on the mechanisms involved in gene duplication as well as of the fate of duplicated genes.Entities:
Mesh:
Year: 2006 PMID: 16846249 PMCID: PMC1484586 DOI: 10.1371/journal.pcbi.0020076
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Assessment of the Identification and Classification Procedures
Figure 1Schematic Representation of the Procedure Employed to Classify the Three Major Types of Genes and Derived Sequences Identified in Human and Mouse According to Their Origin between and within Each Species
Dashed boxes denote key action steps in the procedure. See text for details.
Figure 2Analysis of Gene Coverage between Mouse and Human Paralogs
(A) Identification of orthologous duplicated pairs. Genes are labeled with letters (same letters in human and mouse mean best reciprocal orthologs, e.g., genes “a,” “c,” and “d”). Numbers within circles in tree nodes represent gene duplication events. Dashed lines indicate orthology between human and mouse duplication nodes, which is inferred from the orthologous relations between the products of that duplication in each of the organism.
(B) Distribution of orthologous duplication nodes in human according to the coverage of the shortest coding region relative to the longest one. The line corresponds to the exponential curve adjusted to the observed data (see Materials and Methods).
(C) Distribution of the coverage of all duplicates found in human (columns), and probability for being functional according to the coverage (P f, dashed line).