| Literature DB >> 24710207 |
Alinda Nagy1, György Szláma2, Eszter Szarka3, Mária Trexler4, László Bányai5, László Patthy6.
Abstract
In view of the fact that appearance of novel protein domain architectures (DA) is closely associated with biological innovations, there is a growing interest in the genome-scale reconstruction of the evolutionary history of the domain architectures of multidomain proteins. In such analyses, however, it is usually ignored that a significant proportion of Metazoan sequences analyzed is mispredicted and that this may seriously affect the validity of the conclusions. To estimate the contribution of errors in gene prediction to differences in DA of predicted proteins, we have used the high quality manually curated UniProtKB/Swiss-Prot database as a reference. For genome-scale analysis of domain architectures of predicted proteins we focused on RefSeq, EnsEMBL and NCBI's GNOMON predicted sequences of Metazoan species with completely sequenced genomes. Comparison of the DA of UniProtKB/Swiss-Prot sequences of worm, fly, zebrafish, frog, chick, mouse, rat and orangutan with those of human Swiss-Prot entries have identified relatively few cases where orthologs had different DA, although the percentage with different DA increased with evolutionary distance. In contrast with this, comparison of the DA of human, orangutan, rat, mouse, chicken, frog, zebrafish, worm and fly RefSeq, EnsEMBL and NCBI's GNOMON predicted protein sequences with those of the corresponding/orthologous human Swiss-Prot entries identified a significantly higher proportion of domain architecture differences than in the case of the comparison of Swiss-Prot entries. Analysis of RefSeq, EnsEMBL and NCBI's GNOMON predicted protein sequences with DAs different from those of their Swiss-Prot orthologs confirmed that the higher rate of domain architecture differences is due to errors in gene prediction, the majority of which could be corrected with our FixPred protocol. We have also demonstrated that contamination of databases with incomplete, abnormal or mispredicted sequences introduces a bias in DA differences in as much as it increases the proportion of terminal over internal DA differences. Here we have shown that in the case of RefSeq, EnsEMBL and NCBI's GNOMON predicted protein sequences of Metazoan species, the contribution of gene prediction errors to domain architecture differences of orthologs is comparable to or greater than those due to true gene rearrangements. We have also demonstrated that domain architecture comparison may serve as a useful tool for the quality control of gene predictions and may thus guide the correction of sequence errors. Our findings caution that earlier genome-scale studies based on comparison of predicted (frequently mispredicted) protein sequences may have led to some erroneous conclusions about the evolution of novel domain architectures of multidomain proteins. A reassessment of the DA evolution of orthologous and paralogous proteins is presented in an accompanying paper [1].Entities:
Year: 2011 PMID: 24710207 PMCID: PMC3927609 DOI: 10.3390/genes2030449
Source DB: PubMed Journal: Genes (Basel) ISSN: 2073-4425 Impact factor: 4.096
|
| ||||
|---|---|---|---|---|
| Swiss-Prot | 2156 | 6 | 0,3 | |
| Swiss-Prot | 14522 | 167 | 1,1 | |
| Swiss-Prot | 1799 | 54 | 3 | |
| Swiss-Prot | 1371 | 13 | 0,9 | |
| Swiss-Prot | 42 | 2,1 | ||
| Swiss-Prot | 50 | 4,8 | ||
| Swiss-Prot | 852 | 50 | 5,9 | |
|
| ||||
| *The species are listed in the order of increasing evolutionary distance from | ||||
The species are listed in the order of increasing evolutionary distance from Homo sapiens.
Positional distribution of DA differences observed when sequences from different databases (Swiss-Prot, TreEMBL, RefSeq, EnsEMBL, GNOMON) were compared with sequences of orthologous human Swiss-Prot proteins.
|
| ||||
|---|---|---|---|---|
| Type 1 transition | 46.74% | 27.96% | 9.19% | 9.19% |
| Type 2 transition | 25.9% | 24.3% | 29.4% | 20% |
| Type 3 transition | 10.2% | 6.2% | 40.3% | 33.3% |
|
| ||||
| Type 1 transition | 43.9 | 40.8 | 0 | 13 |
| Type 2 transition | 35.5 | 31.7 | 6.8 | 23.9 |
| Type 3 transition | 39.6 | 17.4 | 8.9 | 33.8 |
|
| ||||
| Type 1 transition | 39.80 | 33.80 | 9.98 | 9.98 |
| Type 2 transition | 30.70 | 22.90 | 26.60 | 18.00 |
| Type 3 transition | 20.78 | 11.69 | 38.49 | 28.72 |
|
| ||||
| Type 1 transition | 50.9 | 35.4 | 0 | 11.3 |
| Type 2 transition | 41.7 | 25.1 | 9.2 | 22.4 |
| Type 3 transition | 26.3 | 15.7 | 14.8 | 42.1 |
|
| ||||
| Type 1 transition | 41.20 | 32.72 | 9.96 | 9.96 |
| Type 2 transition | 29.62 | 20.02 | 28.50 | 20.74 |
| Type 3 transition | 21.60 | 15.08 | 37.29 | 25.22 |
The numbers in the different categories represent the percent of total assignments
Figure 1Correction of the sequence of complement component C7 of Gallus gallus with the FixPred protocol. The DA of GNOMON-predicted sequence of complement component C7 from Gallus gallus (XP_424774) was found to differ from those of its mammalian and fish orthologs (CO7_HUMAN, CO7_PIG, B5X0R1_SALSA): whereas the latter contain TSP_1, Ldl_recept_a, MACPF, TSP_1, Sushi and Sushi domains the ortholog of Gallus gallus lacks the domains downstream of the MACPF domain. The sequence “XP_424774_CORRECTED” was predicted by the use of alternative gene models and is supported by ESTs. The sequence predicted by FixPred was experimentally verified by cloning the full-length cDNA; the cDNA sequence was deposited in GenBank (accession cDNA: HQ878377; accession protein: ADY17228). (a) Comparison of the DAs of XP_424774 and XP_424774_CORRECTED with that of CO7_HUMAN; (b) Alignment of the sequences of XP_424774 and XP_424774_CORRECTED with those P_416936, XP_416936_CORRECTED with those of CO7_HUMAN, CO7_PIG and B5X0R1_SALSA.
Figure 2Correction of the sequence of cathepsin H of Gallus gallus with the FixPred protocol. The DA of GNOMON-predicted sequence of cathepsin H from Gallus gallus (xp_001232765) was found to differ from those of its mammalian orthologs (CATH_HUMAN, CATH_MOUSE, CATH_PIG, CATH_RAT: whereas the DA of the latter contains an Inhibitor_I29 and a Peptidase_C1 domain, the chicken protein lacks the Inhibitor_I29 domain. The sequence “xp_001232765_corrected” was predicted by the use of ESTs bm427347, bi066433, am064052, bu425005 and bi064908. The sequence predicted by FixPred was experimentally verified by cloning the full-length cDNA; the cDNA sequence was deposited in GenBank (accession cDNA: JF514547; accession protein: AEC13302). (a) Comparison of the DAs of XP_001232765 and XP_001232765_CORRECTED with that of CATH_HUMAN; (b) Alignment of the sequences of XP_001232765 and XP_001232765 _CORRECTED with those of CATH_HUMAN, CATH_MOUSE, CATH_PIG, CATH_RAT.
Times of divergence of Homo sapiens from the lineages of the species analyzed. In our analyses we used average values determined for all genes taken from the homepage of TimeTree [44].
| Homo/Pongo | 15.96 | 15.48 | ||
| Primates/Glires | 94.72 | 103.74 | 91 | |
| Mammalia/Sauropsida | 274.80 | 324.81 | 325 | |
| Amniota/Amphibia | 389.66 | 360.50 | 361 | |
| Sarcopterygii/Actinopterygii | 444.25 | 454.94 | 455 | |
| Deuterostomia/Protostomia | 826.36 | 980.12 | 910 | |
| Coelomata/Pseudocoelomata | 993.57 | 867.44 | 728 | |
The species are listed in the order of increasing evolutionary distance from Homo sapiens.
| 66,66 | 0,00 | 0,00 | 33,33 | |
| 73,70 | 11,29 | 5,07 | 9,91 | |
| 65,70 | 9,14 | 5,71 | 19,42 | |
| 65,60 | 34,38 | 0,00 | 0,00 | |
| 74,30 | 12,39 | 5,31 | 7,96 | |
| 71,00 | 20,39 | 3,95 | 4,61 | |
| 60,80 | 27,11 | 3,61 | 8,43 | |
| *The numbers in the different categories represent the percent of total DA differences. | ||||
The numbers in the different categories represent the percent of total DA differences.