| Literature DB >> 15606920 |
Gary W Stuart1, Michael W Berry.
Abstract
BACKGROUND: Eukaryotic whole genome sequences are accumulating at an impressive rate. Effective methods for comparing multiple whole eukaryotic genomes on a large scale are needed. Most attempted solutions involve the production of large scale alignments, and many of these require a high stringency pre-screen for putative orthologs in order to reduce the effective size of the dataset and provide a reasonably high but unknown fraction of correctly aligned homologous sites for comparison. As an alternative, highly efficient methods that do not require the pre-alignment of operationally defined orthologs are also being explored.Entities:
Mesh:
Substances:
Year: 2004 PMID: 15606920 PMCID: PMC544558 DOI: 10.1186/1471-2105-5-204
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Genes and Genomes Compared
| Organism | SVD "top 5" | Genome Total |
| Hsap | 996 (23%) | 25,319 (14%) |
| Mmus | 881 (20%) | 25,371 (14%) |
| Rnov | 670 (15%) | 21,204 (12%) |
| Frub | 536 (12%) | 37,439 (21%) |
| Agam | 573 (13%) | 16,091 (9%) |
| Dmel | 443 (10%) | 18,107 (10%) |
| Cele | 135 (3%) | 21,124 (12%) |
| Scer | 113 (3%) | 5,855 (3%) |
| Pfal | 23 (1%) | 5,049 (3%) |
| 4370 (100%) | 175559 (100%) |
A selected list of protein family/motifs identified by SVD-derived singular triplets (st's). In this summary table, unique example proteins (rsv-gi#) were chosen from the 5 to 40 "top five" proteins identified as members of a given family by as many as 8 distinct right singular vectors. As examples, six individual ras proteins representing six broad categories of ras (highlighted in italics) are defined by a total of 13 right singular vectors, and 18 ribosomal proteins (highlighted in bold) are defined by a total of 65 right singular vectors. The lengths of continuous copep strings identified from the corresponding left singular vectors and their specificities (E-values) as revealed by pairwise BLAST are also provided.
| 421a | 1 | 11415030 | HIST1H4J | H4 histone family, member E | 62 aa's (1e-54) |
| 417a | 2 | 21166389 | HIST1H2BC | H2B histone family, member L | 75 aa's (4e-67) |
| 60 aa's (2e-55) | |||||
| 408 | 1 | 4501885 | ACTB | beta actin; beta cytoskeletal actin | 42 aa's (9e-38) |
| 79 aa's (3e-62) | |||||
| 392a | 1 | 5174735 | TUBB2 | tubulin, beta, 2 | 45 aa's (7e-41) |
| 14 aa's (2e-11) | |||||
| 77 aa's (3e-60) | |||||
| 387 | 3 | 31981690 | Hspa8 | heat shock 70kD protein 8 | 40 aa's (2e-35) |
| 385a | 1 | 11024714 | UBB | ubiquitin B precursor; polyubiquitin B | 77 aa's (2e-68) |
| 378a | 5 | 26051216 | CAMK2B | calmodulin-dependent protein kinase IIB isoform 7 | 14 aa's (2e-10) |
| 373a | 2 | 4502201 | ARF1 | ADP-ribosylation factor 1 | 86 aa's (1e-41) |
| 371a | 3 | 6679439 | Ppia | peptidylprolyl isomerase A; cyclophilin A | 55 aa's (2e-48) |
| 368a | 5 | 25150942 | Tcb-1 | transposable element tcb1 transposase (1O615) | 88 aa's (7e-74) |
| 363 | 3 | 33149310 | UBE2D3 | ubiquitin-conjugating enzyme E2D 3 isoform 1 | 138 aa's (7e-91) |
| 354 | 3 | 4502549 | CALM2 | calmodulin 2; phosphorylase kinase delta | 40 aa's (1e-19) |
| 44 aa's (3e-33) | |||||
| 15 aa's (2e-12) | |||||
| 347a | 3 | 51873060 | Eef1a1 | eukaryotic translation elongation factor 1 alpha 1 | 24 aa's (4e-19) |
| 92 aa's (2e-89) | |||||
| 341a | 5 | 31980772 | Ppp1cc | protein phosphatase 1, catalytic, gamma isoform | 20 aa's (5e-17) |
| 337 | 5 | 24648716 | mod(mdg4) | modifier of mdg4 | 32 aa's (2e-29) |
| 334 | 5 | 24653107 | Galpha49B | G protein alpha49B | 19 aa's (9e-18) |
| 78 aa's (8e-74) | |||||
| 329a | 2 | 34878793 | Pcdha13 | protocadherin alpha 13 | 17 aa's (8e-14) |
| 327 | 3 | 32307119 | PPP2R2B | Serine/threonine protein phosphatase 2A, neuronal | 23 aa's (7e-20) |
| 324 | 1 | 31982919 | ZNF430 | zinc finger protein 430 | 18 aa's (3e-11) |
| 322a | 3 | 34871376 | LOC287293 | similar to high mobility group 1 protein | 15 aa's (9e-13) |
| 321a | 3 | 4504445 | HNRPA1 | heterogeneous nuclear ribonucleoprotein A1 | 23 aa's (2e-18) |
| 320a | 2 | 25141298 | kin-1 | cyclic AMP-dependent catalytic subunit (kin-1) | 66 aa's (4e-62) |
| 316a | 5 | 22094075 | Slc25a5 | solute carrier family 25; adenine nucleotide | 27 aa's (7e-22) |
| 308a | 3 | 9845502 | LAMR1 | laminin receptor 1 (67kD, ribosomal protein SA) | 68 aa's (1e-60) |
| 304 | 3 | 6978809 | Eno1 | enolase 1, alpha | 32 aa's (3e-27) |
| 139 aa's (1e-13) | |||||
| 295 | 2 | 31083250 | PPP2R5C | Ser/threo protein phosphatase 2A, 56 kD regulator, | 16 aa's (6e-12) |
| 58 aa's (7e-56) | |||||
| 77 aa's (7e-64) | |||||
| 288 | 1 | 22129671 | Olfr493 | olfactory receptor MOR204–35 | 12 aa's (3e-08) |
| 287 | 2 | 38076430 | LOC193565 | similar to T-cell receptor alpha chain | 16 aa's (2e-12) |
| 285a | 3 | 6754140 | H2-Q7 | histocompatibility 2, Q region locus 7 | 19 aa's (5e-16) |
| 27 aa's (4e-23) | |||||
| 9 aa's (2e-06) | |||||
| 17 aa's (4e-13) | |||||
| 276 | 4 | 24580529 | M(2)21AB | Minute (2) 21AB CG2674-PA | 25 aa's (5e-20) |
| 272 | 1 | 25742772 | Kcna2 | potassium voltage-gated channel, shaker-related, | 12 aa's (1e-09) |
| 11 aa's (3e-09) | |||||
| 54 aa's (2e-49) | |||||
| 34 aa's (8e-30) | |||||
| 253a | 6 | 15809016 | MRLC2 | myosin regulatory light chain MRCL2 | 19 aa's (7e-16) |
| 10 aa's (4e-08) | |||||
| 240a | 5 | 24639734 | Dlc | dynein light chain ATPase | 22 aa's (4e-21) |
| 237a | 4 | 34865959 | gpdh | similar to glyceraldehyde-3-phosphate | 16 aa's (7e-13) |
| 9 aa's (9e-07) | |||||
| 11 aa's (6e-09) | |||||
| 81 aa's (1e-78) | |||||
| 15 aa's (4e-12) | |||||
| 16 aa's (8e-14) | |||||
| 13 aa's (1e-10) |
Comparison of seven ras family clusters provided by right singular vectors with KOG and Homologen clusters. Only proteins having one of the five strongest projections ("top five") for a given singular vector are used in the comparison. Few genomes have KOG members specifically identified by NCBI, however, most or all of the "top 5" proteins for a given rsv would likely be identified as members of the same KOG family. For 197a (Rab11), the KOG # provided in parentheses is that of the closely related human protein.
| 197a | 6679583 | 0.06900 | Mmus | Rab11b | (0087) | 3109 |
| (Rab11) | 14249144 | 0.06892 | Rnov | Rab11b | na | 3109 |
| 31209781 | 0.06827 | Agam | na | na | 3109 | |
| 31209783 | 0.06827 | Agam | na | na | 3109 | |
| 31209785 | 0.06826 | Agam | na | na | 3109 | |
| 236a | 31542143 | 0.05883 | Mmus | Arha | na | 1257 |
| (ApRas) | 16923986 | 0.05883 | Rnov | Arha2 | na | 1257 |
| 10835049 | 0.05873 | Hsap | RHOA | 0393 | 1257 | |
| 28395033 | 0.05610 | Hsap | ARHC | 0393 | 22408 | |
| en131312 | 0.05412 | Frub | na | na | na | |
| 277 | 27689505 | 0.07229 | Rnov | Rab5c | na | 20961 |
| (Rab5) | 4759020 | 0.07214 | Hsap | RAB5C | 0092 | 20961 |
| 31225537 | 0.07022 | Agam | na | na | 20961 | |
| 31225545 | 0.07022 | Agam | na | na | 20961 | |
| 31225553 | 0.07022 | Agam | na | na | 20961 | |
| 277a | 15718763 | 0.04278 | Hsap | KRAS2 | 0395 | 2159 |
| (HaRas) | 4885425 | 0.04243 | Hsap | HRAS | 0395 | 3907 |
| 34861217 | 0.04243 | Rnov | Hras1 | na | 3907 | |
| 4505451 | 0.04176 | Hsap | NRAS | 0395 | 20564 | |
| 34859609 | 0.04165 | Rnov | Nras | na | 20564 | |
| 350a | 9845511 | 0.07403 | Hsap | RAC1 | 0393 | 23126 |
| (RasC3) | 38081613 | 0.07403 | Mmus | Rac1 | na | 23126 |
| 9845509 | 0.06942 | Hsap | RAC1b | 0393 | 23126 | |
| 4826962 | 0.06820 | Hsap | RAC3 | 0393 | 3705 | |
| 18875380 | 0.06820 | Mmus | Rac3 | na | 3705 | |
| 387a | 34861437 | 0.03486 | Rnov | Rab1B | na | 23689 |
| (Rab1) | 21313162 | 0.03413 | Mmus | Rab1B | na | 23689 |
| 13569962 | 0.03400 | Hsap | RAB1B | 0084 | 23689 | |
| 27709432 | 0.03400 | Rnov | Rab1B-like | na | 27733 | |
| en156199 | 0.03396 | Frub | na | na | na | |
| 389a | 4758988 | 0.04851 | Hsap | RAB1A | 0084 | 3067 |
| (Rab/Ras) | 6679587 | 0.04851 | Mmus | Rab1A | na | 3067 |
| 13569962 | 0.04840 | Hsap | RAB1B | 0084 | 23689 | |
| en160503 | 0.04824 | Frub | na | na | na | |
| 13592035 | 0.04811 | Rnov | Rab1A | na | 3067 |
Comparison of four unrelated protein clusters provided by right singular vectors with KOG and Homologen clusters. Descriptions for each of these clusters are provided in Table 2. Only proteins having one of the five strongest projections ("top five") for a given singular vector are used in the comparison.
| 272a | en165011 | 0.06928 | Frub | na | na | na |
| (Kcna) | 25742772 | 0.06865 | Rnov | Kcna2 | na | 21034 |
| 4826782 | 0.06834 | Hsap | Kcna2 | 1545 | 21034 | |
| 31543024 | 0.06821 | Mmus | Kcna2 | na | 21034 | |
| 27465523 | 0.06632 | Rnov | Kcna1 | na | 183 | |
| 304 | 12963491 | 0.101507 | Mmus | Eno1 | na | 1093 |
| (Eno) | 6978809 | 0.101252 | Rnov | Eno1 | na | 1093 |
| 4503571 | 0.097337 | Hsap | Eno1 | 2670 | 1093 | |
| 51770896 | 0.092899 | Mmus | Eno1 | na | 1093 | |
| en150208 | 0.091209 | Frub | na | na | na | |
| 316a | 32189350 | 0.11376 | Rnov | Slc25a5 | na | 37448 |
| (Slc25) | 22094075 | 0.11343 | Mmus | Slc25a5 | na | 37448 |
| 4502099 | 0.11202 | Hsap | Slc25a5 | 0749 | 37448 | |
| en159404 | 0.1034 | Frub | na | na | na | |
| 20863388 | 0.10117 | Mmus | Slc25a4 | na | 36058 | |
| 373a | 4502201 | 0.12887 | Hsap | Arf1 | 0070 | 1253 |
| (Arf) | 6680716 | 0.12887 | Mmus | Arf1 | na | 1253 |
| 11968098 | 0.12887 | Rnov | Arf1 | na | 1253 | |
| 24668762 | 0.12856 | Dmel | Arf79F | 0070 | 1253 | |
| 24668773 | 0.12856 | Dmel | Arf79F | 0070 | 1253 |
Figure 1Ras families and sub-families defined by singular vectors (labeled at right). For comparison, dominant peptide strings identified by SVD (boxes) are shown within a Clustal-X alignment. The aligned region corresponds to the first 181aa's of the 192aa Human Rac3 protein. Protein sequences are labeled by gi# (or ensemble# for Frup). Asterisks (*) indicate globally conserved residues. Subfamily motifs associated with negative vector values are denoted with an "a" suffix (e.g. 350a).
Figure 2Left singular vectors depicted as tetrapeptide projection value frequency distributions. Distributions for singular vectors 277 (A) and 389 (B) are shown in purple, normal distributions having the same standard deviation are shown in blue. For both distributions, the vast majority of values fall between 0.015 and -0.015. Dashed lines mark the cut-off values used to extract dominant tetrapeptides summarizing correlated peptide (copep) motifs. Selected strings of overlapping tetrapeptides describing parts of these motifs are shown boxed above the approximate regions in the distribution in which they appear.
Figure 3SVD-based proteome phylogeny (A) of nine eukaryotes with percentage branch support: top – bootstrap; bottom – novel jackknife. An unsupported alternative phylogeny containing the "ecdysozoan" lineage is indicated by the dashed red branches. Percentage branch support values for the various clades of the tree are also provided to the left (B) for trees built using all proteins, as well as trees built after poorly described proteins are removed using either of two alternative vector magnitude inclusion values (>0.005, >0.05).