| Literature DB >> 19801558 |
Takashi Abe1, Shigehiko Kanaya, Hiroshi Uehara, Toshimichi Ikemura.
Abstract
As a result of remarkable progresses of DNA sequencing technology, vast quantities of genomic sequences have been decoded. Homology search for amino acid sequences, such as BLAST, has become a basic tool for assigning functions of genes/proteins when genomic sequences are decoded. Although the homology search has clearly been a powerful and irreplaceable method, the functions of only 50% or fewer of genes can be predicted when a novel genome is decoded. A prediction method independent of the homology search is urgently needed. By analyzing oligonucleotide compositions in genomic sequences, we previously developed a modified Self-Organizing Map 'BLSOM' that clustered genomic fragments according to phylotype with no advance knowledge of phylotype. Using BLSOM for di-, tri- and tetrapeptide compositions, we developed a system to enable separation (self-organization) of proteins by function. Analyzing oligopeptide frequencies in proteins previously classified into COGs (clusters of orthologous groups of proteins), BLSOMs could faithfully reproduce the COG classifications. This indicated that proteins, whose functions are unknown because of lack of significant sequence similarity with function-known proteins, can be related to function-known proteins based on similarity in oligopeptide composition. BLSOM was applied to predict functions of vast quantities of proteins derived from mixed genomes in environmental samples.Entities:
Mesh:
Substances:
Year: 2009 PMID: 19801558 PMCID: PMC2762413 DOI: 10.1093/dnares/dsp018
Source DB: PubMed Journal: DNA Res ISSN: 1340-2838 Impact factor: 4.458
Figure 1The distribution of pure lattice points. (A) Di20-w/o window; 20-amino acid groups, full-length sequences without window. (B) Di20-W200; 20-amino acid groups, 200-amino acid windows. (C) Di11-W200; 11-amino acid groups, 200-amino acid windows. (D) Tri11-W100; 11-amino acid groups, 100-amino acid windows. (E) Tri11-W200; 11-amino acid groups, 200-amino acid windows. (F) Tetra6-W200; 6-amino acid groups, 200-amino acid windows. ‘Pure lattice points’ are colored in red and ‘mixed lattice points’ are colored in blue. Lattice points without sequence or with one sequence are shown in white.
Occurrence levels of pure lattice points
| Analysis condition | Proportion (%) of pure lattice points |
|---|---|
| Di20-w/o window | 15.2 |
| Di20-W200 | 35.1 |
| Di11-W200 | 10.7 |
| Tri11-W100 | 19.1 |
| Tri11-W200 | 45.4 |
| Tetra6-W200 | 18.1 |
Figure 2Clustering of protein sequences according to COG. (A) Di20-W200. (B) Tri11-W200. (C) Tetra6-W200. Sequences of 20 COG examples are presented. Numbers of sequences classified into each lattice point are presented by the height of the vertical bar with a color representing each of the 20 NcCOGs.
COG pairs associated commonly on three BLSOMs
| COG0419 ATPase involved in DNA repair | COG0497 ATPase involved in DNA repair |
| COG0419 ATPase involved in DNA repair | COG1196 Chromosome segregation ATPases |
| COG0419 ATPase involved in DNA repair | COG4942 Membrane-bound metallopeptidase |
| COG0419 ATPase involved in DNA repair | COG5022 Myosin heavy chain |
| COG1196 Chromosome segregation ATPases | COG5022 Myosin heavy chain |
| COG0439 Biotin carboxylase | COG4770 Acetyl/propionyl-CoA carboxylase, alpha subunit |
| COG0477 Permeases of the major facilitator superfamily | COG0697 Permeases of the drug/metabolite transporter (DMT) superfamily |
| COG0477 Permeases of the major facilitator superfamily | COG2814 Arabinose efflux permease |
| COG0515 Serine/threonine protein kinase | COG5099 RNA-binding protein of the Puf family, translational repressor |
| COG1298 Flagellar biosynthesis pathway, component FlhA | COG4789 Type III secretory pathway, component EscV |
| COG3839 ABC-type sugar transport systems, ATPase components | COG3842 ABC-type spermidine/putrescine transport systems, ATPase components |
Figure 3Venn diagrams representing COG predictions obtained by three BLSOMs. (A) Sargasso HomCOG proteins. The number and percentage in a parenthesis show the number of Sargasso proteins properly assigned to COG with each BLSOM and its percentage, respectively. (B) Sargasso proteins unassigned to COG with BLAST. The number in a parenthesis shows the number of Sargasso proteins assigned to COG with each BLSOM.
Figure 4Identity and coverage levels found for the NcCOG protein with the lowest e value for each of Sargasso queries, which included HomCOG proteins. Sargasso proteins commonly assigned (A) with three BLSOMs, (B) with at least two BLSOMs and (C) with any BLSOM, respectively.