| Literature DB >> 18288261 |
Chon-Kit Kenneth Chan1, Arthur L Hsu, Sen-Lin Tang, Saman K Halgamuge.
Abstract
Metagenomic projects using whole-genome shotgun (WGS) sequencing produces many unassembled DNA sequences and small contigs. The step of clustering these sequences, based on biological and molecular features, is called binning. A reported strategy for binning that combines oligonucleotide frequency and self-organising maps (SOM) shows high potential. We improve this strategy by identifying suitable training features, implementing a better clustering algorithm, and defining quantitative measures for assessing results. We investigated the suitability of each of di-, tri-, tetra-, and pentanucleotide frequencies. The results show that dinucleotide frequency is not a sufficiently strong signature for binning 10 kb long DNA sequences, compared to the other three. Furthermore, we observed that increased order of oligonucleotide frequency may deteriorate the assignment result in some cases, which indicates the possible existence of optimal species-specific oligonucleotide frequency. We replaced SOM with growing self-organising map (GSOM) where comparable results are obtained while gaining 7%-15% speed improvement.Entities:
Mesh:
Year: 2008 PMID: 18288261 PMCID: PMC2235928 DOI: 10.1155/2008/513701
Source DB: PubMed Journal: J Biomed Biotechnol ISSN: 1110-7243
Figure 1Concept of mixed pair: the mixed pair between A and B is truly mixed (IoM = ML and LoM = Phylum). The mixed pair between B and C is not truly mixed because .
Figure 3The taxonomy distribution of the 10 species in (a) Set 1, (b) Set 2, and the 4 species in (c) simMC_Phrap. Each letter represents a single species. The numbers below the taxonomic levels indicate the maximum number of mixed pairs at that taxonomic level. For example, in (a), the maximum number of mixed pairs at taxonomic level of Class is 12, which consists (a,j), (c,e), (c,b,d,g,h,i), and (e,b,d,g,h,i) mixed pairs.
Figure 2Dinucleotide frequency counting for the short sequence “AATACTTT.”
Training parameters used for the SOM and GSOM training.
| Training parameter | Phase 2 | Phase 3 |
|---|---|---|
| Learning length | 15 epochs | 70 epochs |
| Learning rate | 0.1 | 0.05 |
| Neighbourhood size | 3 | 1 |
Figure 4Illustration of method used to compare SOM and GSOM.
The evaluation of clustering results using F-measure.
| Set 1 | Set 2 | simMC_Phrap | ||||
|---|---|---|---|---|---|---|
| SOM | GSOM | SOM | GSOM | SOM | GSOM | |
| Di | 0.95 | 0.95 | 0.94 | 0.94 | 0.92 | 0.91 |
| Tri | 0.97 | 0.97 | 0.98 | 0.98 | 0.90 | 0.90 |
| Tetra | 0.97 | 0.97 | 0.98 | 0.98 | 0.92 | 0.90 |
| Penta | 0.97 | 0.97 | 0.99 | 0.99 | 0.89 | 0.89 |
Training results in the mixing regions for species Set 1.
| Algorithm | SOM | GSOM | ||||||
|---|---|---|---|---|---|---|---|---|
| Nucleotide Freq. | Di | Tri | Tetra | Penta | Di | Tri | Tetra | Penta |
| Kingdom | — | — | — | — | — | — | — | — |
| Phylum | — | — | — | — | — | — | — | — |
| Class | ML, ML, L | — | — | — | ML, ML, L | — | — | — |
| Order | ML, ML | ML, L | ML | L | M, L | ML, L | L, L | — |
| Family | — | — | — | — | — | — | — | — |
| Genus | — | — | — | — | — | — | — | — |
| Species | M, L | M, L | ML | ML | M, L | M, L | ML, L | ML, L |
Figure 5The labelled cluster maps for clustering species Set 1 by (a) SOM, (b) GSOM with the pentanucleotide frequency. Each hexagon represents a single node. If a node contains input samples from only a single species, it is displayed with a letter that uniquely identifies the species. Grey colour nodes correspond to two or more species in the node and the number of species is displayed on the node. A node without label means that there is no input sample “hits.”
Training results in the mixing regions for species Set 2.
| Algorithm | SOM | GSOM | ||||||
|---|---|---|---|---|---|---|---|---|
| Nucleotide Freq. | Di | Tri | Tetra | Penta | Di | Tri | Tetra | Penta |
| Kingdom | — | — | — | — | — | — | — | — |
| Phylum | MH, ML, ML | — | L | — | H, ML, L, L | — | — | L |
| Class | — | — | — | — | — | — | — | — |
| Order | — | — | — | — | — | — | — | — |
| Family | — | — | — | — | — | — | — | — |
| Genus | — | — | — | — | — | — | — | — |
| Species | — | — | — | — | — | — | — | — |
Training results in the mixing regions for the contigs 8 kb from simMC_Phrap.
| Algorithm | SOM | GSOM | ||||||
|---|---|---|---|---|---|---|---|---|
| Nucleotide Freq. | Di | Tri | Tetra | Penta | Di | Tri | Tetra | Penta |
| Kingdom | — | — | — | — | — | — | — | — |
| Phylum | — | — | — | — | — | — | — | — |
| Class | — | — | — | — | — | — | — | — |
| Order | — | — | — | — | — | — | — | — |
| Family | — | — | — | — | — | — | — | — |
| Genus | MH | MH | M | MH | MH | MH | MH | MH |
| Species | — | — | — | — | — | — | — | — |
| Strain | L | — | — | L | — | — | — | — |
Speed comparisons for the first two training phases of SOM and GSOM, in which the improvement columns represent the percentage of speed improvement for GSOM comparing to SOM.
| Species Set 1 | Species Set 2 | simMC_Phrap | |||||||
|---|---|---|---|---|---|---|---|---|---|
| SOM (sec) | GSOM (sec) | Improvement | SOM (sec) | GSOM (sec) | Improvement | SOM (sec) | GSOM (sec) | Improvement | |
| Di | 54 | 34 | 37% | 24 | 15 | 38% | 2 | 1 | 50% |
| Tri | 188 | 115 | 39% | 74 | 45 | 39% | 7 | 4 | 43% |
| Tetra | 779 | 475 | 39% | 236 | 147 | 38% | 31 | 18 | 42% |
| Penta | 3031 | 1847 | 39% | 878 | 518 | 41% | 144 | 80 | 44% |
Speed comparisons for the overall training time of SOM and GSOM, in which the improvement columns represent the percentage of speed improvement for GSOM comparing to SOM.
| Species Set 1 | Species Set 2 | simMC_Phrap | |||||||
|---|---|---|---|---|---|---|---|---|---|
| SOM (sec) | GSOM (sec) | Improvement | SOM (sec) | GSOM (sec) | Improvement | SOM (sec) | GSOM (sec) | Improvement | |
| Di | 313 | 274 | 12% | 133 | 121 | 9% | 11 | 10 | 9% |
| Tri | 1048 | 942 | 10% | 427 | 387 | 9% | 39 | 36 | 8% |
| Tetra | 4639 | 3932 | 15% | 1297 | 1203 | 7% | 173 | 158 | 9% |
| Penta | 16839 | 15709 | 7% | 4702 | 4387 | 7% | 720 | 662 | 8% |