| Literature DB >> 24053649 |
Abstract
BACKGROUND: Comparison and classification of metagenome samples is one of the major tasks in the study of microbial communities of natural environments or niches on human bodies. Bioinformatics methods play important roles on this task, including 16S rRNA gene analysis and some alignment-based or alignment-free methods on metagenomic data. Alignment-free methods have the advantage of not depending on known genome annotations and therefore have high potential in studying complicated microbiomes. However, the existing alignment-free methods are all based on unsupervised learning strategy (e.g., PCA or hierarchical clustering). These types of methods are powerful in revealing major similarities and grouping relations between microbiome samples, but cannot be applied for discriminating predefined classes of interest which might not be the dominating assortment in the data. Supervised classification is needed in the latter scenario, with the goal of classifying samples into predefined classes and finding the features that can discriminate the classes. The effectiveness of supervised classification with alignment-based features on metagenomic data have been shown in some recent studies. The application of alignment-free supervised classification methods on metagenome data has not been well explored yet.Entities:
Mesh:
Year: 2013 PMID: 24053649 PMCID: PMC3849074 DOI: 10.1186/1471-2164-14-641
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Major categories of machine-learning methods for analysis of microbiome samples with sequencing data
| Feature type | Alignment-based | OTU-based or microbial taxon-based | |||
| (based on sequence signature features without using any database) | (based on features obtained by mapping sequence reads to annotation databases) | ||||
| Type of machine learning | Unsupervised clustering | Supervised classification | Unsupervised clustering | Supervised or unsupervised | |
The highlighted category (alignment-free supervised classification of metagenome data) is that of the current work.
Simulation experiments
| Simulation1 | 0 | 3 | 0 | 3 | 0 | 3 |
| 0 | 5 | 0 | 5 | 0 | 5 | |
| | 0 | 10 | 0 | 10 | 0 | 10 |
| Simulation2 | 1 | 1 | - | - | - | - |
| 1 | 5 | - | - | - | - | |
| 5 | 5 | - | - | - | - | |
| 5 | 10 | - | - | - | - | |
| 10 | 10 | - | - | - | - | |
The symbol “-” means that there is no experiment under the parameters.
Figure 1The flow chart of generating the simulation data.
Summary of the NGS short reads data of the tree genome data
| Fagaceae | SRX017683 | SRR037802 | 1.2G |
| SRX017436 | SRR037484 | 0.816G | |
| SRX017340 | SRR037437 | 1.4G | |
| SRX017339 | SRR037158 | 0.979G | |
| SRX017338 | SRR037157 | 1.2G | |
| | SRX016680 | SRR035946 | 1.1G |
| Moraceae | SRX017643 | SRR037748 | 1.2G |
| SRX017840 | SRR038268 | 1.1G | |
| SRX017740 | SRR037888 | 0.487G | |
| SRX017645 | SRR037751 | 0.899G |
The LOOCV error rates on the simulation data
| 3-tuple | 0.16 | - | - | - | - | 0.18 | 0.28 | 0.28 | 0.28 | NaN |
| 4-tuple | 0.14 | - | - | 0.14 | 0.18 | 0.18 | 0.24 | 0.34 | 0.40 | NaN |
| 5-tuple | 0.06 | 0.06 | 0.06 | 0.08 | 0.02 | 0.02 | 0.02 | 0.00 | 0.02 | 0.12 |
| 6-tuple | 0.04 | 0.04 | 0.04 | 0.04 | 0.02 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 7-tuple | 0.10 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.04 | 0.04 |
| 8-tuple | 0.26 | 0.10 | 0.10 | 0.14 | 0.28 | 0.30 | 0.40 | 0.46 | 0.42 | 0.46 |
This table shows the LOOCV error rates when no seed is inserted in group 1 and five seeds of length 6 are inserted in group 2 with the seed density 0.005.
“# of selected” indicates the number of feature selected in each level. “features” indicates the k-tuple feature with the particular value of k. “all” is the situation when all the 4k features are used. “-” is the feature selection level does not exist for the particular k value. “NaN” means the method failed to converge to a result within a given amount of time.
Result of simulation 1
| 0/3 | 0.003 | 5-tuple | 0.78 | 5-tuple | 0.88 | 6-tuple | 0.82 | 7-tuple | 1.00 |
| | 20 features | | 5f eatures | | 5 features | | 5 features | | |
| 0.005 | 5-tuple | 0.76 | 5-tuple | 0.90 | 6-tuple | 1.00 | 6-tuple | 1.00 | |
| | 50 features | | 10 features | | 5 features | | 5 features | | |
| 0.01 | 5-tuple | 0.90 | 5-tuple | 1.00 | 5-tuple | 1.00 | 5-tuple | 1.00 | |
| | | 20 features | | 5 features | | 5 features | | 5 features | |
| 0/5 | 0.003 | 4-tuple | 0.86 | 5-tuple | 0.78 | 6-tuple | 1.00 | 7-tuple | 1.00 |
| | 200 features | | 50 features | | 10 features | | 5 features | | |
| 0.005 | 4-tuple | 0.94 | 5-tuple | 0.94 | 6-tuple | 1.00 | 5-tuple | 1.00 | |
| | 100 features | | 10 features | | 5 features | | 5 features | | |
| 0.01 | 4-tuple | 1.00 | 5-tuple | 1.00 | 5-tuple | 1.00 | 5-tuple | 1.00 | |
| | | 5 features | | 5 features | | 5 features | | 5 features | |
| 0/10 | 0.003 | 8-tuple | 0.60 | 5-tuple | 0.96 | 6-tuple | 1.00 | 6-tuple | 1.00 |
| | 50 features | | 30 features | | 5 features | | 5 features | | |
| 0.005 | 4-tuple | 0.96 | 5-tuple | 1.00 | 6-tuple | 1.00 | 6-tuple | 1.00 | |
| | 100 features | | 10 features | | 5 features | | 5 features | | |
| 0.01 | 4-tuple | 1.00 | 4-tuple | 1.00 | 4-tuple | 1.00 | 5-tuple | 1.00 | |
| 20 features | 5 features | 5 features | 5 features | ||||||
This table summarizes all the results in simulation 1.
“#kind of seeds” shows how many types of seeds are inserted into the two groups, for example, “0/3” means no seed was inserted in group 1 and 3 kinds of seeds were inserted in group 2. “Length = 4, 5, 6, or 7” means the length of inserted seeds is 4, 5, 6 or 7, respectively.
Result of simulation 2
| 1/1 | 4-tuple | 0.96 | 5-tuple | 1.00 | 5-tuple | 1.00 | 6-tuple | 1.00 |
| | 5 features | | 5 features | | 5 features | | 5 features | |
| 1/5 | 4-tuple | 1.00 | 5-tuple | 1.00 | 5-tuple | 1.00 | 6-tuple | 1.00 |
| | 5 features | | 5 features | | 5 features | | 5 features | |
| 5/5 | 3-tuple | 1.00 | 5-tuple | 1.00 | 4-tuple | 1.00 | 3-tuple | 1.00 |
| | 5 features | | 5 features | | 5 features | | 5 features | |
| 5/10 | 4-tuple | 1.00 | 4-tuple | 1.00 | 4-tuple | 1.00 | 5-tuple | 1.00 |
| | 5 features | | 5 features | | 5 features | | 5 features | |
| 10/10 | 3-tuple | 1.00 | 4-tuple | 1.00 | 3-tuple | 1.00 | 4-tuple | 1.00 |
| 5 features | 5 features | 5 features | 5 features | |||||
“#kind of seeds” and “best result” have the same meaning with that in Table 4. Seed density is fixed to 0.01.
Figure 2The selected features from experiment in Table4. This table shows some of the selected features in Table 4's experiment. The inserted seeds in this particular experiment were TGTTGA, ACGACA, AACCTG, GCGGGG and ATCTGT. The first row and the second row show the selected 5 features and 10 features of length 6. The third row shows the selected 20 features of length 7. Feature with yellow shade means it is the seed we inserted, and feature with green shade means it is the seed’s reverse complement sequence.
Figure 3The LOOCV error rates on real genome data. This figure shows the LOOCV error rates of different feature lengths and at different feature selection levels on the tree genome data. Each line stands for the LOOCV error rates of one feature length.
Figure 4The LOOCV error rate of real metagenome data. This figure shows us the LOOCV error rates of different feature lengths and at different feature selection levels on the IBD vs. non-IBD metagenome data. Each line stands for the LOOCV error rates of one feature length.
Figure 5The histogram of the best LOOCV error rates on the permutated IBD data.
The LOOCV and test accuracies of the 5 experiments on the IBD data
| LOOCV Acc | 88% | 86% | 88% | 90% | 86% |
| Test Acc | 78% | 84% | 78.4% | 73% | 86.5% |