| Literature DB >> 19799776 |
Andrey Kislyuk1, Srijak Bhatnagar, Jonathan Dushoff, Joshua S Weitz.
Abstract
BACKGROUND: The development of effective environmental shotgun sequence binning methods remains an ongoing challenge in algorithmic analysis of metagenomic data. While previous methods have focused primarily on supervised learning involving extrinsic data, a first-principles statistical model combined with a self-training fitting method has not yet been developed.Entities:
Mesh:
Year: 2009 PMID: 19799776 PMCID: PMC2765972 DOI: 10.1186/1471-2105-10-316
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Binning diagram. Diagram of binning data pathways and main MCMC iteration loop.
Redundancies in oligonucleotide dimension space
| 1 | 4 | 1 |
| 2 | 20 | 7 |
| 3 | 84 | 25 |
| 4 | 340 | 103 |
| 5 | 1364 | 391 |
Figure 2Fragment likelihood separation. Log likelihood values of fragments from pairs of species according to models fitted by the classifier. Points' positions on the two axes represent log likelihoods of each fragment according to the first and second model, respectively. A, Helicobacter acinonychis vs. Vibrio fischeri, good separation (98% accuracy, D = 1.31); B, Streptococcus pneumoniae vs. Streptococcus pyogenes, poor separation (57% accuracy, D = 0.22). Fragment length was 800 in both cases. 500 fragments per species were supplied.
Figure 3Pairwise genome divergence distributions. Cumulative distributions of pairwise divergences (D) between all completed bacterial genomes retrieved from GenBank. Fragment lengths of 400 to 1000 were used to compute D. Divergences based on k-mer order 2, 3, and 4 are represented in panels A, B, and C, respectively. The vertical cut-off line at D = 1 indicates an empirical boundary above which the binning algorithm works with high accuracy. For fragment length 400, over 80% of all randomly selected pairs are observed to have divergences above this line.
Figure 4Algorithm accuracy vs. fragment divergence. Sets of 2, 3, 5, 10 genomes were sampled randomly from a set of 1055 completed bacterial chromosomes, and experiments were conducted as described in Materials and Methods. Trials were conducted with 400- and 800-nt long fragments. Classification accuracy for the majority of genome pairs above overall divergence 1 is in the high performance range (accuracy > 0.9), while above divergence 3 accuracy is above 0.9 for over 95% of the trials. Results for Bayesian posterior distribution sampling were not significantly different (Additional file 3).
Summary of species' characteristics, including all independent monomer and dimer frequencies, in the subset of trials on 5 pairs of genomes performed in Figures 5 and 6.
| 63% | 0.186 | 0.041 | 0.044 | 0.048 | 0.054 | 0.127 | 0.114 | |
| 62% | 0.189 | 0.040 | 0.057 | 0.037 | 0.068 | 0.097 | 0.098 | |
| 36% | 0.322 | 0.128 | 0.046 | 0.092 | 0.063 | 0.025 | 0.037 | |
| 32% | 0.337 | 0.118 | 0.047 | 0.109 | 0.059 | 0.015 | 0.038 | |
| 40% | 0.301 | 0.105 | 0.050 | 0.082 | 0.066 | 0.027 | 0.042 | |
| 39% | 0.303 | 0.126 | 0.040 | 0.079 | 0.058 | 0.037 | 0.060 | |
| 35% | 0.324 | 0.122 | 0.042 | 0.097 | 0.060 | 0.017 | 0.037 | |
| 33% | 0.333 | 0.121 | 0.053 | 0.110 | 0.066 | 0.026 | 0.035 | |
| 31% | 0.343 | 0.134 | 0.038 | 0.110 | 0.055 | 0.008 | 0.027 | |
| 33% | 0.335 | 0.122 | 0.053 | 0.112 | 0.065 | 0.026 | 0.033 | |
Figure 5Algorithm accuracy vs. fragment length. Fragment length-dependent performance on 2-species datasets. Same trials as in Figure 4 were performed on a subset of pairs of genomes while varying simulated fragment size from 40 to 1000. The species' characteristics are given in Table 2.
Summary of algorithm performance on JGI FAMeS data.
| APOW1005, PPD1199, AIBF1022, AHZI1134, AHXO1014 | 2.3451 | 500 | 400 | 0.87 |
| BCSB1222, ABFI1048, AHYP1295, AKNK1296, AAZH3626 | 1.9598 | 500 | 400 | 0.69 |
| AHYT1136, AHYI1010, PIT10099, AINZ1029, AHZF1044 | 1.9314 | 500 | 400 | 0.85 |
| PPD1199, AUNI1013, ABSU1031, AABS2846, AHXO1014 | 1.8881 | 500 | 400 | 0.89 |
| AOTU1003, BCSB1222, AIOH1083, AIFS1040, AHXX1063 | 1.8032 | 500 | 400 | 0.86 |
| BCSB1222, VNY1182, AHXF1121, AKNK1296, AHZI1134 | 1.3563 | 500 | 400 | 0.81 |
| KPY1561, AOTY1222, BAHF1005, POG1025, AAOP1172 | 1.2429 | 500 | 400 | 0.79 |
| BCSB1222, AADD1003, AUNI1013, KPR1102, AHXO1014 | 1.1571 | 500 | 400 | 0.87 |
| AICI1287, AAOO1711, AKNK1296, AHXX1063, KPR1102 | 1.0279 | 500 | 400 | 0.72 |
| AHYT1136, AAWX1070, WBJ1361, AIAI1092, AXBY1147 | 0.9987 | 500 | 400 | 0.65 |
| AICI1287, AHYT1136, AAWX1070, AADE1259, AINZ1029 | 0.9856 | 500 | 400 | 0.72 |
| AUSC1572, AHYF1232, AAON1449, AIAX1019, ACBK1133 | 0.8884 | 500 | 400 | 0.78 |
| Average (12 trials, 5 sources, | 1.46 | 500 | 400 | 0.79 |
Random subsets of 5 sources each were selected from the FAMeS simLC dataset, with a genomic fragment divergence, D3, as shown. Fragments were truncated to the indicated length where appropriate. Reads from the dataset were used raw with no trimming.
Figure 6Algorithm accuracy vs. source ratio. Fragment ratio-dependent performance on 2-species datasets. Same trials as in Figure 4 were performed on a subset of pairs of genomes while varying species' contributions to the dataset from 2% to 98%. Fragment sizes were fixed at 400 nt (A) and 1000 nt (B). The species' characteristics are given in Table 2.
Performance comparison of LikelyBin and CompostBin on pairs of genomes analyzed in Figures 5, 6, Table 2.
| 400 | 500 | 1.02 | 0.94 | 10 | 0.93 | ||
| 400 | 500 | 1.15 | 0.92 | 10 | 0.76 | ||
| 400 | 500 | 0.97 | 0.96 | 10 | 0.12* | ||
| 400 | 500 | 0.99 | 0.93 | 10 | 0.73 | ||
| 400 | 500 | 0.92 | 0.94 | 10 | 0.17* | ||
Frag L, Fragment length; Frag N, Number of fragments per source; CB seeds, labeled fragments supplied to CompostBin for training. LikelyBin consistently performed equally to or above CompostBin performance despite being completely unsupervised, while CompostBin required a fraction of input fragments to be labeled to seed its clustering alorithm. We supplied training fragments to CompostBin without regard to their origin (protein or RNA-coding). In a likely practical scenario, only 16S RNA-coding fragments would be labeled, but would have different k-mer distributions from protein-coding regions, possibly confounding classification. (*) Convergence toward a good clustering was not observed in CompostBin for these datasets; accuracy can be less than 50% due to labeled input.
The method of sampling the posterior distribution of the MCMC chain by averaging random accepted models from the steady state was compared to the method of selecting the model with the overall maximum log likelihood.
| 400 | Steady state sampled | 1.08 | 0.95 | -1054490.36 | |||||
| 1.09 | 0.94 | -1040007.41 | |||||||
| 400 | Maximum log likelihood | 1.02 | 0.94 | -1055584.16 | |||||
| 1000 | Steady state sampled | 1.95 | 0.97 | -2648159.80 | |||||
| 2.52 | 0.99 | -2637429.69 | |||||||
| 1000 | Maximum log likelihood | 2.12 | 0.98 | -2645204.57 | |||||
| 400 | Steady state sampled | 1.08 | 0.90 | -1045063.72 | |||||
| 1.33 | 0.95 | -1040811.10 | |||||||
| 400 | Maximum log likelihood | 1.15 | 0.92 | -1047966.99 | |||||
| 1000 | Steady state sampled | 2.02 | 0.96 | -2624742.76 | |||||
| 2.22 | 0.97 | -2615376.71 | |||||||
| 1000 | Maximum log likelihood | 2.19 | 0.96 | -2626080.18 | |||||
| 400 | Steady state sampled | 0.93 | 0.96 | -1059955.55 | |||||
| 1.18 | 0.93 | ||||||||
| 400 | Maximum log likelihood | 0.97 | 0.96 | -1061298.85 | |||||
| 1000 | Steady state sampled | 1.71 | 0.99 | -2656860.50 | |||||
| 2.28 | 0.99 | -2634722.55 | |||||||
| 1000 | Maximum log likelihood | 1.69 | 0.98 | -2658488.27 | |||||
| 400 | Steady state sampled | 0.99 | 0.90 | -1049716.33 | |||||
| 1.00 | 0.95 | -1045188.54 | |||||||
| 400 | Maximum log likelihood | 0.99 | 0.93 | -1050316.80 | |||||
| 1000 | Steady state sampled | 1.92 | 0.97 | -2636903.64 | |||||
| 2.21 | 0.97 | -2624299.41 | |||||||
| 1000 | Maximum log likelihood | 1.75 | 0.97 | -2636046.52 | |||||
| 400 | Steady state sampled | 0.96 | 0.95 | -1037936.55 | |||||
| 1.05 | 0.89 | -1033285.36 | |||||||
| 400 | Maximum log likelihood | 0.92 | 0.94 | -1037505.67 | |||||
| 1000 | Steady state sampled | 1.84 | 0.98 | ||||||
| 2.36 | 0.99 | -2581181.80 | |||||||
| 1000 | Maximum log likelihood | 1.94 | 0.98 | -2601394.32 | |||||
Frag L, Fragment length; LL, Output model log likelihood
The resulting accuracy differences were negligible. Accuracy was also compared in 3-mer models vs. 4-mer models. While 4-mer models slightly outperformed 3-mer models on average, a significant run time increase was observed (not shown). NC_identifiers refer to GenBank accession numbers for genomes listed in each trial.