| Literature DB >> 35972375 |
Andre Lamurias1, Mantas Sereika2, Mads Albertsen2, Katja Hose1, Thomas Dyhre Nielsen1.
Abstract
MOTIVATION: Despite recent advancements in sequencing technologies and assembly methods, obtaining high-quality microbial genomes from metagenomic samples is still not a trivial task. Current metagenomic binners do not take full advantage of assembly graphs and are not optimized for long-read assemblies. Deep graph learning algorithms have been proposed in other fields to deal with complex graph data structures. The graph structure generated during the assembly process could be integrated with contig features to obtain better bins with deep learning.Entities:
Year: 2022 PMID: 35972375 PMCID: PMC9525014 DOI: 10.1093/bioinformatics/btac557
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.931
Fig. 1.GraphMB’s workflow. (a) The metagenome of an environmental sample is sequenced and assembled into contigs. (b) Initial embeddings are computed with a variational auto-encoder based on k-mer composition and abundance features. (c) The input of the GNN are the initial contig embeddings and the graph structure provided by the assembly graph. The thickness of the edge corresponds to the number of reads that cover it. (d) The GNN model learns new embeddings by aggregating neighboring contigs (nodes in the assembly graph). (e) The final embeddings are clustered and bins are obtained
Summary of the datasets used to compare binners
| Datasets | Total size (Gbp) | Reads N50 (kbp) | Assembly length (Gbp) | Contigs N50 (kbp) | Mean cov. | Contigs | Edges | Samples |
|---|---|---|---|---|---|---|---|---|
| Strong100 | 7.5 | 13.3 | 0.17 | 175.0 | 42 | 852 | 670 | 1 |
| Hjor | 16.0 | 8.7 | 0.86 | 80.4 | 13 | 19 496 | 5937 | 4 |
| Viby | 17.2 | 14.0 | 1.32 | 101.0 | 7 | 23 389 | 7800 | 4 |
| Damh | 26.7 | 14.3 | 1.93 | 119.0 | 8 | 32 771 | 14 066 | 4 |
| Mari | 23.3 | 10.1 | 1.69 | 83.1 | 8 | 36 611 | 12 651 | 4 |
| AalE | 27.7 | 10.2 | 1.92 | 83.4 | 8 | 40 827 | 12 425 | 4 |
| Hade | 45.2 | 9.8 | 3.01 | 73.9 | 9 | 70 402 | 27 952 | 4 |
| Soil | 115.0 | 7.7 | 1.98 | 93.3 | 19 | 51 135 | 69 522 | 1 |
Note: Total size refers to the total number of base pairs in the dataset. Reads N50 is the N50 length of reads. Assembly length refers to the sum of the length of all contigs. Contig N50 is the N50 value for contigs. Mean cov. refers to the mean base coverage of all contigs. Contigs and edges refers to the number of contigs of each assembly and edges in the assembly graph. Samples is the number of samples available to calculate abundance.
Results obtained with GraphMB and state-of-the-art binning tools
| HQ bins | Strong100 | Hjor | Viby | Damh | Mari | AalE | Hade | Soil |
|---|---|---|---|---|---|---|---|---|
| GraphBin | 30 | 11 | 15 | 14 | 16 | 12 | 6 | 0 |
| Maxbin2 | 27 | 12 | 19 | 16 | 14 | 12 | 19 | 0 |
| SemiBin-ocean | 30 | 11 | 1 | 22 | 18 | 21 | 7 | 0 |
| SemiBin-train | 27 | 7 | 4 | 23 | 22 | 32 | 25 | 0 |
| VAMB | 28 | 22 | 12 | 22 | 30 | 37 | 28 | 0 |
| MetaBAT2 | 32 | 23 | 29 | 41 | 39 | 43 | 44 | 2 |
|
| 33 | 25 | 23 | 43 | 48 | 46 | 52 | 3 |
| Δ VAMB | 5 | 3 | 11 | 21 | 18 | 9 | 24 | 3 |
| Δ MetaBAT | 1 | 2 | −6 | 2 | 9 | 3 | 8 | 1 |
| Δ % VAMB | 15.2% | 12.0% | 47.8% | 48.8% | 37.5% | 19.6% | 46.2% | 100.0% |
| Δ % MetaBAT | 3.0% | 8.0% | −26.1% | 4.7% | 18.8% | 6.5% | 15.4% | 33.3% |
| GraphMB dRep unique | 0 | 1 | 2 | 4 | 6 | 8 | 12 | 2 |
| DASTool w/o GraphMB | 37 | 32 | 32 | 41 | 43 | 43 | 51 | 15 |
| DASTool w/GraphMB | 37 | 33 | 32 | 46 | 47 | 48 | 58 | 16 |
Note: The WWTP datasets are sorted by ascending size of assembly in terms of number of contigs. The Soil dataset is separate because it has a much higher complexity than the WWTP datasets.
AMBER evaluation metrics on the simulated Strong100 dataset on GraphMB and state-of-the-art binning tools
| AP (bp) | AC (bp) | F1 | HQ | |
|---|---|---|---|---|
| GraphBin | 0.848 | 0.613 | 0.712 | 23 |
| MaxBin2 | 0.818 | 0.765 | 0.791 | 14 |
| SemiBin-ocean | 0.858 | 0.783 | 0.819 | 26 |
| SemiBin-train | 0.826 |
| 0.823 | 20 |
| VAMB |
| 0.755 | 0.849 | 26 |
| MetaBAT2 | 0.905 | 0.592 | 0.716 | 26 |
| GraphMB | 0.967 | 0.762 |
|
|
Note: The metrics used are Average Purity (bp), Average Completeness (bp), F1-score, HQ and MQ bins based on these purity and completeness metrics. The highest value of each metric is bolded.