| Literature DB >> 31608107 |
Alice Chiodi1,2, Francesco Comandatore3,4, Davide Sassera5, Giulio Petroni6, Claudio Bandi2,3, Matteo Brilli2,3.
Abstract
In recent years, the advent of NGS technology has made genome sequencing much cheaper than in the past; the high parallelization capability and the possibility to sequence more than one organism at once have opened the door to processing whole symbiotic consortia. However, this approach needs the development of specific bioinformatics tools able to analyze these data. In this work, we describe SeqDex, a tool that starts from a preliminary assembly obtained from sequencing a mixture of DNA from different organisms, to identify the contigs coming from one organism of interest. SeqDex is a fully automated machine learning-based tool exploiting partial taxonomic affiliations and compositional analysis to predict the taxonomic affiliations of contigs in an assembly. In literature, there are few methods able to deconvolve host-symbiont datasets, and most of them heavily rely on user curation and are therefore time consuming. The problem has strong similarities with metagenomic studies, where mixed samples are sequenced and the bioinformatics challenge is trying to separate contigs on the basis of their source organism; however, in symbiotic systems, additional information can be exploited to improve the output. To assess the ability of SeqDex to deconvolve host-symbiont datasets, we compared it to state-of-the-art methods for metagenomic binning and for host-symbiont deconvolution on three study cases. The results point out the good performances of the presented tool that, in addition to the ease of use and customization potential, make SeqDex a useful tool for rapid identification of endosymbiont sequences.Entities:
Keywords: NGS; binning; deconvolution; machine learning; symbiont
Year: 2019 PMID: 31608107 PMCID: PMC6761303 DOI: 10.3389/fgene.2019.00853
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Figure 1Contigs are used to obtain the read-pair graph by exploiting the paired sequencing (left branch). The network is used in several steps of the procedure, for instance, to extend the taxonomy information obtained through sequence comparison (middle branch). The k-mer frequencies are also calculated (right branch) and combined with the (extended) taxonomy. The contig dataset is then split in two depending on the presence of taxonomy labels; the labeled contigs are used to train the machine learning models (gray box) after partitioning the contigs again into a training and a test set. Training of the models is repeated N times to provide error estimations that are independent of the actual contigs in the train and test sets. As a default, classification is performed at the only Superkingdom level; if the user wants to proceeds down in the taxonomy hierarchy, additional iterations, each time focusing on a different taxonomic rank (green branch), can be performed. After that, SeqDex uses the trained models to predict the taxonomic affiliations of unlabeled contigs. Again, the read-pair network can be used to correct the predictions made by the machine learning models. At this point, contigs can be recovered, and two possible alternatives exist: (A) when there is more than one bacterium in the sequencing, the user can proceed by directing SeqDex on the flow indicated with (A): (i) run UMAP, (ii) DBscan on the UMAP transformed k-mer frequencies, (iii) identify the cluster containing the target 16S gene, (iv) extend predicted taxonomy information using the read-pair graph, and (v) extract the contigs identified as coming from the target organism. Alternatively, (B) SeqDex can directly extract the contigs classified as coming from the target organism after the machine learning step.
Figure 2Genome-based F1 scores. For all datasets and targets considered in the work. (A) Simulated dataset; (B)Ca. Fokinia solitaria dataset; (C and D)Pratylenchus penetrans dataset: (C)Ca. Cardinium hertigii; (D)Wolbachia pipientis.
Performances calculated with respect to the whole genome sequence of Neisseriagonorrhoeae.
| Blobology | MaxBIN | SeqDex-SVM | SeqDex-RF | |
|---|---|---|---|---|
| Sensitivity | 0.7442 | 0.9586 | 0.9586 | |
| Precision | 0.9518 | 0.9862 | 0.9862 | |
| Accuracy | 0.9909 | 0.9969 | ||
| F1 score | 0.8533 | 0.9576 |
Highest values in bold.
Performances of the classifications with respect to the whole Ca. Fokinia solitaria genome.
| Blobology | BusyBee | MetaBAT | MaxBin | SeqDex-SVM | SeqDex-RF | |
|---|---|---|---|---|---|---|
| Sensitivity | 0.8684 | 0.9864 | ||||
| Precision | 0.9843 | 0.1422 | 1.0000 | 0.9246 | ||
| Accuracy | 0.9973 | 0.8881 | 0.9997 | 0.9984 | ||
| F1 score | 0.9227 | 0.2489 | 0.9932 | 0.9592 |
Highest values in bold.
Figure 3In the main panel, we show the Blobology plot obtained for the Pratylenchus dataset as an example of cases when host and symbiont(s) are not clearly discernible in the GC and coverage dimensions. In the inset, we represent the contigs from the symbionts (as identified through homology) in the UMAP space used by SeqDex to partition the contigs from the symbionts.
Performances of the classifications with respect to the whole Ca. Cardinium hertigii genome retrieved from the Pratylenchus penetrans dataset.
| Blobology | BusyBee | MetaBAT | MaxBin | SeqDex-SVM | SeqDex-RF | |
|---|---|---|---|---|---|---|
| Sensitivity | 0.9781 | 0.8793 | 0.9289 | 0.9388 | 0.929 | |
| Precision | 0.0033 | 0.3466 | 0.6394 | 0.6649 | 0.6682 | |
| Accuracy | 0.1423 | 0.9946 | 0.9981 | 0.9983 | ||
| F1 score | 0.0067 | 0.5118 | 0.7662 | 0.7575 | 0.7773 |
Highest values in bold.
Performances of the classifications with respect to the whole Wolbachia pipientis genome retrieved from the Pratylenchus penetrans dataset.
| Blobology | BusyBee | MetaBAT | MaxBin | SeqDex-SVM | SeqDex-RF | |
|---|---|---|---|---|---|---|
| Sensitivity | 0.9974 | 0.9922 | 0.9800 | 0.9810 | 0.9868 | 0.9868 |
| Precision | 0.0034 | 0.2091 | 0.2096 | 0.4273 | 0.4327 | |
| Accuracy | 0.5042 | 0.9936 | 0.9974 | 0.9936 | 0.9977 | |
| F1 score | 0.0068 | 0.3454 | 0.3453 | 0.5963 | 0.6016 |
Highest values in bold.