| Literature DB >> 30252023 |
Florian Plaza Oñate1,2, Emmanuelle Le Chatelier2, Mathieu Almeida2, Alessandra C L Cervino1, Franck Gauthier2, Frédéric Magoulès3, S Dusko Ehrlich2, Matthieu Pichaud1.
Abstract
MOTIVATION: Analysis toolkits for shotgun metagenomic data achieve strain-level characterization of complex microbial communities by capturing intra-species gene content variation. Yet, these tools are hampered by the extent of reference genomes that are far from covering all microbial variability, as many species are still not sequenced or have only few strains available. Binning co-abundant genes obtained from de novo assembly is a powerful reference-free technique to discover and reconstitute gene repertoire of microbial species. While current methods accurately identify species core parts, they miss many accessory genes or split them into small gene groups that remain unassociated to core clusters.Entities:
Mesh:
Year: 2019 PMID: 30252023 PMCID: PMC6499236 DOI: 10.1093/bioinformatics/bty830
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Simplified model illustrating the rationale behind the method. Six samples except the fourth carry a strain of a microbial species represented by a circle. The absolute abundance of each strain is indicated on the bottom right. Core genes (red, blue, yellow) are present in all the strains while accessory genes (green, purple) are found only in some. In addition, the yellow gene is tagged as shared because it is observed in sample 4 that do not contain the species. After shotgun sequencing, core genes yield directly proportional mapped reads counts across samples, the proportionality coefficient being roughly equal to the ratio of their length. In contrast, such relationship between a core and an accessory gene is observed only in the subset of samples where the accessory gene is present
Fig. 2.Method for comparing gene count profiles and classifying genes in MSPs. The counts of a gene () are compared to the counts of the core seed () with which it is associated across metagenomic samples. The coefficient of proportionality between and is estimated to be 0.75. The solid line of slope corresponds to expected counts. Dashed lines represent the gene quantification thresholds before and after adjustment according to . Black and grey crosses are respectively structural and undetermined zeros. Only structural zeros are taken into account to assign to a given class (c.f. braces). Black and grey points are respectively inlier and outlier samples. The distance between the unique outlier and the expected proportional count correspond to the residual
Fig. 3.MSPminer workflow
Fig. 4.Evaluation of the measures of proportionality. (A) Comparison of the Pearson’s correlation coefficient, the Spearman’s correlation coefficient and the proposed measure of proportionality to detect an association between the median abundance vector of the core genes of the simulated species and the abundance vectors of each of its genes. The x-axis corresponds to the percentage of samples where a gene is detected and the y-axis corresponds to the intensity of the relationship between the compared vectors. The closer the value is to 1, the stronger the intensity of the relationship. (B) Comparison of the performances of the robust (black) and the non-robust (grey) measures of proportionality to detect a relationship between the noisy abundance vector of each gene of the simulated species and the outlier-free median abundance vector of its core genes. The proportion of outliers is gradually increased to 5%, 10% and 20%
Fig. 5.Evaluation of the clustering algorithm. (A) Impact of number of samples where the simulated species is detected on clustering. (B) Impact of strain mixture on clustering