Literature DB >> 28348870

Bayesian identification of bacterial strains from sequencing data.

Aravind Sankar¹, Brandon Malone^1,2, Sion C Bayliss³, Ben Pascoe³, Guillaume Méric³, Matthew D Hitchings⁴, Samuel K Sheppard³, Edward J Feil³, Jukka Corander^5,6, Antti Honkela¹.

Abstract

Rapidly assaying the diversity of a bacterial species present in a sample obtained from a hospital patient or an environmental source has become possible after recent technological advances in DNA sequencing. For several applications it is important to accurately identify the presence and estimate relative abundances of the target organisms from short sequence reads obtained from a sample. This task is particularly challenging when the set of interest includes very closely related organisms, such as different strains of pathogenic bacteria, which can vary considerably in terms of virulence, resistance and spread. Using advanced Bayesian statistical modelling and computation techniques we introduce a novel pipeline for bacterial identification that is shown to outperform the currently leading pipeline for this purpose. Our approach enables fast and accurate sequence-based identification of bacterial strains while using only modest computational resources. Hence it provides a useful tool for a wide spectrum of applications, including rapid clinical diagnostics to distinguish among closely related strains causing nosocomial infections. The software implementation is available at https://github.com/PROBIC/BIB.

Entities: Chemical Disease Species

Keywords: pathogenic bacteria; probabilistic modelling; staphylococcus aureus; strain identification

Mesh：

Substances：
DNA, Bacterial

Year: 2016 PMID： 28348870 PMCID： PMC5320594 DOI： 10.1099/mgen.0.000075

Source DB: PubMed Journal: Microb Genom ISSN： 2057-5858

Data Summary

1. Benchmarking data have been deposited in Figshare; DOI: 10.6084/m9.figshare.1617539 (url – http://figshare.com/articles/Benchmarking_data_for_bacterial_strain_identification/1617539)

Impact Statement

Bacterial samples can today be routinely scrutinized with in-depth sequencing covering the majority of the genomes of the organisms present. This provides a basis for discovering the presence of mixed colonies and identification of the relative abundances of different strains at subspecies level. Here we present a novel analysis pipeline by which such identification problems can be solved considerably more rapidly and accurately compared with the existing methods, while only using modest computational resources. Our pipeline is a useful tool for a wide spectrum of applications, including contamination detection and rapid clinical diagnostics to distinguish among closely related strains causing nosocomial infections.

Introduction

Different strains of pathogenic bacteria are known to often vary in terms of virulence, resistance and geographical spread (Méric ). Rapid and inexpensive sequence-based identification of the strain(s) colonising a patient would be highly desirable. Previous research shows that patients can often host several strains of specific species of the genus Staphylococcus (Ueta ). The current approach is to isolate single colonies, assuming that the sample is homogeneous. Here we consider an approach which allows a robust test of this assumption, and if there is diversity, an efficient means to compare similarities between whole host populations present in different individuals. Since single random colonies might be misleading, it is beneficial to allow for a more flexible approach where pooled colony data can be directly utilized. With the growing tendency to routinely sequence samples from infected patients in the hospital environment, the identification would be additionally advantageous for pathogen surveillance and monitoring purposes without necessitating the use of extensive computational resources for de novo genome assembly. Moreover, samples with mixed presence of several strains are problematic for assembly-based analyses, which calls for alternative approaches. The identification of bacteria from sequencing data has been widely considered in metagenomic community profiling (Segata ; Franzosa ). As our primary identification and estimation focus is at a much higher level of resolution than in typical metagenomics studies, whole-genome or whole-metagenome shotgun sequencing data is by definition a necessity for a successful implementation of a platform for this purpose. Typical metagenomic approaches for such data are based on defining a set of markers for each clade of interest (Segata ; Sunagawa ). However, these methods are typically not sensitive enough to identify the pathogens responsible for infections in sufficient detail. Eyre have presented a method for detecting mixed infections but the method assumes there are at most two strains in each sample, which may not hold, in particular if a sample has become contaminated at any phase of the preparation and sequencing process. A Bayesian statistical method capable of using all the sequencing data was recently introduced (Francis ; Hong ), but also its practical performance may not be appropriate, as suggested by our experiments. The computational problem in bacterial strain identification is analogous to the widely studied transcript isoform expression estimation in RNA-sequencing (RNA-seq) data analysis, which both aim at identifying and quantifying the abundance of several closely related sequences from short-read data. In both cases a significant fraction of reads will align perfectly to multiple sequences of interest. Several probabilistic models have been proposed for solving this problem (Xing ; Jiang & Wong, 2009). Based on its success in recent assessments of methods for tackling this problem (SEQC/MAQC-III Consortium, 2014; Kanitz ), we use the BitSeq (Glaus ; Hensman ) method to obtain a fast and accurate solution to this problem in our Bayesian Identification of Bacteria (BIB) pipeline for bacterial strain identification from unassembled sequence reads. In this paper we focus on Staphylococcus aureus and Staphylococcus epidermidis, which represent two of the most widespread causes of nosocomial infections and impose considerable burden on the public health system worldwide (Harris ; Méric ). Using a diverse collection of strains from these two species as a model system, we demonstrate that clinically relevant, fast and highly accurate identification of the strains colonising a patient is possible in less than 10 min on a standard single-CPU desktop computer. Our BIB pipeline improves significantly upon the state-of-the-art approach for sequence-based identification of bacteria.

Methods

Our pipeline is built by a combination of the following two central ideas: defining core genomes of the target set of strains by excluding more variable regions to strengthen the analysis, and using a fast fully probabilistic method to estimate the relative frequencies of the target strains in a sample. These ideas translate to two analysis steps: Step 1. Cluster the strains, perform multiple sequence alignment to find the strain-specific common core genome and construct an index for read alignment. Step 2. Align the reads to the reference core genomes allowing multiple matches and use a probabilistic method to estimate the strain abundances using the alignments. Step 1 only needs to be done once for each collection of reference sequences while Step 2 needs to be performed for every sample. The two steps will be detailed further below, followed by description of the synthetic data generation process and characterisation of the real data which are used for empirical evaluation and comparison against the leading alternative identification method.

Step 1: Reference strain selection and core genome extraction.

We demonstrate our pipeline on a collection of 30 S. aureus and 3 S. epidermidis strains whose phylogenetic tree is illustrated in Fig. 1. The tree was reconstructed using UPGMA method with p-distance in the mega6 software (Tamura ). The tree displays a natural partition with 13 S. aureus strain clusters, each of which corresponds to an already established clonal complex (Feil & Enright, 2004), while each S. epidermidis strain forms a cluster of its own, representing the three previously identified main complexes within the species (Méric ). The strains selected to represent each cluster are indicated by bold type in Fig. 1.

Fig. 1.

Phylogenetic tree of the investigated Staphylococcus strains. Inset: Enlarged view of the S. aureus branch illustrating the clustering of the strains within clonal complexes. The scale measures base-level sequence dissimilarity, showing that the S. aureus clusters differ by approximately two to ten substitutions every 1 kb while strains within each cluster differ by less than one substitution every 5 kb. Microbial genomes are often highly dynamic and susceptible to horizontal gene transfer and translocation of genomic regions (Gogarten ; Lawrence, 2002). As a consequence, mobile elements may confuse genome-based identification of strains. In order to avoid issues with misalignment of reads and incorrect abundance estimates, we discard the non-core parts of the reference genomes and use only core alignment, i.e. parts of the genome shared by all strains of a species, as a basis for the analysis. A multiple sequence alignment for the 16 cluster prototype bacterial strains shown in bold in Fig. 1 was obtained using progressive Mauve (Darling ). The accessory genome regions were detected and discarded using the standard settings, resulting in an ungapped core alignment which was used to represent the genomic variation in the target set of strains. These ungapped sequences are used to construct an index for read alignment.

Step 2: Strain abundance estimation.

The gapless core genomes extracted as described above were considered as the reference sequences in the BitSeq (Glaus ; Hensman ) method to estimate the relative proportion of each strain in our reference collection in a sample. We used Bowtie 2 (Langmead & Salzberg, 2012) to align the reads to the reference sequences allowing for multiple matches. We then used estimateVBExpression from BitSeq to estimate the relative proportions of each of the strains in the sequenced samples. Our full method pipeline is referred to as Bayesian Identification of Bacteria (BIB) in the remainder of the article.

Abundance estimation model in detail.

The strain abundance estimation was based on a statistical model of sequencing data as a mixture of reads from a set of known reference sequences (Xing ; Li ). The relative abundances of the sequences are the unknown parameters θ. In our case the references were the core genomes of randomly selected representatives of each cluster. Reads not mapping to the core genomes were ignored. After introducing indicator variables I defining the sequence of origin of each read r, the likelihood of a read r (single or paired-end) p(r is defined in Equation (1) of Glaus and depends on the mismatches in the alignment as well as the length of the reference sequence. The position model was not used in BIB because it would be difficult to estimate with almost no unique alignments. We used a conjugate Dirichlet(α,...,α) prior over θ with α = 1. Smaller α would mean weaker regularisation, but α ≥ 1 is needed for log-concavity of the model, which aids convergence. We used fast collapsed variational inference to optimise an approximate posterior distribution over I after marginalising out θ (Hensman , 2015). The posterior distribution over the unknown abundances θ was obtained from these as in Hensman .

Generation of data for validation experiments.

For the primary set of experiments, each sample was created by randomly mixing the reads from a number of real single-strain sequencing data sets (Data citations 1–3). Details of the strains and the used mixing proportions are provided in Table S1 (available in the online Supplementary Material) and the full data set is available through Data citation 4. These data are obtained independently of the reference sequences used in the model and represent realistic sequencing data obtained from other strains in the same clusters. To test more thoroughly the effect of dropped clusters in the presence of a more diverse representation of different strains, we additionally simulated reads using MetaSim (Richter ).

Results

We tested the BIB pipeline on several DNA sequencing data sets from Staphylococcus strains. We used two different types of data sets: data sets with artificial mixtures of genuine reads from single strain sequencing experiments, and synthetic data sets generated using MetaSim. We report the results from our pipeline and compare against Pathoscope 2 (Francis ; Hong ) as well as naive estimation from strain frequencies among uniquely mapping reads. To ensure that the other methods can fully utilise the same information about the strains, we used the same read alignments as input to all methods, essentially only replacing the final abundance estimation step in our pipeline.

Clustering and selection of strains

The strains used in the experiment and their phylogenetic relationships are illustrated in Fig. 1. The phylogenetic tree illustrates the clonal complex (CC) structure of the S. aureus population (Feil & Enright, 2004), where members of the same complex are highly similar and interchangeable in terms of strain identification (Méric ). Choosing one representative for each CC corresponds to the clustering illustrated in Fig. 1.

Identification of Staphylococcus strains from sequencing data

We generated 30 synthetic mixtures of sequencing reads from different strains of species of the genus Staphylococcus as described and analysed these data sets using BIB. As a benchmark, we also tested the same identification and quantification using Pathoscope instead of BitSeq. Each analysed data set contained a mixture of two to six Staphylococcus strains. The number of reads varied between one million and three million. The data sets mixed from previous data (Data citations 1–3) are available in Figshare (Data citation 4). Details of the samples and mixing proportions are given in Table S1. Strain-level identification is very difficult, as typically only around 0.1–0.2 % of the reads map uniquely to the core genome. Full genome alignments have more unique hits, but given the volatility of the accessory genome these are also likely to be more misleading. The absolute errors in the abundance estimation in the experiments are illustrated in Fig. 2. We split our analysis to two cases: strains not present in the samples (true negatives) and strains that are present (true positives). All methods are reliable in identifying true negatives. For true positives, BIB consistently provides very accurate quantification (absolute errors mean ± sd 0.014 ± 0.023) while Pathoscope and the naive unique mapping read analysis are significantly less accurate (Pathoscope absolute errors 0.11 ± 0.11, unique reads 0.14 ± 0.12). BIB quantification results remain accurate all the way down to the least abundant strains which had only 3 % abundance in our data. A scatter-plot in Fig. 3 comparing the errors of BitSeq and Pathoscope for each experiment shows that BIB is essentially always more accurate than Pathoscope (P < 10−16; Wilcoxon signed rank test) and often by a wide margin.

Fig. 2.

Fig. 3.

Scatter plot comparing the estimation errors of BIB and Pathoscope on true positives. Points below the diagonal are cases where BIB is more accurate while points above the diagonal are cases where Pathoscope is more accurate.

Estimation in the presence of contaminant species

The alignment of reads to reference genomes makes BIB highly robust to contamination by unrelated species. We tested this by generating ten of the samples with 3–30 % contamination from Bacillus subtilis subsp. subtilis strain 168. Full details of the experiments are given in Table S1. After filtering the non-aligning reads, which include most of the Bacillus reads, the estimation accuracy on the proportions of Staphylococcus strains is almost as good as with the uncontaminated samples, as illustrated in Fig. 4. The corresponding median errors are 0.002 for the uncontaminated samples and 0.01 for the contaminated samples, respectively. Addition of the contaminant reads is visible as a drop in the total rate of aligned reads, but given the significant and variable number of unmappable reads originating from the auxiliary genome the mapping rate is at best an unreliable measure of the contamination level.

Fig. 4.

Comparison of errors in estimation of proportions of Staphylococcus strains with and without Bacillus contamination. Errors on contaminated samples are slightly higher, but overall still very low.

Magnitudes of errors in proportion estimates of BIB, Pathoscope and naive estimation among uniquely mapping reads (Unique) in strains really present in the experiment (true positives; left) and those not present in the experiment (true negatives; right). The “Unique” method is implemented by simply computing the frequencies of different strain clusters among unique alignments. Lower values indicate better results. Scatter plot comparing the estimation errors of BIB and Pathoscope on true positives. Points below the diagonal are cases where BIB is more accurate while points above the diagonal are cases where Pathoscope is more accurate. Comparison of errors in estimation of proportions of Staphylococcus strains with and without Bacillus contamination. Errors on contaminated samples are slightly higher, but overall still very low.

Estimation in the presence of unknown strain clusters

When the reads of unknown origin stem from a species or strain related closely enough to allow for the reads aligning well with those included in the index, they tend to be assigned to the most closely related included reference strains. This is illustrated by two examples in Fig. 5. In the first example dropping Cluster 1 from the index causes the reads to get assigned to Cluster 2 which is in the evolutionary sense closest to Cluster 1 in the phylogenetic tree in Fig. 1. In the second example, dropping Cluster 13, results in the reads getting split more evenly among the available alternatives because the branch to Cluster 13 splits off from the rest very early.

Fig. 5.

Two examples of error spectra when some strain clusters present in a sample are not included in the index. The plots show the profile of true and estimated proportions (top) as well as the errors in the estimation (bottom). The error profile lines will always show a bump at the dropped cluster index because they cannot be estimated while the other shape shows how the reads get reassigned.

Estimation without clustering

Clustering of very similar strains when defining the reference set is an essential part of BIB. Fig. 6 shows a typical example of the consequences of excluding the clustering step. As seen, the contribution of a single cluster representative truly present in a sample tends to get split up between all strains representing the same cluster in the reference set as they are too similar to be differentiated. Furthermore, the method is unable to separate strains 1–9 belonging to Clusters 1 and 2, even though the two were usually properly separated in the experiments with clustering of the reference strains. This is most likely because the difference from using six or nine strains to represent the data is not as substantial as the difference between one or two strain clusters where the clearly simpler model is able to drive the other coefficient to zero. It is likely that no statistical method would be able to truthfully resolve the origins of the reads when the sources are too similar to each other. Hence, it is of importance to ensure biological meaningfulness of the reference set of strains prior to assignment.

Fig. 6.

An example of error profile in strain abundance estimation without clustering. The vertical dotted lines indicate the borders between different clusters.

Analysis of clinical S. aureus samples

To illustrate the practical applicability of BIB we tested it on S. aureus short-read data generated at the Wellcome Trust Sanger Institute as part of a Europe-wide surveillance project [“Genetic diversity in Staphylococcus aureus (European collection)” study; Data citation 5], with kind permission from Matthew Holden. Initial analysis of these data revealed they were of poor quality, probably resulting from contamination, and for this reason they have not previously been published. All isolates were recovered from cases of invasive S. aureus disease. The estimated abundance profiles of selected samples are shown in Fig. 7. In top two isolates (ERR038357 and ERR038367) a single cluster is robustly identified (> 95 % share for the dominant strain, all other shares < 1 %) indicating that the level of contamination in these samples is low. In contrast, isolates ERR033658 and ERR033686 (rows 3 and 4) show clearer evidence of mixed clusters due to contamination. We also note that the cluster profiles are similar within these two samples, which is consistent with a single source of contamination for both runs. Isolate ERR038366 (bottom row) represents a completely failed sample, possibly caused by problems with sequence barcoding.

Fig. 7.

Estimated cluster abundance profiles from diverse clinical samples. The two top rows represent clean samples where one cluster clearly dominates. Rows 3 and 4 represent contaminated samples where the true cluster can still be fairly reliably identified. The bottom row shows a completely failed sample, possibly due to problems with sequence barcoding.

Analysis of samples of a more recombinant species S. epidermidis

To test the applicability of BIB to more recombinant species than S. aureus, we applied it to 83 S. epidermidis isolates sequenced by Méric (Data citation 6, details inTable S2). We analysed the samples using the two clusterings of an extended set of 143 strains analysed by Méric : a coarse clustering representing a clinically relevant partition of the population into three clonal complexes, and a more detailed clustering into 11 groups which subdivides the clonal complexes further. We randomly selected representatives from each cluster as prototype isolates in the BIB database. Samples used as prototypes were excluded from the analysis, leaving 82 samples in the three-cluster case and 75 in the 11-cluster case. The estimated relative abundances of the true and other clusters are shown in Fig. 8. The figure shows that in the case of the clinically motivated population partition the correct cluster is always identified as dominant with a large margin, but the absolute errors are larger than in the S. aureus case, typically around 20 %. The results in the 11-cluster case show more variability, but the correct cluster is still identified as dominant in 85 % of the cases and the abundance estimates for the other clusters are in most cases very small.

Fig. 8.

Estimated relative abundances of the correct and other clusters in S. epidermidis data with two different clusterings. (Higher values are better for correct cluster, lower values for other clusters.) The results show more leakage of estimates to incorrect clusters, but the true cluster is still identified as dominant in all cases with three clusters and in 85 % of cases with 11 clusters.

Runtime

For a new sample, the pipeline requires running of programs for read alignment (Bowtie 2) and abundance estimation (BitSeq being the core part of BIB). The time required by these two steps is approximately equal. A typical analysis of 1 M reads takes approximately 10 min on a single-CPU desktop computer representing a standard level of hardware.

Discussion

Interpretation of proportions and benchmarking

Our pipeline can estimate strain abundances as proportions of the sequencing reads. These would be expected to be related to the proportions of DNA from the different strains. Depending on the relative lengths of different genomes, this may deviate slightly from cell counts between species, but should be consistent within a species because we only consider the shared core genome of equal length. This kind of minor variations should not affect any practical applications. Our empirical evaluation is based solely on synthetic mixtures of sequencing reads from different single-strain sequencing experiments. Such mixtures are necessary to enable accurate benchmarking of the methods. Because we use actual reads from various experiments they will not perfectly match the reference and thus represent a much more realistic test than synthetic reads generated from references. Experiments based on laboratory-derived mixed cultures would add significant extra uncertainty because it is difficult to accurately control the strain proportions during the cultivation process.

Applicability to different bacterial species

The main assumption behind our BIB method is that each putative biologically meaningful source is adequately represented by a single core genome sequence to which the reads can be mapped. As illustrated in this paper, this works with high fidelity for species like S. aureus whose population structure has clear well-separated lineages (Méric ). The results with more highly recombining species such as S. epidermidis show more variability in the proportion estimates, but the correct cluster is still always identified as the dominant one in the clinically most relevant three-cluster case and also in a great majority of samples in the more detailed scenario as well. Improving the performance for more recombinant species is an important avenue of future work. The current state-of-the-art probabilistic identification method Pathoscope 2 (Francis ; Hong ) is essentially based on a similar assumption and is expected to be similarly vulnerable to strong deviations from the assumption. However, our experiments demonstrated that BIB delivers a considerably higher level of estimation accuracy without requiring more extensive computational resources. As illustrated in the results shown in Fig. 6, clustering the strains is essential for accurate identification results. It is not surprising that distinguishing among multiple highly related strains is not feasible, however, it is more striking that clustering also aids identification of read origin between the more separated sources. We suspect this may be due to the prior used in the Bayesian model, but further work is needed to properly understand the phenomenon. In transcript-level RNA-seq analysis clustering of similar transcripts has been suggested for improving the accuracy by Turro . Unlike our off-line clustering, their algorithm is run on-line together with the inference separately for every sample. Our approach can easily incorporate additional expert knowledge and guarantee consistent clustering, making interpretation of the results more straightforward. This approach is expected to work especially well for any species that has a clear subpopulation boundaries, since every potentially mixed sample will correspondingly have a clearly delineable structure among its reads, apart from those representing contamination, which can be efficiently filtered out by our pipeline. The S. epidermidis results highlight a trade-off in the number of clusters: increasing the number of clusters raises the risk of misidentification while providing potentially more relevant information. The optimal number of clusters clearly depends on the aim of the analysis. Selection of the representative sequence for each cluster for more highly recombinant species like S. epidermidis is an interesting question for future work. It seems that the optimal answer will depend on the assumed distribution of analysed samples. Presumably one could simulate data from the non-selected representatives in each cluster and test the method, but this could become quite computationally demanding, as the choices in different clusters are clearly not independent. Here we have used random selection to avoid such questions.

Relationship to transcript-level RNA-seq analysis

The underlying statistical problem in bacterial strain identification is the same as that underlying most transcript-level RNA-seq expression estimation methods: how to estimate the probability of a read originating from a given reference sequence. There exist a number of methods solving the same problem there including RSEM (Li ; Li & Dewey, 2011), Cufflinks (Trapnell ), Miso (Katz ), BitSeq (Glaus ; Hensman ), TIGAR (Nariai , 2014), eXpress (Roberts & Pachter, 2013), Sailfish (Patro ) and many others. These are all based on different inference methods applied to the same probabilistic model first proposed in (Xing ). This is also essentially the same as the model used by Pathoscope (Francis ; Hong ). There are also a number of other RNA-seq analysis methods based on other models. We have chosen to use the fast variational Bayes (VB) version of BitSeq (Hensman ) as core ingredient in BIB because it provides very high accuracy while being reasonably fast according to recent broad assessments (SEQC/MAQC-III Consortium, 2014; Kanitz ).

Conclusion

In this paper we have presented the BIB pipeline for probabilistic identification and quantification of relative abundance of bacterial strains in mixed samples from unassembled sequence data. The pipeline is based on alignment of reads to representative core genomes followed by deconvolution of multi-mapping reads using BitSeq, a method previously proposed for RNA-seq analysis. Our BIB pipeline can rapidly and reliably estimate the proportions of the reference strains with the typical deviance of at most a few percent units, using approximately one million sequencing reads. BIB improves significantly upon the accuracy of both naive analysis as well as previous state-of-the-art method in strain identification. Application of BIB to analyse clinical samples suggests it has significant potential both in strain identification as well as flagging problematic, such as contaminated, samples.

33 in total

Review 1. Analyses of clonality and the evolution of bacterial pathogens.

Authors: Edward J Feil; Mark C Enright
Journal: Curr Opin Microbiol Date: 2004-06 Impact factor: 7.934

2. Fast gapped-read alignment with Bowtie 2.

Authors: Ben Langmead; Steven L Salzberg
Journal: Nat Methods Date: 2012-03-04 Impact factor: 28.547

3. Analysis and design of RNA sequencing experiments for identifying isoform regulation.

Authors: Yarden Katz; Eric T Wang; Edoardo M Airoldi; Christopher B Burge
Journal: Nat Methods Date: 2010-11-07 Impact factor: 28.547

4. TIGAR: transcript isoform abundance estimation method with gapped alignment of RNA-Seq data by variational Bayesian inference.

Authors: Naoki Nariai; Osamu Hirose; Kaname Kojima; Masao Nagasaki
Journal: Bioinformatics Date: 2013-07-02 Impact factor: 6.937

5. Metagenomic microbial community profiling using unique clade-specific marker genes.

Authors: Nicola Segata; Levi Waldron; Annalisa Ballarini; Vagheesh Narasimhan; Olivier Jousson; Curtis Huttenhower
Journal: Nat Methods Date: 2012-06-10 Impact factor: 28.547

6. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome.

Authors: Bo Li; Colin N Dewey
Journal: BMC Bioinformatics Date: 2011-08-04 Impact factor: 3.307

7. An expectation-maximization algorithm for probabilistic reconstructions of full-length isoforms from splice graphs.

Authors: Yi Xing; Tianwei Yu; Ying Nian Wu; Meenakshi Roy; Joseph Kim; Christopher Lee
Journal: Nucleic Acids Res Date: 2006-06-06 Impact factor: 16.971

8. Streaming fragment assignment for real-time analysis of sequencing experiments.

Authors: Adam Roberts; Lior Pachter
Journal: Nat Methods Date: 2012-11-18 Impact factor: 28.547

9. MetaSim: a sequencing simulator for genomics and metagenomics.

Authors: Daniel C Richter; Felix Ott; Alexander F Auch; Ramona Schmid; Daniel H Huson
Journal: PLoS One Date: 2008-10-08 Impact factor: 3.240

Review 10. Computational meta'omics for microbial community studies.

Authors: Nicola Segata; Daniela Boernigen; Timothy L Tickle; Xochitl C Morgan; Wendy S Garrett; Curtis Huttenhower
Journal: Mol Syst Biol Date: 2013-05-14 Impact factor: 11.429

5 in total

1. Alterations of human skin microbiome and expansion of antimicrobial resistance after systemic antibiotics.

Authors: Jay-Hyun Jo; Catriona P Harkins; Nicole H Schwardt; Jessica A Portillo; Matthew D Zimmerman; Claire L Carter; Md Amir Hossen; Cody J Peer; Eric C Polley; Véronique Dartois; William D Figg; Niki M Moutsopoulos; Julia A Segre; Heidi H Kong
Journal: Sci Transl Med Date: 2021-12-22 Impact factor: 17.956

2. MetaPop: a pipeline for macro- and microdiversity analyses and visualization of microbial and viral metagenome-derived populations.

Authors: Ann C Gregory; Kenji Gerhardt; Zhi-Ping Zhong; Benjamin Bolduc; Ben Temperton; Konstantinos T Konstantinidis; Matthew B Sullivan
Journal: Microbiome Date: 2022-03-15 Impact factor: 14.650

3. MentaLiST - A fast MLST caller for large MLST schemes.

Authors: Pedro Feijao; Hua-Ting Yao; Dan Fornika; Jennifer Gardy; William Hsiao; Cedric Chauve; Leonid Chindelevitch
Journal: Microb Genom Date: 2018-01-10

4. High-resolution sweep metagenomics using fast probabilistic inference.

Authors: Tommi Mäklin; Teemu Kallonen; Sophia David; Christine J Boinett; Ben Pascoe; Guillaume Méric; David M Aanensen; Edward J Feil; Stephen Baker; Julian Parkhill; Samuel K Sheppard; Jukka Corander; Antti Honkela
Journal: Wellcome Open Res Date: 2021-10-08

5. StrainGE: a toolkit to track and characterize low-abundance strains in complex microbial communities.

Authors: Lucas R van Dijk; Bruce J Walker; Timothy J Straub; Colin J Worby; Alexandra Grote; Henry L Schreiber; Christine Anyansi; Amy J Pickering; Scott J Hultgren; Abigail L Manson; Thomas Abeel; Ashlee M Earl
Journal: Genome Biol Date: 2022-03-07 Impact factor: 13.583

5 in total