Literature DB >> 29134153

HgtSIM: a simulator for horizontal gene transfer (HGT) in microbial communities.

Weizhi Song1,2, Kerrin Steensen1,3, Torsten Thomas1,4.   

Abstract

The development and application of metagenomic approaches have provided an opportunity to study and define horizontal gene transfer (HGT) on the level of microbial communities. However, no current metagenomic data simulation tools offers the option to introduce defined HGT within a microbial community. Here, we present HgtSIM, a pipeline to simulate HGT event among microbial community members with user-defined mutation levels. It was developed for testing and benchmarking pipelines for recovering HGTs from complex microbial datasets. HgtSIM is implemented in Python3 and is freely available at: https://github.com/songweizhi/HgtSIM.

Entities:  

Keywords:  Bioinformatics; Horizontal gene transfer; Simulator

Year:  2017        PMID: 29134153      PMCID: PMC5681852          DOI: 10.7717/peerj.4015

Source DB:  PubMed          Journal:  PeerJ        ISSN: 2167-8359            Impact factor:   2.984


Introduction

Horizontal gene transfer (HGT) has been recognized as an important force in microbial evolution and adaptation (Soucy, Huang & Gogarten, 2015). A number of pipelines have been developed to identify HGTs in draft or completed genomes of isolated microorganisms (Adato et al., 2015; Hasan et al., 2012; Podell & Gaasterland, 2007; Ravenhall et al., 2015; Trappe, Marschall & Renard, 2016; Zhu, Kosoy & Dittmar, 2014). In recent years, the development and application of metagenomic approaches have provided novel and vast amounts of information on the genomic composition of uncultured microorganisms (Thomas, Gilbert & Meyer, 2012). This offers an opportunity to study HGT on the level of microbial communities, however new bioinformatics tools and pipelines have to be developed to reliably detect any HGT events in metagenomic datasets. Simulations of metagenomics reads have been essential for the development and benchmarking of pipelines for the quality control, assembly and annotation of metagenomic data (Peng et al., 2012; Kang et al., 2015). These simulation tools typically produce reads based on defined sets of reference genomes with user-defined abundance distributions and often considering realistic error models for common sequencing technologies (Escalona, Rocha & Posada, 2016). However, no current simulation tool offers the option to introduce defined HGT within the microbial community data simulated, thus allowing to test pipelines that aim to detect HGT. Here, we have developed a pipeline called HgtSIM, which can simulate HGTs between the genomes of microbial communities. The pipeline can simulate HGTs with different degrees of similarity for transferred genes found in donor and recipient genomes, thus allowing to assess the detection of relatively recent or past transfers.

Methods

Simulation of gene mutations

The transfer of genes into a recipient genome often involves subsequent mutations that reflect evolutionary drift or adaptation to the new genomic context (e.g., change in codon usage to match tRNA availability). To simulate such mutations without disrupting reading frames and to confine the mutations to a defined range, we use codons as units of mutations. The mutations of codons were grouped into four categories (C): (1) one-base, silent mutation; (2) one-base, non-silent mutation; (3) two-bases mutations and (4) three-bases mutations (Table 1).
Table 1

Mutation types of codons.

The changed bases are displayed in bold. The corresponding amino acid change is given in parenthesis. As the number of silent two- and three-bases mutations are low (1%) compared to non-silent mutations, we here combined them into the same categories. The start and stop codons were excluded when calculating the number of mutation types.

CategoryMutation typeExampleTotal number
C1One-base, silentATC (Ile) → ATA (Ile)124
C2One-base, non-silentGCC (Ala) →ACC (Thr)356
C3Two bases, silentAGG (Arg) →CGT(Arg)20
Two bases, non-silentCTC (Leu) → CCT (Pro)1,394
C4Three bases, silentAGT (Ser) →TCC (Ser)12
Three bases, non-silentGTG (Val) →TAC (Tyr)1,400
The algorithm for simulating random mutations is as follows: Get the length (L) of each gene to be transferred. Define the number of bases need to be changed (N) based on a user-defined identity value (I) and L: i.e., N = LI∕100. Define the type of mutations based on N and a user-defined ratio of the four mutation categories. For example, if a ratio of 1:1:1:1 is specified for C1:C2:C3:C4, then, N = C1 + C2 + 2C3 + 3C4. Randomly select C1, C2, C3 and C4 codons and perform the corresponding mutations. All changed nucleotides are recorded in a mutation report file. A BlastP-based comparison between the amino acid sequences is also provided.

Simulation of gene transfers

The steps to simulate random gene transfers are as follows (Fig. 1):
Figure 1

The workflow of HgtSIM.

Add flanking sequences (if specified) to the (mutated) genes to be transferred. These flanking regions could, for example, be transposon insertion sequences. Get the total length or the total number of intergenic regions of the recipient genome (P) and user-defined number of genes (Q) to be transferred. Randomly select Q numbers between 1 and P and cut the recipient genome at corresponding positions to create sub-sequences. If user wants to insert gene transfers only into intergenic regions, then the recipient genome will be cut in the middle position of the selected intergenic regions. Randomly assign the (mutated) genes to be transferred to the cut points and concatenate them with the sub-sequences. All the break positions and the (mutated) genes inserted to these positions are recorded in an insertion report file. The Python3 implementation of this HgtSIM algorithm, parameter setting and all scripts used here are available at: https://github.com/songweizhi/HgtSIM.

Mutation types of codons.

The changed bases are displayed in bold. The corresponding amino acid change is given in parenthesis. As the number of silent two- and three-bases mutations are low (1%) compared to non-silent mutations, we here combined them into the same categories. The start and stop codons were excluded when calculating the number of mutation types.

Results and Discussion

The effect of mutation categories on the level of coded amino acid changes

The correlation of mutation on the nucleotide level and the resulting amino acid changes under different ratios of mutation categories were assessed by performing random mutations on 100 genes selected from ten Alphaproteobacteria genomes (Table 2). The values for the level of user-defined nucleotide mutations and the values for the resulting changes in protein sequences were similar for the category ratios of “0:0:0:1” and “1:0:1:1” (Fig. 2). This correlation analysis provides the user with information on the level of protein sequence changes that occur at any given nucleotide mutation level and category settings.
Table 2

The selected 20 genomes used in this study.

ClassStrainNCBI BioProject IDGenome size (Mbp)
AlphaproteobacteriaAcidiphilium multivorum AIU301 60101 3.58
Ketogulonigenium vulgarum WSH 001 161161 2.64
Mesorhizobium australicum WSM2073 47287 3.74
Methylocapsa acidiphila B2 72841 5.91
Methyloferula stellata AR4 165575 4.04
Rhodovibrio salinarum DSM 9154 84315 4.30
Roseobacter litoralis Och 149 19357 3.98
Sphingobium japonicum UT26S 1 19949 3.35
Starkeya novella DSM 506 37659 4.54
Tistrella mobilis KA081020 065 76349 3.74
BetaproteobacteriaAlicycliphilus denitrificans K601 50751 4.76
Dechlorosoma suillum PS 37693 3.63
Gallionella capsiferriformans ES 2 32827 3.02
Herbaspirillum seropedicae SmR1 47945 5.26
Nitrosospira multiformis ATCC 25196 13912 3.04
Ramlibacter tataouinensis TTB310 16294 3.88
Sideroxydans lithotrophicus ES 1 33161 2.41
Snodgrassella alvi wkB2 167602 2.99
Sulfuricella denitrificans skB26 170011 2.86
Tetrathiobacter kashmirensis WT001 67337 4.16
Figure 2

The correlation of mutation on the nucleotide level and the resulting aa changes under different mutation category ratios.

The four numbers separated by colon refer to the ratio between C1, C2, C3 and C4.

The correlation of mutation on the nucleotide level and the resulting aa changes under different mutation category ratios.

The four numbers separated by colon refer to the ratio between C1, C2, C3 and C4.

The effect of assemble k-mer range on the recovery of simulated HGTs

We next demonstrated the usefulness of HgtSIM to assess the recovery rate of HGTs from a simulated metagenomic shotgun-sequencing dataset after various sequence assembly processes. For this, 10 genes each were selected from the ten Alphaproteobacteria genomes and randomly transferred to ten Betaproteobacteria genomes (Table 2) with various degrees of mutation (0%, 5%, 10%, 15%, 20%, 25% and 30%). The ratio of mutation types was set to 1:0:1:1 and a flanking sequence of “TAGATGAGTGATTAGTTAGTTA” were added to the two ends of all transfers. Ten million paired-end, error-free 100-bp reads (corresponding to a coverage of 26.4×) with 250 bp insert size were simulated with an in-house script from the 20 genomes for each mutation group. To get an even sequencing depth distribution of the 20 genomes, their relative abundances were all set to one. The simulated reads were then assembled with IDBA_UD 1.1.1 (Peng et al., 2012) and metaSPAdes 3.9.0 (Nurk et al., 2017) with multiple k-mer ranges (Fig. 3). A gene transfer was considered to be recovered during the assembly if at least one of the gene’s two flanking regions was >1 Kbp and the flanking region matched its recipient genome. To do this, a blastn (Altschul et al., 1990) was performed between the introduced gene transfers and the contigs produced by the assemblers. The blast results were then filtered with an identity cutoff of 99% and a coverage cutoff of 99% for the transferred genes.
Figure 3

The effect of assemble k-mer range on the recovery of HGT events.

The results show that the best recovery was obtained with a k-mer range of 20–124 for IDBA_UD as well as 21–55 and 21–125 for metaSPAdes (Fig. 3). The number of genes recovered by the two assemblers were reduced when the user-defined nucleotide mutation levels were low (i.e., <5%). When no mutation was introduced, only one and nine genes were recovered by IDBA_UD and metaSPAdes, respectively.

The effect of sequencing depth on the recovery of no-mutation HGTs

We then investigated how sequencing depth might affect the recovery of HGTs with no mutations. To do this, between one to 20 million paired-end 100-bp reads with 250 bp insert size were simulated for the 20 genomes, no error was introduced to the simulated reads during simulation and no mutation was introduced to the 100 transferred genes. The simulated reads were then assembled with IDBA_UD and metaSPAdes with the optimal k-mer ranges identified above. Assembly statistics (total length, number of contigs, N50 and percentage of recovered reference genomes) were obtained with MetaQUAST 4.5 (Mikheenko, Saveliev & Gurevich, 2015). The results show that the quality of assemblies improved with increasing sequencing depth (Figs. 4A–4C) and over 98.5% of sequences for the reference genomes were reconstructed with sequencing depths of greater 6.6× (Fig. 4A). We found that the number of gene transfers recovered by IDBA_UD and metaSPAdes was not linearly correlated with sequencing depth. The best recovery (69 out of 100 transfers) was observed with metaSPAdes at a k-mer range of 21–125 and sequencing depth of 7.9× (Fig. 4D). With a k-mer range of 21–55, 43 gene transfers were recovered by metaSPAdes when the sequencing depth is 4×. As for IDBA_UD, the best recovery was obtained with a sequencing depth of 5.28×. The decrease of HGT recovery rates beyond a certain coverage threshold was surprising and contrasted the improved general quality measurements of the assemblies (Figs. 4A–4C). One possible explanation is that assemblers under certain coverage condition are more likely to assemble contigs for the transferred genes that lack flanking regions, and visual inspection of assembly graphs found instances of this.
Figure 4

The total length (A), percentage of recovered sequences (B), contig number and N50 (C) of assembler produced assemblies. (D) Number of recovered transfers.

The lines showing the number of contigs and N50 of metaSPAdes produced assemblies with two different k-mer settings were overlapping in panel (C).

The total length (A), percentage of recovered sequences (B), contig number and N50 (C) of assembler produced assemblies. (D) Number of recovered transfers.

The lines showing the number of contigs and N50 of metaSPAdes produced assemblies with two different k-mer settings were overlapping in panel (C).

The effect of read length and insert size on the recovery of no-mutation HGTs

We also simulated how insert size and read length might influence recovery of transfer events. As the best recovery of no-mutation HGTs was observed with metaSPAdes and a k-mer range of 21–125 at sequencing depth of 7.9× (Fig. 4D), we simulated reads with different length (100 bp and 250 bp) and insert sizes (250 bp, 500 bp and 1 Kbp) to this depth. More no-mutation gene transfers were recovered with reads length of 100 bp than with 250 bp. For the datasets with 250 bp read length the recovery of gene transfers was improved with longer insert sizes (Table 3).
Table 3

The effect of reads length and insert size on the recovery of 100 simulated HGT events.

Reads length (bp)100250
Insert size (bp)2505001,0002505001,000
Recovered gene transfers695563152351

The effect of the DNA length on the rate of transfer recovery

The correlation between the length of the DNA transferred and its recovery rate was also analyzed. The three datasets shown in Fig. 4D were used for this analysis. There was no statistically supported correlation between the gene length and the recovery rate, indicating that gene length has no impact on recovery rate under the given experimental conditions (Fig. 5).
Figure 5

The correlation between the length of gene transfers and their recovery rate.

Conclusions

Our study demonstrates how various aspects of metagenomic sequencing projects (e.g., insert length, read length, assembly parameters, gene length) can influence the potential to recover HGT from metagenomic datasets. Testing and benchmarking of various parameters and tools with simulated datasets produced by HgtSIM will in the future help to develop robust pipelines that have maximal success in recovering HGT from complex metagenomic data.
  14 in total

1.  IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth.

Authors:  Yu Peng; Henry C M Leung; S M Yiu; Francis Y L Chin
Journal:  Bioinformatics       Date:  2012-04-11       Impact factor: 6.937

2.  Basic local alignment search tool.

Authors:  S F Altschul; W Gish; W Miller; E W Myers; D J Lipman
Journal:  J Mol Biol       Date:  1990-10-05       Impact factor: 5.469

3.  MetaQUAST: evaluation of metagenome assemblies.

Authors:  Alla Mikheenko; Vladislav Saveliev; Alexey Gurevich
Journal:  Bioinformatics       Date:  2015-11-26       Impact factor: 6.937

Review 4.  Horizontal gene transfer: building the web of life.

Authors:  Shannon M Soucy; Jinling Huang; Johann Peter Gogarten
Journal:  Nat Rev Genet       Date:  2015-08       Impact factor: 53.242

5.  Detecting horizontal gene transfer by mapping sequencing reads across species boundaries.

Authors:  Kathrin Trappe; Tobias Marschall; Bernhard Y Renard
Journal:  Bioinformatics       Date:  2016-09-01       Impact factor: 6.937

6.  Detecting Horizontal Gene Transfer between Closely Related Taxa.

Authors:  Orit Adato; Noga Ninyo; Uri Gophna; Sagi Snir
Journal:  PLoS Comput Biol       Date:  2015-10-06       Impact factor: 4.475

7.  Metagenomics - a guide from sampling to data analysis.

Authors:  Torsten Thomas; Jack Gilbert; Folker Meyer
Journal:  Microb Inform Exp       Date:  2012-02-09

8.  GIST: Genomic island suite of tools for predicting genomic islands in genomic sequences.

Authors:  Mohammad Shabbir Hasan; Qi Liu; Han Wang; John Fazekas; Bernard Chen; Dongsheng Che
Journal:  Bioinformation       Date:  2012-02-28

9.  DarkHorse: a method for genome-wide prediction of horizontal gene transfer.

Authors:  Sheila Podell; Terry Gaasterland
Journal:  Genome Biol       Date:  2007       Impact factor: 13.583

10.  MetaBAT, an efficient tool for accurately reconstructing single genomes from complex microbial communities.

Authors:  Dongwan D Kang; Jeff Froula; Rob Egan; Zhong Wang
Journal:  PeerJ       Date:  2015-08-27       Impact factor: 2.984

View more
  3 in total

Review 1.  Current Methods for Recombination Detection in Bacteria.

Authors:  Anton E Shikov; Yury V Malovichko; Anton A Nizhnikov; Kirill S Antonets
Journal:  Int J Mol Sci       Date:  2022-06-02       Impact factor: 6.208

2.  MetaCHIP: community-level horizontal gene transfer identification through the combination of best-match and phylogenetic approaches.

Authors:  Weizhi Song; Bernd Wemheuer; Shan Zhang; Kerrin Steensen; Torsten Thomas
Journal:  Microbiome       Date:  2019-03-04       Impact factor: 14.650

Review 3.  Current and Promising Approaches to Identify Horizontal Gene Transfer Events in Metagenomes.

Authors:  Gavin M Douglas; Morgan G I Langille
Journal:  Genome Biol Evol       Date:  2019-10-01       Impact factor: 3.416

  3 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.