Literature DB >> 33393985

CAFE: a software suite for analysis of paired-sample transposon insertion sequencing data.

Anna Abramova^1,2, Adriana Osińska^1,2,3, Haveela Kunche^1,2,4, Emil Burman^1,2, Johan Bengtsson-Palme^1,2.

Abstract

SUMMARY: Sequencing of transposon insertion libraries is used to determine the relative fitness of individual mutants at a large scale. However, there is a lack of tools for specifically analyzing data from such experiments with paired sample designs. Here, we introduce CAFE-Coefficient-based Analysis of Fitness by read Enrichment-a software package that can analyze data from paired transposon mutant sequencing experiments, generate fitness coefficients for each gene and condition, and perform appropriate statistical testing on these fitness coefficients.
AVAILABILITY AND IMPLEMENTATION: CAFE is implemented in Perl and R. The source code is freely available for download under the MIT License from https://github.com/bengtssonpalme/cafe and http://microbiology.se/software/cafe/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. The evaluation data can be obtained from https://microbiology.se/sw/cafe/example_data.tgz.

Entities: Disease Species

Year: 2021 PMID： 33393985 PMCID： PMC8034522 DOI： 10.1093/bioinformatics/btaa1086

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Over the last years, a variety of approaches for investigating the fitness of mutants at a large scale have emerged (Chao ; Goodman ; van Opijnen and Camilli, 2013). Generally, these approaches are based on insertion of a transposase en masse in the target genome, followed by sequencing of tags from these mutants that allow determination of their relative fitness in experimental or in vivo conditions. While a number of software packages exist for analysis of this type of data (Blanchard ; McCoy ; Zhao ; Zomer ), these packages lack a central feature that is critical for specifically addressing the fitness effects of specific genes that are unique for a given condition. The missing feature is the ability to compare a given treatment condition to a control, under the assumption that the initial starting collection of transposon mutants come from the same pool for each paired replicate. This type of experimental setup allows a direct assessment of the genes that have significant effects on fitness specifically when one experimental factor is altered (such as exposure to a selective agent, the presence of other species etc.). In this paper, we introduce CAFE—Coefficient-based Analysis of Fitness by read Enrichment—a software package to analyze data from paired transposon mutant sequencing experiments, generate fitness coefficients for each gene and condition, as well as perform appropriate statistical testing on these fitness coefficients.

2 Implementation

The CAFE package is based upon the concept of condition-specific fitness coefficients (FCs). Each FC describes the relative importance of a gene under a certain condition and is derived from a comparison of how mutants coming from the same source population differ after growth in a studied condition and a control condition. The FC is defined as: where FC is the condition-specific FC, n is the number of read counts assigned to a gene, s the total number of mapped reads from a sample, u the number of reads mapped to intergenic regions, and ‘cond’ and ‘ctrl’ represent the testing condition and the control, respectively. The reads corresponding to intergenic transposon insertions u are used as a normalization factor assumed to show no effect between the condition and the control, but can be set to some other factor if a better no-effect control exists. CAFE is a set of command-line tools for analysis of sequence data implemented in Perl, combined with an R package for statistical analysis of the read counts generated. The entire software package should be functional under any version of Unix or Linux, including MacOS. The R package also runs well in the Windows version of R. The command-line tools are dependent on cutadapt (Martin, 2011), TrimGalore! (https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/) and Bowtie2 (Langmead and Salzberg, 2012) for full functionality.

3 Evaluation

To assess the performance of the CAFE package compared to other commonly used software for transposon sequencing analysis, we used data from an InSeq experiment comparing Pseudomonas aeruginosa transposon mutant libraries after overnight growth to the frozen state of the same libraries (available as CAFE example data from https://microbiology.se/sw/cafe/example_data.tgz). We analyzed this data using CAFE and then ran ESSENTIALS (Zomer ), MAGenTA (McCoy ) and TnseqDiff (Zhao ) on either the reads from the experiment or the counts resulting from the Perl portion of CAFE, depending on the input required by the different analysis tools (Table 1; see Supplementary Text for details). A few important conclusions can be drawn from this analysis. First, ESSENTIALS and to some degree also TnseqDiff produce unrealistically small P-values (Supplementary Fig. S1). For example, ESSENTIALS produces P-values smaller than 10−100 and TnseqDiff generates P-values as small as 10−30 for this dataset. With only five replicates in each group and considerable within-treatment variation, such small P-values hints at an over-confident statistical method. Furthermore, MAGenTA indicates that virtually all genes have significant differences (Supplementary Fig. S2), which is very unlikely to be true, and in any case is not a particularly useful result in terms of filtering out relevant hits. We also see that all four tools agree on that 689 genes have significant differences between the two treatments (Supplementary Fig. S3). MAGenTA stands out as the most liberal, having identified 1691 genes as significant that were not reported by any of the other tools. Furthermore, it is notable that ESSENTIALS share 879 reported genes with only MAGenTA (which reports almost all genes as significant), while CAFE and TnseqDiff only share 332 and 16 reported genes with MAGenTA, respectively. It is worth pointing out that we have no way of knowing the ‘true’ result in this case—we can only make reasonable assumptions on what a plausible distribution of significantly differential genes would look like, and the results reported by CAFE and ESSENTIALS seem to match the expected distributions best. We also investigated the robustness against false positives on a no-effect dataset and found that CAFE and ESSENTIALS far outperformed the other two tools in this respect (Supplementary Text and Supplementary S4 and S5).

Table 1.

Comparison of CAFE with three other commonly used software for transposon sequencing analysis

	CAFE	ESSENTIALS	MAGenTA	TnseqDiff
Total number of reported genes	5703	5697	5697	5700
Number of genes with adjusted P-value ≤ 0.05	2375	2847	4920	973
Percentage significant genes	41.6%	50.0%	86.4%	17.1%

Comparison of CAFE with three other commonly used software for transposon sequencing analysis

4 Conclusions

Our evaluation of currently used statistical methods for analysis of transposon insertion sequencing data reveals substantial flaws in the methodological assumptions, particularly when the samples are paired. We here introduce a new solution to this paired-sample transposon sequencing library problem in the form of a software package—CAFE—which is capable of performing the bioinformatic processing of sequence data from such experiments, as well as performing statistical analysis. The R package part of CAFE can operate on any type of count data from paired transposon sequencing experiments, regardless of if the CAFE tools were used for preprocessing or not. The CAFE package is open source and available from GitHub (https://github.com/bengtssonpalme/cafe) as well as from https://microbiology.se/software/cafe/

Funding

This work was supported by the Swedish Research Council for Environment, Agricultural Sciences and Spatial Planning (FORMAS; grant 2016-00768), the Sahlgrenska Academy at the University of Gothenburg, the Swedish Research Council (VR; grant 2019-00299) under the frame of JPI AMR (EMBARK; JPIAMR2019-109), the Centre for Antibiotic Resistance Research at the University of Gothenburg, the Adlerbertska research foundation, the O.E. and Edla Johansson foundation, the Swedish Cancer and Allergy fund (Cancer- och Allergifonden) and Längmanska Kulturfonden. Conflict of Interest: none declared. Click here for additional data file.

8 in total

1. Fast gapped-read alignment with Bowtie 2.

Authors: Ben Langmead; Steven L Salzberg
Journal: Nat Methods Date: 2012-03-04 Impact factor: 28.547

2. Identifying microbial fitness determinants by insertion sequencing using genome-wide transposon mutant libraries.

Authors: Andrew L Goodman; Meng Wu; Jeffrey I Gordon
Journal: Nat Protoc Date: 2011-11-17 Impact factor: 13.491

3. MAGenTA: a Galaxy implemented tool for complete Tn-Seq analysis and data visualization.

Authors: Katherine Maia McCoy; Margaret L Antonio; Tim van Opijnen
Journal: Bioinformatics Date: 2017-09-01 Impact factor: 6.937

Review 4. Transposon insertion sequencing: a new tool for systems-level analysis of microorganisms.

Authors: Tim van Opijnen; Andrew Camilli
Journal: Nat Rev Microbiol Date: 2013-05-28 Impact factor: 60.633

5. TnseqDiff: identification of conditionally essential genes in transposon sequencing studies.

Authors: Lili Zhao; Mark T Anderson; Weisheng Wu; Harry L T Mobley; Michael A Bachman
Journal: BMC Bioinformatics Date: 2017-07-06 Impact factor: 3.169

Review 6. The design and analysis of transposon insertion sequencing experiments.

Authors: Michael C Chao; Sören Abel; Brigid M Davis; Matthew K Waldor
Journal: Nat Rev Microbiol Date: 2016-02 Impact factor: 60.633

7. ESSENTIALS: software for rapid analysis of high throughput transposon insertion sequencing data.

Authors: Aldert Zomer; Peter Burghout; Hester J Bootsma; Peter W M Hermans; Sacha A F T van Hijum
Journal: PLoS One Date: 2012-08-10 Impact factor: 3.240

8. Transposon insertion mapping with PIMMS - Pragmatic Insertional Mutation Mapping System.

Authors: Adam M Blanchard; James A Leigh; Sharon A Egan; Richard D Emes
Journal: Front Genet Date: 2015-04-09 Impact factor: 4.599

8 in total

1 in total

1. The genome of Prunus humilis provides new insights to drought adaption and population diversity.

Authors: Yi Wang; Jun Xie; Hongna Zhang; Weidong Li; Zhanjun Wang; Huayang Li; Qian Tong; Gaixia Qiao; Yujuan Liu; Ying Tian; Yongzan Wei; Ping Li; Rong Wang; Weiping Chen; Zhengchang Liang; Meilong Xu
Journal: DNA Res Date: 2022-06-25 Impact factor: 4.477

1 in total