| Literature DB >> 32698896 |
Gerry Tonkin-Hill1,2, Neil MacAlasdair3,4, Christopher Ruis4,5,6, Aaron Weimann4,5,6,7, Gal Horesh3, John A Lees8, Rebecca A Gladstone9, Stephanie Lo3, Christopher Beaudoin10, R Andres Floto5,11, Simon D W Frost12,13, Jukka Corander3,9,14, Stephen D Bentley3, Julian Parkhill4.
Abstract
Population-level comparisons of prokaryotic genomes must take into account the substantial differences in gene content resulting from horizontal gene transfer, gene duplication and gene loss. However, the automated annotation of prokaryotic genomes is imperfect, and errors due to fragmented assemblies, contamination, diverse gene families and mis-assemblies accumulate over the population, leading to profound consequences when analysing the set of all genes found in a species. Here, we introduce Panaroo, a graph-based pangenome clustering tool that is able to account for many of the sources of error introduced during the annotation of prokaryotic genome assemblies. Panaroo is available at https://github.com/gtonkinhill/panaroo .Entities:
Keywords: Bacteria; Clustering; Horizontal gene transfer; Pangenome; Prokaryote
Mesh:
Year: 2020 PMID: 32698896 PMCID: PMC7376924 DOI: 10.1186/s13059-020-02090-4
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Fig. 1a An overview conceptualising the problem with current gene annotation methods and the stages Panaroo uses to correct for annotation errors. b Expanded specific stages in the process. (i) Contamination appears in the graph as poorly supported components. In the default mode, Panaroo removes contamination by recursively removing poorly supported nodes of degree 1. (ii) Genes are often mis-annotated near contig breaks [19]. Panaroo corrects such mis-annotations by recursively removing poorly supported nodes of degree 1. (iii) Panaroo corrects cases where the same DNA sequence has been translated in multiple reading frames into a single gene by clustering concomitant genes at the DNA level. (iv) Panaroo uses context and a lower clustering threshold to combine diverse gene families into a single gene. (v) Annotation algorithms may predict a gene in some but not all samples, even when the samples share exactly the same DNA sequence. Panaroo finds missing genes by searching for the gene sequence in the surrounding DNA
Fig. 2Pangenome counts for 413 Mycobacterium tuberculosis genomes from an outbreak in London [27]. The maximum pairwise SNP distance between these isolates was 9, suggesting extremely limited variation. Consequently, we would expect a very limited accessory genome and a core genome of approximately 4000 genes. All tools with the exception of Panaroo found in excess of 2500 accessory genes, which can be attributed to annotation errors
Fig. 3Error counts for the different algorithms after comparing with simulated data on different scenarios. Accessory genome inflation refers to the number of erroneous clusters that do not correspond to any simulated gene cluster. Missing genes refer to false-negative gene calls where the annotation is not present in the final pangenome. Even in simulations of pangenome variation from a single E. coli reference with only relatively simple sources of error, a Panaroo outperforms other methods across a variety of gene gain/loss rates and mutation rates. In more realistic simulations of sequencing data, b the only method with reasonable control of the error rate is Panaroo
Fig. 4a The estimated pangenome, core and accessory sizes from the different algorithms in the global K. pneumoniae dataset. b The number of conflicting gene annotations in the inferred clusters of the different algorithms
Fig. 5a A diagram indicating how gene triplets are called in the graph. A single genome can only pass through a node once; thus, variations in the arrangement of genes in different genomes can be called using triplets. These triplets are summarised as a binary presence/absence matrix. b A family of related plasmids present in the N. gonorrhoeae pangenome gene network. The path highlighted in red contained 4 structural variant gene triplets significantly negatively associated with tetracycline resistance, or associated with tetracycline susceptibility by a structural variant pan-GWAS (all adjusted p value < 0.05). The gene highlighted in yellow, group_1999, was found to be a tetM resistance gene. c A subsection of the N. gonorrhoeae pangenome gene network of the region surrounding gene group_1138. The presence of gene triplets (group_771-group_1002-group_1138) and (group_1131-group_795-group_1138) is positively associated with tetracycline resistance whilst the triplets (group_1002-group_795-group_1131) and (group_771-group_1002-group_795) are negatively associated with tetracycline resistance (all adjusted p value < 0.05)
Fig. 6The inferred gene gain and loss rates of each of the 51 major clades of the Global Pneumococcal Sequencing project plotted above the respective log odds ratio of invasive disease in that clade. Clades which had significant odds ratios in Gladstone et al. [23] are represented in dark yellow
Fig. 7The cpu time and memory required for each of the algorithms for 10, 100 and 1000 N. gonorrhoeae isolates. Each tool was run with 5 cpus