| Literature DB >> 35663029 |
Tizian Schulz1,2,3, Roland Wittler1,2, Jens Stoye1,2.
Abstract
One of the most basic kinds of analysis to be performed on a pangenome is the detection of its core, i.e., the information shared among all members. Pangenomic core detection is classically done on the gene level and many tools focus exclusively on core detection in prokaryotes. Here, we present a new method for sequence-based pangenomic core detection. Our model generalizes from a strict core definition allowing us to flexibly determine suitable core properties depending on the research question and the dataset under consideration. We propose an algorithm based on a colored de Bruijn graph that runs in linear time with respect to the number of k-mers in the graph. An implementation of our method is called Corer. Because of the usage of a colored de Bruijn graph, it works alignment-free, is provided with a small memory footprint, and accepts as input assembled genomes as well as sequencing reads.Entities:
Keywords: Bioinformatics; Genomics
Year: 2022 PMID: 35663029 PMCID: PMC9160775 DOI: 10.1016/j.isci.2022.104413
Source DB: PubMed Journal: iScience ISSN: 2589-0042
Pangenome properties
| Species | Core | Status | |
|---|---|---|---|
| 18 | 0.27 | closed ( | |
| 48 | 0.88 | closed ( | |
| 153 | 0.09 | open ( | |
| Listeria monocytogenes | 263 | 0.07 | open ( |
Overview of all prokaryotic pangenomes used in our experiments. The notion core k-mer refers to Definition 3.
Figure 1Result comparison
Result comparison of Corer, Panaroo, and SibeliaZ on four prokaryotic pangenomes. Shown are numbers of genes two tools agree (green) or disagree (blue/red) on. Results of SibeliaZ are not shown for L. monocytogenes because its output did not comprise any gene.
Figure 2Run time and memory comparison
Run time and memory usage comparison of all three tools.
Figure 3Influence of δ
Influence of δ on core sizes (left) and run time and memory consumption for core prediction (right).
Non-gene core annotations
| Element | Total | Fraction in core of | |
|---|---|---|---|
| Corer | SibeliaZ | ||
| miRNA | 3203 | 96% | 94% |
| tRNA | 11339 | 100% | 100% |
| ncRNA | 8563 | 90% | 79% |
| ps. transcript | 16065 | 90% | 68% |
| snoRNA | 1277 | 95% | 96% |
| snRNA | 234 | 100% | 62% |
| rRNA | 72 | 100% | 100% |
| tr. elem. | 66233 | 86% | 31% |
Non-gene related annotations found within the predicted cores of Corer and SibeliaZ from a pangenome of 18 accessions of Arabidopsis thaliana.
Figure 4Drosophila core genomes
Phylogenetic tree of genus Drosophila taken from FlyBase (Larkin et al., 2021). Each pie chart represents the ratio of genes inside (blue) and outside (orange) the core of pangenomes built from species’ assemblies below each internal node of the tree.
| REAGENT or RESOURCE | SOURCE | IDENTIFIER |
|---|---|---|
| NCBI | See | |
| NCBI | See | |
| NCBI | See | |
| NCBI | See | |
| ENA | Study Accession PRJEB2457 | |
| Drosophila reference genomes | ||
| Corer software | This paper | |
| Panaroo | ||
| SibeliaZ | ||
| Bifrost | ||
| Prokka | ||
| Augustus | ||
| BLAST+ | ||
| BUSCO | ||