| Literature DB >> 29334898 |
Christine Jandrasits1, Piotr W Dabrowski1, Stephan Fuchs2, Bernhard Y Renard3.
Abstract
BACKGROUND: The increasing application of next generation sequencing technologies has led to the availability of thousands of reference genomes, often providing multiple genomes for the same or closely related species. The current approach to represent a species or a population with a single reference sequence and a set of variations cannot represent their full diversity and introduces bias towards the chosen reference. There is a need for the representation of multiple sequences in a composite way that is compatible with existing data sources for annotation and suitable for established sequence analysis methods. At the same time, this representation needs to be easily accessible and extendable to account for the constant change of available genomes.Entities:
Keywords: Data structure; Pan-genome; Whole genome alignment
Mesh:
Year: 2018 PMID: 29334898 PMCID: PMC5769345 DOI: 10.1186/s12864-017-4401-3
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Comparison of pan-genome tools. We analyzed tools for pan-genome analysis that are available or currently under development. This table lists the corresponding publications or websites. We compared the intended use cases of the tools and the prerequisite data required in order to use them. We evaluated the availability of features needed to work with the pan-genome in subsequent analyses, e.g. updating the set of included genomes. Furthermore, we assessed whether the proposed data structures take into account structural variants and whether it is possible to visualize the resulting pan-genome
| Name | Objective | Input | Visualization | Structural | Functionality | ||
|---|---|---|---|---|---|---|---|
| of pan-genome | Variants | Update | Possibility to | ||||
| Add | Remove | include annotation | |||||
| svaha [ | Graph construction | Reference sequence + variants | External | Yes | No | No | No |
| cdbg [ | Graph construction | Multiple reference sequences | External | Yes | No | No | No |
| cdbg_search [ | Graph construction | Multiple reference sequences | External | Yes | No | No | No |
| SplitMEM [ | Graph construction | Multiple reference sequences | External | Yes | No | No | No |
| TwoPaCo [ | Graph construction | Multiple reference sequences | External | Yes | No | No | No |
| GCSA2 [ | Graph indexing | Variation graph | No | No | No | No | No |
| GCSA [ | Graph indexing | Reference sequence + variants | No | No | No | No | No |
| Multiple sequence mapping | |||||||
| BWBBLE [ | Multiple sequence mapping | Reference sequence + variants | No | No | No | No | No |
| GenomeMapper [ | Multiple sequence mapping | Reference sequence + variants | No | No | No | No | No |
| panVC [ | Multiple sequence variant detection | Whole genome alignment | External | Yes | No | No | Yes |
| MHC-PRG [ | Multiple sequence variant detection | Multiple sequence alignment | No | No | Yes | No | No |
| Pan-genome data structure | AND variants | ||||||
| GenomeRing [ | Pan-genome data structure | Whole genome alignment | Yes | Yes | No | No | Yes |
| JST [ | Pan-genome data structure | Reference sequence + variants | No | Yes | Yes | Yes | Yes |
| vg [ | Pan-genome data structure | Reference sequence + variants | External | Yes | Yes* | Yes* | Yes |
| OR multiple reference sequences | |||||||
| PanCake [ | Pan-genome data structure | Multiple reference sequences | External | Yes | Yes | No | No |
| AND pairwise alignment | |||||||
| seq-seq-pan | Pan-genome data structure | Multiple reference sequences | External | Yes | Yes | Yes | Yes |
∗ Adding and removing of genomes in vg can be achieved using a combination of several steps
Fig. 1Visualization of the alignment workflow for an example with three genomes. Input genomes (g1-3) are depicted as green, yellow and blue blocks. All sub-sequences are part of locally collinear blocks (LCBs) in the final result and are therefore marked within the whole genomes and numbered according to their appearance in the respective genome. The first two genomes are aligned and provided as separated blocks of aligned sub-sequences. Block I and II indicate a rearrangement of sub-sequence 3 of g1 when compared to g2 and parts of g1 are not present in g2. Consensus sequences are built individually for each LCB in the alignment and concatenated with stretches of ’N’ as delimiters to form a consensus genome (depicted in red with delimiters in gray). It is used in the alignment with g3, which is presented in detail in steps a-e. a The consensus genome is aligned with the third genome (g3, blue), yielding six blocks. Block I and III represent a rearrangement of sub-sequence 6 of g1. Block II shows a large deletion in g3 compared to the consensus genome. Block IV-VI show single-sequence blocks. b Blocks resulting from alignment with the consensus genome are broken up into smaller blocks at delimiter positions (Block II in a is now Block II-VI in b). The small single-sequence block with sub-sequence 5 of the consensus genome (Block IV in a) is merged to its neighboring sub-sequence 4 of the consensus genome, introducing gaps into sub-sequence 3 of g3 (see Block IV in b). c Remaining single-sequence blocks of both genomes (depicted in lighter red and blue) are concatenated with stretches of ’N’ as delimiters (c.a). Sequences are aligned (c.b) and resulting blocks are resolved at delimiter positions (c.c). Small single-sequences would also be merged to neighboring blocks (not shown). d Aligned and single-sequence blocks from step c are joined with initially aligned blocks and all blocks are sorted by their position in the consensus genome. e The full alignment is traced back using the newly formed blocks and the alignment of the first two genomes. f A consensus genome is built from the full alignment and alignment of additional genomes is achieved by consecutive repetition of steps a-f
Fig. 2Visualization of the phylogenetic tree used to simulate genomes with EVOLVER. The corresponding NEWICK tree is (((D:0.015625,E:0.0333)B:0.01,C:0.015625)A:0.03125, (((K:0.03125,L:0.015625)J:0.005,I:0.015625)G:0.02083, H:0.02083)F:0.005);. (drawn with online version of Phylodendron [43])
Fig. 3F-scores for comparing alignments using different sort orders for genomes. Genomes of each dataset were sorted by similarity and dissimilarity and randomly (100 times for the simulated and M. tuberculosis datasets and 10 times for the S. aureus and E. coli datasets) and aligned using the sequential workflow. The F-score is used as measure of consistency for alignment when comparing alignments with the dissimilar and random sort orders to the alignment with genomes sorted by similarity. All F-scores were similar within datasets and greater than 0.93 for all comparisons
Effect of merging short one-sequence LCBs
| Total alignment | Mean number of | Number of | Number of short | Precision | Recall | F-Score | |
|---|---|---|---|---|---|---|---|
| length | sequences in LCB | short LCBs | one-sequence LCBs | ||||
| Simulated dataset (13 genomes) | |||||||
| With merging step | 4809015 | 9.2 | 0 | 0 | 0.993 | 0.475 | 0.643 |
| Without merging step | 4789770 | 5.5 | 318 | 156 | 0.993 | 0.475 | 0.643 |
| With merging step | 4826979 | 16.1 | 0 | 0 | - | - | - |
| Without merging step | 4859842 | 7.5 | 154 | 109 | - | - | - |
We compare the results from sequentially aligning two genome datasets including and excluding the merging step in the workflow. For estimation of the fragmentation of the alignment we compare the total alignment length, the number of sequences per block and the number of small (< 10 bp) LCBs and focus on the ones containing only sequences from one genome. By comparing the precision, recall and F-score of both alignments compared to the true alignment of the simulated dataset we show that the accuracy of the alignment is not affected by the merging step
Precision, Recall and F-Score for alignments of the simulated dataset
| Precision | Recall | F-score | |
|---|---|---|---|
| TBA | 0.993 | 0.999 | 0.997 |
| progressiveMauve | 0.992 | 0.477 | 0.644 |
| seq-seq-pan | 0.993 | 0.475 | 0.643 |
| Mugsy | 0.999 | 0.474 | 0.643 |
| Mugsy with duplications | 0.999 | 0.474 | 0.643 |
| progressiveCactus | 0.892 | 0.473 | 0.618 |
| progressiveCactus with duplications | 0.999 | 0.339 | 0.506 |
We compare the results of all alignment tools with the true alignment of the simulated genomes. Aligners are sorted first by F-score and then by Recall
F-score for pairwise comparison of alignment results for the M. tuberculosis dataset
| seq-seq-pan | Mugsy* | progressiveMauve | TBA | progressiveCactus | |
|---|---|---|---|---|---|
| seq-seq-pan | - |
|
|
|
|
| Mugsy* |
| - | 0.990 | 0.974 |
|
| progressiveMauve |
| 0.990 | - | 0.972 | 0.928 |
| TBA |
| 0.974 | 0.972 | - | 0.914 |
| progressiveCactus |
|
| 0.928 | 0.914 | - |
We estimate the similarity of alignments of progressiveMauve, Mugsy, progressiveCactus, seq-seq-pan and TBA, by calculating the pairwise F-score. The aligner with the most similar alignment is shown in bold for each aligner. * Aligning 43 M. tuberculosis genomes caused a segmentation fault in Mugsy. We were able to align 39 genomes and therefore compare the results only for this set of sequences
Run-time and memory usage. We compare seq-seq-pan to other whole genome aligners in terms of run-time and memory usage. Time and memory are indicated for single-threaded processes. Individual steps for TBA can be run in parallel
| Elapsed wall clock time (hh:mm) | Maximum resident set size (GB) | |
|---|---|---|
| Simulated dataset (13 genomes) | ||
| seq-seq-pan | 00:30 | 0.77 |
| progressiveMauve | 02:33 | 4.93 |
| Mugsy | 01:08 | 1.01 |
| progressiveCactus | 03:41 | 1.00 |
| TBA | 04:59 | 0.34 |
| seq-seq-pan | 02:06 | 1.20 |
| progressiveMauve | 09:03 | 2.79 |
| Mugsy* | 14:52 | 3.26 |
| progressiveCactus | 47:09 | 5.54 |
| TBA | 386 days | 1.32 |
| seq-seq-pan | 08:55 | 4.27 |
| seq-seq-pan | 68:19 | 8.5 |
For the larger datasets (S.aureus and E.coli) only seq-seq-pan was used for the alignment due to run-time limitation of other tools. * Aligning 43 M. tuberculosis genomes caused a segmentation fault in Mugsy. This table lists data for aligning 39 genomes with Mugsy, but the whole set of 43 genomes for all other tools
Comparison of seq-seq-pan and PanCake
| seq-seq-pan | PanCake | Nucmer | |
|---|---|---|---|
| Time for construction (hh:mm:ss) | 02:06:00 | 88:10:00 | 03:04:00 |
| Maximum memory usage | 1.20 GB | 2.34 GB | 0.10 GB |
| Pan-genome file size | 198 MB | 36 MB | - |
| Time to add genome | 00:04:01 | 05:33:52 | 00:08:48 |
| Mean time for extraction of sequence* | 00:00:09 | 00:01:08 | - |
| Mean time for removing genome** | 00:00:19 | Not available | - |
| Time for consensus genome creation | 00:00:47 | Not available | - |
First we compare the run-time and memory usage of pan-genome creation for the set of 43 M. tuberculosis genomes. PanCake requires pairwise genome comparisons by nucmer. Run-time and memory requirements for nucmer are listed separately as these can be run in parallel. We also evaluate the file size of the resulting pan-genome. We clock all available features (adding a genome, extracting part of a genome or the whole genome, remove a genome and constructing a consensus genome). * Extraction times for whole genomes and parts of sequences are equal. We extracted the interval 500-1000 for all genomes. ** Each of the 43 genomes was removed from the whole set