| Literature DB >> 31510683 |
Abstract
MOTIVATION: Whole-genome alignment (WGA) methods show insufficient scalability toward the generation of large-scale WGAs. Profile alignment-based approaches revolutionized the fields of multiple sequence alignment construction methods by significantly reducing computational complexity and runtime. However, WGAs need to consider genomic rearrangements between genomes, which make the profile-based extension of several whole-genomes challenging. Currently, none of the available methods offer the possibility to align or extend WGA profiles.Entities:
Mesh:
Year: 2019 PMID: 31510683 PMCID: PMC6612806 DOI: 10.1093/bioinformatics/btz377
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.(A) Example of the SuperGenome data structure with its arrays G and SG for a genome with its aligned genomic sequence. As visually encoded, the array G resembles the alignment position for each base, while SG stores the genomic positions of the aligned bases. Negative values in G and SG indicate the alignment of the reversed complement of the base. (B) Based on the three genomic sequences g1, g2 and g3 and their alignment , a SuperGenome data structure (union of the SG and G arrays) is computed. In addition, a consensus sequence from the alignment is deduced. (C) A new genomic sequence g4 is combined with . For this a pairwise alignment is computed and again a SuperGenome data structure is deduced. (D) Update of by g4: for every position j in , (1) array (orange) contains the index of the aligned consensus sequence positions, which is used to determine the original genomic positions (example shown in blue). (2) This allows a coordinate transfer and (red) into a common coordinate system of . (E) From the updated SuperGenome data structure the new alignment is easily deduced
Fig. 2.Workflow of merging several alignment and genomes using the SuperGenome data structure into a large WGA. The workflow consists of six steps: (1) Build SuperGenome for every alignment, (2) compute consensus sequence from input alignments, (3) align all consensus and genome sequences, (4) build SuperGenome of guiding alignment, (5) merge all alignments and genomes (marked with a star) via coordinates transfer and (6) output alignment of all sequences in XMFA-format derived from updated SuperGenome data structure
The datasets, which were used for the WGA computations as obtained from the NCBI FTP server
| Organism | No. of Strains | Median genome length (Mb) | GC content (%) |
|---|---|---|---|
|
| 13 | 5.760 | 35.1 |
|
| 30 | 2.975 | 37.9 |
|
| 72 | 1.046 | 41.3 |
|
| 128 | 4.385 | 65.6 |
|
| 166 | 5.590 | 57.2 |
|
| 176 | 2.847 | 32.8 |
|
| 326 | 4.100 | 67.7 |
Note: All statistics including the median genome length and median GC content have been derived from NCBI.
Evaluation of WGAs generated by progressiveMauve (PM) and GPA based on simulated WGAs by EVOLVER (EVO)
| Runtime [min] | PC score |
|
|
| TC score | TC score | TC score | Memory usage [GB] | ||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| No. of Strains | PM | GPA | PM | GPA | PM with EVO | GPA with EVO | PM with GPA | GPA in PM | PM in EVO | GPA in EVO | PM | GPA |
| 10 | 26 | 8 | 97.27% | 97.12% | 0.985 | 0.985 | 0.992 | 92.54% | 92.26% | 90.39% | 0.72 | 2.50 |
| 20 | 125 | 39 | 98.28% | 98.24% | 0.997 | 0.996 | 0.996 | 93.94% | 96.40% | 95.36% | 1.34 | 4.59 |
| 40 | 678 | 121 | 94.78% | 93.86% | 0.976 | 0.971 | 0.990 | 62.84% | 60.18% | 58.74% | 2.33 | 10.14 |
| 80 | 14541 | 163 | 94.49% | 95.02% | 0.976 | 0.977 | 0.984 | 63.67% | 68.03% | 70.68% | 23.11 | 16.83 |
| 326 | — | 653 | — | 88.14% | — | 0.948 | — | — | — | 14.13% | — | 47.21 |
Note: All WGAs were evaluated with respect to their runtime (wall-clock time), average PC, F-scores achieved against EVO and maximal amount of RAM used in the process of WGA computation. In addition the TC score and F-score between PM and GPA is reported. Again, for the calculation of the TC and F-score, PM is used as the reference. GPA was run with several different parameters k for the merge size, reported for each dataset is the one with the highest F-score achieved with EVO.
Evaluation of WGAs generated by GPA based on simulated WGAs with several different parameters k for the maximal merge size
| 80 strains | Runtime | PC score |
|
| TC score | TC score |
|---|---|---|---|---|---|---|
|
| [min.] | GPA with EVO | GPA with PM | GPA in PM | GPA in EVO | |
| 3 | 24 | 88.03% | 0.942 | 0.947 | 55.40% | 61.46% |
| 7 | 34 | 90.38% | 0.955 | 0.961 | 59.02% | 64.90% |
| 12 | 96 | 94.35% | 0.976 | 0.982 | 62.29% | 69.61% |
| 17 | 163 | 94.70% | 0.977 | 0.983 | 63.67% | 70.68% |
| 22 | 177 | 94.53% | 0.977 | 0.984 | 63.55% | 70.62% |
Note: All WGAs were evaluated with respect to their runtime (wall-clock time) and PC score. In addition, TC score and F-scores between GPA and EVO as well as GPA and progressiveMauve are reported.
Comparison between the WGA construction using the original progressiveMauve (PM) method and our SuperGenome-based iterative profile alignment approach of GPA
| Dataset | Runtime [min] | No. of LCBs | PC score | TC score |
| |||||
|---|---|---|---|---|---|---|---|---|---|---|
| Organism | No. of Strains | PM | GPA | PM | GPA | PM | GPA | GPA in PM | ||
| 10 | 27 | 9 | 3 | 6 | 97.90% | 97.97% | 93.99% | 0.998 | ||
| 20 | 98 | 22 | 16 | 4 | 96.83% | 97.29% | 88.31% | 0.994 | ||
|
| 40 | 475 | 41 | 230 | 72 | 96.00% | 96.61% | 85.05% | 0.993 | |
| (4.385 Mb) | 80 | 1665 | 105 | 852 | 148 | 92.36% | 93.03% | 73.03% | 0.990 | |
| 128 | >350 ha | 527 | — | 437 | — | 90.67% | — | — | ||
| 10 | 55 | 25 | 177 | 160 | 38.77% | 38.94% | 73.13% | 0.991 | ||
| 20 | 691 | 90 | 784 | 769 | 41.31% | 41.42% | 67.18% | 0.989 | ||
|
| 40 | 9401 | 99 | 3369 | 1463 | 42.91% | 43.15% | 55.94% | 0.981 | |
|
| (5.590 Mb) | 80 | 120 038 | 329 | 3986 | 4632 | 42.15% | 42.38% | 45.02% | 0.974 |
| 166 | >6250 ha | 771 | — | 8558 | — | 41.19% | — | — | ||
| 10 | 16 | 7 | 221 | 53 | 82.45% | 81.81% | 73.07% | 0.983 | ||
| 20 | 64 | 33 | 415 | 362 | 81.58% | 81.11% | 67.83% | 0.986 | ||
|
| 40 | 552 | 28 | 1222 | 555 | 76.33% | 75.70% | 56.63% | 0.976 | |
| (2.847 Mb) | 80 | 5213 | 121 | 2492 | 2498 | 73.02% | 72.50% | 47.37% | 0.976 | |
| 176 | >1650 ha | 198 | — | 3213 | — | 71.91% | — | — | ||
| 10 | 24 | 9 | 64 | 56 | 58.77% | 58.99% | 95.87% | 0.995 | ||
| 20 | 99 | 16 | 88 | 115 | 58.56% | 58.45% | 90.06% | 0.991 | ||
|
| 40 | 503 | 96 | 165 | 172 | 57.65% | 57.78% | 85.48% | 0.990 | |
| (4.100 Mb) | 80 | 1683 | 78 | 314 | 373 | 58.69% | 58.39% | 75.57% | 0.981 | |
| 326 | >6250 ha | 504 | — | 3708 | — | 47.37% | — | — | ||
|
| ||||||||||
|
| ||||||||||
|
| 13 | 225 | 61 | 380 | 214 | 60.00% | 60.63% | 62.40% | 0.977 | |
| (5.760 Mb) | ||||||||||
|
| 30 | 628 | 28 | 293 | 254 | 73.41% | 73.92% | 61.50% | 0.989 | |
| (2.975 Mb) | ||||||||||
|
| 72 | 351 | 14 | 4 | 80 | 98.35% | 98.36% | 86.68% | 0.995 | |
| (1.046 Mb) | ||||||||||
Note: The results are divided into two distinct groups, whether for GPA a guide tree (guide tree) was used or not (random). The runtime for WGA computation and the number of LCBs for the respective WGAs is reported. In addition, the WGAs were evaluated with respect to their average PC, the TC score (% of identical aligned columns in PM) and F-score. Both, for the calculation of the TC and F-score, PM is used as the reference. GPA was run with several different parameters k for the merge size, reported for each dataset is the one with the highest PC score.
Time past till computation has been manually aborted by us.
Fig. 3.Comparison of the measured computational runtime needed for the construction of the WGA depending on the number of genomes for the datasets of S. aureus (left) and B. pertussis (right). In addition to the direct comparison between progressiveMauve (orange) and GPA, the upper left section only shows the runtime (wall-clock time) of GPA (blue) and GPA CPU time (green), together with the r2 values for the linear regression. The respective regression curves (PM cubic/GPA linear) were computed with R