| Literature DB >> 29373581 |
Guillaume Marçais1,2, Arthur L Delcher3, Adam M Phillippy4, Rachel Coston3, Steven L Salzberg3,5, Aleksey Zimin1,3.
Abstract
The MUMmer system and the genome sequence aligner nucmer included within it are among the most widely used alignment packages in genomics. Since the last major release of MUMmer version 3 in 2004, it has been applied to many types of problems including aligning whole genome sequences, aligning reads to a reference genome, and comparing different assemblies of the same genome. Despite its broad utility, MUMmer3 has limitations that can make it difficult to use for large genomes and for the very large sequence data sets that are common today. In this paper we describe MUMmer4, a substantially improved version of MUMmer that addresses genome size constraints by changing the 32-bit suffix tree data structure at the core of MUMmer to a 48-bit suffix array, and that offers improved speed through parallel processing of input query sequences. With a theoretical limit on the input size of 141Tbp, MUMmer4 can now work with input sequences of any biologically realistic length. We show that as a result of these enhancements, the nucmer program in MUMmer4 is easily able to handle alignments of large genomes; we illustrate this with an alignment of the human and chimpanzee genomes, which allows us to compute that the two species are 98% identical across 96% of their length. With the enhancements described here, MUMmer4 can also be used to efficiently align reads to reference genomes, although it is less sensitive and accurate than the dedicated read aligners. The nucmer aligner in MUMmer4 can now be called from scripting languages such as Perl, Python and Ruby. These improvements make MUMer4 one the most versatile genome alignment packages available.Entities:
Mesh:
Year: 2018 PMID: 29373581 PMCID: PMC5802927 DOI: 10.1371/journal.pcbi.1005944
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Comparison of aligner features.
A checkmark means the feature is present and usable, otherwise the feature is absent or its use is impractical. Features that are absent by design are marked with a dash.
| Aligner | Graphical User Interface | Multi-platform Windows/Linux | Multi-threaded | Callable from C++, scripting languages | Whole genome aln. | Short read aln. | Long read aln. | SAM format output | P-value output |
|---|---|---|---|---|---|---|---|---|---|
| MUMmer4 | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | |||
| MUMmer3 | ✔ | ||||||||
| Blast | ✔ | ✔ | ✔ | ✔ | ✔ | ||||
| Blat | ✔ | ✔ | |||||||
| Mauve | ✔ | ✔ | ✔ | ||||||
| LASTZ | ✔ | ✔ | ✔ | ||||||
| bwa-mem | ✔ | - | ✔ | ✔ | ✔ | ||||
| Bowtie2 | ✔ | - | ✔ | - | ✔ | ||||
| BLASR | ✔ | - | - | ✔ | ✔ | ✔ |
Description of the data sets used for aligner comparisons.
The Illumina and PacBio data for A. thaliana is available from [23]; the human Illumina and PacBio reads are from the Ashkenazi child data set (available from the Genome in a Bottle project [24], NCBI SRA accession SRX847862). The reference genomes are the Arabidopsis thaliana Col-0 reference genome [18], the human reference genome version GRCh38.p7 [19], and the chimpanzee (Pan troglodytes) genome [20] (release PanTro4, GenBank accession GCF00001515.6).
| Reference | Genome size | Illumina | PacBio | ||||
|---|---|---|---|---|---|---|---|
| number of reads | bases in reads | average read size | number of reads | bases in reads | average read size | ||
| Arabidopsis | 120 Mb | 23 M | 6919 M | 300 bp | 481 K | 2748 M | 5713 bp |
| Human | 3.09 Gb | 264 M | 39.1 G | 300 bp | 3.9 M | 30.5 G | 7821 bp |
| Chimp | 3.31 Gb | ||||||
Timing and memory usage to align two genome sequences for nucmer3 and nucmer4 compared to Mauve and Lastz aligners built using similar data structures.
We list both wall clock time and CPU time to show how effective is the code in utilizing multiple cores. Nucmer 4 is the fastest, but not the most memory efficient aligner. Nucmer3 failed to align human to chimp assembly due to the restriction on the size of the reference sequence. LASTZ and Mauve runs on human to chimp alignments took over two days, and we stopped them after that. LASTZ defaults are optimized for high sensitivity, resulting in slow performance. Thus for fairness of timing comparisons we ran LASTZ twice: once with default settings and once with parameters that result in sensitivity matching that of nucmer4 with default settings. We list the parameters in the supplement.
| Arabidopsis | Tardigrade | Human/Chimp | ||
|---|---|---|---|---|
| nucmer3 | Wall time (min) | 17.5 | 19.6 | fail |
| CPU time (min) | 17.1 | 19.2 | fail | |
| Memory (GB) | 2.1 | 2.3 | fail | |
| nucmer4 | Wall time (min) | 3.7 | 4.0 | 207 |
| CPU time (min) | 22 | 26 | 2897 | |
| Memory (GB) | 4.6 | 4.9 | 66 | |
| Mauve | Wall time (min) | 41 | 273 | > 2 days |
| CPU time (min) | 38.6 | 268 | > 2 days | |
| Memory (GB) | 3.3 | 4.0 | > 2 days | |
| LASTZ default | Wall time (min) | 1122 | > 2 days | > 2 days |
| CPU time (min) | 1113 | > 2 days | > 2 days | |
| Memory (GB) | 1.3 | |||
| LASTZ match | Wall time (min) | 66 | 77 | > 2 days |
| CPU time (min) | 66 | 76 | > 2 days | |
| Memory (GB) | 0.6 | 0.4 |
Timing and memory usage to align PacBio and Illumina reads to the Arabidopsis thaliana reference genome.
Timings reported here include the time used to build the genome index. The alignments reported by nucmer3 and nucmer4 for the Illumina data were identical. Nucmer3 experienced a reproducible crash when aligning PacBio reads to the A. thaliana reference.
| PacBio | Illumina | |||||||
|---|---|---|---|---|---|---|---|---|
| time (min) | memory (MB) | aligned (Mbp) | aligned reads | time (min) | memory (MB) | aligned (Mbp) | aligned reads | |
| blasr | 95 | 4065 | 1780 | 435888 | ||||
| bwa-mem | 49 | 2162 | 1944 | 420912 | 30 | 3360 | 6112 | 21874366 |
| bowtie2 | 24 | 686 | 5580 | 18716070 | ||||
| nucmer3 | fail | fail | fail | fail | 334 | 4688 | 5651 | 19873013 |
| nucmer4 | 24 | 5743 | 1713 | 424271 | 29 | 1283 | 5651 | 19873013 |
Performance of Nucmer4, BLASR and BWA MEM on data simulated by pbsim from human and Arabidopsis reference genomes.
All numbers are percentages from the total of bases that are in the reads aligned correctly, missed, or aligned incorrectly. The numbers may not add to exactly 100 due to rounding.
| Arabidopsis | Human | |||||
|---|---|---|---|---|---|---|
| Aligned Correctly | Missed | Aligned Incorrectly | Aligned Correctly | Missed | Aligned Incorrectly | |
| nucmer4 | 94.0 | 3.5 | 2.5 | 84.4 | 10.9 | 4.6 |
| blasr | 98.2 | 0.2 | 1.7 | 91.8 | 5.0 | 3.2 |
| bwa-mem | 98.7 | 0.5 | 0.8 | 91.6 | 5.9 | 2.5 |
Timing and memory usage to align Illumina and PacBio reads to human reference.
| Illumina reads to Human reference | ||||||
| build index | align | result | ||||
| time (min) | memory (GB) | time (min) | memory (GB) | aligned bases (Gbp) | aligned reads | |
| bwa-mem | 96 | 4.5 | 197 | 11.2 | 38.46 | 263155221 |
| bowtie2 | 51 | 18.6 | 163 | 4.0 | 38.00 | 258560571 |
| nucmer4 | 36 | 45.1 | 146 | 45.5 | 36.71 | 250689492 |
| PacBio reads to Human reference | ||||||
| build index | align | result | ||||
| time (min) | memory (GB) | time (min) | memory (GB) | aligned bases (Gbp) | aligned reads | |
| blasr | 40 | 29.4 | 1680 | 47.9 | 24.41 | 3836927 |
| bwa-mem | 96 | 4.5 | 1473 | 7.7 | 25.86 | 3820163 |
| nucmer4 | 36 | 45.1 | 850 | 50.1 | 23.02 | 3784039 |
Fig 1Scaling of nucmer4’s performance when aligning Illumina reads to the A. thaliana genome with 1–32 threads.
All tests were run on a 32-core AMD Opteron computer.