| Literature DB >> 17324286 |
Daniel D Sommer1, Arthur L Delcher, Steven L Salzberg, Mihai Pop.
Abstract
BACKGROUND: Genome assemblers have grown very large and complex in response to the need for algorithms to handle the challenges of large whole-genome sequencing projects. Many of the most common uses of assemblers, however, are best served by a simpler type of assembler that requires fewer software components, uses less memory, and is far easier to install and run.Entities:
Mesh:
Substances:
Year: 2007 PMID: 17324286 PMCID: PMC1821043 DOI: 10.1186/1471-2105-8-64
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Overview of Minimus pipeline. Several independent modules of the AMOS package (shown as ovals) interact through the AMOS API to a central data-structure (called a Bank). The order of execution of the individual modules is shown by the arrow. Note that the inputs and outputs to minimus follow the AMOS file format (AMOS message files). The AMOS package provides converters between this file format and virtually all commonly used formats for representing sequence data and genome assemblies.
Figure 2Top: Blast alignments of contigs resulting from minimus assembly of Zebrafish shotgun reads to the human GPC3 protein. The top four matches correspond to contigs that do not have any significant similarity to each other at the nucleotide level, indicating the presence of at least 4 homologues to the GPC3 gene in the Zebrafish genome. Bottom: Alignment of the Zebrafish GPC3 protein to the human GPC3 protein highlighting that the minimus assembly covers the majority of this gene.
Comparison of Minimus and phrap in the assembly of 10 mouse BACs from data obtained from the NCBI Trace Archive.
| BAC | BAC size (bp) | # Reads/seq. coverage | Running time | # Contigs | N50 contig size (kbp) | Coverage (%) | # errors | |
| RP23-179K16 | 195,061 | 3685 | Minimus | 1 m 45 s | 40 | 4.2 | 99.9 | 0 |
| 8 | phrap | 2 m 55 s | 14 | 16.1 | 99.9 | 2 | ||
| RP23-188E5 | 157,996 | 2983 | Minimus | 1 m 5 s | 43 | 4.3 | 99.9 | 0 |
| 7 | phrap | 2 m 33 s | 16 | 16.9 | 99.7 | 2 | ||
| RP23-111A22 | 200,329 | 5428 | Minimus | 56 s | 244 | 4.8 | 98.7 | 3 |
| 10 | phrap | 1 m 43 s | 183 | 17 | 98.4 | 14 | ||
| RP23-271013 | 239,837 | 7601 | Minimus | 3 m 11 s | 448 | 1.5 | 100.0 | 2 |
| 14 | phrap | 6 m 30 s | 329 | 6.3 | 98.6 | 10 | ||
| RP23-283E4 | 178,084 | 5708 | Minimus | 3 m 49 s | 713 | 1.4 | 99.9 | 2 |
| 15 | phrap | 9 m 53 s | 467 | 3.9 | 98.7 | 8 | ||
| RP23-286D16 | 195,068 | 4969 | Minimus | 41 s | 90 | 9 | 99.9 | 2 |
| 8 | phrap | 1 m 22 s | 264 | 40 | 99.9 | 5 | ||
| RP23-296N18 | 187,242 | 1536 | Minimus | 36 s | 52 | 6.5 | 99.9 | 0 |
| 6 | phrap | 1 m 5 s | 34 | 16.1 | 99.0 | 12 | ||
| RP23-319P12 | 190,514 | 5629 | Minimus | 1 m 39 s | 131 | 4.9 | 99.9 | 3 |
| 14 | phrap | 2 m 29 s | 139 | 18 | 98.0 | 18 | ||
| RP23-363E23 | 199,409 | 5301 | Minimus | 1 m 19 s | 111 | 5.5 | 99.9 | 3 |
| 12 | phrap | 2 m 12 s | 178 | 20 | 100 | 15 | ||
| RP23-425H1 | 188,835 | 1536 | Minimus | 14 s | 46 | 10 | 97.2 | 1 |
| 6 | phrap | 38 s | 28 | 21 | 98.4 | 5 |
Minimus ran considerably faster than phrap and produced no errors, at the expense of a larger number of contigs. Note that the table contains two quantities denoted "coverage": the sequencing coverage (reported in the #Reads/seq. coverage column) represents the total amount of DNA in the sequenced reads, divided by the size of the chromosome, i.e. the redundancy in the sequenced data; the column headed "coverage" represents the fraction of the reference sequence covered by assembled contig. The latter measure does not take into account assembly errors, i.e. partial contig matches contribute to the overall coverage.
Figure 3Dot plots of alignments of assemblies produced by minimus (top) and phrap (bottom) to the completed Brucella suis genome. The horizontal lines indicate the boundary between assembled contigs represented on the y axis. The vertical line separates between the two chromosomes of Brucella suis represented on the x axis. The minimus assembly (top) perfectly matches the reference sequence, as indicated by all matches lying along the main diagonal (except the contig at the bottom center, which spans the origin of the circular chromosome). The phrap assembly (bottom) shows many discrepancies with respect to the reference sequence (off-diagonal segments), including several contigs that incorrectly join segments of the two distinct chromosomes (e.g., second and third contigs from the bottom). Note that the ordering of the contigs implied by these figures is an artifact of the alignment to the reference sequence and does not correspond to the order in which the contigs were reported by the specific assembly tools. The discrepancies between the phrap assembly and the reference sequence prevent us from providing a consistent ordering for this assembly.
Comparison of Minimus and phrap in the assembly of two bacterial genomes (Brucella suis and Staphylococcus aureus).
| Genome | Genome size (Mbp) | # Reads/seq. coverage | Running time | # Contigs | N50 contig size (kbp) | coverage (%) | # errors | |
| 3.3 | 36080 | Minimus | 6 m 30 s | 110 | 43 | 99.9% | 0 | |
| 7.8 | phrap | 30 m 2 s | 39 | 196 | 99.7% | 8 | ||
| 2.9 | 49014 | Minimus | 16 m 40 s | 85 | 51 | 99.8% | 0 | |
| 9.2 | phrap | 40 m | 30 | 190 | 99.7% | 5 |
Minimus ran faster than phrap and produced no errors. However, it generated a considerably larger number of contigs. Note that the table contains two quantities denoted "coverage": the sequencing coverage (reported in the #Reads/seq. coverage column) represents the total amount of DNA in the sequenced reads, divided by the size of the chromosome, i.e. the redundancy in the sequenced data; the column headed "coverage" represents the fraction of the reference sequence covered by assembled contig. The latter measure does not take into account assembly errors, i.e. partial contig matches contribute to the overall coverage.