| Literature DB >> 26279623 |
Jürgen Kleffe1, Robert Weißmann2, Florian F Schmitzberger3.
Abstract
We compare the results of three different assembler programs, Celera, Phrap and Mira2, for the same set of about a hundred thousand Sanger reads derived from an unknown bacterial genome. In difference to previous assembly comparisons we do not focus on speed of computation and numbers of assembled contigs but on how the different sequence assemblies agree by content. Threefold consistently assembled genome regions are identified in order to estimate a lower bound of erroneously identified single nucleotide polymorphisms (SNP) caused by nothing but the process of mathematical sequence assembly. We identified 509 sequence triplets common to all three de-novo assemblies spanning only 34% (3.3 Mb) of the bacterial genome with 175 of these regions (~1.5 Mb) including erroneous SNPs and insertion/deletions. Within these triplets this on average leads to one error per 7,155 base pairs. Replacing the assembler Mira2 by the most recent version Mira3, the letter number even drops to 5,923. Our results therefore suggest that a considerably high number of erroneous SNPs may be present in current sequence data and mathematicians should urgently take up research on numerical stability of sequence assembly algorithms. Furthermore, even the latest versions of currently used assemblers produce erroneous SNPs that depend on the order reads are used as input. Such errors will severely hamper molecular diagnostics as well as relating genome variation and disease. This issue needs to be addressed urgently as the field is moving fast into clinical applications.Entities:
Keywords: SNP; genome comparison; genome variation; sequence assembly; single nucleotide polymorphism
Year: 2010 PMID: 26279623 PMCID: PMC4510600 DOI: 10.4137/GEI.S3653
Source DB: PubMed Journal: Genomics Insights ISSN: 1178-6310
Figure 1Three pairwise complete matches shown by blue boxes and three 3-fold complete matches shown by green boxes. The arrows indicate contigs of the three assemblies A to C. The boxes show regions of near-perfect alignment between assemblies.
Figure 2Some parts of multiple alignments of 3-fold complete matches.
Comparing the results of the assemblers Celera, Phrap, Mira2 and Mira3 obtained for 109,289 Sanger reads.
| Contigs | Total size | Av. length | W50 | |
|---|---|---|---|---|
| Celera | 162 | 10,232,260 | 63,162 | 22 |
| Phrap | 390 | 10,383,774 | 26,625 | 42 |
| Mira2 | 1,289 | 11,583,563 | 8,986 | 81 |
| Mira3 | 1,478 | 10,691,233 | 7,233 | 226 |
| 3-fold matches | 509 | 175 | 31% | |
| Total size | 3,341,792 | 1,501,621 | 45% | |
| Alignment errors | 467 | 467 | ||
| Bases per error | 7,155 | 3,940 | ||
| 3-fold matches | 514 | 216 | 42% | |
| Total size | 3,115,451 | 1,400,282 | 45% | |
| Alignment errors | 526 | 526 | ||
| Bases per error | 5,923 | 2,662 | ||
The upper block provides summary data for all considered assemblies; the middle block describes 3-fold complete matches of Celera, Phrap and Mira2, while the last block gives the same data for replacing Mira2 by Mira3.
Relative frequencies of letter replacements.
| A | C | G | T | gap | |
|---|---|---|---|---|---|
| A | 0.139 | 0.142 | 0.061 | 0.076 | |
| C | 0.125 | 0.144 | 0.144 | 0.094 | |
| G | 0.131 | 0.188 | 0.104 | 0.050 | |
| T | 0.047 | 0.153 | 0.106 | 0.046 | |
| gap | 0.108 | 0.070 | 0.057 | 0.015 |
All 3-fold matching contig sections were aligned using ClustalW and pairs of letters reported that appeared in error columns as seen in Figure 2. The upper part shows relative frequencies observed for Celera, Phrap and Mira2 contigs, while the lower part results from replacing Mira2 by Mira3.
Results of the Celera assembler for 109,289 Sanger reads in original, reversed and shuffled order.
| Contigs | Total size | Av. length | W50 | |
|---|---|---|---|---|
| Original | 162 | 10,232,260 | 63,162 | 22 |
| Reversed | 164 | 10,229,889 | 62,377 | 22 |
| Shuffled | 165 | 10,230,667 | 62,004 | 22 |
| 3-fold matches | 215 | 36 | 17% | |
| Total size | 9,850,412 | 378,653 | 4% | |
| Alignment errors | 162 | 162 | ||
| Bases per error | 60,805 | 2,337 | ||
The upper block provides summary data for the three assemblies while the lower block describes 3-fold complete matches of contigs resulting from original, reversed and shuffled reads.
Figure 3Examples of direction dependent alignments.
Figure 4The effect of exchanging the calculation of an alignment with complementing sequences. Details are described in the text.