| Literature DB >> 21999860 |
Stefan Taudien1, Burkhard Steuernagel, Ruvini Ariyadasa, Daniela Schulte, Thomas Schmutzer, Marco Groth, Marius Felder, Andreas Petzold, Uwe Scholz, Klaus Fx Mayer, Nils Stein, Matthias Platzer.
Abstract
BACKGROUND: Next generation sequencing of BACs is a viable option for deciphering the sequence of even large and highly repetitive genomes. In order to optimize this strategy, we examined the influence of read length on the quality of Roche/454 sequence assemblies, to what extent Illumina/Solexa mate pairs (MPs) improve the assemblies by scaffolding and whether barcoding of BACs is dispensable.Entities:
Year: 2011 PMID: 21999860 PMCID: PMC3213688 DOI: 10.1186/1756-0500-4-411
Source DB: PubMed Journal: BMC Res Notes ISSN: 1756-0500
Comparison of the 454 assemblies of barcoded BACs with their Sanger reference sequences.
| BAC | sequence depth | average read length (bp) | L50 (bp) | L80 (bp) | gaps | total gap size (bp) | |||
|---|---|---|---|---|---|---|---|---|---|
| bcFLX | 27 | 224 | 52,352 | 52,352 | 0 | 1 | 50 | 1 | |
| 184G09 120,562 bp | bcTi | 56 | 256 | 121,630 | 121,630 | 0 | 0 | 0 | 0 |
| bcTids | 27 | 256 | 120,569 | 120,569 | 0 | 0 | 0 | 0 | |
| bcFLX | 15 | 225 | 11,912 | 3,586 | 6 | 9 | 490 | 21 | |
| 259I16 124,050 bp | bcTi | 25 | 253 | 68,888 | 14,394 | 2 | 3 | 177 | 7 |
| bcTids | 15 | 253 | 24,258 | 10,367 | 6 | 5 | 355 | 17 | |
| bcFLX | 26 | 223 | 52,601 | 11,098 | 6 | 5 | 199 | 17 | |
| 631P08 101,158 bp | bcTi | 66 | 252 | 25,788 | 17,610 | 2 | 2 | 77 | 6 |
| bcTids | 26 | 252 | 52,257 | 17,582 | 1 | 1 | 14 | 3 | |
| bcFLX | 26 | 219 | 16,866 | 3,203 | 17 | 4 | 392 | 38 | |
| 711N16 112,178 bp | bcTi | 41 | 292 | 21,923 | 3,860 | 5 | 3 | 30 | 13 |
| bcTids | 26 | 292 | 21,921 | 3,859 | 5 | 3 | 25 | 13 | |
| bcFLX | 23 | 223 | 33,433 | 17,560 | 29 | 19 | 1,131 | 77 | |
| all | bcTi | 47 | 263 | 59,557 | 39,374 | 9 | 8 | 284 | 26 |
| bcTids | 23 | 263 | 54,751 | 38,094 | 12 | 9 | 394 | 33 | |
1 bcFLX: barcoded FLX; bcTi: barcoded Titanium; bcTids: barcoded Titanium, down sampled to the same sequence depth as of bcFLX.
2 misassemblies
3 To quantify the assembly quality, penalties of 2 per misassembly and 1 per gap were given.
Error rates of different chemistries by comparison to the Sanger reference sequences.
| BAC | sequence depth | error rate | ||
|---|---|---|---|---|
| bcFLX | 27 | 1.16E-04 | 39 | |
| 184G09 | bcTi | 56 | 1.49E-04 | 38 |
| bcTids | 27 | 1.49E-04 | 38 | |
| bcFLX | 15 | 4.08E-04 | 34 | |
| 259I16 | bcTi | 25 | 3.49E-04 | 35 |
| bcTids | 15 | 4.40E-04 | 34 | |
| bcFLX | 26 | 1.94E-04 | 37 | |
| 631P08 | bcTi | 66 | 1.43E-04 | 38 |
| bcTids | 26 | 3.09E-04 | 35 | |
| bcFLX | 26 | 1.66E-04 | 38 | |
| 711N16 | bcTi | 41 | 5.24E-04 | 33 |
| bcTids | 26 | 5.00E-04 | 33 | |
| bcFLX | 2.27E-04 | 36 | ||
| all | bcTi | 2.90E-04 | 35 | |
| bcTids | 3.37E-04 | 35 | ||
1 bcFLX: barcoded FLX; bcTi: barcoded Titanium; bcTids: barcoded Titanium, down sampled to the same sequence depth as of bcFLX;
2 Q = -10*log(error rate).
Figure 1Threshold determination for the scaffolding of 454 contigs by Illumina MPs. The number of bridging MPs per gap was normalized to the total number of all gap bridging MPs per BAC (x-axis) and counted in bins of 0.02 (y-axis, logarithmic). An exponential relationship is observed for bins 0.02-0.44 (red line). The first bin (≤0.02) deviates from this relationship, most likely due to weakly supported false positive bridgings. Therefore, only contig pairings >0.02 were considered for scaffolding.
Figure 2Example for scaffolding of 454 contigs by Illumina MPs: (top) effective scaffolding of 390L10 (3 scaffolds comprising 18 out of 21 contigs and 105 out of 108 kb); (bottom) ineffective scaffolding of 601I24 due to multiple bridging options (2 scaffold comprising 4 out of 15 contigs and 20 out of 102 kb). Contigs are represented by arrows in 5' to 3' direction. Numbers on the connecting lines indicate the normalized MP values (MP per gap/MP per BAC) supporting the gap bridging. Scaffolding was omitted for values below the threshold of 0.02 (orange rectangles). Contigs marked by a blue rectangle remain unscaffolded either by lack of MPs or bridging support below the threshold.
Scaffolding contigs from 454 assemblies of 96 BACs by Illumina MPs
| contig pairs | MPs | contigs | length (bp) | fraction | |
|---|---|---|---|---|---|
| gap bridgings, total | 1,665 | 52,234 | |||
| gap bridgings, discarded | 1,021 | 3,846 | |||
| gap bridgings, subjected to scaffolding | 644 | 48,388 | |||
| conflict free scaffolding | 481 | 40,522 | 678 | 8,798,614 | 0.79 |
| not scaffolded due to missing MPs or conflicts | 795 | 2,318,108 | 0.21 | ||
| total | 1,473 | 11,116,722 | 1.00 | ||
1 discarded as MPs per contig pair/all bridging MPs per BAC ≤ 0.02.
Comparison of the 96 barcoded BAC 454 assemblies without and with scaffolding by Illumina MPs.
| contigs and scaffolds | without scaffolding | with scaffolding | Fold change |
|---|---|---|---|
| Average [bp] | 7,546 | 11,161 | 1.5 |
| Maximum [bp] | 89,835 | 114,443 | 1.3 |
| L50 | 21,694 | 53,258 | 2.5 |
| L80 [bp] | 7,392 | 22,695 | 3.1 |
| L90 [bp] | 3,784 | 7,644 | 2.0 |
| N50 | 157 | 67 | 0.4 |
| N80 | 418 | 161 | 0.4 |
| N90 | 616 | 244 | 0.4 |
1 Length of the contig such that using equal or longer contigs produces 50% of the overall assembly length;
2 Number of largest contigs such that their bases produce 50% of the overall assembly length.
Figure 3Threshold determination for the classification of contigs as chimeric and non-chimeric in the non-barcoded assemby of BAC pool 2. X-axis: minimal fraction of reads contributed by a sole BAC to a contig. Y-axis (logarithmic): fraction of total length comprised by those contigs. For all 3 assemblies (unmasked, masked in regions with 20mer frequencies >72 and >36) an exponential relationship was observed for fractions from a sole BAC <0.96, followed by a steep decline in the range between 0.96 and 1.00. Therefore, contigs with fractions ≥0.96 were classified as non-chimeric, whereas those <0.96 were regarded as chimeric.
Statistics of non-chimeric and chimeric contigs >1 kb generated by the assemblies of unmasked and masked bcTi reads of pool 2 without separation by barcodes.
| non-bc contigs | unmasked | |||
|---|---|---|---|---|
| number | non_chim | 354 | 461 | 562 |
| chim | 328 | 239 | 199 | |
| total length (bp) | non_chim | 2,912,280 | 2,888,385 | 3,024,333 |
| chim | 2,570,258 | 1,590,145 | 973,179 | |
| total | 5,482,538 | 4,478,530 | 3,997,512 | |
| average length (bp) | non_chim | 8,227 | 6,252 | 5,381 |
| chim | 7,836 | 6,681 | 4,890 | |
| fraction of total contig length | non_chim | 0.53 | 0.64 | 0.76 |
| chim | 0.47 | 0.36 | 0.24 | |
1 m72, m36: Reads were masked in regions where the 20mer frequency exceeds 72 and 36x, repectively.
2 Contigs composed by reads from a sole BAC with fractions <0.96 were regarded as chimeric (chim), all others as non-chimeric (non_chim).
Figure 4Examples for chimeric contigs in the non-barcoded assembly of unmasked sequences from a 48-BAC pool (non-bc). Coloured curves represent the coverage by reads from different BACs as identified by barcodes. BAC names and their read fraction in the whole non-bc contig are given in the same colour above the curves. Grey curves depict the 20mer frequency. Below the diagrams, the nucleotide identities to the corresponding contigs from the barcoded assemblies are given (rc = reverse complement). Red arrows indicate the points where the non-bc contigs are misassembled. (A, B, C) non-repetitive contig parts are joined by known repetitive elements (blue bars).(D) non-repetitive contig parts are joined at an unknown low complexity repeat without increased kmer frequency.
Figure 5Sequence comparison by dotter between the chimeric contigs (y-axis) shown in Figure 4 and the corresponding contigs from the assembly of masked sequences (masked in regions with 20mer frequencies >36, m36; x-axis). (A) The two contig parts of c24, wrongly assembled at the TA-repeat are separated in m36 forming three non-chimeric contigs. (B) The misassemblies at the LTRs in c31 are not observed in the m36 assembly. (C) The misassembly at the LTR in c83 is not present in the m36 assembly. The short section from 555O10 is contained in a chimeric contig of the masked assembly (m36_c286). (D) Contig c35 is also chimeric in the masked assembly (m36_c41) with the same read fractions and >99% nucleotide identity.