| Literature DB >> 15980536 |
Sacha A F T van Hijum1, Aldert L Zomer, Oscar P Kuipers, Jan Kok.
Abstract
With genome sequencing efforts increasing exponentially, valuable information accumulates on genomic content of the various organisms sequenced. Projector 2 uses (un)finished genomic sequences of an organism as a template to infer linkage information for a genome sequence assembly of a related organism being sequenced. The remaining gaps between contigs for which no linkage information is present can subsequently be closed with direct PCR strategies. Compared with other implementations, Projector 2 has several distinctive features: a user-friendly web interface, automatic removal of repetitive elements (repeat-masking) and automated primer design for gap-closure purposes. Moreover, when using multiple fragments of a template genome, primers for multiplex PCR strategies can also be designed. Primer design takes into account that, in many cases, contig ends contain unreliable DNA sequences and repetitive sequences. Closing the remaining gaps in prokaryotic genome sequence assemblies is thereby made very efficient and virtually effortless. We demonstrate that the use of single or multiple fragments of a template genome (i.e. unfinished genome sequences) in combination with repeat-masking results in mapping success rates close to 100%. The web interface is freely accessible at http://molgen.biol.rug.nl/websoftware/projector2.Entities:
Mesh:
Substances:
Year: 2005 PMID: 15980536 PMCID: PMC1160117 DOI: 10.1093/nar/gki356
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1The Projector 2 procedure. From top to bottom: I, single (left) or fragmented n-multiple templates (right) are optionally repeat masked; II, contigs (middle) are fragmented and also optionally repeat masked; III, the (unique) contig fragments are compared against the (unique) template fragments using BLAST; yielding IV, mapped contigs. Arrows with a plus sign (+) signify contigs that were successfully mapped; and those with a minus (−) sign could not be mapped. V, For the mapped contigs, gap-closing sequences and PCR primers are designed and a visual SVG output is generated.
The number of contigs containing repetitive elements and a high G+C-content on either end was determined for the four sequence assemblies shown (see Table 2 for details)
| Genome sequence assembly origin | Total number of contigs in sequence assembly | Total number of contigs used for mapping | Number of contig ends containing repeats | Number of contigs ends with high G+C-content | ||||
|---|---|---|---|---|---|---|---|---|
| One | Both | % | One | Both | % | |||
| 210 | 131 | 34 | 12 | 35 | 6 | 1 | 5 | |
| 516 | 491 | 176 | 25 | 41 | 0 | 30 | 6 | |
| N.europaea | 477 | 409 | 31 | 12 | 11 | 2 | 4 | 1 |
| 154 | 108 | 49 | 50 | 92 | 1 | 27 | 26 | |
The genome sequence assemblies used were L.lactis MG1363 (A), M.tuberculosis CDC1551 (A), N.europaea ATCC 19718 and R.typhi str. wilmington (A).
aThe contigs used for mapping were >1200 bp.
bOn one or both ends of a contig.
cThe number of contigs with at least one end that contains repeats, or with a high G+C-content, divided by the total number of contigs used for mapping × 100%.
Mapping results for unfinished genome assemblies on their isogenic counterparts and related genomes
| Target genome | Template genome | Number of templates | Repeat mask performed | Contigs correct mapped | Total mapped contigs | Gaps <15 kb | % Gaps PCR |
|---|---|---|---|---|---|---|---|
| L.lactis A | L.lactis A | 1 | + | 106 | 106 | 102 | 96 |
| L.lactis A | L.lactis A | 1 | − | 106 | 125 | 83 | 66 |
| L.lactis A | L.lactis A | 10 | + | 106 | 109 | 99 | 91 |
| L.lactis A | L.lactis A | 10 | − | 106 | 163 | 69 | 42 |
| L.lactis A | 1 | + | 82 | 85 | 79 | 93 | |
| L.lactis A | 1 | − | 82 | 90 | 74 | 82 | |
| L.lactis A | 10 | + | 80 | 86 | 74 | 86 | |
| L.lactis A | 10 | − | 81 | 116 | 61 | 53 | |
| M.tuberculosis A | M.tuberculosis A | 1 | + | 436 | 436 | 428 | 98 |
| M.tuberculosis A | M.tuberculosis A | 1 | − | 426 | 488 | 389 | 80 |
| M.tuberculosis A | M.tuberculosis A | 10 | + | 430 | 440 | 417 | 95 |
| M.tuberculosis A | M.tuberculosis A | 10 | − | 434 | 545 | 351 | 65 |
| M.tuberculosis A | M.tuberculosis B | 1 | + | 426 | 432 | 419 | 97 |
| M.tuberculosis A | M.tuberculosis B | 1 | − | 429 | 487 | 386 | 80 |
| M.tuberculosis A | M.tuberculosis B | 10 | + | 423 | 434 | 418 | 97 |
| M.tuberculosis A | M.tuberculosis B | 10 | − | 428 | 548 | 349 | 64 |
| M.tuberculosis A | 1 | + | 195 | 231 | 125 | 54 | |
| N.europaea | N.europaea | 1 | + | 63 | 63 | 62 | 98 |
| N.europaea | N.europaea | 1 | − | 61 | 65 | 55 | 85 |
| N.europaea | N.europaea | 10 | + | 62 | 63 | 60 | 95 |
| N.europaea | N.europaea | 10 | − | 61 | 157 | 32 | 21 |
| R.typhi | R.typhi | 1 | + | 79 | 79 | 75 | 95 |
| R.typhi | R.typhi | 1 | − | 79 | 97 | 60 | 62 |
| R.typhi | R.typhi | 10 | + | 78 | 79 | 74 | 94 |
| R.typhi | R.typhi | 10 | − | 78 | 97 | 60 | 62 |
| R.typhi | 1 | + | 73 | 79 | 74 | 94 | |
| R.typhi | 1 | − | 74 | 96 | 64 | 67 | |
| R.typhi | 10 | + | 77 | 79 | 75 | 95 | |
| R.typhi | 10 | − | 79 | 96 | 61 | 64 |
For each mapping procedure, the results were compared with the reference results obtained by using an isogenic template and repeat-masking (represented in boldface). Genome origins: L.lactis MG1363 (A), L.lactis IL1403 (B), M.tuberculosis CDC1551 (A), M.tuberculosis H37Rv (B), M.leprae TN, N.europaea ATCC 19718, R.typhi str. wilmington (A) and R.prowazekii str. Madrid E (B).
aThe number of gaps that could be closed with direct PCR is defined as the number of gaps with sizes <15 kb divided by the total number of PCRs (= total number of mapped contigs) × 100%.
bThe target genomes R.typhi (13) (Supplementary Figure S1) and L.lactis MG1363 (14) contain an inversion compared with their respective templates resulting in two incorrectly mapped contigs.
cThis mapping was performed to demonstrate the mapping success when using a template genome with very limited colinearity under optimal conditions (one template with repeat-masking).
Figure 2SVG output of Projector 2 runs performed with and without repeat-masking. (A–D) Details of the results for mapping of: (A) R.typhi contigs on its isogenic template; (B) N.europaea contigs on its isogenic template; (C) L.lactis MG1363 contigs on L.lactis IL1403; and (D) M.tuberculosis contigs on M.tuberculosis. For each inset, the mapping results with (lower panel) and without (upper panel) repeat-masking of the target and template sequences are shown. A ruler in base pairs (kb or Mb) is shown above each mapping. The template fragments are indicated in dark green triangles below this scale. The mapped contigs, one or at most two on each line, are shown below the template fragments. Within each mapped contig, the L and R fragments used to map the contig are indicated with green boxes. Contig numbers are shown for mapping in (D). A dot (•) indicates an incorrectly mapped contig on that line.
Figure 3Results of the mapping procedures described in Table 2. The percentage of mapped contigs is plotted for four bacterial genome assemblies mapped onto (non-) isogenic templates (for details see Table 2).