| Literature DB >> 21443786 |
Alessandro M Varani1, Patricia Siguier, Edith Gourbeyre, Vincent Charneau, Mick Chandler.
Abstract
Insertion sequences (ISs) play a key role in prokaryotic genome evolution but are seldom well annotated. We describe a web application pipeline, ISsaga (http://issaga.biotoul.fr/ISsaga/issaga_index.php), that provides computational tools and methods for high-quality IS annotation. It uses established ISfinder annotation standards and permits rapid processing of single or multiple prokaryote genomes. ISsaga provides general prediction and annotation tools, information on genome context of individual ISs and a graphical overview of IS distribution around the genome of interest.Entities:
Mesh:
Substances:
Year: 2011 PMID: 21443786 PMCID: PMC3129680 DOI: 10.1186/gb-2011-12-3-r30
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Figure 1Flow diagram of the ISsaga pipeline. The figure shows how the different ISsaga functions are assembled. Following loading of the appropriate genome file, the system identifies ORFs using the ORF identification module. Module (a): if the file is pre-annotated, the protocol performs a BLASTP (filter off and e-value 1e-5) analysis followed by BLASTX (filter off and e-value 1e-5) to identify any ORFs that may have been overlooked. If the file is not annotated, an automatic Glimmer annotation is performed prior to BLASTP and BLASTX. Identified ORFs are included in a candidate ORF list. The replicon is then subject to BLASTN (filter off, word size 7 and e-value 1e-5) analysis, which yields an IS prediction and generates a web-based annotation table. If no ORFs are found, BLASTN is performed against the ISfinder database and any candidate ISs are fed into the IS prediction step. This step identifies partial ISs without ORFs. In a second module (b), ISs that have been identified and are already present in ISfinder are automatically fed into an IS report that must then be validated (module (c)). These modules are linked to the web interface (module (d)), which permits annotation management and provides tools for identifying and defining new ISs.
Figure 2A section of the original GenBank file (left) and of the extracted file after correct annotation using ISsaga.
Predictor performance
| GB | - IS | + IS | Manual | |
|---|---|---|---|---|
| Total IS ORF | 1 | 4 | 4 | 2 |
| Complete ORF | - | 0 | 0 | 0 |
| Partial ORF | - | 1 | 1 | 1 |
| Pseudogene | 1 | 2 | 2 | 1 |
| Unknown ORF | - | 1 | 1 | 0 |
| Total IS | - | 4 | 4 | 2 |
| Different IS | - | 4 | 4 | 2 |
| Total IS ORF | 15 | 22 | 24 | 19 |
| Complete ORF | - | 4 | 12 | 12 |
| Partial ORF | - | 1 | 2 | 6 |
| Pseudogene | 1 | 4 | 4 | 1 |
| Unknown ORF | - | 13 | 6 | 0 |
| Total IS | - | 20 | 21 | 16 |
| Different IS | - | 16 | 17 | 12 |
| Total IS ORF | 14 | 25 | 28 | 27 |
| Complete ORF | - | 12 | 26 | 26 |
| Partial ORF | - | 2 | 0 | 0 |
| Pseudogene | - | 1 | 1 | 1 |
| Unknown ORF | - | 10 | 1 | 0 |
| Total IS | - | 19 | 19 | 18 |
| Different IS | - | 10 | 10 | 9 |
| Total IS ORF | 15 | 33 | 35 | 35 |
| Complete ORF | - | 18 | 24 | 27 |
| Partial ORF | - | 4 | 2 | 3 |
| Pseudogene | - | 8 | 8 | 5 |
| Unknown ORF | - | 3 | 1 | 0 |
| Total IS | - | 25 | 25 | 23 |
| Different IS | - | 12 | 12 | 14 |
| Total IS ORF | - | 7 | 7 | 3 |
| Complete ORF | - | 0 | 2 | 2 |
| Partial ORF | - | 1 | 1 | 1 |
| Pseudogene | - | 0 | 0 | 0 |
| Unknown ORF | - | 6 | 4 | 0 |
| Total IS | - | 7 | 7 | 3 |
| Different IS | - | 6 | 6 | 2 |
| Total IS ORF | 75 | 143 | 144 | 160 |
| Complete ORF | - | 81 | 123 | 125 |
| Partial ORF | - | 43 | 11 | 27 |
| Pseudogene | - | 7 | 7 | 8 |
| Unknown ORF | - | 12 | 3 | 0 |
| Total IS | - | 115 | 115 | 119 |
| Different IS | - | 27 | 27 | 26 |
| Total IS ORF | 11 | 21 | 22 | 20 |
| Complete ORF | - | 13 | 19 | 19 |
| Partial ORF | - | 7 | 1 | 1 |
| Pseudogene | - | 1 | 1 | 0 |
| Unknown ORF | - | 0 | 1 | 0 |
| Total IS | - | 18 | 19 | 16 |
| Different IS | - | 6 | 7 | 4 |
| Total IS ORF | 49 | 53 | 54 | 57 |
| Complete ORF | - | 18 | 45 | 47 |
| Partial ORF | - | 27 | 5 | 9 |
| Pseudogene | - | 3 | 3 | 1 |
| Unknown ORF | 3 | 5 | 1 | 0 |
| Total IS | - | 38 | 39 | 36 |
| Different IS | - | 18 | 19 | 18 |
The table shows a comparison of IS annotations of eight bacterial genomes contained in the corresponding GenBank files (GB) with those obtained by manual annotation (Manual) and using the ISsaga predictor with two different IS reference databases. In one database (-IS) the reference ISs contained in the genome under test were removed while in the other these ISs were included (+IS). The total number of IS-associated ORFs (Total IS ORF) are divided into four categories: Complete ORFs, Partial ORFs, Pseudogenes and Unknown. The category 'Unknown' includes all examples that cannot be distinguished by the predictor as complete or partial due to the absence of sufficient numbers of closely related examples in the reference database. The categories 'Total IS' and 'Different IS' are based on nucleotide predictions. In these predictions the number of ORFs carried by the IS are taken into account. For example, if an IS includes two ORFs, this will be counted as two examples in 'Complete ORF' but as a single IS in 'Total IS'.
Figure 3Part of the individual IS report. This example shows the four complete copies of ISAcma18 from the genome of Acaryochloris marina. The top section shows the genome coordinates of each IS. Note that copies 2 and 3 are at some distance from each other. The lower section shows the flanking 49 bp and the corresponding DRs. Note that the left 'DR' of copy 2 (marked in red) is present as the right 'DR' of copy 3 (marked in red) whereas the right 'DR' of copy 2 (marked in black) is present as the left 'DR' of copy 3 (marked in black).
Figure 4Decision tree to determine complete, partial or uncategorized IS-associated ORFs based in global and local alignments against the ISfinder protein dataset.