| Literature DB >> 21798037 |
Klas Hatje1, Oliver Keller, Björn Hammesfahr, Holger Pillmann, Stephan Waack, Martin Kollmar.
Abstract
BACKGROUND: Obtaining transcripts of homologs of closely related organisms and retrieving the reconstructed exon-intron patterns of the genes is a very important process during the analysis of the evolution of a protein family and the comparative analysis of the exon-intron structure of a certain gene from different species. Due to the ever-increasing speed of genome sequencing, the gap to genome annotation is growing. Thus, tools for the correct prediction and reconstruction of genes in related organisms become more and more important. The tool Scipio, which can also be used via the graphical interface WebScipio, performs significant hit processing of the output of the Blat program to account for sequencing errors, missing sequence, and fragmented genome assemblies. However, Scipio has so far been limited to high sequence similarity and unable to reconstruct short exons.Entities:
Year: 2011 PMID: 21798037 PMCID: PMC3162530 DOI: 10.1186/1756-0500-4-265
Source DB: PubMed Journal: BMC Res Notes ISSN: 1756-0500
Figure 1The extended Scipio workflow. This diagram depicts the activity and data flow of a Scipio run. Scipio needs a protein and a target genome sequence, both in FASTA format, as input to start a Blat run. Every single Blat hit is subsequently processed and filtered, and assembled in the case of hits on multiple targets. The gap_length describes the number of amino acids of an unmatched query subsequence. The intron_length is the corresponding length of the unmatched target subsequence in nucleotides.
Figure 2Screenshot of the multiple results view of WebScipio. The screenshot shows the result of the search for multiple homologs of one of the muscle class-II myosins from human in the human genome. The search parameters were --min_identity = 60%, --max_mismatch = ∞, and --multiple_results = yes to get as many homologs as possible. On top, the opened quick view of all reconstructed gene structures is shown. Next, a panel with the different results is shown. Green numbers mark complete results (100% of the query sequence reconstructed) while red numbers mark incomplete results (might contain gaps, mismatches, frameshifts, etc.). Result hit number 2 was selected and shows the result for the closest homolog to the query sequence with no gaps (unmapped query sequence) but 101 mismatches.
Figure 3Modelling of additional/missing bases in gene predictions. Case A shows the result of the search of a kinesin from Neurospora crassa (query sequence) in Neusrospora discreta (target sequence) using the old and the new Scipio version. The --min_intron_len parameter has been set to 22. Case B shows the result of a search of the dynactin p62 homolog from Phytophthora ramorum (query sequence) in Phytophthora sojae (target sequence). To get the correct gene prediction the following Scipio parameters have been used: --min_identity = 60%, --min_score = 0.3, --max_mismatch = ∞, --gap_to_close = 15, --min_intron_length = 22. The colour coding is explained in the legend and applies to all gene structure figures. For further information see Additional file 2.
Figure 4Reconstruction of very short exons. Case A shows the result for the reconstruction of the human dynamitin (dynactin p50) gene, that contains a 3 amino acid exon and a following 2 amino acid exon that are differentially included in the final transcript. These exons could not be reconstructed with Blat and the old Scipio version, but using the new Scipio version that enables Needlman-Wunsch searches. The --exhaust_align_size parameter has been set to 15,000 bp because of the length of the intron. Case B shows the result of the reconstruction of the coronin gene from Puccinia graminis f. sp. tritici. The small but evolutionarily conserved exon 7 can now correctly be reconstructed. Case C shows the result of the reconstruction of the mouse dynactin p150 gene that contains three short exons of 7, 6, and 7 amino acids close to the 5'-end of the gene. For the correct reconstruction, the --exhaust_align_size parameter has been increased to 10,000 bp, because of the length of the intron, and the --exhaust_gap_size has been set to 21 because of the length of the query that could not be mapped. The colour coding of the scheme is the same as in Figure 3. For further information see Additional file 2.
Figure 5Reconstructing short exons at low homology intron borders. The scheme shows two examples for the reconstruction of short exons in regions where the intron borders of the neighbouring exons show some homology to the unmatched query sequence. The value for the --max_move_exon parameter has been set to 3 (case A) and 6 (case B), respectively. The colour coding of the scheme is the same as in Figure 3. For further information see Additional file 2.
Figure 6Reconstructing genes on chromosome assemblies. The scheme shows an example of the search for the rat homolog (target sequence) of the human Kif5C kinesin motor protein. The C-terminal about 25 amino acids of the rat Kif5C homolog are missing in the respective chromosome assembly. Using Scipio v1.0 a very short identical stretch of four amino acids, found on a different chromosome, has artificially been added to the 3'-end of the gene generating an "intron" of millions of base pairs (Note the scale of the introns!). The new parameter --single_target_hits now prevents this mis-assembly. The colour coding of the scheme is the same as in Figure 3.
Details of the dynein heavy chain genes used for the cross-species search
| Protein | status* | ||||
|---|---|---|---|---|---|
| Name | Length [aa] | Length [aa] | Length [bp] | Exons | |
| DHC1 | 4646 | 4561 | P | 62248 | 77 |
| DHC2 | 4307 | 4234 | P | 413085 | 89 |
| DHC3A | 4707 | 4690 | ✓ | 358204 | 92 |
| DHC3B | 4624 | 4582 | P | 298827 | 78 |
| DHC4A | 4507 | 4508 | ✓ | 361115 | 82 |
| DHC4B | 4462 | 4428 | P | 115251 | 79 |
| DHC4C | 4486 | 4339 | P | 486267 | 69 |
| DHC5 | 4589 | 4584 | ✓ | 140290 | 79 |
| DHC6 | 4509 | 4457 | P | 134640 | 86 |
| DHC7A | 4024 | 4019 | ✓ | 253605 | 62 |
| DHC7B | 4070 | 3966 | P | 187156 | 60 |
| DHC7C | 3960 | 3960 | ✓ | 201577 | 73 |
| DHC8 | 4265 | 4064 | F | 80895 | 73 |
| DHC9A | 4158 | 4062 | P | 302568 | 75 |
| DHC9B | 4612 | 4597 | P | 369531 | 85 |
| DHC11 | 4779 | 4779 | ✓ | 81205 | 43 |
* Status: P = partial sequence (short part of the sequence missing); F = sequence fragment (large region of the gene missing in the genome); ✓ = sequence complete.
Figure 7Example cross-species searches. The results of four searches with dynein heavy chain sequences from Homo sapiens in the elephant (Loxodonta africana) genome are shown. All genes are spread on several hundred thousands of base pairs. Statistics to the sequence results are given below the gene structure cartoons. An "intron?" is an intron for which the borders do not correspond to the standard splice sites GT---AG or GC---AG. The colour coding of the scheme is the same as in Figure 3.
Figure 8Diagrams of the improvements introduced with the new Scipio version. The diagrams describe the improvement of the gene reconstructions of the DHC genes in the cross-species search of the human homologs (query sequences) in elephant (target sequence) using different Scipio versions and parameters. (A) The base-line is the result of the search using the old Scipio v1.0. The maximal possible annotation is represented by the gene reconstructions based on the manually annotated elephant DHC genes (reference dataset, purple). The blue bars show the reconstruction with Scipio v1.5 using --blat_tilesize = 7, --exhaust_align_size = 500 and --exhaust_gap_size = 21 (dataset s1). Green bars are results from the second search (dataset s2) with same parameters as for the first search, except for --blat_tilesize = 6 and --exhaust_gap_size = 18 (three times the tilesize). This dataset represents improvements independent of Scipio. The red bars represent searches with same parameters as for dataset s1, except for the increased parameters --exhaust_align_size = 5,000 and --exhaust_gap_size = 25 (dataset s4). This data takes far longer to compute compared to the first search, because of the Needleman-Wunsch search in longer regions. For the DHC1 gene Scipio v1.0 maps too many amino acids of the human query sequence to the elephant genome. So the negative bar representing the other datasets shows that these datasets cover the right number of 4561 amino acids. (B) This diagram depicts the number of gaps (human query sequence not matched in the elephant genome) and questionable introns (intron?; introns with uncommon splice sites) for the searches with the old Scipio version and the new version applying different parameters as in (A). The detailed values of the diagrams are shown in tables in Additional file 3.
Test scenario 1: Reconstruction of the Loxodonta africana dynein heavy chain genes in the whole genome sequence based on human protein sequences
| Tool | Predicted genes | Missing exons1 | Wrong exons2 | Exon sens. % | Exon sens. (ov.)3 % | Exon spec. % | Nucl. sens. % | Nucl. spec. % | Execution time per prot. seq. |
|---|---|---|---|---|---|---|---|---|---|
| Scipio 1.54 | 16 | 11 | 6 | 93.4 | 99.1 | 93.3 | 98.7 | 99.8 | 70 m 46s |
| Exonerate5 | 2145 | 6 | 5669 | 94.8 | 99.5 | 16.5 | 99.7 | 18.5 | 123 m 23s |
| Exonerate6 | 16 | 287 | 62 | 73.4 | 76.1 | 90.2 | 76.1 | 94.3 | 121 m 27s |
| Augustus7 | 1374928 | 0 | 3909434 | 47.9 | 100.0 | 0.0 | 100.0 | 0.3 | > 10 days |
| BLAT8 | - | 9 | 264228 | 19.6 | 99.3 | 0.1 | 97.4 | 2.6 | 7 m 24s |
| Scipio 1.04 | 16 | 14 | 46 | 86.1 | 98.8 | 83.2 | 97.9 | 99.4 | 8 m 24s |
1 Number of annotated exons, which are not overlapped by any predicted exon
2 Number of predicted exons, which are not overlapped by any annotated exon
3 Number annotated exons, which are overlapped by at least one predicted exon divided by the number of annotated exons
4 Mammalia cross species default options (for detailed parameters see Additional file 5)
5 Parameters: --model protein2genome
6 Parameters: --model protein2genome --bestn 1
7 Parameters: --species = human --genemodel = exactlyone (for more parameters see Additional file 5)
8 Parameters as in 4: -tileSize = 7 -minIdentity = 54 -minScore = 15 -oneOff = 1
Test scenario 2: Reconstruction of the Loxodonta africana dynein heavy chain gene structures in the respective gene regions based on human protein sequences
| Tool | Pred. genes | Missing exons1 | Wrong exons2 | Exon sens. % | Exon sens. (ov.)3 % | Exon spec. % | Nucl. sens. % | Nucl. spec. % |
|---|---|---|---|---|---|---|---|---|
| Scipio 1.54 | 16 | 13 | 6 | 93.1 | 98.9 | 93.1 | 98.6 | 99.8 |
| Scipio 1.55 | 16 | 4 | 7 | 94.7 | 99.7 | 93.7 | 99.2 | 99.8 |
| Prosplign6 | 16 | 1 | 41 | 95.7 | 99.9 | 92.6 | 99.9 | 98.7 |
| Exonerate7 | 32 | 7 | 6 | 94.8 | 99.4 | 94.6 | 99.6 | 99.5 |
| Exonerate8 | 16 | 255 | 4 | 75.7 | 78.8 | 95.6 | 79.2 | 99.7 |
| Prot_map9 | 16 | 4 | 27 | 91.7 | 99.7 | 86.2 | 99.3 | 99.7 |
| Fgenesh+10 | 16 | 10 | 10 | 94.9 | 99.2 | 94.8 | 99.0 | 99.7 |
| Wise211 | 39 | 3 | 16 | 93.3 | 99.8 | 91.2 | 99.7 | 98.9 |
| Augustus12 | 16 | 132 | 111 | 81.9 | 89.0 | 83.2 | 89.9 | 88.7 |
| Fgenesh10 | 161 | 111 | 342 | 80.2 | 90.8 | 67.3 | 91.8 | 62.3 |
| Genscan13 | 194 | 138 | 520 | 76.3 | 88.5 | 57.9 | 90.4 | 55.3 |
| BLAT14 | - | 16 | 19 | 19.9 | 98.7 | 19.4 | 97.0 | 98.9 |
| Scipio 1.04 | 16 | 16 | 10 | 86.2 | 98.7 | 85.9 | 97.8 | 99.8 |
1 Number of annotated exons, which are not overlapped by any predicted exon
2 Number of predicted exons, which are not overlapped by any annotated exon
3 Number annotated exons, which are overlapped by at least one predicted exon divided by the number of annotated exons
4 Mammalia cross species default options (for detailed parameters see Additional file 5)
5 Mammalia cross species default options; -tileSize = 6 (for detailed parameters see Additional file 5)
6 Parameters: -full -two_stages
7 Parameters: --model protein2genome
8 Parameters: --model protein2genome --bestn 1
9 Similarity: Weak; Search for one best alignment only (for more parameters see Additional file 5)
10 Organism: Human
11 Parameters: -both
12 Parameters: --species = human --genemodel = exactlyone (for more parameters see Additional file 5)
13 Organism: Vertebrate; Suboptimal exon cutoff: 1.00
14 Parameters as in 4: -tileSize = 7 -minIdentity = 54 -minScore = 15 -oneOff = 1
Test scenario 3 and 4: Difficult cases for reconstruction of gene structures
| Tool | Ned kinesin | Phs dynactin p62 | Hs dynactin p50 | Pug coronin | Mm dynactin p150 | Hs myosin | Th CAPα |
|---|---|---|---|---|---|---|---|
| Scipio 1.51 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Prosplign2 | o | o | ✓ | - | ✓ | - | ✓ |
| Exonerate3 | o | o | - | - | ✓ | - | - |
| Prot_map4 | o | o | - | - | - | o | - |
| Fgenesh+5 | o | o | - | - | - | o | o |
| Wise26 | o | o | - | - | - | - | - |
| Augustus7 | o | o | - | - | - | - | - |
| Fgenesh5 | o | o | - | - | - | - | - |
| Genscan8 | o | o | - | - | - | - | - |
| BLAT9 | o | o | - | - | - | - | - |
| Scipio 1.01 | o | o | - | - | - | - | - |
Ned: Neurospora discreta, Phs: Phytophthora sojae, Hs: Homo sapiens, Pug: Puccinia graminis, Mm: Mus musculus, Th: Thielavia heterothallica
✓ All exons are reconstructed correctly
o All annotated exons are matched by or overlap with predicted exons
- Exons are missing
1 Ned, Phs: cross species default options; Hs, Mm: default options, exhaust_align_size = 15000; Pug, Th: default options (for detailed parameters see Figures 3, 4, and 5 and Additional file 5)
2 Parameters: -full -two_stages
3 Parameters: --model protein2genome
4 Similarity: Weak; Search for one best alignment only (for more parameters see Additional file 5)
5 Organisms: Ned, Th: Neurospora crassa; Phs: Phytophtora; Hs: Human; Pug: Puccina; Mm: Mouse
6 Parameters: -both
7 Parameters: --genemodel = exactlyone; Organisms: Ned: --species = neurospora; Phs, Pug, Th: --species = generic; Hs, Mm: --species = human (for more parameters see Additional file 5)
8 Organism: Vertebrate; Suboptimal exon cutoff: 1.00
9 Parameters: -minScore = 15; Ned, Phs: -tileSize = 5 -minIdentity = 54 -oneOff = 1; Hs, Pug, Mm, Th: -tileSize = 7 -minIdentity = 81