| Literature DB >> 18554390 |
Oliver Keller1, Florian Odronitz, Mario Stanke, Martin Kollmar, Stephan Waack.
Abstract
BACKGROUND: For many types of analyses, data about gene structure and locations of non-coding regions of genes are required. Although a vast amount of genomic sequence data is available, precise annotation of genes is lacking behind. Finding the corresponding gene of a given protein sequence by means of conventional tools is error prone, and cannot be completed without manual inspection, which is time consuming and requires considerable experience.Entities:
Mesh:
Substances:
Year: 2008 PMID: 18554390 PMCID: PMC2442105 DOI: 10.1186/1471-2105-9-278
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1The Scipio workflow. This diagram depicts the data flow of a Scipio run. Scipio needs two FASTA files as input, one containing the protein query and one containing the genome sequence. Scipio starts BLAT and processes the BLAT results in a series of steps, successively refining and assembling the hits. Scipio's output is a YAML file, which can further be converted into a GFF file or a log file. YAML files can also be manually edited and read by a parser of which many exist for all modern programming languages. The resulting data structure can then be further processed.
Figure 2Types of discrepancies. This chart lists all types of discrepancies between the protein query and target translation/DNA that are known to Scipio. The identifiers as written into the log files are given.
Figure 3In-Species Performance. This chart shows Scipio's performance when searching in-species. The charts shows histograms depicting how many sequences where found on a particular number of contigs in the genome. Black rectangles represent ten, grey ones five and white ones single sequences. 'Complete' means the queries where found without discrepancies. 'Complete (mm/fs)' means that Scipio found the complete gene without gaps but with discrepancies like mismatches or framshifts. 'Incomplete' means that Scipio could not determine the complete gene structure with standard parameters.
Figure 4Cross-Species Performance. This chart shows Scipio's performance when searching cross-species. The charts show the dependency of the completeness of the gene reconstruction on the identity of the protein sequences. Red dots show searches with human sequences against the genome of Pongo pygmaeus abelii, black dots show searches with human sequences against the genome of Callithrix jacchus. A: Completeness compared to the query sequence. B: Completeness compared to the manually annotated sequence. A BLAT tile size of 7 was used. C: As in B, but with tile size 5.
Sequences tested in comparative analysis.
| Species | typ. target size (kbp) | # of target sequences | # of query sequences |
| 2.8 | 213 289 | 8 | |
| 9.4 | 31 198 | 37 | |
| 14 | 66 482 | 32 | |
| 27 | 26 605 | 35 | |
| 1 827 | 6 181 | 35 | |
| 70 | 21 425 | 38 | |
| 113 | 36 206 | 40 | |
| 825 | 9 080 | 42 | |
| 6 969 | 69 724 | 37 | |
| 21 575 | 13 | 38 | |
| 2 409 | 143 | 15 | |
| 153 287 | 24 | 40 | |
| 150 724 | 22 | 40 |
A list of 9 insect, one fungi, and two primate genomes, searched for kinesin and myosin genes to compare the performance of Scipio with that of BLAT and Exonerate. Genomic target sequences were taken from different stages of assembly, as can be seen from the different typical target sizes (D. melanogaster, H. sapiens and M. mulatta were given as sets of complete chromosomes; of the genome of N. vitripennis, two versions were compared; the genome of A. gambiae was given partly in chromosomes, partly in small contigs). The protein query sequences were taken from the same species as the genome.
Percentage of residues left unmatched.
| BLAT | Exnrt. | BLAT | Exnrt. | ||
| Species | Scipio (automatic assembling) | with manual assembling | without assembling | ||
| 15.03 | 15.41 | 17.03 | 43.61 | 43.64 | |
| 7.11 | 7.44 | 8.14 | 31.83 | 30.33 | |
| 7.93 | 8.43 | 7.86 | 36.09 | 36.38 | |
| 0.20 | 0.60 | 0.45 | 10.17 | 9.38 | |
| 1.11 | 1.50 | 1.38 | 3.70 | 3.89 | |
| 0.11 | 0.45 | 0.95 | 9.05 | 10.14 | |
| 0.73 | 1.01 | 0.21 | 10.43 | 10.22 | |
| 1.26 | 1.84 | 1.33 | 1.84 | 1.88 | |
| 0.02 | 0.31 | 0.26 | 0.31 | 0.63 | |
| 0.00 | 0.32 | 0.05 | 0.32 | 0.05 | |
| 0.01 | 0.18 | 0.08 | 0.18 | 2.63 | |
| 0.06 | 0.92 | 0.10 | 0.92 | 0.10 | |
| 2.29 | 3.09 | 2.31 | 3.09 | 7.82 | |
| total | 1.67 | 2.16 | 1.87 | 8.24 | 8.71 |
The percentage of residues of the query sequences that the compared tools failed to recover from the target sequence. To gain comparable results, the hits proposed by BLAT and Exonerate were assembled together manually: a collection of best-scoring non-overlapping hits was chosen for each query. The last two columns show the results if only the best-scoring hit for each query was used.
Percentage of perfectly aligned queries.
| BLAT | Exnrt. | BLAT | Exnrt. | ||
| Species | Scipio (automatic assembling) | with manual assembling | without assembling | ||
| 25.0 | 0.0 | 25.0 | 0.0 | 12.5 | |
| 37.8 | 2.7 | 21.6 | 2.7 | 16.2 | |
| 15.6 | 3.1 | 12.5 | 0.0 | 9.4 | |
| 71.4 | 11.4 | 57.1 | 11.4 | 54.3 | |
| 68.6 | 11.4 | 65.7 | 11.4 | 65.7 | |
| 63.2 | 7.9 | 55.3 | 2.6 | 55.3 | |
| 87.5 | 7.5 | 52.5 | 7.5 | 50.0 | |
| 88.1 | 0.0 | 76.2 | 0.0 | 76.2 | |
| 94.6 | 8.1 | 73.0 | 8.1 | 73.0 | |
| 100.0 | 10.5 | 86.8 | 10.5 | 86.8 | |
| 86.7 | 6.7 | 73.3 | 6.7 | 73.3 | |
| 62.5 | 0.0 | 50.0 | 0.0 | 50.0 | |
| 12.5 | 0.0 | 10.0 | 0.0 | 10.0 | |
| total | 64.5 | 5.5 | 51.7 | 4.8 | 50.3 |
The number of query sequences that were predicted by the programs exactly at the correct location, with 100 % matching residues, without frameshifts or false positives. This figure reveals the amount of workload needed for manual postprocessing of the hits.