| Literature DB >> 19422688 |
Alan L Kwan1, Linya Li, David C Kulp, Susan K Dutcher, Gary D Stormo.
Abstract
BACKGROUND: The availability of whole-genome sequences allows for the identification of the entire set of protein coding genes as well as their regulatory regions. This can be accomplished using multiple complementary methods that include ESTs, homology searches and ab initio gene predictions. Previously, the Genie gene-finding algorithm was trained on a small set of Chlamydomonas genes and shown to improve the accuracy of gene prediction in this species compared to other available programs. To improve ab initio gene finding in Chlamydomonas, we assemble a new training set consisting of over 2,300 cDNAs by assembling over 167,000 Chlamydomonas EST entries in GenBank using the EST assembly tool PASA.Entities:
Mesh:
Year: 2009 PMID: 19422688 PMCID: PMC2694837 DOI: 10.1186/1471-2164-10-210
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Analysis of PASA gene models: Categorization of the 2384 PASA EST assembly gene models
| Alignment to NCBI NRdb | 957/2384 | |
| Absent from the NCBI NRdb | 1427/2384 | |
| Exact overlap in | 222/1427 | |
| Partial overlap in | 260/1427 | |
| No overlap in | 945/1427 | |
| Single exon | 835 | |
| Tested via RT-PCR | 13 | |
| Verified via RT-PCR | 10 | |
Analysis of PASA gene models: RT-PCR testing of 13 novel, single exon PASA gene assemblies
| Assembly ID | Outcome |
| 3146_3724 | Present in cDNA |
| 5172_6168 | Present in cDNA |
| 8132_9749 | Present in cDNA |
| 9104_10933 | Present in cDNA |
| 9866_11843 | Present in cDNA |
| 11161_13363 | Present in cDNA |
| 11240_13451 | Present in cDNA |
| 11709_14017 | Present in cDNA |
| 14828_17825 | Present in cDNA |
| 16095_19351 | Present in cDNA |
| 14105_16951 | Not present in cDNA |
| 15620_18773 | Not present in cDNA |
| 14205_17074 | Not present in cDNA |
Present: A product of the correct size was found in samples by RT-PCR
Not present: No product was obtained by RT-PCR
Assembly ID numbers can be downloaded from . For primers used see Additional file 3
Comparing GreenGenie2 and GeneMark.hmm-ES 3.0 in gb140 catalog
| GreenGenie2 | GeneMark.hmm-ES 3.0 | ||||
| Sensitivity | Specificity | Sensitivity | Specificity | ||
| Gene Level | (n = 140) | 0.51 | 0.47 | 0.31 | 0.24 |
| Exon Level | (n = 1145) | 0.83 | 0.83 | 0.79 | 0.74 |
| Initial Exons | (n = 133) | 0.65 | 0.60 | 0.50 | 0.40 |
| Internal Exons | (n = 870) | 0.87 | 0.88 | 0.84 | 0.84 |
| Terminal Exons | (n = 133) | 0.82 | 0.75 | 0.78 | 0.63 |
| Single Exon | (n = 7) | 0.71 | 0.62 | 0.00 | 0.00 |
| Nucleotide Level | (n = 713682) | 0.93 | 0.92 | 0.91 | 0.89 |
Comparison of gg2v3 and FGC07 catalog by overlap interval analysis
| Type of overlap | Count | Type of overlap | Count |
| Exact Overlap | 1,324 | Exact Overlap | 0 |
| Partial Overlap | 5,425 | Partial Overlap | 2,826 |
| No Overlap | 1,574 | No Overlap | 1,149 |
| Other | 16 | Other | 16 |
| Total | 8,339 | Total | 3,981 |
Complete model: Any model that includes a starting ATG gene feature and terminates with a stop codon (TAA, TAG or TGA).
Incomplete model: Any model that lacks a start or stop codon or both.
Other: Models that interlaced overlaps and concatenated exact overlaps.
Figure 1Diagram of four classes of gene level interval overlaps. Interval overlap analysis identifies four classes of predictions between the two catalogs. Grey tracks represent identical stretches of the genomic assembly. Either blue or green boxes distinguish exons in the two catalogs. (A) Predictions are exact overlaps; (B) Predictions show a partial gene overlap with an exact overlap of the 5' exon and partial overlap of the 3' exon; (C) Predictions show a partial gene overlap with an exact overlap of the terminal exons and an extra exon in one catalog but not the other; (D) A unique prediction present in one catalog but not present in the other catalog.
Experimental analysis of 13 randomly selected predictions that differ between the gg2v3 and FGC07 catalogs
| Models with alternate exon termini predicted in | Novel exons predicted in | |||
| 4t254 | + | -- | 1t16 | + |
| 11t344 | + | -- | 1t34 | + |
| 25t123 | + | -- | 1t147 | + |
| 24t200 | + | -- | 11t344 | + |
| 5t126 | -- | -- | 15t291 | + |
| 30t106 | + | |||
| 30t170 | + | |||
| 3t257 | -- | |||
+: A product of the correct size was found in samples by RT-PCR
--: No product was obtained by RT-PCR
*For primers see Additional files 4 and 5.
Experimental analysis of 10 randomly selected predictions unique to the gg2v3 or FGC07 catalogs
| Predictions exclusive to | Predictions exclusive to | ||
| Outcome | Outcome | ||
| 3t69 | + | 141597 | + |
| 19t170 | + | 181956 | + |
| 30t189 | + | 184911 | + |
| 76t11 | + | 141023 | -- |
| 69t65 | -- | 180935 | -- |
+: A product of the correct size was found in samples by RT-PCR
--: No product was obtained by RT-PCR
*For primers see Additional files 6 and 7.