| Literature DB >> 17229318 |
William A Moskal1, Hank C Wu, Beverly A Underwood, Wei Wang, Christopher D Town, Yongli Xiao.
Abstract
BACKGROUND: Several lines of evidence support the existence of novel genes and other transcribed units which have not yet been annotated in the Arabidopsis genome. Two gene prediction programs which make use of comparative genomic analysis, Twinscan and EuGene, have recently been deployed on the Arabidopsis genome. The ability of these programs to make use of sequence data from other species has allowed both Twinscan and EuGene to predict over 1000 genes that are intergenic with respect to the most recent annotation release. A high throughput RACE pipeline was utilized in an attempt to verify the structure and expression of these novel genes.Entities:
Mesh:
Substances:
Year: 2007 PMID: 17229318 PMCID: PMC1783852 DOI: 10.1186/1471-2164-8-18
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Gene structure statistics for intergenic Twinscan predictions and intergenic EuGene predictions
| # of intergenic predictions | 1515 | 1774 |
| Mean CDS length | 342 bp | 254 bp |
| Mean number of exons | 1.8 | 1.5 |
| Percent of single exon genes | 854 (56%) | 1086 (61%) |
| Mean CDS length, single-exon predictions | 303 bp | 260 bp |
| Mean CDS length, multi-exon | 391 bp | 245 bp |
| # of spliced predictions < 100 bp | 27 | 239 |
| # of predictions > 300 bp | 608 (40%) | 403 (23%) |
Success rate and structural characteristics for RACE-targeted intergenic genes.
| Number targeted | 726 | 623 | 1071* |
| Mean CDS length | 456 bp | 354 bp | - |
| Median CDS length | 357 bp | 264 bp | - |
| Average # of exons | 2.0 | 1.4 | - |
| % single exon predictions | 423 (58%) | 452 (73%) | - |
| Number successful† | 304 (42%) | 257 (41%) | 378 (35%) |
| Mean CDS length | 445 bp | 452 bp | 441 bp |
| Median CDS length | 339 bp | 330 bp | 321 bp |
| Average # of exons | 1.8 | 1.7 | 1.7 |
* Primer pairs were designed for a total of 1071 total loci. 623 pairs of primers were compatible with a EuGene prediction, and 726 pairs of primers were compatible with a Twinscan prediction. 278 primer pairs were compatible with both a Twinscan and a EuGene prediction.
† For Twinscan and EuGene, success rates are defined as the number of FL sequences obtained that overlap a Twinscan or EuGene prediction, as compared to the number targeted. For the Combined category, the success rate represents the total number of loci verified with respect to the total number targeted.
Figure 1Merging of EuGene predictions. At this locus, Twinscan predicts a single large 3 exon ORF (yellow), while EuGene splits this gene to predict 2 smaller ORFs within the same frame (green). A minimal set of sequence assemblies generated by PASA are shown with ORFs shown as red. We have recovered experimental evidence supporting transcription of both the merged ORF as best predicted by Twinscan and one of the smaller ORFs, as better predicted by EuGene. Poly A tail locations are denoted by a green 'A'.
Figure 2Example of a previously unpredicted gene. This region contains a EuGene prediction, At01eug01210 (green), a Twinscan prediction At.chr1.1.117 (yellow). Sequencing reads are shown in black. A minimal set of sequence assemblies are also shown with potential ORFs highlighted in red. Conserved splice junctions are shown as blue bars. PolyA tail locations are denoted by a green 'A'.
Figure 3Extent of alternative splicing of previously unannotated genes. Number of isoforms per 100 genes.
Figure 4Structural accuracy of Twinscan and EuGene predictions. Gene level Sensitivity (Sn) and Specificity (Sp) were calculated using GTF files generated from BLAT alignment coordinates and the Eval software package.
Top ten blastx hits among 378 intergenic ORFs.
| At.chr4.1.125/At04eug01370 | AAB61038.1 | contains similarity to membrane associated salt-inducible protein {Arabidopsis thaliana;} | 8.8e-273 |
| At.chr5.6.182/At05eug23610 | NP_001031936.1 | hydrolase, hydrolyzing O-glycosyl compounds {Arabidopsis thaliana;} | 6.5e-247 |
| At.chr5.5.142/At05eug19100 | NP_001031908.1 | Nucleotidyltransferase {Arabidopsis thaliana;} | 1e-240 |
| At.chr1.2.81/At01eug05270 | NP_001030977.1 | unknown protein {Arabidopsis thaliana;} | 1.7e-237 |
| At.chr1.10.7/At01eug36540 | AAG51252.1 | acetyl-CoA carboxylase, putative; {Arabidopsis thaliana;} | 2.2e-214 |
| At.chr1.15.124/At01eug50060 | NP_001031203.1 | unknown protein {Arabidopsis thaliana;} | 1e-202 |
| At.chr5.10.162/At05eug30680 | NP_001031973.1 | unknown protein {Arabidopsis thaliana;} | 3.7e-196 |
| At.chr3.3.273/At03eug12310 | AAG51009.1 | FKBP-type peptidyl-prolyl cis-trans isomerase, putative {Arabidopsis thaliana;} | 3.7e-196 |
| At.chr3.11.252/At03eug36080 | NP_001030807.1 | unknown protein {Arabidopsis thaliana;} | 2.2e-191 |
| At.chr2.1.132/At02eug01230 | NP_197902.1 | unknown protein {Arabidopsis thaliana;} | 1.4e-187 |
Figure 5Expression pattern of At.chr1.16.98. GUS staining pattern observed for enhancer trap line ET7211, which tags a novel gene predicted by Twinscan as At.chr1.16.98.
Figure 6Promoter-Reporter analysis. The promoter of intergenic gene At.chr1.15.120 drives expression of the GUS and GFP reporter genes in identical patterns in independent lines generated using two different transformation constructs. Expression of the reporter gene is seen in the hydathode region of leaves (A, B) and at cauline branch junctions (C, D).