Literature DB >> 20400755

EuGène-maize: a web site for maize gene prediction.

Abstract

MOTIVATION: A large part of the maize B73 genome sequence is now available and emerging sequencing technologies will offer cheap and easy ways to sequence areas of interest from many other maize genotypes. One of the steps required to turn these sequences into valuable information is gene content prediction. To date, there is no publicly available gene predictor specifically trained for maize sequences. To this end, we have chosen to train the EuGène software that can combine several sources of evidence into a consolidated gene model prediction. AVAILABILITY: http://genome.jouy.inra.fr/eugene/cgi-bin/eugene_form.pl.

Entities: Chemical Species

Mesh：

Substances：
DNA, Complementary

Year: 2010 PMID： 20400755 PMCID： PMC2859131 DOI： 10.1093/bioinformatics/btq123

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

The B73 maize genome sequence is now available (Schnable et al., 2009). We can anticipate that next generation sequencing technologies will soon supply a deluge of genomic data from other maize genotypes. Therefore, in addition to the annotation provided by the maize sequence consortium for the B73 genotype, the community will also need a tool for the annotation of maize genomic sequences produced from other genotypes. To date, the www.maizesequence.org web site provides the Filtered Gene Set including annotation of 32 540 gene models (RefGen_v1), based on biological evidence. However, no tool is provided to annotate personal sequences yet. Fgenesh (Salamov and Solovyev, 2000) was among the first ab initio gene prediction softwares available for maize while it was trained for monocot species. Combiner softwares like EuGène (Foissac and Schiex, 2005) can improve their own ab initio prediction results by integrating information from sequence alignment software, from splice site and translation start site prediction software or from other gene finder algorithms, thereby improving prediction quality. EuGène uses probabilistic Markov models to discriminate coding sequences from non-coding ones, or genuine splice sites from false ones. Gene models generated by EuGène are associated with a score based on all the available information. In order to calculate the weight for each information source and to calibrate its ab initio prediction module, EuGène was trained using a maize-curated gene sequence set built in this study.

2 METHODS

To build cognate gDNA/cDNA pairs, 6700 (4000 BAC) maize genomic sequences and 5500 full-length cDNA (FLcDNA) were extracted from the NCBI databases and filtered using an automatic pipeline designed on-site. First, cDNA sequences were trimmed using Seqclean (http://www.tigr.org/) and the Univec database (http://www.ncbi.nlm.nih.gov/VecScreen/UniVec.html). Then cDNA redundancy was removed based on an ‘all against all’ cDNA BLAST analysis (E-value threshold 1e-100). Non-redundant cDNAs were then aligned onto the BAC sequences using BLAT (Kent, 2002). To avoid ambiguous alignments such as cDNA mapping to several gDNAs, a spliced alignment was computed using GenomeThreader (Gremme et al., 2005), and the best alignment was retained as the correct gDNA/cDNA pair. Next, for each gDNA/cDNA pair, the coding sequence was determined from an alignment with the corresponding protein from maize or rice. Each of the 247 validated pairs was manually checked before inclusion into the training set. The third-party prediction softwares used by EuGène maize are SpliceMachine (Degroeve et al., 2005; donor and acceptor splicing site prediction), GenomeThreader (spliced alignment), BlastX (Altschul et al., 1997; protein alignment) and optionally Fgenesh (ab initio gene prediction). The sequence database used for prediction contains 69 306 maize FLcDNAs from GenBank, 20 508 rice mRNAs from RAPdb, 342 491 maize PlantGDB-assembled Unique Transcripts (PUT) from PlantGDB, 593 maize proteins from Uniprot/SwissProt, 19 836 rice proteins from RAPdb and 26 751 Arabidopsis proteins from TAIR (see the EuGène-maize web site for an updated listing of sequence resources and corresponding full references).

3 RESULTS

3.1 Gene prediction assessment

The training gene set was compared with 330 curated genes (Haberer et al., 2005) and was found to be representative of maize genes (Table 1). To assess EuGène-maize, we performed gene prediction on eight BACs (AC211245, AC190915, AC204601, AC186187, AC211225, AC193983, AC200414 and AC194325) for which manually curated annotations of 42 genes are available (Liu et al., 2007). We compared these results (Table 2) with predictions from GeneBuilder (Liang et al., 2009) B73 RefGen_v1 (Schnable et al., 2009). Nearly all loci are detected by both predictors; however, GeneBuilder missed several mono-exonic genes. Exon-level assessment shows that the GeneBuilder is more sensitive yet less specific than EuGène. A gene containing 18 exons was incorrectly split by both tools (GRMZM2G119544_E01 and GRMZM2G119496_E01). Another gene (GRMZMM2G086779) containing four exons was split by EuGène only. In two other instances (GRMZM2G520535 and GRMZM2G177098) GeneBuilder incorrectly merged two adjacent genes, whereas EuGène failed only once.

Table 1.

Comparison of several maize gene set statistics

	Training set^a	Curated set^b	Maize all^c	Maize cDNA^d
Genes	247	330	32 540	20 867
Av. gene size (kb)	3.5	4	3.7	3.5
Exons	1321	1520	–	–
Av. no. of exon/gene	5.4	4.6	5.3	4.7
Av. exon size (kb)	0.22	0.25	0.3	0.3
Av. intron size (kb)	0.52	0.6	0.52	0.58
G + C gene (%)	47.6	–	47.1	47.1
G + C exon (%)	53.4	55.4	52.7	53.4
G + C intron (%)	42.3	42.3	42.1	42.5

aThe curated maize gene training set built in this study.

bA maize set of curated genes from Haberer et al. (2005).

cAll maize genes in the B73 RefGen_v1 filtered set.

dMaize gene models supported by FLcDNAs are from Schnable et al. (2009).

Table 2.

EuGène-maize and GeneBuilder assessment comparison

	Missed Loci	Exon
		Se (%)	Sp (%)
Genebuilder B73 RefGen_v1	4	79	74
EuGène-maize	1	73	84

Se, sensitivity (fraction of actual exons predicted among total actual exons); Sp, specificity (fraction of actual exons predicted among total predicted exons).

Comparison of several maize gene set statistics aThe curated maize gene training set built in this study. bA maize set of curated genes from Haberer et al. (2005). cAll maize genes in the B73 RefGen_v1 filtered set. dMaize gene models supported by FLcDNAs are from Schnable et al. (2009). EuGène-maize and GeneBuilder assessment comparison Se, sensitivity (fraction of actual exons predicted among total actual exons); Sp, specificity (fraction of actual exons predicted among total predicted exons).

3.2 Web access

EuGène-maize is available online (see Availability section). Genomic sequences can be masked prior to the prediction step. Masking is computed by RepeatMasker (A.F.A. Smit et al., unpublished data) using the mips Repeat Element Database (Redat 4.3) (Spannagl et al., 2007). RepeatMasker low complexity masking option is disabled. The user may also submit, if available, the output file from the Fgenesh software (version 2.4). The gene prediction computation takes <5 min for a 200 kb genomic sequence and >1 h if the RepeatMasker option is enabled. The results are compressed into an archive file and e-mailed to the user. The archive contains the parameters and options used for prediction, the submitted sequence, the masked sequence (if relevant), the annotation file (gff, gff3 and fasta format) and a HTML file that allows results to be displayed by a web browser.

10 in total

1. BLAT--the BLAST-like alignment tool.

Authors: W James Kent
Journal: Genome Res Date: 2002-04 Impact factor: 9.043

2. SpliceMachine: predicting splice sites from high-dimensional local context representations.

Authors: Sven Degroeve; Yvan Saeys; Bernard De Baets; Pierre Rouzé; Yves Van de Peer
Journal: Bioinformatics Date: 2004-11-25 Impact factor: 6.937

3. A GeneTrek analysis of the maize genome.

Authors: Renyi Liu; Clémentine Vitte; Jianxin Ma; A Assibi Mahama; Thanda Dhliwayo; Michael Lee; Jeffrey L Bennetzen
Journal: Proc Natl Acad Sci U S A Date: 2007-07-05 Impact factor: 11.205

4. MIPS plant genome information resources.

Authors: Manuel Spannagl; Georg Haberer; Rebecca Ernst; Heiko Schoof; Klaus F X Mayer
Journal: Methods Mol Biol Date: 2007

5. The B73 maize genome: complexity, diversity, and dynamics.

Authors: Patrick S Schnable; Doreen Ware; Robert S Fulton; Joshua C Stein; Fusheng Wei; Shiran Pasternak; Chengzhi Liang; Jianwei Zhang; Lucinda Fulton; Tina A Graves; Patrick Minx; Amy Denise Reily; Laura Courtney; Scott S Kruchowski; Chad Tomlinson; Cindy Strong; Kim Delehaunty; Catrina Fronick; Bill Courtney; Susan M Rock; Eddie Belter; Feiyu Du; Kyung Kim; Rachel M Abbott; Marc Cotton; Andy Levy; Pamela Marchetto; Kerri Ochoa; Stephanie M Jackson; Barbara Gillam; Weizu Chen; Le Yan; Jamey Higginbotham; Marco Cardenas; Jason Waligorski; Elizabeth Applebaum; Lindsey Phelps; Jason Falcone; Krishna Kanchi; Thynn Thane; Adam Scimone; Nay Thane; Jessica Henke; Tom Wang; Jessica Ruppert; Neha Shah; Kelsi Rotter; Jennifer Hodges; Elizabeth Ingenthron; Matt Cordes; Sara Kohlberg; Jennifer Sgro; Brandon Delgado; Kelly Mead; Asif Chinwalla; Shawn Leonard; Kevin Crouse; Kristi Collura; Dave Kudrna; Jennifer Currie; Ruifeng He; Angelina Angelova; Shanmugam Rajasekar; Teri Mueller; Rene Lomeli; Gabriel Scara; Ara Ko; Krista Delaney; Marina Wissotski; Georgina Lopez; David Campos; Michele Braidotti; Elizabeth Ashley; Wolfgang Golser; HyeRan Kim; Seunghee Lee; Jinke Lin; Zeljko Dujmic; Woojin Kim; Jayson Talag; Andrea Zuccolo; Chuanzhu Fan; Aswathy Sebastian; Melissa Kramer; Lori Spiegel; Lidia Nascimento; Theresa Zutavern; Beth Miller; Claude Ambroise; Stephanie Muller; Will Spooner; Apurva Narechania; Liya Ren; Sharon Wei; Sunita Kumari; Ben Faga; Michael J Levy; Linda McMahan; Peter Van Buren; Matthew W Vaughn; Kai Ying; Cheng-Ting Yeh; Scott J Emrich; Yi Jia; Ananth Kalyanaraman; An-Ping Hsia; W Brad Barbazuk; Regina S Baucom; Thomas P Brutnell; Nicholas C Carpita; Cristian Chaparro; Jer-Ming Chia; Jean-Marc Deragon; James C Estill; Yan Fu; Jeffrey A Jeddeloh; Yujun Han; Hyeran Lee; Pinghua Li; Damon R Lisch; Sanzhen Liu; Zhijie Liu; Dawn Holligan Nagel; Maureen C McCann; Phillip SanMiguel; Alan M Myers; Dan Nettleton; John Nguyen; Bryan W Penning; Lalit Ponnala; Kevin L Schneider; David C Schwartz; Anupma Sharma; Carol Soderlund; Nathan M Springer; Qi Sun; Hao Wang; Michael Waterman; Richard Westerman; Thomas K Wolfgruber; Lixing Yang; Yeisoo Yu; Lifang Zhang; Shiguo Zhou; Qihui Zhu; Jeffrey L Bennetzen; R Kelly Dawe; Jiming Jiang; Ning Jiang; Gernot G Presting; Susan R Wessler; Srinivas Aluru; Robert A Martienssen; Sandra W Clifton; W Richard McCombie; Rod A Wing; Richard K Wilson
Journal: Science Date: 2009-11-20 Impact factor: 47.728

Review 6. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

Authors: S F Altschul; T L Madden; A A Schäffer; J Zhang; Z Zhang; W Miller; D J Lipman
Journal: Nucleic Acids Res Date: 1997-09-01 Impact factor: 16.971

7. Ab initio gene finding in Drosophila genomic DNA.

Authors: A A Salamov; V V Solovyev
Journal: Genome Res Date: 2000-04 Impact factor: 9.043

8. Structure and architecture of the maize genome.

Authors: Georg Haberer; Sarah Young; Arvind K Bharti; Heidrun Gundlach; Christina Raymond; Galina Fuks; Ed Butler; Rod A Wing; Steve Rounsley; Bruce Birren; Chad Nusbaum; Klaus F X Mayer; Joachim Messing
Journal: Plant Physiol Date: 2005-12 Impact factor: 8.340

9. Evidence-based gene predictions in plant genomes.

Authors: Chengzhi Liang; Long Mao; Doreen Ware; Lincoln Stein
Journal: Genome Res Date: 2009-06-18 Impact factor: 9.043

10. Integrating alternative splicing detection into gene prediction.

Authors: Sylvain Foissac; Thomas Schiex
Journal: BMC Bioinformatics Date: 2005-02-10 Impact factor: 3.169

10 in total