Literature DB >> 17439966

Wheat Estimated Transcript Server (WhETS): a tool to provide best estimate of hexaploid wheat transcript sequence.

Rowan A C Mitchell¹, Nathalie Castells-Brooke, Jan Taubert, Paul J Verrier, David J Leader, Christopher J Rawlings.

Abstract

Wheat biologists face particular problems because of the lack of genomic sequence and the three homoeologous genomes which give rise to three very similar forms for many transcripts. However, over 1.3 million available public-domain Triticeae ESTs (of which approximately 850,000 are wheat) and the full rice genomic sequence can be used to estimate likely transcript sequences present in any wheat cDNA sample to which PCR primers may then be designed. Wheat Estimated Transcript Server (WhETS) is designed to do this in a convenient form, and to provide information on the number of matching EST and high quality cDNA (hq-cDNA) sequences, tissue distribution and likely intron position inferred from rice. Triticeae EST and hq-cDNA sequences are mapped onto rice loci and stored in a database. The user selects a rice locus (directly or via Arabidopsis) and the matching Triticeae sequences are assembled according to user-defined filter and stringency settings. Assembly is achieved initially with the CAP3 program and then with a single nucleotide polymorphism (SNP)-analysis algorithm designed to separate homoeologues. Alignment of the resulting contigs and singlets against the rice template sequence is then displayed. Sequences and assembly details are available for download in fasta and ace formats, respectively. WhETS is accessible at http://www4.rothamsted.bbsrc.ac.uk/whets.

Entities: Chemical

Mesh：

Year: 2007 PMID： 17439966 PMCID： PMC1933201 DOI： 10.1093/nar/gkm220

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Wheat is the most widely grown crop in the world with massive importance for human nutrition. However, genomics and DNA sequence analysis in wheat present particular problems. Cultivated wheat (Triticum aestivum) is an allohexaploid species with three homoeologous genomes (A, B, D), each comprising seven pairs of chromosomes. All three genomes are very large, so that together they contain about 30× as much DNA as rice and 6× as much as the human genome. Due to the technical difficulties, the complete genome sequence of wheat will not be available for several years at the earliest. However, there is a rich resource of wheat ESTs of which there are ∼850 000 and a further ∼500 000 from other Triticeae species in dbEST [(1); January 2007]. These can be mapped to the genes of rice as the most closely related fully sequenced genome (2). In this way, all the ESTs derived from the same transcript can be grouped and linked to information on the orthologues in rice and Arabidopsis. This procedure thus facilitates the application of knowledge gained in model species, particularly Arabidopsis, to wheat crops. The aim of Wheat Estimated Transcript Server (WhETS) is to flexibly allow the user to assemble Triticeae ESTs mapped to rice genes in this way and provide access to the results in a convenient form. A common way to exploit ESTs is to use the pre-existing assemblies such as Unigene (3). However, because WhETS assembles related sequences in real time, the user can adjust the set of ESTs to be used, alter the stringency setting and view the affect of the changes on the assembly. Also, by anchoring the ESTs to rice loci, non-contiguous ESTs representing the same genes are automatically treated a part of the same set. The three very similar homoeologues of wheat genes are frequently all expressed (4), but assembly programs do not normally separate these so they are grouped together in contigs. These homoeologous sequences are best identified by analysis of shared SNPs such as can be achieved with SNPserver (5) which uses autoSNP (6) algorithms to separate alleles or homoeologues. A similar approach is included in WhETS to provide a best estimate of homoeologue-specific sequence. Aligning the Triticeae ESTs to rice has the additional advantage of being able to infer likely intron position which can be used to derive allelic markers, an approach taken in the USDA wheat SNP database (http://wheat.pw.usda.gov/SNP/new/index.shtml). Part of the aim of WhETS is therefore to bring together the useful features of Unigene, SNPserver and USDA SNP database into one site tailored specifically for wheat transcripts. However, it also has features unavailable elsewhere, such as the ability to display Triticeae EST distribution corresponding to a set of rice loci and the option to filter sequences according to library source tissue.

DESCRIPTION

Database

WhETS has a relational database containing sequences and annotation for all Triticeae EST and high quality cDNA (hq-cDNA) sequences from dbEST, coding sequences, annotation and intron positions for rice from The Institute for Genome Research (TIGR) rice pseudo molecule release 4 (7), and annotation for each locus from The Arabidopsis Information Resource (TAIR) version 6 (8) (Figure 1). EST sequences are first masked for vector contamination using the cross_match program (9). The WhETS database also contains the results of a blast (10) similarity searches: blastp of all the TIGR rice protein sequences against all the TAIR Arabidopsis proteins, and blastn of the Triticeae ESTs against the TIGR rice CDS. These tables contain the top scoring hits and any lower scoring hits with longer aligned regions, thus defining many-to-many relationships between Arabidopsis and rice genes and between rice genes and Triticeae sequences. The database is updated weekly by automated scripts which compare the contents with Triticeae entries in GenBank using Entrez utilities (11). Any missing entries are downloaded and any extra ones deleted. New sequences are subjected to a blastn search against the rice sequences and the sequences, and blast results are added to the WhETS database (Figure 1).

Figure 1.

WhETS database preparation steps. Dashed arrows indicate steps which are repeated in automatic weekly updates.

Real-time operation

The main part of WhETS requires TIGR rice loci identifiers. Users can start directly by supplying these as input, or they can start with a set of Arabidopsis AGI numbers or Triticeae accession numbers. WhETS will then retrieve all the matching rice loci for these. The user can then select filter settings for species, tissue and sequence type (EST or hq-cDNA). WhETS will then display the number and accessions of all the matching Triticeae sequences for each locus. The user can then select the locus for which they wish to obtain sequences for in the main part of WhETS. When a single rice locus is selected, the user can again filter for species, tissue and sequence type. The sequences which pass this filter are assembled using the CAP3 program (12), and the resulting contigs are passed to an algorithm which analyses shared SNPs. If the contigs are found to contain groups which share more SNPs (i.e. base differences from the consensus) than a user-defined cut-off (default five SNPs per kb), these are split off into separate contigs. The CAP3 step tends to assemble paralogues which match the same rice locus into separate contigs, whereas the SNP analysis step is designed to separate homoeologues. However, by selecting higher stringency the user may also separate allelic forms. Conversely, in situations where there are relatively few ESTs it can be useful to assemble the sequences from wheat and related species with low stringency. WhETS also assembles the hq-cDNA sequences where present using a much higher base quality setting for the CAP3 program than used for ESTs. This has the effect that the consensus sequence of any contig will normally be the same as any hq-cDNA within it. After assembly, the rice CDS is aligned to the contigs’ consensus and singlet sequences with blast and the results displayed using a modified version of a Perl script from the Korf et al. study (13). For contigs, links are supplied which open windows detailing all species, tissue, sequence type and cultivar of the constituent sequences. Singlets link out to the original NCBI entry. The main output for user downloading is a fasta file containing the rice template CDS, contigs’ consensus and singlet sequences. Additional details, such as intron positions are supplied in the descriptor fields of this file. Also available are other files, such as ace format files for each of the contigs containing all the information on the constituent sequences and their alignment, and a spreadsheet-compatible file containing details of all SNPs used to split contigs. WhETS is implemented in MySQL (http://www.mysql.com/) and Perl using some Bioperl (14) modules. More details on allocation of blast hits within the WhETS database, strand of EST used and the algorithm for separating contigs into putative homoeologues are available in the Supplementary Data.

EXAMPLE OUTPUT

To test that WhETS correctly separates homoeologues, we examined the well-characterized WAXY locus, which encodes granule-bound starch synthase I. The three homoeologous forms are all sequenced, as are several allelic variants of these. As there are only a total of 2 715 wheat hq-cDNA sequences available, the normal use of WhETS is only with ESTs. We, therefore, ran WhETS with the orthologous rice locus Os06g04200 setting the filter to use ESTs and wheat sequences only. Figure 2 shows the output and how the resulting contigs match with the known homoeologues. From ESTs alone, WhETS correctly identifies the homoeologues and indicates the existence of a splice variant of the B homoeologue with a deletion in its 5′ UTR. Also shown (Figure 3) is the additional window detailing constituent sequences of one of the contigs.

Figure 2.

Output from WhETS for Os06g04200.1. The black line at the top corresponds to the rice gene CDS with intron position and size indicated as red vertical lines with triangles. Thin horizontal lines below this indicate the coverage of hits from the Triticeae sequences. The rows below show these hits, with blast HSPs for contigs and singlets shown as lines coloured according to percentage identity, and coordinates aligned to the rice template. The CAP3 step gives three contigs; one of these (contig 1) is then divided into five new contigs by the SNP analysis step (contigs 1.1, 1.2, etc.) The genome of origin (A, B, D) has been added to the screenshot according to 100% identity matches of the contig consensus to the known-homoeologue transcript sequences (exons of accessions AB019622, AB019623 and AB019624). Contigs 1.1 and 1.2 are not combined because of a lack of substantial overlap. Contigs 1.4 and 1.5 appear to be splice variants with an indel in the 5′UTR. Contigs 2 and 3 appear quite different and may be paralogues.

Figure 3.

Display window showing details for a contig which is opened by clicking on contig 1.1 link in Figure 2.

CONCLUSION

WhETS is designed to be a practical tool for wheat biologists to rapidly get the best estimate of transcript sequence for a target gene, supplemented with information on tissue distribution and likely gene structure. It is particularly aimed at producing wheat sequences from which to design PCR primers for cDNA templates.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

13 in total

1. CAP3: A DNA sequence assembly program.

Authors: X Huang; A Madan
Journal: Genome Res Date: 1999-09 Impact factor: 9.043

2. Redundancy based detection of sequence polymorphisms in expressed sequence tag data using autoSNP.

Authors: Gary Barker; Jacqueline Batley; Helen O' Sullivan; Keith J Edwards; David Edwards
Journal: Bioinformatics Date: 2003-02-12 Impact factor: 6.937

3. The Bioperl toolkit: Perl modules for the life sciences.

Authors: Jason E Stajich; David Block; Kris Boulez; Steven E Brenner; Stephen A Chervitz; Chris Dagdigian; Georg Fuellen; James G R Gilbert; Ian Korf; Hilmar Lapp; Heikki Lehväslaiho; Chad Matsalla; Chris J Mungall; Brian I Osborne; Matthew R Pocock; Peter Schattner; Martin Senger; Lincoln D Stein; Elia Stupka; Mark D Wilkinson; Ewan Birney
Journal: Genome Res Date: 2002-10 Impact factor: 9.043

4. Discrimination of homoeologous gene expression in hexaploid wheat by SNP analysis of contigs grouped from a large number of expressed sequence tags.

Authors: K Mochida; Y Yamazaki; Y Ogihara
Journal: Mol Genet Genomics Date: 2003-11-01 Impact factor: 3.291

5. Base-calling of automated sequencer traces using phred. I. Accuracy assessment.

Authors: B Ewing; L Hillier; M C Wendl; P Green
Journal: Genome Res Date: 1998-03 Impact factor: 9.043

6. dbEST--database for "expressed sequence tags".

Authors: M S Boguski; T M Lowe; C M Tolstoshev
Journal: Nat Genet Date: 1993-08 Impact factor: 38.330

7. The map-based sequence of the rice genome.

Authors:
Journal: Nature Date: 2005-08-11 Impact factor: 49.962

8. GenBank.

Authors: Dennis A Benson; Ilene Karsch-Mizrachi; David J Lipman; James Ostell; David L Wheeler
Journal: Nucleic Acids Res Date: 2006-01-01 Impact factor: 16.971

9. SNPServer: a real-time SNP discovery tool.

Authors: David Savage; Jacqueline Batley; Tim Erwin; Erica Logan; Christopher G Love; Geraldine A C Lim; Emmanuel Mongin; Gary Barker; German C Spangenberg; David Edwards
Journal: Nucleic Acids Res Date: 2005-07-01 Impact factor: 16.971

10. The TIGR Rice Genome Annotation Resource: improvements and new features.

Authors: Shu Ouyang; Wei Zhu; John Hamilton; Haining Lin; Matthew Campbell; Kevin Childs; Françoise Thibaud-Nissen; Renae L Malek; Yuandan Lee; Li Zheng; Joshua Orvis; Brian Haas; Jennifer Wortman; C Robin Buell
Journal: Nucleic Acids Res Date: 2006-12-01 Impact factor: 16.971

8 in total

1. Cell walls of developing wheat starchy endosperm: comparison of composition and RNA-Seq transcriptome.

Authors: Till K Pellny; Alison Lovegrove; Jackie Freeman; Paola Tosi; Christopher G Love; J Paul Knox; Peter R Shewry; Rowan A C Mitchell
Journal: Plant Physiol Date: 2011-11-28 Impact factor: 8.340

2. Down-regulation of the CSLF6 gene results in decreased (1,3;1,4)-beta-D-glucan in endosperm of wheat.

Authors: Csilla Nemeth; Jackie Freeman; Huw D Jones; Caroline Sparks; Till K Pellny; Mark D Wilkinson; Jim Dunwell; Annica A M Andersson; Per Aman; Fabienne Guillon; Luc Saulnier; Rowan A C Mitchell; Peter R Shewry
Journal: Plant Physiol Date: 2010-01-20 Impact factor: 8.340

3. TriFLDB: a database of clustered full-length coding sequences from Triticeae with applications to comparative grass genomics.

Authors: Keiichi Mochida; Takuhiro Yoshida; Tetsuya Sakurai; Yasunari Ogihara; Kazuo Shinozaki
Journal: Plant Physiol Date: 2009-05-15 Impact factor: 8.340

4. Wheat grain development is characterized by remarkable trehalose 6-phosphate accumulation pregrain filling: tissue distribution and relationship to SNF1-related protein kinase1 activity.

Authors: Eleazar Martínez-Barajas; Thierry Delatte; Henriette Schluepmann; Gerhardus J de Jong; Govert W Somsen; Cátia Nunes; Lucia F Primavesi; Patricia Coello; Rowan A C Mitchell; Matthew J Paul
Journal: Plant Physiol Date: 2011-03-14 Impact factor: 8.340

5. Early activation of wheat polyamine biosynthesis during Fusarium head blight implicates putrescine as an inducer of trichothecene mycotoxin production.

Authors: Donald M Gardiner; Kemal Kazan; Sebastien Praud; Francois J Torney; Anca Rusu; John M Manners
Journal: BMC Plant Biol Date: 2010-12-30 Impact factor: 4.215

6. Summarizing and exploring data of a decade of cytokinin-related transcriptomics.

Authors: Wolfram G Brenner; Thomas Schmülling
Journal: Front Plant Sci Date: 2015-02-17 Impact factor: 5.753

7. Comparative evaluation of intron prediction methods and detection of plant genome annotation using intron length distributions.

Authors: Long Yang; Hwan-Gue Cho
Journal: Genomics Inform Date: 2012-03-31

8. Transcriptome analysis of grain development in hexaploid wheat.

Authors: Yongfang Wan; Rebecca L Poole; Alison K Huttly; Claudia Toscano-Underwood; Kevin Feeney; Sue Welham; Mike J Gooding; Clare Mills; Keith J Edwards; Peter R Shewry; Rowan Ac Mitchell
Journal: BMC Genomics Date: 2008-03-06 Impact factor: 3.969

8 in total