Literature DB >> 22334568

Comprehensive functional analyses of expressed sequence tags in common wheat (Triticum aestivum).

Alagu Manickavelu¹, Kanako Kawaura, Kazuko Oishi, Tadasu Shin-I, Yuji Kohara, Nabila Yahiaoui, Beat Keller, Reina Abe, Ayako Suzuki, Taishi Nagayama, Kentaro Yano, Yasunari Ogihara.

Abstract

About 1 million expressed sequence tag (EST) sequences comprising 125.3 Mb nucleotides were accreted from 51 cDNA libraries constructed from a variety of tissues and organs under a range of conditions, including abiotic stresses and pathogen challenges in common wheat (Triticum aestivum). Expressed sequence tags were assembled with stringent parameters after processing with inbuild scripts, resulting in 37,138 contigs and 215,199 singlets. In the assembled sequences, 10.6% presented no matches with existing sequences in public databases. Functional characterization of wheat unigenes by gene ontology annotation, mining transcription factors, full-length cDNA, and miRNA targeting sites were carried out. A bioinformatics strategy was developed to discover single-nucleotide polymorphisms (SNPs) within our large EST resource and reported the SNPs between and within (homoeologous) cultivars. Digital gene expression was performed to find the tissue-specific gene expression, and correspondence analysis was executed to identify common and specific gene expression by selecting four biotic stress-related libraries. The assembly and associated information cater a framework for future investigation in functional genomics.

Entities: CellLine Chemical Disease Gene Species

Mesh：

Substances：
DNA, Complementary

Year: 2012 PMID： 22334568 PMCID： PMC3325080 DOI： 10.1093/dnares/dss001

Source DB: PubMed Journal: DNA Res ISSN： 1340-2838 Impact factor: 4.458

Introduction

Wheat provides 21% of food calories and 20% of protein to more than 4.5 billion people worldwide.[1] Demand for wheat in the developing world is projected to increase 60% by 2050. At the same time, climate change-induced temperature increases are estimated to reduce wheat production by 29%.[2] The advent of new molecular genetic technology and the dramatic increase in plant gene sequence data have provided opportunities to underpin wheat breeding programmes in order to improve yield, grain quality, and disease resistance.[3] Many of these technologies have been designed to facilitate detection and understanding of the alterations in gene expression that accompany differential development or that result from the perception of changes to the environment. Expressed sequence tag (EST) projects provide a very useful and quick means of accessing gene sequence and expression information. When combined with breakthroughs in highly parallel designs for gene expression analysis, large-scale EST projects now offer new perspectives for understanding the molecular basis of important traits in plants of agricultural relevance.[4] EST sequencing projects have been completed or are under way for many plant species. These projects have provided useful tools for intragenomic[5] and intergenomic[6] comparisons, gene discovery,[7-9] molecular marker identification,[10] microarray development,[11-15] and polyploid species genomic resource development.[16-18] As robot throughput increases and cost-per-read drops, determination of a sequence tag for a large proportion of genes is now reasonable using this random cDNA sequencing approach. For example, the availability of the complete genome sequence of Arabidopsis thaliana revealed that the 105 000 ESTs available at the end of 2000 were enough to tag 60% of the 25 500 genes.[19] The complete genome sequences of several plant species are known and the rate at which whole genomes are being sequenced is increasing. Correct annotation of these genomes remains problematic despite gene prediction algorithms becoming ever more sophisticated. While wheat genome sequencing is rapidly progressing (www.wheatgenome.org), a quicker and complementary approach to identifying a large number of wheat genes is EST and full-length cDNA sequencing. These resources will prove invaluable for annotating the genomes of wheat and other monocots and as substrates for transgenic improvement of crops. As in Arabidopsis and rice, these tools will prove to be critical in speeding up the genetic improvement of wheat. DNA markers constructed from ESTs are effective since they are contained within an exon region of genes that are actually expressed. Examination of DNA sequence databases permits a direct search for sequence polymorphisms and thus molecular markers. These polymorphisms are typically single-nucleotide polymorphisms (SNPs) or small insertions–deletions (indels). More importantly, the SNPs are identified in EST sequences, thus the polymorphisms can be used to directly map functional, expressed genes, rather than DNA sequences derived from conventional RAPD and AFLP techniques, which are typically not genes. This has led to studies on linkage disequilibrium in genes to better characterize associations between phenotype and genotype.[20-22] To identify an SNP from an EST database, the database must be composed of ESTs derived from different genotypes, followed by alignment of the same EST sequences from different genotypes.[23] SNP markers rely upon the underlying redundancy within EST collections and assume that distinct genotype of a plant genome will be represented within a collection. Earlier, we described extensive wheat EST resources including full-length sequences,[24-27] and the usage of wheat transcriptome analysis by making a custom microarray with ESTs.[28,29] Here, we extend our efforts and describe a collection of further ESTs and the complete functional analysis of the whole EST so far developed (∼1 million ESTs). Our work is based on a set of cDNA libraries established from 51 different tissues of interest varied from growth stages and biotic and abiotic stresses in 10 different cultivars.

Materials and methods

Plant materials and cDNA library construction

Eighteen libraries from various growth stages, 25 libraries from abiotic stresses (cold, drought, saline, and mineral toxic), and eight libraries from biotic stresses (leaf rust, powdery mildew, and blast) were constructed from eight different wheat lines (Table 1). Out of 51 libraries, 20 libraries were newly constructed and included for this study. Double-stranded cDNAs were synthesized as previously described.[24] cDNAs were ligated with pBlueScript SK(+) digested with EcoRI and XhoI. After transformation by electroporation, transformed bacterial cells were initially cultured in the SOC medium for 1 h before culture at 37°C for 2 h in 2× LB medium. Cultured cells were stored in 20% glycerol at −80°C until use. Transformed bacteria were randomly selected and plasmid DNA was extracted.[24] Inserted cDNAs were sequenced from both ends using dye terminator cycle sequencing (Applied Biosystems, Foster City, CA, USA).

Table 1.

List and characteristics of cDNA libraries

Library name	Genotype	Stage	Condition	No. of EST	Accession number
whcs	CS	Callus	GS	11 505	CJ518205–CJ523460, CJ627048–CJ632007
whr	CS	Root	GS	19 227	BJ277129–BJ287630
whs	CS	Seedling	GS	13 356	HX000001–HX010004
whdl	CS	Seedling crown	GS	12 761	BJ221844–BJ231912
whh	CS	Spike at heading	GS	20 648	BJ255495–BJ266779
whf	CS	Spike at flowering	GS	21 106	BJ243195–BJ255494
whoh	CS	Pistil at heading	GS	20 736	BJ266780–BJ277128
whpc	CS	Anther at meiosis	GS	11 016	CJ576197–CJ580898, CJ682880–CJ687382
whhg	CSMT4B^a	Anther at meiosis	GS	9669	CJ549536–CJ554132, CJ657247–CJ661657
whsh	CS	Young spikelet	GS	11 302	CJ730709–CJ736986
whyd	CS	Spikelet at late flowering	GS	14 708	BJ300204–BJ312233
whok	CS	DPA5	GS	12 159	CJ570869–CJ576196, CJ677747–CJ682879
whms	CSDT3DL^b	DPA5	GS	12 036	CJ565425–CJ570868, CJ672462–CJ677746
whe	CS	DPA10	GS	19 200	BJ231913–BJ243194
whdp	CS	DPA20	GS	13 455	CJ523461–CJ529179, CJ632008–CJ637596
whsl	CS	DPA30	GS	15 522	BJ287631–BJ300203
whsp	CS	Seedling	GS	12 783	HX247045–HX247474
whca	CS(Sp5A)^c	Seedling	GS	794	HX247475–HX257200
whkp	CSKmppd^d	Seedling	Grown under continuous light	12 950	CJ554133–CJ559953, CJ661658–CJ667190
whkv	CS	Seedling	Grown under continuous light after cold treatment	15 360	CJ559954–CJ565424, CJ667191–CJ672461
whem	Kitakei1354	Dormant seed	With water supply	11 671	CJ539482–CJ544724, CJ647722–CJ652633
whei	Kitakei1354	Dormant seed	With water supply after wounded	11 743	CJ534326–CJ539481, CJ642661–CJ647721
whsc	Kitakei1354	Shoot	Cold treatment after excision of grain part	13 079	CJ586310–CJ591776, CJ692607–CJ697797
whsd	Kitakei1354	Shoot	Dehydration	11 897	CJ591777–CJ596845, CJ697798–CJ702615
whrd	Kitakei1354	Root	Dehydration	12 436	CJ580899–CJ586309, CJ687383–CJ692606
whv3	Valuevskaya	Shoot	3 days cold condition	10 069	CJ601934–CJ606680, CJ707296–CJ711731
whv	Valuevskaya	Shoot	16 days cold condition	11 087	CJ596846–CJ601933, CJ702616–CJ707295
whva	Valuevskaya	Shoot	ABA treatment	10 631	CJ606681–CJ611586, CJ711732–CJ715867
whvd	Valuevskaya	Shoot	Five days dehydration	11 767	CJ611587–CJ616926, CJ715868–CJ720922
whvh	Valuevskaya	Shoot	Heat shock treatment	10 090	CJ616927–CJ621531, CJ720923–CJ725419
whvs	Valuevskaya	Liquid cultured cells	Liquid cultured cells	12 327	CJ621532–CJ627047, CJ725420–CJ730708
whrs6^e	CS	Root	Salt stress for 6 h	13 110	HX010005–HX019847
whss6^e	CS	Leaf	Salt stress for 6 h	13 312	HX019848–HX030180
whrs24^e	CS	Root	Salt stress for 24 h	12 949	HX030181–HX040252
whss24^e	CS	Leaf	Salt stress for 24 h	12 487	HX040253–HX050054
whatl^e	Atlas66	Root	No treatment	20 519	CJ773323–CJ797201
whatlal^e	Atlas66	Root	50 mM Al for 6 h	28 795	CJ822818–CJ848636
whsct^e	Scout66	Root	No treatment	27 717	CJ797202–CJ822817
whsctal^e	Scout66	Root	50 mM Al for 6 h	29 824	CJ848637–CJ872807
whhb^e	Halberd	Root	No treatment	22 488	HX124648–HX143755
whhbb^e	Halberd	Root	10mM boric acid for 24 h	22 522	HX143756–HX163093
whcr^e	Cranbrook	Root	No treatment	22 680	HX163094–HX182808
whcrb^e	Cranbrook	Root	10mM boric acid for 24 h	22 616	HX182809–HX201765
whthls^e	Thatcher	Seedling	Infected with leaf rust	30 307	CJ872808–CJ896490
whthkles^e	NILThatcher	Seedling	Infected with leaf rust	24 701	CJ896491–CJ919993
whchan^e	Chancellor	Seedling	Infected with powdry mildew	29 281	CJ919994–CJ944155
whchu^e	NILChancellor	Seedling	Infected with powdry mildew	28 799	CJ944156–CJ968175
whnr^e	Norin4	Seedling	No treatment	34 415	HX050055–HX071918
whnrpr48^e	Norin4	Seedling	Infected with blast strain Pr48 at 23°C for 4 days	23 987	HX071919–HX084716
whnrpr58r^e	Norin4	Seedling	Infected with blast strain Pr58 at 23°C for 4 days	36 360	HX084717–HX106894
whnrpr58s^e	Norin4	Seedling	Infected with blast strain Pr58 at 27°C for 4 days	30 797	HX106895–HX124647
			Total	894 756

CS, Chinese Spring; GS, growth stage; DPA, days to post-anthesis; NIL, near-isogenic lines.

aMono-telosomic 4BS of CS.

bDitelosomic 4BS of CS.

cSpelta5A chromosome substituted in CS.

dNear-isogenic line.

eNewly constructed libraries.

List and characteristics of cDNA libraries CS, Chinese Spring; GS, growth stage; DPA, days to post-anthesis; NIL, near-isogenic lines. aMono-telosomic 4BS of CS. bDitelosomic 4BS of CS. cSpelta5A chromosome substituted in CS. dNear-isogenic line. eNewly constructed libraries.

EST processing and assembly

The chromatogram files were base called and quality trimmed using PHRED[30] with default parameters. Vector, library linker-primer, and EcoRI adapter sequences were removed using CROSSMATCH. Repeat, ambiguous sequences (PHRED quality values <30) and poly (A) tails or poly (T) sequences (at most 10 bases) in the ESTs were trimmed. Subsequently, ESTs with sequences <30 bp were omitted from the final data set. The remaining high-quality sequence was used for further study. All sequence data are available from the DNA database of Japan (Table 1). The processed EST sequence files were combined and assembled into contigs using the CAP3 program[31] with a high and low stringency level (high 95% homology in a 20 bp overlap; low 80% homology in a 15 bp overlap). Default CAP3 settings include -p 90 -h 20; the custom parameter settings used were -p 85 -h 90. The CAP3 -p option specifies overlap per cent identity cut-off, while the -h option specifies the maximum alignment overhang percentage.

2.2.1. Sequence annotation

Using the BLAST program (BLASTX with a search threshold of 1e−5), the sequences of the contigs were searched against seven databases (NCBI's nr; http://www.ncbi.nlm.nih.gov/genbank, Uniprot; http://www.uniprot.org, RAP-DB; http://rapdb.dna.affrc.go.jp, RGAP; http://rice.plantbiology.msu.edu, Tair9; http://www.arabidopsis.org, MaizeGDB; http://www.maizegdb.org, and Brachypodium database; http://db.brachypodium.org). According to Ewing et al.,[7] only contigs were taken for further analyses. The gene ontology (GO) terms[32] of each contig was derived by InterProScan.[33] The GO terms were then converted into GO slim term using EBI website (http://www.ebi.ac.uk/QuickGO/) by written perl-script for this purpose. Open reading frames (ORFs) were searched by translating sequences into amino acids by six frames (three per frame in the plus and minus strands).

Transcription factor

The PlnTFDB[34] containing 29 473 sequences of plant genes involved in transcriptional control was used to mine our data by local BLAST. The default parameters mentioned in the database was used for prediction (Filter ‘on’, gapped alignment ‘on’, substitution matrix ‘blosum 62’, E-value ≤1e−10). We included two meta-rules in our classification scheme: (i) if a protein harbours domains characteristic of a transcription factor (TF) family and a transcriptional regulators (TR) family, we assigned it to the TF family, (ii) when the protein of interest contains domains characteristic of more than one TF family or more than one TR family, it was assigned to the family to which its characteristic domains matched with the lowest E-value.

Full-length cDNA

The contigs were classified as full length if it aligned with our full-length cDNA data.[27] BLASTN searches of the contigs against the full-length sequences yielded a candidate hit list (E-value ≤1e−100) of putative full-length sequences that either covered the start and stop codon of the subject sequence or possessed sufficient sequence up/down-stream of the match to contain putative start and stop signals. In a few instances, some contigs covered all but the start methionine, and were also included as full-length sequences. In addition, contigs were aligned (E-value <1e−5, ≥98% query coverage, ≥98% identity) with barley full-length sequences[35] to know the similarity as well as the full-length nature.

miRNA analysis

To identify conserved miRNA in wheat, contigs were annotated with the plant small RNA regulator target analysis database (http://plantgrn.noble.org/psRNATarget/) containing small RNA of 15 plant species including wheat.[36] This database contained 2192 published miRNA sequences, including 32 from Triticum aestivum, 148 from sorghum, 496 from Oryza sativa, 319 from maize, and 224 from A. thaliana. Potential targets were predicted according to the rules applied by[37,38]: (i) the number of allowed mis-matches at complementary sites between miRNA sequences and potential mRNA targets is four or fewer; and (ii) no gaps are allowed at the complementary sites.

SNP discovery

Sequence variants or SNPs were mined in wheat contigs with two criteria, and perl-scripts were written for each category. In the first criterion, only contigs with ≥4 ESTs were selected, and SNPs were declared only when there was no mismatch, no gaps, or N's were admitted before and after an SNP site; in addition, the alternative base to the consensus sequence was present at least more than twice in an alignment. In the second criterion, the SNPs were mined only in the significant sequence of contigs that was worked out by counting the nucleotides of either end of the contigs containing a minimum of four EST members. To find the SNP between cultivars, in addition to the above parameters, the contigs containing the minimum of two consistent EST from the same cultivar were selected. For homoeologous SNPs, the contigs containing the minimum of four EST from the same cultivar were chosen. The visual inspection of SNP was carried out using Tablet software.[39]

Digital gene expression and correspondence analysis

For statistical analysis of gene expression profiles, contigs harbouring five or more constituents were selected from 37 138 contigs. Similarities between contigs or libraries were estimated using Pearson's correlation coefficient.[40] Hierarchical clustering was applied to compare EST expression profiles among the 51 wheat tissue/treatments and libraries. Expression profiles are displayed based on the number of constituents in a contig (from 0 to 4647; red intensity), along with an increasing number of constituents. Contigs specific to DREB (dehydration-responsive element binding), NAC (nitrogen assimilation control), OMT (O-methyl transferase), and miRNA 172 were selected to show the differential gene expression in cultivars and growth stage. Correspondence analysis (CA) was carried out by selecting four disease-related libraries (whthls, whthkles, whchan, and whchul; Table 1) as per the procedure detailed in Hamada et al.[41] and visualized by a custom-build viewer (available based on request).

Results

cDNA library construction and assembly

Our previous studies carried out the construction and analysis of 31 libraries.[26] Here, we have reported the further addition of 20 libraries and their combined analyses for comprehensive view of wheat EST. At the maximum, we have accumulated ∼1 million ESTs (Table 1). The libraries were generated from developmental stages and stresses. After trimming low-quality bases, vector sequences, and shortness (<30 bp), 0.68 million ESTs were used for CAP3 assembly under stringent conditions resulting in 37 138 contigs and 215 199 singlets. When assembled using relaxed settings in CAP3, 65 426 contigs and 66 875 singlets were obtained. The high stringent condition was chosen to achieve a more complete isolation of individual paralogues, orthologues, and homoeologues compared with using a low stringent condition. The total sequence of transcript assemblies in stringent parameter settings, containing both the singlets and contigs, developed in this study was 125.3 Mb with the GC% of 51.9%. This is the maximum transcriptome sequences developed in any plant species. The GC% value is similar to rice but less than that reported in the wheat 3B chromosome exon coding sequence.[42] The length of the singlets varied from 31 to 884 bp with an average of 430 bp. The maximum singlets were grouped under 500–600 bp lengths (Fig. 1). As we found a large number of singlets, we subsequently discriminated the singlet contribution among the 10 cultivars (Fig. 1B). There was no correlation observed between the number of ESTs and the singlets. However, the stress-related libraries from four cultivars contributed to 55% of the total singlets. The contig length ranged from 46 to 3960 bp with an average of 879 bp, and ∼70% of the contigs extended from 501 to 1000 bp (Fig. 2A). The number of ESTs grouped in each contig varied between 2 and 4647, with 78% of the contigs containing 2–10 EST members (Fig. 2B).

Figure 1.

Figure 2.

Distribution of contig length and their EST member constitution. (A) Sequence length frequency of contigs. (B) Number of EST members in contigs.

Analysis of singlet sequence length and their genotype-wise distribution. (A) Sequence length distribution of singlet. (B) Genotype-wise frequency (%) of singlet (AT, Atlas; SC, Scout; CC, Chancellor; TC, Thatcher; CR, Cranbrook; HB, Halberd; CS, Chinese Spring; KT, Kitakei1354; NR, Norin4; VV, Valuevskaya). Distribution of contig length and their EST member constitution. (A) Sequence length frequency of contigs. (B) Number of EST members in contigs.

Functional annotation

The contig resulted from stringent parameter assembly was used for functional annotation. The function of each contig was derived after annotation with rice, Arabidopsis, maize, and Brachypodium databases, in addition to the protein sequences in the GenBank nr database (BLASTX; E-value <1e−5). The recently sequenced Brachypodium was included for its close originated relation with wheat. On annotation, maximum similarity was observed in Brachypodium followed by rice (Table 2). In rice, further annotation was carried out to find the chromosome-wise sequence similarity and identified that chromosome 1 is having much co-linearity followed by chromosome 3 (Fig. 3). On overall annotation, ∼3500 genes were found to have no similarity, suggesting new genes in our data. To further validate these new genes, updated tentative consensus sequences from the DFCI wheat gene index (http://compbio.dfci.harvard.edu/cgi-bin/tgi/gimain.pl?gudb=wheat) were annotated and resulted in the same number of new genes (∼3500), confirming the importance of our new EST assembly and analysis in wheat. To estimate the total number of full-length cDNAs in our collection, we searched our contig data against our 11 902 full-length cDNA data. With the stringent criteria of >95% similarity and the expected cut-off value of <1e−100, we found ∼7000 contigs that were full length in nature, further validated by identification of high similarity of wheat contigs with barley full-length sequences, indicating the robustness of our data and their applications to wheat functional genomics.

Table 2.

Annotation of wheat contigs

Database and Species	URL	No. and percentage of similarity
RAP-DB (build 5) (Rice)	http://rapdb.dna.affrc.go.jp	32.405 (87.26%)
RGAP (Rice)	http://rice.plantbiology.msu.edu	32.430 (87.32%)
TAIR9 (Arabidopsis)	http://www.arabidopsis.org	30.504 (82.14%)
MaizeGDB	http://www.maizegdb.org/	31.206 (84.03%)
Brachypodium Database	http://db.brachypodium.org	32.522 (87.57%)
All databases		33.909 (91.31%)
Total contigs		37.138

The updated (until May 2011) sequence was retrieved and similarity search was carried out.

Figure 3.

Sequence similarity of wheat contigs with rice genome. Based on the result, the contig was grouped in rice chromosome wise.

Annotation of wheat contigs The updated (until May 2011) sequence was retrieved and similarity search was carried out. Sequence similarity of wheat contigs with rice genome. Based on the result, the contig was grouped in rice chromosome wise. We also analysed length and ORF distribution in the contigs for both plus and minus strands. To obtain meaningful results, only contigs with >5 ESTs were selected and ORFs were identified. GO annotation of the wheat contigs was performed on the basis of ORF mining of the data. The GO terms were organized into three categories representing molecular functions, biological processes, and cellular components.[32] The sum of the wheat contigs per category did not add up to 100% as some contigs were classified into more than one category. Of the total contig set, 21 125 (56%) were annotated into the molecular function category (describing the biochemical activity performed by the gene product), 13 354 (36%) into the biological process GO category (describing the ordered assembly of more than one molecular function), and 13 356 (36%) into the cellular component GO category (describing subcellular compartments of a cell) (Fig. 4). Among the molecular function, the most highly represented categories were binding, catalytic activity, redox activity, and structural activity (Fig. 4A). Among the biological processes, the largest proportion of functionally assigned contigs fell into metabolic, transport, and translation processes, while redox activity, biosynthetic process, regulation, phosphorylation, and transcription comprised 34% of the contigs (Fig. 4B). For the cell component category, almost all contig sequences were annotated into the cell–cell subcategory, 28% into the membrane category, and 23% into the intracellular category (Fig. 4C). Together, all three GO categories accounted for ∼82% of the assigned wheat contig set.

Figure 4.

Functional classification of contig sequences based on GO categorization. Sequences were evaluated for their predicted involvement in molecular function, biological process, and cellular component.

Functional classification of contig sequences based on GO categorization. Sequences were evaluated for their predicted involvement in molecular function, biological process, and cellular component. The role and importance of TF lead to mining of our data and resulted in 1183 contigs containing either single or multiple transcription factors. Among the TFs, the CCAAT family was found in as many as 69 contigs. miRNA target sequence analysis of the wheat transcriptome identified different miRNA target sequences in 5180 contigs which ranged from 19 to 24 nt long. The majority of the small RNAs are 20–24 nt long, which is a typical range for dicer-derived products; the 21-nt class is predominant. Among species-specific miRNA, rice had maximum homology followed by maize and Medicago truncatula (Fig. 5). The number of hits for each species is roughly proportional to the number of sequences for that species in the database. Due to the limited number of wheat miRNA sequences in the database, there was only 200 contigs with wheat-specific miRNA target sequences. Among miRNAs, miRNA 395, 172, and 164 target sequences alone were found in 831 contigs, showing the relative abundance of these miRNA target sequences in wheat.

Figure 5.

miRNA target sequence analysis in wheat contigs. The database bars indicate the available miRNA in the database and the hit bars indicate the number of wheat genes having miRNA target sequence. tae, Triticum; sbi, Sorghum bicolor; osa, Oryza sativa; zma, Zea mays; ath, Arabidopsis thaliana; mtr, Medicago truncatula; ghr, Gossibium hirsutum; ptc, Populus trichocorpa; bna, Brassica napus; gma, Glycine max; pta, Pinus taeda; sly, Solanum lycopersicum; bra, Brassica rapa; bol, Brassica oleraceae; cre, Chlamydomonas reinhardtii.

Sequence polymorphism/SNP mining

The SNPs were mined in our large-scale wheat transcriptome data by applying both relax and stringent criteria. For both criteria, to find reliable SNPs, the common conditions of SNP should be present at a given position when there is no mismatch present 2 bp before or after the SNP site. A total of 51 067 SNPs were detected from 20 609 contigs using the first criterion, resulting in the identification of an SNP site every 96 bp. This value is considerably higher than those reported for earlier studies of wheat.[23,43] Hence, the second criterion was applied with an interest to differentiate the homoeologous SNPs from intergenome SNP by calculating the significant sequence length of each contig before mining the SNPs. This approach avoided finding SNPs on either end of the contigs and resulted in the identification of only 6352 SNPs in the wheat contigs. Further classification of the SNPs present between cultivars found 3515 SNPs with a frequency of one SNP per 614 bp. As there were genome constituents of wheat (three homologous genomes) and selective gene expression among the three genomes, we examined the SNP within each cultivar and found 2837 SNPs with an SNP site every 470 bp. The overall SNP frequency based on the stringent criteria was one SNP per 483 bp. Transitions (70%) were more frequent than transversions (30%). As expected, a significant positive correlation (P < 0.05) was observed between the number of SNPs detected in a contig and the number of reads present in that contig. The cultivars, except Chinese spring and Norin4, contain more inter-cultivar sequence variation than homoeologous SNPs. Among cultivars, Halberd, Valuevskaya, and Cranbrook contain almost the same number of SNPs in both cases. Cultivars Atlas, Kitakei1354, Scout, and Thatcher had much less homoeologous sequence variation than cultivar differences (Supplementary Fig. S1).

Digital gene expression

EST frequencies approximate the message abundance in the mRNA population used to construct a cDNA library. We have already attempted to make tissue expression maps of a large number of ESTs from stress-related libraries for in silico screening of stress responsive genes in wheat.[26] Here, we aimed to determine the global gene expression of wheat from 51 cDNA libraries, including growth stages and biotic and abiotic stresses. Contigs containing >5 ESTs were subjected to a correlated clustering analysis[7] to compare expression profiles in the different libraries. When the result was displayed in the form of a dendrogram (Fig. 6), many libraries with similar origins agglomerated together. All four libraries derived from root tissues treated with boric acid and aluminium united. In the same manner, biotic stress (leaf rust, powdery mildew, and blast)-related tissues were grouped in the same clade. The tissues collected from cv. Valucvskaya, which mainly undergo abiotic stress, were clustered separately.

Figure 6.

Correlated clustering of wheat cDNA libraries based on gene expression (AT, Atlas; SC, Scout; CC, Chancellor; TC, Thatcher; CR, Cranbrook; HB, Halberd; CS, Chinese Spring; KT, Kitakei1354; NR, Norin4; VV, Valuevskaya). To further determine the tissue-specific gene expression of select genes, digital gene expression was carried out for DREB and NAC TFs, OMT gene, and miRNA 172 targeting site genes in wheat transcripts (Supplementary Fig. S2). The DREB genes are expressed mainly in dehydration-related tissue libraries and the spatial expression was mostly at root tissues. In the case of NAC TF genes, as expected, the expression was only noted in salt-treated libraries and the expression level was greater in root tissue followed by shoot, spikelet, and seed. Interestingly, some biotic stress-related libraries also expressed NAC TF. Based on the preliminary result obtained from our previous study on OMT against self-defence in wheat,[29] the OMT-related contigs were mined and its gene expression was analysed. We found ubiquitous expression of OMT genes irrespective of stress, suggesting its defence role against stress. While mining of miRNA in wheat transcriptome, we found an abundance of miRNA 172 target sites. The digital gene expression analysis showed its abundance in all tissues and under all treatments in wheat. In addition to digital expression, a new method of gene discovery and/or gene expression based on CA was carried out to determine specific and common gene expression between libraries or treatments. To identify the common molecular plant–athogen interactions, four libraries constructed for leaf rust and powdery mildew diseases were selected and examined by CA. Contigs with more than or equal to four ESTs were selected, to identify specific genes for leaf rust and powdery mildew diseases, in addition to common disease resistance- and susceptibility-related genes (Supplementary Table S1). The number of genes expressed for powdery mildew outplayed the leaf rust disease. However, the number of disease susceptibility-related genes was more in leaf rust, suggesting different disease reaction mechanisms among diseases in wheat. Overall around 100 new genes were identified from these four disease-related libraries, which could have immense value for future research of molecular plant–pathogen interactions.

Discussion

Global wheat transcriptome analysis was carried out by accumulating ∼1 million ESTs from 51 cDNA libraries. In comparison with other studies of wheat which were biased towards one stress or growth stage of a few cultivars,[3,26] here we used all growth stages, and biotic and abiotic stresses for 10 different cultivars. The work flow of EST assembly and further analyses were summarized in Fig. 7. Many perl-script programs were written specifically for processing the ESTs, resulting in a 24% reduction in total ESTs. The stringent parameter in EST assembly resulted in more singlets compared with contigs which helped for further analysis, i.e. SNP mining.[23] The average length of the contigs (879 bp) is higher than other studies,[3,23,44] and ∼80% of the contigs containing 2–10 EST members had homoeologous or paralogous genes. To determine the EST member contribution to the contig, we further classified the data into stress- and growth stage-related parameters (Supplementary Fig. S3). Among the 20 stress-related libraries, biotic stress-related libraries contributed more ESTs to the contigs than libraries for abiotic stress (Supplementary Fig. S3A). Among the growth stages, libraries of the spikelet at late flowering contributed more than those of other stages, suggesting differential gene expression among the cultivars with or without any stress (Supplementary Fig. S3B). The high number of unigenes and GC% also suggests the possibility of more genes present in wheat than in other crops.[45] This is supported by the recent study of the megabase level sequencing in 3B, which estimated 50 000 genes per diploid genome as a result of the additional non-collinear genes interspersed within the highly conserved ancestral grass gene backbone, suggesting accelerated evolution in the Triticeae lineages.[44] The presence of additional genes was further confirmed by identification of new genes based on functional annotation (Fig. 3). When putative wheat gene sequences were analysed for ORF length based on their hit status, we observed significantly shorter ORFs in sequences with no hits. These results suggest that ORF length, not sequence length, is a better indicator of finding transcripts with protein coding capacity and subsequently getting a hit in a sequence database. On the other hand, more than one-third of the sequences without a hit still contained an ORF >300 bp, suggesting that sequences without a hit but with a relatively long ORF may represent new genes with protein coding capacity. We confirmed this by finding more full-length cDNA sequences. Our results also showed a higher no hit percentage in singlet sequences, most likely due to the fact that singletons represent rare genes in the wheat genome that are not well described in other organisms.

Figure 7.

Schematic diagram explaining the comprehensive EST analysis. The software used in the respective step was mentioned in parallel.

Functional characterization

GO analysis revealed expected categories such as molecular, biological, and cellular processes (Fig. 5). In wheat, the major molecular processes were binding and catalytic activities, similarly found in other Poaceae species.[44,46,47] Metabolic, transport, and translation functions accounted for 50% of the biological processes. Among cellular processes, as expected, membrane, intracellular, and ribosome functions played a major role. Using data from the plant TF database,[34] we found 1183 contigs from wheat that had a high similarity with 2197 different coding sequences of TF from seven species; the high similarity percentage was found with rice. The number of TF found in wheat contigs was higher than reported in Salvia sclorea calyx which was sequenced using 454 pyrosequncing.[48] The most represented TF family in wheat was CCAAT. The CCAAT box is a common cis-acting element found in the promoter and enhancer regions of a large number of genes in higher eukaryotes (for review).[49,50] In addition to this TF family, several other TF families known to be involved in plant development were also present in our data. The role of miRNAs in developmental and stress regulation is not well established in wheat, and increasingly tissue-specific and developmental regulation of miRNAs is being found mostly in animal species.[51] Through cDNA sequencing efforts, we have identified transcripts that encode 945 different miRNAs,[52] although additional wheat-specific miRNAs may still remain in the cDNA collection. Indeed, a large number of ncRNAs have the potential to form miRNA-like stem loop precursors (data not shown), but experimental validation of these potential miRNA is required. In our study, we found an abundance of miRNA 172 target sites in as many as 236 contigs, and their uniform expression irrespective of growth stage and/or stress was confirmed by in silico gene expression analysis (Supplementary Fig. S2D). The digital gene expression pattern of genes related to biotic and abiotic stresses (OMT, DREB, and NAC TFs), and epigenetic gene silencing has helped us to determine their quantitative and qualitative gene expression patterns.[7] This approach permits both the association of tissues via their common patterns of gene expression and the association of genes via their tissue-dependent expression patterns. The correlation clustering of 51 libraries formed the groups of the libraries based on respective treatments or stages which confirmed the importance of libraries as well gene expression specific to treatment.[26] CA is an explorative computational method for the study of associations between variables. Much like principle component analysis, it displays a low dimensional projection of the data, e.g. into the plane with three-dimensional view (Supplementary Fig. S4), which can be achieved for two to three variables simultaneously, thus revealing associations between them. Traditionally, CA has been used prevalently in categorical data in the social sciences, but its application has been extended also to physical quantities and to proteomics.[53] This method allows us to quickly analyse the set of EST libraries and to discover molecular pathogenicity on wheat. In four disease-related libraries, ribulose 1–5 biphosphate and S-adenosyl methionine genes were commonly found. In leaf rust, UDP-glucosyl transferase and chlorophyll a–b binding protein were highly expressed,[29] while in powdery mildew treatment, ADP-ribosylation factor, lipid transfer protein, and oxalate oxidase were specifically expressed. The advantage of CA analysis helped to find the common molecular resistance mechanism in wheat.[54] While comparing susceptible and resistance reaction mechanisms, some of the genes, such as 40S, 60S ribosomal protein and Zinc finger domain-containing protein genes, have copy number variations between the two mechanisms. We have shown that the application of CA to EST data provides an informative and concise means of visualizing these data, being capable of uncovering relationships both among either gene and between genes, in particular or common stages.

SNP mining

Assembly of EST sequences into contigs in a polyploidy species like hexaploid wheat results in each contig being composed of ESTs from homoeologous loci and members of gene families.[23] SNPs are the most abundantly found co-dominant polymorphic sites in greater proportion both in intronic and in exonic regions of the genome. They occur with variable frequencies and have become very popular in plant genetics and breeding due to their amenability for high throughput genotyping. In continuation of our earlier study to discriminate homoeologous gene expression of hexaploid wheat by SNP analysis of contigs grouped from 10 cDNA libraries from Chinese Spring,[43] we have mined our large-scale data to find SNPs from 10 different cultivars with various stress treatments. With relaxed criteria, we estimated ∼50 000 SNPs in wheat with an SNP frequency of one SNP per 96 bp—a number that is higher than our previous report.[43] This high number might be due to the EST originating from 10 different cultivars with various stress conditions, although the possibilities of over-estimation from the end sequences could not be excluded. Hence, a new criterion was applied by calculating the significant sequence length to avoid the end sequence and sequencing error-based SNPs. This approach accounted for only 20% of the SNPs obtained by the initial approach, and could lead to an underestimation of nucleotide diversity, although it guarantees the elimination of false positives. The stringent parameter resulted in the SNP frequency of one SNP in every 483 bp. In comparison of SNPs among huge genome size species, coffee has one SNP every 222 bp;[55] sugarcane has one SNP every 290 bp;[16] cotton has one SNP every 500 bp;[17] oak has a frequency similar to cotton with one SNP every 471 bp.[18] Our result in wheat compared with other species could explain the low polymorphism found in polyploidy species. The comparison of SNPs found between and within cultivars showed higher SNP frequency between cultivars (Supplementary Fig. S1), confirming that the low level of polymorphism identified between homoeologous genomes compared with inter-genome differences could be useful to select parents for linkage mapping studies.

Conclusion

The global wheat EST assembly presented here provides an unprecedented look at the wheat transcriptome and contributes tools for wheat genetics and genomics effort. The development and inclusion of cDNA libraries from all growth stages, various tissues, and treatments portray the complete picture of wheat transcriptome. Functional annotation and characterization give new ideas about wheat expressed genome, at least in part. The identified SNPs are invaluable resources for functional genomics and molecular breeding application. This set of processed EST sequences provides a seed for future investigation of wheat functional genomics using both long and short oligonucleotide arrays. Our data will thus act as a backbone for wheat genome sequence assembly, which is progressing rapidly.

Supplementary Data

Supplementary Data are available at www.dnaresearch.oxfordjournals.org.

Funding

This work was supported by Grants-in-Aid for Scientific Research on priority areas ‘Comparative Genomics’ and the National Bio-resource Project from the Ministry of Education, Culture, Sports, Science and Technology of Japan. This is contribution No. 1007 from the Kihara Institute for Biological Research, Yokohama City University.

53 in total

1. Structure of linkage disequilibrium and phenotypic associations in the maize genome.

Authors: D L Remington; J M Thornsberry; Y Matsuoka; L M Wilson; S R Whitt; J Doebley; S Kresovich; M M Goodman; E S Buckler
Journal: Proc Natl Acad Sci U S A Date: 2001-09-18 Impact factor: 11.205

2. Discrimination of homoeologous gene expression in hexaploid wheat by SNP analysis of contigs grouped from a large number of expressed sequence tags.

Authors: K Mochida; Y Yamazaki; Y Ogihara
Journal: Mol Genet Genomics Date: 2003-11-01 Impact factor: 3.291

3. Transcriptome analysis of salinity stress responses in common wheat using a 22k oligo-DNA microarray.

Authors: Kanako Kawaura; Keiichi Mochida; Yukiko Yamazaki; Yasunari Ogihara
Journal: Funct Integr Genomics Date: 2005-11-19 Impact factor: 3.410

4. Construction and evaluation of cDNA libraries for large-scale expressed sequence tag sequencing in wheat (Triticum aestivum L.).

Authors: D Zhang; D W Choi; S Wanamaker; R D Fenton; A Chin; M Malatrasi; Y Turuspekov; H Walia; E D Akhunov; P Kianian; C Otto; K Simons; K R Deal; V Echenique; B Stamova; K Ross; G E Butler; L Strader; S D Verhey; R Johnson; S Altenbach; K Kothari; C Tanaka; M M Shah; D Laudencia-Chingcuanco; P Han; R E Miller; C C Crossman; S Chao; G R Lazo; N Klueva; J P Gustafson; S F Kianian; J Dubcovsky; M K Walker-Simmons; K S Gill; J Dvorák; O D Anderson; M E Sorrells; P E McGuire; C O Qualset; H T Nguyen; T J Close
Journal: Genetics Date: 2004-10 Impact factor: 4.562

5. Gene expression profiles during the initial phase of salt stress in rice.

Authors: S Kawasaki; C Borchert; M Deyholos; H Wang; S Brazille; K Kawai; D Galbraith; H J Bohnert
Journal: Plant Cell Date: 2001-04 Impact factor: 11.277

6. Tablet--next generation sequence assembly visualization.

Authors: Iain Milne; Micha Bayer; Linda Cardle; Paul Shaw; Gordon Stephen; Frank Wright; David Marshall
Journal: Bioinformatics Date: 2009-12-04 Impact factor: 6.937

7. Cluster analysis and display of genome-wide expression patterns.

Authors: M B Eisen; P T Spellman; P O Brown; D Botstein
Journal: Proc Natl Acad Sci U S A Date: 1998-12-08 Impact factor: 11.205

8. Insights into corn genes derived from large-scale cDNA sequencing.

Authors: Nickolai N Alexandrov; Vyacheslav V Brover; Stanislav Freidin; Maxim E Troukhan; Tatiana V Tatarinova; Hongyu Zhang; Timothy J Swaller; Yu-Ping Lu; John Bouck; Richard B Flavell; Kenneth A Feldmann
Journal: Plant Mol Biol Date: 2008-10-21 Impact factor: 4.076

9. A global assembly of cotton ESTs.

Authors: Joshua A Udall; Jordan M Swanson; Karl Haller; Ryan A Rapp; Michael E Sparks; Jamie Hatfield; Yeisoo Yu; Yingru Wu; Caitriona Dowd; Aladdin B Arpat; Brad A Sickler; Thea A Wilkins; Jin Ying Guo; Xiao Ya Chen; Jodi Scheffler; Earl Taliercio; Ricky Turley; Helen McFadden; Paxton Payton; Natalya Klueva; Randell Allen; Deshui Zhang; Candace Haigler; Curtis Wilkerson; Jinfeng Suo; Stefan R Schulze; Margaret L Pierce; Margaret Essenberg; Hyeran Kim; Danny J Llewellyn; Elizabeth S Dennis; David Kudrna; Rod Wing; Andrew H Paterson; Cari Soderlund; Jonathan F Wendel
Journal: Genome Res Date: 2006-02-14 Impact factor: 9.043

10. OryzaExpress: an integrated database of gene expression networks and omics annotations in rice.

Authors: Kazuki Hamada; Kohei Hongo; Keita Suwabe; Akifumi Shimizu; Taishi Nagayama; Reina Abe; Shunsuke Kikuchi; Naoki Yamamoto; Takaaki Fujii; Koji Yokoyama; Hiroko Tsuchida; Kazumi Sano; Takako Mochizuki; Nobuhiko Oki; Youko Horiuchi; Masahiro Fujita; Masao Watanabe; Makoto Matsuoka; Nori Kurata; Kentaro Yano
Journal: Plant Cell Physiol Date: 2010-12-23 Impact factor: 4.927

12 in total

1. Genome-wide fungal stress responsive miRNA expression in wheat.

Authors: Behçet Inal; Mine Türktaş; Hakan Eren; Emre Ilhan; Sezer Okay; Mehmet Atak; Mustafa Erayman; Turgay Unver
Journal: Planta Date: 2014-08-26 Impact factor: 4.116

2. Transcriptome-wide analysis of WRKY transcription factors in wheat and their leaf rust responsive expression profiling.

Authors: Lopamudra Satapathy; Dharmendra Singh; Prashant Ranjan; Dhananjay Kumar; Manish Kumar; Kumble Vinod Prabhu; Kunal Mukhopadhyay
Journal: Mol Genet Genomics Date: 2014-08-07 Impact factor: 3.291

Review 3. RNA-Seq-based DNA marker analysis of the genetics and molecular evolution of Triticeae species.

Authors: Kazuhiro Sato; Kentaro Yoshida; Shigeo Takumi
Journal: Funct Integr Genomics Date: 2021-08-18 Impact factor: 3.410

4. Characteristics and Regulating Roles of Wheat TaHsfA2-13 in Abiotic Stresses.

Authors: Xiangzhao Meng; Baihui Zhao; Mingyue Li; Ran Liu; Qianqian Ren; Guoliang Li; Xiulin Guo
Journal: Front Plant Sci Date: 2022-06-27 Impact factor: 6.627

5. Editorial: Drought Threat: Responses and Molecular-Genetic Mechanisms of Adaptation and Tolerance in Wheat.

Authors: Dev Mani Pandey; Yin-Gang Hu; Yuri Shavrukov; Narendra Kumar Gupta
Journal: Front Plant Sci Date: 2022-06-30 Impact factor: 6.627

6. Transcriptome-wide identification of bread wheat WRKY transcription factors in response to drought stress.

Authors: Sezer Okay; Ebru Derelli; Turgay Unver
Journal: Mol Genet Genomics Date: 2014-04-19 Impact factor: 3.291

7. Co-expression and co-responses: within and beyond transcription.

Authors: Takayuki Tohge; Alisdair R Fernie
Journal: Front Plant Sci Date: 2012-11-08 Impact factor: 5.753

8. A hybrid distance measure for clustering expressed sequence tags originating from the same gene family.

Authors: Keng-Hoong Ng; Chin-Kuan Ho; Somnuk Phon-Amnuaisuk
Journal: PLoS One Date: 2012-10-11 Impact factor: 3.240

9. dbWFA: a web-based database for functional annotation of Triticum aestivum transcripts.

Authors: Jonathan Vincent; Zhanwu Dai; Catherine Ravel; Frédéric Choulet; Said Mouzeyar; M Fouad Bouzidi; Marie Agier; Pierre Martre
Journal: Database (Oxford) Date: 2013-05-09 Impact factor: 3.451

10. De Novo Assembled Wheat Transcriptomes Delineate Differentially Expressed Host Genes in Response to Leaf Rust Infection.

Authors: Saket Chandra; Dharmendra Singh; Jyoti Pathak; Supriya Kumari; Manish Kumar; Raju Poddar; Harindra Singh Balyan; Puspendra Kumar Gupta; Kumble Vinod Prabhu; Kunal Mukhopadhyay
Journal: PLoS One Date: 2016-02-03 Impact factor: 3.240