Literature DB >> 22334568

Comprehensive functional analyses of expressed sequence tags in common wheat (Triticum aestivum).

Alagu Manickavelu1, Kanako Kawaura, Kazuko Oishi, Tadasu Shin-I, Yuji Kohara, Nabila Yahiaoui, Beat Keller, Reina Abe, Ayako Suzuki, Taishi Nagayama, Kentaro Yano, Yasunari Ogihara.   

Abstract

About 1 million expressed sequence tag (EST) sequences comprising 125.3 Mb nucleotides were accreted from 51 cDNA libraries constructed from a variety of tissues and organs under a range of conditions, including abiotic stresses and pathogen challenges in common wheat (Triticum aestivum). Expressed sequence tags were assembled with stringent parameters after processing with inbuild scripts, resulting in 37,138 contigs and 215,199 singlets. In the assembled sequences, 10.6% presented no matches with existing sequences in public databases. Functional characterization of wheat unigenes by gene ontology annotation, mining transcription factors, full-length cDNA, and miRNA targeting sites were carried out. A bioinformatics strategy was developed to discover single-nucleotide polymorphisms (SNPs) within our large EST resource and reported the SNPs between and within (homoeologous) cultivars. Digital gene expression was performed to find the tissue-specific gene expression, and correspondence analysis was executed to identify common and specific gene expression by selecting four biotic stress-related libraries. The assembly and associated information cater a framework for future investigation in functional genomics.

Entities:  

Mesh:

Substances:

Year:  2012        PMID: 22334568      PMCID: PMC3325080          DOI: 10.1093/dnares/dss001

Source DB:  PubMed          Journal:  DNA Res        ISSN: 1340-2838            Impact factor:   4.458


Introduction

Wheat provides 21% of food calories and 20% of protein to more than 4.5 billion people worldwide.[1] Demand for wheat in the developing world is projected to increase 60% by 2050. At the same time, climate change-induced temperature increases are estimated to reduce wheat production by 29%.[2] The advent of new molecular genetic technology and the dramatic increase in plant gene sequence data have provided opportunities to underpin wheat breeding programmes in order to improve yield, grain quality, and disease resistance.[3] Many of these technologies have been designed to facilitate detection and understanding of the alterations in gene expression that accompany differential development or that result from the perception of changes to the environment. Expressed sequence tag (EST) projects provide a very useful and quick means of accessing gene sequence and expression information. When combined with breakthroughs in highly parallel designs for gene expression analysis, large-scale EST projects now offer new perspectives for understanding the molecular basis of important traits in plants of agricultural relevance.[4] EST sequencing projects have been completed or are under way for many plant species. These projects have provided useful tools for intragenomic[5] and intergenomic[6] comparisons, gene discovery,[7-9] molecular marker identification,[10] microarray development,[11-15] and polyploid species genomic resource development.[16-18] As robot throughput increases and cost-per-read drops, determination of a sequence tag for a large proportion of genes is now reasonable using this random cDNA sequencing approach. For example, the availability of the complete genome sequence of Arabidopsis thaliana revealed that the 105 000 ESTs available at the end of 2000 were enough to tag 60% of the 25 500 genes.[19] The complete genome sequences of several plant species are known and the rate at which whole genomes are being sequenced is increasing. Correct annotation of these genomes remains problematic despite gene prediction algorithms becoming ever more sophisticated. While wheat genome sequencing is rapidly progressing (www.wheatgenome.org), a quicker and complementary approach to identifying a large number of wheat genes is EST and full-length cDNA sequencing. These resources will prove invaluable for annotating the genomes of wheat and other monocots and as substrates for transgenic improvement of crops. As in Arabidopsis and rice, these tools will prove to be critical in speeding up the genetic improvement of wheat. DNA markers constructed from ESTs are effective since they are contained within an exon region of genes that are actually expressed. Examination of DNA sequence databases permits a direct search for sequence polymorphisms and thus molecular markers. These polymorphisms are typically single-nucleotide polymorphisms (SNPs) or small insertions–deletions (indels). More importantly, the SNPs are identified in EST sequences, thus the polymorphisms can be used to directly map functional, expressed genes, rather than DNA sequences derived from conventional RAPD and AFLP techniques, which are typically not genes. This has led to studies on linkage disequilibrium in genes to better characterize associations between phenotype and genotype.[20-22] To identify an SNP from an EST database, the database must be composed of ESTs derived from different genotypes, followed by alignment of the same EST sequences from different genotypes.[23] SNP markers rely upon the underlying redundancy within EST collections and assume that distinct genotype of a plant genome will be represented within a collection. Earlier, we described extensive wheat EST resources including full-length sequences,[24-27] and the usage of wheat transcriptome analysis by making a custom microarray with ESTs.[28,29] Here, we extend our efforts and describe a collection of further ESTs and the complete functional analysis of the whole EST so far developed (∼1 million ESTs). Our work is based on a set of cDNA libraries established from 51 different tissues of interest varied from growth stages and biotic and abiotic stresses in 10 different cultivars.

Materials and methods

Plant materials and cDNA library construction

Eighteen libraries from various growth stages, 25 libraries from abiotic stresses (cold, drought, saline, and mineral toxic), and eight libraries from biotic stresses (leaf rust, powdery mildew, and blast) were constructed from eight different wheat lines (Table 1). Out of 51 libraries, 20 libraries were newly constructed and included for this study. Double-stranded cDNAs were synthesized as previously described.[24] cDNAs were ligated with pBlueScript SK(+) digested with EcoRI and XhoI. After transformation by electroporation, transformed bacterial cells were initially cultured in the SOC medium for 1 h before culture at 37°C for 2 h in 2× LB medium. Cultured cells were stored in 20% glycerol at −80°C until use. Transformed bacteria were randomly selected and plasmid DNA was extracted.[24] Inserted cDNAs were sequenced from both ends using dye terminator cycle sequencing (Applied Biosystems, Foster City, CA, USA).
Table 1.

List and characteristics of cDNA libraries

Library nameGenotypeStageConditionNo. of ESTAccession number
whcsCSCallusGS11 505CJ518205–CJ523460, CJ627048–CJ632007
whrCSRootGS19 227BJ277129–BJ287630
whsCSSeedlingGS13 356HX000001–HX010004
whdlCSSeedling crownGS12 761BJ221844–BJ231912
whhCSSpike at headingGS20 648BJ255495–BJ266779
whfCSSpike at floweringGS21 106BJ243195–BJ255494
whohCSPistil at headingGS20 736BJ266780–BJ277128
whpcCSAnther at meiosisGS11 016CJ576197–CJ580898, CJ682880–CJ687382
whhgCSMT4BaAnther at meiosisGS9669CJ549536–CJ554132, CJ657247–CJ661657
whshCSYoung spikeletGS11 302CJ730709–CJ736986
whydCSSpikelet at late floweringGS14 708BJ300204–BJ312233
whokCSDPA5GS12 159CJ570869–CJ576196, CJ677747–CJ682879
whmsCSDT3DLbDPA5GS12 036CJ565425–CJ570868, CJ672462–CJ677746
wheCSDPA10GS19 200BJ231913–BJ243194
whdpCSDPA20GS13 455CJ523461–CJ529179, CJ632008–CJ637596
whslCSDPA30GS15 522BJ287631–BJ300203
whspCSSeedlingGS12 783HX247045–HX247474
whcaCS(Sp5A)cSeedlingGS794HX247475–HX257200
whkpCSKmppddSeedlingGrown under continuous light12 950CJ554133–CJ559953, CJ661658–CJ667190
whkvCSSeedlingGrown under continuous light after cold treatment15 360CJ559954–CJ565424, CJ667191–CJ672461
whemKitakei1354Dormant seedWith water supply11 671CJ539482–CJ544724, CJ647722–CJ652633
wheiKitakei1354Dormant seedWith water supply after wounded11 743CJ534326–CJ539481, CJ642661–CJ647721
whscKitakei1354ShootCold treatment after excision of grain part13 079CJ586310–CJ591776, CJ692607–CJ697797
whsdKitakei1354ShootDehydration11 897CJ591777–CJ596845, CJ697798–CJ702615
whrdKitakei1354RootDehydration12 436CJ580899–CJ586309, CJ687383–CJ692606
whv3ValuevskayaShoot3 days cold condition10 069CJ601934–CJ606680, CJ707296–CJ711731
whvValuevskayaShoot16 days cold condition11 087CJ596846–CJ601933, CJ702616–CJ707295
whvaValuevskayaShootABA treatment10 631CJ606681–CJ611586, CJ711732–CJ715867
whvdValuevskayaShootFive days dehydration11 767CJ611587–CJ616926, CJ715868–CJ720922
whvhValuevskayaShootHeat shock treatment10 090CJ616927–CJ621531, CJ720923–CJ725419
whvsValuevskayaLiquid cultured cellsLiquid cultured cells12 327CJ621532–CJ627047, CJ725420–CJ730708
whrs6eCSRootSalt stress for 6 h13 110HX010005–HX019847
whss6eCSLeafSalt stress for 6 h13 312HX019848–HX030180
whrs24eCSRootSalt stress for 24 h12 949HX030181–HX040252
whss24eCSLeafSalt stress for 24 h12 487HX040253–HX050054
whatleAtlas66RootNo treatment20 519CJ773323–CJ797201
whatlaleAtlas66Root50 mM Al for 6 h28 795CJ822818–CJ848636
whscteScout66RootNo treatment27 717CJ797202–CJ822817
whsctaleScout66Root50 mM Al for 6 h29 824CJ848637–CJ872807
whhbeHalberdRootNo treatment22 488HX124648–HX143755
whhbbeHalberdRoot10mM boric acid for 24 h22 522HX143756–HX163093
whcreCranbrookRootNo treatment22 680HX163094–HX182808
whcrbeCranbrookRoot10mM boric acid for 24 h22 616HX182809–HX201765
whthlseThatcherSeedlingInfected with leaf rust30 307CJ872808–CJ896490
whthkleseNILThatcherSeedlingInfected with leaf rust24 701CJ896491–CJ919993
whchaneChancellorSeedlingInfected with powdry mildew29 281CJ919994–CJ944155
whchueNILChancellorSeedlingInfected with powdry mildew28 799CJ944156–CJ968175
whnreNorin4SeedlingNo treatment34 415HX050055–HX071918
whnrpr48eNorin4SeedlingInfected with blast strain Pr48 at 23°C for 4 days23 987HX071919–HX084716
whnrpr58reNorin4SeedlingInfected with blast strain Pr58 at 23°C for 4 days36 360HX084717–HX106894
whnrpr58seNorin4SeedlingInfected with blast strain Pr58 at 27°C for 4 days30 797HX106895–HX124647
Total894 756

CS, Chinese Spring; GS, growth stage; DPA, days to post-anthesis; NIL, near-isogenic lines.

aMono-telosomic 4BS of CS.

bDitelosomic 4BS of CS.

cSpelta5A chromosome substituted in CS.

dNear-isogenic line.

eNewly constructed libraries.

List and characteristics of cDNA libraries CS, Chinese Spring; GS, growth stage; DPA, days to post-anthesis; NIL, near-isogenic lines. aMono-telosomic 4BS of CS. bDitelosomic 4BS of CS. cSpelta5A chromosome substituted in CS. dNear-isogenic line. eNewly constructed libraries.

EST processing and assembly

The chromatogram files were base called and quality trimmed using PHRED[30] with default parameters. Vector, library linker-primer, and EcoRI adapter sequences were removed using CROSSMATCH. Repeat, ambiguous sequences (PHRED quality values <30) and poly (A) tails or poly (T) sequences (at most 10 bases) in the ESTs were trimmed. Subsequently, ESTs with sequences <30 bp were omitted from the final data set. The remaining high-quality sequence was used for further study. All sequence data are available from the DNA database of Japan (Table 1). The processed EST sequence files were combined and assembled into contigs using the CAP3 program[31] with a high and low stringency level (high 95% homology in a 20 bp overlap; low 80% homology in a 15 bp overlap). Default CAP3 settings include -p 90 -h 20; the custom parameter settings used were -p 85 -h 90. The CAP3 -p option specifies overlap per cent identity cut-off, while the -h option specifies the maximum alignment overhang percentage.

2.2.1. Sequence annotation

Using the BLAST program (BLASTX with a search threshold of 1e−5), the sequences of the contigs were searched against seven databases (NCBI's nr; http://www.ncbi.nlm.nih.gov/genbank, Uniprot; http://www.uniprot.org, RAP-DB; http://rapdb.dna.affrc.go.jp, RGAP; http://rice.plantbiology.msu.edu, Tair9; http://www.arabidopsis.org, MaizeGDB; http://www.maizegdb.org, and Brachypodium database; http://db.brachypodium.org). According to Ewing et al.,[7] only contigs were taken for further analyses. The gene ontology (GO) terms[32] of each contig was derived by InterProScan.[33] The GO terms were then converted into GO slim term using EBI website (http://www.ebi.ac.uk/QuickGO/) by written perl-script for this purpose. Open reading frames (ORFs) were searched by translating sequences into amino acids by six frames (three per frame in the plus and minus strands).

Transcription factor

The PlnTFDB[34] containing 29 473 sequences of plant genes involved in transcriptional control was used to mine our data by local BLAST. The default parameters mentioned in the database was used for prediction (Filter ‘on’, gapped alignment ‘on’, substitution matrix ‘blosum 62’, E-value ≤1e−10). We included two meta-rules in our classification scheme: (i) if a protein harbours domains characteristic of a transcription factor (TF) family and a transcriptional regulators (TR) family, we assigned it to the TF family, (ii) when the protein of interest contains domains characteristic of more than one TF family or more than one TR family, it was assigned to the family to which its characteristic domains matched with the lowest E-value.

Full-length cDNA

The contigs were classified as full length if it aligned with our full-length cDNA data.[27] BLASTN searches of the contigs against the full-length sequences yielded a candidate hit list (E-value ≤1e−100) of putative full-length sequences that either covered the start and stop codon of the subject sequence or possessed sufficient sequence up/down-stream of the match to contain putative start and stop signals. In a few instances, some contigs covered all but the start methionine, and were also included as full-length sequences. In addition, contigs were aligned (E-value <1e−5, ≥98% query coverage, ≥98% identity) with barley full-length sequences[35] to know the similarity as well as the full-length nature.

miRNA analysis

To identify conserved miRNA in wheat, contigs were annotated with the plant small RNA regulator target analysis database (http://plantgrn.noble.org/psRNATarget/) containing small RNA of 15 plant species including wheat.[36] This database contained 2192 published miRNA sequences, including 32 from Triticum aestivum, 148 from sorghum, 496 from Oryza sativa, 319 from maize, and 224 from A. thaliana. Potential targets were predicted according to the rules applied by[37,38]: (i) the number of allowed mis-matches at complementary sites between miRNA sequences and potential mRNA targets is four or fewer; and (ii) no gaps are allowed at the complementary sites.

SNP discovery

Sequence variants or SNPs were mined in wheat contigs with two criteria, and perl-scripts were written for each category. In the first criterion, only contigs with ≥4 ESTs were selected, and SNPs were declared only when there was no mismatch, no gaps, or N's were admitted before and after an SNP site; in addition, the alternative base to the consensus sequence was present at least more than twice in an alignment. In the second criterion, the SNPs were mined only in the significant sequence of contigs that was worked out by counting the nucleotides of either end of the contigs containing a minimum of four EST members. To find the SNP between cultivars, in addition to the above parameters, the contigs containing the minimum of two consistent EST from the same cultivar were selected. For homoeologous SNPs, the contigs containing the minimum of four EST from the same cultivar were chosen. The visual inspection of SNP was carried out using Tablet software.[39]

Digital gene expression and correspondence analysis

For statistical analysis of gene expression profiles, contigs harbouring five or more constituents were selected from 37 138 contigs. Similarities between contigs or libraries were estimated using Pearson's correlation coefficient.[40] Hierarchical clustering was applied to compare EST expression profiles among the 51 wheat tissue/treatments and libraries. Expression profiles are displayed based on the number of constituents in a contig (from 0 to 4647; red intensity), along with an increasing number of constituents. Contigs specific to DREB (dehydration-responsive element binding), NAC (nitrogen assimilation control), OMT (O-methyl transferase), and miRNA 172 were selected to show the differential gene expression in cultivars and growth stage. Correspondence analysis (CA) was carried out by selecting four disease-related libraries (whthls, whthkles, whchan, and whchul; Table 1) as per the procedure detailed in Hamada et al.[41] and visualized by a custom-build viewer (available based on request).

Results

cDNA library construction and assembly

Our previous studies carried out the construction and analysis of 31 libraries.[26] Here, we have reported the further addition of 20 libraries and their combined analyses for comprehensive view of wheat EST. At the maximum, we have accumulated ∼1 million ESTs (Table 1). The libraries were generated from developmental stages and stresses. After trimming low-quality bases, vector sequences, and shortness (<30 bp), 0.68 million ESTs were used for CAP3 assembly under stringent conditions resulting in 37 138 contigs and 215 199 singlets. When assembled using relaxed settings in CAP3, 65 426 contigs and 66 875 singlets were obtained. The high stringent condition was chosen to achieve a more complete isolation of individual paralogues, orthologues, and homoeologues compared with using a low stringent condition. The total sequence of transcript assemblies in stringent parameter settings, containing both the singlets and contigs, developed in this study was 125.3 Mb with the GC% of 51.9%. This is the maximum transcriptome sequences developed in any plant species. The GC% value is similar to rice but less than that reported in the wheat 3B chromosome exon coding sequence.[42] The length of the singlets varied from 31 to 884 bp with an average of 430 bp. The maximum singlets were grouped under 500–600 bp lengths (Fig. 1). As we found a large number of singlets, we subsequently discriminated the singlet contribution among the 10 cultivars (Fig. 1B). There was no correlation observed between the number of ESTs and the singlets. However, the stress-related libraries from four cultivars contributed to 55% of the total singlets. The contig length ranged from 46 to 3960 bp with an average of 879 bp, and ∼70% of the contigs extended from 501 to 1000 bp (Fig. 2A). The number of ESTs grouped in each contig varied between 2 and 4647, with 78% of the contigs containing 2–10 EST members (Fig. 2B).
Figure 1.

Analysis of singlet sequence length and their genotype-wise distribution. (A) Sequence length distribution of singlet. (B) Genotype-wise frequency (%) of singlet (AT, Atlas; SC, Scout; CC, Chancellor; TC, Thatcher; CR, Cranbrook; HB, Halberd; CS, Chinese Spring; KT, Kitakei1354; NR, Norin4; VV, Valuevskaya).

Figure 2.

Distribution of contig length and their EST member constitution. (A) Sequence length frequency of contigs. (B) Number of EST members in contigs.

Analysis of singlet sequence length and their genotype-wise distribution. (A) Sequence length distribution of singlet. (B) Genotype-wise frequency (%) of singlet (AT, Atlas; SC, Scout; CC, Chancellor; TC, Thatcher; CR, Cranbrook; HB, Halberd; CS, Chinese Spring; KT, Kitakei1354; NR, Norin4; VV, Valuevskaya). Distribution of contig length and their EST member constitution. (A) Sequence length frequency of contigs. (B) Number of EST members in contigs.

Functional annotation

The contig resulted from stringent parameter assembly was used for functional annotation. The function of each contig was derived after annotation with rice, Arabidopsis, maize, and Brachypodium databases, in addition to the protein sequences in the GenBank nr database (BLASTX; E-value <1e−5). The recently sequenced Brachypodium was included for its close originated relation with wheat. On annotation, maximum similarity was observed in Brachypodium followed by rice (Table 2). In rice, further annotation was carried out to find the chromosome-wise sequence similarity and identified that chromosome 1 is having much co-linearity followed by chromosome 3 (Fig. 3). On overall annotation, ∼3500 genes were found to have no similarity, suggesting new genes in our data. To further validate these new genes, updated tentative consensus sequences from the DFCI wheat gene index (http://compbio.dfci.harvard.edu/cgi-bin/tgi/gimain.pl?gudb=wheat) were annotated and resulted in the same number of new genes (∼3500), confirming the importance of our new EST assembly and analysis in wheat. To estimate the total number of full-length cDNAs in our collection, we searched our contig data against our 11 902 full-length cDNA data. With the stringent criteria of >95% similarity and the expected cut-off value of <1e−100, we found ∼7000 contigs that were full length in nature, further validated by identification of high similarity of wheat contigs with barley full-length sequences, indicating the robustness of our data and their applications to wheat functional genomics.
Table 2.

Annotation of wheat contigs

Database and SpeciesURLNo. and percentage of similarity
RAP-DB (build 5) (Rice)http://rapdb.dna.affrc.go.jp32.405 (87.26%)
RGAP (Rice)http://rice.plantbiology.msu.edu32.430 (87.32%)
TAIR9 (Arabidopsis)http://www.arabidopsis.org30.504 (82.14%)
MaizeGDBhttp://www.maizegdb.org/31.206 (84.03%)
Brachypodium Databasehttp://db.brachypodium.org32.522 (87.57%)
All databases33.909 (91.31%)
Total contigs37.138

The updated (until May 2011) sequence was retrieved and similarity search was carried out.

Figure 3.

Sequence similarity of wheat contigs with rice genome. Based on the result, the contig was grouped in rice chromosome wise.

Annotation of wheat contigs The updated (until May 2011) sequence was retrieved and similarity search was carried out. Sequence similarity of wheat contigs with rice genome. Based on the result, the contig was grouped in rice chromosome wise. We also analysed length and ORF distribution in the contigs for both plus and minus strands. To obtain meaningful results, only contigs with >5 ESTs were selected and ORFs were identified. GO annotation of the wheat contigs was performed on the basis of ORF mining of the data. The GO terms were organized into three categories representing molecular functions, biological processes, and cellular components.[32] The sum of the wheat contigs per category did not add up to 100% as some contigs were classified into more than one category. Of the total contig set, 21 125 (56%) were annotated into the molecular function category (describing the biochemical activity performed by the gene product), 13 354 (36%) into the biological process GO category (describing the ordered assembly of more than one molecular function), and 13 356 (36%) into the cellular component GO category (describing subcellular compartments of a cell) (Fig. 4). Among the molecular function, the most highly represented categories were binding, catalytic activity, redox activity, and structural activity (Fig. 4A). Among the biological processes, the largest proportion of functionally assigned contigs fell into metabolic, transport, and translation processes, while redox activity, biosynthetic process, regulation, phosphorylation, and transcription comprised 34% of the contigs (Fig. 4B). For the cell component category, almost all contig sequences were annotated into the cell–cell subcategory, 28% into the membrane category, and 23% into the intracellular category (Fig. 4C). Together, all three GO categories accounted for ∼82% of the assigned wheat contig set.
Figure 4.

Functional classification of contig sequences based on GO categorization. Sequences were evaluated for their predicted involvement in molecular function, biological process, and cellular component.

Functional classification of contig sequences based on GO categorization. Sequences were evaluated for their predicted involvement in molecular function, biological process, and cellular component. The role and importance of TF lead to mining of our data and resulted in 1183 contigs containing either single or multiple transcription factors. Among the TFs, the CCAAT family was found in as many as 69 contigs. miRNA target sequence analysis of the wheat transcriptome identified different miRNA target sequences in 5180 contigs which ranged from 19 to 24 nt long. The majority of the small RNAs are 20–24 nt long, which is a typical range for dicer-derived products; the 21-nt class is predominant. Among species-specific miRNA, rice had maximum homology followed by maize and Medicago truncatula (Fig. 5). The number of hits for each species is roughly proportional to the number of sequences for that species in the database. Due to the limited number of wheat miRNA sequences in the database, there was only 200 contigs with wheat-specific miRNA target sequences. Among miRNAs, miRNA 395, 172, and 164 target sequences alone were found in 831 contigs, showing the relative abundance of these miRNA target sequences in wheat.
Figure 5.

miRNA target sequence analysis in wheat contigs. The database bars indicate the available miRNA in the database and the hit bars indicate the number of wheat genes having miRNA target sequence. tae, Triticum; sbi, Sorghum bicolor; osa, Oryza sativa; zma, Zea mays; ath, Arabidopsis thaliana; mtr, Medicago truncatula; ghr, Gossibium hirsutum; ptc, Populus trichocorpa; bna, Brassica napus; gma, Glycine max; pta, Pinus taeda; sly, Solanum lycopersicum; bra, Brassica rapa; bol, Brassica oleraceae; cre, Chlamydomonas reinhardtii.

miRNA target sequence analysis in wheat contigs. The database bars indicate the available miRNA in the database and the hit bars indicate the number of wheat genes having miRNA target sequence. tae, Triticum; sbi, Sorghum bicolor; osa, Oryza sativa; zma, Zea mays; ath, Arabidopsis thaliana; mtr, Medicago truncatula; ghr, Gossibium hirsutum; ptc, Populus trichocorpa; bna, Brassica napus; gma, Glycine max; pta, Pinus taeda; sly, Solanum lycopersicum; bra, Brassica rapa; bol, Brassica oleraceae; cre, Chlamydomonas reinhardtii.

Sequence polymorphism/SNP mining

The SNPs were mined in our large-scale wheat transcriptome data by applying both relax and stringent criteria. For both criteria, to find reliable SNPs, the common conditions of SNP should be present at a given position when there is no mismatch present 2 bp before or after the SNP site. A total of 51 067 SNPs were detected from 20 609 contigs using the first criterion, resulting in the identification of an SNP site every 96 bp. This value is considerably higher than those reported for earlier studies of wheat.[23,43] Hence, the second criterion was applied with an interest to differentiate the homoeologous SNPs from intergenome SNP by calculating the significant sequence length of each contig before mining the SNPs. This approach avoided finding SNPs on either end of the contigs and resulted in the identification of only 6352 SNPs in the wheat contigs. Further classification of the SNPs present between cultivars found 3515 SNPs with a frequency of one SNP per 614 bp. As there were genome constituents of wheat (three homologous genomes) and selective gene expression among the three genomes, we examined the SNP within each cultivar and found 2837 SNPs with an SNP site every 470 bp. The overall SNP frequency based on the stringent criteria was one SNP per 483 bp. Transitions (70%) were more frequent than transversions (30%). As expected, a significant positive correlation (P < 0.05) was observed between the number of SNPs detected in a contig and the number of reads present in that contig. The cultivars, except Chinese spring and Norin4, contain more inter-cultivar sequence variation than homoeologous SNPs. Among cultivars, Halberd, Valuevskaya, and Cranbrook contain almost the same number of SNPs in both cases. Cultivars Atlas, Kitakei1354, Scout, and Thatcher had much less homoeologous sequence variation than cultivar differences (Supplementary Fig. S1).

Digital gene expression

EST frequencies approximate the message abundance in the mRNA population used to construct a cDNA library. We have already attempted to make tissue expression maps of a large number of ESTs from stress-related libraries for in silico screening of stress responsive genes in wheat.[26] Here, we aimed to determine the global gene expression of wheat from 51 cDNA libraries, including growth stages and biotic and abiotic stresses. Contigs containing >5 ESTs were subjected to a correlated clustering analysis[7] to compare expression profiles in the different libraries. When the result was displayed in the form of a dendrogram (Fig. 6), many libraries with similar origins agglomerated together. All four libraries derived from root tissues treated with boric acid and aluminium united. In the same manner, biotic stress (leaf rust, powdery mildew, and blast)-related tissues were grouped in the same clade. The tissues collected from cv. Valucvskaya, which mainly undergo abiotic stress, were clustered separately.
Figure 6.

Correlated clustering of wheat cDNA libraries based on gene expression (AT, Atlas; SC, Scout; CC, Chancellor; TC, Thatcher; CR, Cranbrook; HB, Halberd; CS, Chinese Spring; KT, Kitakei1354; NR, Norin4; VV, Valuevskaya).

Correlated clustering of wheat cDNA libraries based on gene expression (AT, Atlas; SC, Scout; CC, Chancellor; TC, Thatcher; CR, Cranbrook; HB, Halberd; CS, Chinese Spring; KT, Kitakei1354; NR, Norin4; VV, Valuevskaya). To further determine the tissue-specific gene expression of select genes, digital gene expression was carried out for DREB and NAC TFs, OMT gene, and miRNA 172 targeting site genes in wheat transcripts (Supplementary Fig. S2). The DREB genes are expressed mainly in dehydration-related tissue libraries and the spatial expression was mostly at root tissues. In the case of NAC TF genes, as expected, the expression was only noted in salt-treated libraries and the expression level was greater in root tissue followed by shoot, spikelet, and seed. Interestingly, some biotic stress-related libraries also expressed NAC TF. Based on the preliminary result obtained from our previous study on OMT against self-defence in wheat,[29] the OMT-related contigs were mined and its gene expression was analysed. We found ubiquitous expression of OMT genes irrespective of stress, suggesting its defence role against stress. While mining of miRNA in wheat transcriptome, we found an abundance of miRNA 172 target sites. The digital gene expression analysis showed its abundance in all tissues and under all treatments in wheat. In addition to digital expression, a new method of gene discovery and/or gene expression based on CA was carried out to determine specific and common gene expression between libraries or treatments. To identify the common molecular plant–athogen interactions, four libraries constructed for leaf rust and powdery mildew diseases were selected and examined by CA. Contigs with more than or equal to four ESTs were selected, to identify specific genes for leaf rust and powdery mildew diseases, in addition to common disease resistance- and susceptibility-related genes (Supplementary Table S1). The number of genes expressed for powdery mildew outplayed the leaf rust disease. However, the number of disease susceptibility-related genes was more in leaf rust, suggesting different disease reaction mechanisms among diseases in wheat. Overall around 100 new genes were identified from these four disease-related libraries, which could have immense value for future research of molecular plant–pathogen interactions.

Discussion

Global wheat transcriptome analysis was carried out by accumulating ∼1 million ESTs from 51 cDNA libraries. In comparison with other studies of wheat which were biased towards one stress or growth stage of a few cultivars,[3,26] here we used all growth stages, and biotic and abiotic stresses for 10 different cultivars. The work flow of EST assembly and further analyses were summarized in Fig. 7. Many perl-script programs were written specifically for processing the ESTs, resulting in a 24% reduction in total ESTs. The stringent parameter in EST assembly resulted in more singlets compared with contigs which helped for further analysis, i.e. SNP mining.[23] The average length of the contigs (879 bp) is higher than other studies,[3,23,44] and ∼80% of the contigs containing 2–10 EST members had homoeologous or paralogous genes. To determine the EST member contribution to the contig, we further classified the data into stress- and growth stage-related parameters (Supplementary Fig. S3). Among the 20 stress-related libraries, biotic stress-related libraries contributed more ESTs to the contigs than libraries for abiotic stress (Supplementary Fig. S3A). Among the growth stages, libraries of the spikelet at late flowering contributed more than those of other stages, suggesting differential gene expression among the cultivars with or without any stress (Supplementary Fig. S3B). The high number of unigenes and GC% also suggests the possibility of more genes present in wheat than in other crops.[45] This is supported by the recent study of the megabase level sequencing in 3B, which estimated 50 000 genes per diploid genome as a result of the additional non-collinear genes interspersed within the highly conserved ancestral grass gene backbone, suggesting accelerated evolution in the Triticeae lineages.[44] The presence of additional genes was further confirmed by identification of new genes based on functional annotation (Fig. 3). When putative wheat gene sequences were analysed for ORF length based on their hit status, we observed significantly shorter ORFs in sequences with no hits. These results suggest that ORF length, not sequence length, is a better indicator of finding transcripts with protein coding capacity and subsequently getting a hit in a sequence database. On the other hand, more than one-third of the sequences without a hit still contained an ORF >300 bp, suggesting that sequences without a hit but with a relatively long ORF may represent new genes with protein coding capacity. We confirmed this by finding more full-length cDNA sequences. Our results also showed a higher no hit percentage in singlet sequences, most likely due to the fact that singletons represent rare genes in the wheat genome that are not well described in other organisms.
Figure 7.

Schematic diagram explaining the comprehensive EST analysis. The software used in the respective step was mentioned in parallel.

Schematic diagram explaining the comprehensive EST analysis. The software used in the respective step was mentioned in parallel.

Functional characterization

GO analysis revealed expected categories such as molecular, biological, and cellular processes (Fig. 5). In wheat, the major molecular processes were binding and catalytic activities, similarly found in other Poaceae species.[44,46,47] Metabolic, transport, and translation functions accounted for 50% of the biological processes. Among cellular processes, as expected, membrane, intracellular, and ribosome functions played a major role. Using data from the plant TF database,[34] we found 1183 contigs from wheat that had a high similarity with 2197 different coding sequences of TF from seven species; the high similarity percentage was found with rice. The number of TF found in wheat contigs was higher than reported in Salvia sclorea calyx which was sequenced using 454 pyrosequncing.[48] The most represented TF family in wheat was CCAAT. The CCAAT box is a common cis-acting element found in the promoter and enhancer regions of a large number of genes in higher eukaryotes (for review).[49,50] In addition to this TF family, several other TF families known to be involved in plant development were also present in our data. The role of miRNAs in developmental and stress regulation is not well established in wheat, and increasingly tissue-specific and developmental regulation of miRNAs is being found mostly in animal species.[51] Through cDNA sequencing efforts, we have identified transcripts that encode 945 different miRNAs,[52] although additional wheat-specific miRNAs may still remain in the cDNA collection. Indeed, a large number of ncRNAs have the potential to form miRNA-like stem loop precursors (data not shown), but experimental validation of these potential miRNA is required. In our study, we found an abundance of miRNA 172 target sites in as many as 236 contigs, and their uniform expression irrespective of growth stage and/or stress was confirmed by in silico gene expression analysis (Supplementary Fig. S2D). The digital gene expression pattern of genes related to biotic and abiotic stresses (OMT, DREB, and NAC TFs), and epigenetic gene silencing has helped us to determine their quantitative and qualitative gene expression patterns.[7] This approach permits both the association of tissues via their common patterns of gene expression and the association of genes via their tissue-dependent expression patterns. The correlation clustering of 51 libraries formed the groups of the libraries based on respective treatments or stages which confirmed the importance of libraries as well gene expression specific to treatment.[26] CA is an explorative computational method for the study of associations between variables. Much like principle component analysis, it displays a low dimensional projection of the data, e.g. into the plane with three-dimensional view (Supplementary Fig. S4), which can be achieved for two to three variables simultaneously, thus revealing associations between them. Traditionally, CA has been used prevalently in categorical data in the social sciences, but its application has been extended also to physical quantities and to proteomics.[53] This method allows us to quickly analyse the set of EST libraries and to discover molecular pathogenicity on wheat. In four disease-related libraries, ribulose 1–5 biphosphate and S-adenosyl methionine genes were commonly found. In leaf rust, UDP-glucosyl transferase and chlorophyll a–b binding protein were highly expressed,[29] while in powdery mildew treatment, ADP-ribosylation factor, lipid transfer protein, and oxalate oxidase were specifically expressed. The advantage of CA analysis helped to find the common molecular resistance mechanism in wheat.[54] While comparing susceptible and resistance reaction mechanisms, some of the genes, such as 40S, 60S ribosomal protein and Zinc finger domain-containing protein genes, have copy number variations between the two mechanisms. We have shown that the application of CA to EST data provides an informative and concise means of visualizing these data, being capable of uncovering relationships both among either gene and between genes, in particular or common stages.

SNP mining

Assembly of EST sequences into contigs in a polyploidy species like hexaploid wheat results in each contig being composed of ESTs from homoeologous loci and members of gene families.[23] SNPs are the most abundantly found co-dominant polymorphic sites in greater proportion both in intronic and in exonic regions of the genome. They occur with variable frequencies and have become very popular in plant genetics and breeding due to their amenability for high throughput genotyping. In continuation of our earlier study to discriminate homoeologous gene expression of hexaploid wheat by SNP analysis of contigs grouped from 10 cDNA libraries from Chinese Spring,[43] we have mined our large-scale data to find SNPs from 10 different cultivars with various stress treatments. With relaxed criteria, we estimated ∼50 000 SNPs in wheat with an SNP frequency of one SNP per 96 bp—a number that is higher than our previous report.[43] This high number might be due to the EST originating from 10 different cultivars with various stress conditions, although the possibilities of over-estimation from the end sequences could not be excluded. Hence, a new criterion was applied by calculating the significant sequence length to avoid the end sequence and sequencing error-based SNPs. This approach accounted for only 20% of the SNPs obtained by the initial approach, and could lead to an underestimation of nucleotide diversity, although it guarantees the elimination of false positives. The stringent parameter resulted in the SNP frequency of one SNP in every 483 bp. In comparison of SNPs among huge genome size species, coffee has one SNP every 222 bp;[55] sugarcane has one SNP every 290 bp;[16] cotton has one SNP every 500 bp;[17] oak has a frequency similar to cotton with one SNP every 471 bp.[18] Our result in wheat compared with other species could explain the low polymorphism found in polyploidy species. The comparison of SNPs found between and within cultivars showed higher SNP frequency between cultivars (Supplementary Fig. S1), confirming that the low level of polymorphism identified between homoeologous genomes compared with inter-genome differences could be useful to select parents for linkage mapping studies.

Conclusion

The global wheat EST assembly presented here provides an unprecedented look at the wheat transcriptome and contributes tools for wheat genetics and genomics effort. The development and inclusion of cDNA libraries from all growth stages, various tissues, and treatments portray the complete picture of wheat transcriptome. Functional annotation and characterization give new ideas about wheat expressed genome, at least in part. The identified SNPs are invaluable resources for functional genomics and molecular breeding application. This set of processed EST sequences provides a seed for future investigation of wheat functional genomics using both long and short oligonucleotide arrays. Our data will thus act as a backbone for wheat genome sequence assembly, which is progressing rapidly.

Supplementary Data

Supplementary Data are available at www.dnaresearch.oxfordjournals.org.

Funding

This work was supported by Grants-in-Aid for Scientific Research on priority areas ‘Comparative Genomics’ and the National Bio-resource Project from the Ministry of Education, Culture, Sports, Science and Technology of Japan. This is contribution No. 1007 from the Kihara Institute for Biological Research, Yokohama City University.
  53 in total

1.  Structure of linkage disequilibrium and phenotypic associations in the maize genome.

Authors:  D L Remington; J M Thornsberry; Y Matsuoka; L M Wilson; S R Whitt; J Doebley; S Kresovich; M M Goodman; E S Buckler
Journal:  Proc Natl Acad Sci U S A       Date:  2001-09-18       Impact factor: 11.205

2.  Discrimination of homoeologous gene expression in hexaploid wheat by SNP analysis of contigs grouped from a large number of expressed sequence tags.

Authors:  K Mochida; Y Yamazaki; Y Ogihara
Journal:  Mol Genet Genomics       Date:  2003-11-01       Impact factor: 3.291

3.  Transcriptome analysis of salinity stress responses in common wheat using a 22k oligo-DNA microarray.

Authors:  Kanako Kawaura; Keiichi Mochida; Yukiko Yamazaki; Yasunari Ogihara
Journal:  Funct Integr Genomics       Date:  2005-11-19       Impact factor: 3.410

4.  Construction and evaluation of cDNA libraries for large-scale expressed sequence tag sequencing in wheat (Triticum aestivum L.).

Authors:  D Zhang; D W Choi; S Wanamaker; R D Fenton; A Chin; M Malatrasi; Y Turuspekov; H Walia; E D Akhunov; P Kianian; C Otto; K Simons; K R Deal; V Echenique; B Stamova; K Ross; G E Butler; L Strader; S D Verhey; R Johnson; S Altenbach; K Kothari; C Tanaka; M M Shah; D Laudencia-Chingcuanco; P Han; R E Miller; C C Crossman; S Chao; G R Lazo; N Klueva; J P Gustafson; S F Kianian; J Dubcovsky; M K Walker-Simmons; K S Gill; J Dvorák; O D Anderson; M E Sorrells; P E McGuire; C O Qualset; H T Nguyen; T J Close
Journal:  Genetics       Date:  2004-10       Impact factor: 4.562

5.  Gene expression profiles during the initial phase of salt stress in rice.

Authors:  S Kawasaki; C Borchert; M Deyholos; H Wang; S Brazille; K Kawai; D Galbraith; H J Bohnert
Journal:  Plant Cell       Date:  2001-04       Impact factor: 11.277

6.  Tablet--next generation sequence assembly visualization.

Authors:  Iain Milne; Micha Bayer; Linda Cardle; Paul Shaw; Gordon Stephen; Frank Wright; David Marshall
Journal:  Bioinformatics       Date:  2009-12-04       Impact factor: 6.937

7.  Cluster analysis and display of genome-wide expression patterns.

Authors:  M B Eisen; P T Spellman; P O Brown; D Botstein
Journal:  Proc Natl Acad Sci U S A       Date:  1998-12-08       Impact factor: 11.205

8.  Insights into corn genes derived from large-scale cDNA sequencing.

Authors:  Nickolai N Alexandrov; Vyacheslav V Brover; Stanislav Freidin; Maxim E Troukhan; Tatiana V Tatarinova; Hongyu Zhang; Timothy J Swaller; Yu-Ping Lu; John Bouck; Richard B Flavell; Kenneth A Feldmann
Journal:  Plant Mol Biol       Date:  2008-10-21       Impact factor: 4.076

9.  A global assembly of cotton ESTs.

Authors:  Joshua A Udall; Jordan M Swanson; Karl Haller; Ryan A Rapp; Michael E Sparks; Jamie Hatfield; Yeisoo Yu; Yingru Wu; Caitriona Dowd; Aladdin B Arpat; Brad A Sickler; Thea A Wilkins; Jin Ying Guo; Xiao Ya Chen; Jodi Scheffler; Earl Taliercio; Ricky Turley; Helen McFadden; Paxton Payton; Natalya Klueva; Randell Allen; Deshui Zhang; Candace Haigler; Curtis Wilkerson; Jinfeng Suo; Stefan R Schulze; Margaret L Pierce; Margaret Essenberg; Hyeran Kim; Danny J Llewellyn; Elizabeth S Dennis; David Kudrna; Rod Wing; Andrew H Paterson; Cari Soderlund; Jonathan F Wendel
Journal:  Genome Res       Date:  2006-02-14       Impact factor: 9.043

10.  OryzaExpress: an integrated database of gene expression networks and omics annotations in rice.

Authors:  Kazuki Hamada; Kohei Hongo; Keita Suwabe; Akifumi Shimizu; Taishi Nagayama; Reina Abe; Shunsuke Kikuchi; Naoki Yamamoto; Takaaki Fujii; Koji Yokoyama; Hiroko Tsuchida; Kazumi Sano; Takako Mochizuki; Nobuhiko Oki; Youko Horiuchi; Masahiro Fujita; Masao Watanabe; Makoto Matsuoka; Nori Kurata; Kentaro Yano
Journal:  Plant Cell Physiol       Date:  2010-12-23       Impact factor: 4.927

View more
  12 in total

1.  Genome-wide fungal stress responsive miRNA expression in wheat.

Authors:  Behçet Inal; Mine Türktaş; Hakan Eren; Emre Ilhan; Sezer Okay; Mehmet Atak; Mustafa Erayman; Turgay Unver
Journal:  Planta       Date:  2014-08-26       Impact factor: 4.116

2.  Transcriptome-wide analysis of WRKY transcription factors in wheat and their leaf rust responsive expression profiling.

Authors:  Lopamudra Satapathy; Dharmendra Singh; Prashant Ranjan; Dhananjay Kumar; Manish Kumar; Kumble Vinod Prabhu; Kunal Mukhopadhyay
Journal:  Mol Genet Genomics       Date:  2014-08-07       Impact factor: 3.291

Review 3.  RNA-Seq-based DNA marker analysis of the genetics and molecular evolution of Triticeae species.

Authors:  Kazuhiro Sato; Kentaro Yoshida; Shigeo Takumi
Journal:  Funct Integr Genomics       Date:  2021-08-18       Impact factor: 3.410

4.  Characteristics and Regulating Roles of Wheat TaHsfA2-13 in Abiotic Stresses.

Authors:  Xiangzhao Meng; Baihui Zhao; Mingyue Li; Ran Liu; Qianqian Ren; Guoliang Li; Xiulin Guo
Journal:  Front Plant Sci       Date:  2022-06-27       Impact factor: 6.627

5.  Editorial: Drought Threat: Responses and Molecular-Genetic Mechanisms of Adaptation and Tolerance in Wheat.

Authors:  Dev Mani Pandey; Yin-Gang Hu; Yuri Shavrukov; Narendra Kumar Gupta
Journal:  Front Plant Sci       Date:  2022-06-30       Impact factor: 6.627

6.  Transcriptome-wide identification of bread wheat WRKY transcription factors in response to drought stress.

Authors:  Sezer Okay; Ebru Derelli; Turgay Unver
Journal:  Mol Genet Genomics       Date:  2014-04-19       Impact factor: 3.291

7.  Co-expression and co-responses: within and beyond transcription.

Authors:  Takayuki Tohge; Alisdair R Fernie
Journal:  Front Plant Sci       Date:  2012-11-08       Impact factor: 5.753

8.  A hybrid distance measure for clustering expressed sequence tags originating from the same gene family.

Authors:  Keng-Hoong Ng; Chin-Kuan Ho; Somnuk Phon-Amnuaisuk
Journal:  PLoS One       Date:  2012-10-11       Impact factor: 3.240

9.  dbWFA: a web-based database for functional annotation of Triticum aestivum transcripts.

Authors:  Jonathan Vincent; Zhanwu Dai; Catherine Ravel; Frédéric Choulet; Said Mouzeyar; M Fouad Bouzidi; Marie Agier; Pierre Martre
Journal:  Database (Oxford)       Date:  2013-05-09       Impact factor: 3.451

10.  De Novo Assembled Wheat Transcriptomes Delineate Differentially Expressed Host Genes in Response to Leaf Rust Infection.

Authors:  Saket Chandra; Dharmendra Singh; Jyoti Pathak; Supriya Kumari; Manish Kumar; Raju Poddar; Harindra Singh Balyan; Puspendra Kumar Gupta; Kumble Vinod Prabhu; Kunal Mukhopadhyay
Journal:  PLoS One       Date:  2016-02-03       Impact factor: 3.240

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.