Literature DB >> 20208064

Genomics and bioinformatics resources for crop improvement.

Abstract

Recent remarkable innovations in platforms for omics-based research and application development provide crucial resources to promote research in model and applied plant species. A combinatorial approach using multiple omics platforms and integration of their outcomes is now an effective strategy for clarifying molecular systems integral to improving plant productivity. Furthermore, promotion of comparative genomics among model and applied plants allows us to grasp the biological properties of each species and to accelerate gene discovery and functional analyses of genes. Bioinformatics platforms and their associated databases are also essential for the effective design of approaches making the best use of genomic resources, including resource integration. We review recent advances in research platforms and resources in plant omics together with related databases and advances in technology.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2010 PMID： 20208064 PMCID： PMC2852516 DOI： 10.1093/pcp/pcq027

Source DB: PubMed Journal: Plant Cell Physiol ISSN： 0032-0781 Impact factor: 4.927

Introduction

Sustainable agricultural production is an urgent issue in response to global climate change and population increase (Brown and Funk 2008, Turner et al. 2009). Furthermore, recent increased demand for biofuel crops has created a new market for agricultural commodities. One potential solution is to increase plant yield by designing plants based on a molecular understanding of gene function and on the regulatory networks involved in stress tolerance, development and growth (Takeda and Matsuoka 2008). Recent progress in plant genomics has allowed us to discover and isolate important genes and to analyze functions that regulate yields and tolerance to environmental stress. The whole genome sequencing of Arabidopsis thaliana was completed in 2000 (The Arabidopsis Genome Initiative 2000). Subsequently, the National Science Foundation (NSF) Arabidopsis 2010 project in the USA was launched with the stated goal of determining the functions of 25,000 genes of Arabidopsis by 2010 (Somerville and Dangl 2000). Technological advances in each omics research area have become essential resources for the investigation of gene function in association with phenotypic changes. Some of these advances include the development of high-throughput methods for profiling expressions of thousands of genes, for identifying modification events and interactions in the plant proteome, and for measuring the abundance of many metabolites simultaneously. In addition, large-scale collections of bioresources, such as mass-produced mutant lines and clones of full-length cDNAs and their integrative relevant databases, are now available (Brady and Provart 2009, Kuromori et al. 2009, Seki and Shinozaki 2009). The genome sequencing project of japonica rice was completed in 2005, and the Rice Annotation Project (RAP), which was orchestrated via ‘jamboree-style’ annotation meetings, aimed to provide an accurate annotation of the rice genome (International Rice Genome Sequencing Project 2005, Itoh et al. 2007). In conjunction with the rice genome sequence and its related genomic resources, advanced development of mapping populations and molecular marker resources has allowed researchers to accelerate the isolation of agronomically important quantitative trait loci (QTLs) (Ashikari et al. 2005, Konishi et al. 2006, Ma et al. 2006, Kurakawa et al. 2007, Ma et al. 2007). The aforementioned recent high-throughput technological advances have provided opportunities to develop collections of sequence-based resources and related resource platforms for specific organisms. A schematic representation of each relevant omics resource is shown in Fig. 1, together with the current status of their availabilities for Arabidopsis, rice and soybean as examples. Each biological element that has been measured comprehensively by a high-throughput method is represented in a corresponding plane in a conceptual model with layers ranging from genome to phenome, a model termed ‘omic space’ (Toyoda and Wada 2004). Such comprehensive models often provide an excellent starting point for designing experiments, generating hypotheses or conceptualizing based on the integrated knowledge found in the omic space of a particular organism. Furthermore, development of such omic resources and data sets for various species allows the comparison of omic properties among species, which promises to be an efficient way to find collateral evidence for conserved gene functions that might be evolutionarily supported. Bioinformatics platforms have become essential tools for accessing omics data sets for the efficient mining and integration of biologically significant knowledge.

Fig. 1

Omic space and related resources in plants. Examples of resources related to each omics instance are represented in Arabidopsis, rice and soybean, as the model plant, as a model monocot and a sequenced crop, and as an important crop recently sequenced, respectively. These resources are accessible from the following URLs or citations. 1. http://www.arabidopsis.org/, 2. http://www.gramene.org/, 3. http://soybase.org/, 4. http://nazunafox.psc.database.riken.jp, 5. http://rarge.gsc.riken.jp/dsmutant/index.pl, 6. http://signal.salk.edu/tabout.html 7. http://tilling.fhcrc.org/, 8. Kolesnik et al. (2004), 9. http://www.postech.ac.kr/life/pfg/risd/, 10. http://tos.nias.affrc.go.jp/, 11. http://www.soybeantilling.org/psearch.jsp, 12. http://mulch.cropsoil.uga .edu/∼parrottlab/Mutagenesis/acds/index.php, 13. http://arabidopsis.org.uk/home.html, 14. http://abrc.osu.edu/, 15. http://www.shigen.nig.ac.jp/rice/oryzabase/top/top.jsp, 16. http://www .irri.org/grc/GRChome/home.htm, 17. http://www.legumebase.agr.miyazaki-u.ac.jp/index.jsp, 18. http://www.plantcyc.org:1555/ARA/server.html, 19. http://pathway.gramene.org/gramene/ricecyc.shtml, 20. http://www.plantcyc.org/, 21. http://mediccyc.noble.org/, 22. http://prime.psc.riken.jp/, 23. http://csbdb.mpimp-golm.mpg.de/csbdb/gmd/gmd.html, 24. http://ppdb.tc.cornell.edu/, 25. http://phosphat.mpimp-golm.mpg.de/, 26. http://cdna01.dna.affrc.go.jp/RPD/main_en.html, 27. http://proteome.dc.affrc.go.jp/Soybean/, 28. http://oilseedproteomics .missouri.edu/, 29. http://bioinfo.esalq.usp.br/cgi-bin/atpin.pl, 30. http://atpid.biosino.org/, 31. http://suba.plantenergy.uwa.edu.au/, 32.http://proteomics.arabidopsis.info/, 33. http://www .brc.riken.go.jp/lab/epd/catalog/cdnaclone.html, 34. http://rarge.gsc.riken.jp/, 35. http://cdna01.dna.affrc.go.jp/cDNA/, 36. http://rsoy.psc.riken.jp/, 37. http://www.arabidopsis.org/portals/expression/microarray/ATGenExpress.jsp, 38. https://www.genevestigator.com/gv/index.jsp, 39. http://bioinformatics.med.yale.edu/riceatlas/, 40. http://bioinformatics.towson.edu/SGMD/Default.htm, 41. http://soyxpress.agrenv.mcgill.ca/cgi-bin/soy/soybean.cgi, 42. http://mpss.udel.edu/at/, 43. http://mpss.udel.edu/rice/, 44. http://signal.salk.edu/, 45. http://rapdb.dna.affrc .go.jp/, 46. http://rice.plantbiology.msu.edu/, 47. http://www.phytozome.net/, 48. http://walnut.usc.edu/, 49. http://www.oryzasnp.org/, 50. http://www.soymap.org/, 51. http://1001genomes .org/, 52. http://rarge.gsc.riken.jp/rartf/, 53. http://arabidopsis.med.ohio-state.edu/, 54. http://datf.cbi.pku.edu.cn/, 55. http://drtf.cbi.pku.edu.cn/, 56. http://grassius.org/, 57. http://soybeantfdb .psc.riken.jp, 58. http://legumetfdb.psc.riken.jp/.

Sequence resources in plants

Comprehensively collected sequence data provide essential genomic resources for accelerating molecular understanding of biological properties and for promoting the application of such knowledge. The recent accumulation of nucleotide sequences of model plants, as well as of applied species such as crops and domestic animals, has provided fundamental information for the design of sequence-based research applications in functional genomics. In this section, we describe recently developed plant sequence resources. Species-specific nucleotide sequence collections also provide opportunities to identify the genomic aspects of phenotypic characters based on genome-wide comparative analyses and knowledge of model organisms (Cogburn et al. 2007, Flicek et al. 2008, Paterson 2008, Tanaka et al. 2008).

Genome sequencing projects

The first genome sequence of a plant was completed for A. thaliana, which is now used as a model species in plant molecular biology due to its small size, short generation time and high efficiency of transformation. The Arabidopsis genome sequence project was performed as a cooperative project among scientists in Japan, Europe and the USA (Bevan 1997). The genome sequencing was completed and published in 2000 by the Arabidopsis Genome Initiative (AGI) (The Arabidopsis Genome Initiative 2000). The draft genome sequence of rice, both japonica and indica, an important staple food as well as a model monocotyledon, was published in 2002 (Goff et al. 2002, Yu et al. 2002). Subsequently, the genome sequence of japonica rice was completed and published by the International Rice Genome Sequencing Project in 2005 (International Rice Genome Sequencing Project 2005). To date, several genome sequencing projects involving various plant species have been completed (Table 1).

Table 1

Whole genome sequencing projects in plants (http://www.arabidopsis.org/portals/genAnnotation/other_genomes/index.jsp)

Common name	Latin name	Sequencing group	References	URL
Dicots
				Mouse ear cress	Arabidopsis thaliana	Consortium (AGI)	The Arabidopsis Genome Initiative (2000)	http://www.arabidopsis.org/
Poplar	Populus trichocarpa	JGI	Tuskan et al. (2006)	http://genome.jgi-psf.org/Poptr1_1/Poptr1_1.home.html
Lyre-leaf rockcress	Arabidopsis lyrata	JGI		http://genome.jgi-psf.org/Araly1/Araly1.home.html
Pink shepherd’s purse	Capsella rubella	JGI		(http://www.jgi.doe.gov/sequencing/why/3066.html)
Rapeseed	Brassica rapa	Consortium (MGBP)		http://www.brassica-rapa.org/BRGP/index.jsp
Tomato	Solanum lycopersicum	Consortium (ITGSP)		http://solgenomics.net/
Potato	Solanum tuberosum	Consortium (PGSC)		http://www.potatogenome.net/index.php/Main_Page
Barrel medic	Medicago truncatula	Consortium (IMGAG)		http://www.medicago.org/genome/
	Lotus japonicus	Consortium	Sato et al. (2008)	http://www.kazusa.or.jp/lotus/
Monkey flower	Mimulus guttatus	JGI		(http://www.jgi.doe.gov/sequencing/why/3062.html)
Soybean	Glycine max	JGI	Schmutz et al. (2010)	http://www.phytozome.net/soybean.php
Cassava	Manihot esculenta	JGI		http://www.phytozome.org/cassava.php
Grape	Vitis vinifera	Consortium	Jaillon et al. (2007)	http://www.genoscope.cns.fr/externe/GenomeBrowser/Vitis/
Columbine	Aquilegia formosa	JGI		(http://www.jgi.doe.gov/sequencing/why/51280.html)
Eucalyptus	Eucalyptus grandis	JGI		http://bioinformatics.psb.ugent.be/genomes/view/Eucalyptus-grandis
Papaya	Carica papaya	Consortium	Ming et al. (2008)	http://asgpb.mhpcc.hawaii.edu/papaya/
Castor bean	Ricinus communis	TIGR		http://castorbean.jcvi.org/
Monocots
				Rice	Oryza sativa japonica	Consortium (IRGSP)	International Rice Genome Sequencing Project (2005)	http://rgp.dna.affrc.go.jp/E/IRGSP/index.html
Rice	Oryza sativa indica	Beijing Genomics Institute	Yu et al. (2002)	http://rice.genomics.org.cn/rice/index2.jsp
Maize	Zea mays	Consortium	Schnable et al. (2009)	http://www.maizegdb.org/
Sorghum	Sorghum bicolor	JGI	Paterson et al. (2009)	http://genome.jgi-psf.org/Sorbi1/Sorbi1.home.html
	Brachypodium distachyon	JGI, Consortium (IBI)	The International Brachypodium Initiative (2010)	http://www.brachypodium.org/
Wheat	Triticum aestivum	Consortium (IWGSC)		http://www.wheatgenome.org/
Barley	Hordeum vulgare	Consortium (IBSC)		http://www.public.iastate.edu/∼imagefpc/IBSC%20Webpage/ IBSC%20Template-home.html
Other
				Moss	Physcomitrella patens	JGI	Rensing et al. (2008)	http://genome.jgi-psf.org/Phypa1_1/Phypa1_1.home.html
Gemmiferous spike moss	Selaginella moellendorffii	JGI		http://genome.jgi-psf.org/Selmo1/Selmo1.home.html
	Cyanidioschyzon merolae	Consortium	Matsuzaki et al. (2004)	http://merolae.biol.s.u-tokyo.ac.jp/

Whole genome sequencing projects in plants (http://www.arabidopsis.org/portals/genAnnotation/other_genomes/index.jsp) There are a number of providers for plant genome sequences and annotations. Phytozome is a Web-accessible information resource providing genome sequences and annotations of various plant species. This resource is a joint project of the Department of Energy’s Joint Genome Institute (DOE–JGI) and the Center for Integrative Genomics, and is intended to facilitate comparative genomic studies among green plants (http://www.phytozome.net/Phytozome_info.php). The current version of Phytozome (ver. 5.0, January 2010) consists of 18 plant species that were sequenced by JGI and other sequencing projects. Gramene (http://www.gramene.org/) is an information resource established as a portal site for grass species, and it provides various kinds of information related to grass genomics, including genome sequences (Ware 2007, Liang et al. 2008). The current version of Gramene (#30, October 2009) provides genome sequence information for 15 plant species, including five wild rice genome assemblies. According to data provided on the Entrez Genome Project Web site (http://www.ncbi.nlm.nih.gov/sites/entrez?db =genomeprj), as of November 2009, >150 instances of genome projects in species of Viridiplantae have been tracked, including agronomically important crops such as staple foods, fruit trees, medical plants and a number of green alga species. With the ongoing innovations in next-generation sequencing technologies, the release of sequenced genomes is expected to accelerate (Ossowski et al. 2008, Ansorge 2009). Whole-genome sequence information allows us to derive sets of important genomic features, including the identification of protein-coding or non-coding genes and constructs such as gene families, regulatory elements, repetitive sequences, simple sequence repeats (SSRs) and guanine–cytosine (GC) content. These data sets have become primary sequence material for the design of genome sequence-based platforms such as microarrays, tiling arrays or molecular markers, as well as for reference data sets for the integration of omics elements into a genome sequence. Chromosome-scale comparisons identifying conserved similarities of gene coordinates facilitate documentation of segmental and tandem duplications in related species (Haas et al. 2004, De Bodt et al. 2005). Whole-genome comparisons identifying chromosomal duplication and conserved synteny among related species provide evidence for hypotheses on comparative evolutionary histories with regard to the diversification of species in a related lineage (Paterson et al. 2009, Schnable et al. 2009).

Large-scale collections of expressed sequence tags (ESTs) and cDNA clones

ESTs are created by partial ‘one-pass’ sequencing of randomly picked gene transcripts that have been converted into cDNA (Adams et al. 1993). Since cDNA and EST collections can be acquired regardless of genomic complexity, this approach has been applied not only to model species but also to a number of applied species with large genome sizes due to polyploidy and/or to their number of repetitive sequences. As of November 2009, there are >63 million ESTs in the National Center for Biotechnology Information (NCBI)’s dbEST, a public domain EST database (http://www.ncbi.nlm.nih.gov/dbEST/) that includes a number of plant species (Table 2) (Boguski et al. 1993).

Table 2

Numbers of ESTs and unified transcripts in plants (November 2009)

Species	No. of ESTs (dbEST)	No. of entries (UniGene)
Physcomitrella patens	382,584	18,870
Picea glauca (white spruce)	299,455	22,472
Picea sitchensis (Sitka spruce)	175,662	18,838
Pinus taeda (loblolly pine)	328,628	18,921
Aquilegia formosa×Aquilegia pubescens	85,039	8,046
Arabidopsis thaliana (thale cress)	1,527,298	30,579
Artemisia annua (sweet wormwood)	85,402	9,462
Brassica napus (rape)	643,601	26,733
Brassica oleracea	59,946	5,617
Brassica rapa (field mustard)	44,570	14,497
Capsicum annuum	116,541	8,868
Citrus clementina	118,365	9,123
Citrus sinensis (Valencia orange)	208,909	15,808
Glycine max (soybean)	1,422,604	33,001
Gossypium hirsutum (upland cotton)	268,786	21,738
Gossypium raimondii	63,577	3,297
Helianthus annuus (sunflower)	133,682	12,216
Lactuca sativa (garden lettuce)	80,781	7,940
Lotus japonicus	195,385	14,493
Malus × domestica (apple)	324,308	23,731
Medicago truncatula (barrel medic)	269,237	18,098
Nicotiana tabacum (tobacco)	317,190	24,069
Populus tremula × Populus tremuloides (hybrid aspen)	76,160	9,652
Populus trichocarpa (western balsam poplar)	89,943	14,965
Prunus persica (peach)	79,203	7,620
Raphanus raphanistrum (wild radish)	164,119	18,788
Raphanus sativus (radish)	83,034	17,649
Solanum lycopersicum (tomato)	296,848	18,228
Solanum tuberosum (potato)	236,568	18,784
Theobroma cacao	159,320	24,958
Vigna unguiculata (cowpea)	187,443	15,740
Vitis vinifera (wine grape)	357,856	22,083
Selaginella moellendorffii	93,806	8,810
Hordeum vulgare (barley)	501,614	23,595
Oryza sativa (rice)	1,249,110	40,978
Panicum virgatum (switchgrass)	436,535	20,973
Saccharum officinarum (sugarcane)	246,892	15,594
Sorghum bicolor (sorghum)	209,814	13,899
Triticum aestivum (wheat)	1,067,290	40,349
Zea mays (maize)	2,018,798	97,123
Chlamydomonas reinhardtii	204,076	11,310
Volvox carteri	132,038	5,638

Numbers of ESTs and unified transcripts in plants (November 2009) Because EST data collected from the cDNA libraries of a particular organism consist of redundant sequence data derived from the same gene locus or transcription unit, it is often necessary to perform EST grouping by transcription units and to assemble these groups in order to create a consolidated alignment and representative sequence of each transcript before further analysis. Such steps are performed computationally: a typical work flow consists of ‘base calling’, i.e. converting the output trace of a sequencer to identified nucleotide data, followed by a cleaning step involving identification and removal of contaminated sequences, the masking out of cloning vector sequences, clustering of identical sequences and alignment of clustered sequences (Ewing et al. 1998, Huang and Madan 1999, Masoudi-Nejad et al. 2006). Then, the obtained data sets of representative transcripts can be used as unified transcript data. There are several data resources that provide such unified data sets of plants, such as NCBI-UniGene, PlantGDB, TIGR Plant Gene Index and HarvEST (Feolo et al. 2000, Lee et al. 2005, Close et al. 2007, Duvick et al. 2008). The comprehensive and rapid accumulation of cDNA clones together with mass volume data sets of their sequence tags have become significant resources for functional genomics. ESTs derived from various kinds of tissues, including tissues from organisms in a range of developmental stages or under stress, could significantly facilitate gene discovery as well as gene structural annotation, large-scale expression analysis, genome-scale intraspecific and interspecific comparative analysis of expressed genes and the design of expressed gene-oriented molecular markers and probes for microarrays (Ogihara et al. 2003, Zhang et al. 2004, Kawaura et al. 2006, Mochida et al. 2006).

Full-length cDNA projects

Although partial cDNAs are useful for rapidly creating catalogs of expressed genes, they are not suitable for further study of gene function. This is because the most popular method for preparing a cDNA library does not provide a full-length cDNA that includes the capped site sequence. The biotinylated cap trapper method, which uses trehalose-thermostabilized reverse transcriptase and is an efficient method for constructing full-length cDNA-enriched libraries, was developed by Hayashizaki’s group at RIKEN about 10 years ago. Full-length cDNA libraries and large-scale sequence data sets of clones have become invaluable resources for life science projects studying various species (Hayashizaki 2003, Imanishi et al. 2004, Maeda et al. 2006, Tanaka et al. 2008, Yamasaki et al. 2008a). The sequence resources derived from full-length cDNAs can also help substantially in identifying transcribed regions in completed or draft genome sequences. In Arabidopsis and rice, full-length cDNA sequences have been used to identify genomic structural features such as transcription units, transcription start sites (TSSs) and transcriptional variants (Seki et al. 2002b, Iida et al. 2004, Itoh et al. 2007, Yamamoto et al. 2009). In species for which we have draft genomes, such as Physcomitrella, soybean and poplar, full-length cDNA clones have been sequenced to help consolidate genomic infrastructure; this should also contribute to gene discovery (Nanjo et al. 2007, Ralph et al. 2008a, Umezawa et al. 2008) (Table 3). Full-length cDNAs are also useful for determining the three-dimensional (3D) structures of proteins by X-ray crystallography and nuclear magnetic resonance (NMR) spectroscopy and for functional biochemical analyses of expressed proteins in the molecular interactions of protein–ligands, protein–proteins and protein–DNAs. Furthermore, recent advances in proteomics infrastructure require comprehensive data sets of the full-length amino acid sequences used to assign peptides to a protein. These advances also necessitate functional annotations to support systematic knowledge mined for proteins corresponding to identified peptides and for residues modified by, for example, phosphorylation, for use in combination with comparative analyses of modificomic events among species. The full-length cDNA library has also contributed importantly to functional analysis by creating overexpressors used in reverse genetics. The advent of function-based gene discovery by systems such as full-length cDNA overexpressor (FOX) gene hunting, which use full-length cDNA transgenic plants as overexpressors, has provided an effective approach to high-throughput discovery of functional genes associated with phenotypic changes (Ichikawa et al. 2006, Fujita et al. 2007, Kondou et al. 2009).

Table 3

Large-scale collections of full-length cDNA clones in plants

Species	Database	References
Arabidopsis thaliana	http://rarge.gsc.riken.jp/	Seki et al. (2002b)
Citrus species		Marques et al. (2009)
Cryptomeria japonica		Futamura et al. (2008)
Glycine max	http://rsoy.psc.riken.jp/	Umezawa et al. (2008)
Hordeum vulgare	http://www.shigen.nig.ac.jp/barley/	Sato et al. (2009)
Manihot esculenta	http://amber.gsc.riken.jp/cassava/	Sakurai et al. (2007)
Oryza rufipogon		Lu et al. (2008)
Oryza sativa, (japonica)	http://cdna01.dna.affrc.go.jp/cDNA/	Kikuchi et al. (2003)
Oryza sativa, (indica)	http://www.ncgr.ac.cn/ricd	Liu et al. (2007)
Physcomitrella patens	http://www.brc.riken.go.jp/lab/epd/catalog/p_patens.html	Nishiyama et al. (2003)
Populus nigra	http://rpop.psc.riken.jp/index.pl	Nanjo et al. (2004); Nanjo et al. (2007)
Populus trichocarpa		Ralph et al. (2008a)
Thellungiella halophila		Taji et al. (2008)
Triticum aestivum	http://trifldb.psc.riken.jp/	Kawaura et al. (2009); Ogihara et al. (2004)
Zea mays	http://www.maizecdna.org/	Soderlund et al. (2009)

Large-scale collections of full-length cDNA clones in plants Recently, full-length enriched cDNA libraries have been constructed for non-sequenced crops or forestry species, such as wheat (Triticum aestivum), barley (Hordeum vulgare), cassava (Manihot esculenta), Japanese cedar (Cryptomeria japonica) and Sitka spruce (Picea sitchensis), as well as for plant species showing specific biological characters such as salt tolerance in salt cress (Thellungiella halophila) (Table 3). These full-length cDNA libraries have been used to identify biological features through comparisons of target sequences with those of model organisms such as Arabidopsis, rice and poplar. These libraries also serve as primary sequence resources for designing microarray probes and as clone resources for genetic engineering to improve crop efficiency (Sakurai et al. 2007, Futamura et al. 2008, Ralph et al. 2008b, Taji et al. 2008). Because of the various key functionalities of full-length cDNA resources in omic space, it is also essential to establish relevant information resources that provide gateways to these resources as well as to integrate related data sets derived from other omics fields and species (Sakurai et al. 2005, Mochida et al. 2009b).

Ultrahigh-throughput DNA sequencing

During the past decade, the Sanger sequencing method has been used to complete sequencing of microbial and higher eukaryote genomes. In recent years, a number of alternative technologies, which are adaptations of methods such as pyrosequencing procedures, massively parallel DNA sequencing or single molecule sequencing, have become available (Margulies et al. 2005, Ansorge 2009). Such new sequencing technologies have provided us with new opportunities to be addressed at the entire genome level in the fields of comparative genomics, meta-genomics and evolutionary genomics (Varshney et al. 2009).

Whole-genome resequencing

Next-generation sequencing technology coupled with reference genome sequence data allows us to discover variations among individuals, strains and/or populations. Nucleotide polymorphisms are effectively identified by mapping sequence fragments onto a particular reference genome data set, a capability that is of immense importance in all genetic research. A whole-genome resequencing project to discover whole-genome sequence variations in 1,001 strains (accessions) of Arabidopsis will result in a data set that will become a fundamental resource for promoting future genetics studies to identify alleles in association with phenotypic diversity across the entire genome and across the entire species range (http://1001genomes.org/) (Weigel and Mott 2009). In rice, a high-throughput method for genotyping recombinant populations that used whole-genome resequencing data generated by the Illumina Genome Analyzer was performed (Huang et al. 2009). One of the most anticipated innovations for next-generation sequencers is the application to whole-genome de novo sequencing. Although, to date, this approach has been realized only in bacterial genomes (Farrer et al. 2009, Moran et al. 2009), a number of attempts are being made to realize this advance in higher species.

Comprehensive discovery of small RNAs (sRNAs)

In plants, sRNAs, including microRNAs (miRNAs), short interfering RNAs (siRNAs) and trans-acting siRNAs (ta-siRNAs), are also playing roles as crucial components of epigenetic processes and gene networks involved in development and homeostasis (Ruiz-Ferrer and Voinnet 2009). These RNA molecules are important targets that should be comprehensively identified and their expression should be analyzed using next-generational genomic technologies (Nobuta et al. 2007, Chellappan and Jin 2009). In maize, sRNAs in the wild type and in the isogenic mop1-1 loss-of-function mutant were analyzed by deep sequencing using Illumina’s sequencing-by-synthesis (SBS) technology to characterize the complement of maize sRNA (Nobuta et al. 2008). In poplar, expressed sRNAs from leaves and vegetative buds were also discovered using high-throughput Roche 454 pyrosequencing, subsequently the genes of miRNA families, including the novel ones, were identified (Barakat et al. 2007). Deep sequencing of Brachypodium sRNAs at the global genome level has also been performed, resulting in identification of miRNAs involved in the cold stress response (J. Zhang et al. 2009). The plant miRNA database (PMRD) is a useful information resource on plant miRNA and is available on the Web (http://bioinformatics.cau.edu.cn/PMRD/) (Z. Zhang et al. 2009).

Resources for variation analysis

Recent innovations related to DNA sequencing technology and the rapid growth of genome and cDNA sequence resources allow us to design various types of molecular markers covering entire genomes (Feltus et al. 2004). For high-throughput genotyping, a number of platforms have been developed that have been applied to genetic map construction, marker-assisted selection and QTL cloning using multiple segregation populations (Hori et al. 2007). Such genotyping systems have also been used in post-genome sequencing projects such as genotyping of genetic resources, accessions to evaluate population structure and association studies to identify genetic loci involved in phenotypic changes of species. This recent expansion of analysis platforms addressing genome-wide polymorphisms provides an essential resource in the ‘variome’ study of plants.

Molecular markers

Accumulation and saturation of available genetic markers contribute to advances in marker-assisted genetic studies and are important resources with a wide range of applications. Genetic markers designed to cover a genome extensively allow not only identification of individual genes associated with complex traits by QTL analysis but also the exploration of genetic diversity with regard to natural variations (Feltus et al. 2004, Varshney et al. 2005, Caicedo et al. 2007). With the progress of genome sequencing and large-scale EST analysis in various species, these sequence data sets have become quite efficient sequence resources for designing molecular markers. A number of attempts to design polymorphic markers from accumulated sequence data sets have been made for various species. Several genome-wide rice (Oryza sativa) DNA polymorphism data sets have been constructed based on alignment between japonica and indica rice genomes(Han and Xue 2003, Shen et al. 2004). Large-scale EST data sets are also important resources for discovery of sequence polymorphisms, especially for allocating expressed genes onto a genetic map. Therefore, computational discovery of ESTbase single-nucleotide polymorphisms (SNPs) and/or EST-SNP markers for the purpose of identifying sequence-tagged site (STS) markers has progressed for numerous species, including barley, wheat, maize, melon, Brassica, common bean and sunflower (Kota et al. 2001, Kantety et al. 2002, Kota et al. 2003, Torada et al. 2006, Heesacker et al. 2008, Kota et al. 2008, Blair et al. 2009, Deleu et al. 2009, Kaur et al. 2009, Li et al. 2009). Several databases provide information on molecular markers in plant species. PlantMarkers is a genetic marker database that contains predicted molecular markers, such as SNP, SSR and conserved ortholog set (COS) markers, from various plant species (Heesacker et al. 2008). GrainGenes is a popular site for Triticeae genomics; it also provides genetic markers and linkage map data on wheat, barley, rye and oat (Carollo et al. 2005). Gramene is a database for plant comparative genomics that provides genetic maps of various plant species (Liang et al. 2008). The Triticeae Mapped EST database (TriMEDB) provides information regarding mapped cDNA markers that are related to barley and their wheat homologs (Mochida et al. 2008).

Platforms for variation analyses

High-throughput polymorphism analysis is an essential tool for facilitating any genetic map-based approach. So far, genome-wide genotyping using a hybridization-based SNP typing method has been used to analyze representative Arabidopsis ecotypes and rice strains, and the data sets containing the calculated genome-wide variation patterns for each species have been released. As typified by the Arabidopsis 1,001 project, genome-wide variation study is a key analysis that should be performed after genome sequencing has been completed for a particular reference strain. Therefore, the demand for high-throughput and cost-effective platforms for comprehensive variation analysis (also called variome analysis) has rapidly increased. As we have already mentioned, whole-genome resequencing approaches are already being realized as a direct solution for variome analysis in species whose reference genome sequence data are available. Diversity Array Technology (DArT) is a high-throughput genotyping system that was developed based on a microarray platform (http://www.diversityarrays.com/index .html) (Jaccoud et al. 2001, Wenzl et al. 2007). In various crop species such as wheat, barley and sorghum, DArT markers have been used together with conventional molecular markers to construct denser genetic maps and/or to perform association studies (Crossa et al. 2007, Peleg et al. 2008, Mace et al. 2009). In barley and wheat, Affymetrix GeneChip Arrays have been used to discover nucleotide polymorphisms as single-feature polymorphisms based on the differential hybridization of GeneChip probes (Rostoks et al. 2005, Bernardo et al. 2009). The Illumina GoldenGate Assay allows the simultaneous analysis of up to 1,536 SNPs in 96 samples and has been used to analyze genotypes of segregation populations in order to construct genetic maps allocating SNP markers in crops such as barley, wheat and soybean (Hyten et al. 2008, Akhunov et al. 2009, Close et al. 2009).

Transcriptome resources in plants

Comprehensive and high-throughput analysis of gene expression, called transcriptome analysis, is also a significant approach to screen candidate genes, predict gene function and discover cis-regulatory motifs. The hybridization-based method, such as that used in microarrays and GeneChips, has been well established for acquiring large-scale gene expression profiles for various species. The recent rapid accumulation of data sets containing large-scale gene expression profiles and the ability of related databases to support the availability of such large repositories of data has provided us with access to large amounts of information in the public domain. This public domain data are an efficient and valuable resource for many secondary uses, such as co-expression and comparative analyses. Furthermore, as a next-generation DNA sequencing application, deep sequencing of short fragments of expressed RNAs, including sRNAs, is quickly becoming an efficient tool for use with genome-sequenced species (Harbers and Carninci 2005, de Hoon and Hayashizaki 2008).

Sequence tag-based platforms in transcriptomics

Large-scale sequencing of ESTs from cDNA libraries was an early approach for acquiring transcriptome profiles. In this approach, ESTs that are randomly sequenced in an unbiased cDNA library are classified into clusters of transcript sequences using sequence-clustering and/or assembling methods. Then, the abundance of transcripts expressed in each tissue is estimated by counting the number of ESTs with identifiers for each cDNA library and/or each sequence cluster. The same methodological principle has been applied in human and mouse in the form of a ‘body map’ to derive the transcriptome in various organs (Hishiki et al. 2000, Kawamoto et al. 2000, Ogasawara et al. 2006). Moreover, this principle has also been used in the digital differential display (DDD) tool, which is a component of NCBI’s UniGene database and has been applied in large-scale cDNA projects for various species, including plants (Mochida et al. 2003, Fei et al. 2004, Sterky et al. 2004, Zhang et al. 2004). Although this approach, coupled with cDNA clone resources, has facilitated gene discovery and expression profiling, its disadvantages include cost and limited resolution due to large-scale sequencing. Serial analysis of gene expression (SAGE) is a method based on deep sequencing of short read cDNA tags. SAGE allows the identification of a large number of transcripts present in tissues and enables quantitative comparison of transcriptomes (Velculescu et al. 1995). SAGE is designed to generate a short specific tag (13–15 bp) from the 3′ end of each mRNA present in a sample, after which >10 tags are concatenated and cloned to generate a SAGE library. The sequencing of selected clones from the SAGE library allows efficient collection of transcript tag sequences. A data set of genome sequences or large-scale ESTs is required to identify genes corresponding to each SAGE tag. Some derivatives of the original protocol (MAGE, SADE, microSAGE, miniSAGE, longSAGE, superSAGE, deepSAGE, 5′ SAGE, etc.) have been developed to improve and expand the utility of SAGE (Hashimoto et al. 2004, Anisimov 2008). For example, superSAGE is an improved version of SAGE that produces 26 bp fragment tags from cDNAs. This method has been applied to simultaneous and quantitative gene expression profiling of both host cells and their eukaryotic pathogens in rice (Matsumura et al. 2003). The 26 bp superSAGE tags have also been used to design probes directly for oligo microarrys (Matsumura et al. 2008). Another sequencing-based technology is massively parallel signature sequencing (MPSS). MPSS uses a unique method to quantify gene expression levels; it generates millions of short sequence tags per library by sequencing 16–20 bp from the 3′ side of cDNA using a microbead array (Brenner et al. 2000). Databases containing MPSS data on plant species, including Arabidopsis, rice, grape and Magnaporthe grisea (the rice blast fungus), are available online (http://mpss.udel.edu) (Nakano et al. 2006). In addition, the MPSS method has also been used to perform genome-scale discovery and expression profiling of sRNAs in Arabidopsis and rice (Lu et al. 2006, Nobuta et al. 2007). The CT-MPSS was a method recently developed for quantitative analysis of the 5′end of transcripts coupled with cap-trapper method for full-length cDNA cloning. This method has been applied to perform high-density mapping of TSS in Arabidopsis to figure out genome-scale instances of plant promoters (Yamamoto et al. 2009). The data set of Arabidopsis CT-MPSS tags is accessible from ppdb (http://www.ppdb.gene.nagoya-u .ac.jp), a plant promoter database that provides promoter annotation of Arabidopsis and rice (Yamamoto and Obokata 2008).

Hybridization-based platforms in transcriptomics

The history of DNA microarray began with a paper from the P. O. Brown laboratory at Stanford University in 1995 (Schena et al. 1995). Since then, microarray- and DNA chip-related technologies have advanced rapidly and their application has expanded to a wide variety of life sciences disciplines. The methodological principle of DNA microarray or GeneChip analysis is to acquire a comprehensive data set of the molecular abundance of each molecule in a given sample based on its simultaneous hybridization with large numbers of DNA molecular species immobilized on a glass slide or on a silicon chip used as a probe set. DNA microarrays can be classified into two major types: (i) the ‘spotting’ type, which was developed at Stanford University; and (ii) the ‘on-chip synthesis’ type based on manufactured probes. The spotting type was widely used during the early years of transcriptome research. This method entailed preparing DNA microarrays by spotting a cDNA solution onto a glass slide. The on-chip (in situ) oligo synthesis-based method is a light-directed chemical synthesis process that combines solid-phase chemical synthesis with photolithographic fabrication techniques. Initially, this method was employed only in conjunction with the Affymetrix-manufactured GeneChip Array system. In the Affymetrix GeneChip system, a known gene or potentially expressed sequence is represented on the chip by 11–20 unique oligomeric probes that are each 25 bases in length. Roche NimbleGen and Agilent Technology offer platforms to manufacture high-density DNA arrays based, respectively, on Roche’s proprietary Maskless Array Synthesizer (MAS) technology and on a non-contact industrial inkjet printing process, both of which are also used for in situ oligo synthesis. With the recent and rapid increase in the number of sequenced species in whole-genome and/or large-scale cDNA clones, a number of DNA microarrays have also been developed for transcriptome analysis in various plant species. For example, Seki and co-workers designed a custom DNA microarray that uses 7,000 full-length cDNA clones of Arabidopsis as probes and then successfully screens genes in response to abiotic stresses using a two-color method (Seki et al. 2002a). With the recent increase in commercially available DNA microarrays, laboratories are able to use a particular DNA microarray design to obtain transcriptome data from many experiments in order to accumulate a more comprehensive resource for organism-specific transcriptome data. AtGenExpress was a multinational effort designed to uncover the transcriptome of A. thaliana. The data sets collected in AtGenExpress have been one of the most comprehensive resources for the Arabidopsis transcriptome to date (Kilian et al. 2007, Goda et al. 2008). NCBI’s Gene Expression Omnibus (GEO) and the European Bioinformatics Institute (EBI)’s ArrayExpress have been serving as the primary archives of transcriptome data in the public domain (Parkinson et al. 2007, Barrett et al. 2009). There are also several more focused databases that provide calculated transcriptome data with user-friendly interfaces and annotations on probes. ATTED II (http://atted.jp/) is a database that provides co-expression analysis data calculated from publicly available Arabidopsis ATH1 GeneChip data (Obayashi et al. 2007, Obayashi et al. 2009). Co-expression analysis data sets generated from comprehensively collected transcriptome data sets have become an efficient resource capable of facilitating the discovery of genes closely correlated in their expression patterns. Genevestigator (https://www.genevestigator.com/gv/index.jsp), which is a reference expression database and meta-analysis system, also provides summary information from hundreds of microarray experiments on various organisms, including Arabidopsis, barley and soybean, with easily interpretable results (Zimmermann et al. 2004). The electronic fluorescent pictograph (eFP) browser provides gene expression patterns collected from Arabidopsis, poplar, Medicago, rice and barley via a user-friendly interface on the Web (http://www.bar .utoronto.ca/) (Winter et al. 2007). The Arabidopsis Gene Expression Database AREX is a database that provides data sets of high-resolution gene expression patterns of root tissues in Arabidopsis (http://www.arexdb.org/index.jsp) (Birnbaum et al. 2003, Brady et al. 2007). The RICEATLAS is a database housing rice transcriptome data covering various types of tissues (http://bioinformatics.med.yale.edu/riceatlas/) (Jiao et al. 2009). Tiling arrays, which are high-density oligonucleotide probes spanning the entire genome in a particular organism, are a platform for analyzing expressed regions throughout a whole genome; it is an effective method by which to discover novel genes and elucidate their structure. Seki and co-workers performed transcriptome analysis in Arabidopsis under abiotic stress conditions using a whole-genome tiling array and discovered a number of antisense transcripts induced by abiotic stresses (Matsui et al. 2008). The A. thaliana Tiling Array Express (At-TAX) is a whole-genome tiling array resource for developmental expression analysis and transcript identification in Arabidopsis (Laubinger et al. 2008, Zeller et al. 2009). The usefulness of tiling arrays has recently been extended by coupling this platform with the immunoprecipitation method. For example, the binding sites of AGAMOU-Like15, AGL15, a MADS domain transcriptional regulator promoting somatic embryogenesis, were identified using a chromatin immunoprecipitation (ChIP) approach coupled with the Affymetrix tiling array for Arabidopsis. This method found approximately 2,000 sites (Zheng et al. 2009). Using the methylcytosine immunoprecipitation (mCIP) method in combination with the Arabidopsis tiling array, a comprehensive DNA methylation map of the genome was constructed as an Arabidopsis methylome data set (Zhang et al. 2006). Sequencing of co-precipitated DNAs together with a protein using the next generation sequencer, ‘ChIP-seq’, has also become an alternative approach (Park 2009).

Platforms and resources in proteomics

As genome sequencing projects for several organisms have been completed, proteome analysis, which is the detailed investigation of the functions, functional networks and 3D structures of proteins, has gained increasing attention. Large-scale proteome data sets are also an important resource for the better understanding of protein functions in cellular systems, which are controlled by the dynamic properties of proteins. These properties reflect cell and organ states in terms of growth, development and response to environmental changes. The primary objective of functional proteomics was the high-throughput identification of all of the proteins appeared in cells and/or tissues. Recent, rapid technical advances in proteomics (e.g. protein separation and purification methods, advances in mass spectrometry equipment and methodological developments in protein quantification) have allowed us to progress to the second generation of functional proteomics, including quantitative proteomics, subcellular proteomics and various modifications and protein-protein interactions (Rossignol et al. 2006, Jorrin-Novo et al. 2009, Yates et al. 2009). The different Web-accessible plant proteome-related databases are summarized on the proteomics subcommittee of the Multinational Arabidopsis Steering Committee (MASCP) Web site (http://www.masc-proteomics.org/) under the heading of “Proteomic Databases and Resources”.

Proteome profiling

The typical experimental workflow of protein profiling can be summarized as protein sample preparation, separation and detection, then identification. Various technical advances for each step of the process have greatly increased the overall performance of plant proteomics (Jorrin-Novo et al. 2009). Sample preparation is the most critical step in any proteomics experiment. The method that uses trichloroacetic acid (TCA) and acetone is the most commonly used procedure for protein precipitation. A method using phenol and NH4OAC/MeOH is also popular for plant tissues. Sample fractionation effectively improves protein detection and increases proteome coverage by reducing sample complexity. Sequential solubilization is an efficient method for fractionating protein samples based on solubility, molecular mass and isoelectric point. By using a series of different reagents to separate proteins by their different solubilities, sequential solubilization is also an effective way to reduce the complexity of proteins in each fraction and to enrich rare proteins (Agrawal et al. 2005). One-dimensional SDS–PAGE has been widely used to fractionate complex proteins based on their molecular masses. For high-resolution separation of proteins, two-dimensional gel electrophoresis (2-DE), which uses isoelectric focusing (IEF) as the first dimension and SDS–PAGE as the second dimension, is an effective method. Furthermore, the later development of the immobilized pH gradient (IPG)-IEF as the first dimension has improved reproducibility and resolution. The 2-DE methods have been widely used in proteomics in various species (Islam et al. 2004, Mechin et al. 2004, Chen and Harmon 2006), and databases housing 2-DE information have been developed and released [e.g. the Swiss Institute of Bioinformatics’ Expasy SWISS-2DPAGE database (http://au.expasy.org/ch2d/) and the Kazusa DNA Research Institute’s Cyano2Dbase (http://bacteria .kazusa.or.jp/cyano_legacy/Synechocystis/cyano2D/index .html)]. Chromatography-based separation methods such as gel filtration chromatography, ion exchange chromatography and affinity chromatography are also effective in separating proteins based on their physicochemical properties. To identify each protein found in a prepared sample, peptide mass fingerprinting has been widely employed. Currently the most efficient method available consists of two steps: (i) enzymatic digestion of separated proteins into peptides; and (ii) accurate mass measurements of peptides using mass spectrometry (MS). In-gel digestion methods have been widely used to separate protein samples by using 2-DE. With its rapid technical advances, MS continues to play an important role in proteomics. MS equipment consists of a source to ionize samples and a mass spectrometer(s) to detect the ionized samples. In proteomics, usually the matrix-assisted laser desorption ionization (MALDI) method or the electrospray ionization (ESI) method is applied to ionize sample peptides. The MALDI method is typically used in combination with time of flight (TOF) MS as MALDI-TOF-MS while the ESI method is usually used in combination with quadrupole (Q) or ion trap (IT) MS. Recently, MS, such as Q-TOF MS, IT-TOF MS or MALDI Q-TOF MS, has become popular. Furthermore, ion fragmentation by collision-induced dissociation (CID) using tandem MS such as Q-TOF MS or by post-source decay (PSD) using MALDI-TOF MS have been applied to determine peptide amino acid sequences. To identify target proteins, obtained peptide mass fingerprint data are searched against a database of theoretically predicted masses of known amino acid sequences (Hirano et al. 2004, Newton et al. 2004). In addition to conventional gel electrophoresis-based separation, the gel-free separation method is often used, particularly in the ‘shotgun proteomics’ approach. In the gel-free method, the protein mixture is directly digested into peptides and separated by the multidimensional separation method. The multidimensional separation method is a combination of different online separation methods including multidimensional protein identification technology (MudPIT). The shotgun approach is suitable for the analysis of proteins that are difficult to separate by 2-DE as well as for high-throughput analysis by automated analytical instruments (Yates et al. 2009). Fourier transform ion cyclotron resonance mass spectrometry (FT-ICR MS) possesses high resolution, high sensitivity, high dynamic range and high mass measurement accuracy. The high resolution and precision of FT-ICR MS allows us to carry out ‘top-down proteomics’ in which an intact protein mixture is analyzed directly, without separation (Bogdanov and Smith 2005).

Quantitative proteomics

Comprehensive quantification of each protein’s abundance is quite important for a better understanding of the protein dynamics regulated in response to cellular state and environmental changes. A quantitative proteome approach also plays a crucial role in the discovery of key proteomic changes, including expression, interaction and modification, that are associated with genetic variations and/or visible phenotypic changes (Gstaiger and Aebersold 2009). Difference gel electrophoresis (DIGE) is a popular method for differential display of proteins for quantitative protein comparison. In DIGE, protein samples are labeled with different fluorescent dyes before 2-D electrophoresis, enabling accurate analysis of differences in protein abundance between samples (Rossignol et al. 2006). This method is an effective way to remove gel to gel variation while significantly increasing accuracy and reproducibility. Isotope-coded affinity tags (ICATs), isobaric tags for relative and absolute quantitation (iTRAQ) and stable isotope labeling with amino acids in cell culture (SILAC) are widely used methods for protein differential display using stable isotope labeling (Jorrin-Novo et al. 2009). Using a single MS/MS analysis, corresponding peptides from each sample are differentially detectable based on mass shift caused by the differential isotopes; this allows comparison of the relative abundance of the two samples. Recently, label-free quantitative techniques have been developed to facilitate high-throughput comparisons of proteomic expression. For label-free quantification, the proteomes from each of two samples are separately analyzed using liquid chromatography (LC)-MS/MS. Then, each MS1 spectrum is aligned to calculate relative protein abundance changes based on ion intensity changes such as peptide peak areas or peak heights in chromatography. Finally, MS/MS analysis is used to identify peptides (Gstaiger and Aebersold 2009).

Subcellular proteomics

Large-scale proteome analysis of cell organelles is essential for understanding the enzymatic inventory of a cell organelle; the compartmentalization of metabolic pathways; cellular logistics such as protein targeting, trafficking and regulation; and proteomic dynamics at the organelle level caused by changes in cellular systems (Andersen and Mann 2006, Chen and Harmon 2006, Baginsky 2009). A number of approaches have been applied to analyze the proteome of organelles or subcellular compartments of plant cells such as chloroplasts, etioplasts, amyloplasts, chromoplasts, mitochondria, vacuoles, plasma membranes, nucleus, peroxisomes, cytosolic ribosome and cell wall (Baginsky 2009). Proteomic analyses of chloroplasts, mitochondria and further fractionations have been carried out to determine detailed localizations of protein in different suborganelle compartments. Techniques for quantitative proteomics, such as the ICAT and iTRAQ methods described above, are also effective for acquiring quantitative data on proteomes in each organelle. In Arabidopsis, rice and alga, differential proteome profiles of plant plasma membranes were monitored to identify those proteins differentially expressed in response to environmental factors such as cold acclimation, salt stress and bacterial elicitor (Benschop et al. 2007, Katz et al. 2007, Cheng et al. 2009, Minami et al. 2009). Several databases provide subcellular proteome information. The rice proteome database (http://gene64.dna.affrc.go .jp/RPD/) is a 2-DE image database for rice that contains data from various tissues as well as subcellular compartments (Komatsu 2005). The Nottingham Arabidopsis Stock Centre (NASC) Proteomics database (http://proteomics.arabidopsis .info/) and the SUB-cellular location database for Arabidopsis proteins (SUBA) (http://suba.plantenergy.uwa.edu.au/) provides subcellular proteome analysis data for Arabidopsis (Dunkley et al. 2006). The soybean proteome database (http://proteome .dc.affrc.go.jp/cgi-bin/2d/2d_view_map.cgi) also provides 2-DE data for various tissues as well as for subcellular compartments (Sakata et al. 2009).

Post-translational protein modifications

Comprehensive approaches to investigate various kinds of post-translational protein modifications also play a key role in the current study of proteomics. It (is also called modificome research) aims to identify modified proteins and to elucidate and coordinate the role of each protein functional modification with its associated biological event (Kwon et al. 2006). Protein phosphorylation is a critical regulatory step in signaling networks and is a widespread protein modification affecting most basic cellular processes in eukaryotic organisms (Schmelzle and White 2006). Advances in MS-based technologies, accompanied by phosphopeptide enrichment techniques, have allowed us to perform high-throughput, large-scale in vivo phosphorylation site mapping. So far, several plant phosphoproteome studies have been reported (Nuhse et al. 2003, Nuhse et al. 2004, de la Fuente van Bentem et al. 2006, Benschop et al. 2007, Nuhse et al. 2007, Sugiyama et al. 2008). For example, the proteome-wide mapping of in vivo phosphorylation sites in Arabidopsis has recently been achieved by using complementary phosphopeptide enrichment techniques coupled with high-accuracy LC-MS/MS with a Finnigan LTQ-Orbitrap (Sugiyama et al. 2008). The Arabidopsis Protein Phosphorylation Site Database (PhosPhAt) provides information on Arabidopsis phosphorylation sites which were identified by MS by different research groups (http://phosphat.mpimp-golm.mpg.de/). The Plant Protein Phosphorylation Database (P3DB) (http://www.p3db.org/), an information resource for plant phosphoproteomes, provides a resource for protein phosphorylation data from multiple plants (Gao et al. 2009). Ubiquitination of protein is also one of the major post- translational modifications occurring in eukaryotic cells. Protein ubiquitination is a key regulatory mechanism that controls protein abundance, localization and activity. Several large-scale analyses of protein ubiquitination in plants have been reported (Maor et al. 2007, Manzano et al. 2008, Igawa et al. 2009). For example, in Arabidopsis, affinity purification using an anti-ubiquitin antibody and the subsequent use of MS/MS analysis has been performed to identify ubiquitinated proteins (Igawa et al. 2009).

Platform advances in structural proteomics

Large-scale data sets of protein 3D structures are also crucial information resources for elucidating relationships between protein functions and structures or for analyzing molecules in protein complexes. The International Structural Genomics Organization (ISGO, http://www.isgo.org) was formed to facilitate global structural genomics research efforts (Stevens et al. 2001). The key centers for structural genomics have been the RIKEN Structural Genomics/Proteomics Initiative (RSGI) in Japan, the Protein Structure Initiative (PSI) in the USA and the structural genomics centers of Europe (Yokoyama et al. 2000). International efforts to determine protein structures have contributed to increases in the number of solved protein structures. Thus, the number of solved protein structures appearing in the protein data bank, PDB (http://www.pdb.org/pdb/home/home.do), which is the most popular resource for biomolecule structure data sets, has dramatically increased during the past decade (Kouranov et al. 2006). The PSI has promoted large-scale attempts to determine the 3D structure of protein folding. The PSI-1 was started as a pilot phase in 2000 with the aim of promoting the strategic development of tools and infrastructures necessary for large-scale determination of protein structures. The PSI-1 centers developed a systematic workflow pipeline encompassing target selection, cloning, expression and purification, followed by crystallization, X-ray crystallography or NMR methods to solve structures. The final step of this pipeline consisted of structure deposition into a database. In 2005, the PSI shifted to its second phase, which was known as PSI-2. The goal of PSI-2’s ‘production phase’ is to solve more challenging structures such as protein complexes and integral membrane proteins (Fox et al. 2008). The RIKEN SGPI has solved >2,700 protein structures, including 33 from Arabidopsis that appear in the PDB (http://www.rsgi.riken.go.jp/rsgi_e/index.html). Although methodological bottlenecks still exist in structural proteomics, some methodological advances have played an important role in this field. One of the major bottlenecks is the production of soluble and folded proteins. Most centers for structural genomics use Escherichia coli cells for protein production in their automated pipelines as an application of the cell-based method. Cell-free expression systems have also become important as a way to address several limitations of cell-based methods, such as protein quality and quantity as well as throughput issues. The E. coli cell-free system has been applied to amino acid-selective stable isotope labeling of proteins for NMR spectroscopy (Yabuki et al. 1998, Kigawa et al. 1999). The wheat germ embryo cell-free system has also been developed as a eukaryotic cell-free system and has the advantage of producing multidomain proteins (Madin et al. 2000, Endo and Sawasaki 2003, Endo and Sawasaki 2006). The wheat germ cell-free system has since been incorporated into a robotic automation platform (Cell-Free Sciences Co. Ltd.). A comparative study of protein production from 96 Arabidopsis open reading frames (ORFs) using cell-based and cell-free systems was reported by the Center for Eukaryotic Structural Genomics (CESG) group in 2005 (Tyler et al. 2005). The technology and platform of NMR spectroscopy has also played an important role in structural proteomics. The cell-free systems of E. coli and wheat germ embryo in combination with selected amino acid labeling, as described above, have produced synergistic advances in the promotion of automated protein structure determination using solution NMR methods (Yokoyama 2003). Furthermore, high-resolution multidimensional solid state NMR methods used in combination with cross polarization (CP), magic angle spinning (MAS) and dipolar decoupling (DD) are also becoming the methods of choice for structural analysis of membrane proteins by NMR platforms (Castellani et al. 2002, McDermott 2009). Furthermore, recent hardware improvements in NMR probes, including the cryoprobe for improved sensitivity, the micro-coil probe for sample mitigation and the flow-probe designed to shorten preparation time, provide us with the opportunity to use NMR methods to screen ligands that bind to a particular protein. X-ray crystallography has been used to determine the protein structures of almost 90% of the protein entries in the PDB (http://www.rcsb.org/pdb/static.do?p=general_information/pdb_statistics/index.html). In particular, the beamlines of third-generation X-ray synchrotrons have become an essential infrastructure for macromolecular crystallography (MX), which is used to determine the 3D structures of macromolecules (such as large proteins and protein complexes) (Samatey et al. 2001). For example, the synchrotron SPring-8 of RIKEN in Japan is used to determine the structures of important membrane proteins and protein complex supermolecules such as Ca2+-ATPase, rhodopsin and flagellin (Palczewski et al. 2000, Samatey et al. 2001, Toyoshima et al. 2003). By using existing analytical platforms for structural proteomics, the structures of many representative DNA-binding domains (DBDs), namely AP2/ERF, NAC, WRKY, B3 and SBP, of plant-specific transcription factor (TF) families have thus far been determined (Yamasaki et al. 2004, Yamasaki et al. 2008b).

Information resources in structural proteomics

Bioinformatics and related databases are also essential tools for advancing the study of structural proteomics. The methods of computational prediction of protein 3D structure are mainly classified into two methods: template-based modeling (TBM) and free modeling (FM) (Zhang 2008, Zhang 2009b). Free modeling, which is also called ‘ab initio’ or ‘de novo’ modeling, is used to predict the 3D structure of proteins, without any information on previously solved structures. A number of Web server and computational tools for free and/or template-based modeling have recently been made available; for example, the I-TASSER internet service, which is used in Critical Assessment of Techniques for Protein Structure Prediction (CASP), was released (Zhang 2009a). Template-based modeling method is a comparative method for matching proteins using evolutionarily related proteins of known structure as a template. There are many Web services and tools (e.g. Swiss Institute of Bioinformatics’ SWISS-MODEL server) to support template-based modeling (Schwede et al. 2003). Databases housing previously predicted structures from amino acid sequences by template-based modeling for a wide range of species also exist: the Genomes TO Protein structures and functions (GTOP) database (http://spock.genes.nig.ac.jp/∼genome/gtop .html) provides information on protein structures and functions obtained through the application of various computational tools for structure prediction and annotation from the amino acid sequences deduced from annotated genes in sequenced genomes (Fukuchi et al. 2009). The database for structure-based protein classification, as typified by CATH (http://www.cathdb.info/) and the Structural Classification of Proteins (SCOP) database (http://scop.mrc-lmb.cam.ac.uk/scop/), has provided important clues to the relationships between protein structures, protein functions and protein evolution (Greene et al. 2007, Andreeva et al. 2008). Databases of protein families based on conserved protein domains, such as Pfam, Superfamily and Protein ANalysis THrough Evolutionary Relationships (PANTHER), are important resources for classifying proteins into families. These types of databases are often used for the functional prediction and classification of proteins; for example, such resources are used in genome-wide identification of genes putatively encoding specific TFs (Mi et al. 2005, Wilson et al. 2007, Finn et al. 2008). Many of these protein family databases can be simultaneously searched using the EBI’s Interpro (Mulder and Apweiler 2008, Hunter et al. 2009).

Platforms and resources in metabolomics

Metabolomics aims to understand metabolic systems based on comprehensive and integrated approaches by taking advantage of various measurement instruments to characterize metabolites. Metabolomic approaches allow us to conduct parallel assessments of multiple metabolites and to undertake quantitative analysis of particular metabolites in ways that provide major advantages over chemical-level phenotyping and diagnostic analysis. It is notable that the plant metabolome represents an enormous chemical diversity due to the complex set of metabolites produced in each plant species (Bino et al. 2004, von Roepenack-Lahaye et al. 2004). Therefore, plant metabolomics is not only a great analytical challenge, but is also quite important. In its ability to elucidate plant cellular systems, metabolomics permits us to engineer molecular breeding to improve the productivity and functionality of plants in areas such as stress tolerance, pharmaceutical production, functional foods, biomaterials and energy (Trethewey 2004, Oksman-Caldentey and Saito 2005, Fernie and Schauer 2009). In this section, we introduce metabolomic analytical platforms for plants, metabolic profiling, and their applications in combination with other omics. We also describe plant metabolomics-related computational tools and databases.

Instruments for metabolomics

Many remarkable technological advances have recently been made in instrumentation related to metabolomics. Metabolomics experiments start with the acquisition of metabolic fingerprints using various analytical instruments such as GC-MS, LC-MS, FT-MS, FT-IR and NMR (Fiehn et al. 2000, Roessner et al. 2001, Fernie et al. 2004). Methods for sample separation, such as gas chromatography (GC), high-performance or ultraperformance LC and capillary electrophoresis (CE), are typically used in conjunction with various types of MS, as detailed below. CE-MS is an especially effective, high-sensitivity method for separating and analyzing polarized molecules in samples (Ramautar et al. 2009). QMS and TOF MS are also well regarded for use in metabolomics. Triple Q (QqQ) MS (a tandem-type MS) and Q-TOF (a hybrid-type MS) are also used. Methods that do not involve pre-separation of samples, e.g. FT-ICR MS, are also being used, allowing for MSn analysis (Werner et al. 2008). NMR-based methods are also used in metabolomic analysis (Dixon et al. 2006, Schripsema 2009). These methods can be broadly classified into solution NMR and insoluble or solid-state NMR, according to sample solubility. Using high-resolution (hr)-MAS techniques, it is possible to acquire metabolic fingerprints from insoluble samples and solid-state samples (Bertocchi and Paci 2008). In one-dimensional NMR, protons (1H) are usually observed (1H-NMR) due to the sensitivity and common occurrence of this magnetic nucleus. More detailed analyses, such as metabolite identification or flux analysis, can be obtained with other nuclei, particularly 13C and 15N that are coupled with 1H nuclei in two-dimensional or multidimensional NMR analysis (Kikuchi et al. 2004, Sekiyama and Kikuchi 2007). The resulting fingerprints, MS or NMR spectra, are pre- processed, including background noise suppression, peak alignment and peak picking. Pre-processed data sets are subsequently used to identify metabolites corresponding to each spectrum signal by searching against compound databases. In non-target analyses, spectrum data sets that include spectra of unknown compounds are subjected to statistical analyses, such as multivariate analysis, to mine the data for biological significance (Tikunov et al. 2005). In target analyses, spectrum data sets that are associated with particular compounds are used as metabolic profiles for each compound in further analyses. Data analysis is important in the determination of biological significance in metabolomics. Statistical analyses using multivariate analysis, such as principal component analysis (PCA), hierarchical clustering analysis (HCA) and self-organization mapping (SOM), are typically used to classify samples and/or metabolites (Kose et al. 2001, Hirai et al. 2004, Jonsson et al. 2004, Matsuda et al. 2009). The visualization of metabolic profiles on metabolic pathway maps is also often used and is combined with other omics methods, including gene expression profiles of genes encoding enzymes involved in particular pathways (Thimm et al. 2004, Tokimatsu et al. 2005).

Metabolite profiling in plants

The systematic collection of metabolite profiles is the initial step in metabolomics. This step can be performed with various instruments capable of high throughput and simultaneous measurement, as we mentioned above. Comprehensive metabolic profile data sets can contribute to the understanding of the cellular system in response to changes in intracellular and extracellular environments. Furthermore, the changes in metabolic profiles associated with genetic variations can be evaluated as chemical phenotypes to identify genes involved in particular metabolic pathways. A number of studies of metabolic profiling in plant species have been performed that have resulted in the publication of related databases. For example, in the case of Arabidopsis, an NSF-funded multi-institutional project aimed at development of the metabolomics database, Plantmetabolomics, has recently been undertaken (http://lab.bcb.iastate.edu/sandbox/pbais05/alpha/plantmetabolomics_trimmed/index.php). Several databases for Solanaceae species are already available. The Metabolome Tomato Database (MoTo DB) was developed as an LC-MS-based metabolome database (http://appliedbioinfor matics.wur.nl/moto/) (Moco et al. 2006). The KOMICS (Kazusa-omics) database collects annotations of metabolite peaks detected by LC-FT-ICR-MS and contains a representative metabolome data set for the tomato cultivar, Micro-tom (Iijima et al. 2008). The Armec Repository Project provides metabolome data on the potato and serves as a data repository for metabolite peaks detected by ESI-MS (http://www.armec.org/MetaboliteLibrary/index.jsp). The Golm Metabolome Database (GMD) provides public access to custom mass spectra libraries and metabolite profiling experiments as well as to additional information and related tools (http://csbdb.mpimp-golm.mpg .de/csbdb/gmd/gmd.html) (Kopka et al. 2005). The MS/MS spectral tag (MS2T) libraries at the Platform for Riken Metabolomics (PRIMe) website provides access to libraries of phytochemical LC-MS2 spectra obtained from various plant species by using an automatic MS2 acquisition function of LC-ESI-Q-TOF/MS (http://prime.psc.riken.jp/lcms/ms2tview/ms2tview .html) (Matsuda et al. 2009). These databases play crucial roles as information resources and repositories of large-scale data sets and also serve as tools for further integration of metabolic profiles containing comprehensive data acquired from other omics research (Akiyama et al. 2008).

Combinatorial approaches in metabolomics and other omics resources

Metabolome approaches also support the understanding of global relationships among cellular metabolic systems in combination with other omics instances such as profiles of the transcriptome and proteome, and also genetic variations. So far, these combinatorial approaches have been successfully demonstrated in the well-studied Arabidopsis by taking advantage of the many other omics resources that currently exist, including the whole-genome sequence with mature annotations, large-scale transcriptome data sets and related co-expression data, and bioresources such as collections of mutants and full-length cDNA clones. A conceptual scheme for systematic elucidation of gene-to-metabolites molecular networks through a combinatorial approach using transcriptome and metabolome resources has been demonstrated by Saito's group in the RIKEN Plant Science Center (Saito et al. 2008). Data sets containing transcriptome and metabolome changes of Arabidopsis under stress conditions induced by deficiency of sulfur and nitrogen were analyzed using a batch-learning, self-organizing map (BL-SOM) analysis, enabling identification of genes involved in glucosinolate biosynthesis (Hirai et al. 2004). An integrated approach that comprised metabolome and transcriptome analysis was conducted for investigation of an activation-tagged mutant and overexpressors of an MYB TF, PAP1 gene in order to identify genes involved in anthocyanin biosynthesis in Arabidopsis (Tohge et al. 2005). Co-expression data of the Arabidopsis transcriptome provided by the ATTED-II database have been applied to the investigation of key genes involved in specific metabolic pathways and then to the configuration of a metabolome analysis coupled with mutant lines of the targeted genes (Obayashi et al. 2009). The ATTED-II database was used to identify novel genes involved in lipid metabolism, leading to identification of a novel gene, UDP-glucose pyrophosphorylase3 (UGP3) that is required for the first step of sulfolipid biosynthesis (Okazaki et al. 2009). Co-expression analysis was also used to identify all of the genes related to flavonoid biosynthesis, which led to further detailed analysis of two flavonoid pathway genes UGT78D3 and RHM1 (Yonekura-Sakakibara et al. 2008). Approaches that integrated metabolome and transcriptome data have also elucidated regulatory networks that act in response to environmental stresses in plants. The metabolic pathways that act in response to cold and dehydration conditions in Arabidopsis were investigated by metabolome analysis using various types of MS coupled with microarray analysis of overexpressors of genes encoding two TFs, DREB1A/CBF3 and DREB2A (Maruyama et al. 2009). Metabolomic profiling was also used to investigate chemical phenotypic changes between wild-type Arabidopsis and a knockout mutant of the NCED3 gene under dehydration stress conditions. The metabolic data were then integrated with transcriptome data to reveal ABA-dependent regulatory networks (Urano et al. 2009). Metabolome profiling has also been used to evaluate chemical phenotypes of natural variations and/or segregation populations simultaneously. A comprehensive exploration of the association between metabolic and genomic diversity will enable the discovery of key genes involved in metabolic changes and would also aid in the identification of genetic associations between metabolic and/or visible phenotypes (Schauer et al. 2008, Fu et al. 2009). Metabolite QTL (mQTL) analysis using segregated populations has been applied to various plant species such as Arabidopsis, poplar and tomato in a popular forward genetics approach (Morreel et al. 2006, Schauer et al. 2006, Lisec et al. 2008, Rowe et al. 2008; Schauer et al. 2008). Furthermore, along with the recent availability of data sets of genome-wide variation acquired by high-throughput genotyping methods including resequencing, interest in the discovery of the genetic association between nucleotide variation and phenotypic changes has also increased, especially with regard to the identification of key genes that play significant roles in evolutionary histories. The attempts to mine correlative patterns between metabolic and genomic diversities have recently been applied to sesame and rice using seed stocks of natural variations (Laurentin et al. 2008, Mochida et al. 2009a).

Information resources for metabolomics

Various information resources related to metabolomics have played crucial roles not only in metabolome research but also in synergistic integration with other omics data. The Web site of metabolome resources at TAIR (http://www.arabidopsis.org/portals/metabolome/index.jsp) provides a summarized list of Web hyperlinks to resources that facilitate metabolome research. In addition, a data set of biological pathway maps is available via the Kyoto Encyclopedia of Genes and Genomes (KEGG) by using a popular database for information on life sciences called the KEGG PATHWAY Database (http://www .genome.jp/kegg/pathway.html) (Kanehisa and Goto 2000, Kanehisa et al. 2008). The Plant Metabolic Network (PMN) is a collaborative project that aims to build plant metabolic pathway databases (http://www.plantcyc.org/). One of its main components, PlantCyc, is a comprehensive plant biochemical pathway database that contains curated information from the literature and from computational analyses of genes, enzymes, compounds, reactions and pathways involved in primary and secondary plant metabolism (http://www.plantcyc.org:1555/PLANT/server.html) that can be visualized using a pathway tool (http://bioinformatics.ai.sri.com/ptools/). AraCyc and PoplarCyc are also available at the PMN Web site and provide manually curated or reviewed information about metabolic pathways in Arabidopsis and poplar, respectively (Mueller et al. 2003). There are also metabolic pathway databases for several other plant species generated by PMN collaborators. MapMan is a tool to project omics datasets including gene expression data onto diagrams of metabolic pathways or other processes (http://mapman.gabipd.org/web/guest) (Thimm et al. 2004). KaPPA-View is another Web-based analysis tool that can be used to superimpose transcriptome and metabolome data onto plant metabolic pathway maps (http://kpv.kazusa .or.jp/kappa-view/) (Tokimatsu et al. 2005). PRIMe is a Web-based service that provides data sets of metabolites measured by multidimensional NMR spectroscopy, GC-MS, LC-MS and CE-MS together with analytical tools to promote integrated approaches using comprehensive data sets within the metabolome and transcriptome (http://prime.psc.riken.jp/) (Akiyama et al. 2008).

Mutant resources for phenome analysis

Analysis of mutants is an effective approach for investigation of gene function (Springer 2000, Stanford et al. 2001). Comprehensive collections of mutant lines are also essential bioresources for radically accelerating forward and reverse genetics. The available mutant resources for phenome analysis in plant species have been well described in a recent review by Kuromori et al. (2009). As described above, various analytical platforms have rapidly evolved, allowing us to discover genes involved in particular phenotypic changes. Along with these analysis platforms, the demands for comprehensive collections of mutants and related information resources have dramatically increased, encouraging high-throughput and genome-wide phenome analysis in plant species (Alonso and Ecker 2006).

Insertion mutant

With the completion of genome sequencing in plants, insertion mutant resources with index data that document the inserted genomic position have become extremely beneficial resources by which to promote functional analysis of annotated genes that are disrupted by a reverse genetics approach. Transferred DNA-tagged (T-DNA-tagged) lines and transposon-tagged lines have become popular resources for the investigation of insertion mutants in plants. T-DNA-tagged lines have emerged as a popular mutant resource due to the rapid generation of large-scale populations in Arabidopsis (Krysan et al. 1999). The two-component maize transposon, Activator (Ac) / Dissociation (Ds), is a popular system for inducing transposon-based insertion that enables the generation of mutants with a high proportion of single-copy insertions (Long et al. 1993). In rice, the endogenous retrotransposon Tos17, which is activated in particular conditions, is also available for the study of the insertion mutant lines of a japonica rice cultivar, Nipponbare (Miyao et al. 2003, Miyao et al. 2007), and the Web resource that provides information on the rice Tos17 mutant panel with flanking sequences of insertion is available at http://tos.nias.affrc.go.jp/index.html.en. Additionally, the maize Enhancer/Suppressor Mutator (En/Spm) element has also been used as an effective tool for the study of functional genomics in plants (Kumar et al. 2005). The enhancer trap (ET) and the gene trap (GT) constructs have been coupled with T-DNA and Ac/Ds transposons, which additionally facilitates entrapment of genes in monitoring of adjacent promoter or enhancer activity (Sundaresan et al. 1995, An et al. 2005). OryGenesDB (http://orygenesdb.cirad.fr/) is a database that integrates information of available insertion mutant resources in rice (Droc et al. 2009). There are a number of resources for insertion mutant populations with insertion site index-tagged data available for various plant species (Kuromori et al. 2009)

Activation tagging

Activation tagging (AT) is a popular method for generating gain-of-function mutant populations. The method uses T-DNA or a transposable element containing cauliflower mosaic virus 35S enhancer multimers (Weigel et al. 2000). With transcriptional activation of genes near the insertion, novel phenotypes are expected to appear that will identify genes that are redundant or essential for survival. Mutant resources have then been used to isolate genes from Arabidopsis, rice, petunia and tomato (Kakimoto 1996, Zubko et al. 2002, Mathews et al. 2003, Mori et al. 2007). Recently, AT systems using a transposon of maize En/Spm or Ac/Ds have been developed in Arabidopsis and rice, respectively (Schneider et al. 2005, Qu et al. 2008). A number of AT projects have been performed in various plant species such as Arabidopsis, rice and soybean (Weigel et al. 2000, An et al. 2005, Kuromori et al. 2009).

The FOX hunting system

The FOX gene hunting system is a recently developed novel gain-of-function system that combines a transformation algorithm with large-scale resources for full-length cDNA clones (Ichikawa et al. 2006). A normalized full-length cDNA library of Arabidopsis was first used to generate the Arabidopsis FOX lines, which are available at http://nazunafox.psc.database.riken.jp. The system was also used to screen salt stress-resistant lines in the T1 generation produced by the transformation of 43 focused stress-inducible TFs of Arabidopsis (Fujita et al. 2007). Then, the system was applied to a set of full-length rice cDNA clones aiming for in planta high-throughput screening of rice functional genes, with Arabidopsis as the host species (http://ricefox.psc.riken.jp) (Kondou et al. 2009). The FOX lines with rice full-length cDNA transformed into rice plants has been also generated as a gain-of-function mutant resource of rice (Nakamura et al. 2007). A similar technique (in terms of overexpressors using cDNA libraries) has also been carried out in tobacco (Lein et al. 2008).

Chemical and physical mutagenesis

Chemical mutagenic agents, such as ethyl methanesulfonate (EMS), sodium azide and methylnitrosourea (MNU), and physical mutagens, such as fast-neutrons, gamma rays and ion-beam irradiation, have been used to generate mutant populations for many years for forward genetics in various plant species. Targeting induced local lesions in genomes (TILLING) was developed as a general reverse-genetic strategy that provides an allelic series of induced point mutations in genes of interest (Till et al. 2004, Till et al. 2006). Because high-throughput TILLING permits the rapid and low-cost discovery of induced point mutations in populations of chemically mutagenized individuals, the method has been applied to various animal and plant species. The TILLING technology can also be used to explore allelic variations that are appeared in natural variation; this technology is called EcoTILLING (Comai et al. 2004, Wang et al. 2006). Several laboratory sites have established TILLING and/or EcoTILLING centers for communities of users as a public service (Barkley and Wang 2008). TILLING projects in rice, tomato and Arabidopsis have been performed at the University of California Davis Genome Center (http://tilling.ucdavis.edu/index.php/Main_Page). A soybean mutation database provides soybean mutagenized lines together with data on TILLed genes and their phenotypes (http://www.soybeantilling.org/index.jsp). RevGenUK at the John Innes Center provides TILLING service for TILLING populations of Medicago truncatula, Lotus japonicus and Brassica rapa (http://revgenuk.jic.ac.uk/about.htm). UTILLdb of INRA is another database for TILLING populations of pea and tomato that provides an interface to search for TILLed lines based on phenotypes (http://urgv.evry.inra.fr/UTILLdb).

Gene silencing technologies

Although insertion mutagenesis is an effective method for generating loss-of-function mutants, it also has limitations in the case of redundant genes and lethal mutants. To overcome these limitations, methods to interrupt gene expression have been developed and applied to the functional analysis of plant genes. RNA interference (RNAi) is a popular method for RNA- mediated gene silencing by sequence-specific degradation of homologous mRNA triggered by double-stranded RNA (dsRNA), which is also known as post-transcriptional gene silencing (PTGS) (Waterhouse et al. 1998, Chuang and Meyerowitz 2000). Constitutive expression of an intron-containing self-complementary hairpin RNA (ihpRNA) has been an effective method for silencing target genes in plants. With demands for conditional silencing of target genes (the silencing of which results in prevention of plant regeneration or embryonic lethality), conditional RNAi systems using a chemical-inducible Cre/loxP recombination system or a promoter of heat shock-inducible genes have been recently developed (Guo et al. 2003, Masclaux et al. 2004). In Arabidopsis, the Complete Arabidopsis Transcriptome MicroArray (CATMA) project has been started to design and produce high-quality gene-specific sequence tags (GSTs) covering most Arabidopsis genes (http://www .catma.org/). Using the GST data set of the CATMA project, the Arabidopsis Genomic RNAi Knock-out Line Analysis (AGRIKOLA) project has also been started with the goal of systematically analyzing Arabidopsis genes by RNAi-based technology (http://www.agrikola.org/index.php?o=/agrikola/html/index) (Hilson et al. 2003). The M. truncatula RNAi Database is also available on the Web as an information resource for RNAi-based gene silencing (https://mtrnai.msi.umn.edu/). Virus-induced gene silencing (VIGS) is a derivative method that takes advantage of the plant RNAi-mediated antiviral defense mechanism. The VIGS system was used to assess the function of almost 5000 random Nicotiana benthamiana cDNAs in disease resistance (Lu et al. 2003a, Lu et al. 2003b). The chimeric repressor silencing technology (CRES-T) system was developed as a novel method for gene silencing; it takes advantage of the fact that TFs fused to the EAR motif, a plant-specific repression domain, act as dominant repressors in transgenic plants and therefore suppress the expression of target genes (Hiratsu et al. 2003). The CRES-T system has been applied to the TFs annotated for Arabidopsis in order to analyze their biological function and to obtain transgenic plants with agronomically preferable traits. An associated database, FioreDB, is available at http://www.cres-t.org/fiore/public_db/index.shtml (Mitsuda and Ohme-Takagi 2009).

Plant comparative genomics and databases

The recent accumulation of nucleotide sequences for agricultural species, including crops and domestic animals, now allows us to perform genome-wide comparative analyses of model organisms with the goal of discovering key genes involved in phenotypic characteristics (Sato and Tabata 2006, Itoh et al. 2007, Neale and Ingvarsson 2008). The integration of genomic resources derived from various related species, such as large-scale collections of cDNAs and data from whole-genome sequencing projects, should facilitate sharing of information about gene function between models and applied organisms. This will also accelerate molecular elucidation of cellular systems related to agronomically important traits. A number of information resources for plant genomics accessible on the Web have appeared, along with appropriate analytical tools. Here we highlight integrative databases promoting plant comparative genomics that we have not described previously. The URLs of each integrative database in plant genomics are shown in Table 4.

Table 4

Integrative databases in plants

Database name	Species	URL
TAIR	Arabidopsis	http://www.arabidopsis.org/
SIGnAL	Arabidopsis	http://signal.salk.edu/
RARGE	Arabidopsis	http://rarge.psc.riken.jp/
Rice Genome Annotation Project	Rice	http://rice.plantbiology.msu.edu/
RAP-DB	Rice	http://rapdb.dna.affrc.go.jp/
SOL genomics network	Solanaceae	http://solgenomics.net/
Gramene	Gramineae	http://www.gramene.org/
GrainGenes	Triticeae and Avena	http://wheat.pw.usda.gov/GG2/index.shtml
SoyBase	Soybean	http://www.soybase.org/
MazieGDB	Maize	http://www.maizegdb.org/
CyanoBase	Cyanobacteria	http://genome.kazusa.or.jp/cyanobase/
GDR (Genome Database for Rosaceae)	Rosaceae	http://www.bioinfo.wsu.edu/gdr/
Brassica Genome Gateway	Brassica	http://brassica.bbsrc.ac.uk/
Cucurbit Genomics Database	Cucurbitaceae	http://www.icugi.org/
Phytozome	Plant species (whole genome data available)	http://www.phytozome.net/
PlantGDB	Plant species (whole genome and/or large-scale EST data available)	http://www.plantgdb.org/
EnsemblPlants	Plant species (whole genome data available)	http://plants.ensembl.org/index.html
ChloroplastDB	Plant species (Chloroplast genome data available)	http://chloroplast.cbio.psu.edu/
KEGG PLANT	Plant species (whole genome and/or large-scale EST data available)	http://www.genome.jp/kegg/plant/

Integrative databases in plants

Portal information resources in plants

TAIR is one of the most popular and integrated information resource in plant science, and it plays an essential role as a portal site in Arabidopsis research (http://www.arabidopsis .org/) (Swarbreck et al. 2008). The Salk Institute Genomic Analysis Laboratory (SIGnAL) is also an information resource that integrates various data sets of significant omics results mainly related to Arabidopsis (http://signal.salk.edu/). The RIKEN Arabidopsis Genome Encyclopedia (RARGE) provides information on various genomic resources built at RIKEN for Arabidopsis research (http://rarge.gsc.riken.jp/db_home.pl) (Sakurai et al. 2005). Such portal sites have provided gateways for access to comprehensive omics data and/or bioresources. These sites also house cross-referenced data sets built between each annotated gene and its associated instances, such as gene–full-length cDNA clones, gene– mutants, gene–expression patterns and gene–homologous genes. Therefore, to visualize an annotated gene along with genome sequences and associated information, genome browsers such as Gbrowse have been implemented on Web sites (Donlin 2007). Gramene is a popular portal site that is not only an integrated rice information resource but also a portal for promoting plant comparative genomics (http://www.gramene.org/). Gramene offers integrated genome-oriented data including gene annotation and molecular markers, and also a QTL database mainly for Gramineae species (Liang et al. 2008). Along with the launch of genome sequencing projects, portal sites to share the progression of outcomes and to integrate related resources have appeared for various species. The Sol genomics network is a portal site for Solanaceae genome resources that includes information on the tomato genome sequencing project (http://solgenomics.net/) (Mueller et al. 2005). SoyBase is a resource portal site for genomic soybean research, and it includes released whole-genome sequence data (http://soybase.org/). The MaizeGDB is the community database for biological information about Zea mays, and includes genetic and genomic data sets and related information (http://www.maizegdb.org/) (Lawrence et al. 2004).

Genome-wide comparisons among plants species

With the completion of genome sequencing in a number of plant species, genome-scale comparative analyses can be used to produce and publish data sets that facilitate identification of conserved and/or characteristic properties among plant species. Using modeled proteome data sets deduced from sequenced genomes in plants, several efforts have been completed to construct comprehensive gene families with the aim of establishing platforms to verify gene content and elucidating the process of gene duplication and functional diversification among species (Sterck et al. 2007). Comprehensive gene family data sets are usually produced by computational procedures including a step that conducts an all-against-all sequence similarity search and then a step for building clusters of protein families by methods such as Markov Clustering (MCL) or consideration of protein domain structures (Hulsen et al. 2006). The results of such studies can themselves yield databases that are useful for further phylogenetic studies (Horan et al. 2005, Conte et al. 2008, Wall et al. 2008). Correlated gene arrangements among taxa along with chromosomal allocation, also known as synteny and collinearity, have become valuable frameworks for inference of shared ancestry of genes and for transfer of knowledge from a species to another related species (Tang et al. 2008a). The plant genome duplication database (PGDD) provides a data set of intragenome or cross-genome syntenic relationships identified throughout genome-sequenced plant species (http://chibba .agtec.uga.edu/duplication/) (Tang et al. 2008b).

Focused database for plant genomics

Databases housing focused data sets together with rich annotations and well interrelated cross-references are also quite useful for the better understanding of focused issues in particular gene families and/or particular cellular processes. Sequence-specific DNA-binding TFs are key molecular switches that control or influence many biological processes, such as development or responses to environmental changes. In plants, the genome-wide identification of repertories of genes encoding TFs of the Arabidopsis genome was reported first, and comparisons with other organisms revealed the properties of plant-specific TFs (Riechmann et al. 2000). In the past decade, with the availability of complete genome sequences, we have been able to compile catalogs describing the function and organization of TF regulatory systems in a number of organisms. There are many databases that provide data sets of genes putatively encoding TFs in many plant species; these are usually predictions based on computational methods such as sequence similarity search and/or hidden Markov model search of conserved DNA-binding domains (Table 5). Recently, further integration of data sets of TF-encoding genes has been performed, thus establishing an integrative, knowledge-based resource of TFs across related plant species in terms of comparative genomics of transcriptional regulatory networks. GRASSIUS provides the first step toward building a comprehensive platform for integration of information, tools and resources for comparative regulatory genomics across the grass species (Yilmaz et al. 2009). The Grass Transcription Factor Database (GrassTFDB) of GRASSIUS houses integrated information on MaizeTFDB, RiceTFDB, SorghumTFDB and CaneTFDB (http://grassius.org/grasstfdb.html). The LegumeTFDB provides predicted TF- encoding genes annotated in the genome sequences of three major legume species: soybean, L. japonicus and M. truncatula (http://legumetfdb.psc.riken.jp/). This database is an extended version of the SoybeanTFDB (http://soybeantfdb.psc.riken.jp/) and is aimed at integrating knowledge on legume TFs and providing a public resource for comparative genomics of the TFs of legumes, non-legume plants and other organisms (Mochida et al. 2009c, Mochida et al. 2010).

Table 5

Transcription factor database in plants

Database	URL	Species	References
RARTF	http://rarge.gsc.riken.jp/rartf/	Arabidopsis	Iida et al. (2005)
AGRIS, AtTFDB	http://arabidopsis.med.ohio-state.edu/AtTFDB/	Arabidopsis	Palaniswamy et al. (2006)
DATF	http://datf.cbi.pku.edu.cn/	Arabidopsis	Guo et al. (2005)
DRTF	http://drtf.cbi.pku.edu.cn/	Rice	Gao et al. (2006)
DPTF	http://dptf.cbi.pku.edu.cn/	Poplar	Zhu et al. (2007)
TOBFAC	http://compsysbio.achs.virginia.edu/tobfac/	Tobacco	Rushton et al. (2008)
SoybeanTFDB	http://soybeantfdb.psc.riken.jp/	Soybean	Mochida et al. (2009c)
PlantTFDB	http://planttfdb.cbi.pku.edu.cn/	22 plant species	Guo et al. (2008)
PlnTFDB	http://plntfdb.bio.uni-potsdam.de/v3.0/	20 plant species	Riano-Pachon et al. (2007)
GRASSIUS, GrassTFDB	http://grassius.org/grasstfdb.html	Maize, rice, sorghum, sugarcane	Yilmaz et al. (2009)
LegumeTFDB	http://legumetfdb.psc.riken.jp/	Soybean, Lotus japonicus, Medicago truncatula	Mochida et al. (2010)
DBD	http://dbd.mrc-lmb.cam.ac.uk/DBD/index.cgi?Home	>700 species	Wilson et al. (2008)

Transcription factor database in plants

323 in total

1. An SNP caused loss of seed shattering during rice domestication.

Authors: Saeko Konishi; Takeshi Izawa; Shao Yang Lin; Kaworu Ebana; Yoshimichi Fukuta; Takuji Sasaki; Masahiro Yano
Journal: Science Date: 2006-04-13 Impact factor: 47.728

2. Transcriptome analysis of salinity stress responses in common wheat using a 22k oligo-DNA microarray.

Authors: Kanako Kawaura; Keiichi Mochida; Yukiko Yamazaki; Yasunari Ogihara
Journal: Funct Integr Genomics Date: 2005-11-19 Impact factor: 3.410

3. De novo assembly of the Pseudomonas syringae pv. syringae B728a genome using Illumina/Solexa short sequence reads.

Authors: Rhys A Farrer; Eric Kemen; Jonathan D G Jones; David J Studholme
Journal: FEMS Microbiol Lett Date: 2008-12-09 Impact factor: 2.742

4. Identification of ubiquitinated proteins in Arabidopsis.

Authors: Concepción Manzano; Zamira Abraham; Gema López-Torrejón; Juan C Del Pozo
Journal: Plant Mol Biol Date: 2008-06-06 Impact factor: 4.076

Review 5. Plant proteome analysis: a 2004-2006 update.

Authors: Michel Rossignol; Jean-Benoît Peltier; Hans-Peter Mock; Andrea Matros; Ana M Maldonado; Jesús V Jorrín
Journal: Proteomics Date: 2006-10 Impact factor: 3.984

Review 6. Phenome analysis in plant species using loss-of-function and gain-of-function mutants.

Authors: Takashi Kuromori; Shinya Takahashi; Youichi Kondou; Kazuo Shinozaki; Minami Matsui
Journal: Plant Cell Physiol Date: 2009-06-05 Impact factor: 4.927

7. A collection of 10,096 indica rice full-length cDNAs reveals highly expressed sequence divergence between Oryza sativa indica and japonica subspecies.

Authors: Xiaohui Liu; Tingting Lu; Shuliang Yu; Ying Li; Yuchen Huang; Tao Huang; Lei Zhang; Jingjie Zhu; Qiang Zhao; Danlin Fan; Jie Mu; Yingying Shangguan; Qi Feng; Jianping Guan; Kai Ying; Yu Zhang; Zhixin Lin; Zongxiu Sun; Qian Qian; Yuping Lu; Bin Han
Journal: Plant Mol Biol Date: 2007-05-24 Impact factor: 4.076

8. A consensus genetic map of sorghum that integrates multiple component maps and high-throughput Diversity Array Technology (DArT) markers.

Authors: Emma S Mace; Jean-Francois Rami; Sophie Bouchet; Patricia E Klein; Robert R Klein; Andrzej Kilian; Peter Wenzl; Ling Xia; Kirsten Halloran; David R Jordan
Journal: BMC Plant Biol Date: 2009-01-26 Impact factor: 4.215

9. PlnTFDB: an integrative plant transcription factor database.

Authors: Diego Mauricio Riaño-Pachón; Slobodan Ruzicic; Ingo Dreyer; Bernd Mueller-Roeber
Journal: BMC Bioinformatics Date: 2007-02-07 Impact factor: 3.169

10. Metabolite annotations based on the integration of mass spectral information.

Authors: Yoko Iijima; Yukiko Nakamura; Yoshiyuki Ogata; Ken'ichi Tanaka; Nozomu Sakurai; Kunihiro Suda; Tatsuya Suzuki; Hideyuki Suzuki; Koei Okazaki; Masahiko Kitayama; Shigehiko Kanaya; Koh Aoki; Daisuke Shibata
Journal: Plant J Date: 2008-02-07 Impact factor: 6.417

45 in total

Review 1. Reverse genetics in eukaryotes.

Authors: Serge Hardy; Vincent Legagneux; Yann Audic; Luc Paillard
Journal: Biol Cell Date: 2010-10 Impact factor: 4.458

2. Salt-responsive ERF1 regulates reactive oxygen species-dependent signaling during the initial response to salt stress in rice.

Authors: Romy Schmidt; Delphine Mieulet; Hans-Michael Hubberten; Toshihiro Obata; Rainer Hoefgen; Alisdair R Fernie; Joachim Fisahn; Blanca San Segundo; Emmanuel Guiderdoni; Jos H M Schippers; Bernd Mueller-Roeber
Journal: Plant Cell Date: 2013-06-25 Impact factor: 11.277

3. Systems responses to progressive water stress in durum wheat.

Authors: Dimah Z Habash; Marcela Baudo; Matthew Hindle; Stephen J Powers; Michael Defoin-Platel; Rowan Mitchell; Mansoor Saqi; Chris Rawlings; Kawther Latiri; Jose L Araus; Ahmad Abdulkader; Roberto Tuberosa; David W Lawlor; Miloudi M Nachit
Journal: PLoS One Date: 2014-09-29 Impact factor: 3.240