Literature DB >> 27499685

Bioinformatics Approach in Plant Genomic Research.

Quang Ong¹, Phuc Nguyen², Nguyen Phuong Thao², Ly Le².

Abstract

The advance in genomics technology leads to the dramatic change in plant biology research. Plant biologists now easily access to enormous genomic data to deeply study plant high-density genetic variation at molecular level. Therefore, fully understanding and well manipulating bioinformatics tools to manage and analyze these data are essential in current plant genome research. Many plant genome databases have been established and continued expanding recently. Meanwhile, analytical methods based on bioinformatics are also well developed in many aspects of plant genomic research including comparative genomic analysis, phylogenomics and evolutionary analysis, and genome-wide association study. However, constantly upgrading in computational infrastructures, such as high capacity data storage and high performing analysis software, is the real challenge for plant genome research. This review paper focuses on challenges and opportunities which knowledge and skills in bioinformatics can bring to plant scientists in present plant genomics era as well as future aspects in critical need for effective tools to facilitate the translation of knowledge from new sequencing data to enhancement of plant productivity.

Entities: Chemical Disease Gene Species

Keywords: Bioinformatics; Comparative genomics; GWAS; Next-generation sequencing; Phylogenomics; Plant genomics

Year: 2016 PMID： 27499685 PMCID： PMC4955030 DOI： 10.2174/1389202917666160331202956

Source DB: PubMed Journal: Curr Genomics ISSN： 1389-2029 Impact factor: 2.236

INTRODUCTION

The Plant kingdom is very important not only for human but also for other living organisms. One of the crucial role of plants is to provide a huge amount of food [1]. Plants are also used in making many human medicines [2] and have been selected as model organisms to study transposable elements in heterochromatin and epigenetic control [3]. Study of plant biology has, therefore, been conducted broadly since the early stage of human life because of its vital role. Modern technologies have pushed the study of plant biology to a higher level than before [4]. The innovation of high-throughput sequencing methods gives scientists the ability to exploit the structure of the genetic material at the molecular level which is known as “genomics”. Plant genomics study has exploded recently and becomes the main theme in plant research due to the rapid increase of sequenced genomes of many plant species [5]. It is easy to see the huge impact of plant genome research on the improvement of economically important plants and the knowledge of plant biology [6]. Open-access and constant updates to this plant genomic information create a fertile environment for plant research to grow. This requires strong connection and cooperation among global biological community [7]. In this paper, we review firstly the development of genomic sequencing technologies and their applications in plant genomic research. Then, we introduce recent approaches of bioinformatics in managing and analyzing plant genomic databases. Particularly, we summarize most popular plant genomic resources. In addition, we also provide fundamental knowledge of key methods for integration and analysis of these genomic data such as comparative genomic analysis, phylogenomics, evolutionary analysis and genome-wide association study (GWAS) in plant.

Next generation sequencing technology in plant genomic research

The development of DNA sequencing technology has been a great and memorial journey filled with many historical events. In the last decade, nearly all of DNA sequence production has restrictively been executed with capillary-based, semi-automated applications of the Sanger biochemistry and its variations [8-10]. Over the years, the field of DNA sequencing has been revived and prospered due to various scientific breakthroughs. These technological advancements eventually lead to the encouragement for developing novel experimental designs for this field due to various reasons [11]. Ultimately, next-generation sequencing (NGS) technologies were released in 2005 [12]. They are known as “high throughput sequencing technologies that parallelize the sequencing process, producing millions of sequences at once at a much lower per-base cost than conventional Sanger sequencing” [13]. Based on NGS technologies, big companies like Roche, Illumina, Applied Biosystems and so forth have recently developed many autonomous and ultrahigh-throughput platforms. All of them are all well-fitted for the current and even future large sequence needs. Generally, Sanger’s dideoxy chain termination sequencing technology is no longer utilized in these NGS platforms. Instead, more advanced methods are applied such as pyrosequencing, sequencing-by-synthesis, sequencing-by-ligation, ion semiconductor-based non-optical sequencing, single molecule sequencing and nanopore sequencing [14]. Sequencing-by-synthesis platform utilizes DNA polymerase to extend many DNA strands in parallel [15]. This method uses modified deoxynucleoside triphosphates (dNTPs) containing a terminator which prevents further polymerization, thus, only one single base can be added by DNA polymerase to each growing DNA copy strand. Therefore, the newly incorporated nucleotide or oligonucleotide can be determined as extension proceeds. The pyrosequencing platform is based on the principle of sequencing-by-synthesis (SBS) [16]. It relies on the detection of pyrophosphate released on nucleotide incorporation by DNA polymerase to facilitate a following series of enzymatic reactions that finally produces light signal from the cleavage of oxyluciferin by luciferase. Sequencing-by-ligation platform uses DNA ligase to create sequential ligation of dye-labeled oligonucleotides. This process enables massively parallel sequencing of clonally amplified DNA fragments [17]. The discrepancy sensitivity of these clonally amplified DNA fragments is then used to determine the hidden sequence of the target DNA molecule. Ion semiconductor-based non-optical sequencing platform detects the hydrogen ions which are released during DNA polymerization. Single molecule sequencing is based on “the successive enzymatic degradation of fluorescently labeled single DNA molecules, and the detection and identification of the released monomer molecules according to their sequential order in a micro-structured channel” [18]. Single molecule sequencer does not require any amplification of DNA fragments prior to sequencing [19]. Nanopore sequencing identifies individual nucleotide sequences as the DNA strand is passed through a membrane-inserted protein nanopore, one base at a time, by alterations in the ion current [20]. Some examples for well-known NGS platforms commercially available are Genome Sequencer from Roche/454 (Pyrosequencing); Genome Analyzer from Illumina/Solexa (Sequencing-by-synthesis); SOLiD from Applied Biosystems (Sequencing-by-ligation) and Polonator from Dover SystemsP (Sequencing-by-ligation), Ion Torrent from Life Science, Inc. (Ion semiconductor-based non-optical sequencing); Heliscope sequencers from Helicos Bioscience Corporation (True single molecule sequencing); PacBio RS sequencers from Pacific Biosciences (Single molecule, real-time sequencing); GridION and miniaturized MinION sequencers from Oxford Nanopore Technologies (Nanopore sequencing) [4, 14]. The main differences among these systems are the length of a sequence read, the unique error model that they applied and the operation cost [21-24]. These differences may affect how the reads are utilized in bioinformatics analyzes, depending upon the application [19]. However, most of the results finally showed that the data produced are similar among these methods [21-24]. Therefore, it mainly depends on the ultimate goal of a particular research that one may choose the appropriate sequencing methods. With its rapid innovation, NGSs have been well applied to many aspects in plant genomic research, such as exome sequencing and studying genetic transmission of alleles/quantitative trait loci (QTLs) through whole genome sequencing [14]. Exome sequencing can effectively help in exploring biodiversity, studying host–pathogen interactions, investigating the natural evolution of crops, testing for the inheritance of genetic markers, providing large-scale genetic resources for the crop improvement, identifying the genes and establishing the presence of functional gene sets that are involved in symbiotic or other co-existential systems [14]. In addition, NGS methods with single-base resolution can provide epigenomic information. For instance, a study in A. thaliana epigenome revealed that the location and abundance of small RNA targets were significantly related to cytosine methylation [25]. Another application of plant genome sequencing is genotyping by sequencing (GBS), which is emerging as high through-put and inexpensive method for optimizing genotype populations. GBS has many approaches for enhancing genomic map construction, especially single nucleotide polymorphisms (SNPs) identification [26]. 681,257 SNP markers of 2,815 maize inbred accessions were found to be positively associated with trait related genes by performing GBS [27]. The successful application of NGSs in plant genomic research is undoubtable. However, there are challenges in developing computational tools for analyzing genome sequences. Galaxy (http://galaxyproject.org) is one of the software systems in which researchers can easily use analysis tools through web-based interfaces comprised of enormous free-accessed biological data [28]. Another software is Artemis, which is freely available from Sanger institute (http://www.sanger.ac.uk/). It provides genome browser and annotation tool [29]. There are several other genome sequence analysis tools given by The Broad's Genome Sequencing and Analysis Program (GSAP) (http://www.broadinstitute.org/). Additionally, the rapid decrease in cost of genome sequencing leads to the urgent requirement of a development of huge database storage and management. In fact, there are more and more plant genomic databases have been generated to confront with that demand.

PLANT GENOMIC RESOURCES

The history of plant genomics has been changed dramatically by the creation of expressed sequence tag (EST) sequencing, a high-throughput gene discovery method [30], and the release of the complete Arabidopsis thaliana genomic sequence in 2000 [31]. Following that success, the complete genomic sequence of rice became available only 2 years later [32]. These events have created powerful waves on both plant biotechnology and crop bioinformatics. For the advancement of learning, more sequencing projects on vital plant species have been carried out by combining novel in silico technologies from genomic research with traditional breeding schemes for further enhancing the quality of crops. With the advent of NGS technology in 2005 [33], the number of plant genomes sequenced have dramatically increased to more than 100 species in 2014 according to CoGepedia, a platform that aims to record all plant genomes with published or in-processed sequences [12]. Throughout the years, these genomes have contributed many valuable materials for plant research in modern molecular genomics era. Based on that foundations, genetical/biological activities of many critical genes and pathways have been revealed [34]. For instance, plant species such as Arabidopsis [31], Brachypodium distachyon (grass) [35], Physcomitrella patens (moss) [36] and Setaria italic (millet) [37, 38] can be used as scientific model for genomic studies in drought tolerance [39]. Others like Oryza sativa (rice) [40, 41], Populus trichocarpa (poplar) [42], Zea mays (maize) [43], Glycine max (soybean) [44], Solanum lycopersicum (tomato) [45], and Pinus taeda (loblolly pine) [46] can serve as both crops and functional models [34]. Non-model and non-crop plant genomes can also tell a story about genome construction and flowering plant evolution [34]. For examples, Utricularia gibba (bladderwort) and Genlisea aurea (corkscrew) genomes can provide significant understanding about genome size variation [47, 48]. Furthermore, Spirodela polyrhiza (greater duckweed) genome which share the similarity in size with that of Arabidopsis but only needs 28% fewer genes to function normally [49]. In another case, the genomes of Selaginella moellendorffii (spikemoss) and Amborella trichopoda present the bridge between the evolution of vascular plants and angiosperms respectively, revealing fundamental understandings about the trajectory of plant specific gene families and the radiance of flowering plants, thus, shedding more light in the evolution of flowering plant [34]. The gene knowledge drawn from genomics can be utilized to recognize, classify, exploit and tag individual alleles as well as to promote and manipulate molecular markers to track the desired alleles in breeding programs [50]. For those reasons, many genome sequencing projects in the field of horticultural crops were carried out such as Tomato genome sequencing project (www.sgn.cornell.edu/about/tomato) [45], Potato genome sequencing consortium, (www.potato genome.net) [51], Papaya genome sequencing project (www.asgpb.mhpcc.hawaii.edu/papaya/) [52], Grape genome sequencing project (www.vitaceae.org) [53], Floral genome sequencing project (www.fgp.bio.psu.edu/) [54] and hopefully many more will be available in public domain for scientific usages in near future. Combining with traditional methods, these projects were armed with advanced sequencing technologies, to fully certify generation of high-quality sequences and budget-efficient design [55]. Therefore, these whole-genome sequencing projects may have great significant impact in global food insurance and bio-energy advancement by providing invaluable resources for comparative and functional genomic studies [55]. If current research keeps moving forward, noticeable impact on global human well-being may be seen through applications of genomic science resources to horticulture plant species. The availability of complete genome sequences, as well as the explosion of sequence data, is leading to an urgent need for well-catalogued and annotated DNA sequence databases. The largest and most well-known of these sequence databases are GenBank, EMBL and DNA Data Bank of Japan [32]. These databases are acknowledged as the standard figure for public annotated DNA sequence collection worldwide and contain millions of plant DNA sequences. Take NCBI as an example, up to 2015, NCBI Genome database have been increased to a total of 5,132,285 plant accession entries according to RefSeq Growth Statistics (http://www.ncbi.nlm.nih.gov/refseq/statistics/). Back to 2004, there were only 88,972 entries, thus, the growth rate is approximately 458,483 entries per year over ten years, which means more than 38,000 sequences are updated monthly. There are other public databases which may provide extra information on plant genome such as Phytozome [56], PlantGDB [57], EnsembPlants, ChloroplastDB [58], KEGG [59], Genomes On-Line Database (GOLD) [12] and the wiki of CoGepedia web page (Table ). Recently, in addition to these general sequence data banks, other databases that focus on specific plant species have been available. Some examples for species-specific sequence databases are The Arabidopsis Initiative Resource (TAIR) [60], The Salk Institute Genomics Analysis Laboratory (SIGnAL), The RIKEN Arabidopsis Genome Encyclopedia (RARGE) [61], The Rice Genome Annotation Project (RGAP) [62],The Rice Annotation Project (RAP-DB) [63], The Solanaceae (SOL) Genomics Network (SGN) [64], Gramene [65], GrainGenes [66], SoyBase [67], MaizeGDB [68], CyanoBase [69], the Genome Database for Rosaceae (GDR) [70], Brassica Genome Gateway and Cucurbit Genomics Database (Table 1) [71]. Commonly, these databases and associated web portals incorporate a set of analytical, visualization and interrogation tools to study the genomic sequences they process such as BLAST for identifying sequence similarity in large datasets.

Plant cOMPARATIVE GENOMIC ANALYSIS

Once whole genomes have been sequenced, defining and describing the gene and non-coding content in these sequences is an important process [72]. For that reason, plant comparative genomic analysis has arisen as a new field of modern biotechnology since its main function is to predict functions for many unknown genes by studying the significant differences and similarities among species. These genes, however, are required to appear in the available datasets of orthologs evolved from the same ancestor [73]. As can be seen, developing new tools, strategies to manage and analyze these tremendous data has been urgently needed. Recent approaches in bioinformatics and systematic biology have reached those demands but still faced further challenges.

Tools and Databases for Plant Comparative Genomic Analysis

Using comparative genomic approach, more and more genes in plant species have been annotated. For instance, several known stress-responsive transcription factors (TFs) in Arabidopsis and rice were used to correctly predict stress-responsive TFs in many other plant species, such as soybean, maize, sorghum, barley, and wheat [74-76]. Moreover, not only comparing within plant species, comparative genomics between plants and distantly related prokaryotes can be greatly presumed the genes functionally associated. The function of NiaP protein family in plants was determined from knowing the role of those proteins in bacteria [77]. Similar strategies to identify functional genes among different plants using comparative analysis also help researchers study genes annotation in newly sequenced plant species [78]. In addition, comparative genomics can discover missing biosynthetic genes by co-expression analysis [79]. This method performs by considering an unknown gene that is co-expressed with various genes from a metabolic pathway which is expected to have a function in that particular pathway [80, 81]. GolmTranscriptome DB [82] and ATTED-II [83] are two popular tools for such type of analysis in plants. One case for this analysis is the discovery of trans-prenyldiphosphate synthase responsible for making the solanesyl moiety of ubiquinone-9. Arabidopsis gene At2g34630 was identified as an alternative candidate using the co-expression and under-expression analysis in Arabidopsis and by functional complementation in yeast [84]. Besides tools and strategies for analysis, powerful computational resources are essential to store and manage massive genomic data. Many online platforms have been developed, published and available to perform comparative genomic study among different plant species. For instance, several plant genomic data platforms described below have been the most representative and widely used recently. Phytozome. One of the largest comparative databases for plant species (http://phytozome.jgi.doe.gov/pz/portal.html). It contains plant genome, gene family data, and evolutionary history information. From the beginning, only 25 plant genomes were sequenced and annotated. This number has increased up to more than 50 species at the current state. Phytozome also provides impressive tools for comparative analysis in level of sequence, gene structure, gene family, and genome organization. With those tools and comprehensive web portal, Phytozome makes it accessible for scientist worldwide conducting plant research intensively [56]. PLAZA. Being known as the most comprehensible plant comparative genomics online platform, PLAZA integrates functional and structure annotation of all currently published crop plant genomes (http://plaza.psb.ugent.be/). Together with that huge set of data, PLAZA provides many interactive tools to study gene, genome evolution, and gene function. Those tools include pre-computed datasets cover, intraspecies dot plots, whole-genome multiple sequence alignments, homologous gene families, phylogenetic trees, and genomic colinearity between species [85]. GreenPhylDB. A web resource belongs to South Green Bioinformatics Platform (http://southgreen.cirad.fr/) and is open to public access. GreenPhylDB is designed for comparative and functional genomics in plants. This database contains 37 full genomes of members of the Plant kingdom at the current release version 4. Catalogue of gene families from GreenPhylDB is provided by gene predictions of genomes, covering a broad taxonomy of green plants. Its web interfaces have been continually developed to improve the navigation through information related to each gene or gene family, such as gene composition, protein domains, publications, orthologous gene predictions, and also external links. The latest version of this database is now possible to browse the full Gene Oncology, which supports gene discovery [86]. PlantsDB. This is one of the most commonly used plant database resources for integrative and comparative plant genome research (http://mips.helmholtz-muenchen.de/plant/genomes.jsp). PlantsDB comprises database instances for tomato, Medicago, Arabidopsis, Brachypodium, Sorghum, maize, rice, barley and wheat. This platform stores and provides individual plant genomes. Moreover, it is also equipped with up-to-date bioinformatics tools to visualize synteny, transfer data from model systems to crops and explore similarities and peculiarities of different plant species. Further important analysis strategies developed from PlantsDB are repeat catalogs and classification systems for all plant species [87].

Remaining Challenges

The enormous amount of genomic data for plants rapidly increases. Thousands of Gb of plant sequences are deposited in NCBI and other public databases monthly. However, reference genome sequence with basic annotation provided by current comparative genomic databases is simply a foundation. It still needs to be integrated with specific biological data such as plant epigenetic decorations and gene expression under vary conditions of environment, development stages and tissue types in order to get better detailed genome maps [34]. Moreover, since plant genomes have been constantly sequenced and re-sequenced, there is rising problem in updating databases. The update process should occur in all comparative genomic databases, not just solely in that individual genome database. This technical problem requires efforts to synchronize update data resources among different plant genomic platforms. Developing a strong community network of plant researchers might be one solution for this issue [88]. Several databases have been developed, published and available to compare plant genomes and tentatively identify orthologs (Table 1). Having powerful application in gene prediction, comparative genomics recently has played an important role in contributing the functional annotation infrastructure on which future plant biotechnology researchers rely on.

PHYLOGENOMICS AND EVOLUTIONARY ANA- LYSIS IN PLANT

Phylogenomics is known as molecular phylogenetic analysis, in which using sets of genomic database for gene function prediction and exploration of the evolutionary relationships among species. This definition of phylogenomics was formed from the early studies in the late 1990s when a scientific hypothesis about protein function via evolutionary analysis of a gene and its homologs was published [89]. Phylogenomics was also defined as the new era of phylogenetic analysis when there are more complete genomes sequenced [90]. Plant phylogenomics has an advantage over other species, which is the ability to identify hundreds of low copy number nuclear genes, hence easily to study the molecular systematic and evolutionary biology [91]. Current approaches of NGS also provide plant phylogenomics research useful information about plant genome diversity, such as the nature and frequency of genome duplication among a diversity of plant lineages [92-94]. There are two important goals in phylogenomic research aims to accomplish. First is to discover the evolutionary patterns among plant species using nuclear genomic information. Second is to derive new hypothesis for the unknown function of plant genes associated to major divergence events in the evolution of plant species [95]. Genomic data give more advantages in the evolutionary study than morphological data which are easily misleading or fossil data which are usually fragmented. Phylogenomics also uses a set of orthologs from genomic sequence via a phylogenetic context to detect hypotheses for the genes and biological processes [96]. The main difference between functional phylogenomics compare to classical phylogenetic analysis methods and current functional genomic methods is that in phylogenomics research, genomic information is mined without incorporating a phylogenetic context during the search for orthologs or candidate genes of functional importance [97]. However, it remains a debating issue in constructing the tree of life (phylogeny of all organisms), which inferred evolutionary relationship using phylogenomics as the advance method. Some studies continuously revalidated the positions of certain plant species in biological taxonomy [98-100] to get the most accurate tree as possible. Therefore, how to draw a scientifically significant topology is still problematic due to some limitations, such as the confliction among methodologies and character sets [101] and systematic errors from merely adding more sequences [102]. As shown above, the main problem of phylogenomics comes from how to handle the large scale of genomic data in a proper way to avoid systematic misleading (bias) assumptions. Statistical confidence (P value) which is normally used in such phylogenetic issue manner, however, was reported as unreliable. The authors then suggested that the magnitudes of differences (effect sizes) and biological relevance are those should be more focus on to get trustworthy results [103]. Another solution is the improvement of existing phylogenetic algorithms so that phylogenomic relationships can be inferred with minimal technical biases and greater computer efficiency [104]. New methods and tools have been developed to gradually overcome these limitations of plant phylogenomics. For instance, de novo assembly of short read RNA-seq data dramatically improves gene coverage by non-redundant and non-chimeric transcripts that are optimized for downstream phylogenomic analysis [105]. Another protocol is called Hyb-Seq, which combines target enrichment of low-copy nuclear exons and flanking regions, as well as genome skimming of high-copy repeats and organelle genomes, to efficiently produce genome-scale data sets for plant phylogenomics [106]. More recently, ExaML (Exascale Maximum Likelihood), which is usually known as new code for large-scale phylogenetic analyzes on Intel MIC (Many Integrated Core) hardware platform, has been updated its version 3. This coding program represents the achievement of developing better phylogenetic analysis algorithms, it is now possible to analyze datasets with 10-20 genes and up to 55,000 taxa [107]. However, even though it is just released few months ago, ExaML still has its limit since it can only run on supercomputer with Linux/Mac system. Obviously, new plant phylogenomic tools similar to ExaML is desperately needed with high quality performance and easy to operate in any computational system in the future.

GENOME-WIDE ASSOCIATION STUDIES IN PLANT

Basic knowledge of phenotypic variation, such as those agronomically important traits used for plant breeding resources has been the main trend of plant genetic studies. In classical crop breeding, biparental cross-mapping is still a major method for genetics dissections of the traits although its limitation is giving the QTLs mapping with low resolution (typically with several megabases in distance) [108]. To overcome that disadvantage, GWAS is currently a favorable tool to explore the allelic variation in a broader scope for extensive phenotypic diversity and higher resolution of QTL mapping thanks to the advent of NGS. Using GWAS, many research projects have been done to investigate the association between genetic variation and valuable plant traits. GWAS has been successfully applied to study Arabidopsis thaliana, a typical model plant organism, in which more than 1,300 distinct accessions have been genotyped for 250,000 SNPs [109] and 107 phenotypes have been studies [110]. Following this initial foundation, there were numerous achievements in conducting GWAS on other traits of interests in Arabidopsis, such as glucosinolate levels [111], shade avoidance [112], heavy metal [113], salt tolerance [114] and flowering time [115], etc. Beside Arabidopsis, rice, one of the most important crop species in the world, also has been focus of intense efforts to map the ancestral genetic variation that underlines agronomic traits such as heading date, grain size, and starch quality [116]. A few rice genes having large effects in controlling traits are involved in determining yield, morphology, stress tolerance, and nutritional quality were also identified [117]. GWAS has been widely used to dissect complex traits in some other major crops, e.g., maize and soybean [118-122]. It is undeniable that GWAS has the powerful application to plant species for identifying phenotypic diversity in trait-associated loci, as well as allelic variation in candidate genes addressing quantitative and complex traits [123, 124]. However, to accelerate genetic mapping and gene discovery in plant using GWAS, besides massive DNA variation data from NGS, it requires having a high-through put phenotyping facility that is capable to capture in details specific traits to enhance GWAS results and gain more significant gene identification information [125]. It is a challenging and promising road for future plant genomic mapping research. Hence, there are efforts on making high quality phenotyping data [126-129]. Furthermore, having computational tools to assist GWAS is also concerning issue. There are three main factors required for a GWAS tool to well perform including computing speed, memory requirements, and statistical power [130]. At the current stage, several bioinformatics approaches have been introduced as GWAS acceleration tools. Following are some examples: Heap. Heap is a SNPs detection tool for NGS data with special reference to GWAS and genomic. Heap detects larger number of variants taking advantage of the information whether the samples are inbred (homozygosity assumption) or not. For data portability to GWAS/GP, Heap outputs variant information in vcf, beagle and PED/MAP format files that are compatible with existing GWAS/GP tools [131]. GnpIS-Asso. GnpIS-Asso is a generic database for managing and exploiting plant genetic association studies. This database provides tools that allow plant scientists or breeders to get associations values between traits and markers obtained in several association studies. It is also easy to view graphically the results with dedicated plots (QQPlot, Manhattan Plot), generated dynamically and to extract data in files to continue the analysis with external tools. After selecting the best markers associated to trait of interest, one specific tool automatically jumps on the genome to find where those markers are located on chromosomes and to identify which genes or other markers or features of interest are nearby. This database is already currently used for dealing GWAS for two species: tomato and maize [132]. BioGPU. As a high performance computing tool for GWAS, BioGPU effectively controls false positives caused by population structure and unequal relatedness among individuals and improves statistical power when compared to mixed linear model methods. The BioGPU method requires much less complex computing time. BioGPU was developed with parallel computational capacity to increase computing speed, so that computing time decreases linearly with the number of central processing units. To solve the memory footprint bottleneck, BioGPU allows users to directly control memory usage when big data are analyzed on computers with limited memory, which means users have the option to trade computing time for less memory usage. Based on these features, BioGPU makes analyzes of large and complex datasets feasible without supercomputers [130]. BHIT. Bayesian high-order interaction toolkit (BHIT) first builds a Bayesian model on both continuous data and discrete data, which is capable of detecting high-order interactions in SNPs related to case-control or quantitative phenotypes. Using both simulation data and soybean nutritional seed composition studies on oil content and protein content, BHIT effectively detects the high-order interactions associated with phenotypes, and it outperformed a number of other currently available tools. BHIT are also used on Soybean 50K SNP array analysis by diversity computational strategies. Then a series of SNP interactions in multiple-orders are detected associated with oil and protein phenotypes. BHIT is freely available at http://digbio.missouri.edu/BHIT/ for academic users [133]. While it was time-consuming in the past to perform QTL analysis a small data, recent bioinformatics approach helps running GWAS with a simple marker scan of few hundred thousand SNPs on PC or web-based software within few minutes [123]. However, future GWAS assisted tools still need to be improved in speed and increased memory capacity in order to integrate with rapidly growing plant genomic data. Moreover, to ensure the accuracy of GWAS results, statistical test is very important factor and must be applied intensively, in which mixed models are set as the error-making factor of genetic background [134, 135]. One example for this is a GWAS online tool is the one for Arabidopsis, which was developed based on R and Python programming languages [136]. This web-based server comprises of common accessions with their genotyping information and several statistical options as well as integrates correlation analysis among published traits [136]. In combination with high resolution phenotyping technologies, performing GWAS is a novel strategy for conducting research on plant genetics, genomics, gene characterization and breeding [137]. Nevertheless, GWAS analysis still has another limitation, which is failure in detecting epistatic and gene-environment interactions in most studies [138]. Due to the fact that living organisms express their phenotypes as the result of not only one but several factors including epistatic effects and their interactions with environment; hence it is important to estimate those gene-gene and gene-environment interactions for better breeding system [139, 140]. Focusing on one main SNP that correlates with a specific phenotype as normal GWAS output may miss the key genetic variants with particular environment response in the context of complex traits [141]. For this issue, bioinformatics approach is again a current solution. Generalize multifactor dimensionality reduction (GMDR) algorithm on a computing system with graphics processing units (GPUs) is one in some available methods at the moment that can screen potential candidate variants and then use the mixed liner model to detect the epistatic and gene-environment interactions [142]. This new GWAS strategy was applied and showed its success in identifying four significant SNPs associated with additive, epistatic, and gene-environment interaction effects in rice [138]. Similar GWAS method using epistatic association mapping (EAM) also successfully detected three epistatic QTLs in soybean [143]. Those presented methods are just the groundwork, future bioinformatics tools have to be more powerful in statistical methodology and overcome the heavy burden of current computation [144-146].

BIOINFORMATIC ADVANCES BEYOND PLANT GENOMIC RESEARCH

The world is now at the post genomic era since DNA sequencing technology continues reaching unprecedented innovations in sequencing scale and throughput. In particular, the term “genomics” by itself is only just a small part in the whole picture named “Omics”. With the development of modern technology, several new omics layers have been emerged to deepen the knowledge of plant molecular system [147]. The most recent added omics layers include interactomics, epigenomics, hormonomics, and metabolomics. While NGS provides feature for whole-genome sequencing/re-sequencing for various genomic analysis, such as those are discussed across this paper, RNA sequencing (RNA-seq) is established for transcriptome and non-coding RNAome analysis, quantitative detection of epigenomic dynamics, and Chip-seq analysis for DNA–protein interactions [148]. In addition, approaches in transcriptional regulatory networks research based on omics data have been published such as interactome analysis for networks formed by protein–protein interactions [149], hormonome analysis for phytohormone-mediated cellular signaling [150], and metabolome analysis for metabolic systems [151]. Apparently, these rapidly growing omics databases widen the large-scale of genomic resources. Therefore, bioinformatics has become more essential than ever for every aspect of omic-based research to be well managed and effectively analyzed.

CONCLUSIONS

Recent advances in bioinformatics application for plant genomes not only provide huge potential for large-scale genomic research among plant species but also many technical challenges. NGS technologies and platforms will make plant genetic data become abundant in the next few years. With these accessible genomic data, development of effective tools for these data management and analysis become increasingly important. Indeed, there are more and more genome databases of plant species continuously established merging with different analysis methods. Comparative genomic analysis gives a specific insight of functional genes within the same and among plant species. Phylogenomic results show more accurate evidences for evolution studies and hypothesized function of genes in plant. GWAS, which has been currently used in plant research, successfully point out loci and allelic variation related to valuable traits. On the contrary, one of the main challenges facing plant genomic researchers is the high demand of knowledge and skills in bioinformatics as well as computer sciences in order to well manage and intensively manipulate the results from the increasing of large-scale plant genomic data. Moreover, since high density genotype information rapidly exploited, high-throughput phenotyping is urgently needed to provide plant genomic analysis results at high resolution. In brief, the recent wealth of plant genomic resources, along with advances in bioinformatics, have enabled plant researchers to achieve fundamental and systematic understanding of economically important plants and plant processes, critical for advancing crop improvement. Despite these exciting achievements, there remains a critical need for effective tools and methodologies to advance plant biotechnology, to tackle questions that are hardly solved using current approaches, and to facilitate the translation of this newly discovered knowledge to improve plant productivity.

Table 1

List of plant genomic databases.

Type of Database	URL
General Plant Genome Database
NCBI Genome	http://www.ncbi.nlm.nih.gov/genome/
Phytozome v10.2	http://phytozome.jgi.doe.gov/pz/portal.html
PLAZA	http://plaza.psb.ugent.be/
PlantGDB	http://www.plantgdb.org/
Ensembl Plants	http://plants.ensembl.org/index.html
ChloroplastDB	http://chloroplast.cbio.psu.edu/
KEGG	http://www.genome.jp/kegg/
GOLD v.5	https://gold.jgi-psf.org/
CoGepedia	https://genomevolution.org/wiki/index.php/Main_Page
Species-specific sequence databases
TAIR (Arabidopsis)	http://www.arabidopsis.org/
SIGnAL (Arabidopsis)	http://signal.salk.edu/
RARGE (Arabidopsis)	http://rarge.psc.riken.jp/
RGAP v.7 (Rice)	http://rice.plantbiology.msu.edu/
RAP-DB (Rice)	http://rapdb.dna.affrc.go.jp/
SGN (Solanaceae)	http://solgenomics.net/solanaceae-project/index.pl
Gramene (Gramineae)	http://www.gramene.org/
GrainGenes (Triticeae and Avena)	http://wheat.pw.usda.gov/GG3/
SoyBase (Soybean)	http://soybase.org/
MaizeGDB (Maize)	http://www.maizegdb.org/
CyanoBase (Cyanobacteria)	http://genome.microbedb.jp/cyanobase/
GDR (Rosaceae)	https://www.rosaceae.org/
Brassica Genome Gateway (Brassica)	http://brassica.nbi.ac.uk/
Cucurbit Genomics Database (Cucurbitaceae)	http://www.icugi.org/cgi-bin/ICuGI/index.cgi
Comparative genomics analysis databases
Golm transcriptome db	http://csbdb.mpimp-golm.mpg.de/csbdb/dbxp/ath/ath_xpmgq.html
ATTED-II	http://atted.jp/
Other database and tools resources
Galaxy	http://galaxyproject.org
Sanger institute	http://www.sanger.ac.uk/
GSAP	http://www.broadinstitute.org/

137 in total

Review 1. Genome relationships: the grass model in current research.

Authors: K M Devos; M D Gale
Journal: Plant Cell Date: 2000-05 Impact factor: 11.277

2. A vision for the future of genomics research.

Authors: Francis S Collins; Eric D Green; Alan E Guttmacher; Mark S Guyer
Journal: Nature Date: 2003-04-14 Impact factor: 49.962

Review 3. Genomic resources in horticultural crops: status, utility and challenges.

Authors: Humira Sonah; Rupesh K Deshmukh; Vinay P Singh; Deepak K Gupta; Nagendra K Singh; Tilak R Sharma
Journal: Biotechnol Adv Date: 2010-11-19 Impact factor: 14.227

4. Gramene: a resource for comparative grass genomics.

Authors: Doreen Ware; Pankaj Jaiswal; Junjian Ni; Xiaokang Pan; Kuan Chang; Kenneth Clark; Leonid Teytelman; Steve Schmidt; Wei Zhao; Samuel Cartinhour; Susan McCouch; Lincoln Stein
Journal: Nucleic Acids Res Date: 2002-01-01 Impact factor: 16.971

Review 5. Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis.

Authors: J A Eisen
Journal: Genome Res Date: 1998-03 Impact factor: 9.043

6. CyanoBase, a www database containing the complete nucleotide sequence of the genome of Synechocystis sp. strain PCC6803.

Authors: Y Nakamura; T Kaneko; M Hirosawa; N Miyajima; S Tabata
Journal: Nucleic Acids Res Date: 1998-01-01 Impact factor: 16.971

7. Identification and prediction of abiotic stress responsive transcription factors involved in abiotic stress signaling in soybean.

Authors: Lam-Son Phan Tran; Keiichi Mochida
Journal: Plant Signal Behav Date: 2010-03-06

Review 8. Next is now: new technologies for sequencing of genomes, transcriptomes, and beyond.

Authors: Ryan Lister; Brian D Gregory; Joseph R Ecker
Journal: Curr Opin Plant Biol Date: 2009-01-20 Impact factor: 7.834

9. Genome-wide association study of 107 phenotypes in Arabidopsis thaliana inbred lines.

Authors: Susanna Atwell; Yu S Huang; Bjarni J Vilhjálmsson; Glenda Willems; Matthew Horton; Yan Li; Dazhe Meng; Alexander Platt; Aaron M Tarone; Tina T Hu; Rong Jiang; N Wayan Muliyati; Xu Zhang; Muhammad Ali Amer; Ivan Baxter; Benjamin Brachi; Joanne Chory; Caroline Dean; Marilyne Debieu; Juliette de Meaux; Joseph R Ecker; Nathalie Faure; Joel M Kniskern; Jonathan D G Jones; Todd Michael; Adnane Nemri; Fabrice Roux; David E Salt; Chunlao Tang; Marco Todesco; M Brian Traw; Detlef Weigel; Paul Marjoram; Justin O Borevitz; Joy Bergelson; Magnus Nordborg
Journal: Nature Date: 2010-03-24 Impact factor: 49.962

10. Chloroplast phylogenomics indicates that Ginkgo biloba is sister to cycads.

Authors: Chung-Shien Wu; Shu-Miaw Chaw; Ya-Yi Huang
Journal: Genome Biol Evol Date: 2013 Impact factor: 3.416

3 in total

1. An efficient chromatin immunoprecipitation (ChIP) protocol for studying histone modifications in peach reproductive tissues.

Authors: Monica Canton; Silvia Farinati; Cristian Forestan; Justin Joseph; Claudio Bonghi; Serena Varotto
Journal: Plant Methods Date: 2022-03-31 Impact factor: 4.993

2. Plant catalase in silico characterization and phylogenetic analysis with structural modeling.

Authors: Takio Nene; Meera Yadav; Hardeo Singh Yadav
Journal: J Genet Eng Biotechnol Date: 2022-08-19

Review 3. Research Tools for the Functional Genomics of Plant miRNAs During Zygotic and Somatic Embryogenesis.

Authors: Anna Maria Wójcik
Journal: Int J Mol Sci Date: 2020-07-14 Impact factor: 5.923

3 in total