Literature DB >> 21470961

Evolutionary divergence and limits of conserved non-coding sequence detection in plant genomes.

Anna R Reineke¹, Erich Bornberg-Bauer, Jenny Gu.

Abstract

The discovery of regulatory motifs embedded in upstream regions of plants is a particularly challenging bioinformatics task. Previous studies have shown that motifs in plants are short compared with those found in vertebrates. Furthermore, plant genomes have undergone several diversification mechanisms such as genome duplication events which impact the evolution of regulatory motifs. In this article, a systematic phylogenomic comparison of upstream regions is conducted to further identify features of the plant regulatory genomes, the component of genomes regulating gene expression, to enable future de novo discoveries. The findings highlight differences in upstream region properties between major plant groups and the effects of divergence times and duplication events. First, clear differences in upstream region evolution can be detected between monocots and dicots, thus suggesting that a separation of these groups should be made when searching for novel regulatory motifs, particularly since universal motifs such as the TATA box are rare. Second, investigating the decay rate of significantly aligned regions suggests that a divergence time of ~100 mya sets a limit for reliable conserved non-coding sequence (CNS) detection. Insights presented here will set a framework to help identify embedded motifs of functional relevance by understanding the limits of bioinformatics detection for CNSs.

Entities: Chemical Disease Gene Species

Mesh：

Substances：

Year: 2011 PMID： 21470961 PMCID： PMC3152334 DOI： 10.1093/nar/gkr179

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

A major focus in the post-genomic era is to understand the temporal expression of genes defining developmental stages (1), physiological states (2), stress responses (3) and adaptive fitness (4). Gene expression is controlled by regulatory elements embedded in non-coding regions of the genome, the discovery of which remains elusive (5). Many genomic surveys of conserved non-coding sequences (CNSs), which are used as a proxy to identify embedded putative gene expression regulators, are proving to be particularly challenging (6–13). While the conservation of examples such as the TATA-box cis-regulatory motif has been identified by Berendzen et al. (14) across all species, the number of such universal highly conserved motifs are few and limited. Unlike cis-elements of vertebrates that are frequently ≥100 bp in length (15), experimentally identified plant cis-elements are infrequently over 30 bp with a median observed length of 8 bp (8,16,17). Bioinformatic scans of genomes estimate plant CNSs to have a predicted median length of 25 bp (18); however, the functional relevance of these CNSs remains to be confirmed. While CNS does not necessarily imply a regulatory role, they are often used as proxies to identify possibly embedded cis-regulatory motifs. Second, while nuclear genes of plants and animals show similar substitution rates (9,19,20), non-coding regions in plants appear to have a higher degeneracy (16,21). This may be a consequence of differences in genomic evolutionary mechanisms observed between animals and plants. Genomes of plants have been found to be more diverse than vertebrate genomes due to increased duplication events (22), polyploidy (23,24), increased recombination (25), transposable elements (TEs) (26) and gene silencing (21,27,28), for example. These evolutionary processes compound to the challenges of detecting CNSs when conservation becomes difficult to distinguish against the degenerate background frequency (13). Therefore, the bioinformatics discovery of possibly embedded regulatory motifs for further characterization and understanding of gene regulation is also complicated by the decay of strong functionally important sequence signatures in upstream regions. Several bioinformatics strategies have been employed to identify putative CNSs and embedded regulatory motifs, among which is to leverage conservation through sequence comparisons that suggest possible functional importance. Early approaches use probability-based algorithms employing a variation of strategies such as Expectation Maximization and Gibbs sampling to find overrepresentation of motifs. These algorithms include MEME (29), MotifSampler (30) and AlignACE (31). Improvements in recent algorithms are often achieved through flexible search parameters (32,33), suffix-trees (34), development of mixed-models (35,36), aided analysis with supplementary high-throughput experimental data (30,37–39), prior knowledge of transcription factor binding site features (40–42), implementation of graph-based methods (43) and the incorporation of phylogenetic relationships (7,44). Finally, consensus interpretation of results from multiple different algorithms has been proposed to improve the discovery of conserved motifs (34,45). The reported success of these strategies, however, has mostly been applied and often remains limited to metazoan CNS (13,46–48). More recently, a new method has been developed to identify CNS in plants by determining the statistical significance of aligned segments (49). Nevertheless, the success of any algorithm improves with a stronger understanding of the data from which the desired feature is to be extracted. As genomes become increasingly available through the advancement of high throughput sequencing, comparative genomics through orthologs and paralogs are becoming a popular strategy yielding some success of CNS identification (6,9,12,50–53). However, the validity of such comparisons needs to be explored, as there are a number of evolutionary mechanisms leading to rapid divergence of sequences, thus rendering them difficult to detect significant similarities. The useful divergence of sequences for comparison is a recognized issue to be considered in comparative genomic-based analysis (54,55). This article addresses the limits of which comparative genomics can be employed specifically in plants to identify putative CNSs by investigating the decay rate of identified significantly aligned sequences in upstream regions with respect to the age of divergence. Most critical to the success of any algorithm is having the proper null model such that statistical significance of CNSs can be properly assigned. This entails understanding the effects of underlying biological processes. Furthermore, defining the divergence limit for CNS discovery also highlights which comparisons will yield interpretable results. Comparisons between upstream regions spanning 5 kb for monocots and dicots were made as well as regions downstream from the coding sequence. The fundamental importance of these findings provides practical guidelines and considerations for future successful research efforts in understanding the regulatory genome of plants.

MATERIALS AND METHODS

Data sets used for phylogenomic comparisons

Comparative analysis of upstream regions between the following plant genomes were used: Arabidopsis thaliana v9.0 (ftp://ftp.arabidopsis.org/home/tair/Genes/TAIR9_genome_release/) (56), Arabidopsis lyrata v1.0 (http://www.jgi.doe.gov/, http://genome.jgi-psf.org/Araly1/Araly1.download.ftp.html), Carica papaya v1.0 (http://www.life.illinois.edu/ming/, ftp://ftp.jgi-psf.org/pub/JGI_data/phytozome/v5.0/Cpapaya/) (57), Glycine max v1.01 (ftp://ftp.jgi-psf.org/pub/JGI_data/phytozome/v5.0/Gmax/annotation/) (58), Medicago truncatula v3.0 (http://www.medicago.org/genome/downloads/Mt3/) (59,60), Populus trichocarpa v2.0 (ftp://ftp.jgi-psf.org/pub/JGI_data/phytozome/v5.0/Ptrichocarpa/annotation/) (61), Ricinus communis v0.1 (http://www.phytozome.net/ricinus, http://castorbean.jcvi.org/castorbean_downloads.shtml), Manihot esculenta v1.1 (http://www.phytozome.net/cassava, ftp://ftp.jgi-psf.org/pub/JGI_data/phytozome/v5.0/Mesculenta/), Prunus persica v1 (http://www.phytozome.org/peach), Cucumus sativa v1 (http://www.phytozome.net/cucumber.php) (62), Zea mays v4a.53 (http://ftp.maizesequence.org/current/filtered-set/) (63), Sorghum bicolor v1.0 (ftp://ftp.jgi-psf.org/pub/JGI_data/Sorghum_bicolor/v1.0/Sbi/annotation/Sbi1.4/) (64), Oryza sativa v6.0 (ftp://ftp.plantbiology.msu.edu/pub/data/Eukaryotic_Projects/o_sativa/annotation_dbs/pseudomolecules/version_6.1/all.dir/) (65), Bryachypodium distachyon v1.0 (http://files.brachypodium.org/Annotation/) (66), Ostreococcus lucimarinus v2.0 (http://genome.jgi-psf.org/Ost9901_3/Ost9901_3.download.ftp.html) (67), Ostreococcus tauri v2.0 (http://genome.jgi-psf.org/Ostta4/Ostta4.download.ftp.html) (67) and Micromonas pusilla v2.0 (http://genome.jgi-psf.org/MicpuC2/MicpuC2.download.ftp.html) (68).

Estimating time of divergence between compared genomes

MUSCLE (69) was used to align rbcL, matK and atpb genes, markers that have been used for previous phylogenetic tree constructions (http://www.ncbi.nlm.nih.gov/) (70–73). BEAST (74) was used to construct a phylogenetic tree with these alignments to estimate divergence times between the species used for this investigation. BEAUTI (74) was used to define the following settings for BEAST: the WAG substitution model, a relaxed clock model, randomly generated starting tree, and MCMC chain length of 10 000 000. Taxons were set based on the APG III classification (72) with the following priors: A. thaliana–A. lyrata ∼5.15 mya (75), P. trichocarpa–R.communis ∼81 mya (76), A. thaliana–C. papaya ∼72 mya (76) and Z. mays–S. bicolor ∼16 mya (77). Glycine max and Medicago truncatula is estimated to diverge ∼50–54 mya (personal communication with Nevin D. Young). The first 200 trees were burned with TreeAnnotator (74), BEAST results were analyzed with Tracer (74) and the resulting phylogenetic trees were visualized with FigTree (74). The divergence times of the three gene trees generated from rbcL, matK and atpb genes (Supplementary Data S1) were averaged.

Ortholog detection and comparative analysis

Inparanoid (78) was used to identify corresponding orthologous genes between the genomes using default parameters. The longest splice variants were used as the representative for orthologs, and identified clusters were separated into two sets; a set containing only one-to-one relationships between two orthologous genes (singleton orthologs) and a second set with clusters containing multiple genes (multiple orthologs). The alignment similarities for multiple orthologs clusters were calculated using DIALIGN for all possible combinatorial orthologous pairs. The best and average similarity scores for each multiple orthologs clusters were used for comparison in the analysis.

Retrieval of upstream and downstream regions

Upstream regions were extracted based on the annotated transcription start site [(TSS) corresponding to the mRNA start position noted in the gff3-file] and translation start site (ATG–the start codon position found noted in gff3-file) for each respective genome data sets. The retrieved upstream regions were then partitioned into 10 × 500 bp segments from 0 to +5 kb. Downstream regions were extracted with the length of 500 bp from the end of the transcript representing the longest splice variant (−500 bp).

Tool performance comparisons to identify CNS

Several tools were used and compared to identify aligned upstream regions with a minimum length of 8 bp which we designate in this investigation as putative CNS: BLASTN (79), CHAOS (80), DIALIGN (81) and LAGAN (82). Comparisons were made using the calculated sequence similarities for +500 bp upstream of the TSS between all plant genome pairings, and the results were then used as benchmark. Upstream regions were aligned when each sequence in the comparison contained no more than 5% missing nucleotides and have a length of 500 bp. Default settings were used for DIALIGN and LAGAN for alignment. For these tools, CNSs were defined by an aligned region with a minimum of 8 consecutive base pairs in length. Alignments were concatenated when separated by a maximum of five unaligned bases. Default parameter settings and a word length of 8 bp were used for BLASTN and CHAOS to detect sequence similarity. Alignments were also concatenated using the following strategy to calculate similarity scores for CNS: (i) identified overlapping alignments were resolved first by taking the longer of two alignments (ii) for overlapping alignments of equal length, the one with a lower e-value was selected. The similarity scores S of alignments using these tools were calculated by the sum of non-overlapping CNS lengths located within the respective 500 bp segment, where i is the identified putative CNS: This metric was used as a measure to detect the reducing size of CNS search space with respect to evolutionary decay and proximity from either the transcription or translation start site. Similarity score distributions of 2000 randomly chosen upstream regions for each compared plant genomes were used to construct cumulative null models for dicots and monocots, respectively, used in the calculation of statistical significance.

Identifying effects of indels and TEs

Effects of insertions, deletions, and TEs were investigated using DIALIGN (81) and BLASTN (79) to find CNS disruptions within the range of +5 kb upstream regions of TSS and ATG. Cross comparison alignments were made between every 500 bp segments up to 5 kb in the upstream regions of compared species. Two 500 bp regions are loosely defined as significantly similar with similarity scores above the threshold of 1.5 from the mean of the null model. Corresponding 500 bp sequences with direct positional best scores are assumed to have no disruptions in upstream regions resulting from indels. TEs were identified with RepeatMasker (http://www.repeatmasker.org) in the 5 kb upstream regions with a div-setting of 20% against the angiosperm library. The distribution of TE in sequences was calculated by counting all masked positions of the sequences in the RepeatMasker output file, where all interspersed repeats in the sequence are masked.

Statistical analysis

Statistical tests were performed on the alignment similarity score distributions of dicots, monocots, and algae using the Mann–Whitney U test with R. Statistical analyses between all plant pair comparisons were performed on similarity scores calculated with the singleton ortholog data set using the Kruskal–Wallis test, a non-parametric ANOVA-like multivariance test from the pgirmess package of R. The Kruskal–Wallis test was also applied for the cross-comparison analysis between all singleton orthologs and the comparison between downstream and upstream regions. Both the Mann–Whitney U-test and Kruskal–Wallis test were applied to identify differences in CNS amount and length.

RESULTS

Workflow for CNS divergence and tool comparisons

A workflow including ortholog detection, comprehensive cross-comparison of upstream regions up to 5 kb, tool comparisons, and TE identification was implemented to understand the decay rate of CNS in plant regulatory genomes (Figure 1). CNSs are defined in this investigation as aligned upstream regions of similarity with a length of >8 bp. First, technical considerations were addressed through tool comparisons, data quality and divergence time between compared genomes using only the first +500 bp of upstream regions from different starting reference points. The search space for CNSs was then expanded to 5 kb in order to discern positional effects and those that may be introduced by putative TEs.

Figure 1.

Workflow for phylogenomic CNS identification. A workflow was implemented to address the effects of tool performance, data quality, divergence age and localization for bioinformatic identification of putative CNSs when using comparative genomics in plants. Orthologous gene pairs were found with Inparanoid and upstream regions of genes were extracted using both the TSS and ATG reference points for a length of 500 bp and 5 kb. Tool comparisons were performed using the first 500 bp of orthologously paired upstream regions as benchmark. Comparisons of upstream region similarity with respect to divergence time, proximity to TSS and ATG and between monocots and dicots were conducted. Cross-comparisons between 500 bp segments covering a search space of 5 kb were also conducted to localize regions with the highest alignment similarity. RepeatMasker was used to identify putative TEs. The first 500 bp upstream regions using both the ATG and TSS reference points were extracted for all orthologous genes (Figure 1, left path) to address technical considerations in searching for CNSs. Orthologs between two genomes were found using Inparanoid (78). For this study, singleton orthologs were used as the benchmark to ensure our analysis addresses effects of evolutionary decay rather than alternative interfering mechanisms that may be associated with duplication and multiple orthologs. Singleton orthologs, genes with only direct one-to-one relationships, have been shown to have an increased possibility of sharing gene function compared with multiple orthologs where neo- or subfunctionalization often occurs after gene duplication in terms of gene function (83). Pairwise upstream region similarities were then compared to evaluate tool performances (Supplementary Figure S3). The tools evaluated were BLASTN (79), CHAOS (80), DIALIGN (81) and LAGAN (82). Comparisons showed BLASTN and DIALIGN to perform best, and therefore these tools were used to estimate the decay of upstream regions with respect to divergence time. Furthermore, comparison of similarity scores and CNS properties between monocots and dicots, using both the ATG/TSS reference points, were performed for the first 500 bp upstream region. After benchmarking with the first 500 bp, the search space for putative CNSs was then expanded to include 5 kb upstream regions using both the ATG and TSS (Figure 1, right path) as reference points. A cross-comparison of all ten 500 bp segments of the extracted 5-kb was performed to localize regions with the highest aligned similarity. Interspersed repetitive elements that may be putative TEs in the 5-kb upstream region were also analyzed with RepeatMasker.

Estimation of divergence time between plant species

Since a phylogenomic strategy was employed in this investigation, the phylogenetic relationship and age of divergence must be estimated between all compared species. The majority of the genome data sets used in this investigation have previously published divergence times, which made it possible to estimate the split between the remaining species using BEAST (Figure 2) (74). The result suggests that R. communis and M. esculenta diverged more recently at ∼41 mya compared with the divergence age of G. max and M. truncatula estimated at 50–54 mya (N. D. Young, personal communication). Since the published divergence times are based on different methods, a comparison of these results may not be appropriate. Nevertheless, the published data and results of the BEAST analysis are good estimates for our research aims and the sequence of divergence events is more important than the exact divergence times for this investigation. Due to possible substitution rate variations between the different plants pairs (84), O. sativa was used as the fixed species for comparisons in monocots, and A. thaliana and P. trichocarpa for dicot comparisons, depending on the possible comparisons for the given divergence age (Figure 2B).

Figure 2.

Phylogenetic tree with estimated divergence times used for comparative analysis. (A) Phylogenetic tree and estimated divergence times were calculated using BEAST (74). The following previously published divergence times were used: C. papaya–A. thaliana, 72 mya (76); A. thaliana–P.trichocarpa, 108 mya (76); P. tichocarpa–M. truncatula, 98 mya (76); P. persica–C. sativa, 90 mya (76); R. communis–P. trichocarpa, 81 mya (76); Z. mays–S. color, 16 mya (77); O. sativa–B. distachyon, 50 mya (64) and O. sativa–S. bicolor, 60 mya (64). Divergence time of R. manihot–M. esculenta and G. max–M. truncatula were estimated using BEAST with rbcL, atpb and matK genes. The divergence time of the Euphorbiaceae pair (R. communis–M. esculenta) was more recent compared with the divergence time of the Fabaceae pair (G. max–M. truncatula). WGD events (lightning) are also marked (17). (B) Table of comparisons made between plants. Fixed plant species, depending on the divergence age being compared in the analysis, are circled.

Comparison of alignment tool performance in identifying CNS

Alignment performance between different tools was conducted through comparing similarity scores resulting from BLASTN(79), CHAOS (80), DIALIGN (81) and LAGAN (82), and comparisons revealed DIALIGN and BLASTN to perform best (Figure 3, Supplementary Figures S3 and S4). The results of DIALIGN is comparable with those of BLASTN (79) and better than that of LAGAN (82) and CHAOS (80). For DIALIGN and BLASTN, alignments could be made for nearly all sequence pairs to calculate similarity, which is necessary for estimating the decay rate, whereas LAGAN and CHAOS did not successfully identify CNSs in many sequence comparisons and therefore no similarity score could be calculated (Supplementary Figure S3). BLASTN implements a local alignment strategy whereas DIALIGN combines local alignment to seed subsequent global alignment. Both of these tools have previously been used in other CNS investigations (6,8,85,86). A case study using a known heat shock element (74,75) shows that both DIALIGN and BLASTN were successful in identifying the motif TTCnnGAA, with DIALIGN being more sensitive (Figure 3). Results from using DIALIGN are reported for the remaining of this investigation, however complementary BLASTN results can also be found in Supplementary Data.

Figure 3.

Upstream region conservation in dicots, monocots and algae. (A) The distribution of the similarity levels for orthologous upstream regions in all studied dicot plant pairs (red) and monocot plant pairs (blue) are shown for DIALIGN. Similarity scores of monocots has a lower distribution compared with that of dicots (Mann–Whitney U-test, P = 2.2e−16). (B) Distribution of similarity scores using DIALIGN with respect to divergence time and null model (grey) for dicots and (C) monocots. Median values are shown (center black bars). (D) Results from using BLASTN for dicots (red) and monocots (blue) are significantly different (Mann–Whitney U-test, P = 2.2e−16). (E) Similarity scores with respect to divergence time and null model (grey) for dicots and (F) monocots analyzed with BLASTN. Significantly different distributions with respect to null-models constructed based on randomly paired upstream regions are marked (asterisk, Kruskal–Wallis test, P = 0.01). Case study results of tool performance using well characterized heat shock elements are shown, respectively. DIALIGN identified heat shock element in orthologous genes from five out of seven plants [(i) A. thaliana, (ii) A. lyrata, (iii) C. papaya, (iv) P. trichocarpa, (v) M. truncatula, (vi) G. max and (vii) R. communis] whereas BLASTN only identified the motif in three out of seven plants [(i) A. thaliana, (ii) A. lyrata, (iii) C. papaya, (iv) P. trichocarpa, (v) M. truncatula, (vi) G. max and (vii) R. communis].

Selection of reference point: transcription versus translation start sites

Due to variable genome qualities, a comparison of alignments starting from different reference points, the transcription (TSS) and translational start sites (ATG), were compared to determine possible effects on alignment performance. Using the ATG as the reference point showed slight improvements in the similarity scores of +500 bp upstream in monocots (Figure 3C and Supplementary Figure S4.1.C). The majority of distances between TSS and ATG were calculated to be <500 bp (Supplementary Figure S6). For dicots however, TSS versus ATG have no effects on the results. The distribution of ATG–TSS distances between monocots and dicots are not significantly different (Supplementary Figure S6A), but there seems to be an effect on the distance between the annotated sites due to genome quality. The Z. mays genome shows a large fraction of distances between TSS and ATG to be >500 bp. Incidentally the Z. mays genome also contain the highest amount of masked regions indicative of poor sequence quality. These observations suggest upstream regions in Z. mays to be of poor quality compared with the other plant species used in this study. Alternatively, large distances between TSS and ATG may be caused by incorrect annotation of the first exon or alternative TSSs (87). For example, some genes contain multiple TSSs and the median distance between the two TSSs was observed to be 184 and 149 bp for A. thaliana and O. sativa, respectively (87). Finally, genome quality continues to be an issue as the annotation of these genomes improves. A substantial portion of the distances could not be calculated between the TSS and ATG sites due to the lack of positional information and, subsequently, annotation (Supplementary Figure S6). For example, previous investigations have shown that ∼66% of the TSSs in A. thaliana are annotated with available 5′-UTR positional information (88). Results from using both the TSS and ATG reference points are included and are reported (Supplementary Data).

Monocot CNS decay faster than those found for dicots

Significantly lower similarity scores were observed in monocots compared with dicots using the Mann–Whitney U-test (Figure 3AD, P = 2.2e−16). As an outgroup family, three green algae species from the Mamiellaceae family, Ostreococcus lucimarinus, Ostreococcus tauri and Micromonas pusilla, were also included for alignment score comparisons using the ATG as calculated with DIALIGN and BLASTN (Supplementary Figure S4.1A and S4.2A). The TSS information for algae was not available and therefore was not conducted. The distributions of upstream region similarity scores for the three different plant groups were found to differ significantly (Mann–Whitney U-test, P = 2.2e−16). Furthermore, while similarity values between orthologs observed for dicots differ from those observed for green algae, the distribution curves both have a similarity peak around ∼35% compared with the monocot distribution which peaked at ∼25% (Supplementary Figure S4.1A and S4.2A). Variation in selection pressures in the non-coding regions immediately flanking the transcripts was also investigated. A comparison of the downstream region (−500 bp) with the first two upstream 500 bp segments in monocots and dicots show a difference in the ranking of regions with the highest distribution of similarity scores. The first +500 bp in the upstream region was ranked to have the highest similarity values, followed by the downstream region and second 500 bp segment (+0.5–1 kb upstream). In contrast, monocots had better alignment scores in downstream regions followed by the first and then second 500 bp segments upstream of the TSS (Supplementary Figure S8, Kruskal–Wallis test, p = 0.001). The findings further suggest differences in regulatory genome evolution of dicots and monocots.

Decay of upstream regions with respect to divergence time

Comparative analysis of singleton orthologous upstream regions was conducted between plant genomes paired based on the estimated divergence time calculated with BEAST (see ‘Materials and Methods’ section). The results show that similarity scores between orthologous upstream regions of different genome pairings decrease with increasing divergence age (Figure 3). The significance of similarity values becomes difficult to distinguish from the null model of randomly paired genes as genomes approach a divergence time around 100 mya ± 10 mya, and therefore comparative genomic identification of CNSs is challenged at this divergence limit. The observation holds true for both monocots and dicots. Gene duplication events leading to sub- or neofunctionalization may also effect the conservation of upstream regions, and therefore potential impacts were investigated for gene families containing multiple paralogs. Upstream regions of singleton orthologs were compared with gene families with multiple orthologs (Supplementary Figure S2). Two measures were used to discern if differences in the similarity scores could be observed between these two groups of orthologs. First, the average calculated similarity score for alignments between all multiple orthologs pairings was compared with the similarity values observed for singleton orthologs. The second measure used the best calculated similarity scores for each compared pairings within multiple orthologs. Results show that while the distribution of best hit similarity scores is significantly higher than that of singleton orthologs, the averages of pairwise comparisons between multiple orthologs are lower (Wilcoxon test, P = 2.2e−16). The lower variance and average similarity values observed for multiple orthologs, compared with the best calculated similarity values within the set and singletons, suggests that upstream regions for multiple orthologs may be subjected to less selection pressure to maintain conservation (Supplementary Figure S2).

Decay rate with respect to distance from the TSS

Since TEs can make up nearly half the genome of many species and reshape genomic structure (89), the mobility of these elements can present problems in detecting CNS. Therefore upstream regions were aligned with respect to the identified orthologous gene coding regions to account for possible displacement events. Furthermore, each 500 bp segments of the upstream regions were aligned across 5 kb of the orthologous upstream region to account for possible indel events resulting in shifted frames. The analysis was done reciprocally between all compared species with the exception of C. papaya due to the lack of available upstream data beyond 500 bp. This cross-combinatorial design also addresses potential issues with respect to errors in TSS annotation which can significantly affect alignment results (Figure 1). Scans of 500 bp segments across 5 kb show possible effects of TEs in repositioning putative CNSs. Direct conservation, wherein significantly aligned regions corresponded to the orthologous 500 bp segment without relocation, appeared mostly in the +500 bp region for all plant pairs with detected occurrences from ∼35% of A. thaliana–A. lyrata comparisons decreasing with divergence time to ∼8% in A. thaliana–P. trichocarpa for dicots (Figure 4A). Monocots also show the same results starting from ∼40% in Z. mays–S. bicolour to ∼20% for O. sativa–S. bicolour (Figure 4B). Additionally, the frequencies of direct conservation also decrease with distance from the TSS (Figure 4AB). Similar results can be observed irrespective of whether the ATG or BLASTN were used instead (Supplementary Figure S5). The decay appears to plateau at ∼3 kb upstream for dicots (Figure 4A) and ∼2.5 kb for monocots (Figure 4B). As expected, the highest instances of significantly aligned regions can be detected with shorter divergence time and in closer proximity to the TSS (Kruskal–Wallis test, P < 0.001, Figure 4A and B).

Figure 4.

Upstream region decay with respect to divergence time. Detected decay rate of upstream regions based on DIALIGN results. (A) Percentage of direct regional conservation is shown to decrease with divergence time and distance from the TSS for dicots and (B) monocots (Kruskal–Wallis test P = 0.001, transparent bars). A correction was applied to identify alignments with similarity levels >1.5σ from the mean of the null model (solid bars). Statistical difference between comparisons is noted in the inserted table (T = true and F = false). (C) The decay rate of identified putative CNSs was fitted with an exponential function for an estimation of the decay rate in dicots and (D) monocots. The exponential decay rate and the corresponding correlation coefficient (R2) are shown in the inserted table. The decay rates of dicot upstream regions can be fitted with an exponential function (Figure 4C) and are observed to be highest in near proximity to the TSS. The first +500 bp segment in the upstream region have the highest decay rate of d(t) = 33.1e0.02t followed by the rate for the second segment (+0.51.0 kb) with d(t) = 20.8e0.016t. Decay rates in monocots were also estimated (Figure 4D), although there is insufficient data for a proper fitting at this time. As soon as more genomes become available for monocot comparisons, this topic will have to be revisited.

Distribution of putative TEs

The amount of sequences with detected interspersed repeats that may be putative TEs varied significantly between the different plant species. In some cases, nearly no interspersed repeats were detected in species such as P. persica and A. lyrata while ∼70% of O. sativa upstream regions are detected to contain TE (Figure 5A). Statistically significant differences could not be detected in the amount of TE between monocots and dicots. Interestingly, less interspersed repeats were detected in the first 500 bp upstream of the ATG compared with regions further upstream for all plants (Figure 5B). This result explains the strong differences in the decay rate between the first +500 bp compared with segments further upstream (Figure 4C and D). This analysis was conducted using the upstream region of the ATG instead of the TSS to also identify putative TE between these two reference points. The findings support a potentially higher success rate of identifying embedded motifs in the first +500 bp segment for initial bioinformatics analysis. For Z. mays, large portion of sequences were already masked due to sequence quality issues and therefore these masked regions in combination with detected interspersed repeats resulted in the higher observances of putative TE of Z. mays compared with the other plants.

Figure 5.

TE distribution in upstream regions. (A) Percentage of 5 kb upstream region sequences with identified putative TE in each plant species. (B) Percentage of upstream regions with putative TE is plotted with respect to upstream localization from the ATG.

Other identified CNS features

Some features of CNS were identified in this comparative study of plant regulatory genomes. First, the density of identified CNSs in +500 bp differs significantly between monocots and dicots (Supplementary Figure S7A). Monocots are found to contain, on average,10 CNSs in the +500 bp upstream region of TSS while dicots show a slightly higher average of 12 CNS (Mann–Whitney U-test, P = 2.2e−16, Supplementary Figure S7A). However, using the ATG instead, an average density of 12 CNSs was calculated for both monocots and dicots, with algae showing an average of 14 CNSs (Supplementary Figure S7). Second, the length of CNSs was also investigated (Figure S7B). Monocot CNSs are significantly shorter than those of dicots and algae (Kruskal–Wallis test, P = 0.001). The finding holds true for CNSs found in the +500 bp upstream regions of both ATG and TSS. The length of CNSs is observed to increase with decreasing divergence time, as expected (Supplementary Figure S7B). Finally, the nucleotide composition of the first 500 bp segments were also investigated and show significant differences in distribution between dicots, monocots and algae (ANOVA, P = 2.2e−16). A higher AT content was observed in dicots (66%) compared with algae containing 32%. A more balanced nucleotide composition is observed for monocots at 51% AT. It should be noted that the two groups, dicots and algae, with a skewed AT-GC content were also identified to have higher detected similarity levels in aligned CNS.

DISCUSSION

The increasing availability of plant genomes allows us to conduct comparative analysis between species to identify conserved features of regulatory genomes. However, previous research efforts have shown that the identification of CNSs and, more importantly, regulatory motifs that form binding sites for transcription factors is difficult due to properties of plant CNSs. Identified motifs are often short, averaging from 20 to 30 bp in length (10), and evolutionary mechanisms such as gene duplication (90,91) and mobile TEs (27,89,92–94) resulting in high diversification are more frequently observed (21). As such, exploring the limits of when comparative genomics is useful for bioinformatics investigations is needed to advance future efforts to understand the regulatory genome. First, the evolutionary split between the analyzed plant species were successfully reconstructed to estimate the divergence ages necessary for further phylogenomic comparison of upstream regions. Using this constructed time tree (Figure 1), systematic pairwise comparisons between two plant species based on the divergence time were conducted to align orthologous upstream regions. The limits of bioinformatics detection for putative CNSs were explored, although the functional validity of these identified CNSs needs to be further experimentally substantiated. Current probabilistic-based algorithms depend on the fundamental basis that orthologous sequences must share similarity to detect signals of putative CNSs which are used as proxies to subsequently identify possibly embedded cis-regulatory motifs. Both BLASTN and DIALIGN showed similar results (Figure 3), thus also indicating the reliability of the presented results despite having different alignment algorithms. Sensitivity to weakly conserved sequences was important for this study to estimate the decay rate of CNSs, which was missed by both CHAOS and LAGAN (Supplementary Figure S4), although comparable performance with BLASTN and DIALIGN can be achieved in regions of high similarity. Therefore, the use of either BLASTN or DIALIGN is encouraged for investigations requiring alignments of non-coding regions between more distantly related species. Common among all plant groups, however, is that the similarity of orthologous upstream regions is significantly different from randomly paired upstream regions up to ∼100 mya (Figure 3). After which, detection of significant similarity between aligned CNSs is difficult to distinguish from the null model. The findings suggest that the divergence age between compared genomes should be considered to increase the success of finding conserved motifs. The boundary of 100 mya may be the useful divergence limit to identify statistically significant CNSs when using probabilistic-based mining strategies. Other null models or strategies should be considered for more distant comparisons and findings encourage further bioinformatics development to consider the biological features identified in this investigation. Conversely, the divergence time between sister species containing non-saturated substitution patterns, such as between A. thaliana and A. lyrata, should also not be neglected (95). Identified CNS as proxies for cis-regulatory elements between recently diverged species may not reflect true functional conservation. However, when including multiple and more distantly related sequences, the impact of non-saturated substitutions between more closely related species is reduced. The effects of duplication events on shared similarity levels between orthologous upstream regions were also detected. For example, although the divergence age between O. sativa–Z. mays is similar to that of O. sativa–S. bicolor, the findings show orthologous upstream regions from O. sativa–Z. mays to have lower scores. This may be a consequence of an independent whole genome duplication (WGD) occurring at 11.4 mya followed by gene deletion events in the lineage of Z. mays (63). Similarly, both G. max–M. truncatula share a common WGD followed by additional independent WGDs (23). The divergence age is estimated to be similar to that of R. communis–M. esculenta, however orthologous upstream regions show lower alignment scores compared with the two Euphorbiaceae. WGDs are frequently observed in plant evolution (21,23,90,91) and must be considered in phylogenomic comparisons. The distances calculated between TSS and ATG using the current available annotations are mostly <500 bp (Supplementary Figure S6). Monocots show slight improvements in alignments when using the ATG as the reference point compared with TSS (Figure 3 and Supplementary Figure S4). This may likely be caused by the large distances between TSS and ATG in Z. mays due to the presence of alternative open reading frames. For example, an average of two TSS with a median distance of 149 bp in O. sativa has been observed for each locus (87). Unlike transcription regulating motifs that are mostly found upstream of TSS to activate gene expression, motifs embedded in the 5′-UTR region between the TSS and ATG have higher impact on modulating the abundance of transcripts that are expressed (96,97). Since this study is dependent on the quality of genome annotation, it should be noted that the boundaries of the 5′-UTR and 3′-UTR, and therefore TSS, is not always known (88). The analyses were performed for upstream sequences using both positional reference points, TSS and ATG, to minimize the effects of genome quality on the interpretation of our analysis. Both analysis show the same conclusions and are available for review (Supplementary Data). The highest region of conservation was identified to be located in the +0.5 kb upstream of the TSS as expected from previous findings, suggesting that the immediate proximity has higher propensity to host modules of regulatory elements (Figure 4AB) (14,98,99). Moreover, the frequency of TEs is lower in this region compared with analyzed segments further upstream which has huge implications for effects of indels on upstream region conservation over the analyzed 5 kb (Figure 5B). The amount of conserved regions is observed to decrease with increasing proximity from the TSS and plateau between +2.5 and 3 kb upstream. CNS conservation appears to be disrupted by indels such as TEs which are found, for example, in over 70% of upstream regions in rice. The amount of identified TE is found to vary strongly between the different species and agrees with previous observations (100–102). TEs are partly responsible for size variation in closely related species and TE amplification, as well as gene duplication, have been shown to contribute to high c-values (101). Nevertheless, cautious interpretation should be taken due to bias that may be introduced by the TE library used in this study. While all species were compared with the same curated angiosperm library implemented in RepeatMasker, the library contains more sequences for the well-studied model plant O. sativa compared with recently sequenced plants such as P. persica and R. communis. The estimated decay rate using an exponential fitted function shows higher decay rates for +500 bp compared with upstream regions located more distant from the TSS. The fitting suggests most of the observed degeneration occurred immediately following the divergence since <10% of the sequences retained good alignment scores for more distant comparisons. Furthermore, the data also suggest that the largest degeneration can be found closest to the TSS. This may be a consequence of a larger number of identified CNSs compared with other regions (14,103), and therefore resulting in a higher probability of observing a phenotypic change if possibly embedded regulatory motifs are disrupted. Furthermore, as mentioned earlier, recently diverged sister species are not yet saturated with substitutions and therefore a high decay rate of putative CNSs that is yet to be subjected to strong functional selection pressure is not unexpected. Differences can be clearly observed between monocots and dicots with monocots showing lower alignment scores in the upstream region (Figure 3). Previous studies have shown that monocot chloroplast genomes have a higher evolutionary rate in the coding regions compared with dicots (71). The effects are also evident in the analysis presented here for non-coding regions of the nuclear genome. This suggests that selection of genomes for comparisons between different plant groups should be considered. However, it should be noted that currently available monocot genomes are only from those of grasses and may introduce a bias for the conclusions. In earlier studies, a higher evolutionary rate in grasses compared with other monocots have been shown (73). Subsequent investigation for differences in CNSs associated with non-grass monocots will be needed when these genomes become available for analysis. Monocots additionally show higher alignment scores in the downstream region (−500 bp), followed by the first and second upstream segments from the TSS. Whereas analysis rank the first 500 bp segment in dicots to have the highest distribution of similarity scores followed by the downstream region having similar alignment score distribution with the second 500 bp segment upstream. This may be indicative of three possible scenarios. First, the selection pressure to conserve CNS in the 0.5 kb upstream and downstream regions may be similar in monocots. Second, the selection pressure in monocots on the +500 upstream region may not be as strong as the corresponding region in dicots. Alternatively, third, CNSs in monocots have a higher decay rate, and therefore significant alignments between orthologous CNS are not retained over a long period of time. Additional plant group specific features have also been observed with respect to the density, length and composition of nucleotides in CNS. Dicots contain an overall higher density of detected CNSs per 500 bp when compared with monocots (Figure 4). Majority of identified CNS have a length between 8 and 40 bp, suggesting CNS regulatory motif length may be static. Monocot CNSs appear to be significantly shorter compared with dicot and algae CNSs, however the findings may be an artifact resulting from the lower CNS coverage in the 500 bp promoter of monocots (Supplementary Figure S7). Interestingly, this distribution does not change significantly with divergence age. These features provide additional support suggesting different regulatory strategies are employed by respective plant groups. Furthermore, previous studies have shown that some genes corresponding to the GO of stress response have a particularly high amount of CNSs that can be located up to several hundreds base pairs, suggesting gene function should be considered in understanding CNS features (11). In Arabidopsis there are 252 genes identified to be highly enriched in CNS called bigfoot genes (18). Considering that 18 000 orthologous pairs were conducted in this investigation between A. thaliana and A. lyrata, these bigfoot genes comprise only a minor fraction and therefore have a low impact on this results of this investigation. Finally, the nucleotide composition between plant groups shows significant differences with preferences for the AT content in dicots and GC content in algae. A higher GC content in rice compared with A. thaliana has been found previously. Interestingly, although different nucleotide compositions in both species are observed, an AT enrichment is nevertheless observed in the first +500 bp upstream region of the TSS for both species (87). This may be a result of sequence signature effects associated with the TATA-box. A previous comparison between A. thaliana and A. lyrata found that mutation rates vary among genomic regions as a function of base composition and is largely dependent on the GC-content (95). GC rich regions have an increased transition:transversion ratio, which may be one contributing factor to the higher observed decay rate (104). The low amount of detected CNSs in monocots compared with dicots could be a result of differences in the decay rate associated with regions of higher GC content which was observed to be 49% in monocots and 33% in dicots. However, this interpretation cannot be applied to clearly explain the observed differences in algae where a higher GC content is also observed but a significant difference in the distribution of identified CNS from dicots was not detected.

CONCLUSIONS

The results of this systematic phylogenetic approach to understand properties of plant CNSs highlighted specific differences and important considerations to guide future research efforts to understand the regulatory component of genomes. First, the selection of plant group and species used for comparative genomics to identify novel regulatory motifs must be considered. We recommend making comparisons only with plants specific to monocots and dicots separately since different evolutionary rates have been observed for each group. Second, we estimate ∼100 mya to be the divergence limit to which plant upstream regions can be compared. After which time, significant similarities between putative orthologous CNSs cannot be distinguished from alignments between randomly selected upstream regions. The findings do not imply that CNSs cannot be detected beyond a divergence age of 100 mya since non-coding regions showing no conservation in sequence are nonetheless conserved in function (8). Instead the analysis suggests that alternative considerations of null models and strategies that incorporate additional information for a large scale global search will be needed. Current algorithms combining multiple features such as incorporating comparative genomics with gene expression data (105,106) and evolutionary models (53,107) to distinguish functional CNSs holds promise. Finally, although we are not the first to investigate the limits of comparative genomics to study CNSs (51), our investigation has successfully addressed these fundamental issues in a systematic, generalized manner with respect to divergence time, plant groups, proximity to the coding region, and other various features that will contribute significantly to our basic fundamental knowledge of CNS identification, guiding future research efforts and algorithm development to understand the transcriptional regulation of genes in systems biology.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

FUNDING

Human Frontier Science Program (R6P0033/2006-C); Alexander von Humboldt Foundation. Funding for open access charge: Alexander von Humboldt Foundation, Deutsche Forschungsgemeinschaft, and the Open Access Publication fund of University of Muenster. Conflict of interest statement. None declared.

99 in total

1. DNA sequence evidence for the segmental allotetraploid origin of maize.

Authors: B S Gaut; J F Doebley
Journal: Proc Natl Acad Sci U S A Date: 1997-06-24 Impact factor: 11.205

2. LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA.

Authors: Michael Brudno; Chuong B Do; Gregory M Cooper; Michael F Kim; Eugene Davydov; Eric D Green; Arend Sidow; Serafim Batzoglou
Journal: Genome Res Date: 2003-03-12 Impact factor: 9.043

3. Evolutionary analysis of regulatory sequences (EARS) in plants.

Authors: Emma Picot; Peter Krusche; Alexander Tiskin; Isabelle Carré; Sascha Ott
Journal: Plant J Date: 2010-09-16 Impact factor: 6.417

4. Divergence in expression between duplicated genes in Arabidopsis.

Authors: Eric W Ganko; Blake C Meyers; Todd J Vision
Journal: Mol Biol Evol Date: 2007-08-01 Impact factor: 16.240

Review 5. Sequencing the genespaces of Medicago truncatula and Lotus japonicus.

Authors: Nevin D Young; Steven B Cannon; Shusei Sato; Dongjin Kim; Douglas R Cook; Chris D Town; Bruce A Roe; Satoshi Tabata
Journal: Plant Physiol Date: 2005-04 Impact factor: 8.340

Review 6. Conserved noncoding sequences (CNSs) in higher plants.

Authors: Michael Freeling; Shabarinath Subramaniam
Journal: Curr Opin Plant Biol Date: 2009-02-25 Impact factor: 7.834

7. Many or most genes in Arabidopsis transposed after the origin of the order Brassicales.

Authors: Michael Freeling; Eric Lyons; Brent Pedersen; Maqsudul Alam; Ray Ming; Damon Lisch
Journal: Genome Res Date: 2008-10-03 Impact factor: 9.043

8. The contribution of transposable elements to expressed coding sequence in Arabidopsis thaliana.

Authors: Steven Lockton; Brandon S Gaut
Journal: J Mol Evol Date: 2009-01-03 Impact factor: 2.395

9. Low nucleotide diversity in man.

Authors: W H Li; L A Sadler
Journal: Genetics Date: 1991-10 Impact factor: 4.562

10. AGRIS: Arabidopsis gene regulatory information server, an information resource of Arabidopsis cis-regulatory elements and transcription factors.

Authors: Ramana V Davuluri; Hao Sun; Saranyan K Palaniswamy; Nicole Matthews; Carlos Molina; Mike Kurtz; Erich Grotewold
Journal: BMC Bioinformatics Date: 2003-06-23 Impact factor: 3.169

21 in total

1. CNMS: The preferred genic markers for comparative genomic, molecular phylogenetic, functional genetic diversity and differential gene regulatory expression analyses in chickpea.

Authors: Deepak Bajaj; Shouvik Das; Swarup K Parida
Journal: J Biosci Date: 2015-09 Impact factor: 1.826

2. Inference of transcriptional networks in Arabidopsis through conserved noncoding sequence analysis.

Authors: Jan Van de Velde; Ken S Heyndrickx; Klaas Vandepoele
Journal: Plant Cell Date: 2014-07-02 Impact factor: 11.277

3. The most deeply conserved noncoding sequences in plants serve similar functions to those in vertebrates despite large differences in evolutionary rates.

Authors: Diane Burgess; Michael Freeling
Journal: Plant Cell Date: 2014-03-28 Impact factor: 11.277

4. Bipartite promoter element required for auxin response.

Authors: Cristina L Walcher; Jennifer L Nemhauser
Journal: Plant Physiol Date: 2011-11-18 Impact factor: 8.340

5. An atlas of over 90,000 conserved noncoding sequences provides insight into crucifer regulatory regions.

Authors: Annabelle Haudry; Adrian E Platts; Emilio Vello; Douglas R Hoen; Mickael Leclercq; Robert J Williamson; Ewa Forczek; Zoé Joly-Lopez; Joshua G Steffen; Khaled M Hazzouri; Ken Dewar; John R Stinchcombe; Daniel J Schoen; Xiaowu Wang; Jeremy Schmutz; Christopher D Town; Patrick P Edger; J Chris Pires; Karen S Schumaker; David E Jarvis; Terezie Mandáková; Martin A Lysak; Erik van den Bergh; M Eric Schranz; Paul M Harrison; Alan M Moses; Thomas E Bureau; Stephen I Wright; Mathieu Blanchette
Journal: Nat Genet Date: 2013-06-30 Impact factor: 38.330

6. Conserved noncoding sequences highlight shared components of regulatory networks in dicotyledonous plants.

Authors: Laura Baxter; Aleksey Jironkin; Richard Hickman; Jay Moore; Christopher Barrington; Peter Krusche; Nigel P Dyer; Vicky Buchanan-Wollaston; Alexander Tiskin; Jim Beynon; Katherine Denby; Sascha Ott
Journal: Plant Cell Date: 2012-10-30 Impact factor: 11.277

7. Towards a transferable and cost-effective plant AFLP protocol.

Authors: Marguerite Blignaut; Allan G Ellis; Johannes J Le Roux
Journal: PLoS One Date: 2013-04-16 Impact factor: 3.240

8. The fate of Arabidopsis thaliana homeologous CNSs and their motifs in the Paleohexaploid Brassica rapa.

Authors: Sabarinath Subramaniam; Xiaowu Wang; Michael Freeling; J Chris Pires
Journal: Genome Biol Evol Date: 2013 Impact factor: 3.416

9. Decoding plant and animal genome plasticity from differential paleo-evolutionary patterns and processes.

Authors: Florent Murat; Yves Van de Peer; Jérôme Salse
Journal: Genome Biol Evol Date: 2012-07-24 Impact factor: 3.416

10. Population-genetic analysis of HvABCG31 promoter sequence in wild barley (Hordeum vulgare ssp. spontaneum).

Authors: Xiaoying Ma; Hanan Sela; Genlin Jiao; Chao Li; Aidong Wang; Mohammad Pourkheirandish; Dmitry Weiner; Shun Sakuma; Tamar Krugman; Eviatar Nevo; Takao Komatsuda; Abraham Korol; Guoxiong Chen
Journal: BMC Evol Biol Date: 2012-09-24 Impact factor: 3.260