Literature DB >> 19814816

The word landscape of the non-coding segments of the Arabidopsis thaliana genome.

Jens Lichtenberg1, Alper Yilmaz, Joshua D Welch, Kyle Kurz, Xiaoyu Liang, Frank Drews, Klaus Ecker, Stephen S Lee, Matt Geisler, Erich Grotewold, Lonnie R Welch.   

Abstract

BACKGROUND: Genome sequences can be conceptualized as arrangements of motifs or words. The frequencies and positional distributions of these words within particular non-coding genomic segments provide important insights into how the words function in processes such as mRNA stability and regulation of gene expression.
RESULTS: Using an enumerative word discovery approach, we investigated the frequencies and positional distributions of all 65,536 different 8-letter words in the genome of Arabidopsis thaliana. Focusing on promoter regions, introns, and 3' and 5' untranslated regions (3'UTRs and 5'UTRs), we compared word frequencies in these segments to genome-wide frequencies. The statistically interesting words in each segment were clustered with similar words to generate motif logos. We investigated whether words were clustered at particular locations or were distributed randomly within each genomic segment, and we classified the words using gene expression information from public repositories. Finally, we investigated whether particular sets of words appeared together more frequently than others.
CONCLUSION: Our studies provide a detailed view of the word composition of several segments of the non-coding portion of the Arabidopsis genome. Each segment contains a unique word-based signature. The respective signatures consist of the sets of enriched words, 'unwords', and word pairs within a segment, as well as the preferential locations and functional classifications for the signature words. Additionally, the positional distributions of enriched words within the segments highlight possible functional elements, and the co-associations of words in promoter regions likely represent the formation of higher order regulatory modules. This work is an important step toward fully cataloguing the functional elements of the Arabidopsis genome.

Entities:  

Mesh:

Substances:

Year:  2009        PMID: 19814816      PMCID: PMC2770528          DOI: 10.1186/1471-2164-10-463

Source DB:  PubMed          Journal:  BMC Genomics        ISSN: 1471-2164            Impact factor:   3.969


Background

All genomes are composed of nucleotides, which are represented abstractly as letters (Adenine (A), Guanine (G), Cytosine (C), and Thymine (T)). Strings of such letters can be conceptualized as words, which provide the blueprints for organisms. Each word is found a specific number of times in a particular genome. Note that the expected frequency of a word is inversely related to the word's length. Some nucleotides appear more frequently than others (e.g. A/T in Arabidopsis), giving each genome a distinct (G+C)% content and biasing expected word frequencies. Higher order frequencies (dinucleotide and trinucleotide) also show distinct biases beyond those expected for single nucleotide frequencies [1]. Distinct selective pressures shape words positioned in different genomic regions. For example, a word in an open reading frame (ORF) has a direct influence on the primary amino acid sequence of a protein and hence is under strong selective pressure. In contrast, words in introns are likely to be under more relaxed selective constraints, unless they are important for gene functions, for example by providing docking sites for splicing factors [2] or for enzymes involved in the post-transcriptional processing of a transcript [3,4]. The gene sections corresponding to the 5' and 3' untranslated regions (5'UTRs and 3'UTRs, respectively) are also likely to be under less selective constraints than the ORFs, yet signatures of strong selection in UTRs have been described (reviewed in [5]). The constant formation of DNA microsatellites through slippage by the replication machinery, and the action of viruses and transposons, also complicate the word landscape, especially in regions with lower selective constraints (such as introns, UTRs and intergenic regions) [6,7]. This manuscript describes the results of a genome-wide analysis to discover putative regulatory words. Within this context, we define the cis-regulatory apparatus as all the DNA segments that are located proximal to a gene, and that also contribute to the gene's expression. It is the function of transcription factors, miRNAs, or other molecules that interact with DNA, to interpret the words (sequence code) hardwired in the cis-regulatory apparatus and to 'execute' them, thereby generating signals to the basal transcription machinery that result in changes to the rate of RNA production by the corresponding DNA-dependent RNA polymerases. When located upstream of the transcription start site (TSS), the cis-regulatory apparatus is often referred to as the promoter of a gene. Promoters are typically divided into three regions: core, proximal and distal. The core promoter, a region at location [+1;-100] relative to the TSS, performs a central role in the formation of pre-initiation transcriptional complexes. Immediately upstream of the core promoter is the proximal promoter, which is located at position [-101;-1000] relative to the TSS and serves as a docking site for transcription factors. The distal promoter is located at [-1001;-3000] relative to the TSS and contains the regulatory elements that are commonly known as enhancers and silencers. The participation of a particular DNA segment in the regulation of gene expression can only be demonstrated experimentally. Thus, understanding the rules at play in deciphering the transcriptional regulatory code remains one of the most significant challenges in biology today. Although most regulatory elements are present in the UTRs and upstream regions, due to their proximity to the TSS, studies have shown the presence of regulatory elements in introns, and, to a much lesser extent, in coding regions [2,8-16]. Building on this knowledge, a segment-based analysis was performed that is focused on non-coding regions within the open reading frames (i.e., introns) and flanking non-coding regions (i.e., UTRs and upstream regions). The coding regions were omitted from this analysis because they are under other selection pressures corresponding to the amino acid sequences of the proteins they produce, and thus they are subjected to biases other than regulation. Arabidopsis thaliana provides an ideal reference organism to investigate the word landscape of a plant genome, and to relate said landscape to important biological features. The Arabidopsis genome consists of 125 Mbp arranged into five chromosomes [17,18]. The genome is well annotated and regions corresponding to introns, 3'UTRs, 5' UTRs, and intergenic genomic spaces are all available from The Arabidopsis Information Resource (TAIR, ) [19]. Many studies have characterized Arabidopsis DNA sequence motifs that participate in the regulation of particular genes (e.g., [20-23]), and public databases such as AthaMap [24] and AGRIS [25] provide comprehensive collections of cis-regulatory elements likely to participate in the regulation of gene expression. However, a systematic analysis of all the words present in the Arabidopsis genome is still lacking. To analyze the different segments of the Arabidopsis genome, an enumerative word discovery approach was used to detect statistically overrepresented words. Similar approaches have been successfully applied over the last decade in the area of motif discovery [26-37]. In a 2005 study, Tompa et al. [38] showed that enumerative methods outperformed heuristic methods in many cases. They are particularly applicable in this research, because they allow the study of the entire 'word landscape' of a genomic data set. Our approach scans the sequences and produces a set of words and word frequencies. This information is employed by a Markov model to compute expected word frequencies. Words with unexpectedly high frequencies are putative functional elements, and thus they are further characterized by comparing word frequencies and positions to gene induction or suppression using the method of Geisler et al. [39]. Additionally, clusters of similar words are formed and used to create motifs for putative transcription factor binding sites. Sequences that contain the same functional elements are grouped together into putative 'nodes' of regulatory networks. Words that co-occur often are identified as putative transcription factor binding modules.

Results and Discussion

Distribution of 8-letter words in the Arabidopsis genome

To determine the word distributions in the segments of the Arabidopsis thaliana genome that contribute to the cis-regulatory apparatus, a comprehensive analysis of 8-letter words in the entire genome was conducted and compared with segments corresponding to non-coding regions. Words of length 6-16 were examined and the complete results have been made available via AGRIS [25,40]. This article reports findings for words of length eight because they correspond to the typical DNA sequence length recognized by transcription factors (usually 6-8 bp [38,41]). Furthermore, 8-mers are long enough that there is enough diversity of word choices (~64,000) to reduce false positive results, while retaining sufficient word counts to be statistically informative. The genome was sub-divided into segments comprising the 3' UTRs, 5'UTRs, promoters and introns (Table 1). The promoter segment was further dissected into the core promoter, corresponding to [-100; +1]; proximal promoter [-1000; -101]; and distal promoter [-3000; -1001]. The general properties of the six genome segments are shown in Table 1. As in a similar study, which was aimed at discovering regulatory elements involved in human DNA-repair pathways [26], word-based genomic signatures were created for each segment. Specifically, the following were identified for each of the genome segments: (1) the set of overrepresented words (signature words), (2) words missing from the sequences (unwords), (3) word-based clusters, (4) word co-occurrences and (5) functional categorizations of the signature words. The results are detailed in the remainder of this section.
Table 1

Segment characteristics for Arabidopsis thaliana

Data Set# Sequences/ChromosomesMin. Seq. LengthMax. Seq. LengthMean Seq. LengthStd. DeviationTotal NucleotidesGenome Percentage
3' UTRs19,77183,118228.134152.1064,510,4103.78

5' UTRs18,58583,214140.088130.2882,603,5312.18

Introns118,319810,234164.446178.48419,457,02916.32

Core Promoters27,02310010010002,702,3002.27

Proximal Promoters27,023900900900024,320,70020.41

Distal Promoters27,0251,3712,0001,999.965.0110554,048,83945.35

Genome-wide518,585,00030,432,60023,837,3004,432,780119,186,497100.00

Overview of the characteristics properties for non-coding segments and the entire genome for Arabidopsis thaliana. The number of sequence refers to the respective number of unique sequences in the specific segment. In case of the entire genome the sequences are the complete chromosomes. Min. Seq. Length refers to the length of the shortest sequence in the set, while Max. Seq. Length refers to the length of the longest sequence in the set. Mean Seq. Length provides the average length of the sequences in the set, while Std. Deviation describes the deviation from said mean. Finally Total Nucleotides describes the total number of nucleotides contained within the sequences of the set and Genome Percentage elaborates on the relationship between the nucleotide count of the set versus the entire genome.

Some sequences in the segments are shorter than 8 nucleotides. Since these sequences cannot harbour any putative regulatory elements in the context of this study, the sequences are removed from the table. For the 3'UTRs this results in a total of 179 nt being omitted, for 5'UTRs 1207 nt and for introns 26 nt. They are however included in the calculation of the background for the different segments since they contribute to the overall nucleotide distribution.

Segment characteristics for Arabidopsis thaliana Overview of the characteristics properties for non-coding segments and the entire genome for Arabidopsis thaliana. The number of sequence refers to the respective number of unique sequences in the specific segment. In case of the entire genome the sequences are the complete chromosomes. Min. Seq. Length refers to the length of the shortest sequence in the set, while Max. Seq. Length refers to the length of the longest sequence in the set. Mean Seq. Length provides the average length of the sequences in the set, while Std. Deviation describes the deviation from said mean. Finally Total Nucleotides describes the total number of nucleotides contained within the sequences of the set and Genome Percentage elaborates on the relationship between the nucleotide count of the set versus the entire genome. Some sequences in the segments are shorter than 8 nucleotides. Since these sequences cannot harbour any putative regulatory elements in the context of this study, the sequences are removed from the table. For the 3'UTRs this results in a total of 179 nt being omitted, for 5'UTRs 1207 nt and for introns 26 nt. They are however included in the calculation of the background for the different segments since they contribute to the overall nucleotide distribution.

Overrepresented Words

All 8-letter words present in the segments were identified and scored using observed:expected frequency ratios (O/E). Specifically, each word was scored and ranked by using the function S*ln(S/E), where S is the number of sequences that contained the word, 'ln' is the natural logarithm, and Eis the number of sequences in which the word was expected to occur. Words discovered in the whole genome were analyzed using the O*ln(O/E) score, with O referring to the overall occurrence of a word across the entire genome and Erepresenting the expected occurrence of that word. The 25 top-ranked words, corresponding to ~0.04% of all possible words, which also corresponds to ~0.04% of the discovered words, were taken as an exemplary subset of the results and further examined (see Table 2, 3, 4, 5, 6, 7, &8 and Additional file 1, 2, 3, 4, 5, 6, &7).
Table 2

The top 25 words in 3'UTRs

UnmaskedMaskedUnmasked
WordSESOEOSlnSESSESOEOSlnSESRevCompRC_PosPalPValues

TTTTTGTT22642066.8224882306.04206.29722792066.8925012331.04222.643AACAAAAA40No9.38E-05

TTTTTCTT21711981.6324042203.7198.14921831978.524272222.83214.723AAGAAAAA49No1.34E-05

TTTTTTGG998824.4581046877.255190.6461003831.2081053888.417188.434CCAAAAAA651No1.71E-08

ATTTTGTA732583.938752615.741165.421738599.956759634.768152.831TACAAAAT37No6.00E-08

TAATTTTT787642.133810678.585160.101797646.36821685.263166.97AAAAATTA164No5.24E-07

ATGTTTTA589469.818601493.292133.161610486.404624512.055138.116TAAAACAT284No1.48E-06

TTTGTTTT25172402.4628472715.8117.22725552406.1528972753.88153.362AAAACAAA1963No0.006347

GTTTTTGA491390.189504408.466112.838512407.532527427.529116.841TCAAAAAC5031No2.76E-06

AAATTTTG588491.471603516.445105.443604504.212621531.22109.069CAAAATTT376No0.00011

ATTTTTTA482387.674498405.795104.97492406.16510426.06494.3317TAAAAAAT100No5.33E-06

ATTTTTCA446354.812450370.941102.014453365.873457383.11896.7633TGAAAAAT170No3.83E-05

TGTTTTGT12271133.1913261219.9197.589712551162.0213591260.0796.6082ACAAAACA659No0.001413

ATAAAAAT564474.529580498.32697.4203566480.088581505.26593.1776ATTTTTAT27No0.000192

TTTTTTCT17211628.1118391786.0995.488217221625.7818471798.8499.0176AGAAAAAA106No0.107802

AAAAATTG397312.488400326.17895.0296414323.794419338.423101.744CAATTTTT66No4.26E-05

TATAATAT505419.081519439.18594.1802514429.108530450.59492.7844ATATTATA275No0.000114

CTCTGTTT763674.497814713.65494.0706796706.86852751.494.5386AAACAGAG227No0.000125

TTTTTAAT897808.297929859.53693.4009905811.646942866.76698.5274ATTAAAAA95No0.009964

TTCTTTTT18841795.1820751982.0590.981118791764.920591964.59117.709AAAAAGAA130No0.019465

TTTTTGGT989902.561029963.19190.4531006920.1751052987.34489.7087ACCAAAAA9144No0.018455

ATTTTCTG324245.197330255.29690.2932340264.756346275.99185.047CAGAAAAT241No4.24E-06

AATATATT462382.795474400.61586.8857477412.829490433.18768.9186AATATATT21Yes0.000195

TTTGTGTG688607.303705640.9485.8355705625.577726662.62384.2635CACACAAA8153No0.006617

TGTTTTTT17161632.3718391791.0585.740417301636.7818641811.8895.8269AAAAAACA1065No0.131261

Top 25 overrepresented words for the 3'Untranslated Regions in Arabidopsis thaliana. The Word attribute describes the short nucleotide sequence associated with a putative word. S and ES describe the number of sequences a word occurs in and the number of sequences the word was expected to occur in respectively, while O and EO describe the total number of occurrences and the expected total number of occurrences. The score SlnSES describes a statistical coverage of the sequences analyzed in the set and is based on a Markov Chain Background Model. Each set of attributes was computed for the masked as well as the unmasked version of the corresponding segment with the emphasis placed on the unmasked version (i.e. sorting of the table based on the unmasked SlnSES score).

Further information for the word is provided through its reverse complement (RevComp) and the position of the reverse complement in the set of results (RC_Pos) as well as a notion describing if the word is a genomic palindrome (Pal).

Finally, PValues describes a p-value that is assigned in order to provide statistical insight allowing the determination if a word is relevant or was discovered as interesting by random chance.

Table 3

The top 25 words in 5'UTRs

UnmaskedMaskedUnmasked
WordSESOEOSlnSESSESOEOSlnSESRevCompRC_PosPalPValues

CTCTTCTC871614.433992668.648303.928883669.295972729.203244.68GAGAAGAG4No-2.22E-16

CTTTCTCT11541003.8412931115.45160.86812041040.0213271164.52176.278AGAGAAAG15No1.14E-07

AACAAAAA1051920.53511341018.31139.3021082933.21211571036.72160.064TTTTTGTT16No0.000192

TTTCTTCA611492.734631532.75131.443808714.439849780.98199.4364TGAAGAAA227No1.88E-05

GAGAAGAG316211.511360225.309126.863305219.262327231.047100.664CTCTTCTC0No0

TTCTCTCC455346.314464371.543124.193504412.082517440.518101.482GGAGAGAA130No2.11E-06

CTTTCTTC883771.778929846.965118.876960807.3941006888.66166.197GAAGAAAG87No0.00285

CTCTCTTT12291116.9713511248.77117.46812841161.6514101312.47128.577AAAGAGAG9No0.002211

TTTCTCTC14211308.6415541478.35117.05114941385.3516361591.45112.808GAGAGAAA74No0.025997

AAAGAGAG666561.408709609.221113.781625511.53649550.867125.216CTCTCTTT7No4.30E-05

AGAAAAAA1078972.58811541078.91110.9281097983.99911791097.24119.255TTTTTTCT93No0.012195

AAAGAAAA978875.4561093966.097108.3281000886.231111981.116120.779TTTTCTTT35No3.32E-05

ATCTCTCA332243.705342260.045102.647380308.328392327.07379.4223TGAGAGAT448No6.93E-07

AAAAAACA759663.266803723.672102.333774675.404814736.19105.466TGTTTTTT298No0.001952

TTTTTCTT1020923.94411161022.27100.88415011398.5717421608.22106.097AAGAAAAA20No0.001995

AGAGAAAG589496.468634536.894100.664548457.974578491.24498.3457CTTTCTCT1No2.45E-05

TTTTTGTT811719.391885787.26597.208515061441.0318181662.3166.4099AACAAAAA2No0.000332

ACAAAAAA845754.352901827.06995.888865767.534916842.311103.408TTTTTTGT37No0.005817

TAAAAAAG231152.899238162.37195.3195272196.748284206.97388.0952CTTTTTTA149No1.66E-08

CAAAAACC357273.395362292.18395.2547386290.194393307.419110.121GGTTTTTG59No4.45E-05

AAGAAAAA11041013.112091126.394.859911341021.8512301142.64118.087TTTTTCTT14No0.007636

CCTCTCTT351268.225358286.57994.4052372313.865375333.08363.2147AAGAGAGG550No2.65E-05

TCTTCTCC907817.38946899.20394.3624899804.147934884.875100.239GGAGAAGA676No0.062179

TTCTCTCA473387.786484416.95193.9572538481.457555517.33159.7404TGAGAGAA126No0.000721

Top 25 overrepresented words for the 5'Untranslated Regions in Arabidopsis thaliana. The Word attribute describes the short nucleotide sequence associated with a putative word. S and ES describe the number of sequences a word occurs in and the number of sequences the word was expected to occur in respectively, while O and EO describe the total number of occurrences and the expected total number of occurrences. The score SlnSES describes a statistical coverage of the sequences analyzed in the set and is based on a Markov Chain Background Model. Each set of attributes was computed for the masked as well as the unmasked version of the corresponding segment with the emphasis placed on the unmasked version (i.e. sorting of the table based on the unmasked SlnSES score).

Further information for the word is provided through its reverse complement (RevComp) and the position of the reverse complement in the set of results (RC_Pos) as well as a notion describing if the word is a genomic palindrome (Pal).

Finally, PValues describes a p-value that is assigned in order to provide statistical insight allowing the determination if a word is relevant or was discovered as interesting by random chance.

Table 4

The top 25 words in Introns

UnmaskedMaskedUnmasked
WordSESOEOSlnSESSESOEOSlnSESRevCompRC_PosPalPValues

TTTTTGTT100489365.741109410679.8706.52498199103.261078310355.3743.17TTTTTGTT100489365.743.44E-05

TTTTTCTT91448495.68100219609.91672.45489398293.5797519363.74669.915TTTTTCTT91448495.681.58E-05

CTTTTTTC27642170.4228212314.32668.22427132187.9727672333.43583.515CTTTTTTC27642170.428.88E-16

GTTTTTGA26732105.1327422243.33638.37226312056.6526962190.66647.973GTTTTTGA26732105.13-2.22E-16

TTTTGCAG35052959.435233179.19593.0634522920.6334703136.4577.016TTTTGCAG35052959.41.07E-09

TTTTTTGT76187067.9781987889.79570.90174006823.8679227600.06599.8TTTTTTGT76187067.970.000286

TTTTTTGG37653238.339423487.94567.37836353124.7637953362.05549.804TTTTTTGG37653238.32.62E-14

TTTTCTTT92568733.23102999900.39538.10990418500.199949615.3557.761TTTTCTTT92568733.233.48E-05

TGTTTTTT74876984.5880287790.67520.07272546759.6577507524.05512TGTTTTTT74876984.580.003768

CTCTCTTT31932716.7932892911.9515.69730862625.0131652811.09499.291CTCTCTTT31932716.793.97E-12

ATTTTTTA25082044.7826452177.76512.12823832003.7824862133.28413.027ATTTTTTA25082044.783.33E-16

TTTTTTCC31662702.4732532896.16501.18630862616.3131612801.55509.528TTTTTTCC31662702.474.13E-11

TGTTTCAG22151790.2122391902.05471.61421531745.321771853.55451.987TGTTTCAG22151790.213.01E-14

GGTTTTTG20291611.1720921708.92467.85119971584.9720581680.71461.47GGTTTTTG20291611.171.11E-16

TTTTGTTT1214211689.31387913619.2461.3271184311368.11343813205.7484.659TTTTGTTT1214211689.30.013306

TTTGTTTT1101710569.91252712188.1456.391072910259.71210611796.5479.827TTTGTTTT1101710569.90.00113

CTTTTTTA22341828.7622821943.72447.14921781816.3122201930.26395.524CTTTTTTA22341828.764.17E-14

AATATATT20221642.5521431742.72420.25319251679.1420191782.16263.038AATATATT20221642.554.44E-16

ATTTTTCA24112030.3524672162.1414.29123491971.8923982098.68411.073ATTTTTCA24112030.357.51E-11

ATTTTTTC28102425.928812592.99413.02127362412.9628002578.85343.758ATTTTTTC28102425.91.43E-08

CAATTTTT24022023.8424812155.04411.47223201952.9823882078.19399.534CAATTTTT24022023.843.73E-12

TTTTTTCT76747280.1782548142.69404.29574767074.780017897.8412.475TTTTTTCT76747280.170.109849

TGTTGCAG19221563.7219331657.84396.50718911543.2119021635.78384.332TGTTGCAG19221563.722.42E-11

TTTCATTT46364258.3948404630.74393.87945384169.0547314529.8384.813TTTCATTT46364258.390.001152

TTTTTATT56475276.0861425792.21383.65854175037.4758425517.96393.481TTTTTATT56475276.082.72E-06

Top 25 overrepresented words for the Introns in Arabidopsis thaliana. The Word attribute describes the short nucleotide sequence associated with a putative word. S and ES describe the number of sequences a word occurs in and the number of sequences the word was expected to occur in respectively, while O and EO describe the total number of occurrences and the expected total number of occurrences. The score SlnSES describes a statistical coverage of the sequences analyzed in the set and is based on a Markov Chain Background Model. Each set of attributes was computed for the masked as well as the unmasked version of the corresponding segment with the emphasis placed on the unmasked version (i.e. sorting of the table based on the unmasked SlnSES score).

Further information for the word is provided through its reverse complement (RevComp) and the position of the reverse complement in the set of results (RC_Pos) as well as a notion describing if the word is a genomic palindrome (Pal).

Finally, PValues describes a p-value that is assigned in order to provide statistical insight allowing the determination if a word is relevant or was discovered as interesting by random chance.

Table 5

The top 25 words in Core Promoters

UnmaskedMaskedUnmasked
WordSESOEOSlnSESSESOEOSlnSESRevCompRC_PosPalPValues

TATAAATA13551071.6913691175.57317.83113001029.9213111128.85302.753TATTTATA69No2.02E-08

CTATAAAT712474.27716514.446289.286704464.711708503.987292.416ATTTATAG2504No7.77E-16

CTATATAA636410.261638444.486278.826626450.579628488.533205.839TTATATAG18530No1.11E-16

ATATAAAC560350.797560379.643261.928554347.685554376.253258.091GTTTATAT26957No4.44E-16

TAAAAAAT473295.342480319.301222.765453298.58460322.82188.835ATTTTTTA12No-2.22E-16

ATATATAC544394.869559427.688174.295507330.093515357.099217.573GTATATAT5651No7.41E-10

AATATATT300181.346300195.646151.012287195.452287210.918110.256AATATATT6Yes2.74E-12

TTATATAA524397.031529430.047145.398514430.79518466.90590.7739TTATATAA7Yes2.22E-06

AAGAAAAA12611129.2413181240.05139.1651189106312381165.84133.189TTTTTCTT25No0.014544

ATATAAAG378262.861380284.014137.316375261.181377282.19135.643CTTTATAT377No3.41E-08

TATATAAA12601131.1112761242.15135.96612341102.4112501209.97139.143TTTATATA1458No0.171817

AGAAAAAA11271000.0411701095.49134.6931063936.86310991025.06134.271TTTTTTCT31No0.01331

ATTTTTTA312204.097315220.282132.415299207.163302223.604109.715TAAAAAAT4No1.17E-09

TTTTAAAA688568.245696617.46131.571658543.865665590.7125.351TTTTAAAA13Yes0.001019

CTCTTCTC402294.202429318.061125.499371277.661390300.087107.516GAGAAGAG444No1.97E-09

ACAAAAAA958840.585988918.052125.259917799.552939872.564125.681TTTTTTGT45No0.011607

ATAAATAC578466.039582505.44124.446574459.992578498.825127.095GTATTTAT14072No0.000465

TTATAAAA507397.553508430.617123.294490386.47491418.525116.302TTTTATAA945No0.000153

AAATTAAA718609.913745663.251117.144682578.03705628.206112.806TTTAATTT96No0.000967

GCCCATTA374273.89396295.991116.512372272.658394294.653115.571TAATGGGC190No1.82E-08

AAAAAACA893787.368924859.073112.42849736.927874803.277120.193TGTTTTTT33No0.014723

TTAAAAAA805701.565828764.227110.71768667.112788726.227108.159TTTTTTAA27No0.01177

ATTAAAAA708609.58719662.885105.969671581.412681631.92196.1611TTTTTAAT316No0.016276

GCCCAATA322231.782340250.291105.859321228.286337246.5109.41TATTGGGC130No4.26E-08

Top 25 overrepresented words for the core promoter regions in Arabidopsis thaliana. The Word attribute describes the short nucleotide sequence associated with a putative word. S and ES describe the number of sequences a word occurs in and the number of sequences the word was expected to occur in respectively, while O and EO describe the total number of occurrences and the expected total number of occurrences. The score SlnSES describes a statistical coverage of the sequences analyzed in the set and is based on a Markov Chain Background Model. Each set of attributes was computed for the masked as well as the unmasked version of the corresponding segment with the emphasis placed on the unmasked version (i.e. sorting of the table based on the unmasked SlnSES score).

Further information for the word is provided through its reverse complement (RevComp) and the position of the reverse complement in the set of results (RC_Pos) as well as a notion describing if the word is a genomic palindrome (Pal).

Finally, PValues describes a p-value that is assigned in order to provide statistical insight allowing the determination if a word is relevant or was discovered as interesting by random chance.

Table 6

The top 25 words in Proximal Promoters

UnmaskedMaskedUnmasked
WordSESOEOSlnSESSESOEOSlnSESRevCompRC_PosPalPValues

TAAAAAAT42493411.1148373674.74933.27236813028.6540713237.18718.039ATTTTTTA1No0

ATTTTTTA38763135.3143723358.5822.01133132758.5836362932.38606.738TAAAAAAT0No2.22E-16

TTATATAA30942505.9233902650.31652.23927122508.3829342653.02211.674TTATATAA2Yes7.77E-16

AATATATT36363104.0840933322.92575.09731783009.5435033215.49173.09AATATATT3Yes1.67E-15

GAAAAAAG20661652.521821718.49461.39519561621.1920531684.9367.226CTTTTTTC5No1.11E-16

CTTTTTTC19601578.3120721638.97424.51218691559.5819691618.92338.269GAAAAAAG4No1.11E-16

AAAAATTG29752595.1732082749.61406.36327372368.4129382497.98395.888CAATTTTT9No-6.66E-16

TAAAATTT43393951.4850584305.15405.9337643348.942143603.07439.821AAATTTTA10No-6.66E-16

TAATTTTT46564272.0253364686.12400.73941253726.4146094040.78419.188AAAAATTA19No0

CAATTTTT28722499.7931102643.5398.63826332269.8328292389.32390.785AAAAATTG6No6.66E-16

AAATTTTA42393880.5749214221.59374.536513305.7741023553.5362.665TAAAATTT7No8.88E-16

TACAAAAT25892241.128212357.73373.6123442040.9625142138.69324.496ATTTTGTA26No6.66E-16

ATTTTCTA22061886.0923461970.39345.62220221748.9321421822.19293.357TAGAAAAT17No8.88E-16

TGAAAAAT23742075.625172176.47318.89122301927.3223542015.09325.288ATTTTTCA21No5.64E-13

AAAAAATC38743607.8542653902.57275.73834943280.0638233524220.77GATTTTTT68No5.63E-09

CATTTTTC16751426.9317601477.44268.47815581356.816241402.92215.428GAAAAATG29No5.16E-13

TAAGAAAT18951645.3619901710.83267.68317731553.4918561612.42234.336ATTTCTTA23No2.52E-11

TAGAAAAT21541904.6522811990.5265.00519711754.6120831828.31229.215ATTTTCTA12No1.04E-10

GGAAAAAA26792426.8628532562.63264.80125062238.0726432354.4283.363TTTTTTCC98No9.20E-09

AAAAATTA47354477.8455474933.58264.40441093862.6746674200.51254.025TAATTTTT8No1.33E-15

CAAAATTT33473092.936553310.2264.26730542796.4233042974.88269.093AAATTTTG60No1.95E-09

ATTTTTCA23382088.524892190.56263.84621691928.6222952016.5254.769TGAAAAAT13No2.29E-10

TTTTTTGG33693120.7937243341.96257.82930502802.6733302981.91257.935CCAAAAAA28No4.49E-11

ATTTCTTA19471705.7920521775.75257.51818001598.5719001660.66213.623TAAGAAAT16No8.37E-11

Top 25 overrepresented words for the proximal promoters in Arabidopsis thaliana. The Word attribute describes the short nucleotide sequence associated with a putative word. S and ES describe the number of sequences a word occurs in and the number of sequences the word was expected to occur in respectively, while O and EO describe the total number of occurrences and the expected total number of occurrences. The score SlnSES describes a statistical coverage of the sequences analyzed in the set and is based on a Markov Chain Background Model. Each set of attributes was computed for the masked as well as the unmasked version of the corresponding segment with the emphasis placed on the unmasked version (i.e. sorting of the table based on the unmasked SlnSES score).

Further information for the word is provided through its reverse complement (RevComp) and the position of the reverse complement in the set of results (RC_Pos) as well as a notion describing if the word is a genomic palindrome (Pal).

Finally, PValues describes a p-value that is assigned in order to provide statistical insight allowing the determination if a word is relevant or was discovered as interesting by random chance.

Table 7

The top 25 words in Distal Promoters

UnmaskedMaskedUnmasked
WordSESOEOSlnSESSESOEOSlnSESRevCompRC_PosPalPValues

ATTTTTTA57894874.0272025393.37995.93749204189.957734568.53790.309TAAAAAAT1No6.66E-16

TAAAAAAT58654983.5773145527.8955.15450034269.1758774662.83793.568ATTTTTTA0No6.66E-16

GAAAAAAG35782825.7739212995.09844.48433942744.3436972903.99721.112CTTTTTTC3No8.88E-16

CTTTTTTC35462878.9239043054.71739.00533452798.3136622964.33596.918GAAAAAAG2No0

TTATATAA47814107.1756564470.46726.30541383955.0947174291.1187.08TTATATAA4Yes0

AATATATT54324895.2167025419.31565.20546884574.6555385029.33114.742AATATATT5Yes0

CAAGAAAC29102459.4431872587.64489.51328182410.3230892533.47440.364GTTTCTTG7No-4.44E-16

GTTTCTTG29122482.9331822613.58464.17628422430.3631082555.55444.685CAAGAAAC6No0

GAAAAATG31582736.5134162895.24452.40228712566.0930802705.63322.343CATTTTTC29No0

GTTTTTGA35163093.2738303296.52450.38232072816.6934622984.91416.186TCAAAAAC13No8.88E-16

GAAAAAAC30132605.3432402749.19438.00427442495.2229352627.17260.786GTTTTTTC26No5.55E-16

CAATTTTT44574041.7749914393.18435.86440093601.5444403878.67429.685AAAAATTG25No1.67E-15

ATTTTGTA40983689.9646263981.23429.81437353342.2341233580.11414.995TACAAAAT69No1.55E-15

TCAAAAAC34143011.2936883203.78428.51331292749.9533582910.25404.054GTTTTTGA9No7.77E-16

GAAGAAAG38513448.542913702.07425.12636643290.4440483520.87394.006CTTTCTTC59No1.11E-16

GTTTTATG21731793.0722931861.81417.60720481720.9121561784.36356.372CATAAAAC57No1.11E-16

CTTTATTC16181250.4516761284.79416.93715001215.715481248.25315.217GAATAAAG43No4.44E-16

GTTTTAAG19571584.6420541638.71413.03117911482.7318711530.29338.304CTTAAAAC28No1.33E-15

ATTTTTCA40813695.3644963987.5405.13743336440953605.05399.585TGAAAAAT40No6.66E-16

TAAGAAGT14651112.4115171139.93403.35913881100.5614351127.54322.073ACTTCTTA62No-8.88E-16

CTTGTTTC23511980.5225042064.03403.15322691929.7624152009.12367.453GAAACAAG35No0

CAAAAAAG33913011.9936963204.57401.91531262864.5233923038.54273.068CTTTTTTG88No0

TAGAAAAT35563178.3838873393.13399.21732192901.7634883080.38333.981ATTTTCTA41No0

ATTCTTCA27162348.1728962465.08395.24825292255.726912363.65289.221TGAAGAAT31No1.11E-16

Top 25 overrepresented words for the distal promoters in Arabidopsis thaliana. The Word attribute describes the short nucleotide sequence associated with a putative word. S and ES describe the number of sequences a word occurs in and the number of sequences the word was expected to occur in respectively, while O and EO describe the total number of occurrences and the expected total number of occurrences. The score SlnSES describes a statistical coverage of the sequences analyzed in the set and is based on a Markov Chain Background Model. Each set of attributes was computed for the masked as well as the unmasked version of the corresponding segment with the emphasis placed on the unmasked version (i.e. sorting of the table based on the unmasked SlnSES score).

Further information for the word is provided through its reverse complement (RevComp) and the position of the reverse complement in the set of results (RC_Pos) as well as a notion describing if the word is a genomic palindrome (Pal).

Finally, PValues describes a p-value that is assigned in order to provide statistical insight allowing the determination if a word is relevant or was discovered as interesting by random chance.

Table 8

The top 25 words in the entire genome

UnmaskedMaskedUnmasked
WordSESOEOOlnOEOSESOEOOlnOEORevCompRC_PosPalPValues

AAAAAAAA551286311193109675.6755101229953346073.66TTTTTTTT1No0

TTTTTTTT551265331173029585.11559888393091.25968.36AAAAAAAA0No1.67E-15

TATATATA555821549385.79575.32552926427159.92183.54TATATATA2Yes3.89E-15

ATATATAT5559429534536298.28553019229596.8601.111ATATATAT3Yes3.00E-15

TAAAAAAT551482311276.34053.855114929148.232621.21ATTTTTTA5No4.44E-16

ATTTTTTA551474311385.13810.5255113929219.872409.99TAAAAAAT4No3.33E-16

GAAGAAGA553010226908.73375.68552278420523.62380.53TCTTCTTC7No0

TCTTCTTC553026727090.33356.11552304420902.72247.42GAAGAAGA6No0

TTTTAAAA552935426314.93208.24551940917519.91987.46TTTTAAAA8Yes2.55E-15

AATATATT551417011353.53140.06551116810179.51035.06AATATATT9Yes1.11E-16

TTTTCTTT553106628174.83034.69552687624423.62571.58AAAGAAAA11No0

AAAGAAAA553103328187.32984.8552686124502.12469TTTTCTTT10No1.11E-16

AGAGAGAG551937616630.52960.63551261511397.81280.05CTCTCTCT16No1.11E-16

TCTCTCTC551917916519.72862.73551291211634.11345.64GAGAGAGA14No4.44E-16

GAGAGAGA552006417413.42842.81551313611970.71220.21TCTCTCTC13No1.89E-15

AAGAAGAA553239729731.92781.12552435223296.21079.35TTCTTCTT19No0

CTCTCTCT551851315956.12751.61551231211212.71151.45AGAGAGAG12No1.11E-16

AGAAGAAG552647724049.72545.91551916118013.61183.17CTTCTTCT20No8.88E-16

TTATATAA55114029138.112523.665592628518.12775.46TTATATAA18Yes1.11E-15

TTCTTCTT5532333299102518.58552455023579.9989.811AAGAAGAA15No0

CTTCTTCT552646324183.92383.23551943218332.31132.03AGAAGAAG17No0

TTTTTCTT5530561283312315.57552651624717.11862.84AAGAAAAA22No0

AAGAAAAA553046128234.72311.9552648824756.81790.32TTTTTCTT21No4.44E-16

TTTGTTTT5532141299312289.6552781326102.21765.71AAAACAAA36No8.88E-16

Top 25 overrepresented words for the entire genome of Arabidopsis thaliana. The Word attribute describes the short nucleotide sequence associated with a putative word. S and ES describe the number of chromosomes a word occurs in and the number of chromosomes the word was expected to occur in respectively, while O and EO describe the total number of occurrences and the expected total number of occurrences. The score OlnOEO describes a statistical overrepresentation of the word in the genome and is based on a Markov Chain Background Model. Each set of attributes was computed for the masked as well as the unmasked version of the corresponding segment with the emphasis placed on the unmasked version (i.e. sorting of the table based on the unmasked OlnOEO score).

Further information for the word is provided through its reverse complement (RevComp) and the position of the reverse complement in the set of results (RC_Pos) as well as a notion describing if the word is a genomic palindrome (Pal).

Finally, PValues describes a p-value that is assigned in order to provide statistical insight allowing the determination if a word is relevant or was discovered as interesting by random chance.

The top 25 words in 3'UTRs Top 25 overrepresented words for the 3'Untranslated Regions in Arabidopsis thaliana. The Word attribute describes the short nucleotide sequence associated with a putative word. S and ES describe the number of sequences a word occurs in and the number of sequences the word was expected to occur in respectively, while O and EO describe the total number of occurrences and the expected total number of occurrences. The score SlnSES describes a statistical coverage of the sequences analyzed in the set and is based on a Markov Chain Background Model. Each set of attributes was computed for the masked as well as the unmasked version of the corresponding segment with the emphasis placed on the unmasked version (i.e. sorting of the table based on the unmasked SlnSES score). Further information for the word is provided through its reverse complement (RevComp) and the position of the reverse complement in the set of results (RC_Pos) as well as a notion describing if the word is a genomic palindrome (Pal). Finally, PValues describes a p-value that is assigned in order to provide statistical insight allowing the determination if a word is relevant or was discovered as interesting by random chance. The top 25 words in 5'UTRs Top 25 overrepresented words for the 5'Untranslated Regions in Arabidopsis thaliana. The Word attribute describes the short nucleotide sequence associated with a putative word. S and ES describe the number of sequences a word occurs in and the number of sequences the word was expected to occur in respectively, while O and EO describe the total number of occurrences and the expected total number of occurrences. The score SlnSES describes a statistical coverage of the sequences analyzed in the set and is based on a Markov Chain Background Model. Each set of attributes was computed for the masked as well as the unmasked version of the corresponding segment with the emphasis placed on the unmasked version (i.e. sorting of the table based on the unmasked SlnSES score). Further information for the word is provided through its reverse complement (RevComp) and the position of the reverse complement in the set of results (RC_Pos) as well as a notion describing if the word is a genomic palindrome (Pal). Finally, PValues describes a p-value that is assigned in order to provide statistical insight allowing the determination if a word is relevant or was discovered as interesting by random chance. The top 25 words in Introns Top 25 overrepresented words for the Introns in Arabidopsis thaliana. The Word attribute describes the short nucleotide sequence associated with a putative word. S and ES describe the number of sequences a word occurs in and the number of sequences the word was expected to occur in respectively, while O and EO describe the total number of occurrences and the expected total number of occurrences. The score SlnSES describes a statistical coverage of the sequences analyzed in the set and is based on a Markov Chain Background Model. Each set of attributes was computed for the masked as well as the unmasked version of the corresponding segment with the emphasis placed on the unmasked version (i.e. sorting of the table based on the unmasked SlnSES score). Further information for the word is provided through its reverse complement (RevComp) and the position of the reverse complement in the set of results (RC_Pos) as well as a notion describing if the word is a genomic palindrome (Pal). Finally, PValues describes a p-value that is assigned in order to provide statistical insight allowing the determination if a word is relevant or was discovered as interesting by random chance. The top 25 words in Core Promoters Top 25 overrepresented words for the core promoter regions in Arabidopsis thaliana. The Word attribute describes the short nucleotide sequence associated with a putative word. S and ES describe the number of sequences a word occurs in and the number of sequences the word was expected to occur in respectively, while O and EO describe the total number of occurrences and the expected total number of occurrences. The score SlnSES describes a statistical coverage of the sequences analyzed in the set and is based on a Markov Chain Background Model. Each set of attributes was computed for the masked as well as the unmasked version of the corresponding segment with the emphasis placed on the unmasked version (i.e. sorting of the table based on the unmasked SlnSES score). Further information for the word is provided through its reverse complement (RevComp) and the position of the reverse complement in the set of results (RC_Pos) as well as a notion describing if the word is a genomic palindrome (Pal). Finally, PValues describes a p-value that is assigned in order to provide statistical insight allowing the determination if a word is relevant or was discovered as interesting by random chance. The top 25 words in Proximal Promoters Top 25 overrepresented words for the proximal promoters in Arabidopsis thaliana. The Word attribute describes the short nucleotide sequence associated with a putative word. S and ES describe the number of sequences a word occurs in and the number of sequences the word was expected to occur in respectively, while O and EO describe the total number of occurrences and the expected total number of occurrences. The score SlnSES describes a statistical coverage of the sequences analyzed in the set and is based on a Markov Chain Background Model. Each set of attributes was computed for the masked as well as the unmasked version of the corresponding segment with the emphasis placed on the unmasked version (i.e. sorting of the table based on the unmasked SlnSES score). Further information for the word is provided through its reverse complement (RevComp) and the position of the reverse complement in the set of results (RC_Pos) as well as a notion describing if the word is a genomic palindrome (Pal). Finally, PValues describes a p-value that is assigned in order to provide statistical insight allowing the determination if a word is relevant or was discovered as interesting by random chance. The top 25 words in Distal Promoters Top 25 overrepresented words for the distal promoters in Arabidopsis thaliana. The Word attribute describes the short nucleotide sequence associated with a putative word. S and ES describe the number of sequences a word occurs in and the number of sequences the word was expected to occur in respectively, while O and EO describe the total number of occurrences and the expected total number of occurrences. The score SlnSES describes a statistical coverage of the sequences analyzed in the set and is based on a Markov Chain Background Model. Each set of attributes was computed for the masked as well as the unmasked version of the corresponding segment with the emphasis placed on the unmasked version (i.e. sorting of the table based on the unmasked SlnSES score). Further information for the word is provided through its reverse complement (RevComp) and the position of the reverse complement in the set of results (RC_Pos) as well as a notion describing if the word is a genomic palindrome (Pal). Finally, PValues describes a p-value that is assigned in order to provide statistical insight allowing the determination if a word is relevant or was discovered as interesting by random chance. The top 25 words in the entire genome Top 25 overrepresented words for the entire genome of Arabidopsis thaliana. The Word attribute describes the short nucleotide sequence associated with a putative word. S and ES describe the number of chromosomes a word occurs in and the number of chromosomes the word was expected to occur in respectively, while O and EO describe the total number of occurrences and the expected total number of occurrences. The score OlnOEO describes a statistical overrepresentation of the word in the genome and is based on a Markov Chain Background Model. Each set of attributes was computed for the masked as well as the unmasked version of the corresponding segment with the emphasis placed on the unmasked version (i.e. sorting of the table based on the unmasked OlnOEO score). Further information for the word is provided through its reverse complement (RevComp) and the position of the reverse complement in the set of results (RC_Pos) as well as a notion describing if the word is a genomic palindrome (Pal). Finally, PValues describes a p-value that is assigned in order to provide statistical insight allowing the determination if a word is relevant or was discovered as interesting by random chance. A detailed analysis of the words identified a minimal overlap between the sets of overrepresented words for the different segments. Specifically, considering the list of top 25 words discovered in any of the six segments (and in the genome wide analysis), 175 words were unique to one specific set, 15 words occurred uniquely in two sets, 7 in three sets, 4 in four sets and none in five sets. Only two words (ATTTTTTA, and AATATATT) were shared in six out of seven sets (neither word was present in the 5'UTR set). Note that the word AATATATT has a significant similarity to the sequence of the TATA-box, a regulatory element that is (1) often found in core promoters and (2) known to contribute to the correct positioning of the core transcriptional machinery [42]. It is conceivable that the absence of AATATATT in the 5'UTR set prevents the initiation of transcription at incorrect sites. The large differences between the various sets of words provide evidence for the existence of segment-specific signatures. Of additional interest is the uniqueness of the word-based genomic signatures in comparison to the signature for the entire Arabidopsis genome. Clearly, the segments' signatures distinguish them from each other and from the entire genome. In addition to uniquely characterizing each segment, the top words discovered in each data set also have a strong probability of being functional regulatory elements. This argument was strengthened by a functional analysis, which is described later in this section.

Missing Words

Another interesting component of our word-based signature is the set of words NOT contained within the different segments (see Table 9, 10, 11, &12 and Additional file 8, 9, 10, &11), referred to as unwords [43] or nullomers [44,45]. The absence of words in particular segments is an interesting phenomenon and may represent negative selection pressure or increased mutation rates specific to these words, or structural constraints on DNA [44]. Thus, the missing word sets, which show unwords and their associated scores, serve as important 'fingerprints' for the segments.
Table 9

Words not detected in the 3'UTRs

#WORDE_SE
CTAGCAGG5.982696.17391

ACTGCCAG4.993195.1526

CGCCTGAT4.977765.13667

GCGTCCGA4.527424.67187

GGGGTGGC4.52484.66917

ACTCCGCC4.388314.5283

CCCGTTCC4.251014.3866

ACACGCCG4.217144.35165

CCCGCTCA4.1934.32673

CTGGGCGT4.068734.19847

GACCTGCG3.718513.83704

GCGCAGTA3.686993.80451

GCACCCGA3.60843.7234

GCACCCTC3.596713.71134

CGCACCCA3.543333.65625

CCGCCGTC3.533853.64646

GGGTCGGC3.524063.63636

GCACGCCT3.354653.46154

GCGCAGCC3.311813.41732

CGTCCGCT3.282523.3871

CTGGCGCC3.26243.36634

GGCGACCT3.256263.36

ATACGCCC3.188163.28972

AGCGCTCC2.984943.08

TAGCGCGG2.984943.08

Top 25 words that were expected to occur in the 3'UTR but are not part of the sequences. Each word is identified through is nucleotide sequence and contains information about the expected number of sequences it was computed to occur in (E_S) as well as the expected number of total occurrences in the set of sequences (E). The words are sorted by their expected sequence occurrence.

Table 10

Words not detected in the 5'UTRs

#WORDE_SE
GGAACTGC5.13335.40909

GAGGACCC5.026585.29661

GCCCTATA5.0155.2844

CCGTACCT4.982365.25

GCGAGTAT4.944915.21053

TATCGCAC4.830885.09034

GGTTGCGG4.694434.94652

GCGGAGTG4.664214.91468

AGTACAGC4.517454.76

GTGCCGAT4.43684.675

GTCCTGGG4.415724.65278

CGGCCGTG4.37684.61176

GGTCGGGG4.168434.39216

GTGCTGGG4.131224.35294

TAGTGCAC4.128434.35

TACCGGCC4.082774.30189

GCCTACGC4.031444.24779

CACCGCGG3.944944.15663

GCGGCGTG3.902174.11155

CGCCTTAG3.778193.98089

CAGCCCAG3.747093.94811

TGAACGGG3.747033.94805

CGTACTGC3.746383.94737

GTGCGCCG3.680133.87755

AGTCCTGG3.676923.87417

Top 25 words that were expected to occur in the 5'UTR but are not part of the sequences. Each word is identified through is nucleotide sequence and contains information about the expected number of sequences it was computed to occur in (E_S) as well as the expected number of total occurrences in the set of sequences (E). The words are sorted by their expected sequence occurrence.

Table 11

Words not detected in the Introns

#WORDE_SE
CGCGGACA6.18056.4557

CCCGGGAG4.572784.77632

CCGGCCCC4.467814.66667

CGCCCCCC4.452544.65072

GCCCACCG4.167824.35331

GCCGCGGG3.476863.63158

CCGAGGGG3.344333.49315

AAGCGCCC3.177373.31875

CGCCAGCG2.991883.125

CGCTCGCG2.915073.04478

GCGTCGCG2.82452.95017

CCGGCACG2.482162.59259

CCGGGGCG2.254832.35514

CCCGCGCC2.161892.25806

TCGGGCGC2.110212.20408

GCGCACGG2.020512.11039

CGCTCCGC2.005142.09434

CGCGACGC1.999452.0884

TGCGCCCG1.95392.04082

GGTGCGCG1.929112.01493

GCGGGCCC1.904641.98936

CGCGGCGA1.861631.94444

GCGCGACG1.832991.91453

GGGCGGGC1.796621.87654

CCGCCGGG1.738871.81622

Top 25 words that were expected to occur in the introns but are not part of the sequences. Each word is identified through is nucleotide sequence and contains information about the expected number of sequences it was computed to occur in (E_S) as well as the expected number of total occurrences in the set of sequences (E). The words are sorted by their expected sequence occurrence.

Table 12

Words not detected in the Core Promoters

#WORDE_SE
CGCACACC5.861096.3029

GTCCGAAC5.467875.88

GCCCTATG5.238955.6338

GGACGTCG4.988735.36471

GGCCCTAG4.471294.80822

CGCGAGCG4.359994.68852

GATCCCCC3.920814.21622

GGCCGCAT3.820284.10811

TACCCAGG3.804294.09091

GGCCCCTG3.672673.94937

CGCATCCG3.669223.94565

CACGCCGA3.569333.83824

CCGGCCGC3.513123.77778

CGCGGTCA3.510793.77528

AGGGCCCT3.509223.77358

GGCGCTGT3.492963.7561

ACGCCCTG3.455873.71622

GCGGACAC3.306483.55556

AGTGGCGC3.299523.54808

GGGCGTTC3.269953.51628

CGCGCAAG3.254813.5

ACCCGCGT3.226353.46939

TTACCCCG3.224823.46774

CCGGTGCG3.182493.42222

TAGGGCCG3.182493.42222

Top 25 words that were expected to occur in the core promoters but are not part of the sequences. Each word is identified through is nucleotide sequence and contains information about the expected number of sequences it was computed to occur in (E_S) as well as the expected number of total occurrences in the set of sequences (E). The words are sorted by their expected sequence occurrence.

Words not detected in the 3'UTRs Top 25 words that were expected to occur in the 3'UTR but are not part of the sequences. Each word is identified through is nucleotide sequence and contains information about the expected number of sequences it was computed to occur in (E_S) as well as the expected number of total occurrences in the set of sequences (E). The words are sorted by their expected sequence occurrence. Words not detected in the 5'UTRs Top 25 words that were expected to occur in the 5'UTR but are not part of the sequences. Each word is identified through is nucleotide sequence and contains information about the expected number of sequences it was computed to occur in (E_S) as well as the expected number of total occurrences in the set of sequences (E). The words are sorted by their expected sequence occurrence. Words not detected in the Introns Top 25 words that were expected to occur in the introns but are not part of the sequences. Each word is identified through is nucleotide sequence and contains information about the expected number of sequences it was computed to occur in (E_S) as well as the expected number of total occurrences in the set of sequences (E). The words are sorted by their expected sequence occurrence. Words not detected in the Core Promoters Top 25 words that were expected to occur in the core promoters but are not part of the sequences. Each word is identified through is nucleotide sequence and contains information about the expected number of sequences it was computed to occur in (E_S) as well as the expected number of total occurrences in the set of sequences (E). The words are sorted by their expected sequence occurrence.

Word-based Clusters

Any biologically required sequence experiences evolutionary pressure (in this case purifying selection) resulting in a narrowing of the range of allowable sequence mutations. Often, a word and various mutations of the word exhibit the same functionality. To incorporate this into our analysis, clusters were built around each of the top overrepresented words, forming groups of words that are similar to each 'seed word.' Word similarity was measured through the Hamming distance metric, which models single point mutations. A Hamming distance of 1 was used to form the clusters. Each cluster is depicted via a sequence logo, providing a visual motif of the characteristics of the cluster. Selected clusters and the corresponding sequence logos are shown in Additional file 12. Two representative motifs are presented for each segment. Motifs for each segment were chosen in order to provide a variety of examples of putative binding sites for the non-coding segments. The presented motifs correspond to well-known regulatory elements and complex motifs, which represent sets of putative regulatory elements. Of particular interest in Additional file 12 are the word-based clusters for the core promoters (in the left column) which correspond to the TATA-box. Also known as the Goldberg-Hogness box [46], the TATA-box is a well-characterized regulatory element appearing 31 bp upstream of the transcription start site in 30% of the promoter sequences in Arabidopsis [23]. The core promoters also contain another interesting motif, (CGACGTCG), which is involved in stress response in Arabidopsis thaliana [22]. An extensive functional characterization is described later in this section.

Word Location Distribution

The locations of a particular word within a segment can provide insight into functional properties of the word. For example, functional TATA motifs are located at a specific distance upstream of the TSS [23,46]. We identified the segment-specific locations of the seed words of the clusters shown in Additional file 12. Being selected for their high complexity, these words are expected to exhibit a distribution bias, manifesting as peaks in the scatterplots of the distribution across sequences, as shown in Figures 1, 2, 3 and 4.
Figure 1

Word location distribution across introns. Word location distributions for interesting words within the introns. The occurrences are shown on a log-scale in order to allow a comparison between the different segments as well as the words visualized for the entire genome.

Figure 2

Word location distribution across core promoters. Word location distribution for interesting words within the core promoters. The occurrences are shown on a log-scale in order to allow a comparison between the different segments as well as the words visualized for the entire genome.

Figure 3

Word location distribution across proximal promoters. Word location distributions for interesting words within proximal promoters. The occurrences are shown on a log-scale in order to allow a comparison between the different segments as well as the words visualized for the entire genome.

Figure 4

Word location distribution across the entire genome. Word location distributions for interesting words within the genome. The occurrences are shown on a log-scale in order to allow a comparison between the different segments as well as the words visualized for the entire genome.

Word location distribution across introns. Word location distributions for interesting words within the introns. The occurrences are shown on a log-scale in order to allow a comparison between the different segments as well as the words visualized for the entire genome. Word location distribution across core promoters. Word location distribution for interesting words within the core promoters. The occurrences are shown on a log-scale in order to allow a comparison between the different segments as well as the words visualized for the entire genome. Word location distribution across proximal promoters. Word location distributions for interesting words within proximal promoters. The occurrences are shown on a log-scale in order to allow a comparison between the different segments as well as the words visualized for the entire genome. Word location distribution across the entire genome. Word location distributions for interesting words within the genome. The occurrences are shown on a log-scale in order to allow a comparison between the different segments as well as the words visualized for the entire genome. The Figures contain histograms showing the numbers of occurrences of specific words at each point along the sequences. For uniformity, sequence lengths are normalized to the range [1;100]. Strong peaks can indeed be found for the words selected in the intron, core promoter, and proximal promoter regions. The peaks detected for the intron segment are at both the 5' and 3' ends of the introns, which means that the words occur in close proximity to flanking exons. The close proximity to the intron-exon boundaries is expected for splicing regulatory sequences [2,8-16]. The peaks exhibited in core and proximal promoters are not surprising. The distributions of words locations in these segments are expected to show clustering, due to positional conservation of locations of cis-regulatory elements [23]. Of particular interest is the location of the peak for the first word chosen for the core promoter distribution (TATAATA), the TATA-box. A location of around 31 bp upstream from the TSS corresponds to the findings in [23]. Interestingly, we also detect strong peaks for the example words chosen for the genome wide word landscape, possibly indicating an important chromosomal feature that is not yet understood.

Word Co-occurrences

Genes are usually controlled by a combination of multiple transcription factors, or by transcription factor complexes binding to different sites embedded in the genes' regulatory non-coding regions. In order to detect the interacting transcription factor binding sites of a complex, we examined the positional relationships of words. The top 25 overrepresented words were paired, and the overrepresentation of each pair was determined using a Markovian background model of order 6. The top 25 overrepresented word pairs for each segment are displayed in Table 13, 14, 15, 16, 17 and 18 (see also Additional files 13, 14, 15, 16, 17, &18). The limited overlap between the word pairs of different segments indicates additional unique word-based signatures for genomic segments.
Table 13

Co-occurrence in 3'UTRs

Word1Word2SESS*ln(S/ES)
TTCTTTTTTTTTTCTT322238.580296.5504

TGTTTTTTTTTTTCTT283217.718374.2154

TTCTTTTTTTTTTTCT260197.570571.3925

TTTTTCTTTTTTTGTT326273.084857.7395

TCTTTTTTTTTTTCTT270218.947156.5898

TTTTCTTTTTTTTTCT278226.888656.479

TTTTTTGGTTTTTGTT161116.596951.9517

TTATTTTTTTTTTCTT211166.829949.5604

TTCTTTTTTTTTTGTT290248.375544.9324

TGTTTTTTTTCTTTTT239198.067744.8973

TTTTCTTTTCTTTTTT270228.744944.7699

TCTTTTTCTTTTTTCT11276.793942.2658

TGTTTTTTTTTTTTGG12993.111142.0564

TTTTTTGGTTTTTCTT148112.028741.2117

TTTTTTCTTTTTTTGG12892.878741.0542

TTTTCTTTTGTTTTTT265227.460540.4796

TTTGTTTTTTTTTTGG170134.425639.9138

TTCTTTTTTTTTTTGG136101.968739.1665

TCTTTTTTTTTTTTGG12793.633238.7099

TTTTCTTTTTCTTTTT285249.267438.1794

TTTTTATTTTATTTTT137103.779438.0467

TGTTTTTTTTTTTTCT215180.327237.8109

TCTTTTTTTTTTTTCT216181.343137.7758

TTTTTGGTTTTTTGTT161127.407237.6766

ATTTTTTATTTTTCTT8253.245735.4078

Overrepresented non-overlapping word-pairs detected in the 3'Untranslated Regions of Arabidopsis thaliana. A word-pair is characterized through the two nucleotide sequences associated with it (Word1 and Word2), the number of sequences the pair occurs in (S) as well as the expected number of sequences (ES) and a statistical score symbolizing the overrepresentation of the word-pair in the specific sequence set (S*ln(S/ES)).

Table 14

Co-occurrence in 5'UTRs

Word1Word2SESS*ln(S/ES)
CTCTCTTTCTTTCTCT209108.1185137.7533

TTTCTCTCCTCTCTTT214139.441991.6622

TTTCTCTCCTTTCTCT198125.80889.7949

TTTTTTGTTTTTCTTT9741.751681.7683

CTTCTCTTCTCTTCTC9745.997372.3745

CTCTGTTTTTTTTCTT10554.058769.7085

TTTTTTGTTTTTTCTT9748.618666.9983

TTTTCTTTTTTTTCTT12271.372865.4048

TTTTTGTTTTTTTCTT11565.232665.2019

TTTCTCTCCTCTTCTC12878.0763.2863

TTTTCTTTTTTTTGTT10356.009362.7487

CTCTGTTTTTTTTGTT8742.433762.4629

AAAGAAAAAGAAAAAA13082.923658.4498

CTCTCTGTCTTTCTCT9047.312457.8733

CTTTCTCTCTCTTCTC10560.586957.7376

TTTTCTCCCTCTTCTC6123.91857.1107

ACAAAAAAAAAAAACA9249.536456.9554

CTTTCTTCCTCTTCTC8847.007355.179

AAGAAAAAAGAAAAAA14195.476954.9724

CTCTCTTTCTCTTCTC10967.121952.8472

GAAAGAGAAGAGAAAG5722.651852.6003

TTTCCTCTCTTTCTCT7940.619352.5511

TTTCCTCTTTTCTCTC9152.319450.3678

TTTTCTTTCTCTCTTT12785.659850.013

TTCTCTCCCTCTTCTC5321.463147.9097

Overrepresented non-overlapping word-pairs detected in the 5'Untranslated Regions of Arabidopsis thaliana. A word-pair is characterized through the two nucleotide sequences associated with it (Word1 and Word2), the number of sequences the pair occurs in (S) as well as the expected number of sequences (ES) and a statistical score symbolizing the overrepresentation of the word-pair in the specific sequence set (S*ln(S/ES)).

Table 15

Co-occurrence in Introns

Word1Word2SESS*ln(S/ES)
TTTTATTTATTTTTTA393217.8144231.9354

TTTTTATTATTTTTTA334186.0726195.3914

TAAAAAATAATATATT14739.3119193.8792

TTTTTAATTTTTTATT460306.2869187.084

TAAAAAATTTTTATTT273140.3538181.6284

TAATTTTTATTTTTTA238113.2939176.6639

CTCTGTTTCTGTTTTT346208.3136175.5583

TTTTATTTAATATATT308175.8151172.6854

TTTTATTTTTTTTAAT505358.7745172.6415

TAAAAAATATTTTTTA14948.6332166.8264

TAAAAAATTTTTTAAT18979.759163.0573

TAAAAAATTAATTTTT17973.1119160.2756

TTTTATTTTAATTTTT461328.5857156.0948

TTTTTAATATTTTTTA238123.6151155.9133

TAAAAAATTTTTTCTT305185.7949151.1788

TAAAAAATTTTTTATT230119.9486149.7338

TTTTTATTAATATATT261150.2261144.1709

TAATTTTTTTTTTAAT300186.1617143.1501

TTTTTAATAATATATT20299.8493142.3303

TTTTATTTTTTTTATT670542.1648141.8441

TAAAAAATTTTTTTGT262157.163133.898

TAATTTTTAATATATT18791.5206133.6198

ATTTTTTATTTTTTGT354243.9756131.769

TAAAAAATTTTTGTTT357246.9371131.5909

TTTTTAATTTTTTGTT638519.9558130.5312

Overrepresented non-overlapping word-pairs detected in the introns of Arabidopsis thaliana. A word-pair is characterized through the two nucleotide sequences associated with it (Word1 and Word2), the number of sequences the pair occurs in (S) as well as the expected number of sequences (ES) and a statistical score symbolizing the overrepresentation of the word-pair in the specific sequence set (S*ln(S/ES)).

Table 16

Co-occurrence in Core Promoters

Word1Word2SESS*ln(S/ES)
GCCCAATAGCCCATTA322.349283.5729

TTTTTTCTTTTTTCTT6822.953173.8516

AATAAAAAAAGAAAAA8441.579859.069

CTCTCTTTCTTTCTCT409.162658.95

AATAAAAAATTAAAAA5722.445353.1222

ACAAAAAAAAGAAAAA7135.126549.9645

ACAAAAAAAGAAAAAA6631.107549.6455

ATTTCTCATATAAATA306.103147.772

AATAAAAATAAAAAAT3810.874847.5432

AAAAAACAACAAAAAA5624.492146.3121

AAAAATATAAAAAACA4415.519145.8533

AACAAAAAAAGAAAAA7742.543345.6828

AACAAAAAAGAAAAAA6937.675841.7512

TTTCTTTTTTTTTTGT4014.292741.1653

AAAAAACAATATAAAG307.65940.9596

AAAAAACACTATATAA3611.953839.689

AAAAATATCTATATAA308.086339.3309

TATATAAATAAAAAAT3612.362338.4793

AATAAAAATTAAAAAA5325.832438.0892

TTTTATTTTTTTTTAA3814.003937.9336

TTTTATTTTTTTTCTT5023.574337.5932

TTCTTTTTTTTTTCTT4620.394237.416

AAATTAAAACAAAAAA4418.972137.0137

AATAAAAAAGAAAAAA6536.822536.938

TTTCTTTTTTTTTGTT4116.842936.4755

Overrepresented non-overlapping word-pairs detected in the core promoters of Arabidopsis thaliana. A word-pair is characterized through the two nucleotide sequences associated with it (Word1 and Word2), the number of sequences the pair occurs in (S) as well as the expected number of sequences (ES) and a statistical score symbolizing the overrepresentation of the word-pair in the specific sequence set (S*ln(S/ES)).

Table 17

Co-occurrence in Proximal Promoters

Word1Word2SESS*ln(S/ES)
AAATTTTATAAAAAAT996489.8445706.8206

ATTTTTTATAAAAAAT869395.77683.4771

TAAATTTTTAAAAAAT970501.8706639.1852

AAAAATTATAAAAAAT1040565.2386634.1171

TAAAATTTTAAAAAAT963498.7952633.5171

TAAAATTTATTTTTTA892458.4645593.7003

AAATTTTAATTTTTTA868450.2375569.7695

AAAAATTAATTTTTTA947519.5356568.5445

AAAATTTATAAAAAAT919496.1801566.4231

TAATTTTTTAAAAAAT965539.2575561.5671

AAAATTTAATTTTTTA865456.0608553.6894

TAATTTTTATTTTTTA907495.6552548.0656

AATATATTTAAAAAAT776391.8276530.2646

AAAATTTAAAATTTTA973564.4665529.8015

AAATTTTATAAAATTT976567.4415529.3092

AAAAATTATAATTTTT1125707.8947521.1483

AATATATTATTTTTTA730360.1459515.7708

TAAATTTTATTTTTTA845461.2912511.4845

AAAAATTATAAAATTT1052654.7789498.8066

AAAATTTAAAAAATTA1044651.346492.5318

AAAATTTATAAAATTT958574.7807489.4031

AAATTTTATAATTTTT993613.4724478.2242

TAATTTTTTAAAATTT995624.6821463.1724

AAAATTTATAATTTTT990621.407461.0615

TTATATAATAAAAAAT645316.3233459.5531

Overrepresented non-overlapping word-pairs detected in the proximal promoters of Arabidopsis thaliana. A word-pair is characterized through the two nucleotide sequences associated with it (Word1 and Word2), the number of sequences the pair occurs in (S) as well as the expected number of sequences (ES) and a statistical score symbolizing the overrepresentation of the word-pair in the specific sequence set (S*ln(S/ES)).

Table 18

Co-occurrence in Distal Promoters

Word1Word2SESS*ln(S/ES)
TAAAAAATATTTTTTA1855898.80381344.087

AATATATTTAAAAAAT1759902.70941173.429

AATATATTATTTTTTA1692882.86791100.631

TTATATAAATTTTTTA1478740.74291020.99

TTATATAATAAAAAAT1464757.3903964.8477

AATATATTTTATATAA1447743.9616962.6287

AAAAATTGTAAAAAAT1301747.7933720.4442

CAATTTTTTAAAAAAT1279745.3293690.6698

AAAAATTGATTTTTTA1237731.3568650.0966

ATTTTGTAATTTTTTA1156665.4975638.3272

CAATTTTTATTTTTTA1200728.947598.171

TAGAAAATTAAAAAAT1024586.114571.3484

ATTTTGTATAAAAAAT1108680.4539540.2074

CAATTTTTAATATATT1162732.1145536.7987

ATTTTTCAATTTTTTA1078666.4705518.3745

AAAAATTGAATATATT1148734.5348512.627

CAATTTTTTTATATAA1003614.2579491.8069

TAGAAAATAATATATT956575.7221484.8189

ATTTTCTAATTTTTTA952574.2477481.2399

ATTTTCTATAAAAAAT964587.1534477.9562

TAGAAAATATTTTTTA941573.2313466.4103

ATTTTTCATAAAAAAT1058681.4487465.4297

TGAAAAATATTTTTTA1020658.2655446.7086

TGAAAAATTAAAAAAT1033673.0593442.5259

AAAAATTGTTATATAA970616.2886439.9733

Overrepresented non-overlapping word-pairs detected in the distal promoters of Arabidopsis thaliana. A word-pair is characterized through the two nucleotide sequences associated with it (Word1 and Word2), the number of sequences the pair occurs in (S) as well as the expected number of sequences (ES) and a statistical score symbolizing the overrepresentation of the word-pair in the specific sequence set (S*ln(S/ES)).

Co-occurrence in 3'UTRs Overrepresented non-overlapping word-pairs detected in the 3'Untranslated Regions of Arabidopsis thaliana. A word-pair is characterized through the two nucleotide sequences associated with it (Word1 and Word2), the number of sequences the pair occurs in (S) as well as the expected number of sequences (ES) and a statistical score symbolizing the overrepresentation of the word-pair in the specific sequence set (S*ln(S/ES)). Co-occurrence in 5'UTRs Overrepresented non-overlapping word-pairs detected in the 5'Untranslated Regions of Arabidopsis thaliana. A word-pair is characterized through the two nucleotide sequences associated with it (Word1 and Word2), the number of sequences the pair occurs in (S) as well as the expected number of sequences (ES) and a statistical score symbolizing the overrepresentation of the word-pair in the specific sequence set (S*ln(S/ES)). Co-occurrence in Introns Overrepresented non-overlapping word-pairs detected in the introns of Arabidopsis thaliana. A word-pair is characterized through the two nucleotide sequences associated with it (Word1 and Word2), the number of sequences the pair occurs in (S) as well as the expected number of sequences (ES) and a statistical score symbolizing the overrepresentation of the word-pair in the specific sequence set (S*ln(S/ES)). Co-occurrence in Core Promoters Overrepresented non-overlapping word-pairs detected in the core promoters of Arabidopsis thaliana. A word-pair is characterized through the two nucleotide sequences associated with it (Word1 and Word2), the number of sequences the pair occurs in (S) as well as the expected number of sequences (ES) and a statistical score symbolizing the overrepresentation of the word-pair in the specific sequence set (S*ln(S/ES)). Co-occurrence in Proximal Promoters Overrepresented non-overlapping word-pairs detected in the proximal promoters of Arabidopsis thaliana. A word-pair is characterized through the two nucleotide sequences associated with it (Word1 and Word2), the number of sequences the pair occurs in (S) as well as the expected number of sequences (ES) and a statistical score symbolizing the overrepresentation of the word-pair in the specific sequence set (S*ln(S/ES)). Co-occurrence in Distal Promoters Overrepresented non-overlapping word-pairs detected in the distal promoters of Arabidopsis thaliana. A word-pair is characterized through the two nucleotide sequences associated with it (Word1 and Word2), the number of sequences the pair occurs in (S) as well as the expected number of sequences (ES) and a statistical score symbolizing the overrepresentation of the word-pair in the specific sequence set (S*ln(S/ES)).

AGRIS Lookup

The AGRIS database [25] contains a large collection of known regulatory elements for Arabidopsis thaliana. The words discovered in this study were compared to the regulatory elements of equal or lesser length in AGRIS. Table 19 provides the overview of the motifs and their locations within the results.
Table 19

AGRIS Lookup

3'UTRs5'UTRsIntronCore PromotersProximal PromotersDistal Promoters
Regulatory Element from AGRIS database [25]RankScoreRankScoreRankScoreRankScoreRankScoreRankScore
Bellringer/replumless/pennywise BS3 IN AG----435030.0479784--646180.95590956341-103.557

CBF1 BS in cor15a----48346-1.48116--48521.349881162424.1708

Octamer promoter motif----414350.673899--119351.28979238584.69741

Bellringer/replumless/pennywise BS1 IN AG7267.631135235.2087574127.46819117.1448751.075958337-186.12

ABRE-like binding site motif544511.7462113821.75561524216.048830441.9698531.45099109255.929

G-box promoter motif185221.1577113821.75561202320.828230441.9698531.45099102260.604

DPBF1&2 binding site motif372014.7278296313.7441346054.809435539.88271371.36496102260.604

MYB1 binding site motif430613.622344632.0594140786.763840038.364717851.11027255776.5745

RAV1-A binding site motif56834.060314849.0095200073.672645136.31111351.20169289186.355

W-box promoter motif75130.776967527.0198458139.17253334.1751761.19182756131.24

CBF2 binding site motif and GBF1/2/3 BS in ADH1----349492.8718754034.05627291.293998117.554

ARF and ARF1 binding site motif97627.580921642.5544741116.21456833.561928521.07934230680.856

L1-box promoter motif269717.6326--582438.291258533.08328891.05367223581.9035

GATA promoter motif118625.635374126.1103124791.671580229.3553551.081611033115.612

ATB2/AtbZIP53/AtbZIP44/GBF5 BS in ProDH175721.6648122520.9254289060.580690827.913913131.12688320467.6808

SORLIP2365814.866390246.911971636114.6754100626.53835501.34186780129.375

MYB binding site promoter476212.8183246215.1743189775.734103226.169249311.06605201086.739

CCA1 binding site motif123025.132537134.5029520241.532122524.4536619900.9976558013-161.161

TGA1 binding site motif--132904.966621032624.0526123324.391916601.21323187989.7072

SORLIP1529711.962561729.00641107622.5348128623.889949651.15533409758.1886

T-box promoter motif63932.6567153219.0267774114.265132523.56091931.27522205212.153

Ibox promoter motif215619.64935835.0463322357.1901179720.450710811.14622628140.679

Box II promoter motif140323.9863499310.3195143785.6577180420.425419861.30314669136.891

Hexamer promoter motif75909.4166161618.59911034724.0156222518.656834771.244191252107.567

AtMYC2 BS in RD22119325.5614402611.6309346054.8094282316.61936461.21499207385.133

RAV1-B binding site motif70549.9457182507.40511158921.6087299616.097560841.12709201786.5658

RY-repeat promoter motif18249.4382--530132.253309715.8378721.2930561302.629

MYB3 binding site motif512812.2348105756.06616140786.7638329215.395332881.083241154624.3649

Bellringer/replumless/pennywise BS2 IN AG312616.2923--64424-30.4349369414.5011627770.9797658184-172.62

AtMYB2 BS in RD22679710.194996306.55608496142.997448013.038335701.07359321867.5209

E2F binding site motif and E2F/DP BS in AtCDC6--407811.544346644-0.929602495312.223609661.2070355143-85.466

ERF1 BS in AtCHI-B and GCC-box promoter motif--68126.94462082210.4265635910.501643401.35349173593.0802

Z-box promoter motif----360292.48082101447.62515391991.00107267841.42726

LTRE promoter motif--62308.955121603615.0374112487.01938112961.13624715538.6247

SORLIP5517012.1706317513.31371401717.6817116146.82909149841.04471222676.5221

ABFs and ABRE binding site motif85408.603562668.92287291095.33319122506.521587251.255981490100.349

PI promoter motif94367.96403--60410-9.96838145965.56209245401.01231790235.621

Observations about the regulatory elements (length = 8) contained in the AGRIS database [25].

AGRIS Lookup Observations about the regulatory elements (length = 8) contained in the AGRIS database [25].

Functional Categorizations of Words

In order to reveal biological meanings of overrepresented words, we established associations between the overrepresented words and biological functions of the genes that harbour a particular word in their corresponding segment (Table 1). For a single word, all the genes that contain that word in their selected segment were found and the corresponding overrepresented Gene Ontology (GO) terms were identified. Overrepresentation of a GO term is determined by using the Arabidopsis gene GO term distributions as a background model. The developmental and experimental parameters that perturb the expression of genes harbouring a particular word was determined by comparing the number of induced, suppressed or neutral genes, to that expected by chance in a collection of 1305 tissue and stress microarrays from the public domain. Significant enrichment or depletion of induced or suppressed genes has been shown to correlate strongly with factors affecting regulation of a cis-regulatory element [39]. As shown in Figures 5, 6, 7, 8, 9 and 10, we identified overrepresented functional categories (y-axis) of genes that carry a particular word (x-axis, top panel) in either their 3'UTR (Figure 5), 5'UTR (Figure 6), intron (Figure 7), or promoter regions (Core, Proximal and Distal Promoters, Figures 8, 9 and 10, respectively). The red squares depict overrepresented categories with lowest p-value, calculated for each segment separately, smaller than 10E-16. For example, the word GTTTTTGA was significantly enriched in the 3'UTRs of genes that participate in the GO category "Protein Synthesis" (including the sub-categories ribosome biogenesis, ribosomal proteins, translation), and is correlated with genes suppressed in flowers and early stage siliques (p-value 4E-14). Based on microarray expression of micro-dissected tissues (see methods), the word TGTTTTTT is present in the 3' UTR of genes induced in roots (p-value 1E-8), in the atrichoblast (hairless) cell files of the root (p-value 7E-25), the root cortex (p-value 2E-23), endodermis (p-value 2E-51), and lateral root cap (p-value 4E-20). The word CTCTCTTT, enriched in introns, was correlated with differential induction in cotyledons (p-value 8E-20), suppressed in young flowers, especially carpals (p-value 1E-14) and heart stage embryos (p-value 3E-20). Surprisingly, the presence of these words in the UTRs and introns were strongly correlated with tissue specific profiles, but were only weakly enriched or strongly depleted for responses by hormones, biotic and abiotic stresses. There was no significant correlation to any of the 1305 surveyed conditions if the words were located in the 1000 bp upstream or downstream regions. This is strikingly different to the well characterized abscisic acid responsive element (ABRE) (CACGTGTC) [22], which when found in the 1000 bp 5'upstream region, was strongly correlated to induction by 10 μM abscisic acid (ABA) (p-value 4E-49), cold, salt and drought stresses (p-values < 1E-40), in flowers (p-value 1E-31), and suppressed in roots (p-value 4E-7) but no significant correlations were observed when ABRE was present in the 3'UTRs, 5'UTRs or introns. We also analyzed primary promoter regions where most of the basal promoter elements are expected to be located. The frequency of words is calculated as described above, and genes that contain the high scoring word in their primary promoter region were queried for enriched biological function. For example, GCCCATTA is found in core promoter regions of genes preferably involved in ribosome biogenesis and translation. Genes with this word in the upstream promoter are significantly depleted for response to all hormones, biotic and abiotic stresses (typically p-value 1E-8 or better). In other words, genes harbouring this word in their upstream promoter region tend to be less responsive to stresses than randomly chosen genes. However, the word CTATAAAT was found in core promoter regions of genes preferably functioning as storage facilitating proteins (Figure 8). Genes with this word in the upstream promoter are rapidly induced by 10 nM brassinolide (p-value 1E-9) and by salt stress in roots (p-value 4E-9). These genes are also induced in roots, flowers, pollen, and during seed development, and strongly suppressed in young leaves and cotyledons.
Figure 5

Cellular functions in 3'UTRs. Enriched functional categories within the set of genes associated with each word in the top 25 words of the 3'UTRs. The lookup was conducted against the MIPS Functional Catalogue Database (FunCatDB) [54].

Figure 6

Cellular functions in 5'UTRs. Enriched functional categories within the set of genes associated with each word in the top 25 words of the 5'UTRs. The lookup was conducted against the MIPS Functional Catalogue Database (FunCatDB) [54].

Figure 7

Cellular functions in introns. Enriched functional categories within the set of genes associated with each word in the top 25 words of the introns. The lookup was conducted against the MIPS Functional Catalogue Database (FunCatDB) [54].

Figure 8

Cellular functions in core promoters. Enriched functional categories within the set of genes associated with each word in the top 25 words of the core promoters. The lookup was conducted against the MIPS Functional Catalogue Database (FunCatDB) [54].

Figure 9

Cellular functions in proximal promoters. Enriched functional categories within the set of genes associated with each word in the top 25 words of the proximal promoters. The lookup was conducted against the MIPS Functional Catalogue Database (FunCatDB) [54].

Figure 10

Cellular functions in distal promoters. Enriched functional categories within the set of genes associated with each word in the top 25 words of the distal promoters. The lookup was conducted against the MIPS Functional Catalogue Database (FunCatDB) [54].

Cellular functions in 3'UTRs. Enriched functional categories within the set of genes associated with each word in the top 25 words of the 3'UTRs. The lookup was conducted against the MIPS Functional Catalogue Database (FunCatDB) [54]. Cellular functions in 5'UTRs. Enriched functional categories within the set of genes associated with each word in the top 25 words of the 5'UTRs. The lookup was conducted against the MIPS Functional Catalogue Database (FunCatDB) [54]. Cellular functions in introns. Enriched functional categories within the set of genes associated with each word in the top 25 words of the introns. The lookup was conducted against the MIPS Functional Catalogue Database (FunCatDB) [54]. Cellular functions in core promoters. Enriched functional categories within the set of genes associated with each word in the top 25 words of the core promoters. The lookup was conducted against the MIPS Functional Catalogue Database (FunCatDB) [54]. Cellular functions in proximal promoters. Enriched functional categories within the set of genes associated with each word in the top 25 words of the proximal promoters. The lookup was conducted against the MIPS Functional Catalogue Database (FunCatDB) [54]. Cellular functions in distal promoters. Enriched functional categories within the set of genes associated with each word in the top 25 words of the distal promoters. The lookup was conducted against the MIPS Functional Catalogue Database (FunCatDB) [54]. A set of 10 frequently enriched cis-elements was recently provided for the ATH95 gene coexpression neighbourhood (AAACCCTA, CTTATCCN, GGCCCANN, GCCACGTN, GCGGGAAN, GACCGTTN, AANGTCAA, CNGATCNA, NCGTGTCN, CATGCANN) [47]. Our results show a direct overlap with two of those words (AAACCCTA, NCGTGTCN), which are detected and marked as 'interesting' in the 5'UTRs, and the proximal promoters, respectively. Several words were hit partially as members of the 'interesting' word clusters (CTTATCCN, GCCACGTN, AANGTCAA, CNGATCNA), while others were not represented in the selected word clusters and the top 25 words. While no overlap for GACCGTTN could be found, it is possible to validate the significance of GGCCCANN and GCGGGAAN through the detection of these two words as unwords in the introns, marking them interesting regulatory elements associated with the expression, but not necessarily with the regulation of the associated alternative splicing process.

Conclusion

The analyses described here provide a first view of the word landscape within the non-coding regions of the Arabidopsis thaliana genome. An analysis centred on the statistically interesting words furnishes important insights into the unique elements of each segment. The correlations of particular words with cellular functions or expression patterns provide valuable hypotheses for further experimentation. Correlation between word position and expression also seems strong, with one class of words only present in the 5'/3'UTRs and introns, and another class of words putatively functioning only in the region upstream of the TSS. Words in the first class seem more directed at regulation of tissue and cellular identity, while words which function upstream appear more likely to be involved in environmental responses.

Methods

Word-based genomic signatures are the union of results generated by applying the software pipeline shown in Figure 11. Statistically relevant words are extracted from a set of genomic sequences, and are analyzed to determine similarity, location distribution, groupings, and predicted cellular function.
Figure 11

Process Flowchart. Methodology flow applied for the discovery of word-based genomic signatures in non-coding Arabidopsis thaliana.

Process Flowchart. Methodology flow applied for the discovery of word-based genomic signatures in non-coding Arabidopsis thaliana.

Sequence Data

This manuscript reports the results of analyzing DNA sequences of Arabidopsis thaliana. The non-coding genomic segments (specifically, the 3'UTRs, 5'UTRs, promoters and introns) and the entire genomic sequence (as complete chromosomes) were obtained from TAIR (release 8) [19]. Both masked and unmasked versions of the genome were analyzed. Ambiguous nucleotides, depicted in the sequences by the letters [R, Y, W, S, K, M], were removed because they represent sequencing anomalies; this resulted in the removal of 0.15% (or 188,820) of the nucleotides. In this study, only protein-coding genes were considered as genes, and transposable-like, or pseudo-genes, were omitted. Thus, the total number of genes in this study is ~27,000. Due to different lengths and locations of the promoter elements it is possible that, while core promoters can occur for a specific gene, no distal promoter for that gene exists due to the fact that its location would fall into another gene or even outside of a chromosome. The difference in number of genes in 3'UTRs and 5'UTRs sets compared to other sets is due to genes that lack annotated UTR (it is yet to be discovered). Whenever multiple spliced transcripts were available for a gene, a major transcript was chosen (Atngnnnnn.1) to prevent bias towards genes that contain multiple transcripts. Likewise, only introns of major transcripts were selected.

Word Enumeration and Scoring

The first pipeline stage employs a radix trie data structure [48] to enumerate all subsequences (words) of a specified length in the given DNA input sequences. For each word w, with o total occurrences in s sequences, a word score is computed as s*ln(s/Es(w)). The expected number of sequences containing word w, Es(w), is computed as the product of (1) the probability for each observed word to occur anywhere in the input sequences and (2) the total length of the sequences. This model implicitly assumes a binomial model for the word distribution, i.e., that the word probabilities are independent of the positions of the words within the sequences [49,50]. The probability is computed by using a maximum-order homogeneous Markov chain model [49] where the transition probabilities are determined using the Maximum Likelihood Method [50]. (Note that under this model, the (G+C)% biasing is achieved for any order of Markov model greater than or equal to zero, since the frequencies of individual nucleotides are taken into consideration for all orders.) The order of the Markov model was chosen by using a standard chi-square test to assess the appropriateness of Markov chains of orders 0 to 6. To provide the highest precision for computation of expected values, the highest order model that passed the chi-square test was selected. Thus, an order 6 model was selected. A p-value for each word (representing the probability of obtaining a score at least as high as the one observed [51]) is calculated by using a binomial word distribution to determine the probability of obtaining at least o repeats in the s input sequences that contain w.

Word Clustering

The Word Clustering stage computes a cluster for each of the top scoring words (seed words) identified in the Word Scoring phase. A cluster is computed from a seed word by determining the set of words whose Hamming distance is within a user-specified threshold. A Position/Weight Matrix (PWM) is constructed for each cluster [52], and a sequence logo is created from each PWM using the TFBS module by Lenhart and Wasserman [53]. For example, the PWM for the seed word ATTTTGTA in the 3'UTRs is as follows: The columns of the PWM correspond to nucleotide positions and the rows correspond to the nucleotides A, C, G, and T, respectively. For selected words from the different segments it was determined if they were clustered at specific locations along the corresponding sequences in which they occur. In order to detect a location bias, representative of such clusters, histograms were created to show the numbers of occurrences of a specific word at each point corresponding to a positional offset from the transcription start site (TSS). For uniformity, sequence lengths were normalized to the range [1;100], to represent the number of nucleotides between the position and the TSS.

Co-Occurrence Analysis

The Co-Occurrence Analysis considers all non-overlapping pairs of the top ranked words and computes the expected number of sequences that contain both words. Subsequently, the observed number of sequences that contain both words is determined, and an observed-to-expected ratio is computed (using a binomial word distribution) for each word pair. Previously published and curated binding site motifs which are equal to or shorter than eight base pairs were extracted from the AGRIS AtcisDB database [25], and were compared with the word lists generated for the different segments. For each motif the corresponding entries in word list were determined and the highest scoring word was identified.

Determine Cellular Function

The MIPS Functional Catalogue Database (FunCatDB) [54], was used for determining over-represented cellular functions in each gene list containing a particular word. The workflow of the cellular function analysis, labelled as "Cellular Function" in the larger process flow (Figure 11) is as follows. For each word in the 'top 25' lists (Table 2, 3, 4, 5, 6, 7, &8) we determined the list of genes that contained the word being analyzed in the corresponding region. Then we determined the functional category of each gene by using the functional category scheme (version 2.1) retrieved from FuncatDB. The p-values for enrichment of categories were calculated by statistical tests with the hypergeometric distribution. After filtering out p-values greater than 1E-5, results were visualized by the matrix2png software package [55]. Analysis of the correlation between word location and gene expression was done as described in [39] with the following exceptions. A larger database was constructed from 1305 available raw microarray datasets (Additional file 19) present in NASC affyarrays and the gene expression omnibus . The p-value was calculated using a chi-squared test comparing genes 2-fold induced, 2-fold suppressed, or neutral between observed (all genes harbouring the word) and expected values (based on genomic average). The Bonferroni correction was used to adjust for multiple hypothesis testing. Microarray sources included a large tissue macro-dissection [56], and the follow-up studies on stress, hormones, and pathogens [57]. We included the laser capture microdissected tissue microarray datasets [58], the gene expression profile of the Arabidopsis root [59], analysis of brassinosteroids [60], and the numerous other experiments found in the collected dataset in the above mentioned repositories. Data were normalized using global scaling of the middle 96% data points, and then noise filtered using a t-test of signal vs. background, and a t-test of signal vs. control.

Authors' contributions

JL contributed in the development of algorithms and models, the implementation of algorithms, generation of the results and drafting of the document. JDW contributed in the development of the models and algorithms and the implementation of the approaches. KK contributed in the development, implementation and testing of models and algorithms. XL contributed in the development of the models and algorithms for co-occurrence analysis and generated the respective data. FD contributed in the development of models and algorithms, and in the implementation of the methods. MG conducted correlation analysis between word presence/location and gene expression pattern. KE contributed the idea of Hamming-distance-based clustering. SSL contributed to the statistical foundations of the scoring model. AY's contributions include extraction of data sets, functional analysis of words, and writing the manuscript. EG contributed to writing the manuscript and integrating the identified words with existing knowledge on control of gene expression. In addition to architecting the software pipeline employed in this research, LRW contributed to the design, implementation and validation of models and algorithms (especially in the areas of word searching and word scoring) and to the writing of this manuscript. All authors read and approved the final manuscript.

Additional file 1

Words discovered in 3'UTRs. Entire set of words discovered in the 3'UTRs with occurrences, expected occurrences, scores, reverse complement information and p-value. Click here for file

Additional file 2

Words discovered in 5'UTRs. Entire set of words discovered in the 5'UTRs with occurrences, expected occurrences, scores, reverse complement information and p-value. Click here for file

Additional file 3

Words discovered in introns. Entire set of words discovered in the introns with occurrences, expected occurrences, scores, reverse complement information and p-value. Click here for file

Additional file 4

Words discovered in core promoters. Entire set of words discovered in the core promoters [-100;+1] with occurrences, expected occurrences, scores, reverse complement information and p-value. Click here for file

Additional file 5

Words discovered in proximal promoters. Entire set of words discovered in the proximal promoters [-1,000;-101] with occurrences, expected occurrences, scores, reverse complement information and p-value. Click here for file

Additional file 6

Words discovered in distal promoters. Entire set of words discovered in the distal promoters [-3,000;-1,001] with occurrences, expected occurrences, scores, reverse complement information and p-value. Click here for file

Additional file 7

Words discovered in entire genome. Entire set of words discovered in the complete genome with occurrences, expected occurrences, scores, reverse complement information and p-value. Click here for file

Additional file 8

Words missed in 3'UTRs. Entire set of words expected to occur but not discovered in the 3'UTRs with expected occurrences. Click here for file

Additional file 9

Words missed in 5'UTRs. Entire set of words expected to occur but not discovered in the 5'UTRs with expected occurrences. Click here for file

Additional file 10

Words missed in introns. Entire set of words expected to occur but not discovered in the introns with expected occurrences. Click here for file

Additional file 11

Words missed in core promoters. Entire set of words expected to occur but not discovered in the core promoters with expected occurrences. Click here for file

Additional file 12

Word based clusters. Word-based clusters built around 2 overrepresented words of each non-coding segment of Arabidopsis thaliana represented by the word cluster and the sequence logo associated with said cluster. A word in a word cluster is presented through the nucleotide sequence associated with the word, the sequence count, the overall count and the SlnSES score. Click here for file

Additional file 13

Word co-occurrences in 3'UTRs. Entire set of co-occurring words (taken from the top 25 words) discovered in the 3'UTRs with occurrence, expected occurrences and scores. Click here for file

Additional file 14

Word co-occurrences in 5'UTRs. Entire set of co-occurring words (taken from the top 25 words) discovered in the 5'UTRs with occurrence, expected occurrences and scores. Click here for file

Additional file 15

Word co-occurrences in introns. Entire set of co-occurring words (taken from the top 25 words) discovered in the introns with occurrence, expected occurrences and scores. Click here for file

Additional file 16

Word co-occurrences in core promoters. Entire set of co-occurring words (taken from the top 25 words) discovered in the core promoters with occurrence, expected occurrences and scores. Click here for file

Additional file 17

Word co-occurrences in proximal promoters. Entire set of co-occurring words (taken from the top 25 words) discovered in the proximal promoters with occurrence, expected occurrences and scores. Click here for file

Additional file 18

Word co-occurrences in distal promoters. Entire set of co-occurring words (taken from the top 25 words) discovered in the distal promoters with occurrence, expected occurrences and scores. Click here for file

Additional file 19

NASC Microarrays. Entire set of microarray experiments available in NASC that were used for the cellular functional analysis. Click here for file
  54 in total

1.  Algorithms for extracting structured motifs using a suffix tree with an application to promoter and regulatory site consensus identification.

Authors:  L Marsan; M F Sagot
Journal:  J Comput Biol       Date:  2000       Impact factor: 1.479

2.  Building a dictionary for genomes: identification of presumptive regulatory sites by statistical analysis.

Authors:  H J Bussemaker; H Li; E D Siggia
Journal:  Proc Natl Acad Sci U S A       Date:  2000-08-29       Impact factor: 11.205

3.  SNP frequencies in human genes an excess of rare alleles and differing modes of selection.

Authors:  S R Sunyaev; W C Lathe; V E Ramensky; P Bork
Journal:  Trends Genet       Date:  2000-08       Impact factor: 11.639

4.  Identification of promoter motifs involved in the network of phytochrome A-regulated gene expression by combined analysis of genomic sequence and microarray data.

Authors:  Matthew E Hudson; Peter H Quail
Journal:  Plant Physiol       Date:  2003-12       Impact factor: 8.340

5.  The AtGenExpress global stress expression data set: protocols, evaluation and model data analysis of UV-B light, drought and cold stress responses.

Authors:  Joachim Kilian; Dion Whitehead; Jakub Horak; Dierk Wanke; Stefan Weinl; Oliver Batistic; Cecilia D'Angelo; Erich Bornberg-Bauer; Jörg Kudla; Klaus Harter
Journal:  Plant J       Date:  2007-03-21       Impact factor: 6.417

6.  Splicing regulatory elements within tat exon 2 of human immunodeficiency virus type 1 (HIV-1) are characteristic of group M but not group O HIV-1 strains.

Authors:  P S Bilodeau; J K Domsic; C M Stoltzfus
Journal:  J Virol       Date:  1999-12       Impact factor: 5.103

7.  Mining for putative regulatory elements in the yeast genome using gene expression data.

Authors:  J Vilo; A Brazma; I Jonassen; A Robinson; E Ukkonen
Journal:  Proc Int Conf Intell Syst Mol Biol       Date:  2000

8.  An analysis of microsatellite loci in Arabidopsis thaliana: mutational dynamics and application.

Authors:  V Vaughan Symonds; Alan M Lloyd
Journal:  Genetics       Date:  2003-11       Impact factor: 4.562

9.  Introns are key regulatory elements of rice tubulin expression.

Authors:  Elisa Fiume; Paul Christou; Silvia Gianì; Diego Breviario
Journal:  Planta       Date:  2003-11-19       Impact factor: 4.116

10.  AGRIS: Arabidopsis gene regulatory information server, an information resource of Arabidopsis cis-regulatory elements and transcription factors.

Authors:  Ramana V Davuluri; Hao Sun; Saranyan K Palaniswamy; Nicole Matthews; Carlos Molina; Mike Kurtz; Erich Grotewold
Journal:  BMC Bioinformatics       Date:  2003-06-23       Impact factor: 3.169

View more
  9 in total

1.  DNA free energy-based promoter prediction and comparative analysis of Arabidopsis and rice genomes.

Authors:  Czuee Morey; Sushmita Mookherjee; Ganesan Rajasekaran; Manju Bansal
Journal:  Plant Physiol       Date:  2011-04-29       Impact factor: 8.340

2.  Genome-wide DNA methylation profiles in hematopoietic stem and progenitor cells reveal overrepresentation of ETS transcription factor binding sites.

Authors:  Amber Hogart; Jens Lichtenberg; Subramanian S Ajay; Stacie Anderson; Elliott H Margulies; David M Bodine
Journal:  Genome Res       Date:  2012-06-08       Impact factor: 9.043

3.  WordSeeker: concurrent bioinformatics software for discovering genome-wide patterns and word-based genomic signatures.

Authors:  Jens Lichtenberg; Kyle Kurz; Xiaoyu Liang; Rami Al-ouran; Lev Neiman; Lee J Nau; Joshua D Welch; Edwin Jacox; Thomas Bitterman; Klaus Ecker; Laura Elnitski; Frank Drews; Stephen Sauchi Lee; Lonnie R Welch
Journal:  BMC Bioinformatics       Date:  2010-12-21       Impact factor: 3.169

4.  AGRIS: the Arabidopsis Gene Regulatory Information Server, an update.

Authors:  Alper Yilmaz; Maria Katherine Mejia-Guerra; Kyle Kurz; Xiaoyu Liang; Lonnie Welch; Erich Grotewold
Journal:  Nucleic Acids Res       Date:  2010-11-08       Impact factor: 16.971

5.  Evolutionary divergence and limits of conserved non-coding sequence detection in plant genomes.

Authors:  Anna R Reineke; Erich Bornberg-Bauer; Jenny Gu
Journal:  Nucleic Acids Res       Date:  2011-04-05       Impact factor: 16.971

6.  Motif content comparison between monocot and dicot species.

Authors:  Matyas Cserhati
Journal:  Genom Data       Date:  2015-01-17

7.  Motifome comparison between modern human, Neanderthal and Denisovan.

Authors:  Matyas F Cserhati; Mary-Ellen Mooter; Lauren Peterson; Benjamin Wicks; Peng Xiao; Mark Pauley; Chittibabu Guda
Journal:  BMC Genomics       Date:  2018-06-18       Impact factor: 3.969

8.  'In silico expression analysis', a novel PathoPlant web tool to identify abiotic and biotic stress conditions associated with specific cis-regulatory sequences.

Authors:  Julio C Bolívar; Fabian Machens; Yuri Brill; Artyom Romanov; Lorenz Bülow; Reinhard Hehl
Journal:  Database (Oxford)       Date:  2014-04-10       Impact factor: 3.451

9.  K-mer-Based Motif Analysis in Insect Species across Anopheles, Drosophila, and Glossina Genera and Its Application to Species Classification.

Authors:  Matyas Cserhati; Peng Xiao; Chittibabu Guda
Journal:  Comput Math Methods Med       Date:  2019-11-15       Impact factor: 2.238

  9 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.