Literature DB >> 19814816

The word landscape of the non-coding segments of the Arabidopsis thaliana genome.

Jens Lichtenberg¹, Alper Yilmaz, Joshua D Welch, Kyle Kurz, Xiaoyu Liang, Frank Drews, Klaus Ecker, Stephen S Lee, Matt Geisler, Erich Grotewold, Lonnie R Welch.

Abstract

BACKGROUND: Genome sequences can be conceptualized as arrangements of motifs or words. The frequencies and positional distributions of these words within particular non-coding genomic segments provide important insights into how the words function in processes such as mRNA stability and regulation of gene expression.
RESULTS: Using an enumerative word discovery approach, we investigated the frequencies and positional distributions of all 65,536 different 8-letter words in the genome of Arabidopsis thaliana. Focusing on promoter regions, introns, and 3' and 5' untranslated regions (3'UTRs and 5'UTRs), we compared word frequencies in these segments to genome-wide frequencies. The statistically interesting words in each segment were clustered with similar words to generate motif logos. We investigated whether words were clustered at particular locations or were distributed randomly within each genomic segment, and we classified the words using gene expression information from public repositories. Finally, we investigated whether particular sets of words appeared together more frequently than others.
CONCLUSION: Our studies provide a detailed view of the word composition of several segments of the non-coding portion of the Arabidopsis genome. Each segment contains a unique word-based signature. The respective signatures consist of the sets of enriched words, 'unwords', and word pairs within a segment, as well as the preferential locations and functional classifications for the signature words. Additionally, the positional distributions of enriched words within the segments highlight possible functional elements, and the co-associations of words in promoter regions likely represent the formation of higher order regulatory modules. This work is an important step toward fully cataloguing the functional elements of the Arabidopsis genome.

Entities: CellLine Chemical Disease Gene Species

Mesh：

Substances：

Year: 2009 PMID： 19814816 PMCID： PMC2770528 DOI： 10.1186/1471-2164-10-463

Source DB: PubMed Journal: BMC Genomics ISSN： 1471-2164 Impact factor: 3.969

Background

All genomes are composed of nucleotides, which are represented abstractly as letters (Adenine (A), Guanine (G), Cytosine (C), and Thymine (T)). Strings of such letters can be conceptualized as words, which provide the blueprints for organisms. Each word is found a specific number of times in a particular genome. Note that the expected frequency of a word is inversely related to the word's length. Some nucleotides appear more frequently than others (e.g. A/T in Arabidopsis), giving each genome a distinct (G+C)% content and biasing expected word frequencies. Higher order frequencies (dinucleotide and trinucleotide) also show distinct biases beyond those expected for single nucleotide frequencies [1]. Distinct selective pressures shape words positioned in different genomic regions. For example, a word in an open reading frame (ORF) has a direct influence on the primary amino acid sequence of a protein and hence is under strong selective pressure. In contrast, words in introns are likely to be under more relaxed selective constraints, unless they are important for gene functions, for example by providing docking sites for splicing factors [2] or for enzymes involved in the post-transcriptional processing of a transcript [3,4]. The gene sections corresponding to the 5' and 3' untranslated regions (5'UTRs and 3'UTRs, respectively) are also likely to be under less selective constraints than the ORFs, yet signatures of strong selection in UTRs have been described (reviewed in [5]). The constant formation of DNA microsatellites through slippage by the replication machinery, and the action of viruses and transposons, also complicate the word landscape, especially in regions with lower selective constraints (such as introns, UTRs and intergenic regions) [6,7]. This manuscript describes the results of a genome-wide analysis to discover putative regulatory words. Within this context, we define the cis-regulatory apparatus as all the DNA segments that are located proximal to a gene, and that also contribute to the gene's expression. It is the function of transcription factors, miRNAs, or other molecules that interact with DNA, to interpret the words (sequence code) hardwired in the cis-regulatory apparatus and to 'execute' them, thereby generating signals to the basal transcription machinery that result in changes to the rate of RNA production by the corresponding DNA-dependent RNA polymerases. When located upstream of the transcription start site (TSS), the cis-regulatory apparatus is often referred to as the promoter of a gene. Promoters are typically divided into three regions: core, proximal and distal. The core promoter, a region at location [+1;-100] relative to the TSS, performs a central role in the formation of pre-initiation transcriptional complexes. Immediately upstream of the core promoter is the proximal promoter, which is located at position [-101;-1000] relative to the TSS and serves as a docking site for transcription factors. The distal promoter is located at [-1001;-3000] relative to the TSS and contains the regulatory elements that are commonly known as enhancers and silencers. The participation of a particular DNA segment in the regulation of gene expression can only be demonstrated experimentally. Thus, understanding the rules at play in deciphering the transcriptional regulatory code remains one of the most significant challenges in biology today. Although most regulatory elements are present in the UTRs and upstream regions, due to their proximity to the TSS, studies have shown the presence of regulatory elements in introns, and, to a much lesser extent, in coding regions [2,8-16]. Building on this knowledge, a segment-based analysis was performed that is focused on non-coding regions within the open reading frames (i.e., introns) and flanking non-coding regions (i.e., UTRs and upstream regions). The coding regions were omitted from this analysis because they are under other selection pressures corresponding to the amino acid sequences of the proteins they produce, and thus they are subjected to biases other than regulation. Arabidopsis thaliana provides an ideal reference organism to investigate the word landscape of a plant genome, and to relate said landscape to important biological features. The Arabidopsis genome consists of 125 Mbp arranged into five chromosomes [17,18]. The genome is well annotated and regions corresponding to introns, 3'UTRs, 5' UTRs, and intergenic genomic spaces are all available from The Arabidopsis Information Resource (TAIR, ) [19]. Many studies have characterized Arabidopsis DNA sequence motifs that participate in the regulation of particular genes (e.g., [20-23]), and public databases such as AthaMap [24] and AGRIS [25] provide comprehensive collections of cis-regulatory elements likely to participate in the regulation of gene expression. However, a systematic analysis of all the words present in the Arabidopsis genome is still lacking. To analyze the different segments of the Arabidopsis genome, an enumerative word discovery approach was used to detect statistically overrepresented words. Similar approaches have been successfully applied over the last decade in the area of motif discovery [26-37]. In a 2005 study, Tompa et al. [38] showed that enumerative methods outperformed heuristic methods in many cases. They are particularly applicable in this research, because they allow the study of the entire 'word landscape' of a genomic data set. Our approach scans the sequences and produces a set of words and word frequencies. This information is employed by a Markov model to compute expected word frequencies. Words with unexpectedly high frequencies are putative functional elements, and thus they are further characterized by comparing word frequencies and positions to gene induction or suppression using the method of Geisler et al. [39]. Additionally, clusters of similar words are formed and used to create motifs for putative transcription factor binding sites. Sequences that contain the same functional elements are grouped together into putative 'nodes' of regulatory networks. Words that co-occur often are identified as putative transcription factor binding modules.

Results and Discussion

Distribution of 8-letter words in the Arabidopsis genome

To determine the word distributions in the segments of the Arabidopsis thaliana genome that contribute to the cis-regulatory apparatus, a comprehensive analysis of 8-letter words in the entire genome was conducted and compared with segments corresponding to non-coding regions. Words of length 6-16 were examined and the complete results have been made available via AGRIS [25,40]. This article reports findings for words of length eight because they correspond to the typical DNA sequence length recognized by transcription factors (usually 6-8 bp [38,41]). Furthermore, 8-mers are long enough that there is enough diversity of word choices (~64,000) to reduce false positive results, while retaining sufficient word counts to be statistically informative. The genome was sub-divided into segments comprising the 3' UTRs, 5'UTRs, promoters and introns (Table 1). The promoter segment was further dissected into the core promoter, corresponding to [-100; +1]; proximal promoter [-1000; -101]; and distal promoter [-3000; -1001]. The general properties of the six genome segments are shown in Table 1. As in a similar study, which was aimed at discovering regulatory elements involved in human DNA-repair pathways [26], word-based genomic signatures were created for each segment. Specifically, the following were identified for each of the genome segments: (1) the set of overrepresented words (signature words), (2) words missing from the sequences (unwords), (3) word-based clusters, (4) word co-occurrences and (5) functional categorizations of the signature words. The results are detailed in the remainder of this section.

Table 1

Segment characteristics for Arabidopsis thaliana

Data Set	# Sequences/Chromosomes	Min. Seq. Length	Max. Seq. Length	Mean Seq. Length	Std. Deviation	Total Nucleotides	Genome Percentage
3' UTRs	19,771	8	3,118	228.134	152.106	4,510,410	3.78

5' UTRs	18,585	8	3,214	140.088	130.288	2,603,531	2.18

Introns	118,319	8	10,234	164.446	178.484	19,457,029	16.32

Core Promoters	27,023	100	100	100	0	2,702,300	2.27

Proximal Promoters	27,023	900	900	900	0	24,320,700	20.41

Distal Promoters	27,025	1,371	2,000	1,999.96	5.01105	54,048,839	45.35

Genome-wide	5	18,585,000	30,432,600	23,837,300	4,432,780	119,186,497	100.00

Overview of the characteristics properties for non-coding segments and the entire genome for Arabidopsis thaliana. The number of sequence refers to the respective number of unique sequences in the specific segment. In case of the entire genome the sequences are the complete chromosomes. Min. Seq. Length refers to the length of the shortest sequence in the set, while Max. Seq. Length refers to the length of the longest sequence in the set. Mean Seq. Length provides the average length of the sequences in the set, while Std. Deviation describes the deviation from said mean. Finally Total Nucleotides describes the total number of nucleotides contained within the sequences of the set and Genome Percentage elaborates on the relationship between the nucleotide count of the set versus the entire genome.

Some sequences in the segments are shorter than 8 nucleotides. Since these sequences cannot harbour any putative regulatory elements in the context of this study, the sequences are removed from the table. For the 3'UTRs this results in a total of 179 nt being omitted, for 5'UTRs 1207 nt and for introns 26 nt. They are however included in the calculation of the background for the different segments since they contribute to the overall nucleotide distribution.

Segment characteristics for Arabidopsis thaliana Overview of the characteristics properties for non-coding segments and the entire genome for Arabidopsis thaliana. The number of sequence refers to the respective number of unique sequences in the specific segment. In case of the entire genome the sequences are the complete chromosomes. Min. Seq. Length refers to the length of the shortest sequence in the set, while Max. Seq. Length refers to the length of the longest sequence in the set. Mean Seq. Length provides the average length of the sequences in the set, while Std. Deviation describes the deviation from said mean. Finally Total Nucleotides describes the total number of nucleotides contained within the sequences of the set and Genome Percentage elaborates on the relationship between the nucleotide count of the set versus the entire genome. Some sequences in the segments are shorter than 8 nucleotides. Since these sequences cannot harbour any putative regulatory elements in the context of this study, the sequences are removed from the table. For the 3'UTRs this results in a total of 179 nt being omitted, for 5'UTRs 1207 nt and for introns 26 nt. They are however included in the calculation of the background for the different segments since they contribute to the overall nucleotide distribution.

Overrepresented Words

All 8-letter words present in the segments were identified and scored using observed:expected frequency ratios (O/E). Specifically, each word was scored and ranked by using the function S*ln(S/E), where S is the number of sequences that contained the word, 'ln' is the natural logarithm, and Eis the number of sequences in which the word was expected to occur. Words discovered in the whole genome were analyzed using the O*ln(O/E) score, with O referring to the overall occurrence of a word across the entire genome and Erepresenting the expected occurrence of that word. The 25 top-ranked words, corresponding to ~0.04% of all possible words, which also corresponds to ~0.04% of the discovered words, were taken as an exemplary subset of the results and further examined (see Table 2, 3, 4, 5, 6, 7, &8 and Additional file 1, 2, 3, 4, 5, 6, &7).

Table 2

The top 25 words in 3'UTRs

	Unmasked					Masked					Unmasked
Word	S	ES	O	EO	SlnSES	S	ES	O	EO	SlnSES	RevComp	RC_Pos	Pal	PValues

TTTTTGTT	2264	2066.82	2488	2306.04	206.297	2279	2066.89	2501	2331.04	222.643	AACAAAAA	40	No	9.38E-05

TTTTTCTT	2171	1981.63	2404	2203.7	198.149	2183	1978.5	2427	2222.83	214.723	AAGAAAAA	49	No	1.34E-05

TTTTTTGG	998	824.458	1046	877.255	190.646	1003	831.208	1053	888.417	188.434	CCAAAAAA	651	No	1.71E-08

ATTTTGTA	732	583.938	752	615.741	165.421	738	599.956	759	634.768	152.831	TACAAAAT	37	No	6.00E-08

TAATTTTT	787	642.133	810	678.585	160.101	797	646.36	821	685.263	166.97	AAAAATTA	164	No	5.24E-07

ATGTTTTA	589	469.818	601	493.292	133.161	610	486.404	624	512.055	138.116	TAAAACAT	284	No	1.48E-06

TTTGTTTT	2517	2402.46	2847	2715.8	117.227	2555	2406.15	2897	2753.88	153.362	AAAACAAA	1963	No	0.006347

GTTTTTGA	491	390.189	504	408.466	112.838	512	407.532	527	427.529	116.841	TCAAAAAC	5031	No	2.76E-06

AAATTTTG	588	491.471	603	516.445	105.443	604	504.212	621	531.22	109.069	CAAAATTT	376	No	0.00011

ATTTTTTA	482	387.674	498	405.795	104.97	492	406.16	510	426.064	94.3317	TAAAAAAT	100	No	5.33E-06

ATTTTTCA	446	354.812	450	370.941	102.014	453	365.873	457	383.118	96.7633	TGAAAAAT	170	No	3.83E-05

TGTTTTGT	1227	1133.19	1326	1219.91	97.5897	1255	1162.02	1359	1260.07	96.6082	ACAAAACA	659	No	0.001413

ATAAAAAT	564	474.529	580	498.326	97.4203	566	480.088	581	505.265	93.1776	ATTTTTAT	27	No	0.000192

TTTTTTCT	1721	1628.11	1839	1786.09	95.4882	1722	1625.78	1847	1798.84	99.0176	AGAAAAAA	106	No	0.107802

AAAAATTG	397	312.488	400	326.178	95.0296	414	323.794	419	338.423	101.744	CAATTTTT	66	No	4.26E-05

TATAATAT	505	419.081	519	439.185	94.1802	514	429.108	530	450.594	92.7844	ATATTATA	275	No	0.000114

CTCTGTTT	763	674.497	814	713.654	94.0706	796	706.86	852	751.4	94.5386	AAACAGAG	227	No	0.000125

TTTTTAAT	897	808.297	929	859.536	93.4009	905	811.646	942	866.766	98.5274	ATTAAAAA	95	No	0.009964

TTCTTTTT	1884	1795.18	2075	1982.05	90.9811	1879	1764.9	2059	1964.59	117.709	AAAAAGAA	130	No	0.019465

TTTTTGGT	989	902.56	1029	963.191	90.453	1006	920.175	1052	987.344	89.7087	ACCAAAAA	9144	No	0.018455

ATTTTCTG	324	245.197	330	255.296	90.2932	340	264.756	346	275.991	85.047	CAGAAAAT	241	No	4.24E-06

AATATATT	462	382.795	474	400.615	86.8857	477	412.829	490	433.187	68.9186	AATATATT	21	Yes	0.000195

TTTGTGTG	688	607.303	705	640.94	85.8355	705	625.577	726	662.623	84.2635	CACACAAA	8153	No	0.006617

TGTTTTTT	1716	1632.37	1839	1791.05	85.7404	1730	1636.78	1864	1811.88	95.8269	AAAAAACA	1065	No	0.131261

Top 25 overrepresented words for the 3'Untranslated Regions in Arabidopsis thaliana. The Word attribute describes the short nucleotide sequence associated with a putative word. S and ES describe the number of sequences a word occurs in and the number of sequences the word was expected to occur in respectively, while O and EO describe the total number of occurrences and the expected total number of occurrences. The score SlnSES describes a statistical coverage of the sequences analyzed in the set and is based on a Markov Chain Background Model. Each set of attributes was computed for the masked as well as the unmasked version of the corresponding segment with the emphasis placed on the unmasked version (i.e. sorting of the table based on the unmasked SlnSES score).

Further information for the word is provided through its reverse complement (RevComp) and the position of the reverse complement in the set of results (RC_Pos) as well as a notion describing if the word is a genomic palindrome (Pal).